How to remove any co-occurrence of sub-list elements from vector (R)

How to remove any co-occurrence of sub-list elements from vector (R)

I review the python question How to remove every occurrence of sub-list from list.
Now I want to know how many creative ways are there in R.

For example, removing any occurrences of sub_list from the main_list.

sub_list

main_list

main_list = c(2, 1, 2, 3, 1, 2, 4, 2, 2 ,1) sub_list = c(1,2)

desired result: 2 3 4 2 2 1

2 3 4 2 2 1

My suggestions:

a<-c() for(i in 1:(length(main_list)-1)) if (all(main_list[c(i,i+1)]==sub_list)) a<-c(a,c(i,i+1)) main_list[-a] [1] 2 3 4 2 2 1

as.numeric(unlist(strsplit(gsub("(12)","",paste0(main_list,collapse = "")),split = "")))

Ohh it is really dangerous. Let's try:

main_list = c(2, 1, 2, 3, 12, 1, 2, 4, 2, 2, 1) as.numeric(unlist(strsplit(gsub("(12)","",paste0(main_list,collapse = "")),split = ""))) [1] 2 3 4 2 2 1 ####However a<-c() for(i in 1:(length(main_list)-1)) if (all(main_list[c(i,i+1)]==sub_list)) a<-c(a,c(i,i+1)) main_list[-a] [1] 2 3 12 4 2 2 1

I Benchmarked solutions base on the memory and time, each solution takes, with a big vector of numbers and used profmem and microbenchmark libraries.

profmem

microbenchmark

set.seed(1587) main_list<-sample(c(8:13,102:105),size = 10000000,replace = T) main_list<-c(c(8,9,12,103),main_list,c(8,9,12,103)) sub_list<-c(8,9,12,103)

d.b's solution does not work for main_list so I modified it as follows:

d.b

main_list

ML = paste(main_list, collapse = ",") # collapse should not be empty SL = paste(sub_list, collapse = ",") out<-gsub(SL, "", ML) out<-gsub("^\,","",out) out<-gsub("\,$","",out) out<-gsub("\,,","\,",out) out<-as.numeric(unlist(strsplit(out,split = ",")))

solution seconds memory_byte memory_base seconds_base <chr> <dbl> <dbl> <dbl> <dbl> 1 d.b 26.0 399904560 1 16.8 2 Grothendieck_2 1.55 1440070304 3.60 1 3 Grothendieck_1 109. 4968036376 12.4 70.3 4 李哲源 2.17 1400120824 3.50 1.40

Any comment about the benchmarking?

Does an open question improve creativity?
– imi
Sep 3 at 1:01

2 Answers
2

Here are two solutions. The first one is obviously simpler and would be used if you favour clarity and maintainability while the second one has no package dependencies and is faster.

1) zoo Use a moving window to compare each subsequence of c(main_list, sub_list) having the required length to the sub_list. (We append sub_list to ensure that there is always something to remove.) This statements returns TRUE or FALSE according to whether the current position is the end of a matching subsequence. Then compute the TRUE index numbers and from that the indices of all elements to be removed and remove them.

main_list, sub_list)

sub_list

library(zoo) w <- length(sub_list) r <- rollapplyr(c(main_list, sub_list), w, identical, sub_list, fill = FALSE) main_list[-c(outer(which(r), seq_len(w) - 1, "-"))] ## [1] 2 3 4 2 2 1

2) Base R. The middle line setting r has the same purpose as the corresponding line in (1) and the last line is the same as the last line in (2) except we use + instead of - due to the fact that embed effectively uses left alignment.

r

+

-

embed

w <- length(sub_list) r <- colSums(t(embed(c(main_list, sub_list), w)) == rev(sub_list)) == w main_list[-c(outer(which(r), seq_len(w) - 1, "+"))] ## [1] 2 3 4 2 2 1

Yes, embed is a good function: stackoverflow.com/a/41839433/4891738, the only inconvenience is that we need a rev. Are you sure that you don't need to protect x[-ind] by testing that length(ind) > 0L?
– 李哲源
Aug 6 at 12:16

embed

rev

x[-ind]

length(ind) > 0L

As long as sub_list has strictly positive length the index length in the last line of either solution above is also strictkly positive so there is no need to test it. One could test whether sub_list has zero length if one wanted to address that edge case.
– G. Grothendieck
Aug 6 at 12:40

sub_list

Yes, when I added the second solution I modified the first to use the same trick.
– G. Grothendieck
Aug 6 at 12:49

Here is a function that does this general thing.

xm

xs

It is required that length(xm) > length(xs) but no such check is made right now.

length(xm) > length(xs)

foo <- function (xm, xs) nm <- length(xm) ns <- length(xs) shift_ind <- outer(0:(ns - 1), 1:(nm - ns + 1), "+") d <- xm[shift_ind] == xs first_drop_ind <- which(.colSums(d, ns, length(d) / ns) == ns) if (length(first_drop_ind) > 0L) drop_ind <- outer(0:(ns - 1), first_drop_ind, "+") return(xm[-drop_ind]) else return(xm) main_list = c(2, 1, 2, 3, 1, 2, 4, 2, 2 ,1) sub_list = c(1,2) foo(main_list, sub_list) #[1] 2 3 4 2 2 1

Explanation

xm <- main_list xs <- sub_list nm <- length(xm) ns <- length(xs) shift_ind <- outer(0:(ns - 1), 1:(nm - ns + 1), "+") MAT <- matrix(xm[shift_ind], ns) # [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] #[1,] 2 1 2 3 1 2 4 2 2 #[2,] 1 2 3 1 2 4 2 2 1

So the first step is a shifting and matrix representation, as above.

LOGIC <- MAT == xs # [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] #[1,] FALSE TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE #[2,] FALSE TRUE FALSE FALSE TRUE FALSE TRUE TRUE FALSE

If a co-occurrence is found, a column should contain all TRUE, i.e., the colSums should be ns. In this way we can identify the location of the first value of the matching.

TRUE

colSums

ns

first_drop_ind <- which(colSums(LOGIC) == ns) #[1] 2 5

Now we need to expand it to cover the subsequent values after those initial matches.

drop_ind <- outer(0:(ns - 1), first_drop_ind, "+") # [,1] [,2] #[1,] 2 5 #[2,] 3 6

Finally we remove values at those positions from xm:

xm

xm[-drop_ind] #[1] 2 3 4 2 2 1

Note that in the function, the matrix is not explicitly formed. .colSums is used instead of colSums.

.colSums

colSums

watch out for bug

The if ... else ... in the function is necessary. If no match is found then drop_ind would be integer(0), and using xm[-drop_ind] gives xm[integer(0)] which is integer(0).

if ... else ...

drop_ind

integer(0)

xm[-drop_ind]

xm[integer(0)]

integer(0)

comparison with zoo::rollapplyr

zoo::rollapplyr

## require package `zoo` bar <- function (xm, xs) w <- length(xs) r <- rollapplyr(xm, w, identical, xs, fill = FALSE) if (length(r) > 0L) return(xm[-c(outer(which(r), seq_len(w) - 1, "-"))]) else return(xm) set.seed(0) xm <- sample.int(10, 10000, TRUE) xs <- 1:2 library(zoo) system.time(a <- foo(xm, xs)) # user system elapsed # 0.004 0.000 0.001 system.time(b <- bar(xm, xs)) # user system elapsed # 0.276 0.000 0.273 all.equal(a, b) #[1] TRUE

I guess that rollapplyr is slower is because

rollapplyr

xm

lapply

Personally speaking I prefer to a full Rcpp solution, but some efforts are needed to deal with different data types. As a C programmer I am not familiar with C++'s template approach so this is not straightforward to me. In C I could use macro to achieve the same effect. But anyway, I might revisit this case sometime in future.
– 李哲源
Sep 3 at 1:02

By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

gAfl13 cIoAV,ibL4dgfz n55T,ISGsWP6J1Y6S7YQF E,GX,k,XHv

搜尋此網誌

Sfyjdyy