How to remove any co-occurrence of sub-list elements from vector (R)

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP



How to remove any co-occurrence of sub-list elements from vector (R)



I review the python question How to remove every occurrence of sub-list from list.
Now I want to know how many creative ways are there in R.

For example, removing any occurrences of sub_list from the main_list.


sub_list


main_list


main_list = c(2, 1, 2, 3, 1, 2, 4, 2, 2 ,1)
sub_list = c(1,2)



desired result: 2 3 4 2 2 1


2 3 4 2 2 1



My suggestions:


a<-c()
for(i in 1:(length(main_list)-1))
if (all(main_list[c(i,i+1)]==sub_list))
a<-c(a,c(i,i+1))

main_list[-a]
[1] 2 3 4 2 2 1



2


as.numeric(unlist(strsplit(gsub("(12)","",paste0(main_list,collapse = "")),split = "")))



Ohh it is really dangerous. Let's try:


main_list = c(2, 1, 2, 3, 12, 1, 2, 4, 2, 2, 1)
as.numeric(unlist(strsplit(gsub("(12)","",paste0(main_list,collapse = "")),split = "")))
[1] 2 3 4 2 2 1
####However
a<-c()
for(i in 1:(length(main_list)-1))
if (all(main_list[c(i,i+1)]==sub_list))
a<-c(a,c(i,i+1))

main_list[-a]
[1] 2 3 12 4 2 2 1



I Benchmarked solutions base on the memory and time, each solution takes, with a big vector of numbers and used profmem and microbenchmark libraries.


profmem


microbenchmark


set.seed(1587)
main_list<-sample(c(8:13,102:105),size = 10000000,replace = T)
main_list<-c(c(8,9,12,103),main_list,c(8,9,12,103))
sub_list<-c(8,9,12,103)



d.b's solution does not work for main_list so I modified it as follows:


d.b


main_list


ML = paste(main_list, collapse = ",") # collapse should not be empty
SL = paste(sub_list, collapse = ",")
out<-gsub(SL, "", ML)
out<-gsub("^\,","",out)
out<-gsub("\,$","",out)
out<-gsub("\,,","\,",out)
out<-as.numeric(unlist(strsplit(out,split = ",")))


solution seconds memory_byte memory_base seconds_base
<chr> <dbl> <dbl> <dbl> <dbl>
1 d.b 26.0 399904560 1 16.8
2 Grothendieck_2 1.55 1440070304 3.60 1
3 Grothendieck_1 109. 4968036376 12.4 70.3
4 李哲源 2.17 1400120824 3.50 1.40



Any comment about the benchmarking?





Does an open question improve creativity?
– imi
Sep 3 at 1:01




2 Answers
2



Here are two solutions. The first one is obviously simpler and would be used if you favour clarity and maintainability while the second one has no package dependencies and is faster.



1) zoo Use a moving window to compare each subsequence of c(main_list, sub_list) having the required length to the sub_list. (We append sub_list to ensure that there is always something to remove.) This statements returns TRUE or FALSE according to whether the current position is the end of a matching subsequence. Then compute the TRUE index numbers and from that the indices of all elements to be removed and remove them.


main_list, sub_list)


sub_list


sub_list


library(zoo)

w <- length(sub_list)
r <- rollapplyr(c(main_list, sub_list), w, identical, sub_list, fill = FALSE)
main_list[-c(outer(which(r), seq_len(w) - 1, "-"))]
## [1] 2 3 4 2 2 1



2) Base R. The middle line setting r has the same purpose as the corresponding line in (1) and the last line is the same as the last line in (2) except we use + instead of - due to the fact that embed effectively uses left alignment.


r


+


-


embed


w <- length(sub_list)
r <- colSums(t(embed(c(main_list, sub_list), w)) == rev(sub_list)) == w
main_list[-c(outer(which(r), seq_len(w) - 1, "+"))]
## [1] 2 3 4 2 2 1





Yes, embed is a good function: stackoverflow.com/a/41839433/4891738, the only inconvenience is that we need a rev. Are you sure that you don't need to protect x[-ind] by testing that length(ind) > 0L?
– 李哲源
Aug 6 at 12:16



embed


rev


x[-ind]


length(ind) > 0L





As long as sub_list has strictly positive length the index length in the last line of either solution above is also strictkly positive so there is no need to test it. One could test whether sub_list has zero length if one wanted to address that edge case.
– G. Grothendieck
Aug 6 at 12:40



sub_list


sub_list





Yes, when I added the second solution I modified the first to use the same trick.
– G. Grothendieck
Aug 6 at 12:49



Here is a function that does this general thing.


xm


xs



It is required that length(xm) > length(xs) but no such check is made right now.


length(xm) > length(xs)


foo <- function (xm, xs)
nm <- length(xm)
ns <- length(xs)
shift_ind <- outer(0:(ns - 1), 1:(nm - ns + 1), "+")
d <- xm[shift_ind] == xs
first_drop_ind <- which(.colSums(d, ns, length(d) / ns) == ns)
if (length(first_drop_ind) > 0L)
drop_ind <- outer(0:(ns - 1), first_drop_ind, "+")
return(xm[-drop_ind])
else
return(xm)



main_list = c(2, 1, 2, 3, 1, 2, 4, 2, 2 ,1)
sub_list = c(1,2)
foo(main_list, sub_list)
#[1] 2 3 4 2 2 1



Explanation


xm <- main_list
xs <- sub_list

nm <- length(xm)
ns <- length(xs)
shift_ind <- outer(0:(ns - 1), 1:(nm - ns + 1), "+")
MAT <- matrix(xm[shift_ind], ns)
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
#[1,] 2 1 2 3 1 2 4 2 2
#[2,] 1 2 3 1 2 4 2 2 1



So the first step is a shifting and matrix representation, as above.


LOGIC <- MAT == xs
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
#[1,] FALSE TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
#[2,] FALSE TRUE FALSE FALSE TRUE FALSE TRUE TRUE FALSE



If a co-occurrence is found, a column should contain all TRUE, i.e., the colSums should be ns. In this way we can identify the location of the first value of the matching.


TRUE


colSums


ns


first_drop_ind <- which(colSums(LOGIC) == ns)
#[1] 2 5



Now we need to expand it to cover the subsequent values after those initial matches.


drop_ind <- outer(0:(ns - 1), first_drop_ind, "+")
# [,1] [,2]
#[1,] 2 5
#[2,] 3 6



Finally we remove values at those positions from xm:


xm


xm[-drop_ind]
#[1] 2 3 4 2 2 1



Note that in the function, the matrix is not explicitly formed. .colSums is used instead of colSums.


.colSums


colSums



watch out for bug



The if ... else ... in the function is necessary. If no match is found then drop_ind would be integer(0), and using xm[-drop_ind] gives xm[integer(0)] which is integer(0).


if ... else ...


drop_ind


integer(0)


xm[-drop_ind]


xm[integer(0)]


integer(0)



comparison with zoo::rollapplyr


zoo::rollapplyr


## require package `zoo`
bar <- function (xm, xs)
w <- length(xs)
r <- rollapplyr(xm, w, identical, xs, fill = FALSE)
if (length(r) > 0L)
return(xm[-c(outer(which(r), seq_len(w) - 1, "-"))])
else
return(xm)



set.seed(0)
xm <- sample.int(10, 10000, TRUE)
xs <- 1:2

library(zoo)

system.time(a <- foo(xm, xs))
# user system elapsed
# 0.004 0.000 0.001

system.time(b <- bar(xm, xs))
# user system elapsed
# 0.276 0.000 0.273

all.equal(a, b)
#[1] TRUE



I guess that rollapplyr is slower is because


rollapplyr


xm


lapply





Personally speaking I prefer to a full Rcpp solution, but some efforts are needed to deal with different data types. As a C programmer I am not familiar with C++'s template approach so this is not straightforward to me. In C I could use macro to achieve the same effect. But anyway, I might revisit this case sometime in future.
– 李哲源
Sep 3 at 1:02







By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

Popular posts from this blog

Firebase Auth - with Email and Password - Check user already registered

Dynamically update html content plain JS

Creating a leaderboard in HTML/JS