How to remove any co-occurrence of sub-list elements from vector (R)
Clash Royale CLAN TAG#URR8PPP
How to remove any co-occurrence of sub-list elements from vector (R)
I review the python question How to remove every occurrence of sub-list from list.
Now I want to know how many creative ways are there in R.
For example, removing any occurrences of sub_list
from the main_list
.
sub_list
main_list
main_list = c(2, 1, 2, 3, 1, 2, 4, 2, 2 ,1)
sub_list = c(1,2)
desired result: 2 3 4 2 2 1
2 3 4 2 2 1
My suggestions:
a<-c()
for(i in 1:(length(main_list)-1))
if (all(main_list[c(i,i+1)]==sub_list))
a<-c(a,c(i,i+1))
main_list[-a]
[1] 2 3 4 2 2 1
2
as.numeric(unlist(strsplit(gsub("(12)","",paste0(main_list,collapse = "")),split = "")))
Ohh it is really dangerous. Let's try:
main_list = c(2, 1, 2, 3, 12, 1, 2, 4, 2, 2, 1)
as.numeric(unlist(strsplit(gsub("(12)","",paste0(main_list,collapse = "")),split = "")))
[1] 2 3 4 2 2 1
####However
a<-c()
for(i in 1:(length(main_list)-1))
if (all(main_list[c(i,i+1)]==sub_list))
a<-c(a,c(i,i+1))
main_list[-a]
[1] 2 3 12 4 2 2 1
I Benchmarked solutions base on the memory and time, each solution takes, with a big vector of numbers and used profmem
and microbenchmark
libraries.
profmem
microbenchmark
set.seed(1587)
main_list<-sample(c(8:13,102:105),size = 10000000,replace = T)
main_list<-c(c(8,9,12,103),main_list,c(8,9,12,103))
sub_list<-c(8,9,12,103)
d.b
's solution does not work for main_list
so I modified it as follows:
d.b
main_list
ML = paste(main_list, collapse = ",") # collapse should not be empty
SL = paste(sub_list, collapse = ",")
out<-gsub(SL, "", ML)
out<-gsub("^\,","",out)
out<-gsub("\,$","",out)
out<-gsub("\,,","\,",out)
out<-as.numeric(unlist(strsplit(out,split = ",")))
solution seconds memory_byte memory_base seconds_base
<chr> <dbl> <dbl> <dbl> <dbl>
1 d.b 26.0 399904560 1 16.8
2 Grothendieck_2 1.55 1440070304 3.60 1
3 Grothendieck_1 109. 4968036376 12.4 70.3
4 李哲源 2.17 1400120824 3.50 1.40
Any comment about the benchmarking?
2 Answers
2
Here are two solutions. The first one is obviously simpler and would be used if you favour clarity and maintainability while the second one has no package dependencies and is faster.
1) zoo Use a moving window to compare each subsequence of c(main_list, sub_list)
having the required length to the sub_list
. (We append sub_list
to ensure that there is always something to remove.) This statements returns TRUE or FALSE according to whether the current position is the end of a matching subsequence. Then compute the TRUE index numbers and from that the indices of all elements to be removed and remove them.
main_list, sub_list)
sub_list
sub_list
library(zoo)
w <- length(sub_list)
r <- rollapplyr(c(main_list, sub_list), w, identical, sub_list, fill = FALSE)
main_list[-c(outer(which(r), seq_len(w) - 1, "-"))]
## [1] 2 3 4 2 2 1
2) Base R. The middle line setting r
has the same purpose as the corresponding line in (1) and the last line is the same as the last line in (2) except we use +
instead of -
due to the fact that embed
effectively uses left alignment.
r
+
-
embed
w <- length(sub_list)
r <- colSums(t(embed(c(main_list, sub_list), w)) == rev(sub_list)) == w
main_list[-c(outer(which(r), seq_len(w) - 1, "+"))]
## [1] 2 3 4 2 2 1
Yes,
embed
is a good function: stackoverflow.com/a/41839433/4891738, the only inconvenience is that we need a rev
. Are you sure that you don't need to protect x[-ind]
by testing that length(ind) > 0L
?– 李哲源
Aug 6 at 12:16
embed
rev
x[-ind]
length(ind) > 0L
As long as
sub_list
has strictly positive length the index length in the last line of either solution above is also strictkly positive so there is no need to test it. One could test whether sub_list
has zero length if one wanted to address that edge case.– G. Grothendieck
Aug 6 at 12:40
sub_list
sub_list
Yes, when I added the second solution I modified the first to use the same trick.
– G. Grothendieck
Aug 6 at 12:49
Here is a function that does this general thing.
xm
xs
It is required that length(xm) > length(xs)
but no such check is made right now.
length(xm) > length(xs)
foo <- function (xm, xs)
nm <- length(xm)
ns <- length(xs)
shift_ind <- outer(0:(ns - 1), 1:(nm - ns + 1), "+")
d <- xm[shift_ind] == xs
first_drop_ind <- which(.colSums(d, ns, length(d) / ns) == ns)
if (length(first_drop_ind) > 0L)
drop_ind <- outer(0:(ns - 1), first_drop_ind, "+")
return(xm[-drop_ind])
else
return(xm)
main_list = c(2, 1, 2, 3, 1, 2, 4, 2, 2 ,1)
sub_list = c(1,2)
foo(main_list, sub_list)
#[1] 2 3 4 2 2 1
Explanation
xm <- main_list
xs <- sub_list
nm <- length(xm)
ns <- length(xs)
shift_ind <- outer(0:(ns - 1), 1:(nm - ns + 1), "+")
MAT <- matrix(xm[shift_ind], ns)
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
#[1,] 2 1 2 3 1 2 4 2 2
#[2,] 1 2 3 1 2 4 2 2 1
So the first step is a shifting and matrix representation, as above.
LOGIC <- MAT == xs
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
#[1,] FALSE TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
#[2,] FALSE TRUE FALSE FALSE TRUE FALSE TRUE TRUE FALSE
If a co-occurrence is found, a column should contain all TRUE
, i.e., the colSums
should be ns
. In this way we can identify the location of the first value of the matching.
TRUE
colSums
ns
first_drop_ind <- which(colSums(LOGIC) == ns)
#[1] 2 5
Now we need to expand it to cover the subsequent values after those initial matches.
drop_ind <- outer(0:(ns - 1), first_drop_ind, "+")
# [,1] [,2]
#[1,] 2 5
#[2,] 3 6
Finally we remove values at those positions from xm
:
xm
xm[-drop_ind]
#[1] 2 3 4 2 2 1
Note that in the function, the matrix is not explicitly formed. .colSums
is used instead of colSums
.
.colSums
colSums
watch out for bug
The if ... else ...
in the function is necessary. If no match is found then drop_ind
would be integer(0)
, and using xm[-drop_ind]
gives xm[integer(0)]
which is integer(0)
.
if ... else ...
drop_ind
integer(0)
xm[-drop_ind]
xm[integer(0)]
integer(0)
comparison with zoo::rollapplyr
zoo::rollapplyr
## require package `zoo`
bar <- function (xm, xs)
w <- length(xs)
r <- rollapplyr(xm, w, identical, xs, fill = FALSE)
if (length(r) > 0L)
return(xm[-c(outer(which(r), seq_len(w) - 1, "-"))])
else
return(xm)
set.seed(0)
xm <- sample.int(10, 10000, TRUE)
xs <- 1:2
library(zoo)
system.time(a <- foo(xm, xs))
# user system elapsed
# 0.004 0.000 0.001
system.time(b <- bar(xm, xs))
# user system elapsed
# 0.276 0.000 0.273
all.equal(a, b)
#[1] TRUE
I guess that rollapplyr
is slower is because
rollapplyr
xm
lapply
Personally speaking I prefer to a full Rcpp solution, but some efforts are needed to deal with different data types. As a C programmer I am not familiar with C++'s template approach so this is not straightforward to me. In C I could use macro to achieve the same effect. But anyway, I might revisit this case sometime in future.
– 李哲源
Sep 3 at 1:02
By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.
Does an open question improve creativity?
– imi
Sep 3 at 1:01