Minus operation of data frames

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP



Minus operation of data frames



I have 2 data frames df1 and df2.


df1


df2


df1 <- data.frame(c1=c("a","b","c","d"),c2=c(1,2,3,4) )
df2 <- data.frame(c1=c("c","d","e","f"),c2=c(3,4,5,6) )

> df1
c1 c2
1 a 1
2 b 2
3 c 3
4 d 4

> df2
c1 c2
1 c 3
2 d 4
3 e 5
4 f 6



I need to perform set operation of these 2 data frames. I used merge(df1,df2,all=TRUE) and merge(df1,df2,all=FALSE) method to get the union and intersection of these data frames and got the required output. What is the function to get the minus of these data frames,that is all the positions existing on one data frame but not the other? I need the following output.


merge(df1,df2,all=TRUE)


merge(df1,df2,all=FALSE)


c1 c2
1 a 1
2 b 2





Do you want to get lines in df1 that are not in df2 and lines in df2 that are not in df1 ?
– juba
Apr 22 '13 at 9:39






@juba, I believe this is more of setdiff but for data.frames
– Arun
Apr 22 '13 at 9:43


setdiff


data.frame





Yes, that's what I thought, but the result given is not a setdiff. That's why I ask the question :)
– juba
Apr 22 '13 at 9:44


setdiff





sorry, don't follow. c,3 and d,4 are present in both. So, setdiff(df1, df2) should return those rows not in df2, which is a,1 and b,2. This seems to be a setdiff operation to me (if implemented for data.frame)
– Arun
Apr 22 '13 at 9:48


c,3


d,4


setdiff(df1, df2)


df2


a,1 and b,2


setdiff





@juba, that depends on how you do: setdiff(df1, df2) should return OP's input. setdiff(df2, df1) should return what you say. It's a set operation. So, it should give x entries not in y (so order matters).
– Arun
Apr 22 '13 at 9:51



setdiff(df1, df2)


setdiff(df2, df1)




7 Answers
7



I remember coming across this exact issue quite a few months back. Managed to sift through my Evernote one-liners.



Note: This is not my solution. Credit goes to whoever wrote it (whom I can't seem to find at the moment).



If you don't worry about rownames then you can do:


rownames


df1[!duplicated(rbind(df2, df1))[-seq_len(nrow(df2))], ]
# c1 c2
# 1 a 1
# 2 b 2



Edit: A data.table solution:


data.table


dt1 <- data.table(df1, key="c1")
dt2 <- data.table(df2)
dt1[!dt2]



or better one-liner (from v1.9.6+):


setDT(df1)[!df2, on="c1"]



This returns all rows in df1 where df2$c1 doesn't have a match with df1$c1.


df1


df2$c1


df1$c1





this is working perfectly. Thank you
– Dinoop Nair
Apr 22 '13 at 10:14





This data.table solution is much faster than doing a setdiff or %in% lookup, thanks!
– daroczig
May 13 '16 at 21:20


data.table


setdiff


%in%



I prefer sqldf package:


sqldf


require(sqldf)
sqldf("select * from df1 except select * from df2")

## c1 c2
## 1 a 1
## 2 b 2



You can create identifier columnas then subset:



e.g.


df1 <- data.frame(c1=c("a","b","c","d"),c2=c(1,2,3,4), indf1 = rep("Y",4) )
df2 <- data.frame(c1=c("c","d","e","f"),c2=c(3,4,5,6),indf2 = rep("Y",4) )
merge(df1,df2)
# c1 c2 indf1 indf2
#1 c 3 Y Y
#2 d 4 Y Y

bigdf <- merge(df1,df2,all=TRUE)
# c1 c2 indf1 indf2
#1 a 1 Y <NA>
#2 b 2 Y <NA>
#3 c 3 Y Y
#4 d 4 Y Y
#5 e 5 <NA> Y
#6 f 6 <NA> Y



Then subset how you wish:


bigdf[is.na(bigdf$indf1) ,]
# c1 c2 indf1 indf2
#5 e 5 <NA> Y
#6 f 6 <NA> Y

bigdf[is.na(bigdf$indf2) ,] #<- output you requested those not in df2
# c1 c2 indf1 indf2
#1 a 1 Y <NA>
#2 b 2 Y <NA>





this is not possible..bcoz the given data is only a sample dataframes.the actual data frames contains huge amount of rows. so the object size may become very large by this method.
– Dinoop Nair
Apr 22 '13 at 9:42





@DinoopNair Then you may do a merge with all.x=TRUE and subset on indf2 ?
– juba
Apr 22 '13 at 9:56



all.x=TRUE


indf2



If you're not planning on using any of the actual data in d2, then you don't need merge at all:


d2


merge


df1[!(df1$c1 %in% df2$c1), ]





i need to check the values in both columns.
– Dinoop Nair
Apr 22 '13 at 9:52



You can check the values in both columns and subset like this (just adding another solution):


na.omit( df1[ sapply( 1:ncol(df1) , function(x) ! df1[,x] %in% df2[,x] ) , ] )
# c1 c2
#1 a 1
#2 b 2



One issue with https://stackoverflow.com/a/16144262/2055486 is it assumes neither data frame already has duplicated rows. The following function removes that limitation and also works with arbitrary user defined columns in x or y.



The implementation uses a similar idea to the implementation of duplicated.data.frame in concatenating the columns together with a separator. duplicated.data.frame uses "r", which can cause collisions if the entries have embedded "r" characters. This uses the ASCII record separator "30" which will have a much lower chance of appearing in input data.


duplicated.data.frame


duplicated.data.frame


"r"


"r"


"30"


setdiff.data.frame <- function(x, y,
by = intersect(names(x), names(y)),
by.x = by, by.y = by)
stopifnot(
is.data.frame(x),
is.data.frame(y),
length(by.x) == length(by.y))

!do.call(paste, c(x[by.x], sep = "30")) %in% do.call(paste, c(y[by.y], sep = "30"))


# Example usage
# remove all 4 or 6 cylinder 4 gear cars or 8 cylinder 3 gear rows
to_remove <- data.frame(cyl = c(4, 6, 8), gear = c(4, 4, 3))
mtcars[setdiff.data.frame(mtcars, to_remove), ]
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
#> Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
#> Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
#> Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
#> Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
#> Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
#> Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
#> Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8

# with differing column names
to_remove2 <- data.frame(a = c(4, 6, 8), b = c(4, 4, 3))
mtcars[setdiff.data.frame(mtcars, to_remove2, by.x = c("cyl", "gear"), by.y = c("a", "b")), ]
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
#> Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
#> Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
#> Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
#> Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
#> Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
#> Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
#> Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8



I think the simplest solution is with dplyr (tidyverse).


require(tidyverse)
anti_join(df1, df2)




Thank you for your interest in this question.
Because it has attracted low-quality or spam answers that had to be removed, posting an answer now requires 10 reputation on this site (the association bonus does not count).



Would you like to answer one of these unanswered questions instead?

Popular posts from this blog

Firebase Auth - with Email and Password - Check user already registered

Dynamically update html content plain JS

How to determine optimal route across keyboard