Say I have large datasets in R and I just want to know whether two of them they are the same. I use this often when I'm experimenting different algorithms to achieve the same result. For example, say we have the following datasets:
df1 <- data.frame(num = 1:5, let = letters[1:5])
df2 <- df1
df3 <- data.frame(num = c(1:5, NA), let = letters[1:6])
df4 <- df3
So this is what I do to compare them:
table(x == y, useNA = 'ifany')
Which works great when the datasets have no NAs:
> table(df1 == df2, useNA = 'ifany')
TRUE
10
But not so much when they have NAs:
> table(df3 == df4, useNA = 'ifany')
TRUE <NA>
11 1
In the example, it's easy to dismiss the NA
as not a problem since we know that both dataframes are equal. The problem is that NA == <anything>
yields NA
, so whenever one of the datasets has an NA
, it doesn't matter what the other one has on that same position, the result is always going to be NA
.
So using table()
to compare datasets doesn't seem ideal to me. How can I better check if two data frames are identical?
P.S.: Note this is not a duplicate of R - comparing several datasets, Comparing 2 datasets in R or Compare datasets in R
Look up all.equal. It has some riders but it might work for you.
all.equal(df3,df4)
# [1] TRUE
all.equal(df2,df1)
# [1] TRUE