Is there an efficient way to append to an existing csv file without duplicates in R?

YYY picture YYY · Jan 15, 2015 · Viewed 28.3k times · Source

There is a data.frame appended to a existing file. When it is appended by write.table function, it might cause duplicated records into the file. Here is the sample code:

df1<-data.frame(name=c('a','b','c'), a=c(1,2,2))
write.csv(df1, "export.csv", row.names=FALSE, na="NA"); 

#"export.csv" keeps two copies of df1
write.table(df1,"export.csv", row.names=F,na="NA",append=T, quote= FALSE, sep=",", col.names=F);

So ideally the output file should only keep one copy of df1. But the write.table function doesn't have any parameter for duplicate check.

Thank you for any suggestion in advance.

Answer

Sirvydas picture Sirvydas · Jan 15, 2015

You could read the data.frame from file, rbind it with the new data.frame and check for duplicate values. For writing efficiency, append only the non-duplicate rows.

If you came up with this question because you are working with big data sets and read/write time is of concern, take a look at data.table and fread packages.

# initial data.frame
df1<-data.frame(name=c('a','b','c'), a=c(1,2,2))
write.csv(df1, "export.csv", row.names=FALSE, na="NA")

# a new data.frame with a couple of duplicate rows
df2<-data.frame(name=c('a','b','c'), a=c(1,2,3))
dfRead<-read.csv("export.csv") # read the file
all<-rbind(dfRead, df2) # rbind both data.frames
# get only the non duplicate rows from the new data.frame
nonDuplicate <- all[!duplicated(all)&c(rep(FALSE, dim(dfRead)[1]), rep(TRUE, dim(df2)[1])), ]
# append the file with the non duplicate rows
write.table(nonDuplicate,"export.csv", row.names=F,na="NA",append=T, quote= FALSE, sep=",", col.names=F)