There is a data.frame appended to a existing file. When it is appended by write.table function, it might cause duplicated records into the file. Here is the sample code:
df1<-data.frame(name=c('a','b','c'), a=c(1,2,2))
write.csv(df1, "export.csv", row.names=FALSE, na="NA");
#"export.csv" keeps two copies of df1
write.table(df1,"export.csv", row.names=F,na="NA",append=T, quote= FALSE, sep=",", col.names=F);
So ideally the output file should only keep one copy of df1. But the write.table function doesn't have any parameter for duplicate check.
Thank you for any suggestion in advance.
You could read the data.frame from file, rbind
it with the new data.frame and check for duplicate values. For writing efficiency, append only the non-duplicate rows.
If you came up with this question because you are working with big data sets and read/write time is of concern, take a look at data.table
and fread
packages.
# initial data.frame
df1<-data.frame(name=c('a','b','c'), a=c(1,2,2))
write.csv(df1, "export.csv", row.names=FALSE, na="NA")
# a new data.frame with a couple of duplicate rows
df2<-data.frame(name=c('a','b','c'), a=c(1,2,3))
dfRead<-read.csv("export.csv") # read the file
all<-rbind(dfRead, df2) # rbind both data.frames
# get only the non duplicate rows from the new data.frame
nonDuplicate <- all[!duplicated(all)&c(rep(FALSE, dim(dfRead)[1]), rep(TRUE, dim(df2)[1])), ]
# append the file with the non duplicate rows
write.table(nonDuplicate,"export.csv", row.names=F,na="NA",append=T, quote= FALSE, sep=",", col.names=F)