R: Make unique the duplicated levels in all factor columns in a data frame

panman picture panman · Nov 29, 2014 · Viewed 8.2k times · Source

For several days already I've been stuck with a problem in R, trying to make duplicate levels in multiple factor columns in data frame unique using a loop. This is part of a larger project.

I have more than 200 SPSS data sets where the number of cases vary between 4,000 and 23,000 and the number of variables vary between 120 and 1,200 (an excerpt of one of the SPSS data sets can be found here). The files contain both numeric and factor variables and many of the factor ones have duplicated levels. I have used read.spss from the foreign package to import them in data frames, keeping the value labels because I need them for further use. During the import R warns me about the duplicated levels in the factor columns:

> adn <- read.spss("/tmp/adn_110.sav", use.value.labels = TRUE,
use.missings = TRUE, to.data.frame = TRUE)
Warning messages:
1: In read.spss("/tmp/adn_110.sav", use.value.labels = TRUE, use.missings = TRUE,  :
  /tmp/adn_110.sav: Unrecognized record type 7, subtype 18 encountered in system file
2: In `levels<-`(`*tmp*`, value = if (nl == nL) as.character(labels) else paste0(labels,  :
  duplicated levels in factors are deprecated
3: In `levels<-`(`*tmp*`, value = if (nl == nL) as.character(labels) else paste0(labels,  :
  duplicated levels in factors are deprecated

The data frame, exported as .RData, can be found here. When I use table (for example) to get the counts for each level of any factor column, all duplicated levels are displayed, but the counts for all duplicated levels are added to the first occurrence of the duplicate levels and for all others 0s are returned:

> table(adn[["adn01"]], useNA = "ifany")
  Incorrect         Incorrect Partially correct Partially correct 
          8                 0                 4                 0 
    Correct              <NA> 
          2                 1 
Warning message:
In `levels<-`(`*tmp*`, value = if (nl == nL) as.character(labels) else paste0(labels,  :
  duplicated levels in factors are deprecated

I know I can easily treat the factor as.numeric when calling table. However, I need the level names displayed in the output. I can use make.unique to make the levels for individual factor columns unique, appending a number at the end of the duplicate levels:

> levels(adn[["adn01"]]) <- make.unique(levels(adn[["adn01"]]), sep = " ")

Works like a charm. Then table shows me the correct counts:

> table(adn[["adn01"]], useNA = "ifany")

          Incorrect         Incorrect 1   Partially correct 
                  5                   3                   1 
Partially correct 1             Correct                <NA> 
                  3                   2                   1 

However, doing this for each factor column in each of the more than 200 files, where the number of variables vary between 120 and 1,200, would be a mission of a lifetime. And if the files change I will have to redo everything. I naively thought looping through the ccolums would be easy. However, make.table requires names. I have tried the following:

> lapply(adn[ , 1:length(adn)], make.unique(as.vector(attr(adn[ , 1:length(adn)],
"levels"))))
Error in make.unique(as.vector(attr(adn[, 1:length(adn)], "levels"))) : 
  'names' must be a character vector

No luck. I have tried many other things in the last days, including classical for loops. Still the same: 'names' must be a character vector. I guess the problem is in indexing the attribute levels of the columns, which is a list component, but I can't figure out what. Additional issue may be that not all columns are factors. Can someone help?

EDIT:

The solution provided by akrun works perfectly. Thank you once again!

Answer

akrun picture akrun · Nov 29, 2014

Try

 load('adn.RData')
 indx <- sapply(adn, is.factor)
 adn[indx] <- lapply(adn[indx], function(x) {
                   levels(x) <- make.unique(levels(x))
                   x })


 table(adn[['adn01']], useNA='ifany')

 #     Incorrect         Incorrect.1   Partially correct Partially correct.1 
 #             5                   3                   1                   3 
 #       Correct                <NA> 
 #             2                   1 


  table(adn[['adn03']], useNA='ifany')

  #  Incorrect Partially correct           Correct              <NA> 
  #          6                 3                 5                 1 

Update

If you have multiple files, you can read the files into a list and then do the processing on the list. For example, considering that the files are in the working directory.

files <- list.files(pattern='^adn\\d+')
lst1 <- lapply(files, function(x) read.spss(x, use.value.labels = TRUE,
          use.missings = TRUE, to.data.frame = TRUE) #not tested

For testing purposes, I am creating lst1 with the same dataset adn.

adn1 <- adn
lst1 <- list(adn, adn1)

Now, you are apply the make.unique for each list element

lst2 <- lapply(lst1, function(dat) {
                  indx <- sapply(dat, is.factor)
                  dat[indx] <- lapply(dat[indx], function(x){
                           levels(x) <- make.unique(levels(x))
                            x})
                          dat})


  lapply(lst2, function(x) table(x[['adn01']], useNA='ifany'))
  # [[1]]

  #    Incorrect         Incorrect.1   Partially correct Partially correct.1 
  #            5                   3                   1                   3 
  #      Correct                <NA> 
  #            2                   1 

  # [[2]]

  #    Incorrect         Incorrect.1   Partially correct Partially correct.1 
  #            5                   3                   1                   3 
  #      Correct                <NA> 
  #            2                   1