Unseen factor levels when appending new records with unseen string values to a dataframe, cause Warning and result in NA

Farrel picture Farrel · Oct 27, 2009 · Viewed 96.4k times · Source

I have a dataframe (14.5K rows by 15 columns) containing billing data from 2001 to 2007.

I append new 2008 data to it with: alltime <- rbind(alltime,all2008)

Unfortunately that generates a warning:

> Warning message:
In `[<-.factor`(`*tmp*`, ri, value = c(NA, NA, NA, NA, NA, NA, NA,  :
  invalid factor level, NAs generated

My guess is that there are some new patients whose names were not in the previous dataframe and therefore it would not know what level to give those. Similarly new unseen names in the 'referring doctor' column.

What's the solution?

Answer

Marek picture Marek · Oct 29, 2009

It could be caused by mismatch of types in two data.frames.

First of all check types (classes). To diagnostic purposes do this:

new2old <- rbind( alltime, all2008 ) # this gives you a warning
old2new <- rbind( all2008, alltime ) # this should be without warning

cbind(
    alltime = sapply( alltime, class),
    all2008 = sapply( all2008, class),
    new2old = sapply( new2old, class),
    old2new = sapply( old2new, class)
)

I expect there be a row looks like:

            alltime  all2008   new2old  old2new
...         ...      ...       ...      ...
some_column "factor" "numeric" "factor" "character"
...         ...      ...       ...      ...

If so then explanation: rbind don't check types match. If you analyse rbind.data.frame code then you could see that the first argument initialized output types. If in first data.frame type is a factor, then output data.frame column is factor with levels unique(c(levels(x1),levels(x2))). But when in second data.frame column isn't factor then levels(x2) is NULL, so levels don't extend.

It means that your output data are wrong! There are NA's instead of true values

I suppose that:

  1. you create you old data with another R/RODBC version so types were created with different methods (different settings - decimal separator maybe)
  2. there are NULL's or some specific data in problematic column, eg. someone change column under database.

Solution:

find wrong column and find reason why its's wrong and fixed. Eliminate cause not symptoms.