Change reference level for variable in R

Mike L picture Mike L · Apr 25, 2013 · Viewed 10.4k times · Source

I have a data set, (call it DATA) with a variable, COLOR. The mode of COLOR is numeric and the class is factor. First, I'm a bit confused by the "numeric" -- when printed out, the data for COLOR are not numeric -- they are all character values, like White or Blue or Black, etc. Any clarification on this is appreciated.

Further, I need to Write R code to return the levels of the COLOR variable, then determine the current reference level of this variable, and finally set the reference level of this variable to White. I tried using factor, but was entirely unsuccessful.

Thank you for taking the time to help.

Answer

Ben Bolker picture Ben Bolker · Apr 25, 2013

mode(DATA$COLOR) is "numeric" because R internally stores factors as numeric codes (to save space), plus an associated vector of labels corresponding to the code values. When you print the factor, R automatically substitutes the corresponding label for each code.

f <- factor(c("orange","banana","apple"))
## [1] orange banana apple 
## Levels: apple banana orange
str(f)
##  Factor w/ 3 levels "apple","banana",..: 3 2 1
c(f)    ## strip attributes to get a numeric vector
## [1] 3 2 1 
attributes(f)
## $levels
## [1] "apple"  "banana" "orange"
## $class
## [1] "factor"

... I need to Write R code to return the levels of the COLOR variable ...

levels(DATA$COLOR)

... then determine the current reference level of this variable,

levels(DATA$COLOR)[1]

... and finally set the reference level of this variable to White.

DATA$COLOR <- relevel(DATA$COLOR,"White")