I understand this is a very basic question but I don't understand what levels mean in R.
For reference, I have done a simple script to read CSV table, filter on one of the fields, pass this on to a new variable and clear the memory allocated for the first variable. If I call unique() on the field on which I filtered, I see that the results were indeed filtered but there is one additional line showing 'Levels' corresponding to data that is in the original dataset.
Example:
df = read.csv(path, sep=",", header=TRUE)
df_intrate = df[df$AssetClass == "ASSET CLASS A", ]
rm(df)
gc()
unique(df_intrate$AssetClass)
Results:
[1] ASSET CLASS A
Levels: ASSET CLASS E ASSET CLASS D ASSET CLASS C ASSET CLASS B ASSET CLASS A
Is the structural information from df
somehow preserved in df_intrate despite R studio showing that df_intrate is indeed the expected number of rows for ASSET CLASS A
?
Is the structural information from df somehow preserved in df_intrate despite R studio showing that df_intrate is indeed the expected number of rows for ASSET CLASS A ?
Yes. This is how categorical variables, called factors, are stored in R - both the levels, a vector of all possible values, and the actual values taken, are stored:
x = factor(c('a', 'b', 'c', 'a', 'b', 'b'))
x
# [1] a b c a b b
# Levels: a b c
y = x[1]
# [1] a
# Levels: a b c
You can get rid of unused levels with droplevels()
, or by re-applying the factor
function, creating a new factor out of only what is present:
droplevels(y)
# [1] a
# Levels: a
factor(y)
# [1] a
# Levels: a
You can also use droplevels
on a data frame to drop all unused levels from all factor columns:
dat = data.frame(x = x)
str(dat)
# 'data.frame': 6 obs. of 1 variable:
# $ x: Factor w/ 3 levels "a","b","c": 1 2 3 1 2 2
str(dat[1, ])
# Factor w/ 3 levels "a","b","c": 1
str(droplevels(dat[1, ]))
# Factor w/ 1 level "a": 1
Though unrelated to your current issue, we should also mention that factor
has an optional levels
argument which can be used to specify the levels of a factor and the order in which they should go. This can be useful if you want a specific order (perhaps for plotting or modeling), or if there are more possible levels than are actually present and you want to include them. If you don't specify the levels
, the default will be alphabetical order.
x = c("agree", "disagree", "agree", "neutral", "strongly agree")
factor(x)
# [1] agree disagree agree neutral strongly agree
# Levels: agree disagree neutral strongly agree
## not a good order
factor(x, levels = c("disagree", "neutral", "agree", "strongly agree"))
# [1] agree disagree agree neutral strongly agree
# Levels: disagree neutral agree strongly agree
## better order
factor(x, levels = c("strongly disagree", "disagree", "neutral", "agree", "strongly agree"))
# [1] agree disagree agree neutral strongly agree
# Levels: strongly disagree disagree neutral agree strongly agree
## good order, more levels than are actually present
You can use ?reorder
and ?relevel
(or just factor
again) to change the order of levels for an already created factor.