Convert factor to integer in a data frame

Anna picture Anna · Feb 28, 2012 · Viewed 40.5k times · Source

I have the following code

anna.table<-data.frame (anna1,anna2)
write.table<-(anna.table, file="anna.file.txt",sep='\t', quote=FALSE) 

my table in the end contains numbers such as the following

chr         start    end      score
chr2      41237927  41238801    151
chr1      36976262  36977889    226
chr8      83023623  83025129    185

and so on......

after that i am trying to to get only the values which fit some criteria such as score less than a specific value

so i am doing the following

anna3<-"data/anna/anna.file.txt"
anna.total<-read.table(anna3,header=TRUE)
significant.anna<-subset(anna.total,score <=0.001)

Error: In Ops.factor(score, 0.001) <= not meaningful for factors

so i guess the problem is that my table has factors and not integers

I guess that my anna.total$score is a factor and i must make it an integer

If i read correctly the as.numeric might solve my problem

i am reading about the as.numeric function but i cannot understand how i can use it

Hence could you please give me some advices?

thank you in advance

best regards Anna

PS : i tried the following

anna3<-"data/anna/anna.file.txt"
anna.total<-read.table(anna3,header=TRUE)
anna.total$score.new<-as.numeric (as.character(anna.total$score))
write.table(anna.total,file="peak.list.numeric.v3.txt",append = FALSE ,quote = FALSE,col.names =TRUE,row.names=FALSE, sep="\t")

anna.peaks<-subset(anna.total,fdr.new <=0.001)
Warning messages:
1: In Ops.factor(score, 0.001) : <= not meaningful for factors

again i have the same problem......

Answer

Gavin Simpson picture Gavin Simpson · Feb 28, 2012

With anna.table (it is a data frame by the way, a table is something else!), the easiest way will be to just do:

anna.table2 <- data.matrix(anna.table)

as data.matrix() will convert factors to their underlying numeric (integer) levels. This will work for a data frame that contains only numeric, integer, factor or other variables that can be coerced to numeric, but any character strings (character) will cause the matrix to become a character matrix.

If you want anna.table2 to be a data frame, not as matrix, then you can subsequently do:

anna.table2 <- data.frame(anna.table2)

Other options are to coerce all factor variables to their integer levels. Here is an example of that:

## dummy data
set.seed(1)
dat <- data.frame(a = factor(sample(letters[1:3], 10, replace = TRUE)), 
                  b = runif(10))

## sapply over `dat`, converting factor to numeric
dat2 <- sapply(dat, function(x) if(is.factor(x)) {
                                    as.numeric(x)
                                } else {
                                    x
                                })
dat2 <- data.frame(dat2) ## convert to a data frame

Which gives:

> str(dat)
'data.frame':   10 obs. of  2 variables:
 $ a: Factor w/ 3 levels "a","b","c": 1 2 2 3 1 3 3 2 2 1
 $ b: num  0.206 0.177 0.687 0.384 0.77 ...
> str(dat2)
'data.frame':   10 obs. of  2 variables:
 $ a: num  1 2 2 3 1 3 3 2 2 1
 $ b: num  0.206 0.177 0.687 0.384 0.77 ...

However, do note that the above will work only if you want the underlying numeric representation. If your factor has essentially numeric levels, then we need to be a bit cleverer in how we convert the factor to a numeric whilst preserving the "numeric" information coded in the levels. Here is an example:

## dummy data
set.seed(1)
dat3 <- data.frame(a = factor(sample(1:3, 10, replace = TRUE), levels = 3:1), 
                   b = runif(10))

## sapply over `dat3`, converting factor to numeric
dat4 <- sapply(dat3, function(x) if(is.factor(x)) {
                                    as.numeric(as.character(x))
                                } else {
                                    x
                                })
dat4 <- data.frame(dat4) ## convert to a data frame

Note how we need to do as.character(x) first before we do as.numeric(). The extra call encodes the level information before we convert that to numeric. To see why this matters, note what dat3$a is

> dat3$a
 [1] 1 2 2 3 1 3 3 2 2 1
Levels: 3 2 1

If we just convert that to numeric, we get the wrong data as R converts the underlying level codes

> as.numeric(dat3$a)
 [1] 3 2 2 1 3 1 1 2 2 3

If we coerce the factor to a character vector first, then to a numeric one, we preserve the original information not R's internal representation

> as.numeric(as.character(dat3$a))
 [1] 1 2 2 3 1 3 3 2 2 1

If your data are like this second example, then you can't use the simple data.matrix() trick as that is the same as applying as.numeric() directly to the factor and as this second example shows, that doesn't preserve the original information.