Normalize data in R data.frame column

Imlerith picture Imlerith · Sep 5, 2016 · Viewed 9.2k times · Source

Suppose I have the following data:

a <- data.frame(var1=letters,var2=runif(26))

Suppose I want to scale every value in var2 such that the sum of the var2 column is equal to 1 (basically turn the var2 column into a probability distribution)

I have tried the following:

a$var2 <- lapply(a$var2,function(x) (x-min(a$var2))/(max(a$var2)-min(a$var2)))

this not only gives an overall sum greater than 1 but also turns the var2 column into a list on which I can't do operations like sum

Is there any valid way of turning this column into a probability distribution?

Answer

Zheyuan Li picture Zheyuan Li · Sep 5, 2016

Suppose you have a vector x with non-negative values and no NA, you can normalize it by

x / sum(x)

which is a proper probability mass function.

The transform you take:

(x - min(x)) / (max(x) - min(x))

only rescales x onto [0, 1], but does not ensure "summation to 1".


Regarding you code

There is no need to use lapply here:

lapply(a$var2, function(x) (x-min(a$var2)) / (max(a$var2) - min(a$var2)))

Just use vectorized operation

a$var2 <- with(a, (var2 - min(var2)) / (max(var2) - min(var2)))

As you said, lapply gives you a list, and that is what "l" in "lapply" refers to. You can use unlist to collapse that list into a vector; or, you can use sapply, where "s" implies "simplification (when possible)".