Suppose I have the following data:
a <- data.frame(var1=letters,var2=runif(26))
Suppose I want to scale every value in var2
such that the sum of the var2
column is equal to 1 (basically turn the var2 column into a probability distribution)
I have tried the following:
a$var2 <- lapply(a$var2,function(x) (x-min(a$var2))/(max(a$var2)-min(a$var2)))
this not only gives an overall sum greater than 1 but also turns the var2
column into a list on which I can't do operations like sum
Is there any valid way of turning this column into a probability distribution?
Suppose you have a vector x
with non-negative values and no NA
, you can normalize it by
x / sum(x)
which is a proper probability mass function.
The transform you take:
(x - min(x)) / (max(x) - min(x))
only rescales x
onto [0, 1]
, but does not ensure "summation to 1".
Regarding you code
There is no need to use lapply
here:
lapply(a$var2, function(x) (x-min(a$var2)) / (max(a$var2) - min(a$var2)))
Just use vectorized operation
a$var2 <- with(a, (var2 - min(var2)) / (max(var2) - min(var2)))
As you said, lapply
gives you a list, and that is what "l" in "lapply" refers to. You can use unlist
to collapse that list into a vector; or, you can use sapply
, where "s" implies "simplification (when possible)".