I am new to neural networks and I have a question about classification with the nnet package.
I have data which is a mixture of numeric and categoric variables. I wanted to make a win lose prediction by using nnet and a function call such as
nnet(WL~., data=training, size=10)
but this gives a different result than if I use a dataframe with only numeric versions of the variables (i.e. convert all the factors to numeric (except my prediction WL)).
Can someone explain to me what is happening here? I guess nnet is interpreting the variables different but I would like to understand what is happening. I appreciate its difficult without any data to recreate the problem but I am just looking at a high level explanation of how neural networks are fitted using nnet. I cant find this anywhere. Many thanks.
str(training)
'data.frame': 1346 obs. of 9 variables:
$ WL : Factor w/ 2 levels "win","lose": 2 2 1 1 NA 1 1 2 2 2 ...
$ team.rank : int 17 19 19 18 17 16 15 14 14 16 ...
$ opponent.rank : int 14 12 36 16 12 30 11 38 27 31 ...
$ HA : Factor w/ 2 levels "A","H": 1 1 2 2 2 2 2 1 1 2 ...
$ comp.stage : Factor w/ 3 levels "final","KO","league": 3 3 3 3 3 3 3 3 3 3 ...
$ days.since.last.match: num 132 9 5 7 14 7 7 7 14 7 ...
$ days.to.next.match : num 9 5 7 14 7 9 7 9 7 8 ...
$ comp.last.match : Factor w/ 5 levels "Anglo-Welsh Cup",..: 5 5 5 5 5 5 3 5 3 5 ...
$ comp.next.match : Factor w/ 4 levels "Anglo-Welsh Cup",..: 4 4 4 4 4 3 4 3 4 3 ...
vs
str(training.nnet)
'data.frame': 1346 obs. of 9 variables:
$ WL : Factor w/ 2 levels "win","lose": 2 2 1 1 NA 1 1 2 2 2 ...
$ team.rank : int 17 19 19 18 17 16 15 14 14 16 ...
$ opponent.rank : int 14 12 36 16 12 30 11 38 27 31 ...
$ HA : num 1 1 2 2 2 2 2 1 1 2 ...
$ comp.stage : num 3 3 3 3 3 3 3 3 3 3 ...
$ days.since.last.match: num 132 9 5 7 14 7 7 7 14 7 ...
$ days.to.next.match : num 9 5 7 14 7 9 7 9 7 8 ...
$ comp.last.match : num 5 5 5 5 5 5 3 5 3 5 ...
$ comp.next.match : num 4 4 4 4 4 3 4 3 4 3 ...
The difference you are looking for can be explained with a very small example:
fit.factors <- nnet(y ~ x, data.frame(y=c('W', 'L', 'W'), x=c('1', '2' , '3')), size=1)
fit.factors
# a 2-1-1 network with 5 weights
# inputs: x2 x3
# output(s): y
# options were - entropy fitting
fit.numeric <- nnet(y ~ x, data.frame(y=c('W', 'L', 'W'), x=c(1, 2, 3)), size=1)
fit.numeric
# a 1-1-1 network with 4 weights
# inputs: x
# output(s): y
# options were - entropy fitting
While fitting models in R, the factor variables are actually split out into several indicator/dummy variables.
Hence, a factor variable x = c('1', '2', '3')
actually is split into three variables: x1
, x2
, x3
, one of which holds the value 1
while others hold the value 0
. Moreover, since the factors {1, 2, 3}
are exhaustive, one (and only one) of x1
, x2
, x3
must be one. Hence, variables x1
, x2
, x3
are not independent since x1 + x2 + x3 = 1
. So we can drop the first variable x1
and keep only values of x2
and x3
in the model and conclude that the level is 1
if both x2 == 0
and x2 == 0
.
That is what you see in the output of nnet
; when x
is a factor, there are actually length(levels(x)) - 1
inputs to the neural network and if x
is a number, then there is only one input to the neural network which is x
.
Most R regression functions (nnet
, randomForest
, glm
, gbm
, etc.) do this mapping from a factor level to dummy variables internally and one doesn't need to be aware of it as a user.
Now it should be clear what is the difference between using a dataset with factors
and a dataset with numbers
replacing the factors
. If you do the conversion to numbers
, then you are:
This does result in a slightly simpler model (with fewer variables as we do not need dummy
variables for each level), but is often not the correct thing to do.