I have built a decision tree using the ctree
function via party
package. it has 1700 nodes.
Firstly, is there a way in ctree
to give the maxdepth
argument? I tried control_ctree
option but, it threw some error message saying couldnt find ctree function.
Also, how can I consume the output of this tree?. How can it be implemented for other platforms like SAS or SQL. I also have another doubt as to what does the value "* weights = 4349 "
at the end of the node signify. How will I know, that which terminal node votes for which predicted value.
There is a maxdepth
option in ctree. It is located in ctree_control()
You can use it as follows
airq <- subset(airquality, !is.na(Ozone))
airct <- ctree(Ozone ~ ., data = airq, controls = ctree_control(maxdepth = 3))
You can also restrict the split sizes and the bucket sizes to be "no less than"
airct <- ctree(Ozone ~ ., data = airq, controls = ctree_control(minsplit= 50, minbucket = 20))
You can also to reduce increase sensetivity and lower the P-value
airct <- ctree(Ozone ~ ., data = airq, controls = ctree_control(mincriterion = 0.99))
The weights = 4349
you've mentioned is just the number of observations in that specific node. ctree
has a default of giving a weight of 1 to every observation, but if you feel that you have observations that deserve bigger weights you can add a weights vector to the ctree()
which have to be the same length as the data set and have to be non-negative integers. After you do that, the weights = 4349
will have to be interpreted with caution.
One way of using weights
is to see which observations fell in a certain node. Using the data in the example above we can perform the following
airq <- subset(airquality, !is.na(Ozone))
airct <- ctree(Ozone ~ ., data = airq, controls = ctree_control(maxdepth = 3))
unique(where(airct)) #in order the get the terminal nodes
[1] 5 3 6 9 8
so we can check what fell in node number 5 for example
n <- nodes(airct , 5)[[1]]
x <- airq[which(as.logical(n$weights)), ]
x
Ozone Solar.R Wind Temp Month Day
1 41 190 7.4 67 5 1
2 36 118 8.0 72 5 2
3 12 149 12.6 74 5 3
4 18 313 11.5 62 5 4
...
Using this method you can create data sets that will contain the informationn of you terminal nodes and then import them into SAS or SQL
You can also get the list of splitting conditions using the function from my answer below ctree() - How to get the list of splitting conditions for each terminal node?