Assume we have 3 rules:
[1] {A,B,D} -> {C}
[2] {A,B} -> {C}
[3] Whatever it is
Rule [2]
is a subset of rule [1]
(because rule [1]
contains all the items in rule [2]
), so rule [1]
should be eliminated (because rule [1]
is too specific and its information is included in rule [2]
)
I searched through the internet and everyone is using these code to remove redundant rules:
subset.matrix <- is.subset(rules.sorted, rules.sorted)
subset.matrix[lower.tri(subset.matrix, diag=T)] <- NA
redundant <- colSums(subset.matrix, na.rm=T) >= 1
which(redundant)
rules.pruned <- rules.sorted[!redundant]
I dont understand how the code work.
After line 2 of the code, the subset.matrix will become:
[,1] [,2] [,3]
[1,] NA 1 0
[2,] NA NA 0
[3,] NA NA NA
The cells in the lower triangle are set to be NA and since rule [2]
is a subset of rule [1]
, the corresponding cell is set to 1. So I have 2 questions:
Why do we have to set the lower triangle as NA? If we do so then how can we check whether rule [2]
is subset of rule [3]
or not? (the cell has been set as NA)
In our case, rule [1]
should be the one to be eliminated, but these code eliminate rule [2]
instead of rule [1]
. (Because the first cell in column 2 is 1, and according to line 3 of the code, the column sums of column 2 >= 1, therefore will be treated as redundant)
Any help would be appreciated !!
For your code to work you need an interest measure (confidence or lift) and rules.sorted
needs to be sorted by either confidence or lift. Anyway, the code is horribly inefficient since is.subset()
creates a matrix of size n^2, where n is the number of rules. Also, is.subset
for rules merges rhs and lhs of the rule which is not correct. So don't worry too much about the implementation details.
A more efficient way to do this is now implemented as function is.redundant()
in package arules (available in version 1.4-2).
This explanation comes from the manual page:
A rule is redundant if a more general rules with the same or a higher confidence exists. That is, a more specific rule is redundant if it is only equally or even less predictive than a more general rule. A rule is more general if it has the same RHS but one or more items removed from the LHS. Formally, a rule X -> Y is redundant if
for some X' subset X, conf(X' -> Y) >= conf(X -> Y).
This is equivalent to a negative or zero improvement as defined by Bayardo et al. (2000). In this implementation other measures than confidence, e.g. improvement of lift, can be used as well.
Check out the examples in ? is.redundant
.