Association rule in R - removing redundant rule (arules)

BigData picture BigData · Aug 5, 2016 · Viewed 8.5k times · Source

Assume we have 3 rules:

[1] {A,B,D} -> {C}

[2] {A,B} -> {C}

[3] Whatever it is

Rule [2] is a subset of rule [1] (because rule [1] contains all the items in rule [2]), so rule [1] should be eliminated (because rule [1] is too specific and its information is included in rule [2] )

I searched through the internet and everyone is using these code to remove redundant rules:

subset.matrix <- is.subset(rules.sorted, rules.sorted)
subset.matrix[lower.tri(subset.matrix, diag=T)] <- NA
redundant <- colSums(subset.matrix, na.rm=T) >= 1
which(redundant)
rules.pruned <- rules.sorted[!redundant]

I dont understand how the code work.

After line 2 of the code, the subset.matrix will become:

      [,1] [,2] [,3]
[1,]   NA    1    0
[2,]   NA   NA    0
[3,]   NA   NA   NA

The cells in the lower triangle are set to be NA and since rule [2] is a subset of rule [1], the corresponding cell is set to 1. So I have 2 questions:

  1. Why do we have to set the lower triangle as NA? If we do so then how can we check whether rule [2] is subset of rule [3] or not? (the cell has been set as NA)

  2. In our case, rule [1] should be the one to be eliminated, but these code eliminate rule [2] instead of rule [1]. (Because the first cell in column 2 is 1, and according to line 3 of the code, the column sums of column 2 >= 1, therefore will be treated as redundant)

Any help would be appreciated !!

Answer

Michael Hahsler picture Michael Hahsler · Aug 7, 2016

For your code to work you need an interest measure (confidence or lift) and rules.sorted needs to be sorted by either confidence or lift. Anyway, the code is horribly inefficient since is.subset() creates a matrix of size n^2, where n is the number of rules. Also, is.subset for rules merges rhs and lhs of the rule which is not correct. So don't worry too much about the implementation details.

A more efficient way to do this is now implemented as function is.redundant() in package arules (available in version 1.4-2). This explanation comes from the manual page:

A rule is redundant if a more general rules with the same or a higher confidence exists. That is, a more specific rule is redundant if it is only equally or even less predictive than a more general rule. A rule is more general if it has the same RHS but one or more items removed from the LHS. Formally, a rule X -> Y is redundant if

for some X' subset X, conf(X' -> Y) >= conf(X -> Y).

This is equivalent to a negative or zero improvement as defined by Bayardo et al. (2000). In this implementation other measures than confidence, e.g. improvement of lift, can be used as well.

Check out the examples in ? is.redundant.