How can I get the frequencies of common itemsets from the apriori call in R?

reprogrammer picture reprogrammer · Jan 13, 2012 · Viewed 7.6k times · Source

Problem:

The apriori function of the arules package infers association rules from the input transactions and reports the support, confidence, and lift of each rule. The association rules are derived from frequent itemsets. I'd like to get the most frequent itemsets in the input transactions. Specifically, I'd like to get all itemsets with a given minimum support. The support of an itemset is the ratio of the number of the transactions that contain the itemset to the total number of transactions.

Requirements:

  1. I'd strongly prefer to find the most frequent itemsets from the intermediate results of the apriori function. That is, I'd rather not to write a program from scratch just to compute the most frequent itemsets, because the apriori function already computes it as an intermediate step. Nonetheless, if there is really not a reasonable way of accessing the intermediate results of the apriori function, I'm open to other solutions.
  2. I'd rather not to do string manipulation on the result of the apriori function because this approach will be too dependent on the string representation of the result of the apriori function. Again, if it turns out that there are no better alternatives, I may resort to this approach.
  3. I'm aware of the itemFrequency function provided by the arules package. Unfortunately, this function just reports the itemsets with a single item. I'm interested in all itemsets of any length with a minimum support.
  4. I'd like the output to be sorted by support numerically and then by itemset lexicographically.

Example Input:

a,b
a,b,c

Program:

# The following is how I'm using apriori to infer the association rules.
library(package = "arules")
transactions = read.transactions(file = file("stdin"), format = "basket", sep = ",")
rules = apriori(transactions, parameter = list(minlen=1, sup = 0.001, conf = 0.001))
WRITE(rules, file = "", sep = ",", quote = TRUE, col.names = NA)

Current Output:

"","rules","support","confidence","lift"
"1","{} => {c}",0.5,0.5,1
"2","{} => {b}",1,1,1
"3","{} => {a}",1,1,1
"4","{c} => {b}",0.5,1,1
"5","{b} => {c}",0.5,0.5,1
"6","{c} => {a}",0.5,1,1
"7","{a} => {c}",0.5,0.5,1
"8","{b} => {a}",1,1,1
"9","{a} => {b}",1,1,1
"10","{b,c} => {a}",0.5,1,1
"11","{a,c} => {b}",0.5,1,1
"12","{a,b} => {c}",0.5,0.5,1

Desired Output:

"itemset","support"
"{a}",1
"{a,b}",1
"{b}",1
"{a,b,c}",0.5
"{a,c}",0.5
"{b,c}",0.5
"{c}",0.5

Answer

reprogrammer picture reprogrammer · Jan 14, 2012

I found the generatingItemsets function in the reference manual of the arules package.

library(package = "arules")
transactions = read.transactions(file = file("stdin"), format = "basket", sep = ",")
rules = apriori(transactions, parameter = list(minlen=1, sup = 0.001, conf = 0.001))
itemsets <- unique(generatingItemsets(rules))
itemsets.df <- as(itemsets, "data.frame")
frequentItemsets <- itemsets.df[with(itemsets.df, order(-support,items)),]
names(frequentItemsets)[1] <- "itemset"
write.table(frequentItemsets, file = "", sep = ",", row.names = FALSE)