I am new to this area as well as the terminology so please feel free to suggest if I go wrong somewhere. I have two datasets like this:
A B C 0 E
A 0 C 0 0
A 0 C D E
A 0 C 0 E
The way I interpret this is at some point in time, (A,B,C,E) occurred together and so did (A,C), (A,C,D,E) etc.
5A 1B 5C 0 2E
4A 0 5C 0 0
2A 0 1C 4D 4E
3A 0 4C 0 3E
The way I interpret this is at some point in time, 5 occurrences of A, 1 occurrence of B, 5 occurrences of C and 2 occurrences of E happened and so on.
I am trying to find what items occur together and if possible, also find out the cause and effect for this. For this, I am not understanding how to go about using both the datasets (or if one is enough). It would be good to have a good tutorial on this but my primary question is which dataset to utilize and how to proceed in (i) building a frequent itemset and (ii) building association rules between them.
Can someone point me to a practical tutorials/examples (preferably in Python) or at least explain in brief words on how to approach this problem?
Some theoretical facts about association rules:
To find association rules, you can use apriori algorithm. There already exists many python implementation, although most of them are not efficient for practical usage:
or use Orange data mining library, which has a good library for association rules.
Usage example:
'''
save first example as item.basket with format
A, B, C, E
A, C
A, C, D, E
A, C, E
open ipython same directory as saved file or use os module
>>> import os
>>> os.chdir("c:/orange")
'''
import orange
items = orange.ExampleTable("item")
#play with support argument to filter out rules
rules = orange.AssociationRulesSparseInducer(items, support = 0.1)
for r in rules:
print "%5.3f %5.3f %s" % (r.support, r.confidence, r)
To learn more about association rules/frequent item mining, then my selection of books are:
There is no short way.