R Basket Analysis using arules package with unique order number but duplicate order combinations

SophiaAP picture SophiaAP · May 13, 2013 · Viewed 11.1k times · Source

R Basket analysis using arules package with unique order number but duplicate order combinations

Just learning R. I'm trying to do a basket analysis using the arules package (but I'm totally open to any other package suggestions!) to compare all possible combinations of 6 different item types being purchased.

My original data set looked like this:

OrderNo, ItemType, ItemCount  
111, Health, 1  
111, Leisure, 2  
111, Sports, 1  
222, Health, 3      
333, Food, 7  
333, Clothing, 1  
444, Clothing, 2  
444, Health, 1  
444, Accessories, 2  

. . .

the list goes on and has about 3,000 observations.

I collapsed the data into a matrix that contains one row for each unique order containing counts of specific ItemType:

 OrderNo, Accessories, Clothing, Food, Health, Leisure, Sports  
 111, 0, 0, 0, 1, 2, 1  
 222, 0, 0, 0, 3, 0, 0  
 333, 0, 1, 7, 0 , 0, 0  
 444, 2, 2, 0, 1, 0, 0  
 . . .

Every time I try to read in the transactions using the following command (and a million attempted variations of it):

tr <- read.transactions("dataset.csv", rm.duplicates=FALSE, format="basket", sep=",")

I get the error message: Error in asMethod(object): can not coerce list with transactions with duplicated items.

I'm assuming this is because I have 3,000 observations and inevitably certain combinations are going to show up more than once (i.e., more than one person is purchasing only one piece of Clothing and nothing else: OrderNo, 0, 1, 0, 0, 0, 0). I know I could collapse the data set on counts of unique combinations, but I'm worried that if I do that, there will be no weights to show the most frequent combinations.

I thought that using format="basket" would account for different orders containing the same item combinations, but apparently that's not the case. I'm so lost. All the documentation I've read implies that this is possible but I can't find any examples or advice on how to approach the problem.

Any advice would be so appreciated! My head is spinning on this one.

Extra info: For my end result, I'm looking to get the top five most significant combinations of purchase combinations. I don't know if that helps.

Answer

SophiaAP picture SophiaAP · May 13, 2013

Ok, after hours of searching and reading all the pdfs I could find, I finally found the answer (and most helpful walkthrough of apriori/basket analysis ever!) in the DATA MINING Desktop Survival Guide by Graham Williams:

The read.transactions function can also read data from a file with transaction ID and a single item per line (using the format="single" option).

So there was no need to do all those transformations after import. I should have just been importing straight from the original csv file specifying the "single" format option instead of "basket." I also had to make sure the file contained no column names and that there was a unique representation of item type paired with order number (for instance, if a person ordered two items from the "Grocery" category, this needs to be represented on one row). And the cols=c(2,1) option indicates that column 1 contains the order number and column 2 is the rest of the data (ItemType).

tr <- read.transactions(file='dataset.csv', format='single', sep=',', cols=c(2,1))