R Basket analysis using arules package with unique order number but duplicate order combinations
Just learning R. I'm trying to do a basket analysis using the arules package (but I'm totally open to any other package suggestions!) to compare all possible combinations of 6 different item types being purchased.
My original data set looked like this:
OrderNo, ItemType, ItemCount
111, Health, 1
111, Leisure, 2
111, Sports, 1
222, Health, 3
333, Food, 7
333, Clothing, 1
444, Clothing, 2
444, Health, 1
444, Accessories, 2
. . .
the list goes on and has about 3,000 observations.
I collapsed the data into a matrix that contains one row for each unique order containing counts of specific ItemType:
OrderNo, Accessories, Clothing, Food, Health, Leisure, Sports
111, 0, 0, 0, 1, 2, 1
222, 0, 0, 0, 3, 0, 0
333, 0, 1, 7, 0 , 0, 0
444, 2, 2, 0, 1, 0, 0
. . .
Every time I try to read in the transactions using the following command (and a million attempted variations of it):
tr <- read.transactions("dataset.csv", rm.duplicates=FALSE, format="basket", sep=",")
I get the error message: Error in asMethod(object): can not coerce list with transactions with duplicated items.
I'm assuming this is because I have 3,000 observations and inevitably certain combinations are going to show up more than once (i.e., more than one person is purchasing only one piece of Clothing and nothing else: OrderNo, 0, 1, 0, 0, 0, 0). I know I could collapse the data set on counts of unique combinations, but I'm worried that if I do that, there will be no weights to show the most frequent combinations.
I thought that using format="basket" would account for different orders containing the same item combinations, but apparently that's not the case. I'm so lost. All the documentation I've read implies that this is possible but I can't find any examples or advice on how to approach the problem.
Any advice would be so appreciated! My head is spinning on this one.
Extra info: For my end result, I'm looking to get the top five most significant combinations of purchase combinations. I don't know if that helps.
Ok, after hours of searching and reading all the pdfs I could find, I finally found the answer (and most helpful walkthrough of apriori/basket analysis ever!) in the DATA MINING Desktop Survival Guide by Graham Williams:
The read.transactions function can also read data from a file with transaction ID and a single item per line (using the format="single" option).
So there was no need to do all those transformations after import. I should have just been importing straight from the original csv file specifying the "single" format option instead of "basket." I also had to make sure the file contained no column names and that there was a unique representation of item type paired with order number (for instance, if a person ordered two items from the "Grocery" category, this needs to be represented on one row). And the cols=c(2,1)
option indicates that column 1 contains the order number and column 2 is the rest of the data (ItemType).
tr <- read.transactions(file='dataset.csv', format='single', sep=',', cols=c(2,1))