Within the apriori function, I want the outcome to only contain these two variables in the LHS HouseOwnerFlag=0
and HouseOwnerFlag=1
. The RHS should only contain attributes from the column Product
. For instance:
# lhs rhs support confidence lift
# 1 {HouseOwnerFlag=0} => {Product=SV 16xDVD M360 Black} 0.2500000 0.2500000 1.000000
# 2 {HouseOwnerFlag=1} => {Product=Adventure Works 26" 720p} 0.2500000 0.2500000 1.000000
# 3 {HouseOwnerFlag=0} => {Product=Litware Wall Lamp E3015 Silver} 0.1666667 0.3333333 1.333333
# 4 {HouseOwnerFlag=1} => {Product=Contoso Coffee Maker 5C E0900} 0.1666667 0.3333333 1.333333
Part of the answer is solved in this question: R arules, mine only rules from specific column
So now I use the following:
rules <- apriori(sales, parameter=list(support =0.01, confidence =0.8, minlen=2), appearance = list(lhs=c("HouseOwnerFlag=0", "HouseOwnerFlag=1")))
Then I use this from that other SO question to ensure that only the Product column is on the RHS:
inspect( subset( rules, subset = rhs %pin% "Product=" ) )
The outcome is like this:
# lhs rhs support confidence lift
# 1 {ProductKey=153, IncomeGroup=Moderate, BrandName=Adventure Works } => {Product=SV 16xDVD M360 Black} 0.2500000 0.2500000 1.000000
# 2 {ProductKey=176, MaritalStatus=M, ProductCategoryName=TV and Video } => {Product=Adventure Works 26" 720p} 0.2500000 0.2500000 1.000000
# 3 {BrandName=Southridge Video, NumberChildrenAtHome=0 } => {Product=Litware Wall Lamp E3015 Silver} 0.1666667 0.3333333 1.333333
# 4 {HouseOwnerFlag=1, BrandName=Southridge Video, ProductKey=170 } => {Product=Contoso Coffee Maker 5C E0900} 0.1666667 0.3333333 1.333333
So apparently the LHS is able to contain every possible column, not just HouseOwnerFlag
like I specified. From other stackoverflow questions, I see that I can put default="rhs"
in the apriori function, like so:
rules <- apriori(sales, parameter=list(support =0.001, confidence =0.5, minlen=2), appearance = list(lhs=c("HouseOwnerFlag=0", "HouseOwnerFlag=1"), default="rhs"))
Then upon inspecting (without the subset part, just inspect(rules
), there are far less rules (7) than before but it does indeed only contain HouseOwnerFlag
in the LHS:
# lhs rhs support confidence lift
# 1 {HouseOwnerFlag=0} => {MaritalStatus=S} 0.2500000 0.2500000 1.000000
# 2 {HouseOwnerFlag=1} => {Gender=M} 0.2500000 0.2500000 1.000000
# 3 {HouseOwnerFlag=0} => {NumberChildrenAtHome=0} 0.1666667 0.3333333 1.333333
# 4 {HouseOwnerFlag=1} => {Gender=M} 0.1666667 0.3333333 1.333333
However on the RHS there's nothing from the column Product in the RHS. So it has no use to inspect
it with subset
as ofcourse it would return null. I tested it several times with different support numbers to experiment and see if Product would appear or not, but the 7 same rules remain the same.
So my question is, how can I specify both the LHS (HouseOwnerFlag) and RHS (Product)? What am I doing wrong?
EDIT: You can reproduce this problem by downloading this testdataset from https://www.dropbox.com/s/tax5xalac5xgxtf/testdf.txt?dl=0
Mind you, I only took the first 20 rows from a huge dataset, so the output here won't have the same product names as the example I displayed above unfortunately. But the problem still remains the same. I want to be able to get only HouseOwnerFlag=0
and/or HouseOwnerFlag=1
on the LHS and the column Product
on the RHS.
It seems that one can't constrain lhs and rhs at once (I also did not before playing with your data). But you can use subset. EDIT: I was wrong, you can also constrain lhs and rhs at once, see below for another solution. I keep Solution 1 because in some cases it might be useful to compute a bigger set and then split by the left hand side.
Solution 1:
rules_sales <- apriori(sales,
parameter=list(support =0.001, confidence =0.5, minlen=2, maxlen=2),
appearance = list(lhs=c("HouseOwnerFlag=0", "HouseOwnerFlag=1"),
default="rhs"))
rules_subset <- subset(rules_sales, (rhs %in% paste0("Product=", unique(sales$Product))))
inspect(rules_subset)
gives:
lhs rhs support confidence lift
1 {HouseOwnerFlag=0} => {Product=SV DVD Movies E100 Yellow} 0.05 0.5 10
2 {HouseOwnerFlag=0} => {Product=Fabrikam Refrigerator 4.6CuFt E2800 Grey} 0.05 0.5 5
3 {HouseOwnerFlag=1} => {Product=Contoso SLR Camera M144 Gold} 0.10 0.5 5
But you should be careful about your low support:
Warning in apriori(sales, parameter = list(support = 0.001, confidence = 0.5, :
You chose a very low absolute support count of 0. You might run out of memory! Increase minimum support.
Solution 2:
I was tricked by the definition of the parameter default. Using lhs and rhs at once tells each item that is assigned to one of them, that it can only be used for lhs/rhs. The parameter "default" is automatically set to "both" and all other items not used in lhs/rhs can be used for both (Explanation of the appearence parameter as implemented in the R package: http://www.inside-r.org/node/86290, I realised that it must be possible when reading the manual of the original C implementation: http://www.borgelt.net/doc/apriori/apriori.html#appearin). You have to set default="none"
then you can constrain lhs and rhs without using a subset later.
rules_sales <- apriori(sales,
parameter=list(support =0.001, confidence =0.5, minlen=2, maxlen=2),
appearance = list(lhs=c("HouseOwnerFlag=0", "HouseOwnerFlag=1"),
rhs=paste0("Product=", unique(sales$Product)), default="none"))