Hi all and thanks in advance. I am new to the NoSQL game but my current place of employment has tasked me with set comparisons of some big data.
Our system has customer tag set and targeted tag sets.
A tag is an 8 digit number.
A customer tag set may have up to 300 tags but averages 100 tags
A targeted tag set may have up to 300 tags but averages 40 tags.
Pre calculating is not an option as we are shooting for a potential customer base of a billion users.
(These tags are hierarchical so having one tag implies that you also have its parent and ancestor tags. Put that info aside for the moment.)
When a customer hits our site, we need to intersect their tag set against one million targeted tag sets as fast as possible. The customer set must contain all elements of the targeted set to match.
I have been exploring my options and the set intersection in Redis seems like it would be ideal. However, my trolling through the internet has not revealed how much ram would be required to hold one million tag sets. I realize the intersection would be lightning fast, but is this a feasable solution with Redis.
I realize this is brute force and inefficient. I also wanted to use this question as means to get suggestions for ways this type of problem has been handled in the past. As stated before, the tags are stored in a tree. I have begun looking at Mongodb as a possible solution as well.
Thanks again
This is an interesting problem, and I think Redis can help here.
Redis can store sets of integers using an optimized "intset" format. See http://redis.io/topics/memory-optimization for more information.
I believe the correct data structure here is a collection of targeted tag sets, plus a reverse index to map tags to their targeted tag sets.
To store two targeted tag sets:
0 -> [ 1 2 3 4 5 6 7 8 ]
1 -> [ 6 7 8 9 10 ]
I would use:
# Targeted tag sets
sadd tgt:0 1 2 3 4 5 6 7 8
sadd tgt:1 2 6 7 8 9 10
# Reverse index
sadd tag:0 0
sadd tag:1 0
sadd tag:2 0 1
sadd tag:3 0
sadd tag:4 0
sadd tag:5 0
sadd tag:6 0 1
sadd tag:7 0 1
sadd tag:8 0 1
sadd tag:9 1
sadd tag:10 1
This reverse index is quite easy to maintain when targeted tag sets are added/removed from the system.
The global memory consumption depends on the number of tags which are common to multiple targeted tag sets. It is quite easy to store pseudo-data in Redis and simulate the memory consumption. I have done it using a simple node.js script.
For 1 million targeted tag sets (tags being 8 digits numbers, 40 tags per set), the memory consumption is close to 4 GB when there are very few tags shared by the targeted tag sets (more than 32M entries in the reverse index), and about 500 MB when the tags are shared a lot (only 100K entries in the reverse index).
With this data structure, finding the targeted tag sets containing all the tags of a given customer is extremely efficient.
1- Get customer tag set (suppose it is 1 2 3 4)
2- SINTER tag:1 tag:2 tag:3 tag:4
=> result is a list of targeted tag sets having all the tags of the customer
The intersection operation is efficient because Redis is smart enough to order the sets per cardinality and starts with the set having the lowest cardinality.
Now I understand you need to implement the converse operation (i.e. finding the targeted tag sets having all their tags in the customer tag set). The reverse index can still help.
Here in an example in ugly pseudo-code:
1- Get customer tag set (suppose it is 1 2 3 4)
2- SUNIONSTORE tmp tag:1 tag:2 tag:3 tag:4
=> result is a list of targeted tag sets having at least one tag in common with the customer
3- For t in tmp (iterating on the selected targeted tag sets)
n = SCARD tgt:t (cardinality of the targeted tag sets)
intersect = SINTER customer tgt:t
if n == len(intersect), this targeted tag set matches
So you never have to test the customer tag set against 1M targeted tag sets. You can rely on the reverse index to restrict the scope of the search to an acceptable level.