I have the following scenario:
I need to figure out the unique list of ids across a very large set.
So for example, I have 6000 arrays of ids (followers list), each can range in size between 1 and 25000 (their follower list).
I want to get the unique list of ids across all of these arrays of ids (unique followers of followers). Once that is done I need to subtract out another list (another persons follower list) of ids and get a final count.
The final set of unique ids grows to around 60,000,000 records. In ruby when adding the arrays to the big array, it starts to get very slow around a couple of million. Adding to the set takes .1 seconds at first, then grows to taking over 4 seconds at 2 million (no where near where I need to go).
I wrote a test program in java and it does the whole thing in less than a minute.
Maybe I am doing this inefficiently in ruby, or there is another way. Since my main code is proprietary I have wrote a simple test program to simulate the issue:
big_array = []
loop_counter = 0
start_time = Time.now
# final target size of the big array
while big_array.length < 60000000
loop_counter+=1
# target size of one persons follower list
random_size_of_followers = rand(5000)
follower_list = []
follower_counter = 0
while follower_counter < random_size_of_followers
follower_counter+=1
# make ids very large so we get good spread and only some amt of dupes
follower_id = rand(240000000) + 100000
follower_list << follower_id
end
# combine the big list with this list
big_array = big_array | follower_list
end_time = Time.now
# every 100 iterations check where we are and how long each loop and combine takes.
if loop_counter % 100 == 0
elapsed_time = end_time - start_time
average_time = elapsed_time.to_f/loop_counter.to_f
puts "average time for loop is #{average_time}, total size of big_array is #{big_array.length}"
start_time = Time.now
end
end
Any suggestions, is it time to switch to jruby and move stuff like this to java?
The method you're using there is horribly inefficient, so it's no surprise this is slow. When you're trying to keep track of unique things, an Array requires way more processing than a Hash equivalent.
Here's a simple refactoring that increases the speed about 100x:
all_followers = { }
loop_counter = 0
start_time = Time.now
while (all_followers.length < 60000000)
# target size of one persons follower list
follower_list = []
rand(5000).times do
follower_id = rand(240000000) + 100000
follower_list << follower_id
all_followers[follower_id] = true
end
end_time = Time.now
# every 100 iterations check where we are and how long each loop and combine takes.
loop_counter += 1
if (loop_counter % 100 == 0)
elapsed_time = end_time - start_time
average_time = elapsed_time.to_f/loop_counter.to_f
puts "average time for loop is #{average_time}, total size of all_followers is #{all_followers.length}"
start_time = Time.now
end
end
The nice thing about a Hash is that it's impossible to have duplicates. If you need to list all the followers at any time, use all_followers.keys
to get the IDs.
Hashes take up more memory than their Array equivalents, but this is the price you have to pay for performance. I'd also suspect one of the big memory consumers here is the many individual lists of followers that are generated and seemingly never used, so perhaps you could skip that step entirely.
The key thing here is that the Array |
operator is not very efficient, especially when operating on very large arrays.