big array manipulation is very slow in ruby

Joelio picture Joelio · Oct 20, 2011 · Viewed 7.2k times · Source

I have the following scenario:

I need to figure out the unique list of ids across a very large set.

So for example, I have 6000 arrays of ids (followers list), each can range in size between 1 and 25000 (their follower list).

I want to get the unique list of ids across all of these arrays of ids (unique followers of followers). Once that is done I need to subtract out another list (another persons follower list) of ids and get a final count.

The final set of unique ids grows to around 60,000,000 records. In ruby when adding the arrays to the big array, it starts to get very slow around a couple of million. Adding to the set takes .1 seconds at first, then grows to taking over 4 seconds at 2 million (no where near where I need to go).

I wrote a test program in java and it does the whole thing in less than a minute.

Maybe I am doing this inefficiently in ruby, or there is another way. Since my main code is proprietary I have wrote a simple test program to simulate the issue:

big_array = []
loop_counter = 0
start_time = Time.now
# final target size of the big array
while big_array.length < 60000000
 loop_counter+=1
 # target size of one persons follower list
 random_size_of_followers = rand(5000)
 follower_list = []
 follower_counter = 0
   while follower_counter < random_size_of_followers
     follower_counter+=1
     # make ids very large so we get good spread and only some amt of dupes
     follower_id = rand(240000000) + 100000
     follower_list << follower_id
   end
 # combine the big list with this list
 big_array = big_array | follower_list
 end_time = Time.now

 # every 100 iterations check where we are and how long each loop and combine takes.
 if loop_counter % 100 == 0
   elapsed_time = end_time - start_time
   average_time = elapsed_time.to_f/loop_counter.to_f
   puts "average time for loop is #{average_time}, total size of big_array is #{big_array.length}"
   start_time = Time.now
 end
end

Any suggestions, is it time to switch to jruby and move stuff like this to java?

Answer

tadman picture tadman · Oct 20, 2011

The method you're using there is horribly inefficient, so it's no surprise this is slow. When you're trying to keep track of unique things, an Array requires way more processing than a Hash equivalent.

Here's a simple refactoring that increases the speed about 100x:

all_followers = { }
loop_counter = 0
start_time = Time.now

while (all_followers.length < 60000000)
  # target size of one persons follower list
  follower_list = []

  rand(5000).times do
    follower_id = rand(240000000) + 100000
    follower_list << follower_id
    all_followers[follower_id] = true
  end

 end_time = Time.now

 # every 100 iterations check where we are and how long each loop and combine takes.
 loop_counter += 1

  if (loop_counter % 100 == 0)
    elapsed_time = end_time - start_time
    average_time = elapsed_time.to_f/loop_counter.to_f
    puts "average time for loop is #{average_time}, total size of all_followers is #{all_followers.length}"
    start_time = Time.now
  end
end

The nice thing about a Hash is that it's impossible to have duplicates. If you need to list all the followers at any time, use all_followers.keys to get the IDs.

Hashes take up more memory than their Array equivalents, but this is the price you have to pay for performance. I'd also suspect one of the big memory consumers here is the many individual lists of followers that are generated and seemingly never used, so perhaps you could skip that step entirely.

The key thing here is that the Array | operator is not very efficient, especially when operating on very large arrays.