Ranking with millions of entries

Question 1

Ranking with millions of entries

mysql database ranking

Naatan · Mar 25, 2011 · Viewed 11.6k times · Source

Answer

Answer

A single disk seek is about 15ms, maybe a little less with server grade disks. A response time of less than 500ms limits you to about 30 random disk accesses. That is not a lot.

On my tiny laptop, I have a development database with

root@localhost [kris]> select @@innodb_buffer_pool_size/1024/1024 as pool_mb;
+--------------+
| pool_mb      |
+--------------+
| 128.00000000 |
+--------------+
1 row in set (0.00 sec)

and a slow laptop disk. I created a score table with

root@localhost [kris]> show create table score\G
*************************** 1. row ***************************
       Table: score
Create Table: CREATE TABLE `score` (
  `player_id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `score` int(11) NOT NULL,
  PRIMARY KEY (`player_id`),
  KEY `score` (`score`)
) ENGINE=InnoDB AUTO_INCREMENT=2490316 DEFAULT CHARSET=latin1
1 row in set (0.00 sec)

with random integer scores and sequential player_id values. We have

root@localhost [kris]> select count(*)/1000/1000 as mrows from score\G
*************************** 1. row ***************************
mrows: 2.09715200
1 row in set (0.39 sec)

The database maintains the pair (score, player_id) in score order in the index score, as data in an InnoDB index is stored in a BTREE, and the row pointer (data pointer) is the primary key value, so that the definition KEY (score) ends up being KEY(score, player_id) internally. We can prove that by looking at the query plan for a score retrieval:

root@localhost [kris]> explain select * from score where score = 17\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: score
         type: ref
possible_keys: score
          key: score
      key_len: 4
          ref: const
         rows: 29
        Extra: Using index
1 row in set (0.00 sec)

As you can see, the key: score is being used with Using index, meaning that no data access is necessary.

The ranking query for a given constant player_id takes precisely 500ms on my laptop:

root@localhost [kris]>  select p.*, count(*) as rank 
    from score as p join score as s on p.score < s.score 
   where p.player_id = 479269\G
*************************** 1. row ***************************
player_id: 479269
    score: 99901
     rank: 2074
1 row in set (0.50 sec)

With more memory and on a faster box it can be quicker, but it is still a comparatively expensive operation, because the plan sucks:

root@localhost [kris]> explain select p.*, count(*) as rank from score as p join score as s on p.score < s.score where p.player_id = 479269;
+----+-------------+-------+-------+---------------+---------+---------+-------+---------+--------------------------+
| id | select_type | table | type  | possible_keys | key     | key_len | ref   | rows    | Extra                    |
+----+-------------+-------+-------+---------------+---------+---------+-------+---------+--------------------------+
|  1 | SIMPLE      | p     | const | PRIMARY,score | PRIMARY | 4       | const |       1 |                          |
|  1 | SIMPLE      | s     | index | score         | score   | 4       | NULL  | 2097979 | Using where; Using index |
+----+-------------+-------+-------+---------------+---------+---------+-------+---------+--------------------------+
2 rows in set (0.00 sec)

As you can see, the second table in the plan is an index scan, so the query slows down linearly with the number of players.

If you want a full leaderboard, you need to leave off the where clause, and then you get two scans and quadratic execution times. So this plan implodes completely.

Time to go procedural here:

root@localhost [kris]> set @count = 0; 
    select *, @count := @count + 1 as rank from score where score >= 99901 order by score desc ;
...
|   2353218 | 99901 | 2075 |
|   2279992 | 99901 | 2076 |
|   2264334 | 99901 | 2077 |
|   2239927 | 99901 | 2078 |
|   2158161 | 99901 | 2079 |
|   2076159 | 99901 | 2080 |
|   2027538 | 99901 | 2081 |
|   1908971 | 99901 | 2082 |
|   1887127 | 99901 | 2083 |
|   1848119 | 99901 | 2084 |
|   1692727 | 99901 | 2085 |
|   1658223 | 99901 | 2086 |
|   1581427 | 99901 | 2087 |
|   1469315 | 99901 | 2088 |
|   1466122 | 99901 | 2089 |
|   1387171 | 99901 | 2090 |
|   1286378 | 99901 | 2091 |
|    666050 | 99901 | 2092 |
|    633419 | 99901 | 2093 |
|    479269 | 99901 | 2094 |
|    329168 | 99901 | 2095 |
|    299189 | 99901 | 2096 |
|    290436 | 99901 | 2097 |
...

Because this is a procedural plan, it is unstable:

You cannot use LIMIT, because that will offset the counter. Instead you have to download all this data.
You cannot really sort. This ORDER BY clause works, because it does not sort, but uses an index. As soon as you see using filesort, the counter values will be wildly off.

It is the solution that comes closest to what a NoSQL (read: procedural) database will do as an execution plan, though.

We can stabilize the NoSQL inside a subquery and then slice out the part that is of interest to us, though:

root@localhost [kris]> set @count = 0; 
    select * from ( 
        select *, @count := @count + 1 as rank 
          from score 
         where score >= 99901 
      order by score desc 
    ) as t 
    where player_id = 479269;
Query OK, 0 rows affected (0.00 sec)
+-----------+-------+------+
| player_id | score | rank |
+-----------+-------+------+
|    479269 | 99901 | 2094 |
+-----------+-------+------+
1 row in set (0.00 sec)

root@localhost [kris]> set @count = 0; 
    select * from ( 
        select *, @count := @count + 1 as rank 
          from score 
         where score >= 99901 
      order by score desc 
    ) as t 
    where rank between 2090 and 2100;
Query OK, 0 rows affected (0.00 sec)
+-----------+-------+------+
| player_id | score | rank |
+-----------+-------+------+
|   1387171 | 99901 | 2090 |
|   1286378 | 99901 | 2091 |
|    666050 | 99901 | 2092 |
|    633419 | 99901 | 2093 |
|    479269 | 99901 | 2094 |
|    329168 | 99901 | 2095 |
|    299189 | 99901 | 2096 |
|    290436 | 99901 | 2097 |
+-----------+-------+------+
8 rows in set (0.01 sec)

The subquery will materialize the former result set as an ad-hoc table named t, which we then can access in the outer query. Because it is an ad-hoc table, in MySQL it will have no index. This limits what is possible efficiently in the outer query.

Note how both queries satisfy your timing constraint, though. Here is the plan:

root@localhost [kris]> set @count = 0; explain select * from ( select *, @count := @count + 1 as rank from score where score >= 99901 order by score desc ) as t where rank between 2090 and 2100\G
Query OK, 0 rows affected (0.00 sec)

*************************** 1. row ***************************
           id: 1
  select_type: PRIMARY
        table: <derived2>
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 2097
        Extra: Using where
*************************** 2. row ***************************
           id: 2
  select_type: DERIVED
        table: score
         type: range
possible_keys: score
          key: score
      key_len: 4
          ref: NULL
         rows: 3750
        Extra: Using where; Using index
2 rows in set (0.00 sec)

Both query components (the inner, DERIVED query and the outer BETWEEN constraint) will get slower for badly ranked players, though, and then grossly violate your timing constraints.

root@localhost [kris]> set @count = 0; select * from ( select *, @count := @count + 1 as rank from score where score >= 0 order by score desc ) as t;
...
2097152 rows in set (3.56 sec)

The execution time for the descriptive approach is stable (dependent only on table size):

root@localhost [kris]> select p.*, count(*) as rank 
   from score as p join score as s on p.score < s.score 
   where p.player_id = 1134026;
+-----------+-------+---------+
| player_id | score | rank    |
+-----------+-------+---------+
|   1134026 |     0 | 2097135 |
+-----------+-------+---------+
1 row in set (0.53 sec)

Your call.

Question 2

I'm working on a server for an online game which should be able to handle millions of players. Now the game needs leaderboards and wants to be able to show a players current position and possibly other players near the current players position as well as the positions of the players friends.

Now I've done this stuff before in MySQL and I know how it's technically possible, however I figured since this is a common practice for a lot of online games there must be existing libraries or databases particularly for this purpose?

Can anyone advice me what database is the best for these types of queries and possibly any pre-existing libraries that already do a lot of this work? A third party service with API access would be fine too.

Hope to get some good advice, thanks!

Edit:

To clarify, I need a database which can hold millions of entries (so far MySQL is good for that) with which I can easily get ranked results. For example if I get a specific row from the "leaderboard" table I need to know which rank that row has. This query has to be under 500ms regardless of the size of the db.

Alternatively a way to update the table with the current ranking information would be fine too long as this update query does not lock the whole table and the update query runs in under 30 seconds.

Any ideas as to what database / mechanism or third party service to use would be much appreciated!

Ranking with millions of entries

Answer

Related questions