I recently came across the pandas library for python, which according to this benchmark performs very fast in-memory merges. It's even faster than the data.table package in R (my language of choice for analysis).
Why is pandas
so much faster than data.table
? Is it because of an inherent speed advantage python has over R, or is there some tradeoff I'm not aware of? Is there a way to perform inner and outer joins in data.table
without resorting to merge(X, Y, all=FALSE)
and merge(X, Y, all=TRUE)
?
Here's the R code and the Python code used to benchmark the various packages.
The reason pandas is faster is because I came up with a better algorithm, which is implemented very carefully using a fast hash table implementation - klib and in C/Cython to avoid the Python interpreter overhead for the non-vectorizable parts. The algorithm is described in some detail in my presentation: A look inside pandas design and development.
The comparison with data.table
is actually a bit interesting because the whole point of R's data.table
is that it contains pre-computed indexes for various columns to accelerate operations like data selection and merges. In this case (database joins) pandas' DataFrame contains no pre-computed information that is being used for the merge, so to speak it's a "cold" merge. If I had stored the factorized versions of the join keys, the join would be significantly faster - as factorizing is the biggest bottleneck for this algorithm.
I should also add that the internal design of pandas' DataFrame is much more amenable to these kinds of operations than R's data.frame (which is just a list of arrays internally).