Does pandas iterrows have performance issues?

KieranPC picture KieranPC · Jul 21, 2014 · Viewed 55.1k times · Source

I have noticed very poor performance when using iterrows from pandas.

Is this something that is experienced by others? Is it specific to iterrows and should this function be avoided for data of a certain size (I'm working with 2-3 million rows)?

This discussion on GitHub led me to believe it is caused when mixing dtypes in the dataframe, however the simple example below shows it is there even when using one dtype (float64). This takes 36 seconds on my machine:

import pandas as pd
import numpy as np
import time

s1 = np.random.randn(2000000)
s2 = np.random.randn(2000000)
dfa = pd.DataFrame({'s1': s1, 's2': s2})

start = time.time()
i=0
for rowindex, row in dfa.iterrows():
    i+=1
end = time.time()
print end - start

Why are vectorized operations like apply so much quicker? I imagine there must be some row by row iteration going on there too.

I cannot figure out how to not use iterrows in my case (this I'll save for a future question). Therefore I would appreciate hearing if you have consistently been able to avoid this iteration. I'm making calculations based on data in separate dataframes. Thank you!

---Edit: simplified version of what I want to run has been added below---

import pandas as pd
import numpy as np

#%% Create the original tables
t1 = {'letter':['a','b'],
      'number1':[50,-10]}

t2 = {'letter':['a','a','b','b'],
      'number2':[0.2,0.5,0.1,0.4]}

table1 = pd.DataFrame(t1)
table2 = pd.DataFrame(t2)

#%% Create the body of the new table
table3 = pd.DataFrame(np.nan, columns=['letter','number2'], index=[0])

#%% Iterate through filtering relevant data, optimizing, returning info
for row_index, row in table1.iterrows():   
    t2info = table2[table2.letter == row['letter']].reset_index()
    table3.ix[row_index,] = optimize(t2info,row['number1'])

#%% Define optimization
def optimize(t2info, t1info):
    calculation = []
    for index, r in t2info.iterrows():
        calculation.append(r['number2']*t1info)
    maxrow = calculation.index(max(calculation))
    return t2info.ix[maxrow]

Answer

Jeff picture Jeff · Jul 21, 2014

Generally, iterrows should only be used in very, very specific cases. This is the general order of precedence for performance of various operations:

1) vectorization
2) using a custom cython routine
3) apply
    a) reductions that can be performed in cython
    b) iteration in python space
4) itertuples
5) iterrows
6) updating an empty frame (e.g. using loc one-row-at-a-time)

Using a custom Cython routine is usually too complicated, so let's skip that for now.

1) Vectorization is ALWAYS, ALWAYS the first and best choice. However, there is a small set of cases (usually involving a recurrence) which cannot be vectorized in obvious ways. Furthermore, on a smallish DataFrame, it may be faster to use other methods.

3) apply usually can be handled by an iterator in Cython space. This is handled internally by pandas, though it depends on what is going on inside the apply expression. For example, df.apply(lambda x: np.sum(x)) will be executed pretty swiftly, though of course, df.sum(1) is even better. However something like df.apply(lambda x: x['b'] + 1) will be executed in Python space, and consequently is much slower.

4) itertuples does not box the data into a Series. It just returns the data in the form of tuples.

5) iterrows DOES box the data into a Series. Unless you really need this, use another method.

6) Updating an empty frame a-single-row-at-a-time. I have seen this method used WAY too much. It is by far the slowest. It is probably common place (and reasonably fast for some python structures), but a DataFrame does a fair number of checks on indexing, so this will always be very slow to update a row at a time. Much better to create new structures and concat.