How to specify the number of rows a pandas dataframe will have?

ps.george picture ps.george · Apr 13, 2015 · Viewed 17k times · Source

I have a Pandas dataframe and I am continually appending a row of data each second as below.

df.loc[time.strftime("%Y-%m-%d %H:%M:%S")] = [reading1, reading2, reading3]
>>>df
                     sensor1 sensor2 sensor3
2015-04-14 08:50:23    5.4     5.6     5.7
2015-04-14 08:50:24    5.5     5.6     5.8
2015-04-14 08:50:26    5.2     5.3     5.4

If I continue this, eventually I am going to start experiencing memory issues (Each time it will call the whole DataFrame).

I only need to keep X rows of the data. i.e. after the operation, it will be:

>>>df
                     sensor1 sensor2 sensor3
(this row is gone)
2015-04-14 08:50:24    5.5     5.6     5.8
2015-04-14 08:50:26    5.2     5.3     5.4
2015-04-14 08:50:27    5.2     5.4     5.6

Is there a way I can specify a maximum number of rows, so that when any subsequent rows are added, the oldest row is deleted at the same time WITHOUT a "Check length of DataFrame, If length of DataFrame > X, Remove first row, Append new row"?

Like this, but for a Pandas DataFrame: https://stackoverflow.com/a/10155753/4783578

Answer

TheBlackCat picture TheBlackCat · Apr 13, 2015

pandas stores data in arrays. Doing the sort of operation you want inherently requires a copy for an array data structure. Since data is stored in contiguous (or strided) memory, adding something to the end and removing something from the beginning requires copying everything to a new region of memory. There is no way around this. You need to use a different data structure.

Edit: Thinking about this a bit more, I see two approaches to do this.

The simplest and most straightforward would be to use a collections.deque of tuples. You can just append a new tuple to the end, and if it gets too full it will dump the corresponding on from the beginning. At the end, you can just convert them into a DataFrame. I am just using the for loop as an example, I gather you get your data in a different way. It wouldn't matter:

import pandas as pd
from collections import deque

maxlen = 1000

dq = deque(maxlen=maxlen)

for reading1, reading3, reading3 in readings:
    dq.append(pd.Series([reading1, reading2, reading3], 
                        index=['sensor1', 'sensor2', 'sensor3'], 
                        name=time.strftime("%Y-%m-%d %H:%M:%S")))

df = pd.concat(dq, axis=1).T

The second approach is to use a DataFrame of a fixed size, and use the modulo of the maximum length to choose the place to overwrite, but also keep the item number in the DataFrame. Then you can sort by item number. In your case, you could conceivably sort by time, but this approach is more general. As with the previous example, I will use a for loop to demonstrate, but you probably don't have one. Further, I will also assume that you don't have a real iterable you can enumerate, if you do then you don't have to keep track of the index number as I do here:

import pandas as pd

maxlen = 1000

df = pd.DataFrame(np.full((maxlen, 5), np.nan),
                  columns=['index', 'time', 
                           'sensor1', 'sensor2', 'sensor3'])

i = 0
for reading1, reading3, reading3 in readings:
    df.loc[i%maxlen, :] = [i, time.strftime("%Y-%m-%d %H:%M:%S"),
                           reading1, reading2, reading3]
    i+=1

df.sort('index', inplace=True)
del df['index']
df.set_index('time', drop=True, inplace=True)