I've been using pandas for research now for about two months to great effect. With large numbers of medium-sized trace event datasets, pandas + PyTables (the HDF5 interface) does a tremendous job of allowing me to process heterogenous data using all the Python tools I know and love.
Generally speaking, I use the Fixed (formerly "Storer") format in PyTables, as my workflow is write-once, read-many, and many of my datasets are sized such that I can load 50-100 of them into memory at a time with no serious disadvantages. (NB: I do much of my work on Opteron server-class machines with 128GB+ system memory.)
However, for large datasets (500MB and greater), I would like to be able to use the more scalable random-access and query abilities of the PyTables "Tables" format, so that I can perform my queries out-of-memory and then load the much smaller result set into memory for processing. The big hurdle here, however, is the write performance. Yes, as I said, my workflow is write-once, read-many, but the relative times are still unacceptable.
As an example, I recently ran a large Cholesky factorization that took 3 minutes, 8 seconds (188 seconds) on my 48 core machine. This generated a trace file of ~2.2 GB - the trace is generated in parallel with the program, so there is no additional "trace creation time."
The initial conversion of my binary trace file into the pandas/PyTables format takes a decent chunk of time, but largely because the binary format is deliberately out-of-order in order to reduce the performance impact of the trace generator itself. This is also irrelevant to the performance loss when moving from the Storer format to the Table format.
My tests were initially run with pandas 0.12, numpy 1.7.1, PyTables 2.4.0, and numexpr 0.20.1. My 48 core machine runs at 2.8GHz per core, and I am writing to an ext3 filesystem which is probably (but not certainly) on a SSD.
I can write the entire dataset to a Storer format HDF5 file (resulting filesize: 3.3GB) in 7.1 seconds. The same dataset, written to the Table format (resulting file size is also 3.3GB), takes 178.7 seconds to write.
The code is as follows:
with Timer() as t:
store = pd.HDFStore('test_storer.h5', 'w')
store.put('events', events_dataset, table=False, append=False)
print('Fixed format write took ' + str(t.interval))
with Timer() as t:
store = pd.HDFStore('test_table.h5', 'w')
store.put('events', events_dataset, table=True, append=False)
print('Table format write took ' + str(t.interval))
and the output is simply
Fixed format write took 7.1
Table format write took 178.7
My dataset has 28,880,943 rows, and the columns are basic datatypes:
node_id int64
thread_id int64
handle_id int64
type int64
begin int64
end int64
duration int64
flags int64
unique_id int64
id int64
DSTL_LS_FULL float64
L2_DMISS float64
L3_MISS float64
kernel_type float64
dtype: object
...so I don't think there should be any data-specific issues with the write speed.
I've also tried adding BLOSC compression, to rule out any strange I/O issues that might affect one scenario or the other, but compression seems to decrease the performance of both equally.
Now, I realize that the pandas documentation says that the Storer format offers significantly faster writes, and slightly faster reads. (I do experience the faster reads, as a read of the Storer format seems to take around 2.5 seconds, while a read of the Table format takes around 10 seconds.) But it really seems excessive that the Table format write should take 25 times as long as the Storer format write.
Can any of the folks involved with PyTables or pandas explain the architectural (or otherwise) reasons why writing to the queryable format (which clearly requires very little extra data) should take an order of magnitude longer? And is there any hope for improving this in the future? I'd love to jump in to contributing to one project or the other, as my field is high performance computing and I see a significant use case for both projects in this domain.... but it would be helpful to get some clarification on the issues involved first, and/or some advice on how to speed things up from those who know how the system is built.
EDIT:
Running the former tests with %prun in IPython gives the following (somewhat reduced for readability) profile output for the Storer/Fixed format:
%prun -l 20 profile.events.to_hdf('test.h5', 'events', table=False, append=False)
3223 function calls (3222 primitive calls) in 7.385 seconds
Ordered by: internal time
List reduced from 208 to 20 due to restriction <20>
ncalls tottime percall cumtime percall filename:lineno(function)
6 7.127 1.188 7.128 1.188 {method '_createArray' of 'tables.hdf5Extension.Array' objects}
1 0.242 0.242 0.242 0.242 {method '_closeFile' of 'tables.hdf5Extension.File' objects}
1 0.003 0.003 0.003 0.003 {method '_g_new' of 'tables.hdf5Extension.File' objects}
46 0.001 0.000 0.001 0.000 {method 'reduce' of 'numpy.ufunc' objects}
and the following for the Tables format:
%prun -l 40 profile.events.to_hdf('test.h5', 'events', table=True, append=False, chunksize=1000000)
499082 function calls (499040 primitive calls) in 188.981 seconds
Ordered by: internal time
List reduced from 526 to 40 due to restriction <40>
ncalls tottime percall cumtime percall filename:lineno(function)
29 92.018 3.173 92.018 3.173 {pandas.lib.create_hdf_rows_2d}
640 20.987 0.033 20.987 0.033 {method '_append' of 'tables.hdf5Extension.Array' objects}
29 19.256 0.664 19.256 0.664 {method '_append_records' of 'tables.tableExtension.Table' objects}
406 19.182 0.047 19.182 0.047 {method '_g_writeSlice' of 'tables.hdf5Extension.Array' objects}
14244 10.646 0.001 10.646 0.001 {method '_g_readSlice' of 'tables.hdf5Extension.Array' objects}
472 10.359 0.022 10.359 0.022 {method 'copy' of 'numpy.ndarray' objects}
80 3.409 0.043 3.409 0.043 {tables.indexesExtension.keysort}
2 3.023 1.512 3.023 1.512 common.py:134(_isnull_ndarraylike)
41 2.489 0.061 2.533 0.062 {method '_fillCol' of 'tables.tableExtension.Row' objects}
87 2.401 0.028 2.401 0.028 {method 'astype' of 'numpy.ndarray' objects}
30 1.880 0.063 1.880 0.063 {method '_g_flush' of 'tables.hdf5Extension.Leaf' objects}
282 0.824 0.003 0.824 0.003 {method 'reduce' of 'numpy.ufunc' objects}
41 0.537 0.013 0.668 0.016 index.py:607(final_idx32)
14490 0.385 0.000 0.712 0.000 array.py:342(_interpret_indexing)
39 0.279 0.007 19.635 0.503 index.py:1219(reorder_slice)
2 0.256 0.128 10.063 5.031 index.py:1099(get_neworder)
1 0.090 0.090 119.392 119.392 pytables.py:3016(write_data)
57842 0.087 0.000 0.087 0.000 {numpy.core.multiarray.empty}
28570 0.062 0.000 0.107 0.000 utils.py:42(is_idx)
14164 0.062 0.000 7.181 0.001 array.py:711(_readSlice)
EDIT 2:
Running again with a pre-release copy of pandas 0.13 (pulled Nov 20 2013 at about 11:00 EST), write times for the Tables format improve significantly but still don't compare "reasonably" to the write speeds of the Storer/Fixed format.
%prun -l 40 profile.events.to_hdf('test.h5', 'events', table=True, append=False, chunksize=1000000)
499748 function calls (499720 primitive calls) in 117.187 seconds
Ordered by: internal time
List reduced from 539 to 20 due to restriction <20>
ncalls tottime percall cumtime percall filename:lineno(function)
640 22.010 0.034 22.010 0.034 {method '_append' of 'tables.hdf5Extension.Array' objects}
29 20.782 0.717 20.782 0.717 {method '_append_records' of 'tables.tableExtension.Table' objects}
406 19.248 0.047 19.248 0.047 {method '_g_writeSlice' of 'tables.hdf5Extension.Array' objects}
14244 10.685 0.001 10.685 0.001 {method '_g_readSlice' of 'tables.hdf5Extension.Array' objects}
472 10.439 0.022 10.439 0.022 {method 'copy' of 'numpy.ndarray' objects}
30 7.356 0.245 7.356 0.245 {method '_g_flush' of 'tables.hdf5Extension.Leaf' objects}
29 7.161 0.247 37.609 1.297 pytables.py:3498(write_data_chunk)
2 3.888 1.944 3.888 1.944 common.py:197(_isnull_ndarraylike)
80 3.581 0.045 3.581 0.045 {tables.indexesExtension.keysort}
41 3.248 0.079 3.294 0.080 {method '_fillCol' of 'tables.tableExtension.Row' objects}
34 2.744 0.081 2.744 0.081 {method 'ravel' of 'numpy.ndarray' objects}
115 2.591 0.023 2.591 0.023 {method 'astype' of 'numpy.ndarray' objects}
270 0.875 0.003 0.875 0.003 {method 'reduce' of 'numpy.ufunc' objects}
41 0.560 0.014 0.732 0.018 index.py:607(final_idx32)
14490 0.387 0.000 0.712 0.000 array.py:342(_interpret_indexing)
39 0.303 0.008 19.617 0.503 index.py:1219(reorder_slice)
2 0.288 0.144 10.299 5.149 index.py:1099(get_neworder)
57871 0.087 0.000 0.087 0.000 {numpy.core.multiarray.empty}
1 0.084 0.084 45.266 45.266 pytables.py:3424(write_data)
1 0.080 0.080 55.542 55.542 pytables.py:3385(write)
I noticed while running these tests that there are long periods where writing seems to "pause" (the file on disk is not actively growing), and yet there is also low CPU usage during some of these periods.
I begin to suspect that some known ext3 limitations may interact badly with either pandas or PyTables. Ext3 and other non-extent-based filesystems sometimes struggle to unlink large files promptly, and similar system performance (low CPU usage, but long wait times) is apparent even during a simple 'rm' of a 1GB file, for instance.
To clarify, in each test case, I made sure to remove the existing file, if any, before starting the test, so as not to incur any ext3 file removal/overwrite penalty.
However, when re-running this test with index=None, performance improves drastically (~50s vs the ~120 when indexing). So it would seem that either this process continues to be CPU-bound (my system has relatively old AMD Opteron Istanbul CPUs running @ 2.8GHz, though it does also have 8 sockets with 6 core CPUs in each, all but one of which, of course, sit idle during the write), or that there is some conflict between the way PyTables or pandas attempts to manipulate/read/analyze the file when already partially or fully on the filesystem that causes pathologically bad I/O behavior when the indexing is occurring.
EDIT 3:
@Jeff's suggested tests on a smaller dataset (1.3 GB on disk), after upgrading PyTables from 2.4 to 3.0.0, have gotten me here:
In [7]: %timeit f(df)
1 loops, best of 3: 3.7 s per loop
In [8]: %timeit f2(df) # where chunksize= 2 000 000
1 loops, best of 3: 13.8 s per loop
In [9]: %timeit f3(df) # where chunksize= 2 000 000
1 loops, best of 3: 43.4 s per loop
In fact, my performance seems to beat his in all scenarios except for when indexing is turned on (the default). However, indexing still seems to be a killer, and if the way I'm interpreting the output from top
and ls
as I run these tests is correct, there remain periods of time when there is neither significant processing nor any file writing happening (i.e., CPU usage for the Python process is near 0, and the filesize remains constant). I can only assume these are file reads. Why file reads would be causing slowdowns is hard for me to understand, as I can reliably load an entire 3+ GB file from this disk into memory in under 3 seconds. If they're not file reads, then what is the system 'waiting' on? (No one else is logged into the machine, and there is no other filesystem activity.)
At this point, with upgraded versions of the relevant python modules, the performance for my original dataset is down to the following figures. Of special interest are the system time, which I assume is at least an upper-bound on the time spent performing IO, and the Wall time, which seems to perhaps account for these mysterious periods of no write/no CPU activity.
In [28]: %time f(profile.events)
CPU times: user 0 ns, sys: 7.16 s, total: 7.16 s
Wall time: 7.51 s
In [29]: %time f2(profile.events)
CPU times: user 18.7 s, sys: 14 s, total: 32.7 s
Wall time: 47.2 s
In [31]: %time f3(profile.events)
CPU times: user 1min 18s, sys: 14.4 s, total: 1min 32s
Wall time: 2min 5s
Nevertheless, it would appears that indexing causes significant slowdown for my use case. Perhaps I should attempt limiting the fields indexed instead of simply performing the default case (which may very well be indexing on all of the fields in the DataFrame)? I am not sure how this is likely to affect query times, especially in the cases where a query selects based on a non-indexed field.
Per Jeff's request, a ptdump of the resulting file.
ptdump -av test.h5
/ (RootGroup) ''
/._v_attrs (AttributeSet), 4 attributes:
[CLASS := 'GROUP',
PYTABLES_FORMAT_VERSION := '2.1',
TITLE := '',
VERSION := '1.0']
/df (Group) ''
/df._v_attrs (AttributeSet), 14 attributes:
[CLASS := 'GROUP',
TITLE := '',
VERSION := '1.0',
data_columns := [],
encoding := None,
index_cols := [(0, 'index')],
info := {1: {'type': 'Index', 'names': [None]}, 'index': {}},
levels := 1,
nan_rep := 'nan',
non_index_axes :=
[(1, ['node_id', 'thread_id', 'handle_id', 'type', 'begin', 'end', 'duration', 'flags', 'unique_id', 'id', 'DSTL_LS_FULL', 'L2_DMISS', 'L3_MISS', 'kernel_type'])],
pandas_type := 'frame_table',
pandas_version := '0.10.1',
table_type := 'appendable_frame',
values_cols := ['values_block_0', 'values_block_1']]
/df/table (Table(28880943,)) ''
description := {
"index": Int64Col(shape=(), dflt=0, pos=0),
"values_block_0": Int64Col(shape=(10,), dflt=0, pos=1),
"values_block_1": Float64Col(shape=(4,), dflt=0.0, pos=2)}
byteorder := 'little'
chunkshape := (4369,)
autoindex := True
colindexes := {
"index": Index(6, medium, shuffle, zlib(1)).is_csi=False}
/df/table._v_attrs (AttributeSet), 15 attributes:
[CLASS := 'TABLE',
FIELD_0_FILL := 0,
FIELD_0_NAME := 'index',
FIELD_1_FILL := 0,
FIELD_1_NAME := 'values_block_0',
FIELD_2_FILL := 0.0,
FIELD_2_NAME := 'values_block_1',
NROWS := 28880943,
TITLE := '',
VERSION := '2.7',
index_kind := 'integer',
values_block_0_dtype := 'int64',
values_block_0_kind := ['node_id', 'thread_id', 'handle_id', 'type', 'begin', 'end', 'duration', 'flags', 'unique_id', 'id'],
values_block_1_dtype := 'float64',
values_block_1_kind := ['DSTL_LS_FULL', 'L2_DMISS', 'L3_MISS', 'kernel_type']]
and another %prun with the updated modules and the full dataset:
%prun -l 25 %time f3(profile.events)
CPU times: user 1min 14s, sys: 16.2 s, total: 1min 30s
Wall time: 1min 48s
542678 function calls (542650 primitive calls) in 108.678 seconds
Ordered by: internal time
List reduced from 629 to 25 due to restriction <25>
ncalls tottime percall cumtime percall filename:lineno(function)
640 23.633 0.037 23.633 0.037 {method '_append' of 'tables.hdf5extension.Array' objects}
15 20.852 1.390 20.852 1.390 {method '_append_records' of 'tables.tableextension.Table' objects}
406 19.584 0.048 19.584 0.048 {method '_g_write_slice' of 'tables.hdf5extension.Array' objects}
14244 10.591 0.001 10.591 0.001 {method '_g_read_slice' of 'tables.hdf5extension.Array' objects}
458 9.693 0.021 9.693 0.021 {method 'copy' of 'numpy.ndarray' objects}
15 6.350 0.423 30.989 2.066 pytables.py:3498(write_data_chunk)
80 3.496 0.044 3.496 0.044 {tables.indexesextension.keysort}
41 3.335 0.081 3.376 0.082 {method '_fill_col' of 'tables.tableextension.Row' objects}
20 2.551 0.128 2.551 0.128 {method 'ravel' of 'numpy.ndarray' objects}
101 2.449 0.024 2.449 0.024 {method 'astype' of 'numpy.ndarray' objects}
16 1.789 0.112 1.789 0.112 {method '_g_flush' of 'tables.hdf5extension.Leaf' objects}
2 1.728 0.864 1.728 0.864 common.py:197(_isnull_ndarraylike)
41 0.586 0.014 0.842 0.021 index.py:637(final_idx32)
14490 0.292 0.000 0.616 0.000 array.py:368(_interpret_indexing)
2 0.283 0.142 10.267 5.134 index.py:1158(get_neworder)
274 0.251 0.001 0.251 0.001 {method 'reduce' of 'numpy.ufunc' objects}
39 0.174 0.004 19.373 0.497 index.py:1280(reorder_slice)
57857 0.085 0.000 0.085 0.000 {numpy.core.multiarray.empty}
1 0.083 0.083 35.657 35.657 pytables.py:3424(write_data)
1 0.065 0.065 45.338 45.338 pytables.py:3385(write)
14164 0.065 0.000 7.831 0.001 array.py:615(__getitem__)
28570 0.062 0.000 0.108 0.000 utils.py:47(is_idx)
47 0.055 0.001 0.055 0.001 {numpy.core.multiarray.arange}
28570 0.050 0.000 0.090 0.000 leaf.py:397(_process_range)
87797 0.048 0.000 0.048 0.000 {isinstance}
Here is a similar comparison I just did. Its about 1/3 of the data 10M rows. The final size is abou 1.3GB
I define 3 timing functions:
Test the Fixed format (called Storer in 0.12). This writes in a PyTables Array format
def f(df):
store = pd.HDFStore('test.h5','w')
store['df'] = df
store.close()
Write in the Table format, using PyTables Table format. Do not create an index.
def f2(df):
store = pd.HDFStore('test.h5','w')
store.append('df',df,index=False)
store.close()
Same as f2, but create an index (which is normally done)
def f3(df):
store = pd.HDFStore('test.h5','w')
store.append('df',df)
store.close()
Create the frame
In [25]: df = concat([DataFrame(np.random.randn(10000000,10)),DataFrame(np.random.randint(0,10,size=50000000).reshape(10000000,5))],axis=1)
In [26]: df
Out[26]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 10000000 entries, 0 to 9999999
Columns: 15 entries, 0 to 4
dtypes: float64(10), int64(5)
v0.12.0
In [27]: %timeit f(df)
1 loops, best of 3: 14.7 s per loop
In [28]: %timeit f2(df)
1 loops, best of 3: 32 s per loop
In [29]: %timeit f3(df)
1 loops, best of 3: 40.1 s per loop
master/v0.13.0
In [5]: %timeit f(df)
1 loops, best of 3: 12.9 s per loop
In [6]: %timeit f2(df)
1 loops, best of 3: 17.5 s per loop
In [7]: %timeit f3(df)
1 loops, best of 3: 24.3 s per loop
Timing Runs with the same file as provided by the OP (link is below)
In [4]: df = pd.read_hdf('test.h5','df')
In [5]: df
Out[5]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 28880943 entries, 0 to 28880942
Columns: 14 entries, node_id to kernel_type
dtypes: float64(4), int64(10)
Like f1, Fixed format
In [6]: %timeit df.to_hdf('test.hdf','df',mode='w')
1 loops, best of 3: 36.2 s per loop
Like f2, Table format, no index
In [7]: %timeit df.to_hdf('test.hdf','df',mode='w',format='table',index=False)
1 loops, best of 3: 45 s per loop
In [8]: %timeit df.to_hdf('test.hdf','df',mode='w',format='table',index=False,chunksize=2000000)
1 loops, best of 3: 44.5 s per loop
Like f3, Table format with index
In [9]: %timeit df.to_hdf('test.hdf','df',mode='w',format='table',chunksize=2000000)
1 loops, best of 3: 1min 36s per loop
Like f3, Table format with index, compressed with blosc
In [10]: %timeit df.to_hdf('test.hdf','df',mode='w',format='table',chunksize=2000000,complib='blosc')
1 loops, best of 3: 46.5 s per loop
In [11]: %timeit pd.read_hdf('test.hdf','df')
1 loops, best of 3: 10.8 s per loop
Show original file (test.h5, and compressed, test.hdf)
In [13]: !ls -ltr test.h*
-rw-r--r-- 1 jreback users 3471518282 Nov 20 18:20 test.h5
-rw-rw-r-- 1 jreback users 649327780 Nov 20 21:17 test.hdf
Several points to note.
Not creating an index can make a non-trivial difference in time. I also believe that if you have a string based index it can substantially worsen write time. That said, you always want to create an index to make retrieval very fast.
You didn't include what your index is, nor whether its sorted (though I only think this makes a small difference).
the write penalty in my examples are roughly 2x (though I have seen it be somewhat bigger when INCLUDING the index time). Thus your 7s (1/2 of my time), for 3x the number I am writing is quite suspect. I am using a reasonably fast disk array. If you were using a flash based disk then this is possible, though.
master/v0.13.0 (release very soon), improves the write times on tables substantially.
you can try setting the chunksize
parameter to a bigger number when you write the data (its default is 100000). The purpose of the 'relatively' low number is to have a constant memory usage. (e.g. if is bigger you will use more memory, in theory it should write faster though).
Tables offer 2 advantages over Fixed format: 1) query retrieval, and 2) appendability. Reading the entire table doesn't take advantage of either, so if you ONLY want to read the entire table, then Fixed format is recommended. (In my experience the flexibility of Tables greatly outweights the write penalty, but YMMV)
Bottom line is to repeat the timings (use ipython as it will run multiple tests). If you can reproduce your results, then pls post a %prun and I'll take a look.
Update:
so the recommended way for a table this size is to compress with blosc and use pandas master/0.13.0 along with PyTables 3.0.0