I am working with a Stata .dta file that is around 3.3 gigabytes, so it is large but not excessively large. I am interested in using IPython and tried to import the .dta file using Pandas but something wonky is going on. My box has 32 gigabytes of RAM and attempting to load the .dta file results in all the RAM being used (after ~30 minutes) and my computer to stall out. This doesn't 'feel' right in that I am able to open the file in R using read.dta() from the foreign package no problem, and working with the file in Stata is fine. The code I am using is:
%time myfile = pd.read_stata(data_dir + 'my_dta_file.dta')
and I am using IPython in Enthought's Canopy program. The reason for the '%time' is because I am interested in benchmarking this against R's read.dta().
My questions are:
Here is a little function that has been handy for me, using some pandas
features that might not have been available when the question was originally posed:
def load_large_dta(fname):
import sys
reader = pd.read_stata(fname, iterator=True)
df = pd.DataFrame()
try:
chunk = reader.get_chunk(100*1000)
while len(chunk) > 0:
df = df.append(chunk, ignore_index=True)
chunk = reader.get_chunk(100*1000)
print '.',
sys.stdout.flush()
except (StopIteration, KeyboardInterrupt):
pass
print '\nloaded {} rows'.format(len(df))
return df
I loaded an 11G Stata file in 100 minutes with this, and it's nice to have something to play with if I get tired of waiting and hit cntl-c
.