How do you deal with missing data using numpy/scipy?

python numpy data-analysis

Abhijit · Sep 4, 2009 · Viewed 9.3k times · Source

One of the things I deal with most in data cleaning is missing values. R deals with this well using its "NA" missing data label. In python, it appears that I'll have to deal with masked arrays which seem to be a major pain to set up and don't seem to be well documented. Any suggestions on making this process easier in Python? This is becoming a deal-breaker in moving into Python for data analysis. Thanks

Update It's obviously been a while since I've looked at the methods in the numpy.ma module. It appears that at least the basic analysis functions are available for masked arrays, and the examples provided helped me understand how to create masked arrays (thanks to the authors). I would like to see if some of the newer statistical methods in Python (being developed in this year's GSoC) incorporates this aspect, and at least does the complete case analysis.

Answer

If you are willing to consider a library, pandas (http://pandas.pydata.org/) is a library built on top of numpy which amongst many other things provides:

Intelligent data alignment and integrated handling of missing data: gain automatic label-based alignment in computations and easily manipulate messy data into an orderly form

I've been using it for almost one year in the financial industry where missing and badly aligned data is the norm and it really made my life easier.

How do you deal with missing data using numpy/scipy?

Answer

Related questions