Data Type Recognition/Guessing of CSV data in python

poezn picture poezn · Jul 26, 2011 · Viewed 12.6k times · Source

My problem is in the context of processing data from large CSV files.

I'm looking for the most efficient way to determine (that is, guess) the data type of a column based on the values found in that column. I'm potentially dealing with very messy data. Therefore, the algorithm should be error-tolerant to some extent.

Here's an example:

arr1 = ['0.83', '-0.26', '-', '0.23', '11.23']               # ==> recognize as float
arr2 = ['1', '11', '-1345.67', '0', '22']                    # ==> regognize as int
arr3 = ['2/7/1985', 'Jul 03 1985, 00:00:00', '', '4/3/2011'] # ==> recognize as date
arr4 = ['Dog', 'Cat', '0.13', 'Mouse']                       # ==> recognize as str

Bottom line: I'm looking for a python package or an algorithm that can detect either

  • the schema of a CSV file, or even better
  • the data type of an individual column as an array

Method for guessing type of data represented currently represented as strings goes in a similar direction. I'm worried about performance, though, since I'm possibly dealing with many large spreadsheets (where the data stems from)

Answer

Rufus Pollock picture Rufus Pollock · Aug 3, 2013

You may be interested in this python library which does exactly this kind of type guessing on CSVs and XLS files for you:

It happily scales to very large files, to streaming data off the internet etc.

There is also an even simpler wrapper library that includes a command line tool named dataconverters: http://okfnlabs.org/dataconverters/ (and an online service: https://github.com/okfn/dataproxy!)

The core algorithm that does the type guessing is here: https://github.com/okfn/messytables/blob/7e4f12abef257a4d70a8020e0d024df6fbb02976/messytables/types.py#L164