I have three input data files. Each uses a different delimiter for the data contained therein. Data file one looks like this:
apples | bananas | oranges | grapes
data file two looks like this:
quarter, dime, nickel, penny
data file three looks like this:
horse cow pig chicken goat
(the change in the number of columns is also intentional)
The thought I had was to count the number of non-alpha characters, and presume that the highest count was the separator character. However, the files with non-space separators also have spaces before and after the separators, so the spaces win on all three files. Here's my code:
def count_chars(s):
valid_seps=[' ','|',',',';','\t']
cnt = {}
for c in s:
if c in valid_seps: cnt[c] = cnt.get(c,0) + 1
return cnt
infile = 'pipe.txt' #or 'comma.txt' or 'space.txt'
records = open(infile,'r').read()
print count_chars(records)
It will print a dictionary with the counts of all the acceptable characters. In each case, the space always wins, so I can't rely on that to tell me what the separator is.
But I can't think of a better way to do this.
Any suggestions?
How about trying Python CSV's standard: http://docs.python.org/library/csv.html#csv.Sniffer
import csv
sniffer = csv.Sniffer()
dialect = sniffer.sniff('quarter, dime, nickel, penny')
print dialect.delimiter
# returns ','