I am trying to use Python Pandas to import a CSV file. The example data in this file is as follows where the first row is the column names separated by commas.
End Customer Organization ID,End Customer Organization Name,End Customer Top Parent Organization ID,End Customer Top Parent Organization Name,Reseller Top Parent ID,Reseller Top Parent Name,Business,Rev Sum Division,Rev Sum Category,Product Family,Version,Pricing Level,Summary Pricing Level,Detail Pricing Level,MS Sales Amount,MS Sales Licenses,Fiscal Year,Sales Date
11027676,Baroda Western Uttar Pradesh Gramin Bankgfhgfnjgfnmjmhgmghmghmghmnghnmghnmhgnmghnghngh,4078446,Bank Of Barodadfhhgfjyjtkyukujkyujkuhykluiluilui;iooi';po'fserwefvegwegf,1809012,"Hcl Infosystems Ltd - Partnerdghftrutyhb frhywer5y5tyu6ui7iukluyj,lgjmfgnhfrgweffw",Server & CALsdgrgrfgtrhytrnhjdgthjtyjkukmhjmghmbhmgfngdfbndfhtgh,SQL Server & CALdfhtrhtrgbhrghrye5y45y45yu56juhydsgfaefwe,SQL CALdhdfthtrutrjurhjethfdehrerfgwerweqeadfawrqwerwegtrhyjuytjhyj,SQL CALdtrye45y3t434tjkabcjkasdhfhasdjkcbaksmjcbfuigkjasbcjkasbkdfhiwh,2005,Openfkvgjesropiguwe90fujklascnioawfy98eyfuiasdbcvjkxsbhg,Open Lklbjdfoigueroigbjvwioergyuiowerhgosdhvgfoisdhyguiserhguisrh,"Open Stddfm,vdnoghioerivnsdflierohgushdfovhsiodghuiohdbvgsjdhgouiwerho",125.85,1,FY07,12/28/2006
12835756,Uttam Strips Pvt Ltd,12835756,Uttam Strips Pvt Ltd,12565538,Redington C/O Fortis Financial Services Ltd,MBS,Dynamics ERP,Dynamics NAV,Dynamics NAV Business Essentials,Non-specific,Other,MBS SA,MBS New Customer Enhanc. Def,0,0,FY09,9/15/2008
12233135,Bhagwan Singh Tondon,12233135,Bhagwan Singh Tondon,2652941,H B S Systems Pvt Ltd,Server & CAL,SQL Server & CAL,SQL CAL,SQL CAL,Non-specific,Open,Open L&SA,Deferred Open L&SA - New,0,0,FY09,9/15/2008
11602305,Maya Academy Of Advanced Cinematics,9750934,Maya Entertainment Ltd,336146,Embee Software Pvt Ltd,Server & CAL,Windows Server & CAL,Windows Server HPC,Windows Compute Cluster Server,Non-specific,Open,Open V/MYO - Rec,OLV Perpet L&SA Recur-Def,0,0,FY09,9/25/2008
13336009,Remiel Softech Solution Pvt Ltd,13336009,Remiel Softech Solution Pvt Ltd,13335482,Redington C/O Remiel Softech Solutions Pvt Ltd,MBS,Dynamics ERP,Dynamics NAV,Dynamics NAV Business Essentials,Non-specific,Other,MBS SA,MBS New Customer Enhanc. Def,0,0,FY09,12/23/2008
I am using the below code to import:
import pandas as pd
df=pd.read_csv('file path.csv',sep=',')
It gave the following error:
Traceback (most recent call last):
File "<pyshell#25>", line 1, in <module>
df=pd.read_csv(filename,sep=',')
File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 400, in parser_f
return _read(filepath_or_buffer, kwds)
File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 205, in _read
return parser.read()
File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 608, in read
ret = self._engine.read(nrows)
File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 1028, in read
data = self._reader.read(nrows)
File "parser.pyx", line 706, in pandas.parser.TextReader.read (pandas\parser.c:6745)
File "parser.pyx", line 728, in pandas.parser.TextReader._read_low_memory (pandas\parser.c:6964)
File "parser.pyx", line 804, in pandas.parser.TextReader._read_rows (pandas\parser.c:7780)
File "parser.pyx", line 890, in pandas.parser.TextReader._convert_column_data (pandas\parser.c:8793)
File "parser.pyx", line 950, in pandas.parser.TextReader._convert_tokens (pandas\parser.c:9484)
File "parser.pyx", line 1026, in pandas.parser.TextReader._convert_with_dtype (pandas\parser.c:10642)
File "parser.pyx", line 1046, in pandas.parser.TextReader._string_convert (pandas\parser.c:10853)
File "parser.pyx", line 1278, in pandas.parser._string_box_utf8 (pandas\parser.c:15657)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc1 in position 90: invalid start byte
Since it looked like a Unicode error I ran with encoding changed this time:
df=pd.read_csv(filename,encoding='utf-16',sep=',')
It gave the following error:
Traceback (most recent call last):
File "<pyshell#26>", line 1, in <module>
df=pd.read_csv(filename,encoding='utf-16',sep=',')
File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 400, in parser_f
return _read(filepath_or_buffer, kwds)
File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 198, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 479, in __init__
self._make_engine(self.engine)
File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 586, in _make_engine
self._engine = CParserWrapper(self.f, **self.options)
File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 957, in __init__
self._reader = _parser.TextReader(src, **kwds)
File "parser.pyx", line 477, in pandas.parser.TextReader.__cinit__ (pandas\parser.c:4434)
File "parser.pyx", line 592, in pandas.parser.TextReader._get_header (pandas\parser.c:5660)
File "parser.pyx", line 768, in pandas.parser.TextReader._tokenize_rows (pandas\parser.c:7451)
File "parser.pyx", line 1661, in pandas.parser.raise_parser_error (pandas\parser.c:18744)
pandas.parser.CParserError: Error tokenizing data. C error: Calling read(nbytes) on source failed. Try engine='python'.
Not sure why is this happening? Even tried converting the CSV file into Excel with Text to Columns and used read_excel function of Pandas. That too gave the error(below):
Traceback (most recent call last):
File "<pyshell#30>", line 1, in <module>
df=pd.read_excel('J:\dmqp on 192.168.1.41\MS Sales Dump (FY09)xls','MS Sales Dump (FY09)')
File "C:\Python33\lib\site-packages\pandas\io\excel.py", line 52, in read_excel
return ExcelFile(path_or_buf,kind=kind).parse(sheetname=sheetname,
File "C:\Python33\lib\site-packages\pandas\io\excel.py", line 68, in __init__
import xlrd # throw an ImportError if we need to
ImportError: No module named 'xlrd'
Can someone help with the above errors and what is wrong here while importing both as CSV and as Excel.
I tried this code with encoding changed:
df=pd.read_csv(filename,encoding='iso-8859-1',sep=',')
It didn't give any error but imported as one single column rather than breaking it to separate columns.
>>>df
<class 'pandas.core.frame.DataFrame'>
Int64Index: 263244 entries, 0 to 263243
Data columns (total 1 columns):
End Customer Organization ID,End Customer Organization Name,End Customer Top Parent Organization ID,End Customer Top Parent Organization Name,Reseller Top Parent ID,Reseller Top Parent Name,Business,Rev Sum Division,Rev Sum Category,Product Family,Version,Pricing Level,Summary Pricing Level,Detail Pricing Level,MS Sales Amount,MS Sales Licenses,Fiscal Year,Sales Date 263244 non-null values
dtypes: object(1)
After just checking for the above example data by storing it in a text file and then importing this is the output I got:
>>> df =pd.read_csv(r'J:\Data.txt')
>>> print(df)
End Customer Organization ID \
0 11027676
1 12835756
2 12233135
3 11602305
4 13336009
End Customer Organization Name \
0 Baroda Western Uttar Pradesh Gramin Bankgfhgfn...
1 Uttam Strips Pvt Ltd
2 Bhagwan Singh Tondon
3 Maya Academy Of Advanced Cinematics
4 Remiel Softech Solution Pvt Ltd
End Customer Top Parent Organization ID \
0 4078446
1 12835756
2 12233135
3 9750934
4 13336009
End Customer Top Parent Organization Name Reseller Top Parent ID \
0 Bank Of Barodadfhhgfjyjtkyukujkyujkuhykluiluil... 1809012
1 Uttam Strips Pvt Ltd 12565538
2 Bhagwan Singh Tondon 2652941
3 Maya Entertainment Ltd 336146
4 Remiel Softech Solution Pvt Ltd 13335482
Reseller Top Parent Name \
0 Hcl Infosystems Ltd - Partnerdghftrutyhb frhyw...
1 Redington C/O Fortis Financial Services Ltd
2 H B S Systems Pvt Ltd
3 Embee Software Pvt Ltd
4 Redington C/O Remiel Softech Solutions Pvt Ltd
Business \
0 Server & CALsdgrgrfgtrhytrnhjdgthjtyjkukmhjmgh...
1 MBS
2 Server & CAL
3 Server & CAL
4 MBS
Rev Sum Division \
0 SQL Server & CALdfhtrhtrgbhrghrye5y45y45yu56ju...
1 Dynamics ERP
2 SQL Server & CAL
3 Windows Server & CAL
4 Dynamics ERP
Rev Sum Category \
0 SQL CALdhdfthtrutrjurhjethfdehrerfgwerweqeadfa...
1 Dynamics NAV
2 SQL CAL
3 Windows Server HPC
4 Dynamics NAV
Product Family Version \
0 SQL CALdtrye45y3t434tjkabcjkasdhfhasdjkcbaksmj... 2005
1 Dynamics NAV Business Essentials Non-specific
2 SQL CAL Non-specific
3 Windows Compute Cluster Server Non-specific
4 Dynamics NAV Business Essentials Non-specific
Pricing Level \
0 Openfkvgjesropiguwe90fujklascnioawfy98eyfuiasd...
1 Other
2 Open
3 Open
4 Other
Summary Pricing Level \
0 Open Lklbjdfoigueroigbjvwioergyuiowerhgosdhvgf...
1 MBS SA
2 Open L&SA
3 Open V/MYO - Rec
4 MBS SA
Detail Pricing Level MS Sales Amount \
0 Open Stddfm,vdnoghioerivnsdflierohgushdfovhsio... 125.85
1 MBS New Customer Enhanc. Def 0.00
2 Deferred Open L&SA - New 0.00
3 OLV Perpet L&SA Recur-Def 0.00
4 MBS New Customer Enhanc. Def 0.00
MS Sales Licenses Fiscal Year Sales Date
0 1 FY07 12/28/2006
1 0 FY09 9/15/2008
2 0 FY09 9/15/2008
3 0 FY09 9/25/2008
4 0 FY09 12/23/2008
>>>
This is adding '\' after each column and column names are not one after the other. Instead they seem to be on new line after each column is imported.
I guess your main problem has to do with encoding. I have suffered the pain of dealing with weird encodings in csv files. What helped me in those cases was to try to detect the real encoding of the file and load it correctly with pandas.
give this next code a try:
from chardet.universaldetector import UniversalDetector
def test_encoding(file_name):
detector = UniversalDetector()
with open(file_name, 'rb') as f:
for line in f:
detector.feed(line)
if detector.done:
break
detector.close()
r = detector.result
return "Detected encoding %s with confidence %s" % (r['encoding'], r['confidence'])
This will try to infer the encoding of your file and than you can try to load it correctly using pandas. Hope it helps...