Python Pandas does not read the first row of csv file

Tom picture Tom · Feb 7, 2015 · Viewed 72.5k times · Source

I have a problem with reading CSV(or txt file) on pandas module Because numpy's loadtxt function takes too much time, I decided to use pandas read_csv instead.

I want to make a numpy array from txt file with four columns separated by space, and has very large number of rows (like, 256^3. In this example, it is 64^3).

The problem is that I don't know why but it seems that pandas's read_csv always skips the first line (first row) of the csv (txt) file, resulting one less data.

here is the code.

from __future__ import division
import numpy as np
import pandas as pd
ngridx = 4
ngridy = 4
ngridz = 4
size = ngridx*ngridy*ngridz
f = np.zeros((size,4))
a = np.arange(size)
f[:, 0] = np.floor_divide(a, ngridy*ngridz)
f[:, 1] = np.fmod(np.floor_divide(a, ngridz), ngridy)
f[:, 2] = np.fmod(a, ngridz)
f[:, 3] = np.random.rand(size)
print f[0]
np.savetxt('Testarray.txt',f,fmt='%6.16f')
g = pd.read_csv('Testarray.txt',delimiter=' ').values
print g[0]
print len(g[:,3])

f[0] and g[0] that are displayed in the output have to match but it doesn't, indicating that pandas is skipping the first line of the Testarray.txt. Also, length of loaded file g is less than the length of the array f.

I need help.

Thanks in advance.

Answer

unutbu picture unutbu · Feb 7, 2015

By default, pd.read_csv uses header=0 (when the names parameter is also not specified) which means the first (i.e. 0th-indexed) line is interpreted as column names.

If your data has no header, then use

pd.read_csv(..., header=None)

For example,

import io
import sys
import pandas as pd
if sys.version_info.major == 3:
    # Python3
    StringIO = io.StringIO 
else:
    # Python2
    StringIO = io.BytesIO

text = '''\
1 2 3
4 5 6
'''

print(pd.read_csv(StringIO(text), sep=' '))

Without header, the first line, 1 2 3, sets the column names:

   1  2  3
0  4  5  6

With header=None, the first line is treated as data:

print(pd.read_csv(StringIO(text), sep=' ', header=None))

prints

   0  1  2
0  1  2  3
1  4  5  6