I am quite new to nympy and I am trying to read a tab(\t) delimited text file into an numpy array matrix using the following code:
train_data = np.genfromtxt('training.txt', dtype=None, delimiter='\t')
File contents:
38 Private 215646 HS-grad 9 Divorced Handlers-cleaners Not-in-family White Male 0 0 40 United-States <=50K
53 Private 234721 11th 7 Married-civ-spouse Handlers-cleaners Husband Black Male 0 0 40 United-States <=50K
30 State-gov 141297 Bachelors 13 Married-civ-spouse Prof-specialty Husband Asian-Pac-Islander Male 0 0 40 India >50K
what I expect is a 2-D array matrix of shape (3, 15)
but with my above code I only get a single row array of shape (3,)
I am not sure why those fifteen fields of each row are not assigned a column each.
I also tried using numpy's loadtxt() but it could not handle type conversions on my data i.e even though I gave dtype=None it tried to convert the strings to default float type and failed at it.
Tried code:
train_data = np.loadtxt('try.txt', dtype=None, delimiter='\t')
Error:
ValueError: could not convert string to float: State-gov
Any pointers?
Thanks
Actually the issue here is that np.genfromtxt
and np.loadtxt
both return a structured array if the dtype is structured (i.e., has multiple types). Your array reports to have a shape of (3,)
, because technically it is a 1d array of 'records'. These 'records' hold all your columns but you can access all the data as if it were 2d.
You are loading it correctly:
In [82]: d = np.genfromtxt('tmp',dtype=None)
As you reported, it has a 1d shape:
In [83]: d.shape
Out[83]: (3,)
But all your data is there:
In [84]: d
Out[84]:
array([ (38, 'Private', 215646, 'HS-grad', 9, 'Divorced', 'Handlers-cleaners', 'Not-in-family', 'White', 'Male', 0, 0, 40, 'United-States', '<=50K'),
(53, 'Private', 234721, '11th', 7, 'Married-civ-spouse', 'Handlers-cleaners', 'Husband', 'Black', 'Male', 0, 0, 40, 'United-States', '<=50K'),
(30, 'State-gov', 141297, 'Bachelors', 13, 'Married-civ-spouse', 'Prof-specialty', 'Husband', 'Asian-Pac-Islander', 'Male', 0, 0, 40, 'India', '>50K')],
dtype=[('f0', '<i8'), ('f1', 'S9'), ('f2', '<i8'), ('f3', 'S9'), ('f4', '<i8'), ('f5', 'S18'), ('f6', 'S17'), ('f7', 'S13'), ('f8', 'S18'), ('f9', 'S4'), ('f10', '<i8'), ('f11', '<i8'), ('f12', '<i8'), ('f13', 'S13'), ('f14', 'S5')])
The dtype
of the array is structured as so:
In [85]: d.dtype
Out[85]: dtype([('f0', '<i8'), ('f1', 'S9'), ('f2', '<i8'), ('f3', 'S9'), ('f4', '<i8'), ('f5', 'S18'), ('f6', 'S17'), ('f7', 'S13'), ('f8', 'S18'), ('f9', 'S4'), ('f10', '<i8'), ('f11', '<i8'), ('f12', '<i8'), ('f13', 'S13'), ('f14', 'S5')])
And you can still access "columns" (known as fields) using the names given in the dtype:
In [86]: d['f0']
Out[86]: array([38, 53, 30])
In [87]: d['f1']
Out[87]:
array(['Private', 'Private', 'State-gov'],
dtype='|S9')
It's more convenient to give proper names to the fields:
In [104]: names = "age,military,id,edu,a,marital,job,fam,ethnicity,gender,b,c,d,country,income"
In [105]: d = np.genfromtxt('tmp',dtype=None, names=names)
So you can now access the 'age'
field, etc.:
In [106]: d['age']
Out[106]: array([38, 53, 30])
In [107]: d['income']
Out[107]:
array(['<=50K', '<=50K', '>50K'],
dtype='|S5')
Or the incomes of people under 35
In [108]: d[d['age'] < 35]['income']
Out[108]:
array(['>50K'],
dtype='|S5')
and over 35
In [109]: d[d['age'] > 35]['income']
Out[109]:
array(['<=50K', '<=50K'],
dtype='|S5')