I have a simple csv file called "test.csv" with the following content:
colA,colB,colC
1,"x",12
2,"y",34
3,"z",56
Let's say I want to skip reading in colA and just read in colB and colC. I want a general way to do this because I have lots of files to read in and sometimes colA is called something else altogether but colB and colC are always the same.
According to the read_csv documentation, one way to accomplish this is to pass a named list for col_types and only name the columns you want to keep:
read_csv('test.csv', col_types = list(colB = col_character(), colC = col_numeric()))
By not mentioning colA it should get dropped from the output. However, the resulting data frame is:
Source: local data frame [3 x 3]
colA colB colC
1 1 x 12
2 2 y 34
3 3 z 56
Am I doing something wrong or is the read_csv documentation not correct? According to the help file:
If a list, it must contain one "collector" for each column. If you only want to read a subset of the columns, you can use a named list (where the names give the column names). If a column is not mentioned by name, it will not be included in the output.
There is an answer out there, I just didn't search hard enough: https://github.com/hadley/readr/issues/132
Apparently this was a documentation issue that has been corrected. This functionality may eventually get added but Hadley thought it was more useful to be able to just update one column type and not drop the others.
Update: The functionality has been added
The following code is from the readr documentation:
read_csv("iris.csv", col_types = cols_only( Species = col_factor(c("setosa", "versicolor", "virginica"))))
This will read only the Species column of the iris data set. In order to read only a specific column you must also pass the column specification i.e. col_factor
, col_double
, etc...