I have a sparse data set, one whose number of columns vary in length, in a csv format. Here is a sample of the file text.
12223, University
12227, bridge, Sky
12828, Sunset
13801, Ground
14853, Tranceamerica
14854, San Francisco
15595, shibuya, Shrine
16126, fog, San Francisco
16520, California, ocean, summer, golden gate, beach, San Francisco
When I use
read.csv("data.txt", header = F)
R will interpret the data set as having 3 columns because the size is determined from the first 5 rows. Is there anyway to force r to put the data in more columns?
Deep in the ?read.table
documentation there is the following:
The number of data columns is determined by looking at the first five lines of input (or the whole file if it has less than five lines), or from the length of
col.names
if it is specified and is longer. This could conceivably be wrong iffill
orblank.lines.skip are true
, so specifycol.names
if necessary (as in the ‘Examples’).
Therefore, let's define col.names
to be length X (where X is the max number of fields in your dataset), and set fill = TRUE
:
dat <- textConnection("12223, University
12227, bridge, Sky
12828, Sunset
13801, Ground
14853, Tranceamerica
14854, San Francisco
15595, shibuya, Shrine
16126, fog, San Francisco
16520, California, ocean, summer, golden gate, beach, San Francisco")
read.table(dat, header = FALSE, sep = ",",
col.names = paste0("V",seq_len(7)), fill = TRUE)
V1 V2 V3 V4 V5 V6 V7
1 12223 University
2 12227 bridge Sky
3 12828 Sunset
4 13801 Ground
5 14853 Tranceamerica
6 14854 San Francisco
7 15595 shibuya Shrine
8 16126 fog San Francisco
9 16520 California ocean summer golden gate beach San Francisco
If the maximum number of fields is unknown, you can use the nifty utility function count.fields
(which I found in the read.table
example code):
count.fields(dat, sep = ',')
# [1] 2 3 2 2 2 2 3 3 7
max(count.fields(dat, sep = ','))
# [1] 7
Possibly helpful related reading: Only read limited number of columns in R