I'm importing xlsx
2007 tables into R 3.2.1patched
using package readxl 0.1.0
under Windows 7 64
. The tables' size is on the order of 25,000 rows by 200 columns.
Function read_excel()
works a treat. My only problem is with its assignment of column class (datatype) to sparsely populated columns. For example, a given column may be NA for 20,000 rows and then will take a character value on row 20,001. read_excel()
appears to default to column type numeric when scanning the first n rows of a column and finding NAs
only. The data causing the problem are chars in a column assigned numeric. When the error limit is reached, execution halts. I actually want the data in the sparse columns, so setting the error limit higher isn't a solution.
I can identify the troublesome columns by reviewing the warnings thrown. And read_excel()
has an option for asserting a column's datatype by setting argument col_types
according to the package docs:
Either NULL
to guess from the spreadsheet or a character vector containing blank
,numeric
, date
or text
.
But does this mean I have to construct a vector of length 200 populated in almost every position with blank
and text
in handful of positions corresponding to the offending columns?
There's probably a way of doing this in a couple lines of R
code. Create a vector of the required length and fill it with blank
s. Maybe another vector containing the numbers of the columns to be forced to text
, and then ... Or maybe it's possible to call out for read_excel()
just the columns for which its guesses aren't as desired.
I'd appreciate any suggestions.
Thanks in advance.
New solution since readxl
version 1.x:
The solution in the currently preferred answer does no longer work with newer versions than 0.1.0 of readxl
since the used package-internal function readxl:::xlsx_col_types
does no longer exist.
The new solution is to use the newly introduced parameter guess_max
to increase the number of rows used to "guess" the appropriate data type of the columns:
read_excel("My_Excel_file.xlsx", sheet = 1, guess_max = 1048576)
The value 1,048,576 is the maximum number of lines supported by Excel currently, see the Excel specs: https://support.office.com/en-us/article/Excel-specifications-and-limits-1672b34d-7043-467e-8e27-269d656771c3
PS: If you care about performance using all rows to guess the data type: read_excel
seems to read the file only once and the guess is done in-memory then so the performance penalty is very small compared to the saved work.