R's read.csv prepending 1st column name with junk text

Daniel PP Cabral picture Daniel PP Cabral · Jul 4, 2014 · Viewed 41.5k times · Source

I have exported data from a result grid in SQL Server Management Studio to a csv file. The csv file looks correct.

But when I read the data into an R dataframe using read.csv, the first column name is prepended with "ï..". How do I get rid of this junk text?

Example:

str(trainData)

'data.frame':   64169 obs. of  20 variables:    
 $ ï..Column1             : int  3232...   
 $ Column2                : int  4242...

The data looks something like this (nothing special) :

Column1,Column2
100116577,100116577
100116698,100116702

Answer

Spacedman picture Spacedman · Jul 4, 2014

You've got a Unicode UTF-8 BOM at the start of the file:

http://en.wikipedia.org/wiki/Byte_order_mark

A text editor or web browser interpreting the text as ISO-8859-1 or CP1252 will display the characters  for this

R is giving you the ï and then converting the other two into dots as they are non-alphanumeric characters.

Here:

http://r.789695.n4.nabble.com/Writing-Unicode-Text-into-Text-File-from-R-in-Windows-td4684693.html

Duncan Murdoch suggests:

You can declare a file to be in encoding "UTF-8-BOM" if you want to ignore a BOM on input

So try your read.csv with fileEncoding="UTF-8-BOM" or persuade your SQL wotsit to not output a BOM.

Otherwise you may as well test if the first name starts with ï.. and strip it with substr (as long as you know you'll never have a column that does start like that genuinely...)