When CSV is read as dataframe in spark, all the columns are read as string. Is there any way to get the actual type of column?
I have the following csv file
Name,Department,years_of_experience,DOB
Sam,Software,5,1990-10-10
Alex,Data Analytics,3,1992-10-10
I've read the CSV using the below code
val df = sqlContext.
read.
format("com.databricks.spark.csv").
option("header", "true").
option("inferSchema", "true").
load(sampleAdDataS3Location)
df.schema
All the columns are read as string. I expect the column years_of_experience to be read as int and DOB to be read as date
Please note that I've set the option inferSchema to true.
I am using the latest version (1.0.3) of spark-csv package
Am I missing something here?
2015-07-30
The latest version is actually 1.1.0, but it doesn't really matter since it looks like inferSchema
is not included in the latest release.
2015-08-17
The latest version of the package is now 1.2.0 (published on 2015-08-06) and schema inference works as expected:
scala> df.printSchema
root
|-- Name: string (nullable = true)
|-- Department: string (nullable = true)
|-- years_of_experience: integer (nullable = true)
|-- DOB: string (nullable = true)
Regarding automatic date parsing I doubt it will ever happen, or at least not without providing additional metadata.
Even if all fields follow some date-like format it is impossible to say if a given field should be interpreted as a date. So it is either lack of out automatic date inference or spreadsheet like mess. Not to mention issues with timezones for example.
Finally you can easily parse date string manually:
sqlContext
.sql("SELECT *, DATE(dob) as dob_d FROM df")
.drop("DOB")
.printSchema
root
|-- Name: string (nullable = true)
|-- Department: string (nullable = true)
|-- years_of_experience: integer (nullable = true)
|-- dob_d: date (nullable = true)
so it is really not a serious issue.
2017-12-20:
Built-in csv parser available since Spark 2.0 supports schema inference for dates and timestamp - it uses two options:
timestampFormat
with default yyyy-MM-dd'T'HH:mm:ss.SSSXXX
dateFormat
with default yyyy-MM-dd
See also How to force inferSchema for CSV to consider integers as dates (with "dateFormat" option)?