Spark Read/Write (csv) ISO-8859-1

jduff1075 picture jduff1075 · Aug 24, 2016 · Viewed 7.6k times · Source

I need to read an iso-8859-1 encoded file, do some operations then save it (with iso-8859-1 encoding). To test this, I'm losely mimicking a testcase I found on the Databricks CSV package: https://github.com/databricks/spark-csv/blob/master/src/test/scala/com/databricks/spark/csv/CsvSuite.scala

-- specifically: test("DSL test for iso-8859-1 encoded file")

val fileDF = spark.read.format("com.databricks.spark.csv")
  .option("header", "false")
  .option("charset", "iso-8859-1")
  .option("delimiter", "~")         // bogus - hopefully something not in the file, just want 1 record per line
  .load("s3://.../cars_iso-8859-1.csv")

   fileDF.collect                   // I see the non-ascii characters correctly
val selectedData = fileDF.select("_c0")  // just so show an operation
selectedData.write
  .format("com.databricks.spark.csv")
  .option("header", "false")
  .option("delimiter", "~")
  .option("charset", "iso-8859-1")
  .save("s3://.../carOutput8859")

This code runs without an error - but it doesn't seem to honor the iso-8859-1 option on output. At a Linux prompt (after copying from S3 -> local Linux)

file -i cars_iso-8859-1.csv 
cars_iso-8859-1.csv: text/plain; charset=iso-8859-1

file -i carOutput8859.csv 
carOutput8859.csv: text/plain; charset=utf-8

I'm just looking for some good examples of reading and writing non-UTF8 files. At this point, I have plenty of flexibility in the approach. (doesn't have to be a csv reader) Any recommedations/examples?

Answer