Spark CSV package not able to handle \n within fields

unk1102 picture unk1102 · May 30, 2017 · Viewed 8k times · Source

I have a CSV file which I am trying to load using Spark CSV package and it does not load data properly because few of the fields have \n within them for e.g. the following two rows

"XYZ", "Test Data", "TestNew\nline", "OtherData" 
"XYZ", "Test Data", "blablablabla
\nblablablablablalbal", "OtherData" 

I am using the following code which is straightforward I am using parserLib as univocity as read in internet it solves multiple newline problem but it does not seems to be the case for me.

 SQLContext sqlContext = new SQLContext(sc);
    DataFrame df = sqlContext.read()
        .format("com.databricks.spark.csv")
        .option("inferSchema", "true")
        .option("header", "true")
        .option("parserLib","univocity")
        .load("data.csv");

How do I replace newline within fields which starts with quotes. Is there any easier way?

Answer

Jacek Laskowski picture Jacek Laskowski · May 30, 2017

According to SPARK-14194 (resolved as a duplicate) fields with new line characters are not supported and will never be.

I proposed to solve this via wholeFile option and it seems merged. I am resolving this as a duplicate of that as that one has a PR.

That's however Spark 2.0, and you use spark-csv module.

In the referenced SPARK-19610 it was fixed with the pull request:

hmm, I understand the motivation for this, though my understanding with csv generally either avoid having newline in field or some implementation would require quotes around field value with newline

In other words, use wholeFile option in Spark 2.x (as you can see in CSVDataSource).

As to spark-csv, this comment might be of some help (highlighting mine):

However, that there are a quite bit of similar JIRAs complaining about this and the original CSV datasource tried to support this although that was incorrectly implemented. This tries to match it with JSON one at least and it might be better to provide a way to process such CSV files. Actually, current implementation requires quotes :). (It was told R supports this case too actually).

In spark-csv's Features you can find the following:

The package also supports saving simple (non-nested) DataFrame. When writing files the API accepts several options:

  • quote: by default the quote character is ", but can be set to any character. This is written according to quoteMode.

  • quoteMode: when to quote fields (ALL, MINIMAL (default), NON_NUMERIC, NONE), see Quote Modes