CSV parsing with Commons CSV - Quotes within quotes causing IOException

mhollander38 picture mhollander38 · Jun 22, 2017 · Viewed 7.7k times · Source

I am using Commons CSV to parse CSV content relating to TV shows. One of the shows has a show name which includes double quotes;

116,6,2,29 Sep 10,""JJ" (60 min)","http://www.tvmaze.com/episodes/4855/criminal-minds-6x02-jj"

The showname is "JJ" (60 min) which is already in double quotes. This is throwing an IOException java.io.IOException: (line 1) invalid char between encapsulated token and delimiter.

    ArrayList<String> allElements = new ArrayList<String>();
    CSVFormat csvFormat = CSVFormat.DEFAULT;
    CSVParser csvFileParser = new CSVParser(new StringReader(line), csvFormat);

    List<CSVRecord> csvRecords = null;

    csvRecords = csvFileParser.getRecords();

    for (CSVRecord record : csvRecords) {
        int length = record.size();
        for (int x = 0; x < length; x++) {
            allElements.add(record.get(x));
        }
    }

    csvFileParser.close();
    return allElements;

CSVFormat.DEFAULT already sets withQuote('"')

I think that this CSV is not properly formatted as ""JJ" (60 min)" should be """JJ"" (60 min)" - but is there a way to get commons CSV to handle this or do I need to fix this entry manually?

Additional information: Other show names contain spaces and commas within the CSV entry and are placed within double quotes.

Answer

Jeronimo Backes picture Jeronimo Backes · Jun 23, 2017

The problem here is that the quotes are not properly escaped. Your parser doesn't handle that. Try univocity-parsers as this is the only parser for java I know that can handle unescaped quotes inside a quoted value. It is also 4 times faster than Commons CSV. Try this code:

//configure the parser to handle your situation
CsvParserSettings settings = new CsvParserSettings();
settings.setUnescapedQuoteHandling(STOP_AT_CLOSING_QUOTE);

//create the parser
CsvParser parser = new CsvParser(settings);

//parse your line
String[] out = parser.parseLine("116,6,2,29 Sep 10,\"\"JJ\" (60 min)\",\"http://www.tvmaze.com/episodes/4855/criminal-minds-6x02-jj\"");

for(String e : out){
    System.out.println(e);
}

This will print:

116
6
2
29 Sep 10
"JJ" (60 min)
http://www.tvmaze.com/episodes/4855/criminal-minds-6x02-jj

Hope it helps.

Disclosure: I'm the author of this library, it's open source and free (Apache 2.0 license)