Is there any Java open source library that supports multi-character (i.e., String with length > 1) separators (delimiters) for CSV?
By definition, CSV = Comma-Separated Values data with a single character (',') as the delimiter. However, many other single-character alternatives exist (e.g., tab), making CSV to stand for "Character-Separated Values" data (essentially, DSV: Delimiter-Separated Values data).
Main Java open source libraries for CSV (e.g., OpenCSV) support virtually any character as the delimiter, but not string (multi-character) delimiters. So, for data separated with strings like "|||" there is no other option than preprocessing the input in order to transform the string to a single-character delimiter. From then on, the data can be parsed as single-character separated values.
It would therefore be nice if there was a library that supported string separators natively, so that no preprocessing was necessary. This would mean that CSV now standed for "CharSequence-Separated Values" data. :-)
This is a good question. The problem was not obvious to me until I looked at the javadocs and realised that opencsv only supports a character as a separator, not a string....
Here's a couple of suggested work-arounds (Examples in Groovy can be converted to java).
Continue to Use OpenCSV, but ignore the empty fields. Obviously this is a cheat, but it will work fine for parsing well-behaved data.
CSVParser csv = new CSVParser((char)'|')
String[] result = csv.parseLine('J||Project report||"F, G, I"||1')
assert result[0] == "J"
assert result[2] == "Project report"
assert result[4] == "F, G, I"
assert result[6] == "1"
or
CSVParser csv = new CSVParser((char)'|')
String[] result = csv.parseLine('J|||Project report|||"F, G, I"|||1')
assert result[0] == "J"
assert result[3] == "Project report"
assert result[6] == "F, G, I"
assert result[9] == "1"
Use the Java String tokenizer method.
def result = 'J|||Project report|||"F, G, I"|||1'.tokenize('|||')
assert result[0] == "J"
assert result[1] == "Project report"
assert result[2] == "\"F, G, I\""
assert result[3] == "1"
Disadvantage of this approach is that you lose the ability to ignore quote characters or escape separators..
Instead of pre-processing the data, altering it's content, why not combine both of the above approaches in a two step process:
Not very efficient, but possibly easier that writing your own CSV parser :-)