Importing CSV file with specific delimiters in Talend

Rogier Lommers picture Rogier Lommers · Jan 9, 2013 · Viewed 13.3k times · Source

I have a CSV file which has unusual delimiters which I want to parse with Talend. Normally, when we have a CSV with "carriage return" characters as rowdelimiter, I use "\n". When it is a TAB delimited file, I use "\t", etcetera. But now I have a file which has some unusual characters. Wikipedia taught me that it are so called "control characters". My question is how can I mention these characters in the tFileDelimitedInput-component in Talend (see screenshot 2). Instead of a newline character (\n) I must use the STX control character, but how do I tell Talend which character this is? What notation is "\n" in the first place?

An example of the file:

https://dl.dropbox.com/u/1757832/talendSeparators1.jpg

The tFileDelimitedInput-component in Talend where I must enter the row separator and field separator characters.

https://dl.dropbox.com/u/1757832/talendSeparators2.jpg

Answer

Jean-Michel Garcia picture Jean-Michel Garcia · Jan 9, 2013

Have you tried creating a tFileDelimitedInputmetadata for that file ?

Doing that, you have more options (see attached picture).

enter image description here

EDIT :

Here's the list of the UTF-8 corresponding control characters codes :

SOH : Start of heading : http://www.fileformat.info/info/unicode/char/0001/index.htm STX : Start of text : http://www.fileformat.info/info/unicode/char/0002/index.htm

Have you also tried using those utf-8 codes ?

EDIT 2 With solution :

Here's a file with the STX field separator

File content

I've defined a simple tFileInputDelimited schema with two columns (key and value both being strings).

Then, I've set :

  1. row separator as "\n"
  2. field separator as new String("\u0002")

Then, I've got the right behavior :

.----+------.
| tLogRow_1 |
|=---+-----=|
|key |value |
|=---+-----=|
|key1|value1|
|key2|value2|
'----+------'