I have a CSV file which has unusual delimiters which I want to parse with Talend. Normally, when we have a CSV with "carriage return" characters as rowdelimiter, I use "\n". When it is a TAB delimited file, I use "\t", etcetera. But now I have a file which has some unusual characters. Wikipedia taught me that it are so called "control characters". My question is how can I mention these characters in the tFileDelimitedInput-component in Talend (see screenshot 2). Instead of a newline character (\n) I must use the STX control character, but how do I tell Talend which character this is? What notation is "\n" in the first place?
An example of the file:
https://dl.dropbox.com/u/1757832/talendSeparators1.jpg
The tFileDelimitedInput-component in Talend where I must enter the row separator and field separator characters.
Have you tried creating a tFileDelimitedInput
metadata for that file ?
Doing that, you have more options (see attached picture).
EDIT :
Here's the list of the UTF-8
corresponding control characters codes :
SOH : Start of heading : http://www.fileformat.info/info/unicode/char/0001/index.htm STX : Start of text : http://www.fileformat.info/info/unicode/char/0002/index.htm
Have you also tried using those utf-8 codes ?
EDIT 2 With solution :
Here's a file with the STX
field separator
I've defined a simple tFileInputDelimited
schema with two columns (key and value both being strings).
Then, I've set :
"\n"
new String("\u0002")
Then, I've got the right behavior :
.----+------.
| tLogRow_1 |
|=---+-----=|
|key |value |
|=---+-----=|
|key1|value1|
|key2|value2|
'----+------'