I download a file from the OECD http://stats.oecd.org/Index.aspx?datasetcode=CRS1 ('CRS 2013 data.txt') by selecting Export-> Related files. I want to work with this file in Ubuntu (14.04 LTS).
When I run:
dos2unix CRS\ 2013\ data.txt
I see:
dos2unix: Binary symbol 0x0004 found at line 1703
dos2unix: Skipping binary file CRS 2013 data.txt
I check the encoding of the file with:
file --mime-encoding CRS\ 2013\ data.txt
and see:
CRS 2013 data.txt: utf-16le
I do:
iconv -l | grep utf-16le
which doesn't return anything so I do:
iconv -l | grep UTF-16LE
which returns:
UTF-16LE//
Then I run:
iconv --verbose -f UTF-16LE -t UTF-8 CRS\ 2013\ data.txt -o crs_2013_data_temp.txt
and check:
file --mime-encoding crs_2013_data_temp.txt
and see:
crs_2013_data_temp.txt: utf-8
Then I try:
dos2unix crs_2013_data_temp.txt
and get:
dos2unix: Binary symbol 0x04 found at line 1703
dos2unix: Skipping binary file crs_2013_data_temp.txt
I then try to force it:
dos2unix -f crs_2013_data_temp.txt
It works i.e., dos2unix completes the conversion without bailing out/complaining but when I open the file I see entries like "FoÄŤa and ÄŚajniÄŤe".
My question is why? Is it because the BOM is not visible to dos2unix? Because it's missing? Have I not done the conversion right? How do I convert this file (correctly?) so that I can read it.
That 0x0004 character you are seeing in your file has nothing at all to do with the BOM (which is fine, by the way) -- it's an EOT (End of Transmission) character from the C0 control set, and has been at that codepoint since 7-bit ASCII was the new hotness. (It's also the familiar Control-D Unix EOF sequence.)
Unfortunately, the pre-dos2unix
way of applying tr
to the file to strip the carriage returns won't work directly since the file is UTF-16; since iconv
works for you, though, you can use it to convert to UTF-8 (which tr
will work on), and then run this tr
command:
tr -d '\r' < crs_2013_data_temp.txt > crs_2013_data_unix.txt
in order to get the text file into the Unix line ending convention. You will have to keep an eye on whatever tools you're feeding the file to, though, to make sure that they don't choke on the Ctrl-D/EOT character; if they do, you can use
tr -d '\004' < crs_2013_data_unix.txt > crs_2013_data_clean.txt
to get rid of it.
As to how it got there in the first place? I blame the Belgians for letting it sneak into the data they gave the OECD, which they probably keyed in with cat - > file
or some other similarly underwhelming means. Also, some text editors try to be a bit too helpful by hiding control characters, even though other tools will bail out when they see them as they think you just stuffed a binary file in that was pretending to be text for a while.