dos2unix: Binary symbol 0x04 found at line 1703

dw8547 picture dw8547 · Apr 28, 2015 · Viewed 11.3k times · Source

I download a file from the OECD http://stats.oecd.org/Index.aspx?datasetcode=CRS1 ('CRS 2013 data.txt') by selecting Export-> Related files. I want to work with this file in Ubuntu (14.04 LTS).

When I run:

dos2unix CRS\ 2013\ data.txt

I see:

dos2unix: Binary symbol 0x0004 found at line 1703
dos2unix: Skipping binary file CRS 2013 data.txt

I check the encoding of the file with:

file --mime-encoding CRS\ 2013\ data.txt

and see:

CRS 2013 data.txt: utf-16le

I do:

iconv -l | grep utf-16le

which doesn't return anything so I do:

iconv -l | grep UTF-16LE

which returns:

UTF-16LE//

Then I run:

iconv --verbose -f UTF-16LE -t UTF-8 CRS\ 2013\ data.txt -o crs_2013_data_temp.txt

and check:

file --mime-encoding crs_2013_data_temp.txt

and see:

crs_2013_data_temp.txt: utf-8

Then I try:

dos2unix crs_2013_data_temp.txt

and get:

dos2unix: Binary symbol 0x04 found at line 1703
dos2unix: Skipping binary file crs_2013_data_temp.txt

I then try to force it:

dos2unix -f crs_2013_data_temp.txt

It works i.e., dos2unix completes the conversion without bailing out/complaining but when I open the file I see entries like "FoÄŤa and ÄŚajniÄŤe".

My question is why? Is it because the BOM is not visible to dos2unix? Because it's missing? Have I not done the conversion right? How do I convert this file (correctly?) so that I can read it.

Answer

LThode picture LThode · Apr 28, 2015

That 0x0004 character you are seeing in your file has nothing at all to do with the BOM (which is fine, by the way) -- it's an EOT (End of Transmission) character from the C0 control set, and has been at that codepoint since 7-bit ASCII was the new hotness. (It's also the familiar Control-D Unix EOF sequence.)

Unfortunately, the pre-dos2unix way of applying tr to the file to strip the carriage returns won't work directly since the file is UTF-16; since iconv works for you, though, you can use it to convert to UTF-8 (which tr will work on), and then run this tr command:

tr -d '\r' < crs_2013_data_temp.txt > crs_2013_data_unix.txt

in order to get the text file into the Unix line ending convention. You will have to keep an eye on whatever tools you're feeding the file to, though, to make sure that they don't choke on the Ctrl-D/EOT character; if they do, you can use

tr -d '\004' < crs_2013_data_unix.txt > crs_2013_data_clean.txt

to get rid of it.

As to how it got there in the first place? I blame the Belgians for letting it sneak into the data they gave the OECD, which they probably keyed in with cat - > file or some other similarly underwhelming means. Also, some text editors try to be a bit too helpful by hiding control characters, even though other tools will bail out when they see them as they think you just stuffed a binary file in that was pretending to be text for a while.