Remove junk characters from a utf-8 file in Unix

user2975559 picture user2975559 · Jan 17, 2017 · Viewed 9.4k times · Source

I'm getting the junk chars (<9f>, <9d>, <9d> etc), CNTRL chars (^Z,^M etc) and NULL chars(^@) in a file. However I was able to remove CNTRL and NULL chars from the file but couldn't eliminate the junk characters. Could anyone suggest a way to remove these junk chars?

Screen shot for Junk Data

Control characters are being removed using the following command:

sed 's/\x1a//g;s/\xef\xbf\xbd//g'

Null characters are removed using the below command

tr -d '\000'

Also, Please a suggest a single command to remove all the above mentioned 3 types of garbal characters.

Thanks in Advance

Answer

hansaplast picture hansaplast · Jan 17, 2017

Strip "unusual" unicode characters

In the comments you mention that you want to block out control characters while keeping the Greek characters, so the solution below with tr does not suit. One solution is sed which offers unicode support and their [[:alpha:]] class matches also alphabetical characters outside ascii. You first need to set LC_CTYPE to specify which characters all fall into the [[:alpha:]] range. For German with Umlauts, that's e.g.

LC_CTYPE=de_DE.UTF-8

Then you can use sed to strip out everything which is not a letter or punctuation:

sed 's/[^[:alpha:];\ -@]//g' < junk.txt

What \ -@ does: It matches all characters in the ascii range between space and @ (see ascii table. Sed has a [[:punct:]] class, but unfortunately this also matches a lot of junk, so \ -@ is needed.

You may need to play around a little with LC_CTYPE, setting it to utf-8 only I could match greek characters, but not japanese.

If you only care about ascii

If you only care about regular ascii characters you can use tr: First you convert the file to a "one byte per character" encoding, since tr does not understand multibyte characters, e.g. using iconv.

Then, I'd advise you use a whitelist approach (as opposed to the blacklist approach you have in your question) as it's a lot easier to state what you want to keep, than what you want to filter out.

This command should do it:

iconv -c -f utf-8 -t latin1 < junk.txt | tr -cd '\11\12\40-\176'

this line..

  • converts to latin1 (single byte per char) and ignores all characters above codepoint 127 (which are the special characters, but be aware, that strips away also things like umlaut or special characters in your language which you might want to keep!)
  • strips all characters away which are outside this whitelist: \11\12\40-\176. The numbers there are octal. Have a look at e.g. this ascii table. \11 is tab, \12 is carriage return. \40-\176 is all characters which are commonly considered as "normal"