sed replacing ASCII characters in Linux

gaurus picture gaurus · Nov 12, 2015 · Viewed 8.1k times · Source

I want to replace the ASCII/English characters in a file and keep the unicode characters in Linux environment

INSERT INTO text (old_id,old_text,old_flags) VALUES (2815829,'[[चित्र:Youth-soccer-indiana.jpg|thumb|300px|right|बचपन का खेल.एसोसिएशन फुटबॉल, ऊपर दिखाया गया है, एक टीम खेल है जो सामाजिक कार्यों को भी प्रदान करता है।]]\n\n\'\'\'खेल\'\'\', कई [[नियमों]] एवं [[रिवाजों]] द्वारा संचालित होने वाली एक [[प्रतियोगी]] गतिविधि है। \'\'खेल\'\' 

I have tried

~$ sed 's/[^\u0900-\u097F]/ /g' hi.text but the range

but i get

sed: -e expression #1, char 23: Invalid range end

I also tried this and it seems to work but not fully

sed 's/[a-zA-Z 0-9`~!@#$%^&*()_+\[\]\\{}|;'\'':",.\/<>?]//g' enwiki-latest-pages-articles-multistream_3.sql  >result.txt

Can anyone tell me how to get the sed working with the unicode range regex

Answer

Thomas Dickey picture Thomas Dickey · Nov 12, 2015

ASCII codes are in the range 0 to 127 inclusive. From that range, 0-31 and 127 are control characters. Unicode encoded as UTF-8 uses data bytes from the range 128 to 255 inclusive.

Because sed is line-oriented, newline (code 9 is control/J) is treated specially. Your file may include tab (code 8) and carriage return (code 13). But in practice you likely only care about tabs and printable ASCII.

Tilde (~) is code 126 (something handy to know).

So:

sed -e 's/[ -~\t]/ /g'

where \t is ASCII tab (and depending on implementation you may need a literal tab) will remove all of the printable ASCII, leaving untouched newline and UTF-8.