I want to replace the ASCII/English characters in a file and keep the unicode characters in Linux environment
INSERT INTO text (old_id,old_text,old_flags) VALUES (2815829,'[[चित्र:Youth-soccer-indiana.jpg|thumb|300px|right|बचपन का खेल.एसोसिएशन फुटबॉल, ऊपर दिखाया गया है, एक टीम खेल है जो सामाजिक कार्यों को भी प्रदान करता है।]]\n\n\'\'\'खेल\'\'\', कई [[नियमों]] एवं [[रिवाजों]] द्वारा संचालित होने वाली एक [[प्रतियोगी]] गतिविधि है। \'\'खेल\'\'
I have tried
~$ sed 's/[^\u0900-\u097F]/ /g' hi.text but the range
but i get
sed: -e expression #1, char 23: Invalid range end
I also tried this and it seems to work but not fully
sed 's/[a-zA-Z 0-9`~!@#$%^&*()_+\[\]\\{}|;'\'':",.\/<>?]//g' enwiki-latest-pages-articles-multistream_3.sql >result.txt
Can anyone tell me how to get the sed working with the unicode range regex
ASCII codes are in the range 0 to 127 inclusive. From that range, 0-31 and 127 are control characters. Unicode encoded as UTF-8 uses data bytes from the range 128 to 255 inclusive.
Because sed is line-oriented, newline (code 9 is control/J) is treated specially. Your file may include tab (code 8) and carriage return (code 13). But in practice you likely only care about tabs and printable ASCII.
Tilde (~
) is code 126 (something handy to know).
So:
sed -e 's/[ -~\t]/ /g'
where \t
is ASCII tab (and depending on implementation you may need a literal tab) will remove all of the printable ASCII, leaving untouched newline and UTF-8.