I'm trying to extract a word list from a Russian short story.
#!/bin/sh
export LC_ALL=ru_RU.utf8
sed -re 's/\s+/\n/g' | \
sed 's/[\.!,—()«»;:?]//g' | \
tr '[:upper:]' '[:lower:]' | \
sort | uniq
However the tr
step is not lowercasing the Cyrillic capital letters. I thought I was being clever using the portable character classes!
$ LC_ALL=ru_RU.utf8 echo "Г" | tr [:upper:] [:lower:]
Г
In case it's relevant, I obtained the Russian text by copy-pasting from a Chrome browser window into Vim. It looks right on screen (a Putty terminal). This is in Cygwin's bash shell -- it should work identically to Bash on Linux (should!).
What is a portable, reliable way to lowercase unicode text in a pipe?
This is what I found at Wikipedia (without any reference, though):
Most versions of
tr
, includingGNU tr
and classic Unixtr
, operate on single-byte characters and are not Unicode compliant. An exception is the Heirloom Toolchest implementation, which provides basic Unicode support.
Also, this is old but related.
As I mentioned in the comment, sed
seems to work (GNU sed
, at least):
$ echo 'СТЭК' | sed 's/[[:upper:]]*/\L&/'
стэк