What is a good way to tell whether a string contains text in a Right To Left language.
I have found this question which suggests the following approach:
public bool IsArabic(string strCompare)
{
char[] chars = strCompare.ToCharArray();
foreach (char ch in chars)
if (ch >= '\u0627' && ch <= '\u0649') return true;
return false;
}
While this may work for Arabic this doesn't seem to cover other RTL languages such as Hebrew. Is there a generic way to know that a particular character belongs to a RTL language?
Unicode characters have different properties associated with them. These properties cannot be derived from the code point; you need a table that tells you if a character has a certain property or not.
You are interested in characters with bidirectional property "R" or "AL" (RandALCat).
A RandALCat character is a character with unambiguously right-to-left directionality.
Here's the complete list as of Unicode 3.2 (from RFC 3454):
D. Bidirectional tables D.1 Characters with bidirectional property "R" or "AL" ----- Start Table D.1 ----- 05BE 05C0 05C3 05D0-05EA 05F0-05F4 061B 061F 0621-063A 0640-064A 066D-066F 0671-06D5 06DD 06E5-06E6 06FA-06FE 0700-070D 0710 0712-072C 0780-07A5 07B1 200F FB1D FB1F-FB28 FB2A-FB36 FB38-FB3C FB3E FB40-FB41 FB43-FB44 FB46-FBB1 FBD3-FD3D FD50-FD8F FD92-FDC7 FDF0-FDFC FE70-FE74 FE76-FEFC ----- End Table D.1 -----
Here's some code to get the complete list as of Unicode 6.0:
var url = "http://www.unicode.org/Public/6.0.0/ucd/UnicodeData.txt";
var query = from record in new WebClient().DownloadString(url).Split('\n')
where !string.IsNullOrEmpty(record)
let properties = record.Split(';')
where properties[4] == "R" || properties[4] == "AL"
select int.Parse(properties[0], NumberStyles.AllowHexSpecifier);
foreach (var codepoint in query)
{
Console.WriteLine(codepoint.ToString("X4"));
}
Note that these values are Unicode code points. Strings in C#/.NET are UTF-16 encoded and need to be converted to Unicode code points first (see Char.ConvertToUtf32). Here's a method that checks if a string contains at least one RandALCat character:
static void IsAnyCharacterRightToLeft(string s)
{
for (var i = 0; i < s.Length; i += char.IsSurrogatePair(s, i) ? 2 : 1)
{
var codepoint = char.ConvertToUtf32(s, i);
if (IsRandALCat(codepoint))
{
return true;
}
}
return false;
}