How to detect whether a character belongs to a Right To Left language?

Patrick Klug picture Patrick Klug · Dec 2, 2010 · Viewed 13.7k times · Source

What is a good way to tell whether a string contains text in a Right To Left language.

I have found this question which suggests the following approach:

public bool IsArabic(string strCompare)
{
  char[] chars = strCompare.ToCharArray();
  foreach (char ch in chars)
    if (ch >= '\u0627' && ch <= '\u0649') return true;
  return false;
}

While this may work for Arabic this doesn't seem to cover other RTL languages such as Hebrew. Is there a generic way to know that a particular character belongs to a RTL language?

Answer

dtb picture dtb · Dec 2, 2010

Unicode characters have different properties associated with them. These properties cannot be derived from the code point; you need a table that tells you if a character has a certain property or not.

You are interested in characters with bidirectional property "R" or "AL" (RandALCat).

A RandALCat character is a character with unambiguously right-to-left directionality.

Here's the complete list as of Unicode 3.2 (from RFC 3454):

D. Bidirectional tables

D.1 Characters with bidirectional property "R" or "AL"

----- Start Table D.1 -----
05BE
05C0
05C3
05D0-05EA
05F0-05F4
061B
061F
0621-063A
0640-064A
066D-066F
0671-06D5
06DD
06E5-06E6
06FA-06FE
0700-070D
0710
0712-072C
0780-07A5
07B1
200F
FB1D
FB1F-FB28
FB2A-FB36
FB38-FB3C
FB3E
FB40-FB41
FB43-FB44
FB46-FBB1
FBD3-FD3D
FD50-FD8F
FD92-FDC7
FDF0-FDFC
FE70-FE74
FE76-FEFC
----- End Table D.1 -----

Here's some code to get the complete list as of Unicode 6.0:

var url = "http://www.unicode.org/Public/6.0.0/ucd/UnicodeData.txt";

var query = from record in new WebClient().DownloadString(url).Split('\n')
            where !string.IsNullOrEmpty(record)
            let properties = record.Split(';')
            where properties[4] == "R" || properties[4] == "AL"
            select int.Parse(properties[0], NumberStyles.AllowHexSpecifier);

foreach (var codepoint in query)
{
    Console.WriteLine(codepoint.ToString("X4"));
}

Note that these values are Unicode code points. Strings in C#/.NET are UTF-16 encoded and need to be converted to Unicode code points first (see Char.ConvertToUtf32). Here's a method that checks if a string contains at least one RandALCat character:

static void IsAnyCharacterRightToLeft(string s)
{
    for (var i = 0; i < s.Length; i += char.IsSurrogatePair(s, i) ? 2 : 1)
    {
        var codepoint = char.ConvertToUtf32(s, i);
        if (IsRandALCat(codepoint))
        {
            return true;
        }
    }
    return false;
}