String concatenation containing Arabic and Western characters

Carlos Ferreira picture Carlos Ferreira · May 30, 2011 · Viewed 13.8k times · Source

I'm trying to concatenate several strings containing both arabic and western characters (mixed in the same string). The problem is that the result is a String that is, most likely, semantically correct, but different from what I want to obtain, because the order of the characters is altered by the Unicode Bidirectional Algorithm. Basically, I just want to concatenate as if they were all LTR, ignoring the fact that some are RTL, a sort of "agnostic" concatenation.

I'm not sure if I was clear in my explanation, but I don't think I can do it any better.

Hope someone can help me.

Kind regards,

Carlos Ferreira

BTW, the strings are being obtained from the database.

EDIT

enter image description here

The first 2 Strings are the strings I want to concatenate and the third is the result.

EDIT 2

Actually, the concatenated String is a little different from the one in the image, it got altered during the copy+paste, the 1 is after the first A and not immediately before the second A.

Answer

Mike Samuel picture Mike Samuel · Jun 6, 2011

You can embed bidi regions using unicode format control codepoints:

  • Left-to-right embedding (U+202A)
  • Right-to-left embedding (U+202B)
  • Pop directional formatting (U+202C)

So in java, to embed a RTL language like Arabic in an LTR language like English, you would do

myEnglishString + "\u202B" + myArabicString + "\u202C" + moreEnglish

and to do the reverse

myArabicString + "\u202A" + myEnglishString + "\u202C" + moreArabic

See Bidirectional General Formatting for more details, or the Unicode specification chapter on "Directional Formatting Codes" for the source material.