(Updated a little)
I'm not very experienced with internationalization using PHP, it must be said, and a deal of searching didn't really provide the answers I was looking for.
I'm in need of working out a reliable way to convert only 'relevant' text to Unicode to send in an SMS message, using PHP (just temporarily, whilst service is rewritten using C#) - obviously, messages sent at the moment are sent as plain text.
I could conceivably convert everything to the Unicode charset (as opposed to using the standard GSM charset), but that would mean that all messages would be limited to 70 characters (instead of 160).
So, I guess my real question is: what is the most reliable way to detect the requirement for a message to be Unicode-encoded, so I only have to do it when it's absolutely necessary (e.g. for non-Latin-language characters)?
Okay, so I've spent the morning working on this, and I'm still no further on than when I started (certainly due to my complete lack of competency when it comes to charset conversion). So here's the revised scenario:
I have text SMS messages coming from an external source, this external source provides the responses to me in plain text + Unicode slash-escaped characters. E.g. the 'displayed' text:
Let's test öäü éàè אין תמיכה בעברית
Returns:
Let's test \u00f6\u00e4\u00fc \u00e9\u00e0\u00e8 \u05d0\u05d9\u05df \u05ea\u05de\u05d9\u05db\u05d4 \u05d1\u05e2\u05d1\u05e8\u05d9\u05ea
Now, I can send on to my SMS provider in plaintext, GSM 03.38 or Unicode. Obviously, sending the above as plaintext results in a lot of missing characters (they're replaced by spaces by my provider) - I need to adopt relating to what content there is. What I want to do with this is the following:
If all text is within the GSM 03.38 codepage, send it as-is. (All but the Hebrew characters above fit into this category, but need to be converted.)
Otherwise, convert it to Unicode, and send it over multiple messages (as the Unicode limit is 70 chars not 160 for an SMS).
As I said above, I'm stumped on doing this in PHP (C# wasn't much of an issue due to some simple conversion functions built-in), but it's quite probable I'm just missing the obvious, here. I couldn't find any pre-made conversion classes for 7-bit encoding in PHP, either - and my attempts to convert the string myself and send it on seemed futile.
Any help would be greatly appreciated.
To deal with it conceptually before getting into mechanisms, and apologies if any of this is obvious, a string can be defined as a sequence of Unicode characters, Unicode being a database that gives an id number known as a code point to every character you might need to work with. GSM-338 contains a subset of the Unicode characters, so what you're doing is extracting a set of codepoints from your string, and checking to see if that set is contained in GSM-338.
// second column of http://unicode.org/Public/MAPPINGS/ETSI/GSM0338.TXT
$gsm338_codepoints = array(0x0040, 0x0000, ..., 0x00fc, 0x00e0)
$can_use_gsm338 = true;
foreach(codepoints($mystring) as $codepoint){
if(!in_array($codepoint, $gsm338_codepoints)){
$can_use_gsm338 = false;
break;
}
}
That leaves the definition of the function codepoints($string), which isn't built in to PHP. PHP understands a string to be a sequence of bytes rather than a sequence of Unicode characters. The best way of bridging the gap is to get your strings into UTF8 as quickly as you can and keep them in UTF8 as long as you can - you'll have to use other encodings when dealing with external systems, but isolate the conversion to the interface to that system and deal only with utf8 internally.
The functions you need to convert between php strings in utf8 and sequences of codepoints can be found at http://hsivonen.iki.fi/php-utf8/ , so that's your codepoints() function.
If you're taking data from an external source that gives you Unicode slash-escaped characters ("Let's test \u00f6\u00e4\u00fc..."), that string escape format should be converted to utf8. I don't know offhand of a function to do this, if one can't be found, it's a matter of string/regex processing + the use of the hsivonen.iki.fi functions, for example when you hit \u00f6, replace it with the utf8 representation of the codepoint 0xf6.