Dealing with eacute and other special characters using Oracle, PHP and Oci8

ddallala picture ddallala · Mar 1, 2010 · Viewed 10k times · Source

Hi I am trying to store names into an Oracle database and fetch them back using PHP and oci8.

However, if I insert the é directly into the Oracle database and use oci8 to fetch it back I just receive an e

Do I have to encode all special characters (including é) into html entities (ie: é) before inserting into database ... or am I missing something ?

Thx


UPDATE: Mar 1 at 18:40

found this function: http://www.php.net/manual/en/function.utf8-decode.php#85034

function charset_decode_utf_8($string) {
    if(@!ereg("[\200-\237]",$string) && @!ereg("[\241-\377]",$string)) {
        return $string;
    }
$string = preg_replace("/([\340-\357])([\200-\277])([\200-\277])/e","'&#'.((ord('\\1')-224)*4096 + (ord('\\2')-128)*64 + (ord('\\3')-128)).';'",$string);
$string = preg_replace("/([\300-\337])([\200-\277])/e","'&#'.((ord('\\1')-192)*64+(ord('\\2')-128)).';'",$string);
return $string;
}

seems to work, although not sure if its the optimal solution


UPDATE: Mar 8 at 15:45

Oracle's character set is ISO-8859-1.
in PHP I added:

putenv("NLS_LANG=AMERICAN_AMERICA.WE8ISO8859P1");

to force the oci8 connection to use that character set. Retrieving the é using oci8 from PHP now worked ! (for varchars, but not CLOBs had to do utf8_encode to extract it )
So then I tried saving the data from PHP to Oracle ... and it doesnt work..somewhere along the way from PHP to Oracle the é becomes a ?


UPDATE: Mar 9 at 14:47

So getting closer. After adding the NLS_LANG variable, doing direct oci8 inserts with é works.

The problem is actually on the PHP side. By using ExtJs framework, when submitting a form it encodes it using encodeURIComponent.
So é is sent as %C3%A9 and then re-encoded into é.
However it's length is now 2 (strlen($my_sent_value) = 2) and not 1. And if in PHP I try: $my_sent_value == é = FALSE

I think if I am able to re-encode all these characters in PHP back into lengths of byte size 1 and then inserting them into Oracle, it should work.

Still no luck though


UPDATE: Mar 10 at 11:05

I keep thinking I am so close (yet so far away).

putenv("NLS_LANG=AMERICAN_AMERICA.WE8ISO8859P9"); works very sporadicly.

I created a small php script to test:

header('Content-Type: text/plain; charset=ISO-8859-1');
putenv("NLS_LANG=AMERICAN_AMERICA.WE8ISO8859P9");
$conn= oci_connect("user", "pass", "DB");
$stmt = oci_parse($conn, "UPDATE temp_tb SET string_field = '|é|'");
oci_execute($stmt, OCI_COMMIT_ON_SUCCESS);

After running this once and loggin into the Oracle Database directly I see that STRING_FIELD is set to |¿|. Obviously not what I had come to expect from my previous experience.
However, if I refresh that PHP page twice quickly.... it worked !!!
In Oracle I correctly saw |é|.

It seems like maybe the environment variable is not being correctly set or sent in time for the first execution of the script, but is available for the second execution.

My next experiment is to export the variable into PHP's environment, however, I need to reset Apache for that...so we'll see what happens, hopefully it works.

Answer

Álvaro González picture Álvaro González · Mar 3, 2010

I presume you are aware of these facts:

  • There are many different character sets: you have to pick one and, of course, know which one you are using.
  • Oracle is perfectly capable of storing text without HTML entities (é). HTML entities are used in, well, HTML. Oracle is not a web browser ;-)

You must also know that HTML entities are not bind to a specific charset; on the contrary, they're used to represent characters in a charset-independent context.

You indistinctly talk about ISO-8859-1 and UTF-8. What charset do you want to use? ISO-8859-1 is easy to use but it can only store text in some latin languages (such as Spanish) and it lacks some common chars like the € symbol. UTF-8 is trickier to use but it can store all characters defined by the Unicode consortium (which include everything you'll ever need).

Once you've taken the decision, you must configure Oracle to hold data in such charset and choose an appropriate column type. E.g., VARCHAR2 is fine for plain ASCII, NVARCHAR2 is good for UTF-8.