PostgreSQL 9.1 using collate in select statements

Henri picture Henri · Oct 17, 2011 · Viewed 8.4k times · Source

I have a postgresql 9.1 database table, "en_US.UTF-8":

CREATE TABLE branch_language
(
    id serial NOT NULL,
    name_language character varying(128) NOT NULL,
    branch_id integer NOT NULL,
    language_id integer NOT NULL,
    ....
)

The attribute name_language contains names in various languages. The language is specified by the foreign key language_id.

I have created a few indexes:

/* us english */
CREATE INDEX idx_branch_language_2
    ON branch_language
    USING btree
    (name_language COLLATE pg_catalog."en_US" );

/* catalan */
CREATE INDEX idx_branch_language_5
    ON branch_language
    USING btree
    (name_language COLLATE pg_catalog."ca_ES" );

/* portuguese */
CREATE INDEX idx_branch_language_6
    ON branch_language
    USING btree
    (name_language COLLATE pg_catalog."pt_PT" );

Now when I do a select I am not getting the results I am expecting.

select name_language from branch_language
where language_id=42 -- id of catalan language
order by name_language collate "ca_ES" -- use ca_ES collation

This generates a list of names but not in the order I expected:

Aficions i Joguines
Agència de viatges
Aliments i Subministraments
Aparells elèctrics i il luminació
Art i Antiguitats
Articles de la llar
Bars i Restaurants
...
Tabac
Àudio, Vídeo, CD i DVD
Òptica

As I expected the last two entries to appear in different positions in the list.

Creating the indexes works. I don't think they are really necessary unless you want to optimize for performance.

The select statement however seems to ignore the part: collate "ca_ES".

This problem also exists when I select other collations. I have tried "es_ES" and "pt_PT" but the results are similar.

Answer

Erwin Brandstetter picture Erwin Brandstetter · Oct 20, 2011

I can't find a flaw in your design. I have tried.

Locales and collation

I revisited this question. Consider this test case on sqlfiddle. It seems to work just fine. I even created the locale ca_ES.utf8 in my local test server (PostgreSQL 9.1.6 on Debian Squeeze) and added the locale to my DB cluster:

CREATE COLLATION "ca_ES" (LOCALE = 'ca_ES.utf8');

I get the same results as can be seen in the sqlfiddle above.

Note that collation names are identifiers and need to be double-quoted to preserve CamelCase spelling like "ca_ES". Maybe there has been some confusion with other locales in your system? Check your available collations:

SELECT * FROM pg_collation;

Generally, collation rules are derived from system locales. Read about the details in the manual here. If you still get incorrect results, I would try to update your system and regenerate the locale for "ca_ES". In Debian (and related Linux distributions) this can be done with:

dpkg-reconfigure locales

NFC

I have one other idea: unnormalized UNICODE strings.

Could it be that your 'Àudio' is in fact '̀ ' || 'Audio'? That would be this character:

SELECT U&'\0300A';
SELECT ascii(U&'\0300A');
SELECT chr(768);

Read more about the acute accent in wikipedia.
You have to SET standard_conforming_strings = TRUE to use Unicode strings like in the first line.

Note that some browsers cannot display unnormalized Unicode characters correctly and many fonts have no proper glyph for the special characters, so you may see nothing here or gibberish. But UNICODE allows for that nonsense. Test to see what you got:

SELECT octet_length('̀A')  -- returns 3 (!)
SELECT octet_length('À')  -- returns 2

If that's what your database has contracted, you need to get rid of it or suffer the consequences. The cure is to normalize your strings to NFC. Perl has superior UNICODE-foo skills, you can make use of their libraries in a plperlu function to do it in PostgreSQL. I have done that to save me from madness.

Read installation instructions in this excellent article about UNICODE normalization in PostgreSQL by David Wheeler.
Read all the gory details about Unicode Normalization Forms at unicode.org.