Humanized or natural number sorting of mixed word-and-number strings

Craig Ringer picture Craig Ringer · Oct 19, 2012 · Viewed 9.3k times · Source

Following up on this question by Sivaram Chintalapudi, I'm interested in whether it's practical in PostgreSQL to do natural - or "humanized" - sorting " of strings that contain a mixture of multi-digit numbers and words/letters. There is no fixed pattern of words and numbers in the strings, and there may be more than one multi-digit number in a string.

The only place I've seen this done routinely is in the Mac OS's Finder, which sorts filenames containing mixed numbers and words naturally, placing "20" after "3", not before it.

The collation order desired would be produced by an algorithm that split each string into blocks at letter-number boundaries, then ordered each part, treating letter-blocks with normal collation and number-blocks as integers for collation purposes. So:

'AAA2fred' would become ('AAA',2,'fred') and 'AAA10bob' would become ('AAA',10,'bob'). These can then be sorted as desired:

regress=# WITH dat AS ( VALUES ('AAA',2,'fred'), ('AAA',10,'bob') )
regress-# SELECT dat FROM dat ORDER BY dat;
     dat      
--------------
 (AAA,2,fred)
 (AAA,10,bob)
(2 rows)

as compared to the usual string collation ordering:

regress=# WITH dat AS ( VALUES ('AAA2fred'), ('AAA10bob') )
regress-# SELECT dat FROM dat ORDER BY dat;
    dat     
------------
 (AAA10bob)
 (AAA2fred)
(2 rows)

However, the record comparison approach doesn't generalize because Pg won't compare ROW(..) constructs or records of unequal numbers of entries.

Given the sample data in this SQLFiddle the default en_AU.UTF-8 collation produces the ordering:

1A, 10A, 2A, AAA10B, AAA11B, AAA1BB, AAA20B, AAA21B, X10C10, X10C2, X1C1, X1C10, X1C3, X1C30, X1C4, X2C1

but I want:

1A, 2A, 10A, AAA1BB, AAA10B, AAA11B, AAA20B, AAA21B, X1C1, X1C3, X1C4, X1C10, X1C30, X2C1, X10C10, X10C2

I'm working with PostgreSQL 9.1 at the moment, but 9.2-only suggestions would be fine. I'm interested in advice on how to achieve an efficient string-splitting method, and how to then compare the resulting split data in the alternating string-then-number collation described. Or, of course, on entirely different and better approaches that don't require splitting strings.

PostgreSQL doesn't seem to support comparator functions, otherwise this could be done fairly easily with a recursive comparator and something like ORDER USING comparator_fn and a comparator(text,text) function. Alas, that syntax is imaginary.

Update: Blog post on the topic.

Answer

Erwin Brandstetter picture Erwin Brandstetter · Oct 19, 2012

Building on your test data, but this works with arbitrary data. This works with any number of elements in the string.

Register a composite type made up of one text and one integer value once per database. I call it ai:

CREATE TYPE ai AS (a text, i int);

The trick is to form an array of ai from each value in the column.

regexp_matches() with the pattern (\D*)(\d*) and the g option returns one row for every combination of letters and numbers. Plus one irrelevant dangling row with two empty strings '{"",""}' Filtering or suppressing it would just add cost. Aggregate this into an array, after replacing empty strings ('') with 0 in the integer component (as '' cannot be cast to integer).

NULL values sort first - or you have to special case them - or use the whole shebang in a STRICT function like @Craig proposes.

Postgres 9.4 or later

SELECT data
FROM   alnum
ORDER  BY ARRAY(SELECT ROW(x[1], CASE x[2] WHEN '' THEN '0' ELSE x[2] END)::ai
                FROM regexp_matches(data, '(\D*)(\d*)', 'g') x)
        , data;

db<>fiddle here

Postgres 9.1 (original answer)

Tested with PostgreSQL 9.1.5, where regexp_replace() had a slightly different behavior.

SELECT data
FROM  (
    SELECT ctid, data, regexp_matches(data, '(\D*)(\d*)', 'g') AS x
    FROM   alnum
    ) x
GROUP  BY ctid, data   -- ctid as stand-in for a missing pk
ORDER  BY regexp_replace (left(data, 1), '[0-9]', '0')
        , array_agg(ROW(x[1], CASE x[2] WHEN '' THEN '0' ELSE x[2] END)::ai)
        , data         -- for special case of trailing 0

Add regexp_replace (left(data, 1), '[1-9]', '0') as first ORDER BY item to take care of leading digits and empty strings.

If special characters like {}()"', can occur, you'd have to escape those accordingly.
@Craig's suggestion to use a ROW expression takes care of that.

BTW, this won't execute in sqlfiddle, but it does in my db cluster. JDBC is not up to it. sqlfiddle complains:

Method org.postgresql.jdbc3.Jdbc3Array.getArrayImpl(long,int,Map) is not yet implemented.

This has since been fixed: http://sqlfiddle.com/#!17/fad6e/1