What regular expression can remove duplicate items from a string?

Tom picture Tom · Jul 22, 2010 · Viewed 17k times · Source

Given a string of identifiers separated by :, is it possible to construct a regular expression to extract the unique identifiers into another string, also separated by :?

How is it possible to achieve this using a regular expression? I have tried s/(:[^:])(.*)\1/$1$2/g with no luck, because the (.*) is greedy and skips to the last match of $1.

Example: a:b:c:d:c:c:x:c:c:e:e:f should give a:b:c:d:x:e:f

Note: I am coding in perl, but I would very much appreciate using a regex for this.

Answer

Tim Pietzcker picture Tim Pietzcker · Jul 22, 2010

In .NET which supports infinite repetition inside lookbehind, you could search for

(?<=\b\1:.*)\b(\w+):?

and replace all matches with the empty string.

Perl (at least Perl 5) only supports fixed-length lookbehinds, so you can try the following (using lookahead, with a subtly different result):

\b(\w+):(?=.*\b\1:?)

If you replace that with the empty string, all previous repetitions of a duplicate entry will be removed; the last one will remain. So instead of

a:b:c:d:x:e:f

you would get

a:b:d:x:c:e:f

If that is OK, you can use

$subject =~ s/\b(\w+):(?=.*\b\1:?)//g;

Explanation:

First regex:

(?<=\b\1:.*): Check if you can match the contents of backreference no. 1, followed by a colon, somewhere before in the string.

\b(\w+):?: Match an identifier (from a word boundary to the next :), optionally followed by a colon.

Second regex:

\b(\w+):: Match an identifier and a colon.

(?=.*\b\1:?): Then check whether you can match the same identifier, optionally followed by a colon, somewhere ahead in the string.