RegEx to split camelCase or TitleCase (advanced)

Jmini picture Jmini · Sep 29, 2011 · Viewed 51k times · Source

I found a brilliant RegEx to extract the part of a camelCase or TitleCase expression.

 (?<!^)(?=[A-Z])

It works as expected:

  • value -> value
  • camelValue -> camel / Value
  • TitleValue -> Title / Value

For example with Java:

String s = "loremIpsum";
words = s.split("(?<!^)(?=[A-Z])");
//words equals words = new String[]{"lorem","Ipsum"}

My problem is that it does not work in some cases:

  • Case 1: VALUE -> V / A / L / U / E
  • Case 2: eclipseRCPExt -> eclipse / R / C / P / Ext

To my mind, the result shoud be:

  • Case 1: VALUE
  • Case 2: eclipse / RCP / Ext

In other words, given n uppercase chars:

  • if the n chars are followed by lower case chars, the groups should be: (n-1 chars) / (n-th char + lower chars)
  • if the n chars are at the end, the group should be: (n chars).

Any idea on how to improve this regex?

Answer

NPE picture NPE · Sep 29, 2011

The following regex works for all of the above examples:

public static void main(String[] args)
{
    for (String w : "camelValue".split("(?<!(^|[A-Z]))(?=[A-Z])|(?<!^)(?=[A-Z][a-z])")) {
        System.out.println(w);
    }
}   

It works by forcing the negative lookbehind to not only ignore matches at the start of the string, but to also ignore matches where a capital letter is preceded by another capital letter. This handles cases like "VALUE".

The first part of the regex on its own fails on "eclipseRCPExt" by failing to split between "RPC" and "Ext". This is the purpose of the second clause: (?<!^)(?=[A-Z][a-z]. This clause allows a split before every capital letter that is followed by a lowercase letter, except at the start of the string.