removing invalid XML characters from a string in java

yossi picture yossi · Nov 21, 2010 · Viewed 78.5k times · Source

Hi i would like to remove all invalid XML characters from a string. i would like to use a regular expression with the string.replace method.

like

line.replace(regExp,"");

what is the right regExp to use ?

invalid XML character is everything that is not this :

[#x1-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

thanks.

Answer

McDowell picture McDowell · Nov 21, 2010

Java's regex supports supplementary characters, so you can specify those high ranges with two UTF-16 encoded chars.

Here is the pattern for removing characters that are illegal in XML 1.0:

// XML 1.0
// #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
String xml10pattern = "[^"
                    + "\u0009\r\n"
                    + "\u0020-\uD7FF"
                    + "\uE000-\uFFFD"
                    + "\ud800\udc00-\udbff\udfff"
                    + "]";

Most people will want the XML 1.0 version.

Here is the pattern for removing characters that are illegal in XML 1.1:

// XML 1.1
// [#x1-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
String xml11pattern = "[^"
                    + "\u0001-\uD7FF"
                    + "\uE000-\uFFFD"
                    + "\ud800\udc00-\udbff\udfff"
                    + "]+";

You will need to use String.replaceAll(...) and not String.replace(...).

String illegal = "Hello, World!\0";
String legal = illegal.replaceAll(pattern, "");