Regex to remove xml declaration from a string

xan picture xan · Nov 8, 2010 · Viewed 11.4k times · Source

First of all, I know this is a bad solution and I shouldn't be doing this.

Background: Feel free to skip


However, I need a quick fix for a live system. We currently have a data structure which serialises itself to a string by creating "xml" fragments via a series of string builders. Whether this is valid XML I rather doubt. After creating this xml, and before sending it over a message queue, some clean-up code searches the string for occurrences of the xml declaration and removes them.

The way this is done (iterate every character doing indexOf for the <?xml) is so slow its causing thread timeouts and killing our systems. Ultimately I'll be trying to fix this properly (build xml using xml documents or something similar) but for today I need a quick fix to replace what's there.

Please bear in mind, I know this is a far from ideal solution, but I need a quick fix to get us back up and running.


Question

My thought to use a regex to find the declarations. I was planning on: <\?xml.*?>, then using Regex.Replace(input, string.empty) to remove.

Could you let me know if there are any glaring problems with this regex, or whether just writing it in code using string.IndexOf("<?xml") and string.IndexOf("?>") pairs in a (much saner) loop is better.

EDIT I need to take care of newlines.

Would: <\?xml[^>]*?> do the trick?

EDIT2

Thanks for the help. Regex wise <\?xml.*?\?> worked fine. I ended up writing some timing code and testing both using ar egex, and IndexOf(). I found, that for our simplest use case, JUST the declaration stripping took:

  • Nearly a second as it was
  • .01 of a second with the regex
  • untimable using a loop and IndexOf()

So I went for IndexOf() as it's easy a very simple loop.

Answer

Jordi picture Jordi · Nov 8, 2010

You probably want either this: <\?xml.*\?> or this: <\?xml.*?\?>, because the way you have it now, the regex is not looking for '?>' but just for '>'. I don't think you want the first option, because it's greedy and it will remove everything between the first occurrence of ''. The second option will work as long as you don't have nested XML-tags. If you do, it will remove everything between the first ''. If you have another '' tag.

Also, I don't know how regexes are implemented in .NET, but I seriously doubt if they're faster than using indexOf.