Replace cdata tags in xml

Toniq picture Toniq · Jul 5, 2013 · Viewed 10.7k times · Source

I have an string with possible multiple cdata tags inside:

<![CDATA[A Survey of Applications of Identity-Based Cryptography in Mobile Ad-Hoc Networks]]>

I am using javascript/jquery and I need to remove multiple cdata tags (replace them with "").

How could I write this with regex?

Answer

Ethan Brown picture Ethan Brown · Jul 5, 2013

@Jim Garrison is correct in his comment above: there is no way to robustly parse XML/HTML with regular expressions. The language is too complex to be represented by a regular expression.

However, that doesn't mean you can't write a regex that will cover most reasonable cases, which is probably going to be sufficient for your needs. For example, the following JavaScript regex will mostly do what you want:

input.replace( /<!\[CDATA\[.*?\]\]>/g, '' );

The two things of note in this regular expression: the wildcard (.*?) inside the CDATA body is made lazy with the ? modifier. Without that, the following bad thing would happen:

'before <![CDATA[blah]]> some stuff between <![CDATA[another cdata]]> after'
    .replace( /<!\[CDATA\[.*\]\]>/g, '' );
// returns "before  after" when we probably 
// wanted "before  some stuff between  after"

The other thing is that we use the g flag to indicate that all matches should be replaced. Otherwise only the first match will be replaced.

Reading over the comments, it looks like you may want to simply strip out the CDATA tags while leaving their contents intact. As @Jim Garrison points out above, this is a bad idea because you could easily be left with invalid HTML; that's the whole point of CDATA. But if you do want to do that, here's how:

'outside <![CDATA[(cdata1)]]> inside <![CDATA[(cdata2)]]> after'
    .replace( /<!\[CDATA\[(.*?)\]\]>/g, '$1' );
// yields "outside (cdata1) inside (cdata2) after"