I am trying to write a regular expression to strip all HTML with the exception of links (the <a href
and </a>
tags respectively. It does not have to be 100% secure (I am not worried about injection attacks or anything as I am parsing content that has already been approved and published into a SWF movie).
The original "strip tags" regular expression I'm using was <(.|\n)+?>
, and I tried to modify it to <([^a]|\n)+?>
, but that of course will allow any tag that has an a in it rather than one that has it in the beginning, with a space.
Not that it should really matter, but in case anyone cares to know I am writing this in ActionScript 3.0 for a Flash movie.
<(?!\/?a(?=>|\s.*>))\/?.*?>
Try this. Had something similar for p tags. Worked for them so don't see why not. Uses negative lookahead to check that it doesn't match a (prefixed with an optional / character) where (using positive lookahead) a (with optional / prefix) is followed by a > or a space, stuff and then >. This then matches up until the next > character. Put this in a subst with
s/<(?!\/?a(?=>|\s.*>))\/?.*?>//g;
This should leave only the opening and closing a tags