I am looking for a quick way to parse HTML tags out of a ColdFusion string. We are pulling in an RSS feed, that could potentially have anything in it. We are then doing some manipulation of the information and then spitting it back out to another place. Currently we are doing this with a regular expression. Is there a better way to do this?
<cfloop from="1" to="#ArrayLen(myFeed.item)#" index="i">
<cfset myFeed.item[i].description.value =
REReplaceNoCase(myFeed.item[i].description.value, '<(.|\n)*?>', '', 'ALL')>
</cfloop>
We are using ColdFusion 8.
Disclaimer I am a fierce advocate of using a proper parser (instead of regex) to parse HTML. However, this question isn't about parsing HTML, but about destroying it. For all tasks that go beyond that, use a parser.
I think your regex is good. As long as there is nothing more than removing all HTML tags from the input, using a regex like yours is safe.
Anything else would probably be more hassle than it's worth, but you could write a small function that loops through the string char-by-char once and removes everything that's within tag brackets — e.g.:
<
" character, >
"For a high-demand part of your app, this may be faster than the regex. But the regex is clean and probably fast enough.
Maybe this modified regex has some advantages for you:
<[^>]*(?:>|$)
[^>]*
is better than (.|\n)
The use of REReplaceNoCase()
is unnecessary when there are no actual letters in the pattern. Case-insensitive regex matching is slower than doing it case-sensitively.