regular expression to extract text from HTML

Ron Harlev picture Ron Harlev · Oct 8, 2008 · Viewed 51.3k times · Source

I would like to extract from a general HTML page, all the text (displayed or not).

I would like to remove

  • any HTML tags
  • Any javascript
  • Any CSS styles

Is there a regular expression (one or more) that will achieve that?

Answer

nickf picture nickf · Oct 8, 2008

Remove javascript and CSS:

<(script|style).*?</\1>

Remove tags

<.*?>