Convert HTML to plain text in JS without browser environment

Erik Aigner picture Erik Aigner · Mar 2, 2013 · Viewed 40.4k times · Source

I have a CouchDB view map function that generates an abstract of a stored HTML document (first x characters of text). Unfortunately I have no browser environment to convert HTML to plain text.

Currently I use this multi-stage regexp

html.replace(/<style([\s\S]*?)<\/style>/gi, ' ')
    .replace(/<script([\s\S]*?)<\/script>/gi, ' ')
    .replace(/(<(?:.|\n)*?>)/gm, ' ')
    .replace(/\s+/gm, ' ');

while it's a very good filter, it's obviously not a perfect one and some leftovers slip through sometimes. Is there a better way to convert to plain text without a browser environment?

Answer

Ga&#235;l picture Gaël · Mar 2, 2013

This regular expression works:

text.replace(/<[^>]*>/g, '');