I have a CouchDB view map function that generates an abstract of a stored HTML document (first x
characters of text). Unfortunately I have no browser environment to convert HTML to plain text.
Currently I use this multi-stage regexp
html.replace(/<style([\s\S]*?)<\/style>/gi, ' ')
.replace(/<script([\s\S]*?)<\/script>/gi, ' ')
.replace(/(<(?:.|\n)*?>)/gm, ' ')
.replace(/\s+/gm, ' ');
while it's a very good filter, it's obviously not a perfect one and some leftovers slip through sometimes. Is there a better way to convert to plain text without a browser environment?
This regular expression works:
text.replace(/<[^>]*>/g, '');