node.js \ sanitize html and also remove tags

user1025852 picture user1025852 · Jul 26, 2015 · Viewed 12k times · Source

how can I tell "sanitize-html" to actually remove the html tags (keep only the content within)? currently if for example I set it to keep the div sections, in the output it writes also the <div>some content</div> - I want only the inside...('some content')

to make it short - I don't want the tags, attributes etc. - only the content of those elements..

var Crawler = require("js-crawler");
    var download = require("url-download");
    var sanitizeHtml = require('sanitize-html');
    var util = require('util');
    var fs = require('fs');

    new Crawler().configure({depth: 1})
      .crawl("http://www.cnn.com", function onSuccess(page) {

        var clean = sanitizeHtml(page.body,{
         allowedTags: [ 'p', 'em', 'strong','div' ],
        });
        console.log(clean);
        fs.writeFile('sanitized.txt', clean, function (err) {
            if (err) throw err;
            console.log('It\'s saved! in same location.');
        });

        console.log(util.inspect(clean, {showHidden: false, depth: null}));
        var str = JSON.stringify(clean.toString());
        console.log(str);
        /*download(page.url, './download')
        .on('close', function () {
          console.log('One file has been downloaded.');
        });*/
      });

Answer

Tom Boutell picture Tom Boutell · Jul 26, 2015

I'm the author of sanitize-html.

You can set allowedTags to an empty array. sanitize-html does not discard the contents of a disallowed tag, only the tag itself (with the exception of a few tags like "script" and "style" for which this would not make sense). Otherwise it wouldn't be much use for its original intended purpose, which is cleaning up markup copied and pasted from word processors and the like into a rich text editor.

However, if you have markup like:

<div>One</div><div>Two</div>

That will come out as:

OneTwo

To work around that, you can use the textFilter option to ensure the text of a tag is always followed by at least one space:

textFilter: function(text) {
  return text + ' ';
}

However, this will also introduce extra spaces in sentences that contain inline tags like "strong" and "em".

So the more I think about it, the best answer for you is probably a completely different npm module:

https://www.npmjs.com/package/html-to-text

It's widely used and much better suited than your use case. sanitize-html is really meant for situations where you want the tags... just not the wrong tags.