I'm trying to remove all the html/javascript using bs4, however, it doesn't get rid of javascript. I still see it there with the text. How can I get around this?
I tried using nltk
which works fine however, clean_html
and clean_url
will be removed moving forward. Is there a way to use soups get_text
and get the same result?
I tried looking at these other pages:
BeautifulSoup get_text does not strip all tags and JavaScript
Currently i'm using the nltk's deprecated functions.
EDIT
Here's an example:
import urllib
from bs4 import BeautifulSoup
url = "http://www.cnn.com"
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)
print soup.get_text()
I still see the following for CNN:
$j(function() {
"use strict";
if ( window.hasOwnProperty('safaripushLib') && window.safaripushLib.checkEnv() ) {
var pushLib = window.safaripushLib,
current = pushLib.currentPermissions();
if (current === "default") {
pushLib.checkPermissions("helloClient", function() {});
}
}
});
/*globals MainLocalObj*/
$j(window).load(function () {
'use strict';
MainLocalObj.init();
});
How can I remove the js?
Only other options I found are:
https://github.com/aaronsw/html2text
The problem with html2text
is that it's really really slow at times, and creates noticable lag, which is one thing nltk was always very good with.
Based partly on Can I remove script tags with BeautifulSoup?
import urllib
from bs4 import BeautifulSoup
url = "http://www.cnn.com"
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)
# kill all script and style elements
for script in soup(["script", "style"]):
script.decompose() # rip it out
# get text
text = soup.get_text()
# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)
print(text)