How to detect either the string contains an html (can be html4, html5, just partials of html within text)? I do not need a version of HTML, but rather if the string is just a text or it contains an html. Text is typically multiline with also empty lines
example inputs:
html:
<head><title>I'm title</title></head>
Hello, <b>world</b>
non-html:
<ht fldf d><
<html><head> head <body></body> html
You can use an HTML parser, like BeautifulSoup
. Note that it really tries it best to parse an HTML, even broken HTML, it can be very and not very lenient depending on the underlying parser:
>>> from bs4 import BeautifulSoup
>>> html = """<html>
... <head><title>I'm title</title></head>
... </html>"""
>>> non_html = "This is not an html"
>>> bool(BeautifulSoup(html, "html.parser").find())
True
>>> bool(BeautifulSoup(non_html, "html.parser").find())
False
This basically tries to find any html element inside the string. If found - the result is True
.
Another example with an HTML fragment:
>>> html = "Hello, <b>world</b>"
>>> bool(BeautifulSoup(html, "html.parser").find())
True
Alternatively, you can use lxml.html
:
>>> import lxml.html
>>> html = 'Hello, <b>world</b>'
>>> non_html = "<ht fldf d><"
>>> lxml.html.fromstring(html).find('.//*') is not None
True
>>> lxml.html.fromstring(non_html).find('.//*') is not None
False