The following test reads a file, and using lxml.html generates the leaf nodes of the DOM/Graph for the page.
However, I'm also trying to figure out how to get the input from a "string". Using
lxml.html.fromstring(s)
doesn't work, as this generates a "Element" as opposed to an "ElementTree".
So, I'm trying to figure out how to convert an element to an ElementTree.
Thoughts
import lxml.html
from lxml import etree # trying this to see if needed
# to convert from element to elementtree
#cmd='cat osu_test.txt'
cmd='cat o2.txt'
proc=subprocess.Popen(cmd, shell=True,stdout=subprocess.PIPE)
s=proc.communicate()[0].strip()
# s contains HTML not XML text
#doc = lxml.html.parse(s)
doc = lxml.html.parse('osu_test.txt')
doc1 = lxml.html.fromstring(s)
for node in doc.iter():
if len(node) == 0:
print "aaa ",node.tag, doc.getpath(node)
#print "aaa ",node.tag
nt = etree.ElementTree(doc1) <<<<< doesn't work.. so what will??
for node in nt.iter():
if len(node) == 0:
print "aaa ",node.tag, doc.getpath(node)
#print "aaa ",node.tag
===============================
update:::
(parsing html instead of xml) Added the changes suggested by Abbas. got the following errs:
doc1 = etree.fromstring(s)
File "lxml.etree.pyx", line 2532, in lxml.etree.fromstring (src/lxml/lxml.etree.c:48621)
File "parser.pxi", line 1545, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:72232)
File "parser.pxi", line 1424, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:71093)
File "parser.pxi", line 938, in lxml.etree._BaseParser._parseDoc (src/lxml/lxml.etree.c:67862)
File "parser.pxi", line 539, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:64244)
File "parser.pxi", line 625, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:65165)
File "parser.pxi", line 565, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:64508)
lxml.etree.XMLSyntaxError: Entity 'nbsp' not defined, line 48, column 220
UPDATE:::
Managed to get the test working. I'm not exactly sure why. If someone with py chops wants to provide an explanation, that would help future people who stumble on this.
from cStringIO import StringIO
from lxml.html import parse
doc1 = parse(StringIO(s))
for node in doc1.iter():
if len(node) == 0:
print "aaa ", node.tag, doc1.getpath(node)
it appears that the StringIO module/class implements IO functionality which satisfies what the parse package needs to go ahead and process the input string for the test html. similar to what casting provides in other languages perhaps...
thanks
To get the root tree from an _Element
(generated with lxml.html.fromstring
), you can use the getroottree
method:
doc = lxml.html.parse(s)
tree = doc.getroottree()