With ElementTree in Python, how can I extract all the text from a node, stripping any tags in that element and keeping only the text?
For example, say I have the following:
<tag>
Some <a>example</a> text
</tag>
I want to return Some example text
. How do I go about doing this? So far, the approaches I've taken have had fairly disastrous outcomes.
If you are running under Python 3.2+, you can use itertext
.
itertext
creates a text iterator which loops over this element and all subelements, in document order, and returns all inner text:
import xml.etree.ElementTree as ET
xml = '<tag>Some <a>example</a> text</tag>'
tree = ET.fromstring(xml)
print(''.join(tree.itertext()))
# -> 'Some example text'
If you are running in a lower version of Python, you can reuse the implementation of itertext()
by attaching it to the Element
class, after which you can call it exactly like above:
# original implementation of .itertext() for Python 2.7
def itertext(self):
tag = self.tag
if not isinstance(tag, basestring) and tag is not None:
return
if self.text:
yield self.text
for e in self:
for s in e.itertext():
yield s
if e.tail:
yield e.tail
# if necessary, monkey-patch the Element class
if 'itertext' not in ET.Element.__dict__:
ET.Element.itertext = itertext
xml = '<tag>Some <a>example</a> text</tag>'
tree = ET.fromstring(xml)
print(''.join(tree.itertext()))
# -> 'Some example text'