I'm hoping someone will just point out something obvious that I'm missing here. I feel like I've done this a hundred times and for some reason tonight, the behavior coming from this is throwing me for a loop.
I'm reading in some XML from a public API. I want to extract all the text from a certain node (everything within 'body'), which also includes a variety of child nodes. Simple example:
<xml>
<metadata>
<article>
<body>
<sec>
<title>A Title</title>
<p>
This contains
<italic>italics</italic>
and
<xref ref-type="bibr">xref's</xref>
.
</p>
</sec>
<sec>
<title>Second Title</title>
</sec>
</body>
</article>
</metadata>
</xml>
So ultimately I want to traverse the tree within the desired node (again, 'body') and extract all the text contained in its natural order. Simple enough, so I just write up this little Groovy script...
def xmlParser = new XmlParser()
def xml = xmlParser.parseText(rawXml)
xml.metadata.article.body[0].depthFirst().each { node ->
if(node.children().size() == 1) {
println node.text()
}
}
...which proceeds to blow up with "No signature of method: java.lang.String.children()". So I'm thinking to myself "wait, what? Am I going crazy?" Node.depthFirst() should only return a List of Node's. I add a little 'instanceof' check and sure enough, I'm getting a combination of Node objects and String objects. Specifically the lines not within entities on the same line are returned as String's, aka "This contains" and "and". Everything else is a Node (as expected).
I can work around this easily. However, this doesn't seem like correct behavior and I'm hoping someone can point me in the right direction.
I'm pretty sure that's correct behavior (though I've always found the XmlSlurper and XmlParser to have screwy APIs). All things you can iterate through really should implement a node interface IMO and potentially have a type
of TEXT
that you could use to know to get the text from them.
Those text nodes are valid nodes that in many cases you'd want to hit as it did a depth first traversal through the XML. If they didn't get returned, your algorithm for checking if the children size of 1 wouldn't work because some nodes (like the <p>
tag) has both mixed text and elements underneath it.
Also, why depthFirst
doesn't consistently return all text nodes where the text is the only child, such as for italic
above, makes things even worse.
I tend to like to use the signature of groovy methods to let the runtime figure out which is the right way to handle each node (rather than using something like instanceof
) like this:
def rawXml = """<xml>
<metadata>
<article>
<body>
<sec>
<title>A Title</title>
<p>
This contains
<italic>italics</italic>
and
<xref ref-type="bibr">xref's</xref>
.
</p>
</sec>
<sec>
<title>Second Title</title>
</sec>
</body>
</article>
</metadata>
</xml>"""
def processNode(String nodeText) {
return nodeText
}
def processNode(Object node) {
if(node.children().size() == 1) {
return node.text()
}
}
def xmlParser = new XmlParser()
def xml = xmlParser.parseText(rawXml)
def xmlText = xml.metadata.article.body[0].'**'.findResults { node ->
processNode(node)
}
println xmlText.join(" ")
Prints
A Title This contains italics and xref's . Second Title
Alternatively, the XmlSlurper
class probably does more what you want/expect it to and has a more reasonable set of output from the text()
method. If you really don't need to do any sort of DOM walking with the results (what XmlParser
is "better" for), I'd suggest XmlSlurper
:
def xmlParser = new XmlSlurper()
def xml = xmlParser.parseText(rawXml)
def bodyText = xml.metadata.article.body[0].text()
println bodyText
Prints:
A Title
This contains
italics
and
xref's
.
Second Title