I'm trying to parse a non-well-formatted HTML page with XmlSlurper, the Eclipse download site The W3C validator shows several errors in the page.
I tried the fault-tolerant parser from this post
@Grab(group='net.sourceforge.nekohtml', module='nekohtml', version='1.9.14')
import org.cyberneko.html.parsers.SAXParser
import groovy.util.XmlSlurper
// Getting the xhtml page thanks to Neko SAX parser
def mirrors = new XmlSlurper(new SAXParser()).parse("http://www.eclipse.org/downloads/download.php?file=/technology/epp/downloads/release/luna/SR1a/eclipse-jee-luna-SR1a-linux-gtk-x86_64.tar.gz")
mirrors.'**'
Unfortunately, it looks like not all content is parsed into the XML object. The faulty subtrees are simply ignored.
E.g. page.depthFirst().find { it.text() == 'North America'}
returns null
instead of the H4 element in the page.
Is there some robust way to parse any HTML content in groovy?
With the following piece of code it's getting parsed well (without errors):
@Grab(group='net.sourceforge.nekohtml', module='nekohtml', version='1.9.14')
import org.cyberneko.html.parsers.SAXParser
import groovy.util.XmlSlurper
def parser = new SAXParser()
def page = new XmlSlurper(parser).parse('http://www.eclipse.org/downloads/download.php?file=/technology/epp/downloads/release/luna/SR1a/eclipse-jee-luna-SR1a-linux-gtk-x86_64.tar.gz')
However I don't know which elements exactly You'd like to find.
Here All mirrors
are found:
page.depthFirst().find {
it.text() == 'All mirrors'
}.@href
EDIT
Both outputs are null
.
println page.depthFirst().find { it.text() == 'North America'}
println page.depthFirst().find { it.text().contains('North America')}
EDIT 2
Below You can find a working example that downloads the file and parses it correctly. I used wget
to download the file (there's something wrong with downloading it with groovy - don't know what)
@Grab(group='net.sourceforge.nekohtml', module='nekohtml', version='1.9.14')
import org.cyberneko.html.parsers.SAXParser
import groovy.util.XmlSlurper
def host = 'http://www.eclipse.org/downloads/download.php?file=/technology/epp/downloads/release/luna/SR1a/eclipse-jee-luna-SR1a-linux-gtk-x86_64.tar.gz'
def temp = File.createTempFile('eclipse', 'tmp')
temp.deleteOnExit()
def cmd = ['wget', host, '-O', temp.absolutePath].execute()
cmd.waitFor()
cmd.exitValue()
def parser = new SAXParser()
def page = new XmlSlurper(parser).parseText(temp.text)
println page.depthFirst().find { it.text() == 'North America'}
println page.depthFirst().find { it.text().contains('North America')}
EDIT 3
And finally problem solved. Using groovy's url.toURL().text
causes problems when no User-Agent
header is specified. Now it works correctly and elements are found - no external tools used.
@Grab(group='net.sourceforge.nekohtml', module='nekohtml', version='1.9.14')
import org.cyberneko.html.parsers.SAXParser
import groovy.util.XmlSlurper
def host = 'http://www.eclipse.org/downloads/download.php?file=/technology/epp/downloads/release/luna/SR1a/eclipse-jee-luna-SR1a-linux-gtk-x86_64.tar.gz'
def parser = new SAXParser()
def page = new XmlSlurper(parser).parseText(host.toURL().getText(requestProperties: ['User-Agent': 'Non empty']))
assert page.depthFirst().find { it.text() == 'North America'}
assert page.depthFirst().find { it.text().contains('North America')}