I'm trying to use Ruby's Nokogiri to parse large (1 GB or more) XML files. I'm testing code on a smaller file, containing only 4 records available here. I'm using Nokogiri version 1.5.0, Ruby 1.8.7 on Ubuntu 10.10. Since I don't understand SAX very well, I'm trying Nokogiri::XML::Reader to start.
My first attempt, to retrieve the content of the PMID tag, looks like this:
#!/usr/bin/ruby
require "rubygems"
require "nokogiri"
file = ARGV[0]
reader = Nokogiri::XML::Reader(File.open(file))
p = []
reader.each do |node|
if node.name == "PMID"
p << node.inner_xml
end
end
puts p.inspect
Here's what I hoped to see:
["21714156", "21693734", "21692271", "21692260"]
Here's what I actually saw:
["21714156", "", "21693734", "", "21692271", "", "21692260", ""]
It seems that for some reason, my code is finding, or generating, an extra, empty PMID tag for every instance of PMID. Either that or inner_xml
does not work as I thought.
I'd be grateful if anyone could confirm that my code and data generates the result shown and suggest where I'm going wrong.
Each element in the stream comes through as two events: one to open the element and one to close it. The opening event will have
node.node_type == Nokogiri::XML::Reader::TYPE_ELEMENT
and the closing event will have
node.node_type == Nokogiri::XML::Reader::TYPE_END_ELEMENT
The empty strings you're seeing are just the element closing events. Remember that with SAX parsing, you're basically walking through a tree so you need the second event to tell you when you're going back up and closing an element.
You probably want something more like this:
reader.each do |node|
if node.name == "PMID" && node.node_type == Nokogiri::XML::Reader::TYPE_ELEMENT
p << node.inner_xml
end
end
Or perhaps:
reader.each do |node|
next if node.name != 'PMID'
next if node.node_type != Nokogiri::XML::Reader::TYPE_ELEMENT
p << node.inner_xml
end
Or some other variation on that.