While trying to parse html using Yahoo Query Language and xpath functionality provided by YQL, I ran into problems of not being able to extract “text()” or attribute values.
For e.g.
perma link
select * from html where url="http://stackoverflow.com"
and xpath='//div/h3/a'
gives a list of anchors as xml
<results>
<a class="question-hyperlink" href="/questions/661184/filling-the-text-area-with-the-text-when-a-button-is-clicked" title="In ASP.net, I need the code to fill the text area (in the form) when a button is clicked. Can you help me through by showing a simple .aspx code containing the script tag? ">Filling the text area with the text when a button is clicked</a>...
</results>
Now when I try to extract the node value using
select * from html where url="http://stackoverflow.com"
and xpath='//div/h3/a/text()'
I get results concatenated rather than a node list e.g.
<results>Xcode: attaching to a remote process for debuggingWhy is b
…… </results>
How do I separate it into node lists and how do I select attribute values ?
A query like this
select * from html where url="http://stackoverflow.com"
and xpath='//div/h3/a[@href]'
gave me the same results for querying div/h3/a
YQL requires the xpath expression to evaluate to an itemPath rather than node text. But once you have an itemPath you can project various values from the tree
In other words an ItemPath should point to the Node in the resulting HTML rather than text content/attributes. YQL returns all matching nodes and their children when you select * from the data.
example
select * from html where url="http://stackoverflow.com" and xpath='//div/h3/a'
This returns all the a's matching the xpath. Now to project the text content you can project it out using
select content from html where url="http://stackoverflow.com" and xpath='//div/h3/a'
"content" returns the text content held within the node.
For projecting out attributes, you can specify it relative to the xpath expression. In this case, since you need the href which is relative to a.
select href from html where url="http://stackoverflow.com" and xpath='//div/h3/a'
this returns
<results>
<a href="/questions/663973/putting-a-background-pictures-with-leds"/>
<a href="/questions/663013/advantages-and-disadvantages-of-popular-high-level-languages"/>
....
</results>
If you needed both the attribute 'href' and the textContent, then you can execute the following YQL query:
select href, content from html where url="http://stackoverflow.com" and xpath='//div/h3/a'
returns:
<results> <a href="/questions/663950/double-pointer-const-issue-issue">double pointer const issue issue</a>... </results>
Hope that helps. let me know if you have more questions on YQL.