Extracting lxml xpath for html table

mkt2012 picture mkt2012 · Apr 7, 2011 · Viewed 13.8k times · Source

I have a html doc similar to following:

<html xmlns="http://www.w3.org/1999/xhtml" xmlns="http://www.w3.org/1999/xhtml">
    <div id="Symbols" class="cb">
    <table class="quotes">
    <tr><th>Code</th><th>Name</th>
        <th style="text-align:right;">High</th>
        <th style="text-align:right;">Low</th>
    </tr>
    <tr class="ro" onclick="location.href='/xyz.com/A.htm';" style="color:red;">
        <td><a href="/xyz.com/A.htm" title="Display,A">A</a></td>
        <td>A Inc.</td>
        <td align="right">45.44</td>
        <td align="right">44.26</td>
    <tr class="re" onclick="location.href='/xyz.com/B.htm';" style="color:red;">
        <td><a href="/xyz.com/B.htm" title="Display,B">B</a></td>
        <td>B Inc.</td>
        <td align="right">18.29</td>
        <td align="right">17.92</td>
</div></html>

I need to extract code/name/high/low information from the table.

I used following code from one of the similar examples in Stack Over Flow:

#############################
import urllib2
from lxml import html, etree

webpg = urllib2.urlopen(http://www.eoddata.com/stocklist/NYSE/A.htm).read()
table = html.fromstring(webpg)

for row in table.xpath('//table[@class="quotes"]/tbody/tr'):
    for column in row.xpath('./th[position()>0]/text() | ./td[position()=1]/a/text() | ./td[position()>1]/text()'):
        print column.strip(),
    print

#############################

I am getting nothing output. I have to change the first loop xpath to table.xpath('//tr') from table.xpath('//table[@class="quotes"]/tbody/tr')

I just don't understand why the xpath('//table[@class="quotes"]/tbody/tr') not work.

Answer

samplebias picture samplebias · Apr 7, 2011

You are probably looking at the HTML in Firebug, correct? The browser will insert the implicit tag <tbody> when it is not present in the document. The lxml library will only process the tags present in the raw HTML string.

Omit the tbody level in your XPath. For example, this works:

tree = lxml.html.fromstring(raw_html)
tree.xpath('//table[@class="quotes"]/tr')
[<Element tr at 1014206d0>, <Element tr at 101420738>, <Element tr at 1014207a0>]