how to get results from xml sax parser in python

Shiva Krishna Bavandla picture Shiva Krishna Bavandla · Sep 4, 2012 · Viewed 17.9k times · Source

I working on xml sax parser to parse xml files and below is my code

xml file code:

<job>
    <title>Registered Nurse-Epilepsy</title>
    <job-code>881723</job-code>
    <detail-url>http://search.careers-hcanorthtexas.com/s/Job-Details/Registered-Nurse-Epilepsy-Job/Medical-City/xjdp-cl289619-jf120-ct2181-jid4041800?s_cid=Advance
    </detail-url>
    <job-category>Neuroscience Nursing</job-category>
    <description>
        <summary>
            <div class='descriptionheader'>Description</div><P STYLE="margin-top:0px;margin-bottom:0px"><SPAN STYLE="font-family:Arial;font-size:small">Utilizing the standards set forth for Nursing Practice by the ANA and ONS, the RN will organize, modify, evaluate, document and maintain the plan of care for Epilepsy and/or Neurological patients. It will include individualized, family centered, holistic, supportive, and safe age-specific care.</SPAN></P><div class='qualificationsheader'>Qualifications</div><UL STYLE="list-style-type:disc"> <LI>Graduate of an accredited school of Professional Nursing.</LI> <LI>BSN preferred </LI> <LI>Current licensure with the Board of Nurse Examiners for the State of Texas</LI> <LI>Experience in Epilepsy Monitoring and/or Neurological background preferred.</LI> <LI>ACLS preferred, within 6 months of hire</LI> <LI>PALS required upon hire</LI> </UL>
       </summary>
    </description>
    <posted-date>2012-07-26</posted-date>
    <location>
       <address>7777 Forest Lane</address>
       <city>Dallas</city>
       <state>TX</state>
       <zip>75230</zip>
       <country>US</country>
    </location>
    <company>
       <name>Medical City (Dallas, TX)</name>
      <url>http://www.hcanorthtexas.com/careers/search-jobs.dot</url>
    </company>
</job> 

Python code: (partial code to clear my doubt until start element function)

from xml.sax.handler import ContentHandler
import xml.sax
import xml.parsers.expat
import ConfigParser
import xml.sax

class Exact(xml.sax.handler.ContentHandler):
  def __init__(self):
    self.curpath = []

  def startElement(self, name, attrs):
    print name,attrs
    self.clearFields()


  def endElement(self, name):
    pass

  def characters(self, data):
    self.buffer += data

  def clearFields():
    self.fields = {}
    self.fields['title'] = None
    self.fields['job-code'] = None
    self.fields['detail-url'] = None
    self.fields['job-category'] = None
    self.fields['description'] = None
    self.fields['summary'] = None
    self.fields['posted-date'] = None
    self.fields['location'] = None
    self.fields['address'] = None
    self.fields['city'] = None
    self.fields['state'] = None
    self.fields['zip'] = None
    self.fields['country'] = None
    self.fields['company'] = None
    self.fields['name'] = None
    self.fields['url'] = None

    self.buffer = ''

if __name__ == '__main__':
  parser = xml.sax.make_parser()
  handler = Exact()
  parser.setContentHandler(handler)
  parser.parse(open('/path/to/xml_file.xml'))

result: The result to the above print statement is given below

job     <xml.sax.xmlreader.AttributesImpl instance at 0x2c0ba70>
title   <xml.sax.xmlreader.AttributesImpl instance at 0x2c0ba70>
job-code <xml.sax.xmlreader.AttributesImpl instance at 0x2c0ba70>
detail-url <xml.sax.xmlreader.AttributesImpl instance at 0x2c0ba70>
job-category <xml.sax.xmlreader.AttributesImpl instance at 0x2c0ba70>
description  <xml.sax.xmlreader.AttributesImpl instance at 0x2c0ba70>
summary       <xml.sax.xmlreader.AttributesImpl instance at 0x2c0ba70>
posted-date   <xml.sax.xmlreader.AttributesImpl instance at 0x2c0ba70>
location      <xml.sax.xmlreader.AttributesImpl instance at 0x2c0ba70>
address       <xml.sax.xmlreader.AttributesImpl instance at 0x2c0ba70>
city          <xml.sax.xmlreader.AttributesImpl instance at 0x2c0ba70>
state         <xml.sax.xmlreader.AttributesImpl instance at 0x2c0ba70>
zip           <xml.sax.xmlreader.AttributesImpl instance at 0x2c0ba70>
country       <xml.sax.xmlreader.AttributesImpl instance at 0x2c0ba70>
company       <xml.sax.xmlreader.AttributesImpl instance at 0x2c0ba70>
name          <xml.sax.xmlreader.AttributesImpl instance at 0x2c0ba70>
url           <xml.sax.xmlreader.AttributesImpl instance at 0x2c0ba70>

As you can observe above i am getting name and attrs from the print statement, but now all my intention is to get value of that name, how to fetch the values for all those tags above because i am getting only node names but not values.

Edited Code:

i really confused on how to map the data from the nodes to the keys in the dictionary as stated above

Answer

l4mpi picture l4mpi · Sep 4, 2012

To get the content of an element, you need to overwrite the characters method... add this to your handler class:

def characters(self, data):
    print data

Be careful with this, though: The parser is not required to give you all data in a single chunk. You should use an internal Buffer and read it when needed. In most of my xml/sax code I do something like this:

class MyHandler(xml.sax.handler.ContentHandler):
    def __init__(self):
        self._charBuffer = []

    def _flushCharBuffer(self):
        s = ''.join(self._charBuffer)
        self._charBuffer = []
        return s

    def characters(self, data):
        self._charBuffer.append(data)

... and then call the flush method on the end of elements where I need the data.

For your whole use case - assuming you have a file containing multiple job descriptions and want a list which holds the jobs with each job being a dictionary of the fields, do something like this:

class MyHandler(xml.sax.handler.ContentHandler):
    def __init__(self):
        self._charBuffer = []
        self._result = []

    def _getCharacterData(self):
        data = ''.join(self._charBuffer).strip()
        self._charBuffer = []
        return data.strip() #remove strip() if whitespace is important

    def parse(self, f):
        xml.sax.parse(f, self)
        return self._result

    def characters(self, data):
        self._charBuffer.append(data)

    def startElement(self, name, attrs):
        if name == 'job': self._result.append({})

    def endElement(self, name):
        if not name == 'job': self._result[-1][name] = self._getCharacterData()

jobs = MyHandler().parse("job-file.xml") #a list of all jobs

If you just need to parse a single job at a time, you can simplify the list part and throw away the startElement method - just set _result to a dict and assign to it directly in endElement.