I working on xml sax parser to parse xml files and below is my code
xml file code:
<job>
<title>Registered Nurse-Epilepsy</title>
<job-code>881723</job-code>
<detail-url>http://search.careers-hcanorthtexas.com/s/Job-Details/Registered-Nurse-Epilepsy-Job/Medical-City/xjdp-cl289619-jf120-ct2181-jid4041800?s_cid=Advance
</detail-url>
<job-category>Neuroscience Nursing</job-category>
<description>
<summary>
<div class='descriptionheader'>Description</div><P STYLE="margin-top:0px;margin-bottom:0px"><SPAN STYLE="font-family:Arial;font-size:small">Utilizing the standards set forth for Nursing Practice by the ANA and ONS, the RN will organize, modify, evaluate, document and maintain the plan of care for Epilepsy and/or Neurological patients. It will include individualized, family centered, holistic, supportive, and safe age-specific care.</SPAN></P><div class='qualificationsheader'>Qualifications</div><UL STYLE="list-style-type:disc"> <LI>Graduate of an accredited school of Professional Nursing.</LI> <LI>BSN preferred </LI> <LI>Current licensure with the Board of Nurse Examiners for the State of Texas</LI> <LI>Experience in Epilepsy Monitoring and/or Neurological background preferred.</LI> <LI>ACLS preferred, within 6 months of hire</LI> <LI>PALS required upon hire</LI> </UL>
</summary>
</description>
<posted-date>2012-07-26</posted-date>
<location>
<address>7777 Forest Lane</address>
<city>Dallas</city>
<state>TX</state>
<zip>75230</zip>
<country>US</country>
</location>
<company>
<name>Medical City (Dallas, TX)</name>
<url>http://www.hcanorthtexas.com/careers/search-jobs.dot</url>
</company>
</job>
Python code: (partial code to clear my doubt until start element function)
from xml.sax.handler import ContentHandler
import xml.sax
import xml.parsers.expat
import ConfigParser
import xml.sax
class Exact(xml.sax.handler.ContentHandler):
def __init__(self):
self.curpath = []
def startElement(self, name, attrs):
print name,attrs
self.clearFields()
def endElement(self, name):
pass
def characters(self, data):
self.buffer += data
def clearFields():
self.fields = {}
self.fields['title'] = None
self.fields['job-code'] = None
self.fields['detail-url'] = None
self.fields['job-category'] = None
self.fields['description'] = None
self.fields['summary'] = None
self.fields['posted-date'] = None
self.fields['location'] = None
self.fields['address'] = None
self.fields['city'] = None
self.fields['state'] = None
self.fields['zip'] = None
self.fields['country'] = None
self.fields['company'] = None
self.fields['name'] = None
self.fields['url'] = None
self.buffer = ''
if __name__ == '__main__':
parser = xml.sax.make_parser()
handler = Exact()
parser.setContentHandler(handler)
parser.parse(open('/path/to/xml_file.xml'))
result: The result to the above print statement is given below
job <xml.sax.xmlreader.AttributesImpl instance at 0x2c0ba70>
title <xml.sax.xmlreader.AttributesImpl instance at 0x2c0ba70>
job-code <xml.sax.xmlreader.AttributesImpl instance at 0x2c0ba70>
detail-url <xml.sax.xmlreader.AttributesImpl instance at 0x2c0ba70>
job-category <xml.sax.xmlreader.AttributesImpl instance at 0x2c0ba70>
description <xml.sax.xmlreader.AttributesImpl instance at 0x2c0ba70>
summary <xml.sax.xmlreader.AttributesImpl instance at 0x2c0ba70>
posted-date <xml.sax.xmlreader.AttributesImpl instance at 0x2c0ba70>
location <xml.sax.xmlreader.AttributesImpl instance at 0x2c0ba70>
address <xml.sax.xmlreader.AttributesImpl instance at 0x2c0ba70>
city <xml.sax.xmlreader.AttributesImpl instance at 0x2c0ba70>
state <xml.sax.xmlreader.AttributesImpl instance at 0x2c0ba70>
zip <xml.sax.xmlreader.AttributesImpl instance at 0x2c0ba70>
country <xml.sax.xmlreader.AttributesImpl instance at 0x2c0ba70>
company <xml.sax.xmlreader.AttributesImpl instance at 0x2c0ba70>
name <xml.sax.xmlreader.AttributesImpl instance at 0x2c0ba70>
url <xml.sax.xmlreader.AttributesImpl instance at 0x2c0ba70>
As you can observe above i am getting name
and attrs
from the print statement, but
now all my intention is to get value of that name, how to fetch the values for all those tags above because i am getting only node names but not values.
Edited Code:
i really confused on how to map the data from the nodes to the keys in the dictionary as stated above
To get the content of an element, you need to overwrite the characters
method... add this to your handler class:
def characters(self, data):
print data
Be careful with this, though: The parser is not required to give you all data in a single chunk. You should use an internal Buffer and read it when needed. In most of my xml/sax code I do something like this:
class MyHandler(xml.sax.handler.ContentHandler):
def __init__(self):
self._charBuffer = []
def _flushCharBuffer(self):
s = ''.join(self._charBuffer)
self._charBuffer = []
return s
def characters(self, data):
self._charBuffer.append(data)
... and then call the flush method on the end of elements where I need the data.
For your whole use case - assuming you have a file containing multiple job descriptions and want a list which holds the jobs with each job being a dictionary of the fields, do something like this:
class MyHandler(xml.sax.handler.ContentHandler):
def __init__(self):
self._charBuffer = []
self._result = []
def _getCharacterData(self):
data = ''.join(self._charBuffer).strip()
self._charBuffer = []
return data.strip() #remove strip() if whitespace is important
def parse(self, f):
xml.sax.parse(f, self)
return self._result
def characters(self, data):
self._charBuffer.append(data)
def startElement(self, name, attrs):
if name == 'job': self._result.append({})
def endElement(self, name):
if not name == 'job': self._result[-1][name] = self._getCharacterData()
jobs = MyHandler().parse("job-file.xml") #a list of all jobs
If you just need to parse a single job at a time, you can simplify the list part and throw away the startElement
method - just set _result to a dict and assign to it directly in endElement
.