I'd like to parse a very large (about 200MB) RDF file in python. Should I be using sax or some other library? I'd appreciate some very basic code that I can build on, say to retrieve a tag.
Thanks in advance.
If you are looking for fast performance then I'd recommend you to use Raptor with the Redland Python Bindings. The performance of Raptor, written in C, is way better than RDFLib. And you can use the python bindings in case you don't want to deal with C.
Another advice for improving performance, forget about parsing RDF/XML, go with other flavor of RDF like Turtle or NTriples. Specially parsing ntriples is much faster than parsing RDF/XML. This is because the ntriples syntax is simpler.
You can transform your RDF/XML into ntriples using rapper, a tool that comes with raptor:
rapper -i rdfxml -o ntriples YOUR_FILE.rdf > YOUR_FILE.ntriples
The ntriples file will contain triples like:
<s1> <p> <o> .
<s2> <p2> "literal" .
and parsers tend to be very efficient handling this structure. Moreover, memory wise is more efficient than RDF/XML because, as you can see, this data structure is smaller.
The code below is a simple example using the redland python bindings:
import RDF
parser=RDF.Parser(name="ntriples") #as name for parser you can use ntriples, turtle, rdfxml, ...
model=RDF.Model()
stream=parser.parse_into_model(model,"file://file_path","http://your_base_uri.org")
for triple in model:
print triple.subject, triple.predicate, triple.object
The base URI is the prefixed URI in case you use relative URIs inside your RDF document. You can check documentation about the Python Redland bindings API in here
If you don't care much about performance then use RDFLib, it is simple and easy to use.