How to navigate XML with xpath in R

CallumH picture CallumH · Dec 5, 2016 · Viewed 8.9k times · Source

Can I have some xpath navigational assistance please, for use with an XML document in R?

I have provided a very stripped down version of my actual data to illustrate ('my_xml'). In general, I am looking to import a pmml doc (exported by spss in XML) into R.

For this example I would like to know how I can return the 'value' attribute values in nodes where the property attributes ='valid'. You will see that I have set one property to 'notvalid'.

my_xml = xmlParse('<?xml version="1.0" encoding="UTF-8"?>
<PMML xmlns="http://www.dmg.org/PMML-4_1" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" version="4.1" xsi:schemaLocation="http://www.dmg.org/PMML-4_1 pmml-4-1.xsd">
<Header copyright="(C) Copyright IBM Corp. 1989, 2014.">
<Application name="IBM SPSS Statistics 23.0" version="23.0.0.0"/>
</Header><DataDictionary numberOfFields="15">
<DataField dataType="string" displayName="Target_Status_OIDV_M!" name="Target_Status_OIDV_M!" optype="categorical">
<Extension extender="spss.com" name="format" value="1"/>
<Extension extender="spss.com" name="width" value="1"/>
<Extension extender="spss.com" name="decimals" value="0"/>
<Value displayValue="F" property="valid" value="F"/>
<Value displayValue="T" property="valid" value="T"/>
</DataField>
<DataField dataType="string" displayName="Status_OIDV_M2" name="Status_OIDV_M2" optype="categorical">
<Extension extender="spss.com" name="format" value="1"/>
<Extension extender="spss.com" name="width" value="4"/>    
<Extension extender="spss.com" name="decimals" value="0"/>    
<Value displayValue="0000" property="valid" value="0000"/>    
<Value displayValue="0001" property="valid" value="0001"/>    
<Value displayValue="0100" property="notvalid" value="0100"/>    
</DataField>
</DataDictionary>
</PMML>')

I check the class...and look for all the values

class(my_xml) #  "XMLInternalDocument" - excellent... just what xpathApply is looking for :)

get_value_attr = xpathApply(my_xml, "//@value")
print(get_value_attr) # a start in the right direction

So next i try with my property condition...but i just get an empty list

get_value_attr_with_condition = xpathApply(my_xml, "//@value[@property='valid']") 
print(get_value_attr_with_condition) # returns an empty list

This makes me realise that of all the examples I have seen, square bracket attribute conditions are only used on a node, I've never seen one run with another attribute (i.e. //mynodename[@attribute='superduper'])

But when I search with xpath for the 'Value' nodes any where in the doc (i.e. with '//').... it returns an empty list (NB - I am now search for the 'Value' node with a capital 'V', not the 'value' attribute)

get_values = xpathApply(my_xml, "//Value") 
print(get_values) 

If I search for the current node, using a period...

my_current_node = xpathApply(my_xml, ".") 
print(my_current_node) 

It's another empty list - why doesn't this select my current node?

I thought maybe xpathApply was looking for some additional args, 3 attempts...

get_that_value = xpathApply(my_xml, "//Value", xmlGetAttr, "value")
print(get_that_value) # empty list again

get_that_property = xpathApply(my_xml, "//Value", xmlGetAttr, "property")
print(get_that_property) # empty list again

get_the_xmlValue = xpathApply(my_xml, "//Value", xmlValue)
print(get_the_xmlValue) 

Nope - I must be doing something wrong! But what???

Answer

hrbrmstr picture hrbrmstr · Dec 5, 2016

For the main inquiry:

xpathSApply(my_xml, "//*[@value and @property='valid']/@value")

For the second inquiry, you need to deal with the default namespace:

nsDefs <- xmlNamespaceDefinitions(my_xml)
ns <- structure(sapply(nsDefs, function(x) x$uri), names = names(nsDefs))
names(ns)[1] <- "x"

xpathSApply(my_xml, "//x:Value", namespaces=ns)