High speed XSLT to convert from XML to CSV

Raghavendra Nilekani picture Raghavendra Nilekani · Mar 5, 2013 · Viewed 9.2k times · Source

I have been trying to get CSV data from an XML document in most efficient time using XSLT. Following is my sample XML

<?xml version="1.0" encoding="ISO-8859-1"?>
 <sObjects xmlns="urn:sobject.partner.soap.sforce.com">
     <sObject>
        <Name>Raagu</Name>
        <BillingStreet>Hoskote</BillingStreet>
</sObject>
     <sObject>
        <Name>Rajath</Name>
         <BillingStreet>BTM</BillingStreet>
         <age>25</age>
</sObject>
     <sObject>
        <Name>Sarath</Name>
         <BillingStreet>Murgesh</BillingStreet>
         <location>Bangalore</location>
         <age>#N/A</age>
</sObject>
     <sObject>
         <Name>Bharath</Name>
         <BillingStreet>EGL</BillingStreet>
         <location>Bangalore</location>
             <shipping>Hoskote</Shipping>
</sObject>
     <sObject>
         <Id>12312321321</Id>
         <Name>Guru</Name>
         <location>Sirsi</location>
         <date>12-12-12</date>
</sObject>
     <sObject>
         <Name>Appa</Name>
         <BillingStreet>someStrrt</BillingStreet>
         <accountNo>213213</accountNo>
</sObject>
           <sObject>
          <Name>Sarath</Name>
          <BillingStreet>Murgesh</BillingStreet>
         <location>Bangalore</location>
</sObject>
     <sObject>
          <Name>Sarath</Name>
         <BillingStreet>Murgesh</BillingStreet>
          <location>Bangalore</location>
</sObject>
     <sObject>
          <Name>Sarath</Name>
          <BillingStreet>Murgesh</BillingStreet>
           <location>Bangalore</location>
</sObject>

and I wanted Output of this kind

 <?xml version="1.0" encoding="utf-8"?><csv xmlns="http://www.approuter.com/schemas/RootNode"><data>Name,BillingStreet,age,location,Shipping,Id,date,accountNo
Raagu,Hoskote,,,,,,
Rajath,BTM,25,,,,,
Sarath,Murgesh,#N/A,Bangalore,,,,
Bharath,EGL,,Bangalore,Hoskote,,,
Guru,,,Sirsi,,12312321321,12-12-12,
Appa,someStrrt,,,,,,213213
Sarath,Murgesh,,Bangalore,,,,
Sarath,Murgesh,,Bangalore,,,,
Sarath,Murgesh,,Bangalore,,,,</data></csv>

To get this done, I have tried following XSLT

<xsl:stylesheet version="1.0" xmlns:p0="urn:sobject.partner.soap.sforce.com" xmlns:csv="csv:csv" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output encoding="utf-8"  method="xml"/>
    <xsl:strip-space elements="*" />
    <xsl:variable name="delimiter" select="','"/>

    <xsl:key name="field" match="p0:sObject/*" use="name()"/>

<!-- variable containing the first occurrence of each field -->
    <xsl:variable name="allFields"
    select="/*/*/*[generate-id()=generate-id(key('field', name())[1])]"/>

    <xsl:template match="/">
    <!-- Output the CSV header -->
        <xsl:element name="csv" namespace="http://www.approuter.com/schemas/RootNode">
            <xsl:element name="data" namespace="http://www.approuter.com/schemas/RootNode">

                <xsl:for-each select="$allFields">
                    <xsl:value-of select="name()" />
                    <xsl:if test="position() &lt; last()">
                        <xsl:value-of select="$delimiter" />
                    </xsl:if>
                </xsl:for-each>

                <xsl:text>&#xa;    </xsl:text>

                <xsl:apply-templates select="/*/p0:sObject" />

            </xsl:element>
        </xsl:element>

    </xsl:template>

    <xsl:template match="p0:sObject">
        <xsl:variable name="this" select="." />
        <xsl:for-each select="$allFields">
            <xsl:value-of select="$this/*[name() = name(current())]" />
            <xsl:if test="position() &lt; last()">
                <xsl:value-of select="$delimiter" />
            </xsl:if>
        </xsl:for-each>
        <xsl:if test="position() &lt; last()">
            <xsl:text>&#xa;    </xsl:text>
        </xsl:if>
    </xsl:template>
</xsl:stylesheet>

The above XSLT works really well from functioanality point of view. But I am trying this to process around 10000 records. i,e 10000 instances on sObject elements each sObject will contain around 15 fields under it.

If I run this above XSLT to process these many records, it is going for toss. XSLT is taking around 20 minutes to process and giving csv data. I wanted to get this done within seconds. That is XSLT should take between 3-4 seconds to process 10k records (sObject entries) to give valid CSV data as shown above.

This is where I am stuck to enhance the XSLT and need help in modifying this XSLT to work faster.

Answer

ljdelight picture ljdelight · Mar 5, 2013

I think this is a difficult problem. I didn't see anything obvious. Using a multi-step build is the trick: I created pass1.xsl and pass2.xsl which creates the output tree much faster.

I made a test file that had a tree size of 252097 nodes (697768 characters). Your XSL took 21sec and the below two xsls took a few seconds.

Pass 1 XSL

<xsl:stylesheet version="2.0" xmlns:p0="urn:sobject.partner.soap.sforce.com" 
    xmlns:csv="csv:csv" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output encoding="utf-8"  method="xml" indent="no"/>
    <xsl:strip-space elements="*" />
    <xsl:variable name="delimiter" select="','"/>

    <xsl:variable name="allFields">
        <xsl:for-each select="distinct-values( /*/*/*/name() )">
            <xsl:element name="{.}" />
        </xsl:for-each>    
    </xsl:variable>

    <xsl:template match="p0:sObjects">
        <xsl:element name="{local-name(.)}" namespace="urn:sobject.partner.soap.sforce.com">
            <xsl:element name="order" namespace="urn:sobject.partner.soap.sforce.com">
                <xsl:for-each select="$allFields/*">
                    <xsl:value-of select="name()" />
                    <xsl:if test="position() &lt; last()">
                        <xsl:value-of select="$delimiter" />
                    </xsl:if>
                </xsl:for-each>
            </xsl:element>

            <xsl:apply-templates select="/*/p0:sObject" />
        </xsl:element>
    </xsl:template>

    <xsl:template match="p0:sObject">
        <xsl:variable name="this" select="." />
        <xsl:element name="{local-name(.)}" namespace="urn:sobject.partner.soap.sforce.com">
            <xsl:for-each select="$allFields/*">
                <xsl:element name="{local-name(.)}" namespace="urn:sobject.partner.soap.sforce.com">
                    <xsl:value-of select="$this/*[name() = name(current())]" />
                </xsl:element>
            </xsl:for-each>
        </xsl:element>
    </xsl:template>
</xsl:stylesheet>

Pass 2 XSL

<xsl:stylesheet version="2.0" xmlns:p0="urn:sobject.partner.soap.sforce.com" 
    xmlns:csv="csv:csv" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output encoding="utf-8"  method="xml" indent="no"/>
    <xsl:strip-space elements="*" />
    <xsl:variable name="delimiter" select="','"/>

    <xsl:template match="/">
        <xsl:element name="csv" namespace="http://www.approuter.com/schemas/RootNode">
            <xsl:element name="data" namespace="http://www.approuter.com/schemas/RootNode">
                <xsl:apply-templates select="/p0:sObjects/*" />
            </xsl:element>
        </xsl:element>
    </xsl:template>

    <xsl:template match="p0:order">
        <xsl:value-of select="." /> 
    </xsl:template>

    <!-- use Michael's suggestion about using first instead of last() -->
    <xsl:template match="p0:sObject/*[ position() = 1 ]">
        <xsl:text>&#xa;    </xsl:text>
        <xsl:value-of select="." />
    </xsl:template>

    <xsl:template match="p0:sObject/*[ position() > 1 ]">
        <xsl:value-of select="$delimiter" />
        <xsl:value-of select="." />
    </xsl:template>
</xsl:stylesheet>

Pass 1 output

This is the output xml from pass 1. This would create an xml that looks like this (which is easier to parse for the next phase. See the new empty elements under sObject?):

<sObjects xmlns="urn:sobject.partner.soap.sforce.com">
   <order>Name,BillingStreet,age,location,shipping,Id,date,accountNo</order>
   <sObject>
      <Name>Raagu</Name>
      <BillingStreet>Hoskote</BillingStreet>
      <age/>
      <location/>
      <shipping/>
      <Id/>
      <date/>
      <accountNo/>
   </sObject>
</xObjects>

Finally, the result:

<csv xmlns="http://www.approuter.com/schemas/RootNode"><data>Name,BillingStreet,age,location,shipping,Id,date,accountNo
    Raagu,Hoskote,,,,,,
    Rajath,BTM,25,,,,,
...

The command I executed:

saxonb-xslt -t bigxml.xml pass1.xsl > intermediate.xml
saxonb-xslt -t intermediate.xml pass2.xsl > res.xml