Convert XML file to CSV in java

ant picture ant · Jul 20, 2010 · Viewed 34.9k times · Source

@Before There will be probably some duplicate questions suggestions, I don't think that is the case maybe read this first, I'll try to be as brief as possible. Title gives basic idea.

Here is an example XML(case 1) :

<root>
      <Item>
        <ItemID>4504216603</ItemID>
        <ListingDetails>
          <StartTime>10:00:10.000Z</StartTime>
          <EndTime>10:00:30.000Z</EndTime>
          <ViewItemURL>http://url</ViewItemURL>
            ....
           </item>      

Here is an example XML(case 2) :

          <Item>
            <ItemID>4504216604</ItemID>
            <ListingDetails>
              <StartTime>10:30:10.000Z</StartTime>
              <!-- Start difference from case 1 -->
              <averages>
              <AverageTime>value1</AverageTime>
              <category type="TX">9823</category>
              <category type="TY">9112</category>
              <AveragePrice>value2</AveragePrice>
              </averages>
              <!-- End difference from case 1 -->
              <EndTime>11:00:10.000Z</EndTime>
              <ViewItemURL>http://url</ViewItemURL>
                ....
               </item>
                </root>

I borrowed this XML from google, anyways my objects are not always the same, sometimes there are extra elements like in case2. Now I'd like to produce CSV like this from both cases:

ItemID,StartTime,EndTime,ViewItemURL,AverageTime,AveragePrice
4504216603,10:00:10.000Z,10:00:30.000Z,http://url
4504216604,10:30:10.000Z,11:00:10.000Z,http://url,value1,value2

This 1st line is header it should also be included in csv. I got some useful links to stax today, I don't really don't know what is the right/optimal approach for this, I'm struggling with this for 3 days now, not really willing to give up yet.

Tell me what you think how would you solve this

I forgot to mention this is very huge xml file up to 1gb

BOUNTY UPDATE :

I'm looking for more Generic approach, meaning that this should work for any number of nodes with any depth, and sometimes as in the example xml, it can happen that one item object has greater number of nodes than the next/previous one so there should be also case for that(so all columns and values match in CSV).

Also it can happen that nodes have the same name/localName but different values and attributes, if that is the case then new column should appear in CSV with appropriate value. (I added example of this case inside <averages> tag called category)

Answer

Mark McLaren picture Mark McLaren · Jul 30, 2010

The code provided should be considered a sketch rather than the definitive article. I am not an expert on SAX and the implementation could be improved for better performance, simpler code etc. That said SAX should be able to cope with streaming large XML files.

I would approach this problem with 2 passes using the SAX parser. (Incidentally, I would also use a CSV generating library to create the output as this would deal with all the fiddly character escaping that CSV involves but I haven't implemented this in my sketch).

First pass: Establish number of header columns

Second pass: Output CSV

I assume that the XML file is well formed. I assume that we don't have a scheme/DTD with a predefined order.

In the first pass I have assumed that a CSV column will be added for every XML element containing text content or for any attribute (I have assumed attributes will contain something!).

The second pass, having established the number of target columns, will do the actual CSV output.

Based on your example XML my code sketch would produce:

ItemID,StartTime,EndTime,ViewItemURL,AverageTime,category,category,type,type,AveragePrice
4504216603,10:00:10.000Z,10:00:30.000Z,http://url,,,,,,
4504216604,10:30:10.000Z,11:00:10.000Z,http://url,value1,9823,9112,TX,TY,value2

Please note I have used the google collections LinkedHashMultimap as this is helpful when associating multiple values with a single key. I hope you find this useful!

import com.google.common.collect.LinkedHashMultimap;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.IOException;
import java.util.LinkedHashMap;
import java.util.Map.Entry;
import org.xml.sax.Attributes;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;
import org.xml.sax.XMLReader;
import org.xml.sax.helpers.DefaultHandler;
import org.xml.sax.helpers.XMLReaderFactory;

public class App {

    public static void main(String[] args) throws SAXException, FileNotFoundException, IOException {
        // First pass - to determine headers
        XMLReader xr = XMLReaderFactory.createXMLReader();
        HeaderHandler handler = new HeaderHandler();
        xr.setContentHandler(handler);
        xr.setErrorHandler(handler);
        FileReader r = new FileReader("test1.xml");
        xr.parse(new InputSource(r));

        LinkedHashMap<String, Integer> headers = handler.getHeaders();
        int totalnumberofcolumns = 0;
        for (int headercount : headers.values()) {
            totalnumberofcolumns += headercount;
        }
        String[] columnheaders = new String[totalnumberofcolumns];
        int i = 0;
        for (Entry<String, Integer> entry : headers.entrySet()) {
            for (int j = 0; j < entry.getValue(); j++) {
                columnheaders[i] = entry.getKey();
                i++;
            }
        }
        StringBuilder sb = new StringBuilder();
        for (String h : columnheaders) {
            sb.append(h);
            sb.append(',');
        }
        System.out.println(sb.substring(0, sb.length() - 1));

        // Second pass - collect and output data

        xr = XMLReaderFactory.createXMLReader();

        DataHandler datahandler = new DataHandler();
        datahandler.setHeaderArray(columnheaders);

        xr.setContentHandler(datahandler);
        xr.setErrorHandler(datahandler);
        r = new FileReader("test1.xml");
        xr.parse(new InputSource(r));
    }

    public static class HeaderHandler extends DefaultHandler {

        private String content;
        private String currentElement;
        private boolean insideElement = false;
        private Attributes attribs;
        private LinkedHashMap<String, Integer> itemHeader;
        private LinkedHashMap<String, Integer> accumulativeHeader = new LinkedHashMap<String, Integer>();

        public HeaderHandler() {
            super();
        }

        private LinkedHashMap<String, Integer> getHeaders() {
            return accumulativeHeader;
        }

        private void addItemHeader(String headerName) {
            if (itemHeader.containsKey(headerName)) {
                itemHeader.put(headerName, itemHeader.get(headerName) + 1);
            } else {
                itemHeader.put(headerName, 1);
            }
        }

        @Override
        public void startElement(String uri, String name,
                String qName, Attributes atts) {
            if ("item".equalsIgnoreCase(qName)) {
                itemHeader = new LinkedHashMap<String, Integer>();
            }
            currentElement = qName;
            content = null;
            insideElement = true;
            attribs = atts;
        }

        @Override
        public void endElement(String uri, String name, String qName) {
            if (!"item".equalsIgnoreCase(qName) && !"root".equalsIgnoreCase(qName)) {
                if (content != null && qName.equals(currentElement) && content.trim().length() > 0) {
                    addItemHeader(qName);
                }
                if (attribs != null) {
                    int attsLength = attribs.getLength();
                    if (attsLength > 0) {
                        for (int i = 0; i < attsLength; i++) {
                            String attName = attribs.getLocalName(i);
                            addItemHeader(attName);
                        }
                    }
                }
            }
            if ("item".equalsIgnoreCase(qName)) {
                for (Entry<String, Integer> entry : itemHeader.entrySet()) {
                    String headerName = entry.getKey();
                    Integer count = entry.getValue();
                    //System.out.println(entry.getKey() + ":" + entry.getValue());
                    if (accumulativeHeader.containsKey(headerName)) {
                        if (count > accumulativeHeader.get(headerName)) {
                            accumulativeHeader.put(headerName, count);
                        }
                    } else {
                        accumulativeHeader.put(headerName, count);
                    }
                }
            }
            insideElement = false;
            currentElement = null;
            attribs = null;
        }

        @Override
        public void characters(char ch[], int start, int length) {
            if (insideElement) {
                content = new String(ch, start, length);
            }
        }
    }

    public static class DataHandler extends DefaultHandler {

        private String content;
        private String currentElement;
        private boolean insideElement = false;
        private Attributes attribs;
        private LinkedHashMultimap dataMap;
        private String[] headerArray;

        public DataHandler() {
            super();
        }

        @Override
        public void startElement(String uri, String name,
                String qName, Attributes atts) {
            if ("item".equalsIgnoreCase(qName)) {
                dataMap = LinkedHashMultimap.create();
            }
            currentElement = qName;
            content = null;
            insideElement = true;
            attribs = atts;
        }

        @Override
        public void endElement(String uri, String name, String qName) {
            if (!"item".equalsIgnoreCase(qName) && !"root".equalsIgnoreCase(qName)) {
                if (content != null && qName.equals(currentElement) && content.trim().length() > 0) {
                    dataMap.put(qName, content);
                }
                if (attribs != null) {
                    int attsLength = attribs.getLength();
                    if (attsLength > 0) {
                        for (int i = 0; i < attsLength; i++) {
                            String attName = attribs.getLocalName(i);
                            dataMap.put(attName, attribs.getValue(i));
                        }
                    }
                }
            }
            if ("item".equalsIgnoreCase(qName)) {
                String data[] = new String[headerArray.length];
                int i = 0;
                for (String h : headerArray) {
                    if (dataMap.containsKey(h)) {
                        Object[] values = dataMap.get(h).toArray();
                        data[i] = (String) values[0];
                        if (values.length > 1) {
                            dataMap.removeAll(h);
                            for (int j = 1; j < values.length; j++) {
                                dataMap.put(h, values[j]);
                            }
                        } else {
                            dataMap.removeAll(h);
                        }
                    } else {
                        data[i] = "";
                    }
                    i++;
                }
                StringBuilder sb = new StringBuilder();
                for (String d : data) {
                    sb.append(d);
                    sb.append(',');
                }
                System.out.println(sb.substring(0, sb.length() - 1));
            }
            insideElement = false;
            currentElement = null;
            attribs = null;
        }

        @Override
        public void characters(char ch[], int start, int length) {
            if (insideElement) {
                content = new String(ch, start, length);
            }
        }

        public void setHeaderArray(String[] headerArray) {
            this.headerArray = headerArray;
        }
    }
}