How to Parse Big (50 GB) XML Files in Java

Joe Maher picture Joe Maher · Oct 11, 2014 · Viewed 39k times · Source

Currently im trying to use a SAX Parser but about 3/4 through the file it just completely freezes up, i have tried allocating more memory etc but not getting any improvements.

Is there any way to speed this up? A better method?

Stripped it to bare bones, so i now have the following code and when running in command line it still doesn't go as fast as i would like.

Running it with "java -Xms-4096m -Xmx8192m -jar reader.jar" i get a GC overhead limit exceeded around article 700000

Main:

public class Read {
    public static void main(String[] args) {       
       pages = XMLManager.getPages();
    }
}

XMLManager

public class XMLManager {
    public static ArrayList<Page> getPages() {

    ArrayList<Page> pages = null; 
    SAXParserFactory factory = SAXParserFactory.newInstance();

    try {

        SAXParser parser = factory.newSAXParser();
        File file = new File("..\\enwiki-20140811-pages-articles.xml");
        PageHandler pageHandler = new PageHandler();

        parser.parse(file, pageHandler);
        pages = pageHandler.getPages();

    } catch (ParserConfigurationException e) {
        e.printStackTrace();
    } catch (SAXException e) {
        e.printStackTrace();
    } catch (IOException e) {
        e.printStackTrace();
    }


    return pages;
    }    
}

PageHandler

public class PageHandler extends DefaultHandler{

    private ArrayList<Page> pages = new ArrayList<>();
    private Page page;
    private StringBuilder stringBuilder;
    private boolean idSet = false;

    public PageHandler(){
        super();
    }

    @Override
    public void startElement(String uri, String localName, String qName, Attributes attributes) throws SAXException {

        stringBuilder = new StringBuilder();

         if (qName.equals("page")){

            page = new Page();
            idSet = false;

        } else if (qName.equals("redirect")){
             if (page != null){
                 page.setRedirecting(true);
             }
        }
    }

     @Override
     public void endElement(String uri, String localName, String qName) throws SAXException {

         if (page != null && !page.isRedirecting()){

             if (qName.equals("title")){

                 page.setTitle(stringBuilder.toString());

             } else if (qName.equals("id")){

                 if (!idSet){

                     page.setId(Integer.parseInt(stringBuilder.toString()));
                     idSet = true;

                 }

             } else if (qName.equals("text")){

                 String articleText = stringBuilder.toString();

                 articleText = articleText.replaceAll("(?s)<ref(.+?)</ref>", " "); //remove references
                 articleText = articleText.replaceAll("(?s)\\{\\{(.+?)\\}\\}", " "); //remove links underneath headings
                 articleText = articleText.replaceAll("(?s)==See also==.+", " "); //remove everything after see also
                 articleText = articleText.replaceAll("\\|", " "); //Separate multiple links
                 articleText = articleText.replaceAll("\\n", " "); //remove new lines
                 articleText = articleText.replaceAll("[^a-zA-Z0-9- \\s]", " "); //remove all non alphanumeric except dashes and spaces
                 articleText = articleText.trim().replaceAll(" +", " "); //convert all multiple spaces to 1 space

                 Pattern pattern = Pattern.compile("([\\S]+\\s*){1,75}"); //get first 75 words of text
                 Matcher matcher = pattern.matcher(articleText);
                 matcher.find();

                 try {
                     page.setSummaryText(matcher.group());
                 } catch (IllegalStateException se){
                     page.setSummaryText("None");
                 }
                 page.setText(articleText);

             } else if (qName.equals("page")){

                 pages.add(page);
                 page = null;

            }
        } else {
            page = null;
        }
     }

     @Override
     public void characters(char[] ch, int start, int length) throws SAXException {
         stringBuilder.append(ch,start, length); 
     }

     public ArrayList<Page> getPages() {
         return pages;
     }
}

Answer

Don Roby picture Don Roby · Oct 19, 2014

Your parsing code is likely working fine, but the volume of data you're loading is probably just too large to hold in memory in that ArrayList.

You need some sort of pipeline to pass the data on to its actual destination without ever store it all in memory at once.

What I've sometimes done for this sort of situation is similar to the following.

Create an interface for processing a single element:

public interface PageProcessor {
    void process(Page page);
}

Supply an implementation of this to the PageHandler through a constructor:

public class Read  {
    public static void main(String[] args) {

        XMLManager.load(new PageProcessor() {
            @Override
            public void process(Page page) {
                // Obviously you want to do something other than just printing, 
                // but I don't know what that is...
                System.out.println(page);
           }
        }) ;
    }

}


public class XMLManager {

    public static void load(PageProcessor processor) {
        SAXParserFactory factory = SAXParserFactory.newInstance();

        try {

            SAXParser parser = factory.newSAXParser();
            File file = new File("pages-articles.xml");
            PageHandler pageHandler = new PageHandler(processor);

            parser.parse(file, pageHandler);

        } catch (ParserConfigurationException e) {
            e.printStackTrace();
        } catch (SAXException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        }

    }
}

Send data to this processor instead of putting it in the list:

public class PageHandler extends DefaultHandler {

    private final PageProcessor processor;
    private Page page;
    private StringBuilder stringBuilder;
    private boolean idSet = false;

    public PageHandler(PageProcessor processor) {
        this.processor = processor;
    }

    @Override
    public void startElement(String uri, String localName, String qName, Attributes attributes) throws SAXException {
         //Unchanged from your implementation
    }

    @Override
    public void characters(char[] ch, int start, int length) throws SAXException {
         //Unchanged from your implementation
    }

    @Override
    public void endElement(String uri, String localName, String qName) throws SAXException {
            //  Elide code not needing change

            } else if (qName.equals("page")){

                processor.process(page);
                page = null;

            }
        } else {
            page = null;
        }
    }

}

Of course, you can make your interface handle chunks of multiple records rather than just one and have the PageHandler collect pages locally in a smaller list and periodically send the list off for processing and clear the list.

Or (perhaps better) you could implement the PageProcessor interface as defined here and build in logic there that buffers the data and sends it on for further handling in chunks.