PDF Parsing with Text and Coordinates

Alexis Canyon picture Alexis Canyon · Jun 21, 2011 · Viewed 15.1k times · Source

I am currently using PDF Box to parse a pdf and I am trying to figure out how to retrieve data about the text such as the font (bold, size, etc) and the location of the font.

Any suggestions?

Answer

Mark Storer picture Mark Storer · Jun 22, 2011

After poking around the (hard to find) PDFBox docs, I found this little gem.

Apparently one of the examples shows exactly how to do everything you asked. Basically, you subclass PdfTextStripper and override the processTextPosition method. There, you query the TextPosition for whatever information you need.

For future reference, you can find the javaDoc here: http://pdfbox.apache.org/apidocs/index.html

Edit 2018-04-02: original link is dead, but example can be found in the SVN repo here.