Parsing PDF files (especially with tables) with PDFBox

Matheus Moreira picture Matheus Moreira · Jul 8, 2010 · Viewed 101.1k times · Source

I need to parse a PDF file which contains tabular data. I'm using PDFBox to extract the file text to parse the result (String) later. The problem is that the text extraction doesn't work as I expected for tabular data. For example, I have a file which contains a table like this (7 columns: the first two always have data, only one Complexity column has data, only one Financing column has data):

+----------------------------------------------------------------+
| AIH | Value | Complexity                     | Financing       |
|     |       | Medium | High | Not applicable | MAC/Other | FAE |
+----------------------------------------------------------------+
| xyz | 12.43 | 12.34  |      |                | 12.34     |     |
+----------------------------------------------------------------+
| abc | 1.56  |        | 1.56 |                |           | 1.56|
+----------------------------------------------------------------+

Then I use PDFBox:

PDDocument document = PDDocument.load(pathToFile);
PDFTextStripper s = new PDFTextStripper();
String content = s.getText(document);

Those two lines of data would be extracted like this:

xyz 12.43 12.4312.43
abc 1.56 1.561.56

There are no white spaces between the last two numbers, but this is not the biggest problem. The problem is that I don't know what the last two numbers mean: Medium, High, Not applicable? MAC/Other, FAE? I don't have the relation between the numbers and their columns.

It is not required for me to use the PDFBox library, so a solution that uses another library is fine. What I want is to be able to parse the file and know what each parsed number means.

Answer

purecharger picture purecharger · Aug 12, 2010

You will need to devise an algorithm to extract the data in a usable format. Regardless of which PDF library you use, you will need to do this. Characters and graphics are drawn by a series of stateful drawing operations, i.e. move to this position on the screen and draw the glyph for character 'c'.

I suggest that you extend org.apache.pdfbox.pdfviewer.PDFPageDrawer and override the strokePath method. From there you can intercept the drawing operations for horizontal and vertical line segments and use that information to determine the column and row positions for your table. Then its a simple matter of setting up text regions and determining which numbers/letters/characters are drawn in which region. Since you know the layout of the regions, you'll be able to tell which column the extracted text belongs to.

Also, the reason you may not have spaces between text that is visually separated is that very often, a space character is not drawn by the PDF. Instead the text matrix is updated and a drawing command for 'move' is issued to draw the next character and a "space width" apart from the last one.

Good luck.