How to extract the contents of a table in pdf file?

Bertrand picture Bertrand · Jul 10, 2015 · Viewed 10.8k times · Source

I want to extract the contents of a table in pdf like like this :

enter image description here

i wrote this java programme using iText java PDF libray which can read the contents of a PDF file line by line, but I do not know how to get the contents of table

import com.itextpdf.text.pdf.PdfReader;
import com.itextpdf.text.pdf.parser.PdfTextExtractor;

public class PDFReader {

    public static void main(String[] args) {

        // TODO, add your application code
        System.out.println("Lecteur PDF");
        System.out.println (ReadPDF("D:/test.pdf"));
    }
        private static String ReadPDF(String pdf_url)
    {
        StringBuilder str=new StringBuilder();
        try
        {

         PdfReader reader = new PdfReader(pdf_url);
        int n = reader.getNumberOfPages();
         for(int i=1;i<n;i++)
         {
            String str2=PdfTextExtractor.getTextFromPage(reader, i);
            str.append(str2);
           System.out.println(str);
         }
        }catch(Exception err)
        {
            err.printStackTrace();
        }
        return String.format("%s", str);
    }
}

this is what I get :

enter image description here

but that's not what I want, I want to extract the contents of the table line by line and column by column, for example, save each line in an java array

the first array will contain : "N°", "DATE OBSERVATIONS", "TEXTE"

the second array will contain : "029/14", "Le 1er sept 2014 remplace AVURNAV...", "SETE A compter du lundi 7 juillet 2014 débuteront les trav..."

the third array will contain : "037/14", "Le 15 octobre 2014 remplace AVURNAV ...", "SETE Du 15 septembre 2014 au 15 juillet 2015, travaux ...."

and so on

Thanks

Answer

3-14159265358979323846264 picture 3-14159265358979323846264 · Jul 10, 2015

You may have to identify common field beginning/end character sequences to split your data into an array if your PDF library doesn't support extracting tables. For instance the first fields is nnn/nn, the second field ends nnnn/nn and the third field ends where the next first field begins.

This is a tricky problem - I have had to use coordinate based approaches to deal with this before, but your pdf library may not support extracting the position of letters as well as the actual text.