I want to extract the contents of a table in pdf like like this :
i wrote this java programme using iText java PDF libray which can read the contents of a PDF file line by line, but I do not know how to get the contents of table
import com.itextpdf.text.pdf.PdfReader;
import com.itextpdf.text.pdf.parser.PdfTextExtractor;
public class PDFReader {
public static void main(String[] args) {
// TODO, add your application code
System.out.println("Lecteur PDF");
System.out.println (ReadPDF("D:/test.pdf"));
}
private static String ReadPDF(String pdf_url)
{
StringBuilder str=new StringBuilder();
try
{
PdfReader reader = new PdfReader(pdf_url);
int n = reader.getNumberOfPages();
for(int i=1;i<n;i++)
{
String str2=PdfTextExtractor.getTextFromPage(reader, i);
str.append(str2);
System.out.println(str);
}
}catch(Exception err)
{
err.printStackTrace();
}
return String.format("%s", str);
}
}
this is what I get :
but that's not what I want, I want to extract the contents of the table line by line and column by column, for example, save each line in an java array
the first array will contain : "N°", "DATE OBSERVATIONS", "TEXTE"
the second array will contain : "029/14", "Le 1er sept 2014 remplace AVURNAV...", "SETE A compter du lundi 7 juillet 2014 débuteront les trav..."
the third array will contain : "037/14", "Le 15 octobre 2014 remplace AVURNAV ...", "SETE Du 15 septembre 2014 au 15 juillet 2015, travaux ...."
and so on
Thanks
You may have to identify common field beginning/end character sequences to split your data into an array if your PDF library doesn't support extracting tables.
For instance the first fields is nnn/nn
, the second field ends nnnn/nn
and the third field ends where the next first field begins.
This is a tricky problem - I have had to use coordinate based approaches to deal with this before, but your pdf library may not support extracting the position of letters as well as the actual text.