Are there any open source libraries that support table identification & extraction?
By this I mean:
I have looked through similar questions on this topic and found the following:
Currently, I am thinking that I would have to spend a lot of time developing a Machine Learning solution to identify table structures from PDFs. Therefore, any alternative approaches would be more than welcome!
After many fruitful hours of exploring OCR libraries, bounding boxes and clustering algorithms - I found a solution so simple it makes you want to cry!
I hope you are using Linux;
pdftotext -layout NAME_OF_PDF.pdf
AMAZING!!
Now you have a nice text file with all the information lined up in nice columns, now it is trivial to format into a csv etc..
It is for times like this that I love Linux, these guys came up with AMAZING solutions to everything, and put it there for FREE!