Can anyone recommend a library/API for extracting the text and images from a PDF? We need to be able to get at text that is contained in pre-known regions of the document, so the API will need to give us positional information of each element on the page.
We would like that data to be output in xml
or json
format. We're currently looking at PdfTextStream which seems pretty good, but would like to hear other peoples experiences and suggestions.
Are there alternatives (commercial ones or free) for extracting text from a pdf programatically?
I was given a 400 page pdf file with a table of data that I had to import - luckily no images. Ghostscript worked for me:
gswin64c -sDEVICE=txtwrite -o output.txt input.pdf
The output file was split into pages with headers, etc., but it was then easy to write an app to strip out blank lines, etc, and suck in all 30,000 records. -dSIMPLE
and -dCOMPLEX
made no difference in this case.