Is there a proper library which I can use to convert PDF to HTML or some other format that can be converted to HTML easily?
I searched similar questions, but to no luck.
I want to be able to extract text from PDF's, possibly images. I'm not looking to embed the PDF inside the HTML.
If you're on Linux, try pdftohtml
:
sudo apt-get install poppler-utils
pdftohtml -enc UTF-8 -noframes infile.pdf outfile.html
On MacOS (with homebrew) pdftohtml
can be installed with:
brew install pdftohtml
The open source ebook converter Calibre can also convert PDF files to HTML and is available on MacOS, Windows and Linux.