Convert scanned pdf to .txt files using tesseract

Ganesh Nannaware picture Ganesh Nannaware · Jan 31, 2014 · Viewed 16.4k times · Source

I have to convert a .pdf file containing scanned images into .txt files. The tesseract ocr converts only images to .txt, but I need to first extract the .tif images and then convert it. Can anyone help me with this?

Answer

Karol S picture Karol S · Jan 31, 2014

Use Imagemagick:

convert -density 600 input.pdf output.tif

Density is in DPI, from my experience 600 DPI works the best.