I'm trying to convert pdfs to tiff images for following OCR. I use "-density 300x300 -depth 8" as parameters. The first problem is that from 500 KB pdf file i get 72 MB tiff file. The second problem is bad quality of resulting image causing OCR failing. Here you can see it yourself. Adobe acrobat reader generated (printed) tiff image:
ImageMaggick tiff image:
The difference is huge. How can i get as good as Adobe generated image using ImageMaggick? Not tiff neccesary, other formats also will be good.
UPD: i've found 'antialias' option. Now it's much more better. But still OCR result not so accurate as for Adobe version.
My suggestion is: use a Ghostscript commandline. Because ImageMagick uses Ghostscript anyway, in the background (the technical IM term for this is: Ghostscript is a "delegate" for some of the conversions, such as PDF->TIFF).
Here is a commandline that should work well for letter-sized pages of a multi-page PDF file:
gswin32c.exe ^
-o page_%03d.tif ^
-sDEVICE=tiffg4 ^
-r720x720 ^
-g6120x7920 ^
input.pdf
The -g...
parameter controls the absolute width+height of the output pages using 'device points'... (and with 6120x7920 at 720dpi this happens to be letter-sized).
These TIFF pages...
Your IM parameter of -depth 8
isn't suited to give good results from the p.o.v. of later OCR, since it will create shades of gray around letters which don't help with this.
Your OCR results should now be much better than before.
If your OCR can't handle TIFF G4 format (which I doubt), then you could generate other TIFF subformats with the help of Ghostscript. For example:
gswin32c.exe ^
-o page_%03d.tif ^
-sDEVICE=tiffgray ^
-r720x720 ^
-g6120x7920 ^
-sCompression=lzw ^
input.pdf
.
gswin32c.exe ^
-o page_%03d.tif ^
-sDEVICE=tiff24nc ^
-r720x720 ^
-g6120x7920 ^
-sCompression=lzw ^
input.pdf
The tiffgray
device creates 8-bit gray output. The tiff24nc
device creates 8-bit RGB color output. Both types of TIFF will of course be bigger than the tiffg4
output.