PDF to tiff ImageMagick problem

clumpter picture clumpter · May 14, 2011 · Viewed 9.7k times · Source

I'm trying to convert pdfs to tiff images for following OCR. I use "-density 300x300 -depth 8" as parameters. The first problem is that from 500 KB pdf file i get 72 MB tiff file. The second problem is bad quality of resulting image causing OCR failing. Here you can see it yourself. Adobe acrobat reader generated (printed) tiff image: enter image description here

ImageMaggick tiff image: enter image description here

The difference is huge. How can i get as good as Adobe generated image using ImageMaggick? Not tiff neccesary, other formats also will be good.

UPD: i've found 'antialias' option. Now it's much more better. But still OCR result not so accurate as for Adobe version.

Answer

Kurt Pfeifle picture Kurt Pfeifle · May 15, 2011

My suggestion is: use a Ghostscript commandline. Because ImageMagick uses Ghostscript anyway, in the background (the technical IM term for this is: Ghostscript is a "delegate" for some of the conversions, such as PDF->TIFF).

Here is a commandline that should work well for letter-sized pages of a multi-page PDF file:

gswin32c.exe ^
   -o page_%03d.tif ^
   -sDEVICE=tiffg4 ^
   -r720x720 ^
   -g6120x7920 ^
    input.pdf

The -g... parameter controls the absolute width+height of the output pages using 'device points'... (and with 6120x7920 at 720dpi this happens to be letter-sized).

These TIFF pages...

  1. ...will be black+white,
  2. ...will have a resolution of 720dpi,
  3. ...will be G4-compressed and
  4. ...will be much smaller than your un-compressed 300dpi from the IM commandline

Your IM parameter of -depth 8 isn't suited to give good results from the p.o.v. of later OCR, since it will create shades of gray around letters which don't help with this.

Your OCR results should now be much better than before.

If your OCR can't handle TIFF G4 format (which I doubt), then you could generate other TIFF subformats with the help of Ghostscript. For example:

gswin32c.exe ^
   -o page_%03d.tif ^
   -sDEVICE=tiffgray ^
   -r720x720 ^
   -g6120x7920 ^
   -sCompression=lzw ^
    input.pdf

.

gswin32c.exe ^
   -o page_%03d.tif ^
   -sDEVICE=tiff24nc ^
   -r720x720 ^
   -g6120x7920 ^
   -sCompression=lzw ^
    input.pdf

The tiffgray device creates 8-bit gray output. The tiff24nc device creates 8-bit RGB color output. Both types of TIFF will of course be bigger than the tiffg4 output.