batch convert and crop postscript to pdf

PatrickT picture PatrickT · Jan 3, 2012 · Viewed 7.8k times · Source

I know barely enough to survive in this digital world.

I have many one-page postscript files (graphs/images) I wish to convert to pdf and automatically crop to a narrow box. I'm on windows right now (I do use linux too, so don't hesitate to post code for linux)

I have in the past been successful by combining Ghostscript gswin32c.exe and Calibre pdfmanipulate.exe. This is probably a familiar approach to many here.

But this approach has become fraught with problems, for several reasons.

One problem arose after I "upgraded" to the 64 bit gswin64c.exe. The 32 bit version gswin32c.exe still works on my system though, so I can't complain too much.

Another problem arose while dealing with postscript files that are perhaps improperly coded. There seems to be at least two problems, but I'm not sure which, if any, is responsible or if both are. One problem is that the bounding box line, e.g. %%BoundingBox: 135 179 484 587 is not always placed on the second line from the top. I understand that can be an issue. Another problem is that the bounding box above corresponds to a "Portrait" orientation in Ghostscript, but the cropping follows the "Landscape" orientation. Yet another problem I have not identified is that for some files the cropping seems quite random.

So here is my 32bit approach (which works for high quality files), followed by the 64bit adaptation which doesn't work (perhaps because it calls some pypdf script on my machine rather than the patched script provided by calibre, if I understand https://bugs.launchpad.net/ubuntu/+source/calibre/+bug/800551 and http://www.mobileread.com/forums/archive/index.php/t-103097.html, but I'm just guessing and don't know a workaround anyhow):

@echo off echo batch processing with Latex ps2pdf followed by Ghostscript gswin64c.exe and Calibre2 pdfmanipulate.exe for %%I in (*.ps,*.eps) do ( "C:\Program Files\MiKTeX 2.9\miktex\bin\x64\ps2pdf" %%I ) for %%I in (*.pdf) do ( "C:\Program Files (x86)\Ghostscript\gs9.00\bin\gswin32c.exe" -dSAFER -dNOPAUSE -dBATCH
-sDEVICE#bbox "%%I" 2> bounding "C:\Program Files (x86)\Calibre2\pdfmanipulate.exe" crop -o "%%~nICropped32.pdf" -b bounding "%%I" pause "C:\Program Files\Ghostscript\gs9.04\bin\gswin64c.exe" -dSAFER -dNOPAUSE -dBATCH
-sDEVICE#bbox "%%I" 2> bounding "C:\Program Files (x86)\Calibre2\pdfmanipulate.exe" crop -o "%%~nICropped64.pdf" -b bounding "%%I" pause )

The above 32 bit approach works on high quality files, e.g. Postscript level 3 produced by PSTricks or by Maple's standard 2D plot driver, but doesn't on older files, eg. Postscript level 2 (if that) produced by Maple's classic plot driver.

I have found a workaround for some such files. It consists in using epstopdf from the (MiKTeX) LaTeX distribution. It works on those Maple classic files. Unfortunately it doesn't work on some other postscript files I generated several years ago with PSTricks and other software like Matlab.

And so I need to make several transformations and select the ones that worked. I wonder if you would have suggestions that would make my life easier. If I can fix the BoundingBox and Portrait/Landscape issues I should be quite content.

I thank you in advance for any suggestions. A linux suggestion would be acceptable. My preference will go for a solution that might be able to handle the diversity of files in one single push of the "return" key.

And of course I'm looking for a lossless type of cropping, one that consists only in interpreting the bounding box, but not in transforming it into a (possibly) lower quality pdf.

EDIT: I forgot to say. When I apply gswin32c/pdfmanipulate to a high quality level 3 postscript file, the file named "bounding" fills with information like:

%%BoundingBox: 34 128 567 667 %%HiResBoundingBox: 34.364390 128.875004 566.054069 666.071980

In the example above, the file was already pretty much cropped. Note the closeness between %%BoundingBox and %%HiResBoundingBox

but applied to a low quality level 2 (or so it claims to be) postscript file, the "bounding" file fills with :

%%BoundingBox: 189 137 574 467 %%HiResBoundingBox: 189.485994 137.843996 573.299983 466.668478

but the bounding box really ought to be %%BoundingBox: 135 179 484 587 The above (135 179 484 587) is the bounding box provided by the postscript file itself (which I moved to the second line by copy-pasting) and it is consistent with the bounding box interpreted by Ghostview/Ghostscript when in the Portrait orientation.

But it gets completely ignored by Ghostscript...

I don't know where the 189 137 574 467 comes from --- it's very wrong...

EDIT 2. I'd like to clarify a few points, in response to Ken's questions:

Hi Ken, thanks for your reply,

sorry if my question was unclear --- nevertheless you seem to have understood the gist of it --- let me take your questions in turn:

I'm unsure why you are using 2 applications, it should be possible to perform the entire transformation with just Ghostscript.

I didn't find a way to do it all with Ghostscript so I used another way. I found the Ghostscript/Calibrate suggestion here, http://www.mobileread.com/forums/archive/index.php/t-72885.html, and elsewhere, tried it and it worked until recently.

I'm not saying it's not possible to do it all with Ghostscript, I'm merely saying that I didn't find a way to.

"One problem arose after I "upgraded" to the 64 bit gswin64c.exe" You haven't said what the problem was, have you reported it as a bug ? If people don't report bugs, they don't get fixed......

I gave the links describing the problem and the bug report, here: https://bugs.launchpad.net/ubuntu/+source/calibre/+bug/800551, http://www.mobileread.com/forums/archive/index.php/t-103097.html, my problem is the exact same one.

You seem to have some confusion between PostScript programs and comments. Any line in a PostScript program beginning '%' is a comment, and has no effect on the operation of the program. So BoundingBox comments won't do anything at all.

I beg to differ, if I may. Take a postscript file, remove the %%Bounding Box, save and open it in Ghostview. Ghostview throws up error messages and then displays it without using the bounding box information, e.g. a figure surrounding by a lot of white space instead of tightly surrounded by the bounding box. So yes, this comment does something, within Ghostview at least. Having removed the %%Bounding Box, if you then use Calibre/pdfmanipulate to crop the pdf, it will crop it wrongly in cases where having the %%Bounding Box would have worked. So this "comment" is quite useful in the context of displaying and cropping.

Note there is no requirement for it to be the second line of the file.....

It is recommended by Adobe. Quoting from adobe,

"The second required DSC header comment provides information about the size of the EPS file and must be present so the including application can transform and clip the EPS file properly. This is the bounding box comment."

http://partners.adobe.com/public/developer/en/ps/5002.EPSF_Spec.pdf

Adobe say "must." Personally I couldn't care less if it's a must or not, as long as I can produce pdf from my eps that are properly bounded.

In general Ghostscript ignores DSC comments, however if you set ProcessDSC to true, then it will make very limited use of it (primarily the BoundingBox comment to set the page size).

with pdfmanipulate it makes all the difference between a properly cropped pdf and an improperly cropped one.

Moving on. You say you are using LaTeX ps2pdf, if you already have a PostScript file, you can send that to Ghostscript for conversion to PDF. Its not clear to me what exactly you are using Ghostscript for in this case, simply to find the real bounding box of the page ?

yes.

Its not clear to me what you mean by 'lossless' cropping, if you crop the content you must be losing something clearly, even if its just white space.....

I mean that I don't want the cropping process to "rasterize" (or whatever it's called, you will know the term) the whole image. The part of the file that is cropped out is not useful to me so it's not much of a loss. The part of the file that is within the crop should be of the same quality as the original. That's the general idea.

You can find comments about this here, which is one place where I found useful information, http://www.charlietanksley.net/philtex/reading-pdfs-on-portables/

Its easy enough to do the conversion in one pass if you know the size you want to crop to,

no I don't know the size, that's why I'm going to such lengths to have software calculate it for me, and it's obviously not a simple thing because Ghostscript and epstopdf don't always agree on the optimal crop, one getting it right for some files but not for others, the other getting it right for other files but not for some...

if you don't know the size then you can do it in 2 passes using only Ghostscript by first extracting the BoundingBox as you have done. That will get you 4 numbers, the bottom left and top right of the bounding box (if I remember correctly). You then create a 'translate' PostScript operation to move the content of the page down and left (so that it starts at 0,0, the bottom left corner). You also create a page device request to set the page size, the size being given by width = right - left and height = top - bottom. Feed the original file, along with the PostScript operators, to Ghostscript and select the pdfwrite device and you will get a PDF file.

A batch file example would be great, if you have one handy. I have seen several examples based on pdfwrite and none that I've tried have worked. The devil is in the detail.

As far as the bounding box goes, it may be a bug, or it may be that the file makes a mark, potentially using a white ink at the outside location. In this case the bounding box device will still regard it as part of the page content. You may be able to see that it isn't, but the device cannot. Consider if the page was first filled with a dark background, and the content outlined using white ink.

The files were all created with software such as Matlab, Maple, PSTricks and it's unlikely (but obviously not impossible) that there would be invisible white marks outside of the area given by the %%Bounding Box.

In many cases, the %%Bounding Box comment contains all the information that is needed and I'd like Ghostscript or Calibre or pdfwrite or whomever to use that information.

I cannot offer a comprehensive solution without understanding more about what you want to do, and ideally seeing one or more of your problematic files.

That would be very easy, how can I post a postscript file for your viewing? It's 420 kilobytes.

Thanks Ken, let's hope we can find a workable solution.

EDIT 3. I have identified a big part of the problem.

My postscript file has the following bounding box, pretty close to an optimal crop: %%BoundingBox: 135 179 484 587

When I run Ghostscript gswin64c/gswin32c to compute the bounding box, viz

for %%I in (*.ps,*.eps) do ("C:\Program Files\Ghostscript\gs9.04\bin\gswin64c.exe" -dSAFER -dNOPAUSE -dBATCH -dAutoRotatePages=/None -sDEVICE#bbox "%%I" 2> bounding)

I get:

%%BoundingBox: 145 189 475 574 %%HiResBoundingBox: 145.331574 189.485994 474.155986 573.299983

When I run ps2pdf followed by Ghostscript gswin64c, i.e.

for %%I in (*.ps,*.eps) do ("C:\Program Files\MiKTeX 2.9\miktex\bin\x64\ps2pdf" %%I)
for %%I in (*.pdf) do ("C:\Program Files\Ghostscript\gs9.04\bin\gswin64c.exe" -dSAFER -dNOPAUSE -dBATCH -dAutoRotatePages=/None -sDEVICE#bbox "%%I" 2> bounding)

I get the following bounding box:

%%BoundingBox: 189 137 574 467 %%HiResBoundingBox: 189.395994 137.843996 573.299983 466.668478

So the problem is that the conversion from ps to pdf with ps2pdf introduces a change in the bounding box information which results in incorrect cropping. So replacing ps2pdf with something else, like eps2pdf solves the problem here. Of course there are other solutions. Particularly valuable are solutions involving Ghostcript only, as suggested by Ken and luser droog. Their very valuable (and superior to my quick fix) suggestions are below. Something like this has worked:

for %%I in (*.eps,*.ps) do ("C:\Program Files\MiKTeX 2.9\miktex\bin\x64\epstopdf" %%I)
for %%I in (*.pdf) do (
"C:\Program Files\Ghostscript\gs9.04\bin\gswin64c.exe" -dSAFER -dNOPAUSE -dBATCH -dAutoRotatePages=/None -sDEVICE#bbox "%%I" 2> bounding
"C:\Program Files (x86)\Calibre2\pdfmanipulate.exe" crop -o "%%~nICropped.pdf" -b bounding "%%I"
)

Answer

luser droog picture luser droog · Jan 4, 2012

If simply enforcing the BoundingBox comment will do what you want, you can replace the first call to ghostscript with a text-scanner.

Here's the sh version of the script above (can't stand those Windows pathnames!)

for i in *.pdf ; 
do 
    gs -dSAFER -dNOPAUSE -dBATCH -sDEVICE=bbox "$i" 2> bounding ; 
    pdfmanipulate crop -o "${i%.pdf}-cropped.pdf" -b bounding "$i" ; 
done

And you can modify it to use grep like this:

for i in *.pdf ; 
do 
    grep '%%BoundingBox' "$i" > bounding ; 
    pdfmanipulate crop -o "${i%.pdf}-cropped.pdf" -b bounding "$i" ; 
done

If I was trying to do this on Windows, I would install cygwin and use the same script.