Converting correctly pdf to ps and vice-versa

Andrei F picture Andrei F · May 28, 2012 · Viewed 14.3k times · Source

I'm using "pdftops" to convert .pdf files to .ps files and then "ps2pdf" for the reverse process (poppler-utils). The problem is that when creating the .pdf files from the .ps files, the text looks ok, but when i try to copy it, the characters are very strange (it's like they are corrupted). I used these tools on other files for a long time and it worked fine. I also tried "pdftohtml -xml" to create an .xml file, and the text is ok (the characters are extracted correctly).

  1. What problem could it be regarding the conversion? Maybe if I use "pdftops" and "ps2pdf" are there some options that need to be changed?
  2. If I create the .xml output, is there a way to create a .pdf file from the .xml file ?

EDIT: Output for "pdffonts original.pdf" pdffonts_output_originalpdf

Output for "roundtripped.pdf" pdffonts_output_roundtrippedpdf

Answer

Kurt Pfeifle picture Kurt Pfeifle · May 28, 2012

I'm just covering the PS->PDF conversion... (I'm assuming your phrase of vice-versa isn't meant to point to a 'round-trip' conversion of the very same file [PDF->PS->PDF], but the general direction of conversion for any PS file. Is that correct?)

First of all, most likely your ps2pdf is only a shellscript, which internally uses a Ghostscript command with some default parameters to do the real work. ps2pdf is much easier to use. Ghostscript has many more options, but it is more difficult to learn. ps2pdf it takes away a lot of potential control you could have if you used Ghostscript. (You can tweak a few parameters with ps2pdf -- but then you are already so much closer to run the real Ghostscript command already...)

Second, without exactly knowing how exactly your PS input file is conditioned, it is difficult to give you good advice: Does your PS have embedded the fonts it uses? Which type of fonts are they? etc.

Thirdly, Ghostscript gained a lot of additional power and control, and had a few bugs or weak spots removed over the last few years when it comes to outputing PDF. So, which is the version of Ghostscript installed on your system? (Remember, ps2pdf calls Ghostscript, it will not work without a locally installed gs executable.)

One likely cause for your inability to copy text from the PDF could be the font type (and encoding) that ended up being used and embedded in your PDF file. Which font details can you tell us about your resulting PDFs? (Try pdffonts your.pdf to find out -- pdffonts is also part of the Poppler utils you mentioned.)

You may try this (full) Ghostscript command for PS->PDF conversion and check where it takes you:

gs \
  -o output.pdf \
  -sDEVICE=pdfwrite \
  -dPDFSETTINGS=/prepress \
  -dHaveTrueTypes=true \
  -dEmbedAllFonts=true \
  -dSubsetFonts=false \
  -c ".setpdfwrite <</NeverEmbed [ ]>> setdistillerparams" \
  -f input.ps