In brief, I'm dealing with a problematic PDF, which:
evince
, because of missing font information; ghostscript
can fully render the same PDF. Thus -- regardless of what ghostscript
uses to fill in the blanks (maybe fallback glyphs, or a different method to accessing fonts) -- I'd like to be able to use ghostscript
to produce ("distill") an output PDF, where pretty much nothing will be changed, except font information added, so evince
can render the same document in the same manner as ghostscript
can.
My question is thus - is this possible to do at all; and if so, what would be command line be to achieve something like that?
Many thanks in advance for any answers,
Cheers!
I'm actually on an older Ubuntu 10.04, and I might be experiencing - not a bug - but an installation problem with evince
(lack of poppler-data
package), as noted in Bug #386008 “Some fonts fail to display due to “Unknown font tag...” : Bugs : “poppler” package : Ubuntu.
However, that is exactly what I'd like to handle, so I'll use the fontspec.pdf
attached to that post ("PDF triggering the bug.", // v.) to demonstrate the problem.
evince
First, I open this pdf's page 3 in evince
; and evince
complains:
$ evince --page-label=3 fontspec.pdf
Error: Missing language pack for 'Adobe-Japan1' mapping
Error: Unknown font tag 'F5.1'
Error (7597): No font in show
Error: Unknown font tag 'F5.1'
Error (7630): No font in show
Error: Unknown font tag 'F5.1'
Error (7660): No font in show
Error: Unknown font tag 'F5.1'
...
The rendering looks like this:
... and it is obvious that some font shapes are missing.
acroread
Just a note on how Adobe's Acrobat Reader for Linux behaves; the following command line:
$ ./Adobe/Reader9/bin/acroread /a "page=3" fontspec.pdf
... generates no output to terminal whatsoever (for more on /a
switch, see Man page acroread) -- and the program has absolutely no problem displaying the fonts.
Also, while I'd like to avoid the roundtrip to postscript - however, note that acroread
itself can be used to convert a PDF to postscript:
$ ./Adobe/Reader9/bin/acroread -v
9.5.1
$ ./Adobe/Reader9/bin/acroread -toPostScript \
-rotateAndCenter -choosePaperByPDFPageSize \
-start 3 -end 3 \
-level3 -transQuality 5 \
-optimizeForSpeed -saveVM \
fontspec.pdf ./
Again, the above command line will generate no output to terminal; -optimizeForSpeed -saveVM
are there because apparently they deal with fonts; the last argument ./
is the output directory (output file is automatically called fontspec.ps
).
Now, evince
can display the previously missing fonts in the fontspec.ps
output - but again complains:
$ evince fontspec.ps
GPL Ghostscript 9.02: Error: Font Renderer Plugin ( FreeType ) return code = -1
GPL Ghostscript 9.02: Error: Font Renderer Plugin ( FreeType ) return code = -1
...
... and furthermore, all text seems to be flattened to curves in the postscript - so now one cannot select the text in the .ps file in evince
anymore (note that the .ps file cannot be opened in acroread
). However, one can convert this .ps back into .pdf again:
$ pstopdf fontspec.ps # note, `pstopdf` has no output filename option;
# it will automatically choose 'fontspec.pdf',
# and overwrite previous 'fontspec.pdf' in
# the same directory
... and now the text in the output of pstopdf
is selectable in evince
, all fonts are there, and evince
doesn't complain anymore. However, as I noted, I'd like to avoid roundtrip to postscript files altogether.
display
(from imagemagick
)We can also observe the page in the same document with imagemagick
s display
(note that image panning from the commandline using 'display' is apparently still not available, so I've used -crop
below to adjust the viewport):
$ display -density 150 -crop 740x450+280+200 fontspec.pdf[2]
**** Warning: considering '0000000000 00000 n' as a free entry.
...
**** This file had errors that were repaired or ignored.
**** The file was produced by:
**** >>>> Mac OS X 10.5.4 Quartz PDFContext <<<<
**** Please notify the author of the software that produced this
**** file that it does not conform to Adobe's published PDF
**** specification.
... which generates some ghostscrip
ish errors - and results with something like this:
... where it's obvious that the missing fonts that evince
couldn't render, are now shown here, with imagemagick
s display
, properly.
ghostscript
Finally, we can use ghostscript as x11 viewer itself -- to observe the same page, same document:
$ gs -sDevice=x11 -g740x450 -r150x150 -dFirstPage=3 \
-c '<</PageOffset [-120 520]>> setpagedevice' \
-f fontspec.pdf
GPL Ghostscript 9.02 (2011-03-30)
Copyright (C) 2010 Artifex Software, Inc. All rights reserved.
This software comes with NO WARRANTY: see the file PUBLIC for details.
**** Warning: considering '0000000000 00000 n' as a free entry.
**** Warning: considering '0000000000 00000 n' as a free entry.
**** Warning: considering '0000000000 00000 n' as a free entry.
**** Warning: considering '0000000000 00000 n' as a free entry.
**** Warning: considering '0000000000 00000 n' as a free entry.
**** Warning: considering '0000000000 00000 n' as a free entry.
**** Warning: considering '0000000000 00000 n' as a free entry.
Processing pages 3 through 74.
Page 3
>>showpage, press <return> to continue<<
^C
... and results with this output:
In conclusion: ghostscript
(and apparently by extension, imagemagick
) can seemingly find the missing font (or at least some replacement for it), and render a page with that -- even if evince
fails at that for the same document.
I would, therefore, simply like to export a PDF version from ghostscript
, that would have only the missing fonts embedded, and no other processing; so I try this:
$ gs -dBATCH -dNOPAUSE -dSAFER \
-dEmbedAllFonts -dSubsetFonts=true -dMaxSubsetPct=99 \
-dAutoFilterMonoImages=false \
-dAutoFilterGrayImages=false \
-dAutoFilterColorImages=false \
-dDownsampleColorImages=false \
-dDownsampleGrayImages=false \
-dDownsampleMonoImages=false \
-sDEVICE=pdfwrite \
-dFirstPage=3 -dLastPage=3 \
-sOutputFile=mypg3out.pdf -f fontspec.pdf
GPL Ghostscript 9.02 (2011-03-30)
Copyright (C) 2010 Artifex Software, Inc. All rights reserved.
This software comes with NO WARRANTY: see the file PUBLIC for details.
**** Warning: considering '0000000000 00000 n' as a free entry.
**** Warning: considering '0000000000 00000 n' as a free entry.
**** Warning: considering '0000000000 00000 n' as a free entry.
**** Warning: considering '0000000000 00000 n' as a free entry.
**** Warning: considering '0000000000 00000 n' as a free entry.
**** Warning: considering '0000000000 00000 n' as a free entry.
**** Warning: considering '0000000000 00000 n' as a free entry.
Processing pages 3 through 3.
Page 3
**** This file had errors that were repaired or ignored.
**** The file was produced by:
**** >>>> Mac OS X 10.5.4 Quartz PDFContext <<<<
**** Please notify the author of the software that produced this
**** file that it does not conform to Adobe's published PDF
**** specification.
... but it doesn't work - the output file mypg3out.pdf
suffers from the exact same problems in evince
as noted previously.
Note: While I'd like to avoid the postscript roundtrip, a good example of gs
command line with from pdf to ps with font embedding is here: (#277826) pdf - How to make GhostScript PS2PDF stop subsetting fonts; but the same command line switches for .pdf to .pdf to not seem to have any effect on the problem described above.
Right, I got a bit further on this (but not completely) - so I'll post a partial answer/comment here.
Essentially, this is not a problem about font embedding in PDF - this is a problem with font mapping.
To show that, let's analyse the mypg3out.pdf
, which was extracted by gs
in the OP (from the 3rd page of the fontspec.pdf
document):
$ pdffonts mypg3out.pdf
name type emb sub uni object ID
------------------------------------ ----------------- --- --- --- ---------
Error: Missing language pack for 'Adobe-Japan1' mapping
CAAAAA+Osaka-Mono-Identity-H CID TrueType yes yes yes 19 0
GBWBYF+CMMI9 Type 1C yes yes yes 28 0
FDFZUN+Skia-Regular_wght13333_wdth11999 TrueType yes yes yes 16 0
ZRLTKK+Optima-Regular TrueType yes yes yes 30 0
ZFQZLD+FPLNeu-Bold Type 1C yes yes yes 8 0
DDRFOG+FPLNeu-Italic Type 1C yes yes no 22 0
HMZJAO+FPLNeu-Regular Type 1C yes yes yes 10 0
RDNKXT+FPLNeu-Regular Type 1C yes yes yes 32 0
GBWBYF+Skia-Regular_wght13333_wdth11999 TrueType yes yes no 26 0
As the output shows - all fonts are, indeed, embedded; so something else is the problem. (It would have been more difficult to observe this in the complete fontspec.pdf
, as there are a ton of fonts there, and a ton of error messages.)
The crucial point (I think) here, is that there is:
Error: Missing language pack for 'Adobe-Japan1' mapping
" message; and CID TrueType
font, which is CAAAAA+Osaka-Mono-Identity-H
There seems to be an obvious relationship between the CID TrueType
and the 'Adobe-Japan1' mapping error; and I got that finally clarified by CID fonts - How to use Ghostscript:
CID fonts are PostScript resources containing a large number of glyphs (e.g. glyphs for Far East languages, Chinese, Japanese and Korean). Please refer to the PostScript Language Reference, third edition, for details.
CID font resources are a different kind of PostScript resource from fonts. In particular, they cannot be used as regular fonts. CID font resources must first be combined with a CMap resource, which defines specific codes for glyphs, before it can be used as a font. This allows the reuse of a collection of glyphs with different encodings.
All good - except here we're dealing with PDF fonts, not PostScript fonts as such; let's demonstrate that a bit.
For instance, 5.3. Using Ghostscript To Preview Fonts - Making Fonts Available To Ghostscript - Font HowTo describes how the Ghostscript-installed script called prfont.ps
can be used to render a table of fonts.
However, here it would be easier with just Listing Ghostscript Fonts [gs-devel], and using resourcestatus operator to query for a specific font - which doesn't require a special .ps script:
$ gs -o /dev/null -dNODISPLAY -f mypg3out.pdf \
-c 'currentpagedevice (*) {=} 100 string /Font resourceforall'
...
Processing pages 1 through 1.
Page 1
URWAntiquaT-RegularCondensed
Palatino-Italic
Hershey-Gothic-Italian
...
$ gs -o /dev/null -dNODISPLAY -f mypg3out.pdf \
-c '/TimesNewRoman findfont pop [/TimesNewRoman /Font resourcestatus]'
....
Processing pages 1 through 1.
Page 1
Can't find (or can't open) font file /usr/share/ghostscript/9.02/Resource/Font/TimesNewRomanPSMT.
Can't find (or can't open) font file TimesNewRomanPSMT.
Can't find (or can't open) font file /usr/share/ghostscript/9.02/Resource/Font/TimesNewRomanPSMT.
Can't find (or can't open) font file TimesNewRomanPSMT.
Querying operating system for font files...
Loading TimesNewRomanPSMT font from /usr/share/fonts/truetype/msttcorefonts/times.ttf... 2549340 1142090 3496416 1237949 1 done.
We got a list of fonts; however, those are system fonts available to ghostscript
- not the fonts embedded in the PDF!
(Basically,
gs -o /dev/null -dNODISPLAY -f mypg3out.pdf -c 'currentpagedevice (*) {=} 100 string /Font resourceforall' | grep -i osaka
will return nothing, and -c '/CAAAAA+Osaka-Mono-Identity-H findfont pop [/CAAAAA+Osaka-Mono-Identity-H /Font resourcestatus]'
will conclude with "Didn't find this font on the system! Substituting font Courier for CAAAAA+Osaka-Mono-Identity-H.")To list the fonts in the PDF, the pdf_info.ps script file from Ghostscript (not installed, in sources) can be used:
$ wget "http://git.ghostscript.com/?p=ghostpdl.git;a=blob_plain;f=gs/toolbin/pdf_info.ps" -O pdf_info.ps
$ gs -dNODISPLAY -q -sFile=mypg3out.pdf -dDumpFontsNeeded pdf_info.ps
...
No system fonts are needed.
$ gs -dNODISPLAY -q -sFile=mypg3out.pdf -dDumpFontsUsed -dShowEmbeddedFonts pdf_info.ps
...
Font or CIDFont resources used:
CAAAAA+Osaka-Mono
DDRFOG+FPLNeu-Italic
FDFZUN+Skia-Regular_wght13333_wdth11999
GBWBYF+CMMI9
GBWBYF+Skia-Regular_wght13333_wdth11999
GTIIKZ+Osaka-Mono
HMZJAO+FPLNeu-Regular
RDNKXT+FPLNeu-Regular
ZFQZLD+FPLNeu-Bold
ZRLTKK+Optima-Regular
So finally we can observe the CAAAAA+Osaka-Mono
in Ghostscript - although I wouldn't know how to query more specific information about it from within ghostscript
.
In the end, I guess my question boils down to: how could ghostscript
be used, to map the glyphs from a CID embedded font - into a font with a different "encoding" (or "character map"?), which will not require additional language files?
I have also experimented with these approaches:
pdffonts
on the output here will not have the Osaka-Mono listed, but it will still complain "Error: Missing language pack for 'Adobe-Japan1' mapping": $ wget http://whalepdfviewer.googlecode.com/svn/trunk/cmaps/japanese/Adobe-Japan1-UCS2 $ gs -sDEVICE=pdfwrite -o mypg3o2.pdf -dBATCH -f mypg3out.pdf Adobe-Japan1-UCS2
pdffonts
list: gs -sDEVICE=pdfwrite -o mypg3o2.pdf -dBATCH \ -c '/CIDSystemInfo << /Registry (Adobe) /Ordering (Unicode) /Supplement 1 >>' \ -f mypg3out.pdf
Error: /undefinedresource in findresource
: gs -sDEVICE=pdfwrite -o mypg3o2.pdf -dBATCH \ -c '/Osaka-Mono-Identity-H /H /CMap findresource [/Osaka-Mono-Identity /CIDFont findresource] == ' \ -f mypg3out.pdf
Note finally that some of the .ps scripts ghostscript
installs, it may use automatically; for instance, you can find gs_ttf.ps
:
$ locate gs_ttf.ps
/usr/share/ghostscript/9.02/Resource/Init/gs_ttf.ps
... and then using sudo nano
, you can add the statement locate gs_ttf.ps
(Hello from gs_ttf.ps\n) print
at the beginning of the code; then whenever one of the above gs
commands is called, the printout will be visible in stdout.
Please keep in mind that a CMap resource unidirectionally maps character codes to CIDs. Those other resources that Acrobat uses are best referred to as PDF mapping resources. Among them, there is a special category called ToUnicode mapping resources that unidirectionally map CIDs to UTF-16BE character codes