Java PDFBox - Reading and modifying a pdf with special characters (diacritics)

Andrei F picture Andrei F · Apr 12, 2013 · Viewed 8.8k times · Source

i'm trying to modify a pdf using this method (first code block - using PDFStreamParser and iterating through PDFOperator, then updating COSString when needed):

http://www.coderanch.com/t/556009/open-source/PdfBox-Replace-String-double-pdf

I'm having an issue with some UTF-8 characters (diacritics): when I print the text that i want to update it show like "Societ? ?ii Na?ionale" (where '?' is a code like 0002 or 0004).

The funny things are:

  1. when I write the updated pdf file, the characters are show correctly (even though i could't detected and replace them)
  2. if i try to strip the text using PDFTextStripper 's getText(...), the text is extracted perfectly.
  3. i tried 2 pdfbox versions: 1.5.0 (that behaves as described above) and 1.8.1 (where the final, written, pdf file does not display special characters correctly and "null" strings appear in the document)

What can I do (configure) for the classes used for updating the pdf (or at least try...) so that all of the UTF-8 characters are displayed correctly ?

EDIT:

Screenshot: enter image description here

EDIT 2:

I searched through the pdfbox source code in PDFTextStripper and its superclass, and I found out how the text was extracted:

At the beginning of processStream method we have

graphicsState = new PDGraphicsState(aPage.findCropBox());

when stripping the text in processEncodedText, an instance of PDFont class is used like this:

final PDFont font = graphicsState.getTextState().getFont();

and the text is extracted from a byte[] with :

String c = font.encode( string, i, codeLength );

The new problem is that when i instantiate a PDFont class with the same 2 lines of code, i get a "null" font class, and thus i cannot use .encode(...) method. Source code for those classes are here: http://grepcode.com/file/repo1.maven.org/maven2/org.apache.pdfbox/pdfbox/1.5.0/org/apache/pdfbox/util/PDFStreamEngine.java and http://grepcode.com/file/repo1.maven.org/maven2/org.apache.pdfbox/pdfbox/1.5.0/org/apache/pdfbox/util/PDFTextStripper.java

I'm digging now for more ...

Answer

plinth picture plinth · Apr 12, 2013

You can't just replace the text in strings. I don't say this lightly. I used to work on Acrobat many years ago and did the text search tool in the initial version, so I have a fairly deep understanding of the issues of text encoding. The main problem is that every string in PDF is encoded in some way. This is because PDF was made before Unicode was generally available and had a history in PostScript. PosctScript liked having very flexible encoding methods for fonts and encouraged re-encoding.

So let's take a step back and understand the whole picture.

A character in a string in PDF which is meant to be shown with a text operator is, by default, encoded as a series of 8 bit characters. To determine what glyph is drawn for each byte, the byte is pushed through an encoding vector for that font. The encoding vector maps the byte to a glyph name which is then looked up in the font and drawn on the page. Be aware that this description is a half-truth (more later).

Most apps that generate PDF are kind and just use a standard encoding such as StandardEncoding or WinAnsiEncoding, most of which are pretty reasonable. Others will use standard encodings along with an encoding delta which are the differences from a standard encoding to what is encoded.

Some apps try to be much more frugal in the PDF they generate, so they look at the glyphs they use and decide to embed a subset of the font. If they only use upper and lower case roman letters and digits, they rebuild the font without those elements and may choose to re-index them as well and provide an encoding vector such that byte 0x00 goes to the glyph 'a' and 0x01 goes to the glyph 'b' and so on.

Now back to the half truth. There are a class of fonts that are encoded by character ID (or CID), and TrueType and OpenType fonts fall into that category. In this case, you get access to Unicode, but again there is an encoding step where you the string, which is now UTF16BE, gets mapped to CID which is used to get the glyph from the font. And for no particularly good reason, Adobe uses a PostScript function to do the mapping. And again, this is about a 3/4s truth because there are different encoding as well for older management of Chinese, Japanese, and Korean fonts.

So before you blithely put a character into a string for a PDF font, you have to ask a few questions:

  1. Is my glyph in the font?
  2. Is my glyph in the encoding?
  3. What is the encoding of my glyph?

And any one of those may be different from what you expect. So for example, if you want to put in Ä (a diresis), you have to see if the font has the glyph for it (which may not be there because the font is a subset). Then the font may have a funny encoding which may not include the glyph. And finally, the actual byte value(s) of to use for Ä may not be standard.

So when I see someone trying to simply replace a chunk of text in PDF content, all I see is a world of pain. For most sane PDF, this will work say, 90% of the time, but for anything exotic - good luck. PDF's text rendering quirks are painful enough that it's sometimes easier to think of it as a write-only format.