how to add unicode in truetype0font on pdfbox 2.0.0?

omri a picture omri a · Sep 14, 2016 · Viewed 8.4k times · Source

I've been using the PDFBOX version 2.0.0 in a Java project to convert pdfs to text.

several of my pdfs are missing the ToUnicode method, so they come out in Gibberish while I export them.

2016-09-14 10:44:55 WARN org.apache.pdfbox.pdmodel.font.PDSimpleFont(1):322 - No Unicode mapping for 694 (30) in font MPBAAA+F1

in the WARN above, instead of the real character, a gibberish unicode (30) was presented.

I was able to overcome it by editing the additional.txt file in pdfbox, since from trial & error I understood that the code of the character (694 in this case) represents a certain Hebrew letter (צ).

here's a short example of what I've edited inside the file:

-694;05E6 #HexaDecimal value for the letter צ
-695;05E7
-696;05E8

later I've encountered almost the same warning on a different pdf, but instead of gibberish characters I got no characters at all. a more detailed explination of this issue can be seen here - pdf reading via pdfbox in java

2016-09-14 11:07:10 WARN org.apache.pdfbox.pdmodel.font.PDType0Font(1):431 - No Unicode mapping for CID+694 (694) in font ABCDEE+Tahoma,Bold

As you can see, the warning came from a different class (PDType0Font) rather than the first warning (PDSimpleFont), but the code name (694) is the same in both of them and they are both talking about the same character.

Does there's a different file that I should edit other than additional.txt to point the 694 code (the Hebrew letter צ) to it's correct unicode?

Thanks main root

drill down on the first type0 font

Answer

Tilman Hausherr picture Tilman Hausherr · Sep 22, 2016

Here's some code to add a ToUnicode CMap stream in a font. Obviously I can't do it with your file, so I used one of my test files, which can be found here. I had to work on each entry separately and didn't do all. However the result is good enough to extract the first word in the green print ("Bedingungen").

The scenario is somewhat tailored to you:

  • Identity-H entry
  • no ToUnicode entry
  • specific font name

    try (PDDocument doc = PDDocument.load(f))
    {
        for (int p = 0; p < doc.getNumberOfPages(); ++p)
        {
            PDPage page = doc.getPage(p);
            PDResources res = page.getResources();
            for (COSName fontName : res.getFontNames())
            {
                PDFont font = res.getFont(fontName);
                COSBase encoding = font.getCOSObject().getDictionaryObject(COSName.ENCODING);
                if (!COSName.IDENTITY_H.equals(encoding))
                {
                    continue;
                }
                // get real name
                String fname = font.getName();
                int plus = fname.indexOf('+');
                if (plus != -1)
                {
                    fname = fname.substring(plus + 1);
                }
                if (font.getCOSObject().containsKey(COSName.TO_UNICODE))
                {
                    continue;
                }
                System.out.println("File '" + f.getName() + "', page " + (p + 1) + ", " + fontName.getName() + ", " + font.getName());
                if (!fname.startsWith("Calibri-Bold"))
                {
                    continue;
                }
                COSStream toUnicodeStream = new COSStream();
                try (PrintWriter pw = new PrintWriter(toUnicodeStream.createOutputStream(COSName.FLATE_DECODE)))
                {
                    // "9.10 Extraction of Text Content" in the PDF 32000 specification
                    pw.println ("/CIDInit /ProcSet findresource begin\n" +
                            "12 dict begin\n" +
                            "begincmap\n" +
                            "/CIDSystemInfo\n" +
                            "<< /Registry (Adobe)\n" +
                            "/Ordering (UCS) /Supplement 0 >> def\n" +
                            "/CMapName /Adobe-Identity-UCS def\n" +
                            "/CMapType 2 def\n" +
                            "1 begincodespacerange\n" +
                            "<0000> <FFFF>\n" +
                            "endcodespacerange\n" +
                            "10 beginbfchar\n" + // number is count of entries
                            "<0001><0020>\n" + // space
                            "<0002><0041>\n" + // A
                            "<0003><0042>\n" + // B
                            "<0004><0044>\n" + // D
                            "<0013><0065>\n" + // e
                            "<0012><0064>\n" + // d
                            "<0017><0069>\n" + // i
                            "<001B><006E>\n" + // n
                            "<0015><0067>\n" + // g
                            "<0020><0075>\n" + // u
                            "endbfchar\n" +
                            "endcmap CMapName currentdict /CMap defineresource pop end end");
                }
                font.getCOSObject().setItem(COSName.TO_UNICODE, toUnicodeStream);
            }
        }
        doc.save("huhu.pdf");
    }
    

Btw the unreleased 2.1 version of PDFDebugger has some improved features to show fonts, you can get it here:

You can use it to verify that your ToUnicode CMap makes sense. Here's what I get with my changes: enter image description here