Extracting text from garbled PDF

SNAG picture SNAG · Aug 29, 2012 · Viewed 38.4k times · Source

I have a PDF file with valuable textual information.

The problem is that I cannot extract the text, all I get is a bunch of garbled symbols. The same happens if I copy and paste the text from the PDF reader to a text file. Even File -> Save as text in Acrobat Reader fails.

I have used all tools I could get my hands on and the result is the same. I believe that this has something to do with fonts embedding, but I don't know what exactly?

My questions:

  • What is the culprit of this weird text garbling?
  • How to extract the text content from the PDF (programmatically, with a tool, manipulating the bits directly, etc.)?
  • How to fix the PDF to not garble on copy?

Answer

Bobrovsky picture Bobrovsky · Aug 30, 2012

Some PDF files are produced without special information that is crucial for successful extraction of text from them. Even by the Adobe tools. Basically, such files do not contain glyph-to-character mapping information.

Such files will be displayed and printed just fine (because shapes of the characters are properly defined), but text from them can't be properly copied / extracted (because there is no information about meaning of used glyphs/shapes).

For example, Distiller produces such files when "Smallest File Size" preset is used.

Other than OCR there is no other way to retrieve text from such files, I'm afraid. We recently published a guide for how to OCR PDFs in .NET.


Supplementing the original answer

The original answer mentioned the "information about meaning of used glyphs/shapes". This information should be contained in a PDF structure called a /ToUnicode table. Such a table is required for each and every font which is embedded as a subset and uses non-standard (Custom) encoding.

In order to quickly evaluate the chances for extractability of text contents, you can use the pdffonts command line utility. This prints in tabular form a series of items about each font used by the PDF. The presence of a /ToUnicode table is indicated by column headed uni.

A few example outputs:

$ kp@mbp:git.PDF101.angea> pdffonts handcoded/textextract/textextract-good.pdf

    name                     type        encoding   emb sub uni object ID
    ------------------------ ----------- ---------- --- --- --- ---------
    BAAAAA+Helvetica         TrueType    WinAnsi    yes yes yes     12  0
    CAAAAA+Helvetica-Bold    TrueType    WinAnsi    yes yes yes     13  0


$ kp@mbp:git.PDF101.angea> pdffonts handcoded/textextract/textextract-bad1.pdf

    name                     type        encoding   emb sub uni object ID
    ------------------------ ----------- ---------- --- --- --- ---------
    BAAAAA+Helvetica         TrueType    WinAnsi    yes yes no      12  0
    CAAAAA+Helvetica-Bold    TrueType    WinAnsi    yes yes no      13  0


$ kp@mbp:git.PDF101.angea> pdffonts handcoded/textextract/textextract-bad2.pdf

    name                     type        encoding   emb sub uni object ID
    ------------------------ ----------- ---------- --- --- --- ---------
    BAAAAA+Helvetica         TrueType    WinAnsi    yes yes yes     12  0
    CAAAAA+Helvetica-Bold    TrueType    WinAnsi    yes yes no      13  0

The good.pdf lets you extract the text contents for both fonts correctly, because both fonts have an accompanying /ToUnicode table.

For the bad1.pdf and the bad2.pdf the text extraction succeeds only for one of the two fonts, and fails for the other, because only one font has a /ToUnicode table.

I, Kurt Pfeifle, have recently created a series of hand-coded PDF files to demonstrate the influence of existing, buggy, manipulated or missing /ToUnicode tables in the PDF source code. These PDFs are extensively-commented and suitable to be explored with the help of a text editor. Above pdffonts output examples were created with the help of these hand-coded files. (There are a few more PDFs showing different results, which an interested reader may want to explore...)