I have a PDF file with valuable textual information.
The problem is that I cannot extract the text, all I get is a bunch of garbled symbols. The same happens if I copy and paste the text from the PDF reader to a text file. Even File -> Save as text in Acrobat Reader fails.
I have used all tools I could get my hands on and the result is the same. I believe that this has something to do with fonts embedding, but I don't know what exactly?
My questions:
Some PDF files are produced without special information that is crucial for successful extraction of text from them. Even by the Adobe tools. Basically, such files do not contain glyph-to-character mapping information.
Such files will be displayed and printed just fine (because shapes of the characters are properly defined), but text from them can't be properly copied / extracted (because there is no information about meaning of used glyphs/shapes).
For example, Distiller produces such files when "Smallest File Size" preset is used.
Other than OCR there is no other way to retrieve text from such files, I'm afraid. We recently published a guide for how to OCR PDFs in .NET.
The original answer mentioned the "information about meaning of used glyphs/shapes". This information should be contained in a PDF structure called a /ToUnicode
table. Such a table is required for each and every font which is embedded as a subset and uses non-standard (Custom
) encoding.
In order to quickly evaluate the chances for extractability of text contents, you can use the pdffonts
command line utility. This prints in tabular form a series of items about each font used by the PDF. The presence of a /ToUnicode
table is indicated by column headed uni
.
A few example outputs:
$ kp@mbp:git.PDF101.angea> pdffonts handcoded/textextract/textextract-good.pdf
name type encoding emb sub uni object ID
------------------------ ----------- ---------- --- --- --- ---------
BAAAAA+Helvetica TrueType WinAnsi yes yes yes 12 0
CAAAAA+Helvetica-Bold TrueType WinAnsi yes yes yes 13 0
$ kp@mbp:git.PDF101.angea> pdffonts handcoded/textextract/textextract-bad1.pdf
name type encoding emb sub uni object ID
------------------------ ----------- ---------- --- --- --- ---------
BAAAAA+Helvetica TrueType WinAnsi yes yes no 12 0
CAAAAA+Helvetica-Bold TrueType WinAnsi yes yes no 13 0
$ kp@mbp:git.PDF101.angea> pdffonts handcoded/textextract/textextract-bad2.pdf
name type encoding emb sub uni object ID
------------------------ ----------- ---------- --- --- --- ---------
BAAAAA+Helvetica TrueType WinAnsi yes yes yes 12 0
CAAAAA+Helvetica-Bold TrueType WinAnsi yes yes no 13 0
The good.pdf
lets you extract the text contents for both fonts correctly, because both fonts have an accompanying /ToUnicode
table.
For the bad1.pdf
and the bad2.pdf
the text extraction succeeds only for one of the two fonts, and fails for the other, because only one font has a /ToUnicode
table.
I, Kurt Pfeifle, have recently created a series of hand-coded PDF files to demonstrate the influence of existing, buggy, manipulated or missing /ToUnicode
tables in the PDF source code. These PDFs are extensively-commented and suitable to be explored with the help of a text editor. Above pdffonts
output examples were created with the help of these hand-coded files. (There are a few more PDFs showing different results, which an interested reader may want to explore...)