PDF and text layer

Jochen Hebbrecht picture Jochen Hebbrecht · Jul 10, 2012 · Viewed 15.3k times · Source

According to this site http://www.searchable-pdf.com/content.php?lang=en&c=61, a PDF can be searchable when a text layer is added.

I was looking for the technical specification of a PDF. I think text can be stored in 2 ways into a PDF: a) as a text layer above the image layer (as described in the webpage above) b) when you create a PDF from a Word document (with text), I don't think Word will store all the text in the text layer. I think it will store it in the image layer? Right?

Since PDF 1.4, XMP has been added (http://en.wikipedia.org/wiki/Extensible_Metadata_Platform). But what is XMP? Is this the "text layer" which I discussed above?

If a scanner is performing OCR on an image, is it storing the text in the "text layer"? Or the "XMP" field? This can only be when a PDF is of version 1.4?

And how can I detect if a PDF already has text data? For example: PDF A has been scanned with OCR and PDF B has not. How can I know that PDF B should be sent to a separate OCR engine?

Answer

Frank picture Frank · Jul 10, 2012

The PDF specification has no mention of a 'text layer'. Normally, there is just one way to 'store' text: by means of text showing operators. These operators draw text at a specific location, using a specific color, font, font size and text rendering mode. There are several text rendering modes. For the purpose of answering your question, text can be visible or invisible.

A scanner that performs OCR, renders both the raster image and text to the PDF document. The text is rendered using the invisible text rendering mode. The result is that you can select the text using a mouse (the highlighted area will be shown at the expected location on top of the image) and you can search for text. Again the search result will be shown at the correct location.

What happens when you generate PDF from a Word document depends on the software that you use to convert. To my knowledge, these converters do not generate an image but they will generate visible text.

XMP is meta data as opposed to visual data.

Finally, with respect to your question about detecting whether a PDF has text data, here is a similar question (10k only).