How can I extract text from a PDF file in Perl?

Pawan Rao picture Pawan Rao · Jul 16, 2009 · Viewed 39.7k times · Source

I am trying to extract text from PDF files using Perl. I have been using pdftotext.exe from command line (i.e using Perl system function) for extracting text from PDF files, this method works fine.

The problem is that we have symbols like α, β and other special characters in the PDF files which are not being displayed in the generated txt file. Also few extra spaces are being added randomly in the text.

Is there a better and more reliable way to extract text from PDF files such that the text will include all the symbols like α, β etc and the text will exactly match the text in the PDF (i.e without extra spaces)?

Answer

joe picture joe · Jul 16, 2009

These modules you can acheive the extract text from pdf

PDF::API2

CAM::PDF

CAM::PDF::PageText

From CPAN

   my $pdf = CAM::PDF->new($filename);
   my $pageone_tree = $pdf->getPageContentTree(1);
   print CAM::PDF::PageText->render($pageone_tree);

This module attempts to extract sequential text from a PDF page. This is not a robust process, as PDF text is graphically laid out in arbitrary order. This module uses a few heuristics to try to guess what text goes next to what other text, but may be fooled easily by, say, subscripts, non-horizontal text, changes in font, form fields etc.

All those disclaimers aside, it is useful for a quick dump of text from a simple PDF file.