Using readPDF in R (tm package)

JDY picture JDY · Aug 18, 2015 · Viewed 8.7k times · Source

I'm a beginner at R and having a bit of trouble using the tm package. I need to extract specific data from page 55 through 300 of this and thought that R might be a good way to do so. (If anyone has a better idea, please let me know!) I did some searching and after installing the tm package and the xpdf package, I've tried reading this and tried zx8754's solution with no luck. I suspect it has something to do with the readPDF command -- I get the following:

Error in readPDF(PdftotextOptions = "-layout") : unused argument (PdftotextOptions = "-layout")

I think it has to do with trying to use the tm package and the xpdf packages together, and so I read Tony Breyal's solution (I can't post more than 2 links), putting pdfinfo and pdftotext as environmental variables (I'm on Win 8) and restarting. I'm sure I'm missing something -- right now I have pdftotext.exe in my working directory in R. Can anyone help me configure this correctly so that the tm package calls on the xpdf files correctly and readPDF functions like it should?

Again, I'm very new to this, so apologies if I'm way off. All help would be very much appreciated.

Thanks in advance,

Justin

Answer

eipi10 picture eipi10 · Aug 19, 2015

To get you started, here is an example of a complete readPDF command for reading a PDF file. readPDF threw an error when I tried to retrieve the PDF file directly from the link you provided, so I downloaded the PDF file to my working directory first.

library(tm)

# File name
filename = "ea0607.pdf"

# Read the PDF file
doc <- readPDF(control = list(text = "-layout"))(elem = list(uri = filename),
                                                 language = "en",
                                                 id = "id1")

The code above converted the PDF file to text and stored the result in doc. doc is actually a list, as can be seen with the following code:

str(doc)

List of 2
 $ content: chr [1:23551] "  STATE UNIVERSITY SYSTEM OF FLORIDA" "" "EXPENDITURE ANALYSIS" "      2006-2007" ...
 $ meta   :List of 7
  ..$ author       : chr "greg.jacques"
  ..$ datetimestamp: POSIXlt[1:1], format: "2007-12-10 11:33:48"
  ..$ description  : NULL
  ..$ heading      : chr " PGM=EASUSI-V01                                        STATE UNIVERSITY SYSTEM                                                 "| __truncated__
  ..$ id           : chr "ea0607.pdf"
  ..$ language     : chr "en"
  ..$ origin       : chr "Acrobat PDFMaker 8.1 for Word"
  ..- attr(*, "class")= chr "TextDocumentMeta"
 - attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"

The text of the PDF file is stored in doc$content, while doc$meta includes various metadata about the PDF file. Each row of doc$content is a line from the PDF file. Here are lines 300 through 310 of the PDF file:

doc$content[300:310]

 [1] ""                                                                                                                      
 [2] "and General (E&G) budget entity. The Expenditure Analysis continues to reflect special units separately and the"       
 [3] ""                                                                                                                      
 [4] "traditional program components and related activities have been further defined to support the funding formula. The"   
 [5] ""                                                                                                                      
 [6] "Expenditure Analysis format was revised in 1995-96 to include all activities in the funding formula as well as college"
 [7] ""                                                                                                                      
 [8] "detail by activity for the UF Health Science Center, the USF Health Science Center and the FSU Medical School. A"      
 [9] ""                                                                                                                      
[10] "definition of each follows:"                                                                                           
[11] ""    

Hopefully that will help you get started.