How to convert data from pdf files into data frames

user3745002 picture user3745002 · Jun 16, 2014 · Viewed 7.7k times · Source

I'm trying to convert the data from a large number of PDF files to data frames in R. I've been converting the PDF files to .txt files using read.fwf(), but the issue is that the widths of all .txt files are not the same. Is there a way to determine the widths of the columns, or is there a way to use a function other than read.fwf()?

I have a large number of files to convert and they all have different formats to begin with, so finding the specific column widths for each file is getting very tedious. Is there a more efficient way to convert data from PDF files to data frames in R?

Answer

eipi10 picture eipi10 · Jun 17, 2014

Here's one possible solution using Regular Expressions. You use the readPDF function from the tm package to convert the PDF files to text, giving you each row as a text string. Then you use Regular Expressions to partition the data into the appropriate column fields for conversion to a data frame.

I've packaged this into a function so that you can read and parse all the PDF files and combine them into a single data frame in one operation. If your other files have formatting idiosyncrasies not present in the file you posted, then you'll need to do some tweaking to make it work properly.

The code also checks for some simple data format issues and saves "bad" rows in a separate text file for later inspection and processing. Once again, you might need to tweak this if your other files have different formatting variations.

# Use text-mining package to extract text from PDF files    
library(tm)

# Function to read a PDF file and turn it into a data frame
PDFtoDF = function(file) {
  ## Extract PDF text. Each line of PDF becomes one element of the string vector dat.
  dat = readPDF(control=list(text="-layout"))(elem=list(uri=file), 
                                              language="en", id="id1") 
  dat = c(as.character(dat))

  ## Keep only those strings that contain the data we want. 
  ## These are the ones that begin with a number.
  dat = dat[grep("^ {0,2}[0-9]{1,3}", dat)]

  ## Create separators so we can turn strings into a data frame. We'll use the 
  ## pipe "|" as a separator.

  # Add pipe after first number (the row number in the PDF file)
  dat = gsub("^ ?([0-9]{1,3}) ?", "\\1|", dat)

  # Replace each instance of 2 or more spaces in a row with a pipe separator. This 
  # works because the company names have a single space between words, while data
  # fields generally have more than one space between them. 
  # (We just need to first add an extra space in a few cases where there's only one
  # space between two data fields.)
  dat = gsub("(, HVOL )","\\1 ", dat)
  dat = gsub(" {2,100}", "|", dat)

  ## Check for data format problems
  # Identify rows without the right number of fields (there should 
  # be six pipe characters per row) and save them to a file for 
  # later inspection and processing (in this case row 11 of the PDF file is excluded))
  excludeRows = lapply(gregexpr("\\|", dat), function(x) length(x)) != 6
  write(dat[excludeRows], "rowsToCheck.txt", append=TRUE)

  # Remove the excluded rows from the string vector
  dat = dat[!excludeRows]

  ## Convert string vector to data frame 
  dat = read.table(text=dat, sep="|", quote="", stringsAsFactors=FALSE)
  names(dat) = c("RowNum", "Reference Entity", "Sub-Index", "CLIP", 
                  "Reference Obligation", "CUSIP/ISIN", "Weighting")
  return(dat)
}

# Create vector of names of files to read
files = list.files(pattern="CDX.*\\.pdf")

# Read each file, convert it to a data frame, then rbind into single data frame
df = do.call("rbind", lapply(files, PDFtoDF))

# Sample of data frame output from your sample file
df
    RowNum    Reference Entity    Sub-Index      CLIP           Reference Obligation   CUSIP/ISIN Weighting
1        1         ACE Limited          FIN 0A4848AC9     ACE-INAHldgs 8.875 15Aug29    00440EAC1     0.008
2        2           Aetna Inc.         FIN 0A8985AC5     AET 6.625 15Jun36 BondCall    00817YAF5     0.008
3        3           Alcoa Inc.  INDU, HVOL 014B98AD5                AA 5.72 23Feb19    013817AP6     0.008