I'm trying to convert the data from a large number of PDF files to data frames in R. I've been converting the PDF files to .txt files using read.fwf(), but the issue is that the widths of all .txt files are not the same. Is there a way to determine the widths of the columns, or is there a way to use a function other than read.fwf()?
I have a large number of files to convert and they all have different formats to begin with, so finding the specific column widths for each file is getting very tedious. Is there a more efficient way to convert data from PDF files to data frames in R?
Here's one possible solution using Regular Expressions. You use the readPDF
function from the tm
package to convert the PDF files to text, giving you each row as a text string. Then you use Regular Expressions to partition the data into the appropriate column fields for conversion to a data frame.
I've packaged this into a function so that you can read and parse all the PDF files and combine them into a single data frame in one operation. If your other files have formatting idiosyncrasies not present in the file you posted, then you'll need to do some tweaking to make it work properly.
The code also checks for some simple data format issues and saves "bad" rows in a separate text file for later inspection and processing. Once again, you might need to tweak this if your other files have different formatting variations.
# Use text-mining package to extract text from PDF files
library(tm)
# Function to read a PDF file and turn it into a data frame
PDFtoDF = function(file) {
## Extract PDF text. Each line of PDF becomes one element of the string vector dat.
dat = readPDF(control=list(text="-layout"))(elem=list(uri=file),
language="en", id="id1")
dat = c(as.character(dat))
## Keep only those strings that contain the data we want.
## These are the ones that begin with a number.
dat = dat[grep("^ {0,2}[0-9]{1,3}", dat)]
## Create separators so we can turn strings into a data frame. We'll use the
## pipe "|" as a separator.
# Add pipe after first number (the row number in the PDF file)
dat = gsub("^ ?([0-9]{1,3}) ?", "\\1|", dat)
# Replace each instance of 2 or more spaces in a row with a pipe separator. This
# works because the company names have a single space between words, while data
# fields generally have more than one space between them.
# (We just need to first add an extra space in a few cases where there's only one
# space between two data fields.)
dat = gsub("(, HVOL )","\\1 ", dat)
dat = gsub(" {2,100}", "|", dat)
## Check for data format problems
# Identify rows without the right number of fields (there should
# be six pipe characters per row) and save them to a file for
# later inspection and processing (in this case row 11 of the PDF file is excluded))
excludeRows = lapply(gregexpr("\\|", dat), function(x) length(x)) != 6
write(dat[excludeRows], "rowsToCheck.txt", append=TRUE)
# Remove the excluded rows from the string vector
dat = dat[!excludeRows]
## Convert string vector to data frame
dat = read.table(text=dat, sep="|", quote="", stringsAsFactors=FALSE)
names(dat) = c("RowNum", "Reference Entity", "Sub-Index", "CLIP",
"Reference Obligation", "CUSIP/ISIN", "Weighting")
return(dat)
}
# Create vector of names of files to read
files = list.files(pattern="CDX.*\\.pdf")
# Read each file, convert it to a data frame, then rbind into single data frame
df = do.call("rbind", lapply(files, PDFtoDF))
# Sample of data frame output from your sample file
df
RowNum Reference Entity Sub-Index CLIP Reference Obligation CUSIP/ISIN Weighting
1 1 ACE Limited FIN 0A4848AC9 ACE-INAHldgs 8.875 15Aug29 00440EAC1 0.008
2 2 Aetna Inc. FIN 0A8985AC5 AET 6.625 15Jun36 BondCall 00817YAF5 0.008
3 3 Alcoa Inc. INDU, HVOL 014B98AD5 AA 5.72 23Feb19 013817AP6 0.008