Extract tables from pdf (to excel), pref. w/ vba

MeRuud picture MeRuud · Feb 23, 2013 · Viewed 27.9k times · Source

I am trying to extract tables from pdf files with vba and export them to excel. If everything works out the way it should, it should go all automatic. The problem is that the table are not standardized.

This is what I have so far.

  1. VBA (Excel) runs XPDF, and converts all .pdf files found in current folder to a text file.
  2. VBA (Excel) reads through each text file line by line.

And the code:

With New Scripting.FileSystemObject
With .OpenTextFile(strFileName, 1, False, 0)

    If Not .AtEndOfStream Then .SkipLine
    Do Until .AtEndOfStream
        //do something
    Loop
End With
End With

This all works great. But now I am getting to the issue of extracting the tables from the text files. What I am trying to do is VBA to find a string e.g. "Year's Income", and then output the data, after it, into columns. (Until the table ends.)

The first part is not very difficult (find a certain string), but how would I go about the second part. The text file will look like this Pastebin. The problem is that the text is not standardized. Thus for example some tables have 3-year columns (2010 2011 2012) and some only two (or 1), some tables have more spaces between the columnn, and some do not include certain rows (such as Capital Asset, net).

I was thinking about doing something like this but not sure how to go about it in VBA.

  1. Find user defined string. eg. "Table 1: Years' Return."
  2. a. Next line find years; if there are two we will need three columns in output (titles +, 2x year), if there are three we will need four (titles +, 3x year).. etc
    b. Create title column + column for each year.
  3. When reaching end of line, go to next line
  4. a. Read text -> output to column 1.
    b. Recognize spaces (Are spaces > 3?) as start of column 2. Read numbers -> output to column 2.
    c. (if column = 3) Recognize spaces as start of column 3. Read numbers -> output to column 3.
    d. (if column = 4) Recognize spaces as start of column 4. Read numbers -> output to column 4.
  5. Each line, loop 4.
  6. Next line does not include any numbers - End table. (probably the easiet just a user defined number, after 15 characters no number? end table)

I based my first version on Pdf to excel, but reading online people do not recommend OpenFile but rather FileSystemObject (even though it seems to be a lot slower).

Any pointers to get me started, mainly on step 2?

Answer

CuberChase picture CuberChase · Feb 24, 2013

You have a number of ways to dissect a text file and depending on how complex it is might cause you to lean one way or another. I started this and it got a bit out of hand... enjoy.

Based on the sample you've provided and the additional comments, I noted the following. Some of these may work well for simple files but can get unwieldy with bigger more complex files. Furthermore, there may be slightly more efficient methods or tricks to what I have used here but this will definitely get you going an achieve the desired outcome. Hopefully this makes sense in conjunction with the code provided:

  • You can use booleans to help you determine what 'section' of the text file you are in. Ie use InStr on the current line to determine you are in a Table by looking for the text 'Table' and then once you know you are in the 'Table' section of the file start looking for the 'Assets' section etc
  • You can use a few methods to determine the number of years (or columns) you have. The Split function along with a loop will do the job.
  • If your files always have constant formatting, even only in certain parts, you can take advantage of this. For example, if you know your file line will always have a dollar sign in front of the them, then you know this will define the column widths and you can use this on subsequent lines of text.

The following code will extract the Assets details from the text file, you can mod it to extract other sections. It should handle multiple rows. Hopefully I've commented it sufficient. Have a look and I'll edit if needs to help out further.

 Sub ReadInTextFile()
    Dim fs As Scripting.FileSystemObject, fsFile As Scripting.TextStream
    Dim sFileName As String, sLine As String, vYears As Variant
    Dim iNoColumns As Integer, ii As Integer, iCount As Integer
    Dim bIsTable As Boolean, bIsAssets As Boolean, bIsLiabilities As Boolean, bIsNetAssets As Boolean

    Set fs = CreateObject("Scripting.FileSystemObject")
    sFileName = "G:\Sample.txt"
    Set fsFile = fs.OpenTextFile(sFileName, 1, False)

    'Loop through the file as you've already done
    Do While fsFile.AtEndOfStream <> True
        'Determine flag positions in text file
        sLine = fsFile.Readline

        Debug.Print VBA.Len(sLine)

        'Always skip empty lines (including single spaceS)
        If VBA.Len(sLine) > 1 Then

            'We've found a new table so we can reset the booleans
            If VBA.InStr(1, sLine, "Table") > 0 Then
                bIsTable = True
                bIsAssets = False
                bIsNetAssets = False
                bIsLiabilities = False
                iNoColumns = 0
            End If

            'Perhaps you want to also have some sort of way to designate that a table has finished.  Like so
            If VBA.Instr(1, sLine, "Some text that designates the end of the table") Then
                bIsTable = False
            End If 

            'If we're in the table section then we want to read in the data
            If bIsTable Then
                'Check for your different sections.  You could make this constant if your text file allowed it.
                If VBA.InStr(1, sLine, "Assets") > 0 And VBA.InStr(1, sLine, "Net") = 0 Then bIsAssets = True: bIsLiabilities = False: bIsNetAssets = False
                If VBA.InStr(1, sLine, "Liabilities") > 0 Then bIsAssets = False: bIsLiabilities = True: bIsNetAssets = False
                If VBA.InStr(1, sLine, "Net Assests") > 0 Then bIsAssets = True: bIsLiabilities = False: bIsNetAssets = True

                'If we haven't triggered any of these booleans then we're at the column headings
                If Not bIsAssets And Not bIsLiabilities And Not bIsNetAssets And VBA.InStr(1, sLine, "Table") = 0 Then
                    'Trim the current line to remove leading and trailing spaces then use the split function to determine the number of years
                    vYears = VBA.Split(VBA.Trim$(sLine), " ")
                    For ii = LBound(vYears) To UBound(vYears)
                        If VBA.Len(vYears(ii)) > 0 Then iNoColumns = iNoColumns + 1
                    Next ii

                    'Now we can redefine some variables to hold the information (you'll want to redim after you've collected the info)
                    ReDim sAssets(1 To iNoColumns + 1, 1 To 100) As String
                    ReDim iColumns(1 To iNoColumns) As Integer
                Else
                    If bIsAssets Then
                        'Skip the heading line
                        If Not VBA.Trim$(sLine) = "Assets" Then
                            'Increment the counter
                            iCount = iCount + 1

                            'If iCount reaches it's limit you'll have to redim preseve you sAssets array (I'll leave this to you)
                            If iCount > 99 Then
                                'You'll find other posts on stackoverflow to do this
                            End If

                            'This will happen on the first row, it'll happen everytime you
                            'hit a $ sign but you could code to only do so the first time
                            If VBA.InStr(1, sLine, "$") > 0 Then
                                iColumns(1) = VBA.InStr(1, sLine, "$")
                                For ii = 2 To iNoColumns
                                    'We need to start at the next character across
                                    iColumns(ii) = VBA.InStr(iColumns(ii - 1) + 1, sLine, "$")
                                Next ii
                            End If

                            'The first part (the name) is simply up to the $ sign (trimmed of spaces)
                            sAssets(1, iCount) = VBA.Trim$(VBA.Mid$(sLine, 1, iColumns(1) - 1))
                            For ii = 2 To iNoColumns
                                'Then we can loop around for the rest
                                sAssets(ii, iCount) = VBA.Trim$(VBA.Mid$(sLine, iColumns(ii) + 1, iColumns(ii) - iColumns(ii - 1)))
                            Next ii

                            'Now do the last column
                            If VBA.Len(sLine) > iColumns(iNoColumns) Then
                                sAssets(iNoColumns + 1, iCount) = VBA.Trim$(VBA.Right$(sLine, VBA.Len(sLine) - iColumns(iNoColumns)))
                            End If
                        Else
                            'Reset the counter
                            iCount = 0
                        End If
                    End If
                End If

            End If
        End If
    Loop

    'Clean up
    fsFile.Close
    Set fsFile = Nothing
    Set fs = Nothing
End Sub