How to extract specific tables from html file using native powershell commands?

Tom A. picture Tom A. · Sep 19, 2014 · Viewed 27.4k times · Source

I make use of the PAL tool (https://pal.codeplex.com/) to generate HTML reports from perfmon logs within Windows. After PAL processes .blg files from perfmon it dumps the information into an HTML document that contains tables with various data points about how the system performed. I am currently writing a script that looks at the contents of a directory for all HTML files, and does a get-content on all the HTML files.

What I would like to do is scrape the dump of this get-content blob for specific tables that have varying amount of rows. Is it possible using native powershell cmdlets to look for specific tables, count how many rows are in each table, and dump just the desired tables and table rows?

Here is an example of the table format I'm trying to scrape:

<H3>Overall Counter Instance Statistics</H3>
<TABLE ID="table6" BORDER=1 CELLPADDING=2>
<TR><TH><B>Condition</B></TH><TH><B>\LogicalDisk(*)\Disk Transfers/sec</B></TH><TH><B>Min</B></TH><TH><B>Avg</B></TH><TH><B>Max</B></TH><TH><B>Hourly Trend</B></TH><TH><B>Std Deviation</B></TH><TH><B>10% of Outliers Removed</B></TH><TH><B>20% of Outliers Removed</B></TH><TH><B>30% of Outliers Removed</B></TH></TR>
<TR><TD>No Thresholds</TD><TD>MACHINENAME/C:</TD><TD>1</TD><TD>7</TD><TD>310</TD><TD>0</TD><TD>11</TD><TD>5</TD><TD>5</TD><TD>5</TD></TR>
<TR><TD>No Thresholds</TD><TD>MACHINENAME/D:</TD><TD>0</TD><TD>0</TD><TD>0</TD><TD>0</TD><TD>0</TD><TD>0</TD><TD>0</TD><TD>0</TD></TR>
<TR><TD>No Thresholds</TD><TD>MACHINENAME/E:</TD><TD>0</TD><TD>24</TD><TD>164</TD><TD>-1</TD><TD>11</TD><TD>22</TD><TD>21</TD><TD>20</TD></TR>
<TR><TD>No Thresholds</TD><TD>MACHINENAME/HarddiskVolume5</TD><TD>0</TD><TD>0</TD><TD>2</TD><TD>0</TD><TD>0</TD><TD>0</TD><TD>0</TD><TD>0</TD></TR>
<TR><TD>No Thresholds</TD><TD>MACHINENAME/L:</TD><TD>0</TD><TD>0</TD><TD>0</TD><TD>0</TD><TD>0</TD><TD>0</TD><TD>0</TD><TD>0</TD></TR>
<TR><TD>No Thresholds</TD><TD>MACHINENAME/T:</TD><TD>0</TD><TD>7</TD><TD>430</TD><TD>0</TD><TD>21</TD><TD>3</TD><TD>2</TD><TD>2</TD></TR>
</TABLE>

The Table ID is constant among all the output files, but the amount of table rows is not. Any help is appreciated!

Answer

Alexander Obersht picture Alexander Obersht · Sep 19, 2014

OK, this isn't thoroughly tested but works with your example table in PS 2.0 with IE11:

# Parsing HTML with IE.
$oIE = New-Object -ComObject InternetExplorer.Application
$oIE.Navigate("file.html")
$oHtmlDoc = $oIE.Document

# Getting table by ID.
$oTable = $oHtmlDoc.getElementByID("table6")

# Extracting table rows as a collection.
$oTbody = $oTable.childNodes | Where-Object { $_.tagName -eq "tbody" }
$cTrs = $oTbody.childNodes | Where-Object { $_.tagName -eq "tr" }

# Creating a collection of table headers.
$cThs = $cTrs[0].childNodes | Where-Object { $_.tagName -eq "th" }
$cHeaders = @()
foreach ($oTh in $cThs) {
    $cHeaders += `
        ($oTh.childNodes | Where-Object { $_.tagName -eq "b" }).innerHTML
}

# Converting rows to a collection of PS objects exportable to CSV.
$cCsv = @()
foreach ($oTr in $cTrs) {
    $cTds = $oTr.childNodes | Where-Object { $_.tagName -eq "td" }
    # Skipping the first row (headers).
    if ([String]::IsNullOrEmpty($cTds)) { continue }
    $oRow = New-Object PSObject
    for ($i = 0; $i -lt $cHeaders.Count; $i++) {
        $oRow | Add-Member -MemberType NoteProperty -Name $cHeaders[$i] `
            -Value $cTds[$i].innerHTML
    }
    $cCsv += $oRow
}

# Closing IE.
$oIE.Quit()

# Exporting CSV.
$cCsv | Export-Csv -Path "file.csv" -NoTypeInformation

Honestly, I didn't aim for optimal code. It's just an example of how you could work with DOM objects in PS and convert them to PS objects.