0

I've hit an interesting problem that has eluded me thus far. I'm trying to extract specific information from a local html document. It's essentially a series of tables, and I only need specific values. I've imported the document using

$sourcePath = "C:\Temp\Record.htm"
$oIE = New-Object -ComObject InternetExplorer.Application
$oIE.Navigate($sourcePath)
$sourceHTML = $oIE.Document

Using the IE comobject was necessary as "HTMLFile" created an object but none of the inner/outer text was available. I've broken down the file into rows for parsing, using

$sourceHTML.body.getElementsByTagName('td')

But herein lies my problem. I need to get the 8 digit number from this entry, but I am falling short:

<td width="25%"><font face="Arial" size="1"><b>Serial Number</b></font></td>
<td width="25%"><font face="Arial" size="1">8111111</font></td>

Edit: Longer section of html as requested. There about six of these tables in the document:

    <p style="text-align: center;">
        <font style="color: rgb(255, 0, 0); font-family: Arial Narrow; 
        font-size: 20pt; font-weight: bold;">TITLE OF TABLE
        </font>
    </p>
    <h2>Registration Details</h2><br>
    <table width="100%" bordercolor="#000000" border="1" cellspacing="0">
        <tbody>
            <tr>
                <td width="25%" bordercolorlight="#ffffff" bordercolordark="#ffffff"><b>
                        <font face="Arial" size="1">Personal Details</font>
                    </b></td>
                <td width="25%" bordercolorlight="#ffffff" bordercolordark="#ffffff">&nbsp;</td>
                <td width="25%" bordercolorlight="#ffffff" bordercolordark="#ffffff"><b>
                        <font face="Arial" size="1">Contact (Work) Address Details</font>
                    </b></td>
                <td width="25%" bordercolorlight="#ffffff" bordercolordark="#ffffff">&nbsp;</td>
            </tr>
        </tbody>
    </table>
    <table width="100%" border="0" cellspacing="0">
        <tbody>
            <tr>
                <td width="25%">
                    <font face="Arial" size="1"><b>Employment</b></font>
                </td>
                <td width="25%">
                    <font face="Arial" size="1"><b>CompanyNameHere</b></font>
                </td>
                <td width="25%">
                    <font face="Arial" size="1"><b>Workplace</b></font>
                </td>
                <td width="25%">
                    <font face="Arial" size="1"><b>Test Street</b></font>
                </td>
            </tr>
            <tr>
                <td width="25%">
                    <font face="Arial" size="1"><b>Employment Type</b></font>
                </td>
                <td width="25%">
                    <font face="Arial" size="1">Regular</font>
                </td>
                <td width="25%">
                    <font face="Arial" size="1"><b>Address Line 1</b></font>
                </td>
                <td width="25%">
                    <font face="Arial" size="1">10 Earth Place</font>
                </td>
            </tr>
            <tr>
                <td width="25%">
                    <font face="Arial" size="1"><b>Employment Category</b></font>
                </td>
                <td width="25%">
                    <font face="Arial" size="1"></font>
                </td>
                <td width="25%">
                    <font face="Arial" size="1"><b>Address Line 2</b></font>
                </td>
                <td width="25%">
                    <font face="Arial" size="1"></font>
                </td>
            </tr>
            <tr>
                <td width="25%">
                    <font face="Arial" size="1"><b>Employment Option</b></font>
                </td>
                <td width="25%">
                    <font face="Arial" size="1"></font>
                </td>
                <td width="25%">
                    <font face="Arial" size="1"><b>Address Line 3</b></font>
                </td>
                <td width="25%">
                    <font face="Arial" size="1"></font>
                </td>
            </tr>
            <tr>
                <td width="25%">
                    <font face="Arial" size="1"><b>Serial Number</b></font>
                </td>
                <td width="25%">
                    <font face="Arial" size="1">8111111</font>
                </td>
                <td width="25%">
                    <font face="Arial" size="1"><b>Suburb/Town/City</b></font>
                </td>
                <td width="25%">
                    <font face="Arial" size="1">City Lakes</font>
                </td>
            </tr>
        </tbody>
    </table><br>

I tried to use regex and pull a 7 digit number starting with 8 (they all will), but that also pulled all numbers with an 8 and following digits, such as with GUIDs etc. Is there a better way to do this? I will need to pull multiple values from different tables in the document, and I don't think a regex is suitable method for everything. Ideally if I can match the column header (Serial Number) and then extract the value from the next row, but I'm not 100% sure on how to do that.

Thank you

4
  • can you share a bigger example of your html ? Commented Feb 20 at 4:31
  • @SantiagoSquarzon I have added the first table, thanks.
    – Ian
    Commented Feb 20 at 4:53
  • Maybe you could try the PsParseHTML PowerShell module and convert it to text with that module and then do your search magic
    – Turdie
    Commented Feb 20 at 4:55
  • Note that the <font> element has been obsolete for 20 years or more.
    – Rob
    Commented Feb 20 at 8:33

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Browse other questions tagged or ask your own question.