I've hit an interesting problem that has eluded me thus far. I'm trying to extract specific information from a local html document. It's essentially a series of tables, and I only need specific values. I've imported the document using
$sourcePath = "C:\Temp\Record.htm"
$oIE = New-Object -ComObject InternetExplorer.Application
$oIE.Navigate($sourcePath)
$sourceHTML = $oIE.Document
Using the IE comobject was necessary as "HTMLFile" created an object but none of the inner/outer text was available. I've broken down the file into rows for parsing, using
$sourceHTML.body.getElementsByTagName('td')
But herein lies my problem. I need to get the 8 digit number from this entry, but I am falling short:
<td width="25%"><font face="Arial" size="1"><b>Serial Number</b></font></td>
<td width="25%"><font face="Arial" size="1">8111111</font></td>
Edit: Longer section of html as requested. There about six of these tables in the document:
<p style="text-align: center;">
<font style="color: rgb(255, 0, 0); font-family: Arial Narrow;
font-size: 20pt; font-weight: bold;">TITLE OF TABLE
</font>
</p>
<h2>Registration Details</h2><br>
<table width="100%" bordercolor="#000000" border="1" cellspacing="0">
<tbody>
<tr>
<td width="25%" bordercolorlight="#ffffff" bordercolordark="#ffffff"><b>
<font face="Arial" size="1">Personal Details</font>
</b></td>
<td width="25%" bordercolorlight="#ffffff" bordercolordark="#ffffff"> </td>
<td width="25%" bordercolorlight="#ffffff" bordercolordark="#ffffff"><b>
<font face="Arial" size="1">Contact (Work) Address Details</font>
</b></td>
<td width="25%" bordercolorlight="#ffffff" bordercolordark="#ffffff"> </td>
</tr>
</tbody>
</table>
<table width="100%" border="0" cellspacing="0">
<tbody>
<tr>
<td width="25%">
<font face="Arial" size="1"><b>Employment</b></font>
</td>
<td width="25%">
<font face="Arial" size="1"><b>CompanyNameHere</b></font>
</td>
<td width="25%">
<font face="Arial" size="1"><b>Workplace</b></font>
</td>
<td width="25%">
<font face="Arial" size="1"><b>Test Street</b></font>
</td>
</tr>
<tr>
<td width="25%">
<font face="Arial" size="1"><b>Employment Type</b></font>
</td>
<td width="25%">
<font face="Arial" size="1">Regular</font>
</td>
<td width="25%">
<font face="Arial" size="1"><b>Address Line 1</b></font>
</td>
<td width="25%">
<font face="Arial" size="1">10 Earth Place</font>
</td>
</tr>
<tr>
<td width="25%">
<font face="Arial" size="1"><b>Employment Category</b></font>
</td>
<td width="25%">
<font face="Arial" size="1"></font>
</td>
<td width="25%">
<font face="Arial" size="1"><b>Address Line 2</b></font>
</td>
<td width="25%">
<font face="Arial" size="1"></font>
</td>
</tr>
<tr>
<td width="25%">
<font face="Arial" size="1"><b>Employment Option</b></font>
</td>
<td width="25%">
<font face="Arial" size="1"></font>
</td>
<td width="25%">
<font face="Arial" size="1"><b>Address Line 3</b></font>
</td>
<td width="25%">
<font face="Arial" size="1"></font>
</td>
</tr>
<tr>
<td width="25%">
<font face="Arial" size="1"><b>Serial Number</b></font>
</td>
<td width="25%">
<font face="Arial" size="1">8111111</font>
</td>
<td width="25%">
<font face="Arial" size="1"><b>Suburb/Town/City</b></font>
</td>
<td width="25%">
<font face="Arial" size="1">City Lakes</font>
</td>
</tr>
</tbody>
</table><br>
I tried to use regex and pull a 7 digit number starting with 8 (they all will), but that also pulled all numbers with an 8 and following digits, such as with GUIDs etc. Is there a better way to do this? I will need to pull multiple values from different tables in the document, and I don't think a regex is suitable method for everything. Ideally if I can match the column header (Serial Number) and then extract the value from the next row, but I'm not 100% sure on how to do that.
Thank you
<font>
element has been obsolete for 20 years or more.