views:

122

answers:

3

I'm working on a personal project to auto fill out the USPS Click & Ship form and then output the Ref# and the Delivery Confirmation #

So far I've been able to get the whole process done, but I can't for the life of me figure out how to pull out the Ref# (which is my order #) and the Delivery Confirmation #

Basically for every package you print a label for the confirmation HTML page comes back with the following in the page.

 <tr class="smTableText">
  <td style="border-top:solid 1px #AAAAAA; padding-bottom:4px;" valign="top">
    <table cellpadding="0" cellspacing="0" border="0" style="margin:7px 0px 0px 5px;">
      <tr> 
       <td valign="top" class="mainText" width=46>1 of 1</td>  
       <td valign="top" width=21><a href="javascript:toggleMoreInfo(0)" tabindex="19"><img src="/cns/images/common/button_plus.gif" height="11" width="11" border="0" hspace="0" vspace="0" id="Img1" style="margin-right:10px;" alt=""></a></td>  
       <td valign="top" width=203><div class="mainText" style="margin-bottom:10px; height:1em; overflow:hidden;" id="Div1">FIRSTLAST NAME<BR>STREET ADDRESS<BR>CITY, STATE  ZIP5-ZIP4<div class="smTableText">[email protected]<BR>Ref#: 100000000<BR></div> </div><div class="smTableText"></div> </td> 
      </tr>
    </table>
  </td> 
  <td style="border-top:solid 1px #AAAAAA; padding-bottom:4px; padding-top:7px;" valign="top" class="smTableText"><div id="Div2" style="margin-left:7px; height:2.4em; overflow:hidden;">&nbsp;Ship Date: 11/17/09<br>&nbsp;Weight: 0lbs 9oz<br>&nbsp;From: 48506<br></div></td>
  <td style="border-top:solid 1px #AAAAAA; padding-bottom:4px; padding-right:15px; padding-top:7px;" valign="top" align="right" class="smTableText"><div class="smTableText" id="Div3" style="height:2.4em; overflow:hidden; margin-bottom:3px;">Priority Mail                      <br>Delivery Confirm.<br></div> <span style="font-weight:bold;" class="smTableText">Label Total</span></td>
  <td style="border-top:solid 1px #AAAAAA; padding-bottom:4px; padding-right:15px; padding-top:7px;" valign="top" align="right" class="smTableText"><div class="smTableText" id="Div4" style="height:2.4em; overflow:hidden; margin-bottom:3px;">$4.80<br>$0.00<br></div><span class="smTableTextbold">$4.80</span></td>
</tr>
<tr class="smTableText"> <td colspan=4 style="height:20px;" valign="top"><div class="mainText" style="margin:0px; padding:4px 8px 0px 8px; display:block; border-top:solid 1px #AAAAAA;">Delivery Confirmation&#153; Label Number: <span class="mainTextbold">0000 1111 2222 3333 4444 55</span></div></td> </tr>

What I need to do is loop through the entire page and find "Ref#: " capture the next 9 characters. Then find the next "Label Number: <span class="mainTextbold">" and capture the next 27 characters. Each pair of Ref#: and Label Number: <span class="mainTextbold"> should be saved to an array.

I'm guessing that regex will probably be my best option for this? Can anyone provide an example of how this would work. VB.net preferred by C# is ok too.

UPDATE: As pointed out in the Comments, this is not XML but rather the HTML code from the WebBrowser Control which the page is being displayed on.

I am auto filling in each page, then invoking the click action on the submit button to get to the next page..... Problem is that this last page, the data I need isn't neatly written around a unique tag to that field for me to pull from...

UPDATE # 2 Alright, using The example given I have come up with the following. Seems like alot of work to pull out the 2 values. I am guessing there must be a more efficient way of doing it.

   'Sub getdeliverynum(ByVal sText As String)
Sub getdeliverynum()
    Me.MainTabControl.SelectedTab = USPSsiteTAB
    WebBrowser1.Navigate("http://www.vaporstix.com/usps.html")
    While Not WebBrowser1.ReadyState = WebBrowserReadyState.Complete
        Application.DoEvents()
    End While
    Dim input As String = WebBrowser1.DocumentText
    Dim pattern As String = "Ref#: ([^<]+)[\S\s]*?Label Number: <span class=""mainTextbold"">([^<]+)"

    For Each match As Match In Regex.Matches(input, pattern)
        Dim instance As Double
        Dim ref As String = ""
        Dim track As String = ""
        instance = 0
        For Each group As Group In match.Groups
            instance = instance + 1
            If instance = 1 Then
                'do nothing this is the full string.... 
            ElseIf instance = 2 Then
                ref = group.Value
            ElseIf instance = 3 Then
                track = group.Value
            End If
        Next
        'replace with insert to db... this is for testing.
        MsgBox("Ref: " + ref + vbCrLf + "Confirmation: " + track)
    Next

End Sub
+2  A: 

You should use System.xml and use a proper parser to do that work. Xpath or even navigating in the XmlDocument would permit you to achieve what you are looking for.

Dim xpathDoc As XPathDocument
Dim xmlNav As XPathNavigator

Dim xmlNI As XPathNodeIterator
xpathDoc = New XPathDocument("c:\builder.xml")
xmlNav = xpathDoc.CreateNavigator()
xmlNI = xmlNav.Select("//span[@class='mainTextbold']")
While (xmlNI.MoveNext())
    System.Console.WriteLine(xmlNI.Current.Name + " : " + xmlNI.Current.Value)
End While

I suggest you to take a look there or there for more info how to extract information from a XmlDocument

a Xpath selector like span[@class='mainTextbold'] would return you all those span.

as per Heinzi remark, your document doesn't seem to be valid XHTML, you should convert it to XHTML using TidyNet and then parse the result of the conversion.

RageZ
Ok, I should have mentioned that this is being returned to me via VS 2008 built in WebBrowser module as an actual webpage.So I assume xpathDoc = New XPathDocument("c:\builder.xml") would become something like xpathDoc = New XPathDocument(webbrowser1) Sorry if I am way off here...I've been using functions such as WebBrowser1.Document.GetElementsByTagName("THETAG")andcurElement.GetAttribute("THEATTRIBUTE").ToStringTo help locate Textboxes, dropdownlists ect to auto fill the form.
Travis Walker
going to dig through those links in the mean time and see where that leads me. I knew regex could probably get the done but I was certain there was a better way. now it seems I am at least on the right track. I'll check back here shortly.
Travis Walker
The HTML posted in the question is not valid XML (see, e.g., the unclosed <BR> tag...)
Heinzi
+1  A: 

To answer the original question, taking into account all the mandatory caveats about "parsing" HTML with regexes, here's a regex that will do what you want:

Ref#: (.{9})[\S\s]*?Label Number: <span class="mainTextbold">(.{27})

Backreference \1 will contain the 9 characters after Ref#:, \2 will contain the 27 characters after Label number...

Alternatively, to make it a bit more robust, you could also use

Ref#: ([^<]+)[\S\s]*?Label Number: <span class="mainTextbold">([^<]+)

That way, the regex will match any characters except opening angle brackets after the markers. It will lead to a lot more backtracking in case of strings where the regex can't find a match at all. Depending on the regex engine used, this can be avoided if you use possessive matches:

Ref#: ([^<]++)[\S\s]*?Label Number: <span class="mainTextbold">([^<]++)

The rationale behind my support of using regexes for this task:

  1. it's trivial and easy to read/maintain - arguably easier than parsing code
  2. there's only one match per page, no nesting.
  3. it's an automatically generated page, so the structure is uniform. If UPS change their page layout, you'd have to adjust the regex, but you'd also have to adjust your xml parser in that case.
Tim Pietzcker
There should be a space between `Ref#:` and `(.{9})`. In addition, I would replace both `(.{9})` and `(.{27})` with `[^<]+`, to just read until the next tag (end) instead of a fixed number of characters.
Heinzi
Tim / Heinzi exactly what I was looking for!Tim, your spot on for the reasons listed above. If there is a better way of doing this I would be more than willing to listen, but I believe this should have me up and runnning!
Travis Walker
This got a downvote? Sheesh...
Heinzi
Thanks Heinzi, I added your example. Testing in RegexBuddy on a string there the second marker doesn't match, the engine takes over 25.000 steps to fail using `[^<]+` (compared to about 2500 using `.{9}` or `[^<]++`Downvote's to be expected from the "Don't do anything on HTML with regexes" dogmatists :)
Tim Pietzcker
s/there/where/g
Tim Pietzcker
Heinzi, Does the edit from the original post look correct? Or is there a better way to pull out the values?
Travis Walker
@Travis: I created a separate answer for this (because comments are not good for posting source code).
Heinzi
+1  A: 

Regarding your updated question about pulling out the values:

For Each match As Match In Regex.Matches(input, pattern)
    Dim ref As String = match.Groups(1).Value
    Dim track As String = match.Groups(2).Value

    ' replace with insert to db... this is for testing.
    MsgBox("Ref: " + ref + vbCrLf + "Confirmation: " + track)
Next

(untested)

Heinzi