views:

1781

answers:

4

I have an HTML document of this format:

<tr><td colspan="4"><span class="fullName">Bill Gussio</span></td></tr>
    <tr>
        <td class="sectionHeader">Contact</td>
        <td class="sectionHeader">Phone</td>
        <td class="sectionHeader">Home</td>
        <td class="sectionHeader">Work</td>
    </tr>
    <tr valign="top">
        <td class="sectionContent"><span>Screen Name:</span> <span>bhjiggy</span><br><span>Email 1:</span> <span>[email protected]</span></td>
        <td class="sectionContent"><span>Mobile: </span><span>2404173223</span></td>
        <td class="sectionContent"><span>NY</span><br><span>New York</span><br><span>78642</span></td>
        <td class="sectionContent"><span>MD</span><br><span>Owings Mills</span><br><span>21093</span></td>
    </tr>

    <tr><td colspan="4"><hr class="contactSeparator"></td></tr>

    <tr><td colspan="4"><span class="fullName">Eddie Osefo</span></td></tr>
    <tr>
        <td class="sectionHeader">Contact</td>
        <td class="sectionHeader">Phone</td>
        <td class="sectionHeader">Home</td>
        <td class="sectionHeader">Work</td>
    </tr>
    <tr valign="top">
        <td class="sectionContent"><span>Screen Name:</span> <span>eddieOS</span><br><span>Email 1:</span> <span>[email protected]</span></td>
        <td class="sectionContent"></td>
        <td class="sectionContent"><span></span></td>
        <td class="sectionContent"><span></span></td>
    </tr>

    <tr><td colspan="4"><hr class="contactSeparator"></td></tr>

So it alternates - chunk of contact info and then a "contact separator". I want to grab the contact info so my first obstacle is to grab the chunks in between the contact separator. I have already figured out the regular expression using rubular. It is:

/<tr><td colspan="4"><span class="fullName">((.|\s)*?)<hr class="contactSeparator">/

You can check on rubular to verify that this isolates chunks.

However my big issue is that I am having trouble with the ruby code. I use the built in match function and make prints, but do not get the results I expect. Here is the code:

page = agent.get uri.to_s    
chunks = page.body.match(/<tr><td colspan="4"><span class="fullName">((.|\s)*?)<hr class="contactSeparator">/).captures

chunks.each do |chunk|
   puts "new chunk: " + chunk.inspect
end

Note that page.body is just the body of the html document grabbed by Mechanize. The html document is much larger but has this format. So, the unexpected output is below:

new chunk: "Bill Gussio</span></td></tr>\r\n\t<tr>\r\n\t\t<td class=\"sectionHeader\">Contact</td>\r\n\t\t<td class=\"sectionHeader\">Phone</td>\r\n\t\t<td class=\"sectionHeader\">Home</td>\r\n\t\t<td class=\"sectionHeader\">Work</td>\r\n\t</tr>\r\n\t<tr valign=\"top\">\r\n\t\t<td class=\"sectionContent\"><span>Screen Name:</span> <span>bhjiggy</span><br><span>Email 1:</span> <span>[email protected]</span></td>\r\n\t\t<td class=\"sectionContent\"><span>Mobile: </span><span>2404173223</span></td>\r\n\t\t<td class=\"sectionContent\"><span>NY</span><br><span>New York</span><br><span>78642</span></td>\r\n\t\t<td class=\"sectionContent\"><span>MD</span><br><span>Owings Mills</span><br><span>21093</span></td>\r\n\t</tr>\r\n\t\r\n\t<tr><td colspan=\"4\">"
new chunk: ">"

There are 2 surprises here for me:

1) There are not 2 matches that contain the chunks of contact info, even though on rubular I have verified that these chunks should be extracted.

2) All of the \r\n\t (line feeds, tabs, etc.) are showing up in the matches.

Can anyone see the issue here?

Alternatively, if anyone knows of a good free AOL contacts importer, that would be great. I have been using blackbook but it keeps failing for me on AOL and I am attempting to fix it. Unfortunately, AOL has no contacts API yet.

Thank you!

+3  A: 

If you're just extracting information out of XML, it might be easier to use something other than regular expressions. XPath is a good tool for extracting info from XML. I believe there are some libraries available for Ruby that support XPath, maybe try REXML:

Andy White
Specifically, I am extracting from HTML. I messed around with XPATH, but because of the exact data I am trying to pull out, it actually seems very difficult to get what I want. It seems like XPATH is good to get all data between 2 nodes, that is not what I want. Also, XPATH docs for ruby is bad!
Tony
+3  A: 

See Can you provide some examples of why it is hard to parse XML and HTML with a regex? for why this is a bad idea. Use an HTML parser instead.

Chas. Owens
That is a good link as to why it is a bad idea to use a regex, however, I have had some trouble using xpath type expressions to parse this data. The main reason is that I am not grabbing all data inside a node.
Tony
A: 

This is the code that parses that HTML. Feel free to suggest something better:

contacts = []
    email, mobile = "",""

    names = page.search("//span[@class='fullName']")

    # Every contact has a fullName node, so for each fullName node, we grab the chunk of contact info
    names.each do |n|

      # next_sibling.next_sibling skips:
      # <tr>
      #   <td class=\"sectionHeader\">Contact</td>
      #   <td class=\"sectionHeader\">Phone</td>
      #   <td class=\"sectionHeader\">Home</td>
      #   <td class=\"sectionHeader\">Work</td>
      # </tr>
      # to give us the actual chunk of contact information
      # then taking the children of that chunk gives us rows of contact info
      contact_info_rows = n.parent.parent.next_sibling.next_sibling.children

      # Iterate through the rows of contact info
      contact_info_rows.each do |row|

        # Iterate through the contact info in each row
        row.children.each do |info|
          # Get Email. There are two ".next_siblings" because space after "Email 1" element is processed as a sibling
          if info.content.strip == "Email 1:" then email = info.next_sibling.next_sibling.content.strip end

          # If the contact info has a screen name but no email, use [email protected]
          if (info.content.strip == "Screen Name:" && email == "") then email = info.next_sibling.next_sibling.content.strip + "@aol.com" end

          # Get Mobile #'s
          if info.content.strip == "Mobile:" then mobile = info.next_sibling.content.strip end

          # Maybe we can try and get zips later.  Right now the zip field can look like the street address field
          # so we can not tell the difference.  There is no label node
          #zip_match = /\A\D*(\d{5})-?\d{4}\D*\z/i.match(info.content.strip) 
          #zip_match = /\A\D*(\d{5})[^\d-]*\z/i.match(info.content.strip)     
        end  

      end

      contacts << { :name => n.content, :email => email, :mobile => mobile }

      # clear variables
      email, mobile = "", ""
    end
Tony
+2  A: 

Use a HTML parser such as hpricot will save you lots of headaches :)

sudo gem install hpricot

It's mostly written in C, so it's fast as well

Here is How to use it:

http://wiki.github.com/why/hpricot/hpricot-basics

Aaron Qian
ended up using nokogiri
Tony
Yup, there is nokogiri too... It's a new contender of the parser realm
Aaron Qian