ansaurus

Question

Ruby regular expression help using match to extract pieces of html doc

Answer 1

+3 A:

If you're just extracting information out of XML, it might be easier to use something other than regular expressions. XPath is a good tool for extracting info from XML. I believe there are some libraries available for Ruby that support XPath, maybe try REXML:

Andy White 2009-04-02 05:11:49

Specifically, I am extracting from HTML. I messed around with XPATH, but because of the exact data I am trying to pull out, it actually seems very difficult to get what I want. It seems like XPATH is good to get all data between 2 nodes, that is not what I want. Also, XPATH docs for ruby is bad!

Tony 2009-04-02 13:01:02

Answer 2

+3 A:

See Can you provide some examples of why it is hard to parse XML and HTML with a regex? for why this is a bad idea. Use an HTML parser instead.

Chas. Owens 2009-04-02 05:34:53

That is a good link as to why it is a bad idea to use a regex, however, I have had some trouble using xpath type expressions to parse this data. The main reason is that I am not grabbing all data inside a node.

Tony 2009-04-02 13:15:03

Answer 3

A:

This is the code that parses that HTML. Feel free to suggest something better:

contacts = []
    email, mobile = "",""

    names = page.search("//span[@class='fullName']")

    # Every contact has a fullName node, so for each fullName node, we grab the chunk of contact info
    names.each do |n|

      # next_sibling.next_sibling skips:
      # <tr>
      #   <td class=\"sectionHeader\">Contact</td>
      #   <td class=\"sectionHeader\">Phone</td>
      #   <td class=\"sectionHeader\">Home</td>
      #   <td class=\"sectionHeader\">Work</td>
      # </tr>
      # to give us the actual chunk of contact information
      # then taking the children of that chunk gives us rows of contact info
      contact_info_rows = n.parent.parent.next_sibling.next_sibling.children

      # Iterate through the rows of contact info
      contact_info_rows.each do |row|

        # Iterate through the contact info in each row
        row.children.each do |info|
          # Get Email. There are two ".next_siblings" because space after "Email 1" element is processed as a sibling
          if info.content.strip == "Email 1:" then email = info.next_sibling.next_sibling.content.strip end

          # If the contact info has a screen name but no email, use [email protected]
          if (info.content.strip == "Screen Name:" && email == "") then email = info.next_sibling.next_sibling.content.strip + "@aol.com" end

          # Get Mobile #'s
          if info.content.strip == "Mobile:" then mobile = info.next_sibling.content.strip end

          # Maybe we can try and get zips later.  Right now the zip field can look like the street address field
          # so we can not tell the difference.  There is no label node
          #zip_match = /\A\D*(\d{5})-?\d{4}\D*\z/i.match(info.content.strip) 
          #zip_match = /\A\D*(\d{5})[^\d-]*\z/i.match(info.content.strip)     
        end  

      end

      contacts << { :name => n.content, :email => email, :mobile => mobile }

      # clear variables
      email, mobile = "", ""
    end

Tony 2009-04-02 15:39:26

Answer 4

+2 A:

Use a HTML parser such as hpricot will save you lots of headaches :)

sudo gem install hpricot

It's mostly written in C, so it's fast as well

Here is How to use it:

http://wiki.github.com/why/hpricot/hpricot-basics

Aaron Qian 2009-04-02 16:33:04

ended up using nokogiri

Tony 2009-04-03 04:11:36

Yup, there is nokogiri too... It's a new contender of the parser realm

Aaron Qian 2009-04-03 15:56:10

ansaurus

tags:

views:

answers:

Ruby regular expression help using match to extract pieces of html doc

related questions