views:

907

answers:

2

I am using mechanize/nokogiri and need to parse out the following HTML string. can anyone help me with the xpath syntax to do this or any other methods that would work?

<table>
  <tr class="darkRow">
 <td>
   <span>
  <a href="?x=mSOWNEBYee31H0eV-V6JA0ZejXANJXLsttVxillWOFoykMg5U65P4x7FtTbsosKRbbBPuYvV8nPhET7b5sFeON4aWpbD10Dq">
   <span>4242YP</span>
  </a>
   </span>
 </td>
 <td>
   <span>Subject of Meeting</span>
 </td>
 <td>
   <span>
  <span>01:00 PM</span> 
  <span>Nov 11 2009</span> 
  <span>America/New_York</span>
   </span>
 </td>
 <td>
   <span>30</span>
 </td>
 <td>
   <span>
  <span>[email protected]</span>
   </span>
 </td>
 <td>
  <span>39243368</span>
 </td>
  </tr>
  .
  .
  .
  <more table rows with the same format>
</table>

I want this as the output

"4242YP","Subject of Meeting","01:00 PM Nov 11 2009 America/New_York","30","[email protected]", "39243368"
.
.
.
<however many rows exist in the html table>
+3  A: 

something like this?

items=doc.xpath('//tr').map {|row| row.xpath('.//span/text()').select{|item| item.text.match(/\w+/)}.map {|item| item.text} }

returns: => [["4242YP", "Subject of Meeting", "01:00 PM", "Nov 11 2009", "America/New_York", "30", "[email protected]", "39243368"], ["abcdefg"]]

Select includes only spans that start with word characters (e.g. excluding the whitespace that some of your spans have). You may need to refine the "select" filter for your specific case.

I added a minimalist row that contained a span containing abcdefg, so that you can see the nested array.

JasonTrue
didn't use your example exactly but it got me thinking about different ways of doing it. thanks for the help!
thomas
Yep, I could only hazard guesses as to how predictable your HTML format is, and how important the joining of the nested spans was, so figured you could work from something minimalist.
JasonTrue
A: 

Here's part of the XSL to transform your input, if you have an XSL transformer:

<?xml version="1.0"?>
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"&gt;
<xsl:output method="text"/>

<xsl:template match="/">
   <xsl:apply-templates select="//tr"/>
</xsl:template>

<xsl:template match="tr">
   "<xsl:value-of select="td/span/a/span"/>","<xsl:value-of select="td[position()=2]/span"/>","<xsl:value-of select="td[position()=3]/span/span[position()=1]"/>"
</xsl:template>

</xsl:stylesheet>

Output produced looks like this:

"4242YP","Subject of Meeting","01:00 PM"
"4242YP","Subject of Meeting","01:00 PM"

(I duplicated your first table row).

The XSL select bits give you a good idea of what XPATH input you'd need to get the rest.

Carl Smotricz