views:

33

answers:

1

I am trying to use nokogiri to parse the following segment

<tr>
 <th>Total Weight</th>
 <td>< 1 g</td>
 <td style="text-align: right">0 %</td>

</tr>             
<tr><td class="skinny_black_bar" colspan="3"></td></tr>

However, I think the "<" sign in "< 1 g" is causing Nokogiri problems. Does anyone know any workarounds? Is there a way I can escape the "<" sign? Or maybe there is a function I can call to just get the plain html segment?

+1  A: 

The "less than" (<) isn't legal HTML, but browsers have a lot of code for figuring out what was meant by the HTML instead of just displaying an error. That's why your invalid HTML sample displays the way you'd want it to in browsers.

So the trick is to make sure Nokogiri does the same work to compensate for bad HTML. Make sure to parse the file as HTML instead of XML:

f = File.open("table.html")
doc = Nokogiri::HTML(f)

This parses your file just fine, but throws away the < 1 g text. Look at how the content of the first 2 TD elements is parsed:

doc.xpath('(//td)[1]/text()').to_s
=> "\n "

doc.xpath('(//td)[2]/text()').to_s
=> "0 %"

Nokogiri threw out your invalid text, but kept parsing the surrounding structure. You can even see the error message from Nokogiri:

doc.errors
=> [#<Nokogiri::XML::SyntaxError: htmlParseStartTag: invalid element name>]
doc.errors[0].line
=> 3

Yup, line 3 is bad.

So it seems like Nokogiri doesn't have the same level of support for parsing invalid HTML as browsers do. I recommend using some other library to pre-process your files. I tried running TagSoup on your sample file and it fixed the < by changing it to &lt; like so:

% java -jar tagsoup-1.1.3.jar foo.html | xmllint --format -
src: foo.html
<?xml version="1.0" standalone="yes"?>
<html xmlns="http://www.w3.org/1999/xhtml"&gt;
  <body>
    <table>
      <tbody>
        <tr>
          <th colspan="1" rowspan="1">Total Weight</th>
          <td colspan="1" rowspan="1">&lt;1 g</td>
          <td colspan="1" rowspan="1" style="text-align: right">0 %</td>
        </tr>
        <tr>
          <td colspan="3" rowspan="1" class="skinny_black_bar"/>
        </tr>
      </tbody>
    </table>
  </body>
</html>
Harold L