views:

88

answers:

1

Suppose I have the following HTML:

html = Four score and seven <b>years ago</b>

I want to parse this with Hpricot:

doc = Hpricot(html)

Find the <b> node:

node = doc.at('b')

and then get the character index of the <b> node within its parent:

node.character_index
=> 22

How can I do this (i.e., what's the real version of the character_index() function I just made up)?

+1  A: 

I don't think Hpricot works like that. Here is what I get doing a "node.inspect" based on your example

node.inspect
"{elem <b> \"years\" </b>}"

So, the position in the overall text that you are asking for just isn't there.

However, there are limited number of things you'd probably like to use the index for and you may be able to do these through the standard Hpricot methods

Mike Buckbee
Also, see this Ruby-Forum topic:http://www.ruby-forum.com/topic/167535 where this same question is asked by someone wanting to check links. Relevant points: **1)** why do this when "Character position is meaningless in an XML and HTML DOM. Whitespace can change character positions without affecting the DOM at all" and **2)** Using libxml as an alternative since "libxml stores the line number of every element. So you can extract all links, check them, and print out element.line_num for each one that fails the check"
i5m