tags:

views:

97

answers:

3

Currently I'm parsing a HTML document using Nokogiri and iterating through all the code tags like this:

html = Nokogiri::HTML(doc)
html.css("code").each do |code|
   # do something with code
   if /^@@@@/.match(code.text.split("\n")[0])
     return "this code element is at line blah"
   end     
end

I don't have to use Nokogiri, it was just convenient to use to iterate through all the code elements.

In the case where the code tag begins with @@@@ then I want to be able to reference the line number in the document where that code tag occurred. Keep in mind that two code tags can be identical.

A: 

Something like this might help get you started:

#!/usr/bin/env ruby

require 'nokogiri'

html = Nokogiri::HTML(File.open('./test.html', 'r'))

code_nodes = html.css('code').select{ |node| node.text[/^@@@@/] }
code_nodes.each do |node|

  # point to the end of the current line
  current_node = node.next 
  line_num = 0

  while (current_node) do
    # count line-endings
    line_num += current_node.text.count("\n")

    # get the previous node (AKA the one above this one.)
    current_node = current_node.previous 
  end

  puts line_num
end

where test.html looks like:

<html>
  <body>
    <code>1</code>
    <code>@@@@2</code>
    <code>3</code>
  </body>
</html>

This is quick 'n dirty code that doesn't get it quite right, but I think the basic idea is sound. You need to find the nodes with "@@@@" in them, then walk backwards counting the text nodes that contain line-endings until you reach the top node.

Difficulty arises because using .previous() doesn't seem to get the previous_sibling() nodes, meaning that anything in the <head> block won't be counted if it exists. So, the logic in the while-loop probably needs to be aware of previous_siblings() and parent() nodes and the nodes they contain.

Greg
A: 
doc = <<-HTML 
<html>
<body>
  <code>1</code>
  <code>@@@@</code>
  <code>3</code>
  <code>@@@@</code>
  <code>
    @@@@
  </code>
</body>
</html>
HTML

puts doc.scan(/(.*?<code>)(\s*@@@@)/m).map { |match| match.first.scan(/\n/).size }.inject([1]){ |sum, n| sum << sum.last + n }[1..-1]
Macario
A: 

I figured out that for some unknown reason that Nokogiri::HTML won't return the line number of any given element when you call the line method on it, but Nokogiri::XML will. Confusing, much?

Therefore, the solution is this:

html = Nokogiri::XML(html)
html.css("code") do |code|
  if /^@@@@/.match(code)
    parse_code(code)
  end
end

def parse_code(code)
  # do something and report back using:
  # Assume for a moment document.path exists.
  puts "Parsed code block on: #{code.line} of #{document.path}"
end
Ryan Bigg