views:

998

answers:

2

I'd like to figure out a way on how to get to the HTML result (mentioned further below) by using the following Ruby code and the Nokogiri Rubygem:

require 'rubygems'
require 'nokogiri'

value = Nokogiri::HTML.parse(<<-HTML_END)
  "<html>
    <body>
      <p id='1'>A</p>
      <p id='2'>B</p>
      <h1>Bla</h1>
      <p id='3'>C</p>
      <p id='4'>D</p>
      <p id='5'>E</p>
    </body>
  </html>"
HTML_END

# The selected-array is given by the application.
# It consists of a sorted array with all ids of 
# <p> that need to be enclosed by the <div>
selected = ["2","3","4"]
first_p = selected.first
last_p = selected.last

#
# WHAT RUBY CODE DO I NEED TO INSERT HERE TO GET
# THE RESULTING HTML AS SEEN BELOW?
#

The resulting HTML should look like this (please note the inserted <div id='XYZ'>):

<html>
  <body>
    <p id='1'>A</p>
    <div id='XYZ'>
      <p id='2'>B</p>
      <h1>Bla</h1>
      <p id='3'>C</p>
      <p id='4'>D</p>
    </div>
    <p id='5'>E</p>
  </body>
</html>

Thanks for your help!

+2  A: 

In these kinds of situations you typically want to use whatever SAX interface the underlying library offers you, to traverse and rewrite the input XML (or XHTML) statefully and serially:

require 'nokogiri'
require 'CGI'

Nokogiri::XML::SAX::Parser.new(
  Class.new(Nokogiri::XML::SAX::Document) {
    def initialize first_p, last_p
      @first_p, @last_p = first_p, last_p
    end

    def start_document
      puts '<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"&gt;'
    end

    def start_element name, attrs = []
      attrs = Hash[*attrs]
      @depth += 1 unless @depth.nil?
      print '<div>' if name=='p' && attrs['id'] == @first_p
      @depth = 1    if name=='p' && attrs['id'] == @last_p && @depth.nil?
      print "<#{ [ name, attrs.collect { |k,v| "#{k}=\"#{CGI::escapeHTML(v)}\"" } ].flatten.join(' ') }>"
    end

    def end_element name
      @depth -= 1 unless @depth.nil?
      print "</#{name}>"
      if @depth == 0
        print '</div>'
        @depth = nil
      end
    end

    def cdata_block string
      print "<![CDATA[#{CGI::escapeHTML(string)}]]>"
    end

    def characters string
      print CGI::escapeHTML(string)
    end

    def comment string
      print "<!--#{string}-->"
    end
  }.new('2', '4')
).parse(<<-HTML_END)
  <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"&gt;
  <html>
    <body>
      <!-- comment -->
      <![CDATA[
        cdata goes here
      ]]>
      &quot;special&quot; entities 
      <p id="1">A</p>
      <p id="2">B</p>
      <p id="3">C</p>
      <p id="4">D</p>
      <p id="5">E</p>
      <emptytag/>
    </body>
  </html>
HTML_END

Alternatively, you can also use the DOM model interface (instead of the SAX interface) to load the entire document into memory (in the same way that you started doing in your original question), and then perform node manipulation (insertion and removal) as follows:

require 'rubygems'
require 'nokogiri'

doc = Nokogiri::HTML.parse(<<-HTML_END)
  <html>
    <body>
      <p id='1'>A</p>
      <p id='2'>B</p>
      <p id='3'>C</p>
      <p id='4'>D</p>
      <p id='5'>E</p>
    </body>
  </html>
HTML_END

first_p = "2"
last_p = "4"

doc.css("p[id=\"#{first_p}\"] ~ p[id=\"#{last_p}\"]").each { |node|
  div_node = nil
  node.parent.children.each { |sibling_node|
    if sibling_node.name == 'p' && sibling_node['id'] == first_p
      div_node = Nokogiri::XML::Node.new('div', doc)
      sibling_node.add_previous_sibling(div_node)
    end
    unless div_node.nil?
      sibling_node.remove
      div_node << sibling_node
    end
    if sibling_node.name == 'p' && sibling_node['id'] == last_p
      div_node = nil
    end
  }
}

puts doc

Cheers, V.

vladr
This is not correct. A <div> may contain other block-level elements.
Zack Mulgrew
@zacm my mistake, I was thinking of span instead of div
vladr
Hm, that looks quite complicated. Hpricot offers easy ways to alter HTML code (http://wiki.github.com/why/hpricot/hpricot-altering) so I can't imagine Nokogiri wouldn't offer something similar...to bad Nokogiri's documentation isn't as good as Hpricot's. :(
Javier
@Javier, see my update with the DOM way of doing things (like hpricot's)... not much simpler given the specific problem you are trying to solve (if only it supported more advanced CSS3 selectors...), but still
vladr
@Vlad: What are the more advanced CSS3 selectors you'd like to see supported? As Nokogiri supports CSS3, its developer would probably be quite interested in such a feature request.
Javier
By the way: Does anyone know a good guide/tutorial for Nokogiri? At least something like the hpricot-github-wiki-articles? I couldn't find a thing and I'm really struggling on how to use Nokogiri correctly.
Javier
@Vlad: Before I forget, thanks a lot for the update of your answer!
Javier
@Vlad: I'm trying to understand your code and this probably is quite a n00b question, but could you explain what the css selector (actually the ~ between the two paragraph) does? Thank you very much.
Javier
I just saw that I forgot to mention, that I've a sorted array with the selected values inside. I corrected my question.
Javier
@Javier, the CSS3 ~ selector is documented at http://www.w3.org/TR/css3-selectors/ ; essentially "A ~ B" selects the elements B that were preceded by a sibling A (i.e. as to guarantee that we don't enter the block above unless |node| is guaranteed to also be preceded by a P with ID first_p)
vladr
@Javier, the CSS3 selector specs mandate that ':not()' can only refer to simple selectors; if that were not the case, all elements between two P's with IDs first_p and last_p respectively could possibly have been obtained with a single selector without all the 'if's inside the loop above
vladr
E.g. 'p#first_p, p#first_p ~ :not(p#last_p ~ *)' would select and return all elements from p#fisrt_p until p#last_p inclusive in one shot, which you could have then removed in bulk from the DOM and reinsterted under the DIV, though it would still not have been trivial given your particular problem.
vladr
+1  A: 

This is the working solution I've implemented into my project (Vlad@SO & Whitelist@irc#rubyonrails: Thanks for your help and inspiration.):

require 'rubygems'
require 'nokogiri'

value = Nokogiri::HTML.parse(<<-HTML_END)
  "<html>
    <body>
      <p id='1'>A</p>
      <p id='2'>B</p>
      <h1>Bla</h1>
      <p id='3'>C</p>
      <p id='4'>D</p>
      <p id='5'>E</p>
    </body>
  </html>"
HTML_END

# The selected-array is given by the application.
# It consists of a sorted array with all ids of 
# <p> that need to be enclosed by the <div>
selected = ["2","3","4"]

# We want an elements, not nodesets!
# .first returns Nokogiri::XML::Element instead of Nokogiri::XML::nodeset
first_p = value.css("p##{selected.first}").first
last_p = value.css("p##{selected.last}").first
parent = value.css('body').first

# build and set new div_node
div_node = Nokogiri::XML::Node.new('div', value)
div_node['class'] = 'XYZ'

# add div_node before first_p
first_p.add_previous_sibling(div_node)

selected_node = false

parent.children.each do |tag|
  # if it's the first_p
  selected_node = true if selected.include? tag['id']
  # if it's anything between the first_p and the last_p
  div_node.add_child(tag) if selected_node
  # if it's the last_p
  selected_node = false if selected.last == tag['id']
end

puts value.to_html
Javier