views:

102

answers:

2

I have a document and want to extract a couple of elements which ar direct descendents of the parent element but leave out others. The problem is that I don't get the elements in the order they appear in the document. The reason might actually be that the CSS selector I am using is wrong...

require 'rubygems'
require 'nokogiri'
require 'open-uri'

html = <<END
  <content>
    <p>Lorem</p>
    <div>
      FOO
      <p>BAR</p>
    </div>
    <h1>Ipsum</h1>
    <p>Dolor</p>
    <div>
      BAR
      <h2>FOO</h2>
    </div>
    <h2>Sit</h2>
    <p>Amet</p>
  </html>
END

Nokogiri::HTML(html).css('content > p, content > h1, content > h2').inner_html # "<p>Lorem</p><p>Dolor</p><p>Amet</p><h1>Ipsum</h1><h2>Sit</h2>"

What I want is

<p>Lorem</p><h1>Ipsum</h1><p>Dolor</p><h2>Sit</h2><p>Amet</p>
A: 

You want the different elements to be listed the way they appear in the document, but as you can see, you get the elements according to the css selector order.

To solve this you would have to add a class attribute to the elements so you select all the elements with that class, than you use only one css selector which would imply that the elements would be in the right order.

JoostK
Thanks for your answer! Unfortunately I can only read the source html, so I cannot add any attributes...
zoopzoop
+1  A: 

Try using this XPath:

//content/p|//content/h1|//content/h2
Ben Alpert
Perfect, thanks!
zoopzoop