ansaurus

Question

Nokogiri oddness?

Answer 1

+7 A:

It has to do with the way Nokogiri's parse method works. Here's the source:

# File lib/nokogiri.rb, line 55
    def parse string, url = nil, encoding = nil, options = nil
      doc =
        if string =~ /^\s*<[^Hh>]*html/i # Probably html
          Nokogiri::HTML::Document.parse(string, url, encoding, options || XML::ParseOptions::DEFAULT_HTML)
        else
          Nokogiri::XML::Document.parse(string, url, encoding, options || XML::ParseOptions::DEFAULT_XML)
        end
      yield doc if block_given?
      doc
    end

The key is the line if string =~ /^\s*<[^Hh>]*html/i # Probably html. When you just use open, it returns an object that doesn't work with regex, thus it always returns false. On the other hand, read returns a string, so it could be regarded as HTML. In this case it is, because it matches that regex. Here's the start of that string:

<!DOCTYPE html PUBLIC

The regex matches the "!DOCTYPE " to [^Hh>]* and then matches the "html", thus assuming it's HTML. Why someone selected this regex to determine if the file is HTML is beyond me. With this regex, a file that begins with a tag like <definitely-not-html> is considered HTML, but <this-is-still-not-html> is considered XML. You're probably best off staying away from this dumb function and invoking Nokogiri::HTML::Document#parse or Nokogiri::XML::Document#parse directly.

Pesto 2009-07-21 13:26:23

Ah. And Ugh. Yes, it's very easily fooled. To work around it I wrote some methods for both document types that do some tests for "/html/head" and the tags for RSS and ATOM and they seem to catch HTML, RSS and ATOM docs reliably. I'm parsing a document as both HTML::Document and XML::Document though, and don't like having to do that. I kind of think Hpricot scores a point because it only has one document type.Now, why does a ".xpath('/feed/entry') search fail but ".search(feed entry)" will succeed on a Nokogiri::XML::Document? That's making me nuts too because it doesn't seem consistent.

Greg 2009-07-21 16:03:46

Technically the CSS selector `feed entry` isn't equivalent to the XPath `/feed/entry`. The equivalent XPath is `//feed//entry`. In the case of Atom, your original XPath is correct, though. Your problem is that you have to include the namespaces. Try this: `/xmlns:feed/xmlns:entry`

Pesto 2009-07-21 18:13:18

Thanks Pesto, you've been very helpful!

Greg 2009-07-22 04:39:36

@ZED: How about upvoting and accepting the answer then?

Geoffrey Chetwood 2009-07-23 13:06:17

I accepted the answer, but I lack the reputation to influence the vote. :-)

Greg 2009-07-23 21:45:50

Answer 2

+1 A:

Responding to this part of your question:

I thought I could write some tests to determine the type but then I ran into xpaths not finding elements, but regular searches working:

I've just come across this problem using nokogiri to parse an atom feed. The problem seemed down to the anonymous name-space declaration:

<feed xmlns="http://www.w3.org/2005/Atom"&gt;

Removing the xmlns declaration from the source xml would enable Nokogiri to search with xpath as per usual. Removing that declaration from the feed obviously wasn't an option here, so instead I just removed the namespaces from the document after parsing. eg:

doc = Nokogiri.parse(open('http://feeds.feedburner.com/RidingRails'))
doc.remove_namespaces!
doc.xpath('/feed/entry').length

Ugly I know, but it did the trick.

Will 2010-06-10 14:45:31

+1 for the remove_namespaces! method. I never knew that and your comment saved me awesome amounts of time.

rhh 2010-07-23 19:43:31

ansaurus

tags:

views:

answers:

Nokogiri oddness?

related questions