views:

208

answers:

1

I use hpricot gem in ruby on rails to parse a webpage and extract the meta-tag contents. But if the website has a <noscrpit> tag just after the <head> tag it throws an exception

Exception: undefined method `[]' for nil:NilClass

I even tried to update the gem to the latest version. but still the same.

this is the sample code i use.

require 'rubygems'
require 'hpricot'
require 'open-uri'
begin
       index_page = Hpricot(open("http://sample.com"))
       puts index_page.at("/html/head/meta[@name='verification']")['content'].gsub(/\s/, "")
rescue Exception => e
       puts "Exception: #{e}"
end

i was thinking to remove the noscript tag before giving the webpage to hpricot. or is there anyother way to do it??

my html snippet

<html> 
<head> 
<noscript> 
<meta http-equiv="refresh" content="0; url=http://www.yoursite.com/noscripts.html"/&gt; 
</noscript> 
<meta name="verification" content="7ff5e90iormq5niy6x98j75-o1yqwcds-c1b1pjpdxt3ngypzdg7p80d6l6xnz5v3buldmmjcd4hsoyagyh4w95-ushorff60-f2e9bzgwuzg4qarx4z8xkmefbe-0-f" /> 
</head> 
<body> 
<h1>Testing</h1> 
</body> 
</html>
A: 

I can't duplicate the exception with Hpricot. However, I do see problems with how you are trying to find the meta tag.

I shorted the HTML sample to help my sample code fit into the answer box here, then saved the HTML locally so I could use open-uri to get at it.

<html> 
<head> 
<noscript> 
<meta http-equiv="refresh" /> 
</noscript> 
<meta name="norton-safeweb-site-verification" /> 
</head> 
<body> 
<h1>Testing</h1> 
</body> 
</html>

Contemplate the results of the searches below:

#!/usr/bin/env ruby

require 'rubygems'
require 'hpricot'
require 'open-uri'

doc = Hpricot(open('http://localhost:3000/test.html&#x27;))

(doc / 'meta').size # => 2
(doc / 'meta')[1] # => {emptyelem <meta name="norton-safeweb-site-verification">}

(doc % 'meta[@name]') # => {emptyelem <meta name="norton-safeweb-site-verification">}

(doc % 'meta[@name="verification"]') # => nil
(doc % 'meta[@name*="verification"]') # => {emptyelem <meta name="norton-safeweb-site-verification">}

(doc % 'meta[@name="norton-safeweb-site-verification"]') # => {emptyelem <meta name="norton-safeweb-site-verification">}

Remember that '/' in Hpricot means .search() or "find all occurrences" and '%' means .at() or "find the first occurrence". Using a long path to get to the desired element is often less likely to find what you want. Look for unique things in the element or its siblings or parents. A long accessor breaks easier because the preceeding layout of the page is considered when searching; If something in the page changes the accessor will be invalid, so search atomically or in the smallest group of elements you can. Also, the Hpricot docs recommend using CSS accessors so I'm using those in the example code.

Searching for any 'meta' tag found two occurrences. So far so good. Grabbing the second one was one way of getting at what you want.

Searching for "meta with a name parameter" found the target.

Searching for "meta with a name parameter consisting of 'verification'" fails, because there isn't one. Searching inside the parameter using "*=" works.

Searching for "meta with a name parameter consisting of 'norton-safeweb-site-verification'" succeeds, because that is the full parameter value.

Hpricot has a pretty good set of CSS selectors:

http://wiki.github.com/whymirror/hpricot/supported-css-selectors

Now, all that said, I recommend using Nokogiri over Hpricot. I have found cases where Hpricot silently failed but Nokogiri successfully parsed malformed XML and HTML.

Greg
hi greg, thanks for your reply. i ll try to use the css selector method and see if its works properly for me. btw how will i get content of my meta tag. after selecting the meta tag by the 'name'(doc % 'meta[@name*="verification"]')
railscoder
and also i want to make sure the meta tag is inside the head tag. but this css selector will select the meta tag even if its present ouside the body tag rite?
railscoder
You'll get the content of the parameter the same way you did in your code sample. Re: the meta tags - they are valid in the head block only, not the body block so your second comment/question doesn't make a lot of sense. Find it by looking for the right "name" parameter, like I showed in the example code.
Greg
for specific reason i need to get the meta tag like thisat("/html/head/meta[@name='verification']")but if there is a noscript tag inside the head. the parser could not get the meta tag . but works other wisebut at("meta[@name='verification']") works perfectly even with noscript tag.
railscoder
This seem to be a bug with hpricot. raised an issue in github.
railscoder