views:

274

answers:

2

My first question here, would be awesome to find an answer. I am new to using nokogiri.

Here is my problem. I have something like this in the HTML head on a target site (here a techcrunch post):

<meta content="During my time at TechCrunch I've seen thousands of startups and written about hundreds of them. I sure as hell don't know all ..." name="description"/>

I would now like to have a script to run through the meta tags, locate the one with the name attribute "description" and get what is in the content attribute.

I have tried something like this

require 'rubygems'
require 'nokogiri'
require 'open-uri'

url = "http://www.techcrunch.com/2009/10/11/the-underutilized-power-of-the-video-demo-to-explain-what-the-hell-you-actually-do/"
doc = Nokogiri::HTML(open(url))
posts = doc.xpath("//meta")
posts.each do |link|
  a = link.attributes['name']
  b = link.attributes['content']
end

after which I could select the link where the attribute name is equal to description - but this code returns nil for a and b.

I played around with posts = doc.xpath("//meta"), posts = doc.xpath("//meta/*"), etc. but still nil.

+1  A: 

The problem is not with the xpath, as it seems the document does not parse. You can check that with puts doc, it does not contain the full input. It seems to be a problem with parsing comments (I suspect either invalid HTML or a bug in libxml2).

In your case I would use a regular expression as workaround. Given that <meta> tags are simple enough that might work, eg /<meta name="([^"]*)" content="([^"]*)"/

Adrian
A: 

you should change

doc = Nokogiri::HTML(open(url))

to

doc = Nokogiri::HTML(open(url).read)

update: or maybe not :) actually your code works for me, using ruby 1.8.7 / nokogiri 1.4.0

mykhal