views:

266

answers:

4

Greetings everyone:

I would love to get some infomation from a huge collection of Google Search Result pages. The only thing I need is the urls inside a bunch of html tags.

I cannot get a solution in any other proper way to handle this problem so now I am moving to ruby.

This is so far what I have written:

require 'net/http'
require 'uri'

url=URI.parse('http://www.google.com.au')
res= Net::HTTP.start(url.host, url.port){|http|
    http.get('/#hl=en&q=helloworld')}
puts res.body

Unfortunately I cannot use the recommended hpricot ruby gem (because it misses a make command or something?)

So I would like to stick with this approach.

Now that I can get the response body as a string, the only thing I need is to retrive whatever is inside the ciite(remove an i to see the true name :)) html tags.

How should I do that? using regular expression? Can anyone give me an example?

Many thanks in advance!!!

A: 

Split the string on the tag you want. Assuming only one instance of tag (or specify only one split) you'll have two pieces I'll call head and tail. Take tail and split it on the closing tag (once), so you'll now have two pieces in your new array. The new head is what was between your tags, and the new tail is the remainder of the string, which you may process again if the tag could appear more than once.

An example that may not be exactly correct but you get the idea:

head1, tail1 = str.split('<tag>', 1) # finds the opening tag
head2, tail2 = tail1.split('</tag>', 1) # finds the closing tag
kajaco
scan is much better than split for this...
glenn mcdonald
better in what way?
kajaco
+2  A: 

If you're having problems with hpricot, you could also try nokogiri which is very similar, and allows you to do the same things.

Mike Trpcic
A: 

I think this will solve it:

res.scan(/<cite>([^<>]*)<\/cite>/imu).flatten

# This one to ignore empty tags:

res.scan(/<cite>([^<>]*)<\/cite>/imu).flatten.select{|x| !x.empty?}
khelll
Thanks for that!
Michael Mao
This will screw up if the citation has any html tags in it, including italic, bold, etc. It will also fail if the opening tag has any attributes, though I don't know if Google ever does that. Here's an alternate regex that deals with these two cases: /<cite(?: .*?)?>(.*?)<\/cite>/imu
glenn mcdonald
Seems interesting, can you elaborate the "<cite(?: .*?)?>" part? just for the record...
khelll
(?:) is a non-capturing group, so (?: .*?) will non-greedily match a space followed by any other characters. The ? after the parens makes the whole paren thing optional. In other words, "<cite" can be followed immediately by the ">", or by a space and then a bunch of other stuff.
glenn mcdonald
Hi all:Google is so kind as not to add any other tags inside a c i t e tag:)That's good news isn't it?
Michael Mao
+1  A: 

Here's one way to do it using Nokogiri:

Nokogiri::HTML(res.body).css("cite").map {|cite| cite.content}
Greg Campbell