ansaurus

Question

How to use ruby to get string between html <cite> tags?

Answer 1

A:

Split the string on the tag you want. Assuming only one instance of tag (or specify only one split) you'll have two pieces I'll call head and tail. Take tail and split it on the closing tag (once), so you'll now have two pieces in your new array. The new head is what was between your tags, and the new tail is the remainder of the string, which you may process again if the tag could appear more than once.

An example that may not be exactly correct but you get the idea:

head1, tail1 = str.split('<tag>', 1) # finds the opening tag
head2, tail2 = tail1.split('</tag>', 1) # finds the closing tag

kajaco 2009-09-18 02:26:27

scan is much better than split for this...

glenn mcdonald 2009-09-18 13:53:07

better in what way?

kajaco 2009-09-20 01:07:32

Answer 2

+2 A:

If you're having problems with hpricot, you could also try nokogiri which is very similar, and allows you to do the same things.

Mike Trpcic 2009-09-18 02:52:51

Answer 3

A:

I think this will solve it:

res.scan(/<cite>([^<>]*)<\/cite>/imu).flatten

# This one to ignore empty tags:

res.scan(/<cite>([^<>]*)<\/cite>/imu).flatten.select{|x| !x.empty?}

khelll 2009-09-18 03:10:53

Thanks for that!

Michael Mao 2009-09-18 05:26:39

This will screw up if the citation has any html tags in it, including italic, bold, etc. It will also fail if the opening tag has any attributes, though I don't know if Google ever does that. Here's an alternate regex that deals with these two cases: /<cite(?: .*?)?>(.*?)<\/cite>/imu

glenn mcdonald 2009-09-18 14:05:31

Seems interesting, can you elaborate the "<cite(?: .*?)?>" part? just for the record...

khelll 2009-09-18 14:20:12

(?:) is a non-capturing group, so (?: .*?) will non-greedily match a space followed by any other characters. The ? after the parens makes the whole paren thing optional. In other words, "<cite" can be followed immediately by the ">", or by a space and then a bunch of other stuff.

glenn mcdonald 2009-09-18 21:50:27

Hi all:Google is so kind as not to add any other tags inside a c i t e tag:)That's good news isn't it?

Michael Mao 2009-09-23 06:17:56

Answer 4

+1 A:

Here's one way to do it using Nokogiri:

Nokogiri::HTML(res.body).css("cite").map {|cite| cite.content}

Greg Campbell 2009-09-18 03:41:43

ansaurus

tags:

views:

answers:

How to use ruby to get string between html <cite> tags?

related questions