tags:

views:

117

answers:

3

Basically I want to strip the document of words between blockquotes. I'm a regular expression newb and even after using rubular, I'm no closer to the answer.

Any help is appreciated.

+10  A: 

Use an HTML parser and forget regular expressions. Regex is incapable of correctly handling HTML.

doc = Nokogiri::HTML(your_html)
doc.xpath("//blockquote").remove

From: Strip text from HTML document using Ruby

There are more examples of how to use Nokogiri and XPath, if you look around.

Tomalak
A: 

raw example:

/<blockquote>([^<]*)<\/blockquote>/
oraz
This fails for `<blockquote>Some <b>bold text</b></blockquote>`. As I said: Regex is *technically incapable* of correctly handling HTML.
Tomalak
@Tomalak: yes, i see
oraz
A: 

Sample string:

<blockquote>Hello world</blockquote>

type the following regex in rubular <blockquote>(.+?)<\/blockquote>

or for something more generic:

<.*?>(.+?)<\/.*?>

hope it helps!

Paul
This fails for `<blockquote>Some <blockquote>quoted text</blockquote> within a quote.</blockquote>`.
Tomalak
if we are just talking ruby:resultarray = htmlstring.split(/<.*?>/). The split() method will disregard the regex match and the text between the matches is kept. FYI: the scan() method will perform the opposite of this. if you're a newb, i suggest to spend some time learning regexs, it's pretty language agnostic and will serve you well.
Paul
If this comment was for me: No, I'm not a "newb" as far as regular expressions go. ;) And `htmlstring.split(/<.*?>/)` fails for `<b title="HTML is > than RegEx">Don't do HTML with RegEx</b>`.
Tomalak