tags:

views:

257

answers:

4

i have a string with bunch of break tags.

unfortunately they are irregular.

<Br> <BR> <br/> <BR/> <br /> etc...

i am using nokogiri, but i dont know how to tell it to break up the string at each break tag....

thanks.

+3  A: 

If you can break on regular expressions, use the following delimiter:

<\s*[Bb][Rr]\s*\/*>

Explanation:

One left angle bracket, zero or more spaces, B or b, R or r, zero or more spaces, zero or more forward slashes.

To use the regex, look here:
http://www.regular-expressions.info/ruby.html

Stefan Kendall
how do i break it? do i use gsub? string.gsub(<\s*[Bb][Rr]\s*\/*>) ?
Looks like "split" is what you need.
Stefan Kendall
A: 

If you parse the string with Nokogiri, you can then scan through it and ignore anything other than text elements:

require 'nokogiri'
doc = Nokogiri::HTML.parse('a<Br>b<BR>c<br/>d<BR/>e<br />f')
text = []
doc.search('p').first.children.each do |node|
  text << node.content if node.text?
end
p text  # => ["a", "b", "c", "d", "e", "f"]

Note that you have to search for the first p tag because Nokogiri will wrap the whole thing in <!DOCTYPE blah blah><html><body><p>YOUR TEXT</p></body></html>.

Pesto
+2  A: 

So to implement iftrue's response:

a = 'a<Br>b<BR>c<br/>d<BR/>e<br />f'
a.split(/<\s*[Bb][Rr]\s*\/*>/)
=> ["a", "b", "c", "d", "e", "f"]

...you're left with an array of the bits of the string between the HTML breaks.

Max Masnick
A little simpler with just /<br *\/?>/i
glenn mcdonald
thanks glenn that is the best.
A: 

Pesto's 99% of the way there, however Nokogiri supports creating a document fragment that doesn't wrap the text in the declaration:

 text = Nokogiri::HTML::DocumentFragment.parse('<Br>this<BR>is<br/>a<BR/>text<br />string').children.select {|n| n.text? and n.content } 
puts text
# >> this
# >> is
# >> a
# >> text
# >> string
Greg