i have a string with bunch of break tags.
unfortunately they are irregular.
<Br> <BR> <br/> <BR/> <br />
etc...
i am using nokogiri, but i dont know how to tell it to break up the string at each break tag....
thanks.
i have a string with bunch of break tags.
unfortunately they are irregular.
<Br> <BR> <br/> <BR/> <br />
etc...
i am using nokogiri, but i dont know how to tell it to break up the string at each break tag....
thanks.
If you can break on regular expressions, use the following delimiter:
<\s*[Bb][Rr]\s*\/*>
Explanation:
One left angle bracket, zero or more spaces, B or b, R or r, zero or more spaces, zero or more forward slashes.
To use the regex, look here:
http://www.regular-expressions.info/ruby.html
If you parse the string with Nokogiri, you can then scan through it and ignore anything other than text elements:
require 'nokogiri'
doc = Nokogiri::HTML.parse('a<Br>b<BR>c<br/>d<BR/>e<br />f')
text = []
doc.search('p').first.children.each do |node|
text << node.content if node.text?
end
p text # => ["a", "b", "c", "d", "e", "f"]
Note that you have to search for the first p tag because Nokogiri will wrap the whole thing in <!DOCTYPE blah blah><html><body><p>YOUR TEXT</p></body></html>
.
So to implement iftrue's response:
a = 'a<Br>b<BR>c<br/>d<BR/>e<br />f'
a.split(/<\s*[Bb][Rr]\s*\/*>/)
=> ["a", "b", "c", "d", "e", "f"]
...you're left with an array of the bits of the string between the HTML breaks.
Pesto's 99% of the way there, however Nokogiri supports creating a document fragment that doesn't wrap the text in the declaration:
text = Nokogiri::HTML::DocumentFragment.parse('<Br>this<BR>is<br/>a<BR/>text<br />string').children.select {|n| n.text? and n.content }
puts text
# >> this
# >> is
# >> a
# >> text
# >> string