Here's a regex solution
irb> SPAN_RE = /(?i:<\/?span[^>]*>)/
#=> /(?i:<\/?span[^>]*>)/
irb> ALL_SPANS_RE = /(?:#{SPAN_RE}*(?!#{SPAN_RE}))/
#=> /(?:(?i-mx:<\/?span[^>]*>)*(?!(?i-mx:<\/?span[^>]*>)))/
irb> def word_wrap(str,width)
full_re = /((?:#{ALL_SPANS_RE}.){0,#{width-1}}#{ALL_SPANS_RE}\S(?:#{SPAN_RE}+|\b))\s*/
str.gsub(/\s*\n/, ' ').gsub(full_re, "\\1\n")
end
#=> nil
irb> text =<<TEXT
Lorem <span>ipsum
dolor</span>
si<span>t</span>
amet,
conse<span>ctetur
adipiscing
elit</span> Praesent
TEXT
#=> "Lorem <span>ipsum\ndolor</span>\nsi<span>t</span>\namet,\nconse<span>ctetur\nadipiscing\nelit</span> Praesent\n"
irb> puts word_wrap(text,20)
Lorem <span>ipsum dolor</span> si<span>
t</span> amet, conse<span>ctetur
adipiscing elit</span>
Praesent
#=> nil
irb> word_wrap(text,20)
#=> "Lorem <span>ipsum dolor</span> si<span>\nt</span> amet, conse<span>ctetur\nadipiscing elit</span>\nPraesent\n"
Basically we grab as many characters as we can
up to the word width, ignoring spans (and making
sure we don't grab parts of spans), and making sure
we end on a non-space character, followed by
either a span, or a word break.
I'll break down how the regex works:
SPAN_RE
matches one span tag (either <span>
or </span>
or or ...)
(?i: - Start of a non-capturing parenthesis (useful for grouping patterns)
The i flag means the inner pattern is case-insensitive
< - a literal '<' character
\/? - 0 or 1 a forward slashes
span - the letters "span"
[^>]* - 0 or more other characters that are not a '>' character
> - a literal '>' character
) - end of the non-capturing parenthesis
ALL_SPAN_RE
matches all the spans at a given position - guaranteeing that the next character
matched is not the start of a span tag.
(?: - Start of a non-capturing parenthesis (useful for grouping patterns)
#{SPAN_RE}* - 0 or more spans
(?! - Start of a negative lookahead
#{SPAN_RE} - exactly 1 span
Since this is inside a negative lookahead, it means that the next
character in the string is not allowed to start a span
) - end of the negative lookahead
) - end of the non-capturing parenthesis
This means that we can match one character after the ALL_SPAN_RE
and be sure that we're not
grabbing part of a span.
The full_re
then just greedily matches as many characters as it can,
up to the desired width (ignoring spans), making sure that it ends on a
non-space character that is either the end of a word or followed by a span.
( - start of a capturing parenthesis
(?: - start of a non-capturing parenthesis
#{ALL_SPANS_RE} - any and all spans
. - one character (which can't be the start of a span)
) - end of non-capturing parenthesis
{0,#{width-1}} - match preceding pattern up to width-1 times
so this matches width-1 characters (ignoring spans)
#{ALL_SPANS_RE} - any and all spans
\S - a non-whitespace character
we don't want to insert a "\n" after whitespace
(?: - a non-capturing parenthesis
#{SPAN_RE}+ - 1 or more spans
| - OR
\b - the end of a word
these alternatives makes sure we aren't breaking in the middle of a word
) - end of non-capturing parentheis
) - end of capturing parenthesis
\s* - any whitespace
since we're wrapping, we can just toss this when we insert the newline