tags:

views:

385

answers:

2

Hello everyone,

I need to write a little ruby function that does word wrapping. I have got the following function :

def word_wrap(text, line_width)
  if line_width.nil? || line_width < 2
    line_width = 40
  end
  text.split("\n").collect do |line|
   line.length > line_width ? line.gsub(/.{1,#{line_width}})(\s+|$)/, "\\1\n").strip : line
  end * "\n"
end

This is basically the word_wrap function included in Rails.

I would like to write a similiar function which parse a string with span elements inside, except that the tags should not be counted to wrap the line.

Example:

s = "Lorem <span>ipsum dolor</span> si<span>t</span> amet, conse<span>ctetur adipiscing elit</span> Praesent"

At the moment, word_wrap(s, 20) gives something like this:

Lorem <span>ipsum
dolor</span>
si<span>t</span>
amet,
conse<span>ctetur
adipiscing
elit</span> Praesent

It should be:

Lorem <span>ipsum dolor</span> si
<span>t</span> amet, conse<span>ctetur 
adipiscing elit</span> 
Praesent

As you can see, the new word_wrap function create lines of (max) 20 characters, without counting the <span> and </span> tags.

How would you do that? All suggestions are welcome!

Thanks in advance for your help.

A: 

This is fun! (albeit has a bit of homework smell)

I have removed my previous attempts, you may check this answer's history if curious. Here's my definitive answer, a little vertically longer, but more down to earth and not so hacky as the previous.

s = "Lorem <span>ipsum dolor</span> si<span>t</span> amet, conse<span>ctetur adipiscing elit</span> Praesent"

def word_wrap(s_arg, line_width = 40)
  producer = s_arg.dup
  consumer = ""
  counter  = 0
  while !producer.empty?
    if producer =~ %r[\A</?span>]
      consumer << producer.slice!(0, $&.length)
      next
    end
    consumer << producer.slice!(0, 1)
    counter += 1
    next if counter <= line_width
    consumer.sub!(/ (\S*?)\z/, "\n\\1")
    counter = $1.length
  end
  consumer
end

puts word_wrap(s, 20)
kch
Thanks for your help as well. However, I have to say that I find rampion's solution prettier. Also, the result is not quite the same as expected: Lorem <span>ipsum dolor</span>\n si<span>t</span> amet,\n conse<span>ctetur\n adipiscing elit</span>\n PraesentThank you for spending time helping me!
gn2
+1  A: 

Here's a regex solution

irb> SPAN_RE = /(?i:<\/?span[^>]*>)/
#=> /(?i:<\/?span[^>]*>)/
irb> ALL_SPANS_RE = /(?:#{SPAN_RE}*(?!#{SPAN_RE}))/
#=> /(?:(?i-mx:<\/?span[^>]*>)*(?!(?i-mx:<\/?span[^>]*>)))/
irb> def word_wrap(str,width)
         full_re = /((?:#{ALL_SPANS_RE}.){0,#{width-1}}#{ALL_SPANS_RE}\S(?:#{SPAN_RE}+|\b))\s*/
         str.gsub(/\s*\n/, ' ').gsub(full_re, "\\1\n")
     end
#=> nil
irb> text =<<TEXT
     Lorem <span>ipsum
     dolor</span>
     si<span>t</span>
     amet,
     conse<span>ctetur
     adipiscing
     elit</span> Praesent
     TEXT
#=> "Lorem <span>ipsum\ndolor</span>\nsi<span>t</span>\namet,\nconse<span>ctetur\nadipiscing\nelit</span> Praesent\n"
irb> puts word_wrap(text,20)
Lorem <span>ipsum dolor</span> si<span>
t</span> amet, conse<span>ctetur
adipiscing elit</span>
Praesent
#=> nil
irb> word_wrap(text,20)
#=> "Lorem <span>ipsum dolor</span> si<span>\nt</span> amet, conse<span>ctetur\nadipiscing elit</span>\nPraesent\n"

Basically we grab as many characters as we can up to the word width, ignoring spans (and making sure we don't grab parts of spans), and making sure we end on a non-space character, followed by either a span, or a word break.

I'll break down how the regex works:

SPAN_RE matches one span tag (either <span> or </span> or or ...)

(?i:    - Start of a non-capturing parenthesis (useful for grouping patterns)
          The i flag means the inner pattern is case-insensitive
  <     - a literal '<' character
  \/?   - 0 or 1 a forward slashes
  span  - the letters "span"
  [^>]* - 0 or more other characters that are not a '>' character
  >     - a literal '>' character
)       - end of the non-capturing parenthesis

ALL_SPAN_RE matches all the spans at a given position - guaranteeing that the next character matched is not the start of a span tag.

(?:             - Start of a non-capturing parenthesis (useful for grouping patterns)
  #{SPAN_RE}*   - 0 or more spans
  (?!           - Start of a negative lookahead
    #{SPAN_RE}  - exactly 1 span
                  Since this is inside a negative lookahead, it means that the next 
                  character in the string is not allowed to start a span
  )             - end of the negative lookahead
)               - end of the non-capturing parenthesis

This means that we can match one character after the ALL_SPAN_RE and be sure that we're not grabbing part of a span.

The full_re then just greedily matches as many characters as it can, up to the desired width (ignoring spans), making sure that it ends on a non-space character that is either the end of a word or followed by a span.

(                     - start of a capturing parenthesis
  (?:                 - start of a non-capturing parenthesis
    #{ALL_SPANS_RE}   - any and all spans
    .                 - one character (which can't be the start of a span)
  )                   - end of non-capturing parenthesis
  {0,#{width-1}}      - match preceding pattern up to width-1 times
                        so this matches width-1 characters (ignoring spans)
  #{ALL_SPANS_RE}     - any and all spans
  \S                  - a non-whitespace character
                        we don't want to insert a "\n" after whitespace
  (?:                 - a non-capturing parenthesis
    #{SPAN_RE}+       - 1 or more spans
    |                 - OR
    \b                - the end of a word
                        these alternatives makes sure we aren't breaking in the middle of a word
   )                  - end of non-capturing parentheis
 )                    - end of capturing parenthesis
 \s*                  - any whitespace
                        since we're wrapping, we can just toss this when we insert the newline
rampion
this is pretty.
kch
Very impressive regexp. I had a go at it, but obviously I don't master regexps as well as you do. Thank you very much for your help.For everyone's information, I will post the result of your function, which is slightly different to the one I posted, but is absolutely valid: Lorem <span>ipsum dolor</span> si<span> t</span> amet, conse<span>ctetur adipiscing elit</span> PraesentThanks!
gn2
Rampion, thank you very much for all these explanations. This was really kind of you to take the time to explain your solution in details.:)
gn2
Give someone a regex, and they'll be able to work for a day. Teach someone how to make regexes, and they'll be able to work for life.
rampion