ansaurus

Question

Ruby: How to implement word wrap that ignores tags when computing line length?

Answer 1

A:

This is fun! (albeit has a bit of homework smell)

I have removed my previous attempts, you may check this answer's history if curious. Here's my definitive answer, a little vertically longer, but more down to earth and not so hacky as the previous.

s = "Lorem <span>ipsum dolor</span> si<span>t</span> amet, conse<span>ctetur adipiscing elit</span> Praesent"

def word_wrap(s_arg, line_width = 40)
  producer = s_arg.dup
  consumer = ""
  counter  = 0
  while !producer.empty?
    if producer =~ %r[\A</?span>]
      consumer << producer.slice!(0, $&.length)
      next
    end
    consumer << producer.slice!(0, 1)
    counter += 1
    next if counter <= line_width
    consumer.sub!(/ (\S*?)\z/, "\n\\1")
    counter = $1.length
  end
  consumer
end

puts word_wrap(s, 20)

kch 2009-05-07 00:06:14

Thanks for your help as well. However, I have to say that I find rampion's solution prettier. Also, the result is not quite the same as expected: Lorem ipsum dolor\n sit amet,\n consectetur\n adipiscing elit\n PraesentThank you for spending time helping me!

gn2 2009-05-07 10:26:39

Answer 2

+1 A:

Here's a regex solution

irb> SPAN_RE = /(?i:<\/?span[^>]*>)/
#=> /(?i:<\/?span[^>]*>)/
irb> ALL_SPANS_RE = /(?:#{SPAN_RE}*(?!#{SPAN_RE}))/
#=> /(?:(?i-mx:<\/?span[^>]*>)*(?!(?i-mx:<\/?span[^>]*>)))/
irb> def word_wrap(str,width)
         full_re = /((?:#{ALL_SPANS_RE}.){0,#{width-1}}#{ALL_SPANS_RE}\S(?:#{SPAN_RE}+|\b))\s*/
         str.gsub(/\s*\n/, ' ').gsub(full_re, "\\1\n")
     end
#=> nil
irb> text =<<TEXT
     Lorem <span>ipsum
     dolor</span>
     si<span>t</span>
     amet,
     conse<span>ctetur
     adipiscing
     elit</span> Praesent
     TEXT
#=> "Lorem <span>ipsum\ndolor</span>\nsi<span>t</span>\namet,\nconse<span>ctetur\nadipiscing\nelit</span> Praesent\n"
irb> puts word_wrap(text,20)
Lorem <span>ipsum dolor</span> si<span>
t</span> amet, conse<span>ctetur
adipiscing elit</span>
Praesent
#=> nil
irb> word_wrap(text,20)
#=> "Lorem <span>ipsum dolor</span> si<span>\nt</span> amet, conse<span>ctetur\nadipiscing elit</span>\nPraesent\n"

Basically we grab as many characters as we can up to the word width, ignoring spans (and making sure we don't grab parts of spans), and making sure we end on a non-space character, followed by either a span, or a word break.

I'll break down how the regex works:

SPAN_RE matches one span tag (either  or  or or ...)

(?i:    - Start of a non-capturing parenthesis (useful for grouping patterns)
          The i flag means the inner pattern is case-insensitive
  <     - a literal '<' character
  \/?   - 0 or 1 a forward slashes
  span  - the letters "span"
  [^>]* - 0 or more other characters that are not a '>' character
  >     - a literal '>' character
)       - end of the non-capturing parenthesis

ALL_SPAN_RE matches all the spans at a given position - guaranteeing that the next character matched is not the start of a span tag.

(?:             - Start of a non-capturing parenthesis (useful for grouping patterns)
  #{SPAN_RE}*   - 0 or more spans
  (?!           - Start of a negative lookahead
    #{SPAN_RE}  - exactly 1 span
                  Since this is inside a negative lookahead, it means that the next 
                  character in the string is not allowed to start a span
  )             - end of the negative lookahead
)               - end of the non-capturing parenthesis

This means that we can match one character after the ALL_SPAN_RE and be sure that we're not grabbing part of a span.

The full_re then just greedily matches as many characters as it can, up to the desired width (ignoring spans), making sure that it ends on a non-space character that is either the end of a word or followed by a span.

(                     - start of a capturing parenthesis
  (?:                 - start of a non-capturing parenthesis
    #{ALL_SPANS_RE}   - any and all spans
    .                 - one character (which can't be the start of a span)
  )                   - end of non-capturing parenthesis
  {0,#{width-1}}      - match preceding pattern up to width-1 times
                        so this matches width-1 characters (ignoring spans)
  #{ALL_SPANS_RE}     - any and all spans
  \S                  - a non-whitespace character
                        we don't want to insert a "\n" after whitespace
  (?:                 - a non-capturing parenthesis
    #{SPAN_RE}+       - 1 or more spans
    |                 - OR
    \b                - the end of a word
                        these alternatives makes sure we aren't breaking in the middle of a word
   )                  - end of non-capturing parentheis
 )                    - end of capturing parenthesis
 \s*                  - any whitespace
                        since we're wrapping, we can just toss this when we insert the newline

rampion 2009-05-07 04:32:18

this is pretty.

kch 2009-05-07 04:55:32

Very impressive regexp. I had a go at it, but obviously I don't master regexps as well as you do. Thank you very much for your help.For everyone's information, I will post the result of your function, which is slightly different to the one I posted, but is absolutely valid: Lorem ipsum dolor si t amet, consectetur adipiscing elit PraesentThanks!

gn2 2009-05-07 09:23:15

Rampion, thank you very much for all these explanations. This was really kind of you to take the time to explain your solution in details.:)

gn2 2009-05-07 15:54:18

Give someone a regex, and they'll be able to work for a day. Teach someone how to make regexes, and they'll be able to work for life.

rampion 2009-05-07 15:57:16

ansaurus

tags:

views:

answers:

Ruby: How to implement word wrap that ignores <span> tags when computing line length?

related questions