views:

80

answers:

2

So I have a string in ruby that is something like

str = "<html>\n<head>\n\n  <title>My Page</title>\n\n\n</head>\n\n<body>" +
      "  <h1>My Page</h1>\n\n<div id=\"pageContent\">\n  <p>Here is a para" +
      "graph. It can contain  spaces that should not be removed.\n\nBut\n" +
      "line breaks that should be removed.</p></body></html>"

How would I remove all whitespace (spaces, tabs, and linebreaks) that is outside of a tag/not inside a tag that has content like <p> using only native Ruby?

(I'd like to avoid using XSLT or something for a task this simple.)

A: 

You can condense all groups of space characters into one space (ie, hello world into hello world) by using String#squeeze:

"hello     world".squeeze(" ")  # => "hello world"

Where the parameter of squeeze is the character to be squeezed.

EDIT: I misread your question, sorry.

This would

  • remove consecutive spaces within tags
  • leave individual spaces outside tags

I'll work on a solution right now.

Justin L.
+1  A: 
str.gsub!(/\n\t/, " ").gsub!(/>\s*</, "><")

That first gsub! replaces all line breaks and tabs with spaces, the second removes spaces between tags.

You will end up with multiple spaces inside your tags, but if you just removed all \n and \t, you would get something like "not be removed.Butline breaks", which is not very readable. Another Regular Expression or the aforementioned .squeeze(" ") could take care of that.

dhabersack