tags:

views:

83

answers:

3

I have to process a block of text, which might have some spurious newlines in the middle of some of the fields. I want to strip these newlines out (replacing them with spaces), without stripping out the 'valid' newlines, which are always preceded by a \t.

So, i want to replace all newlines that are not preceded by a tab with a space. To make things a little more complicated, if there's a space on either side of the newline then i want to keep it. In other words, this

"one\ttwo\tbuckle my \nshoe\t\t\n"

would become

"one\ttwo\tbuckle my shoe\t\t\n"

i.e., with one space between 'my' and 'shoe', not two.

EDIT - some clarification: the unwanted newlines are in the middle of a piece of regular text. If there's a space between the words where the newline occurs, i want to keep it. oherwise, i want to add one in. Eg

"one\ttwo\tbuckle my \nshoe\t\t\n"
=> "one\ttwo\tbuckle my shoe\t\t\n"

"one\ttwo\tbuckle my\nshoe\t\t\n"
=> "one\ttwo\tbuckle my shoe\t\t\n"

"one\ttwo\tbuckle my \n shoe\t\t\n"
=> "one\ttwo\tbuckle my shoe\t\t\n"

EDIT 2: a clumsy but working solution i came up with. I'm not very happy with it, the double-gsubbing seems unelegant.

>> strings = ["one\ttwo\tbuckle my\nshoe\t\t\n", "one\ttwo\tbuckle my \nshoe\t\t\n", "one\ttwo\tbuckle my \n shoe\t\t\n"]
=> ["one\ttwo\tbuckle my\nshoe\t\t\n", "one\ttwo\tbuckle my \nshoe\t\t\n", "one\ttwo\tbuckle my \n shoe\t\t\n"]
>> strings.collect{|s| s.gsub(/[^\t]\n\s?/){|match| match.gsub(/\s*\n\s*/," ")} }
=> ["one\ttwo\tbuckle my shoe\t\t\n", "one\ttwo\tbuckle my shoe\t\t\n", "one\ttwo\tbuckle my shoe\t\t\n"]

This seems to work better than any of the suggestions below given my now extended requirements about adding/preserving spaces.

A: 
str = str.gsub(/\s*(?<!\t)\n\s*/, " ")
reko_t
thanks reko. That doesn't seem to make any difference to my strings: see my edit above.
Max Williams
Sorry, there was a typo in the regexp, `(<?` is supposed to be `(?<`. Try now again.
reko_t
+1  A: 

No lookbehind option

You can match:

(\G|[^\t])\n

And replace with backreference to what group 1 matched.

Here's a Ruby snippet (as seen on ideone.com):

from = "\none\ttwo\tbuckle my \nshoe\t\t\nx\n\n\t\n\n"
to   = "one\ttwo\tbuckle my shoe\t\t\nx\t\n"

mod  = from.gsub(/(\G|[^\t])\n/, '\1')

puts (mod == to) # true

Essentially we either match "something" that's not a \t, followed by an \n, and replace with only the "something" part (effectively preserving whatever "it" is, but deleting the \n), or we can simply continue from previous match using \G, to allow \n at the beginning of the string or following another deleted \n.

References


Lookbehind option

If the flavor supports lookbehind, you can also match:

(?<!\t)\n

And simply replace with the empty string.

References

polygenelubricants
A: 

With a double-negative ([^\S\t] means all whitespace except TAB characters)

def fix(str)
  return str.gsub(/([^\t]|^)[^\S\t]+/, '\1 ')
end

the following tests

#! /usr/bin/ruby

require "test/unit"
require "test/unit/ui/console/testrunner"

class MyTestCases < Test::Unit::TestCase
  def test_after_space
    assert_equal fix("one\ttwo\tbuckle my \nshoe\t\t\n"),
                     "one\ttwo\tbuckle my shoe\t\t\n"
  end

  def test_no_whitespace_neighbors
    assert_equal fix("one\ttwo\tbuckle my\nshoe\t\t\n"),
                     "one\ttwo\tbuckle my shoe\t\t\n"
  end

  def test_whitespace_surrounded
    assert_equal fix("one\ttwo\tbuckle my \n shoe\t\t\n"),
                     "one\ttwo\tbuckle my shoe\t\t\n"
  end

  def test_leading_newline
    assert_equal fix("\none\ttwo"),
                     " one\ttwo"
  end
end

Test::Unit::UI::Console::TestRunner.run(MyTestCases)

all pass:

Loaded suite MyTestCases
Started
....
Finished in 0.000412 seconds.

4 tests, 4 assertions, 0 failures, 0 errors
Greg Bacon