views:

408

answers:

5

I am working on some rather inefficient C# code that wants to remove blanks lines. It does this:

      string b;
 ... 
      while ( b.IndexOf("\n\n") >= 0 )
       b = b.Replace ("\n\n", "\n");

A single replace would not cope with (e.g.) \n\n\n in the input, so the loop is needed. I think it ought to work, and it usually does.

But sometimes it manages to get into an infinite loop, and I don't understand how. On every iteration the number of \n should reduce, so it ought to terminate eventually.

Any ideas?

+6  A: 

Would this work:

String c = Regex.Replace(b, "\n\n+", "\n");
notnoop
Why the extra \n? The answer I posted below should take care of it with just \n+.
Adam Robinson
Steve Wortham
It's a premature optimization to prevent Regex from wasteful matching to replace '\n' with '\n' which is a NOOP.
notnoop
I totally agree with you msaeed. I was about to write the exact same code but you beat me to it.
Steve Wortham
So we have an accepted answer now, but still no explanation as to the cause of the initial problem...
Dan Tao
A: 

Can you give an example of a string for which this gets into an infinite loop? Also to debug your program you might try replacing it with:

while(b.IndexOf("\n\n")>=0)
{
     Console.Write(b)
     Console.Write(b.IndexOf("\n\n").ToString())
     b = b.Replace("\n\n", "\n");
}

and see what it outputs.

Alex319
Unfortunately my example is a 13000 character string.I have added some Console.Writes that confirm IndexOf is returning the same value on successive iterations. Using Visual Studio 2003, I don't see how to look into the string and see non-printing characters. I suppose I could add more Writes.
Rob625
+6  A: 

I don't have an explanation for your inexplicable infinite loop (are you CERTAIN it's infinite? Have you checked to see if the string changes?), but you can do this much easier and faster with a regular expression:

b = System.Text.RegularExpressions.Regex.Replace(b, "\n+", "\n")
Adam Robinson
This it the best approach, IMO.
Reed Copsey
A: 

I'm just putting this answer up here to clarify a point in case anyone else comes along and suggests that the code posted above will infinitely loop if b is an empty string. That is incorrect:

String b = String.Empty;

Console.WriteLine(b.IndexOf("\n\n"));

// output: -1

The documentation states that IndexOf will return 0 if the value parameter passed to it is empty, not if the string itself (b in this case) is empty.

Dan Tao
A: 

I have pinned down the problem to a nasty string I get by reading a file (full code below).

File s.tab contains these 18 hex bytes: FF FE 41 00 0D 0A 00 0D 0A 00 0D 0A 00 42 00

Here is the debug output from my program:

b.Length=8 loop n=1, i=3, b=A?? 
?? B 
stuck at i=3, b(i)=10 2573 3328... 
done n=1, i=3, b=A?? 
?? B

So it is something to do with invalid unicode. I have printed out the decimal values of the characters of string b, starting at i = 3 = IndexOf("\n\n"). IndexOf seems to see the 10 as a newline (OK), and then 2573 (which is 0D 0A) as another (not OK?). Then Replace doesn't agree.

Clearly there is something wrong with the data in the file. But I still don't think this should happen. IndexOf and Replace ought to agree.

I am implementing msaeed's solution. Many thanks.

Debug code:

  {
   System.IO.StreamReader aFile = System.IO.File.OpenText( @"c:\xfer\s.tab");
   string a = aFile.ReadToEnd();
   aFile.Close();

   int nn=0, ii;
   Console.WriteLine ("a.Length={0}", a.Length);
   while ( (ii=a.IndexOf("\n\n")) >= 0 )
   {
    nn++;
    Console.WriteLine("loop n={0}, i={1}, a={2}"
     , nn
     , ii
     , a);
    if (ii == a.IndexOf("\n\n"))
    {
     Console.WriteLine ("stuck at i={0}, a(i)={1} {2} {3}..."
      , ii
      , (int)(a.ToCharArray()[ii])
      , (int)(a.ToCharArray()[ii+1])
      , (int)(a.ToCharArray()[ii+2])
      );
     break;
    }
    a = a.Replace ("\n\n", "\n");
   }
   Console.WriteLine("done n={0}, i={1}, a={2}", nn, ii, a);
  }
Rob625