views:

35

answers:

1

Yesterday I discovered an odd bug in rather simple code that basically gets text from an ifstream and tokenizes it. The code that actually fails does a number of get()/peek() calls looking for the token "/*". If the token is found in the stream, unget() is called so the next method sees the stream starting with the token.

Sometimes, seemingly depending only on the length of the file, the unget() call fails. Internally it calls pbackfail() which then returns EOF. However after clearing the stream state, I can happily read more characters so it's not exactly EOF..

After digging in, here's the full code that easily reproduces the problem:

#include <iostream>
#include <fstream>
#include <string>

  //generate simplest string possible that triggers problem
void GenerateTestString( std::string& s, const size_t nSpacesToInsert )
{
  s.clear();
  for( size_t i = 0 ; i < nSpacesToInsert ; ++i )
    s += " ";
  s += "/*";
}

  //write string to file, then open same file again in ifs
bool WriteTestFileThenOpenIt( const char* sFile, const std::string& s, std::ifstream& ifs )
{
  {
    std::ofstream ofs( sFile );
    if( ( ofs << s ).fail() )
      return false;
  }
  ifs.open( sFile );
  return ifs.good();
}

  //find token, unget if found, report error, show extra data can be read even after error 
bool Run( std::istream& ifs )
{
  bool bSuccess = true;

  for( ; ; )
  {
    int x = ifs.get();
    if( ifs.fail() )
      break;
    if( x == '/' )
    {
      x = ifs.peek();
      if( x == '*' )
      {
        ifs.unget();
        if( ifs.fail() )
        {
          std::cout << "oops.. unget() failed" << std::endl;
          bSuccess = false;
        }
        else
        {
          x = ifs.get();
        }
      }
    }
  }

  if( !bSuccess )
  {
    ifs.clear();
    std::string sNext;
    ifs >> sNext;
    if( !sNext.empty() )
      std::cout << "remaining data after unget: '" << sNext << "'" << std::endl;
  }

  return bSuccess;
}

int main()
{
  std::string s;
  const char* testFile = "tmp.txt";
  for( size_t i = 0 ; i < 12290 ; ++i )
  {
    GenerateTestString( s, i );

    std::ifstream ifs;
    if( !WriteTestFileThenOpenIt( testFile, s, ifs ) )
    {
      std::cout << "file I/O error, aborting..";
      break;
    }

    if( !Run( ifs ) )
      std::cout << "** failed for string length = " << s.length() << std::endl;
  }
  return 0;
}

The program fails when the string length gets near the typical multiple=of-2 buffersizes 4096, 8192, 12288, here's the output:

oops.. unget() failed
remaining data after unget: '*'
** failed for string length = 4097
oops.. unget() failed
remaining data after unget: '*'
** failed for string length = 8193
oops.. unget() failed
remaining data after unget: '*'
** failed for string length = 12289

This happens when tested on on Windows XP and 7, both compiled in debug/release mode, both dynamic/static runtime, both 32bit and 64bit systems/compiles, all with VS2008, default compiler/linker options. No problem found when testing with gcc4.4.5 on a 64bit Debian system.

Questions:

  1. can other people please test this? I would really appreciate some active collaboration form SO.
  2. is there anything that is not correct in the code that could cause the problem (not talking about whether it makes sense)
  3. or any compiler flags that might trigger this behaviour?
  4. all parser code is rather critical for the application and is tested heavily, but off course this problem was not found in the test code. Should I come up with extreme test cases, and if so, how do I do that? How could I ever predict this could cause a problem?
  5. if this really is a bug, where should do I best report it?
+1  A: 

is there anything that is not correct in the code that could cause the problem (not talking about whether it makes sense)

Yes. Standard streams are required to have at least 1 unget() position. So you can safely do only one unget() after a call to get(). When you call peek() and the input buffer is empty, underflow() occurs and the implementation clears the buffer and loads a new portion of data. Note that peek() doesn't increase current input location, so it points to the beginning of the buffer. When you try to unget() the implementation tries to decrease current input position, but it's already at the beginning of the buffer so it fails.

Of course this depends on the implementation. If the stream buffer holds more than one character then it may sometimes fail and sometimes not. As far as I know microsoft's implementation stores only one character in basic_filebuf (unless you specify a greater buffer explicitly) and relies on <cstdio> internal buffering (btw, that's one reason why MVS iostreams are slow). Quality implementation may load the buffer again from the file when unget() fails. But it isn't required to do so.

Try to fix your code so you don't need more than one unget() position. If you really need it then wrap the stream with a stream that guarantees that unget() won't fail (look at Boost.Iostreams). Also the code you posted is nonsense. It tries to unget() and then get() again. Why?

ybungalobill
thanks, this is it! I know the code is nonsense, it's just for testing the problem; the get() is there to mimic the behaviour of the actual code, which after finding the /* continues to read additional characters. I'm not sure why the original author chose to use peek()/get()/unget(), it would probably be less error-prone to have the method use just get() and return a flag specifying if end of line, EOF or the token was last consumed..
stijn
btw I do think the implementation stores more then one character in basic_filebuf by default: when inspecting the memory pointed to by the next read pointer, I can always see a lerge part of the file.
stijn
@stijn: You're right. Maybe they changed it in 2008 or maybe it's just more complicated. From your output in the post it looks like it uses a 4KB buffer.
ybungalobill