Yesterday I discovered an odd bug in rather simple code that basically gets text from an ifstream and tokenizes it. The code that actually fails does a number of get()/peek() calls looking for the token "/*". If the token is found in the stream, unget() is called so the next method sees the stream starting with the token.
Sometimes, seemingly depending only on the length of the file, the unget() call fails. Internally it calls pbackfail() which then returns EOF. However after clearing the stream state, I can happily read more characters so it's not exactly EOF..
After digging in, here's the full code that easily reproduces the problem:
#include <iostream>
#include <fstream>
#include <string>
//generate simplest string possible that triggers problem
void GenerateTestString( std::string& s, const size_t nSpacesToInsert )
{
s.clear();
for( size_t i = 0 ; i < nSpacesToInsert ; ++i )
s += " ";
s += "/*";
}
//write string to file, then open same file again in ifs
bool WriteTestFileThenOpenIt( const char* sFile, const std::string& s, std::ifstream& ifs )
{
{
std::ofstream ofs( sFile );
if( ( ofs << s ).fail() )
return false;
}
ifs.open( sFile );
return ifs.good();
}
//find token, unget if found, report error, show extra data can be read even after error
bool Run( std::istream& ifs )
{
bool bSuccess = true;
for( ; ; )
{
int x = ifs.get();
if( ifs.fail() )
break;
if( x == '/' )
{
x = ifs.peek();
if( x == '*' )
{
ifs.unget();
if( ifs.fail() )
{
std::cout << "oops.. unget() failed" << std::endl;
bSuccess = false;
}
else
{
x = ifs.get();
}
}
}
}
if( !bSuccess )
{
ifs.clear();
std::string sNext;
ifs >> sNext;
if( !sNext.empty() )
std::cout << "remaining data after unget: '" << sNext << "'" << std::endl;
}
return bSuccess;
}
int main()
{
std::string s;
const char* testFile = "tmp.txt";
for( size_t i = 0 ; i < 12290 ; ++i )
{
GenerateTestString( s, i );
std::ifstream ifs;
if( !WriteTestFileThenOpenIt( testFile, s, ifs ) )
{
std::cout << "file I/O error, aborting..";
break;
}
if( !Run( ifs ) )
std::cout << "** failed for string length = " << s.length() << std::endl;
}
return 0;
}
The program fails when the string length gets near the typical multiple=of-2 buffersizes 4096, 8192, 12288, here's the output:
oops.. unget() failed
remaining data after unget: '*'
** failed for string length = 4097
oops.. unget() failed
remaining data after unget: '*'
** failed for string length = 8193
oops.. unget() failed
remaining data after unget: '*'
** failed for string length = 12289
This happens when tested on on Windows XP and 7, both compiled in debug/release mode, both dynamic/static runtime, both 32bit and 64bit systems/compiles, all with VS2008, default compiler/linker options. No problem found when testing with gcc4.4.5 on a 64bit Debian system.
Questions:
- can other people please test this? I would really appreciate some active collaboration form SO.
- is there anything that is not correct in the code that could cause the problem (not talking about whether it makes sense)
- or any compiler flags that might trigger this behaviour?
- all parser code is rather critical for the application and is tested heavily, but off course this problem was not found in the test code. Should I come up with extreme test cases, and if so, how do I do that? How could I ever predict this could cause a problem?
- if this really is a bug, where should do I best report it?