tags:

views:

928

answers:

14

There's something really weird going on: strcmp() returns -1 though both strings are exactly the same. Here is a snippet from the output of the debugger (gdb):

(gdb) print s[i][0] == grammar->symbols_from_int[107][0]
$36 = true
(gdb) print s[i][1] == grammar->symbols_from_int[107][1]
$37 = true
(gdb) print s[i][2] == grammar->symbols_from_int[107][2]
$38 = true
(gdb) print s[i][3] == grammar->symbols_from_int[107][3]
$39 = true
(gdb) print s[i][4] == grammar->symbols_from_int[107][4]
$40 = true
(gdb) print s[i][5] == grammar->symbols_from_int[107][5]
$41 = false
(gdb) print grammar->symbols_from_int[107][4]
$42 = 0 '\0'
(gdb) print s[i]
$43 = (char * const&) @0x202dc50: 0x202d730 "Does"
(gdb) print grammar->symbols_from_int[107]
$44 = (char * const&) @0x1c9fb08: 0x1c9a062 "Does"
(gdb) print strcmp(s[i],grammar->symbols_from_int[107])
$45 = -1

Any idea what's going on?

Thanks in advance,

Onur

Edit 1: Here are some snippets of my code:

# include <unordered_map>       // Used as hash table
# include <stdlib.h>
# include <string.h>
# include <stdio.h>
# include <vector>

using namespace std;
using std::unordered_map;
using std::hash;

struct eqstr
{
  bool operator()(const char* s1, const char* s2) const
  {
    return strcmp(s1, s2) == 0;
  }
};

...
<some other code>
...

class BPCFG {

  public:

        char *symbols;  // Character array holding all grammar symbols, with NULL seperating them.
        char *rules;    // Character array holding all rules, with NULL seperating them.

        unordered_map<char *, int , hash<char *> , eqstr> int_from_symbols; // Hash table holding the grammar symbols and their integer indices as key/value pairs.
...
<some other code>
...

vector<char *> symbols_from_int;        // Hash table holding the integer indices and their corresponding grammar symbols as key/value pairs.
void load_symbols_from_file(const char *symbols_file);
}

void BPCFG::load_symbols_from_file(const char *symbols_file) {
        char buffer[200];
        FILE *input = fopen(symbols_file, "r");
        int symbol_index = 0;
        while(fscanf(input, "%s", buffer) > 0) {
                if(buffer[0] == '/')
                        strcpy(symbols + symbol_index, buffer+1);
                else
                        strcpy(symbols + symbol_index, buffer);
                symbols_from_int.push_back(symbols + symbol_index);
                int_from_symbols[symbols+symbol_index] = symbols_from_int.size()-1;
                probs.push_back(vector<double>());
                hyperprobs.push_back(vector<double>());
                rules_from_IntPair.push_back(vector<char *>());
                symbol_index += strlen(symbols+symbol_index) + 1;
        }


        fclose(input);
}

This last function (BPCFG::load_symbols_from_file) seems to be the only function I modify symbols_from_int in my whole code. Please tell me if you need some more code. I'm not putting everything because it's hundreds of lines.

Edit 2: OK, I think I should add one more thing from my code. This is the constructor of BPCFG class:

BPCFG(int symbols_length, int rules_length, int symbol_count, int rule_count):
   int_from_symbols(1.5*symbol_count),
   IntPair_from_rules(1.5*rule_count),
   symbol_after_dot(10*rule_count)
{
    symbols = (char *)malloc(symbols_length*sizeof(char));
    rules = (char *)malloc(rules_length*sizeof(char));
}

Edit 3: Here is the code on the path to the point of error. It's not compilable, but it shows where the code stepped through (I checked with next and step commands in the debugger that the code indeed follows this route):

BPCFG my_grammar(2000, 5500, 194, 187);
my_grammar.load_symbols_from_file("random_50_1_words_symbols.txt");
<some irrelevant code>
my_grammar.load_rules_from_file("random_50_1_words_grammar.txt", true);
<some irrelevant code>
my_grammar.load_symbols_after_dots();

BPCFGParser my_parser(&my_grammar);
BPCFGParser::Sentence s;

// (Sentence is defined in the BPCFGParser class with
// typedef vector<char *> Sentence;)

Edge e;
try {
        my_parser.parse(s, e);
}
catch(char *e) {fprintf(stderr, "%s", e);}

void BPCFGParser::parse(const Sentence & s, Edge & goal_edge) {

        /* Initializing the chart */

        chart::active_sets.clear();
        chart::passive_sets.clear();
        chart::active_sets.resize(s.size());
        chart::passive_sets.resize(s.size());

        // initialize(sentence, goal);

        try {
                initialize(s, goal_edge);
        }
        catch (char *e) {
                if(strcmp(e, UNKNOWN_WORD) == 0)
                        throw e;
        }
<Does something more, but the execution does not come to this point>
}

void BPCFGParser::initialize(const Sentence & s, Edge & goal_edge) {
        // create a new chart and new agendas
        /* For now, we plan to do this during constructing the BPCFGParser object */

        // for each word w:[start,end] in the sentence
        //   discoverEdge(w:[start,end])

        Edge temp_edge;

        for(int i = 0;i < s.size();i++) {
                temp_edge.span.start = i;
                temp_edge.span.end = i+1;
                temp_edge.isActive = false;
                /* Checking whether the given word is ever seen in the training corpus */
                unordered_map<char *, int , hash<char *> , eqstr>::const_iterator it = grammar->int_from_symbols.find(s[i]);
                if(it == grammar->int_from_symbols.end())
                        throw UNKNOWN_WORD;
                <Does something more, but execution does not come to this point>
        }
}

Where I run the print commands in the debugger is the last

throw UNKNOWN_WORD;

command. I mean, I was stepping with next on GDB and after seeing this line, I ran all these print commands.

Thank you for your interest,
Onur

+1  A: 

Given that GDB output the only possible cause I can see is that strcmp() is bugged.

You basically did in GDB what strcmp does: compare character per character, until both are zeros (at 4).

Can you try print strcmp("Does", "Does"); ?


EDIT: also try:

print stricmp(s[i], grammar->symbols_from_int[107], 1);
print stricmp(s[i], grammar->symbols_from_int[107], 2);
print stricmp(s[i], grammar->symbols_from_int[107], 3);
print stricmp(s[i], grammar->symbols_from_int[107], 4);
print stricmp(s[i], grammar->symbols_from_int[107], 5);
Andreas Bonini
strcmp has been out in the wild for a long time. It's far more likely that the questioner has done something that's not matching his/her expectations.
duffymo
A bug in a very commonly used library, especially doing something this trivial, is pretty damn unlikely. Unless he's getting strcmp from somewhere other than the standard library.
Paul Tomblin
@duffymo: what I was thinking is that he may be using a `strcmp()` he wrote himself, or anyway not the one in the standard library. If you have an alternative explanation you're welcome to offer one, but given the GDB output it appears strcmp() is not behaving like it should.
Andreas Bonini
I'm also considering stuff like stack smashing and similar, but I don't see a scenario where they could result in this
Andreas Bonini
print strcmp("Does", "Does")returns 0.I included stdlib.h and string.h and I compile with the option -std=gnu++0x. My only expectation from strcmp() is that it returns 0 whenever the given strings are equal and returns something else otherwise.
Onur Cobanoglu
To clarify, I did not write my strcmp(). I'm using the one in string.h
Onur Cobanoglu
A: 

Does strlen(s[i]) and strlen(grammar->symbols_from_int[107]) return the same?

Also, I can't imagine this is the problem but could you use a constant value rather than i just to make sure that something weird is not going?

Jesse
Yes. I just tried and both return 4.
Onur Cobanoglu
The value of i is 0 at the time I check strcmp(). "print s[i]" and "print s[0]" returns exactly the same output by the debugger.
Onur Cobanoglu
A: 

May be size of the two string are not same.

Ashish
No, he's shown that both strings are null terminated at element 4.
Steve Fallows
A: 

Maybe there is a non-printable character in one of the strings?

simon
No, please examine the GDB output carefully.
Andreas Bonini
Yes I see the gdb output. But that doesn't mean that the failure isn't a result of corrupted data.
simon
+10  A: 

This sounds like s is a pointer to an array that was on the stack which is overwritten as soon as a new function is called, ie strcmp()

What does the debugger say they are after the strcmp() call?

epatel
+1. Yep. A million times more likely to be data corruption than a bug in strcmp! The strings are clearly the same when you examine them with the debugger, but that doesn't mean the memory locations still have those values in them by the time strcmp executes. (@Onur: Posting the actual code that is failing would make everything much clearer).
Jason Williams
s is of type vector<char *> and its reference is passed on through function calls with const keyword (so compiler checked whether it is being tried to be modified) and according to the debugger, it does not change. When I type "print s" before and after the call to strcmp(), I get exactly the same output.
Onur Cobanoglu
The best move is to post the code - if it isn't doing what you think it should do, then describing what you think it should do won't help us diagnose it.
Jason Williams
it's not an answer because custom strcmp realization works at same point of code. There are many clarifications on situation in responses and comments, that not placed in question, look for comment here: http://stackoverflow.com/questions/1959572/c-strcmp-does-not-work-correctly/1959960#1959960
ThinkJet
A: 

Here are some snippets of my code:

# include <unordered_map>       // Used as hash table
# include <stdlib.h>
# include <string.h>
# include <stdio.h>
# include <vector>

using namespace std;
using std::unordered_map;
using std::hash;

struct eqstr
{
  bool operator()(const char* s1, const char* s2) const
  {
    return strcmp(s1, s2) == 0;
  }
};

...
<some other code>
...

class BPCFG {

  public:

        char *symbols;  // Character array holding all grammar symbols, with NULL seperating them.
        char *rules;    // Character array holding all rules, with NULL seperating them.

        unordered_map<char *, int , hash<char *> , eqstr> int_from_symbols; // Hash table holding the grammar symbols and their integer indices as key/value pairs.
...
<some other code>
...

vector<char *> symbols_from_int;        // Hash table holding the integer indices and their corresponding grammar symbols as key/value pairs.
void load_symbols_from_file(const char *symbols_file);
}

void BPCFG::load_symbols_from_file(const char *symbols_file) {
        char buffer[200];
        FILE *input = fopen(symbols_file, "r");
        int symbol_index = 0;
        while(fscanf(input, "%s", buffer) > 0) {
                if(buffer[0] == '/')
                        strcpy(symbols + symbol_index, buffer+1);
                else
                        strcpy(symbols + symbol_index, buffer);
                symbols_from_int.push_back(symbols + symbol_index);
                int_from_symbols[symbols+symbol_index] = symbols_from_int.size()-1;
                probs.push_back(vector<double>());
                hyperprobs.push_back(vector<double>());
                rules_from_IntPair.push_back(vector<char *>());
                symbol_index += strlen(symbols+symbol_index) + 1;
        }


        fclose(input);
}

This last function (BPCFG::load_symbols_from_file) seems to be the only function I modify symbols_from_int in my whole code. Please tell me if you need some more code. I'm not putting everything because it's hundreds of lines.

Onur Cobanoglu
How do you allocate memory for BPCFG::symbols?
Nikolai N Fetissov
How do you allocate and mainatin the `symbols` array? Are you sure it is not reallocated inside the cycle? I don't see it in the code, but maybe you are not poisting everything.
AndreyT
@Nikolai N Fetissov: I allocate memory for BPCFG::symbols in the constructor of BPCFG with malloc(). I posted the constructor code recently.@AndreyT: Content of the symbols array is modified only in BPCFG::load_symbols_from_file() in my whole code and this function is called only once in my code (see the code on the path to the point of error). It's not called anywhere else.
Onur Cobanoglu
@Onur: Well, what I'm syaing is: Is the code of "load_symbols_from_file" you posted *complete*? Also, how do you determine how much memory to `malloc` for `BPCFG::symbols`? You said that you already posted the constrcutor for `BPCFG` - I don't see it anywhere. I don't see the `malloc` for `BPCFG::symbols` anywhere.
AndreyT
@AndreyT: This is the complete code for "load_symbols_from_file". Nothing hidden. Sorry for the confusion created by me, the constructor definition is just in the answer one above, and the call to the constructor is given in one of my answers beginning with "Here is the code on the path to the point of error.".
Onur Cobanoglu
A: 

Here is the code on the path to the point of error. It's not compilable, but it shows where the code stepped through (I checked with next and step commands in the debugger that the code indeed follows this route):

BPCFG my_grammar(2000, 5500, 194, 187);
my_grammar.load_symbols_from_file("random_50_1_words_symbols.txt");
<some irrelevant code>
my_grammar.load_rules_from_file("random_50_1_words_grammar.txt", true);
<some irrelevant code>
my_grammar.load_symbols_after_dots();

BPCFGParser my_parser(&my_grammar);
BPCFGParser::Sentence s;

// (Sentence is defined in the BPCFGParser class with
// typedef vector<char *> Sentence;)

Edge e;
try {
        my_parser.parse(s, e);
}
catch(char *e) {fprintf(stderr, "%s", e);}

void BPCFGParser::parse(const Sentence & s, Edge & goal_edge) {

        /* Initializing the chart */

        chart::active_sets.clear();
        chart::passive_sets.clear();
        chart::active_sets.resize(s.size());
        chart::passive_sets.resize(s.size());

        // initialize(sentence, goal);

        try {
                initialize(s, goal_edge);
        }
        catch (char *e) {
                if(strcmp(e, UNKNOWN_WORD) == 0)
                        throw e;
        }
<Does something more, but the execution does not come to this point>
}

void BPCFGParser::initialize(const Sentence & s, Edge & goal_edge) {
        // create a new chart and new agendas
        /* For now, we plan to do this during constructing the BPCFGParser object */

        // for each word w:[start,end] in the sentence
        //   discoverEdge(w:[start,end])

        Edge temp_edge;

        for(int i = 0;i < s.size();i++) {
                temp_edge.span.start = i;
                temp_edge.span.end = i+1;
                temp_edge.isActive = false;
                /* Checking whether the given word is ever seen in the training corpus */
                unordered_map<char *, int , hash<char *> , eqstr>::const_iterator it = grammar->int_from_symbols.find(s[i]);
                if(it == grammar->int_from_symbols.end())
                        throw UNKNOWN_WORD;
                <Does something more, but execution does not come to this point>
        }
}

Where I run the print commands in the debugger is the last

throw UNKNOWN_WORD;

command. I mean, I was stepping with next on GDB and after seeing this line, I ran all these print commands.

Thank you for your interest,
Onur

Onur Cobanoglu
A: 

OK, I think I should add one more thing from my code. This is the constructor of BPCFG class:

BPCFG(int symbols_length, int rules_length, int symbol_count, int rule_count):
   int_from_symbols(1.5*symbol_count),
   IntPair_from_rules(1.5*rule_count),
   symbol_after_dot(10*rule_count)
{
    symbols = (char *)malloc(symbols_length*sizeof(char));
    rules = (char *)malloc(rules_length*sizeof(char));
}
Onur Cobanoglu
This should be an edit of your question, not an answer. Do you need a certain rep to edit your own question?
Fred Larson
@Fred Larson: Sorry for being a novice, thanks for editing my question and again sorry for taking your time with this. I'll keep that in mind next time.
Onur Cobanoglu
+1  A: 

The only way you'll ever figure this out is to step INTO strcmp with the debugger.

Die in Sente
A: 

I would strongly recommend trying to solve the problem with regular STL strings first - you'll get cleaner code and automated memory management so you can concentrate on the parser logic. Only after the thing is working and profiling proves that string manipulation is the performance bottleneck I would look at all the algorithms used in greater detail, then at specialized string allocators, and - as the last resort - at manual character array manipulation, in that order.

Nikolai N Fetissov
-1, There's always a guy plugging his favorite library/coding style while ignoring the question at hand. Wish I could give more than -1..
Blindy
@Blindy: the STL is not merely a favorite library. Nikolai is advocating an approach that obviates the problem with strcmp, which is an answer to the problem. Sometime the best solution is to take a different route.
outis
Boy, there's always a guy who doesn't know what he's talking about ...
Nikolai N Fetissov
+1  A: 

I would strongly suggest you zero out the memory before you start using it. I realize that the GDB output makes no sense because you do verify it's a null terminated strings, but I've had a lot of string.h bizarre problems go away with a memset, bzero, calloc or whatever you want to use.

Specifically, zero out the memory in the constructor and the buffer you use when reading from the file.

FranticPedantic
A: 

Thanks to all spared time to answer. I wrote my own string comparison function and it worked correctly at the same point, so obviously it's a problem with strcmp(). Nevertheless, my code still does not work as I expect. But I'll request help for that only after I analyze it thoroughly. Thanks.

Onur Cobanoglu
A: 

As others have pointed out, it's near to impossible that there would be a problem with strcmp.

It would be best to strip down the problematic code to the absolute minimum required to reproduce the problem (also include instructions on how to compile - which compiler, flags, runtime library are you using?). Most likely you will find a bug in the process.

If not, you will receive a lot of credit for finding a bug in one of the most intensively used C functions ;-)

jeroenh
As I wrote in one of my self-replies: The string comparison function I wrote works correctly (finds that the strings are actually the same) at the very same point where strcmp() fails (credit it or not :) ).I compile with:g++ -std=gnu++0x -o experiment experiments.cpp -g (the last one for debugger)I'm working on Linux Ubuntu 9.04, g++ 4.3.3., what other kind of information would you like? I already stripped down the portions of code which I thought to be relevant.
Onur Cobanoglu
A: 

Mark you own strcmp realization as inline and look what happens ...

From GCC 4.3 Release Series Changes, New Features, and Fixes for GCC 4.3.4 :

"During feedback directed optimizations, the expected block size the memcpy, memset and bzero functions operate on is discovered and for cases of commonly used small sizes, specialized inline code is generated."

May be some other related bugs exists ...
Try to switch compiler optimization or/and function inlining off.

ThinkJet