views:

512

answers:

9

I got a function find_nodes() with a loop inside:

for (htmlNodePtr current_node=root_node
    ; current_node!=NULL
    ; current_node=current_node->next) 
{
    if (xmlHasProp(current_node,(xmlChar *)"href")) {
        if (xmlHasProp(current_node,(xmlChar *)attribute)) {
            if (strcmp(value,
                (char *)xmlGetProp(current_node,(xmlChar *)attribute))==0) {
                    found_nodes[numb_found]=current_node;
                    numb_found++;
            }
        }
    }

    find_nodes(found_nodes,numb_found,
               current_node->children,mode,attribute,value);

}

I'm getting segmentation fault in this assignment:

found_nodes[numb_found]=current_node;

I checked the numb_found value and it's ok for few iterations, and after that instead of few+1 it equals -1207604106

What could cause that?

+3  A: 

You're overrunning your array boundaries somehow and looking at random data.

Okay, looking at this, we don't have enough information, but I observe that this appears to be a recursive search through a DOM tree. You're passing numb_found as an argument, so when you assign to it in a recursive call, that value won't be updated above there. Eventually you're going to run into trouble with that.

Charlie Martin
A: 

Guessing from the given code, either you've got something going on that's stomping the stack, OR numb_found is getting very large and overflowing. Put in some real code (like the type info for all of the above) and we'll be able to tell you more.

I'd be suspicious that found_nodes is a fixed size array on the local stack, and that you are over-running it, as well.

Michael Kohne
A: 

Maybe I'm misunderstanding your code, but since you're passing numb_found and not &numb_found, you're just going to be rewriting the found node anyway every time you return from a recursion, which seems like a bug to me.

McWafflestix
num_found is declared as a non-const ref, so the caller's copy gets updated by the increment.
Dan Breslau
+1  A: 

You've stomped on memory. Compile your code with -g and run it using valgrind and valgrind will tell you exactly where the error is.

Norman Ramsey
A: 

This is the whole function:

void find_nodes(htmlNodePtr *found_nodes, int &numb_found, htmlNodePtr root_node, SearchMode mode, const char *attribute, const char *value) {

    htmlNodePtr tmp_ptr;

    switch (mode) {

     case S_HREF:
      for (htmlNodePtr current_node=root_node; current_node!=NULL; current_node=current_node->next) {
       if (xmlHasProp(current_node,(xmlChar *)"href")) {
        if (xmlHasProp(current_node,(xmlChar *)attribute)) {
         if (strcmp(value,(char *)xmlGetProp(current_node,(xmlChar *)attribute))==0) {
          found_nodes[numb_found]=current_node;
          numb_found++;
         }
        }
       }

       find_nodes(found_nodes,numb_found,current_node->children,mode,attribute,value);

      }
      break;
     case S_KEYWORD:
      for (htmlNodePtr current_node=root_node; current_node!=NULL; current_node=current_node->next) {
       if (xmlHasProp(current_node,(xmlChar *)"href")) {
        if (strcmp(value,(char *)xmlNodeGetContent(current_node))==0) {
         found_nodes[numb_found]=current_node;
         numb_found++;
        }
       }

       find_nodes(found_nodes,numb_found,current_node->children,mode,attribute,value);
      }
      break;
     case S_TAG:
      for (htmlNodePtr current_node=root_node; current_node!=NULL; current_node=current_node->next) {
       if (xmlHasProp(current_node,(xmlChar *)attribute)) {
        if (strcmp(value,(char *)xmlGetProp(current_node,(xmlChar *)attribute))==0) {
         tmp_ptr=inner_href_seek(current_node);
         if (tmp_ptr==NULL) {
          find_nodes(found_nodes,numb_found,current_node->children,mode,attribute,value);
          continue;
         }
         else {
          found_nodes[numb_found]=tmp_ptr;
          numb_found++;
         }
        }
       }

       find_nodes(found_nodes,numb_found,current_node->children,mode,attribute,value);
      }
      break;
    }
}

The array is fixed sixe, but it's way bigger than needed. Am I passing the numb_found in the proper way?

=== EDIT ===

char** get_urls(string url, ParseTreeNode *tree_root, int &numb_found) {

    numb_found=0;
    char **url_list;
    htmlDocPtr doc;
    htmlNodePtr root_node;
    string site_content;

    if (get_page(url,site_content)<0) {
     url_list=NULL;
     return url_list;
    }

    // get a DOM
    doc=htmlReadMemory(site_content.data(),site_content.size(),url.data(),NULL,0);

    // and the root
    root_node=xmlDocGetRootElement(doc);

    if (tree_root==NULL) {
     url_list=NULL;
     return url_list;
    }

    LeafList *l_list;
    l_list= new LeafList();

    l_list->numb_leafs=0;

    get_leaf_list(l_list,tree_root);

    htmlNodePtr matching_nodes[256];
    int numb_matching_nodes;

    htmlNodePtr tmp_nodes[64];
    int numb_tmp;

    SearchMode tmp_rule;

    for (int i=0;i<l_list->numb_leafs;i++) {
     if (l_list->leaf_buff[i]->data->rule!=TAG) continue;
     else {
      numb_matching_nodes=0;
      find_nodes(matching_nodes,numb_matching_nodes,root_node,S_TAG,l_list->leaf_buff[i]->data->attribute.data(),l_list->leaf_buff[i]->data->value.data());

      if (numb_matching_nodes==0) continue;
      else l_list->leaf_buff[i]->state=true;

      for (int j=0;j<numb_matching_nodes;j++) {
       for (int k=0;k<l_list->numb_leafs;k++) {
        if (k==i) continue;
        else {
         switch(l_list->leaf_buff[k]->data->rule) {
          case HREF:
           tmp_rule=S_HREF;
           break;
          case TAG:
           tmp_rule=S_TAG;
           break;
          case KEYWORD:
           tmp_rule=S_KEYWORD;
           break;
         }

         find_nodes(tmp_nodes,numb_tmp,matching_nodes[j],tmp_rule,l_list->leaf_buff[k]->data->attribute.data(),l_list->leaf_buff[i]->data->value.data());

         if (numb_tmp>0) l_list->leaf_buff[k]->state=true;
         else l_list->leaf_buff[k]->state=false;
        }
       }

       if (tree_is_true(l_list)) {
        url_list[numb_found]=(char *)xmlGetProp(matching_nodes[j],(xmlChar *)"href");
        numb_found++;
       }
      }
     }
    }

    for (int i=0;i<l_list->numb_leafs;i++) {
     if (l_list->leaf_buff[i]->data->rule!=HREF) continue;
     else {
      numb_matching_nodes=0;
      find_nodes(matching_nodes,numb_matching_nodes,root_node,S_HREF,l_list->leaf_buff[i]->data->attribute.data(),l_list->leaf_buff[i]->data->value.data());

      if (numb_matching_nodes==0) continue;
      else l_list->leaf_buff[i]->state=true;

      for (int j=0;j<numb_matching_nodes;j++) {
       for (int k=0;k<l_list->numb_leafs;k++) {
        if ((k==i)||(l_list->leaf_buff[k]->data->rule==TAG)) continue;
        else {
         switch(l_list->leaf_buff[k]->data->rule) {
          case HREF:
           tmp_rule=S_HREF;
           break;
          case KEYWORD:
           tmp_rule=S_KEYWORD;
           break;
         }

         find_nodes(tmp_nodes,numb_tmp,matching_nodes[j],tmp_rule,l_list->leaf_buff[k]->data->attribute.data(),l_list->leaf_buff[i]->data->value.data());

         if (numb_tmp>0) l_list->leaf_buff[k]->state=true;
         else l_list->leaf_buff[k]->state=false;
        }
       }

       if (tree_is_true(l_list)) {
        url_list[numb_found]=(char *)xmlGetProp(matching_nodes[j],(xmlChar *)"href");
        numb_found++;
       }
      }
     }
    }

    for (int i=0;i<l_list->numb_leafs;i++) {
     if (l_list->leaf_buff[i]->data->rule!=KEYWORD) continue;
     else {
      numb_matching_nodes=0;
      find_nodes(matching_nodes,numb_matching_nodes,root_node,S_KEYWORD,l_list->leaf_buff[i]->data->attribute.data(),l_list->leaf_buff[i]->data->value.data());

      if (numb_matching_nodes==0) continue;
      else {
       for (int i=0;i<numb_matching_nodes;i++) {
        url_list[numb_found]=(char *)xmlGetProp(matching_nodes[i],(xmlChar *)"href");
        numb_found++;
       }
      }
     }
    }

    return url_list;
}
A: 

I notice that the invalid value (-1207604106) for numb_found is 0xB8056C76, which kinda smells like a pointer value. That could be explained by overrunning the array, though you say that it's not being overrun...

I suggest that you verify that the array really is "way bigger than needed". Add traces (using cerr) on the lines where you add a node to the array; at a minimum, have the traces print out the value of numb_found every time through. What's the maximum value you get to before the crash? How does that actually compare to the array size?

Dan Breslau
A: 

This is what valgrind returned

==7464==
==7464== Use of uninitialised value of size 4
==7464==    at 0x80494EF: find_nodes(_xmlNode**, int&, _xmlNode*, SearchMode, char const*, char const*) (search_engine.cpp:90)
==7464==    by 0x8049CF2: get_urls(std::string, ParseTreeNode*, int&) (search_engine.cpp:237)
==7464==    by 0x804907B: main (tester.cpp:39)
==7464==
==7464== Invalid write of size 4
==7464==    at 0x80494EF: find_nodes(_xmlNode**, int&, _xmlNode*, SearchMode, char const*, char const*) (search_engine.cpp:90)
==7464==    by 0x8049CF2: get_urls(std::string, ParseTreeNode*, int&) (search_engine.cpp:237)
==7464==    by 0x804907B: main (tester.cpp:39)
==7464==  Address 0xcef11ec0 is not stack'd, malloc'd or (recently) free'd
==7464==
==7464== Process terminating with default action of signal 11 (SIGSEGV)
==7464==  Access not within mapped region at address 0xCEF11EC0
==7464==    at 0x80494EF: find_nodes(_xmlNode**, int&, _xmlNode*, SearchMode, char const*, char const*) (search_engine.cpp:90)
==7464==    by 0x8049CF2: get_urls(std::string, ParseTreeNode*, int&) (search_engine.cpp:237)
==7464==    by 0x804907B: main (tester.cpp:39)
==7464==
==7464== ERROR SUMMARY: 3 errors from 3 contexts (suppressed: 18 from 2)
==7464== malloc/free: in use at exit: 73,605 bytes in 3,081 blocks.
==7464== malloc/free: 3,666 allocs, 585 frees, 172,154 bytes allocated.
==7464== For counts of detected errors, rerun with: -v
==7464== searching for pointers to 3,081 not-freed blocks.
==7464== checked 329,952 bytes.
==7464==
==7464== LEAK SUMMARY:
==7464==    definitely lost: 186 bytes in 30 blocks.
==7464==      possibly lost: 8,324 bytes in 6 blocks.
==7464==    still reachable: 65,095 bytes in 3,045 blocks.
==7464==         suppressed: 0 bytes in 0 blocks.
==7464== Rerun with --leak-check=full to see details of leaked memory.

The maximum value I got was 24 and the array has 1024 elements

Please edit your original post, rather than posting new "answers" as updates.
Dan Breslau
Also, what is the code at search_engine.cpp:237 andsearch_engine.cpp:90 ?
Dan Breslau
A: 

Seriously, you should not be asking us this question. It will be far quicker for you to either:

  • run this through a debugger, single-stepping the code and keeping a watch on numb_found after each instruction; or
  • in the absence of a debugger, putting a printf/cout of numb_found after every statement (with a unique ID: printf("A:%d\n,numb_found);).

This will be thew quickest way to see exactly what's causing the problem.

In any case, your latest answer/comment indicates that you're passing an uninitialized value to find_nodes() and, given the fact that most of them are pointers, that's also causing you write to invalid memory.

I can't tell from the valgrind output which parameter is uninitialized so put a printf/cout at the top of the function to print out all the pointers (not the contents of the pointers). This should allow you to see which parameter is bad or has been corrupted.

paxdiablo
It's the numb_found that is being passed uninitialized, but what should I do now to avoid it? Everything is alright until the 25th iteration, then the variable somehow gets currupted.
zacjev, there's a difference between uninitialized and corrupted. For s start, you need to add the complete code for the function that calls find_nodes() to your question so we can see what may be causing it, rather than playing guessing games.
paxdiablo
I added the function to the post below.
+2  A: 

in your get_urls function you declare but not initialize

char **url_list;

and after that you use it

if (tree_is_true(l_list)) {
    url_list[numb_found]=(char *)xmlGetProp(matching_nodes[j],(xmlChar *)"href");
    numb_found++;
}

-1207604106 is 0xB8056C76 -- perfectly fitting for pointer ;-)

Well - I'm ashamed to have missed that (and I know there are pointers not being freed - I'll get to that when I get the function working). It works now... sometimes. For some arguments the function actually returns proper urls from the tested site. For other it crashes again in the same place - still getting this kind of values: -1208693642
the same place -- later? on which iteration it crashes again?Maybe 256 for matching_nodes is not enough ;-)