views:

114

answers:

2

I'm having a very tough time debugging a multi-threaded C application that I've made a few changes to. I've been unable to use GDB to help identify the issue(see below code for more info).

The following code is from one of the tasks that is opened in its own thread. I've snipped out most of the code following the problem.

void tskProcessTenMinuteTables(void *input)
{
    /* Check the minute as soon as we start.  If we're started on a ten min
     * boundary, sleep for one minute.
     */
    time_t now;
    time_t wakeup;
    struct tm *next_tick_ptr;

    now = time(NULL);
    next_tick_ptr = localtime(&now);

    /* returns a time struct populated w/ next ten min boundary */
    GetNextTenMinBoundary(next_tick_ptr); 
    wakeup = mktime(next_tick_ptr);

    sleep(2); /* Without this sleep, the following if() was always true. */ 


    if(next_tick_ptr->tm_min % 10 == 0)   
    {
     fprintf(stderr, "On tenmin boundary on initialization.. task sleeping for 60 seconds.\n");

        /*  debug statements to test the cause of segfault.  */ 
     fprintf(stderr, "NOM NOM NOM\n"); 
     printf( "Test%d\n", 1);
     fprintf(stderr, "Test%d\n", 2);  /* <~~~ This statement is the guilty party */

        sleep(60);
    }

    /*  Main loop.  Every loop besides the tick itself will consist only 
    *   of a call to time and a comparison of current stamp with wakeup.
    *   this should be pretty light on the processing side.
    *
    *   Re-implement this as a sleep/awake with a signal in the future.
    */
    while(1)
    {
        now = time(NULL);

        if( now >= wakeup )
        {
            fprintf(stderr, "Triggered 1.\n");
            fprintf(stderr, "Triggered 2.\n");  

            char statement[150];

            fprintf(stderr, "Triggered 3.\n");      
            sprintf(statement, "SELECT ten_min_end(%d::int2)",GetTenMinPeriodNumber());
            fprintf(stderr, "Triggered 4.\n");
            DBCallStoredProcedure(statement);
            fprintf(stderr, "Triggered 5.\n");
    }

}

The cause is attempting to use fprintf with variadic(?) args. Calling it without anything besides the pattern works. Printf functions with or without args.

fprintf(stderr, "Hi #%d.\n", 1); <~~ segfault
fprintf(stderr, "Hi #1.\n"); <~~ works
printf("Hi #%d.\n", 1); <~~ works
printf("Hi #1.\n"); <~~ works

When run in gdb, I receive the following spewage before gdb becomes unresponsive. A kill -9 is needed to terminate.

$gdb ir_client
(gdb) r
Starting program: /home/ziop/Experimental_IR_Clients/ir-10-20/IR_Client/obj-linux-x86/ir_client 
[Thread debugging using libthread_db enabled]
[New Thread 0xb7fe5b70 (LWP 32269)]
[New Thread 0xb7fc4b70 (LWP 32270)]
(032266 - -1208067216) 20-Oct-2010 10:56:19.59 - IR_Client_ConnectCmdPort - Socket connected.
[New Thread 0xb7ffdb70 (LWP 32272)]
(032266 - main thread) 20-Oct-2010 10:56:19.59 - sl_exit - Exiting thread with code 0.
On tenmin boundary on initialization.. task sleeping for 60 seconds.
NOM NOM NOM 
Test1

I'm fairly new at C, so it may be something obvious. My first thought was something with the unbuffered output was not thread-safe but the fprintf always succeeds if no variable is passed. Pthread funkiness is still my top suspect. Unfortunately I'm stuck with the architecture for the time being.

Thanks in advance.

A: 

Usually - these kinds of problems are related to memory corruption. Symptoms such as inconsistent segfaults on different lines whenever you slightly change the code are a wonderful example.

Try running your program through a tool such as valgrind, you are guaranteed to see some illegal memory accesses. Fix those, and I suspect things will work.

Yuval A
+1  A: 

Step one is to try running the function without introducing threads. Just write a .c file that has a main that does the bare minimum to get ready to start the thread, and then rather than do that it just calls the function. It is much easier to debug if you can recreate the problem with just one thread.

Additionally, if you are using gcc you should compile with:

-fstack-protector-all -Wstack-protector -fno-omit-frame-pointer

in addition to your normal flags (at least until you find the problem). These will help with debugging and possibly issue more warnings at compile time. I assume that you know how -O flags can effect debug-ability and functionality (especially if you are already doing something wrong or undefined in the C code).

When you are in GDB and things look like they have locked up or the program is taking a long time to do something you can usually press CTRL Z to get back to (gdb) without killing the program. This issues the stop signal to the program and lets you interact with GDB again, so you can find out what the program is actually doing.

edit

I apparently solved the problem within comments discussion, so I'll write what the problem was here.

A quick glance at the code did not suggest a problem that would result in a segmentation fault (illegal memory access), and Zypsy (the OP) told me that the function ran fine when being called directly from main rather than being run via a separate thread.

Valgrind reported that the thread's stack space was unable to be expanded to a certain address. In Linux the main thread's stack is mapped into the application in such a way that it can easily grow, but this often isn't done when memory is allocated for thread stacks.

I asked Zypsy (the OP) to insert some code that would print out the address of something known to be low on the threads stack (printf("thread stk = %p\n", &input);) so that that value could be compared to the address given in the failure message. From this I could get a guess for the stack size. This did not suggest that very much stack space was consumed between the beginning of the thread function and its failure, but the space also did not seem too small for the code in the question (it apparently turned out to be too small, though).

Because the pthread_create function allows you to either accept the settings for a thread's attributes (pass in a NULL) or pass in an argument specifying various settings for the thread I asked if the code that called pthread_create could be posted so that I could see if there were any suspect settings.

After looking at this code it (an application specific wrapper around various pthread_ functions) I saw that there was actually some stack related attributes being set. I asked the OP to look at calls to this function and look for suspicious things related to how the stack was allocated (make sure that the size value and the allocated memory size were actually the same). It turned out that the OP then found that this thread's stack was being allocated smaller than the stacks of other threads. The stack was too small after all.

nategoose
No issues when running the function in the main thread. Figuring I had a thread issue, I cleared out the function. It's now.
Zypsy
Gah... No issues when running the function in the main thread. Figuring I had a thread issue, I cleared out the function to contain only a fprintf(stderr, "Test%d\n", 1); and the return. GDB handles the function now. When run as a separate thread it throws a cryptic "0x001b4e4e in buffered_vfprintf (s=0x2c9580, format=0x8058ef1 "Test\d.\n", args=0xb7ffd308 "\004") at vfprintf.c:2221". I'd go with the guess that stderr isn't safe for use in threads, but the existing logging function which uses mutex exhibits the same behavior. Valgrind gives the same "Can't extend stack" message.
Zypsy