views:

178

answers:

3

Hi all
I have an application consisting of different modules written in C++.
One of the modules is meant for handling distributed tasks on SunGrid Engine. It uses the DRMAA API for submitting and monitoring grid jobs.If the client doesn't supports grid, local machine should be used

The shared object of the API libdrmaa.so is linked at compile time and loaded at runtime.
If the client using my application has this ".so" everything is fine but in case the client doesn't have that , the application exits failing to load shared libraries.
To avoid this , I have replaced the API calls with function pointers obtained using dlsym() and dlopen(). Now I can use the local machine instead of grid if the call to dlopen doesn't succeeds and my objective is achieved.
The problem now is that the application now runs successfully for small testcases but with larger testcases it throws segmentation fault while the same code using dynamic loading works correctly.

Am I missing something while using dlsym() and dlopen()?
Is there any other way to achieve the same goal?

Any help would be appreciated.

Thanx,

A: 

If you are throwing an exception across a extern "C" function then the application has to quit. This is because the C ABI does not have the facilities to propagate exceptions.

To counter this when using DLL's (or shared libs) you normally have a one C function that returns a C++ object. Then the remaining interaction is with that C++ object that was returned from the DLL.

This pattern suggests (and I stress suggests) a factory like object, thus your DLL should have a single extern "C" function that returns a void* which you can reinterpret_cast<> back into a C++ factory object.

Martin York
+3  A: 

It is very unlikely to be a direct problem with the code loaded via dlsym() - in the sense that the dynamic loading makes it seg-fault.

What it may be doing is exposing a separate problem, probably by moving stuff around. This probably means a stray (uninitialized) pointer that points somewhere 'legitimate' in the static link case but somewhere else in the dynamic link case - and the somewhere else triggers the seg-fault. Indeed, that is a benefit to you in the long run - it shows that there is a problem that otherwise might remain undetected for a long time.

I regard this as particularly likely since you mention that it occurs with larger tests and not with small ones.

Jonathan Leffler
yeah.. I agree to you.. but how should I go about correcting this??The entire code is pretty bulky :(..
Neeraj
The first stage is to work out what is triggering the seg-fault. Is it a null pointer access to data, or to a function pointer? Or is it some other problem? Also, you need to track the calling stack at the point of failure - from the core dump (of the program compiled with debugging enabled). If the stack back trace is corrupt, the problem is likely to be a buffer overflow; you've tried to save data without checking that there was enough space. If not, it may suggest where to look next - which class and functions were in use at the time of crash. Good luck; these bugs can be frustrating.
Jonathan Leffler
+1  A: 

As Jonathan Leffler says, the problem very likely exists in the case where you are using the API directly; it just hasn't caused a crash yet.

Your very first step when you get a SIGSEGV should be analyzing the resulting core dump (or just run the app directly under debugger), and looking where it crashed. I'll bet $0.02 that it's crashing somewhere inside malloc or free, in which case the problem is plain old heap corruption, and there are many heap-checker tools available to help you catch it. Sun provides watchmalloc, which is a good start.

Employed Russian