views:

532

answers:

7

In our server/client-setup we're experiencing some weird behaviour. The client is a C/C++-application which uses OCI to connect to an Oracle server (using the OTL library).

Every now and then the DB server dies in a way (yes this is the core issue, but from application-side we're unable to solve it but have to deal with it anyway), that the machine does not respond anymore to new requests/connections but the existing ones, like the Oracle-connections, do not drop or time out. Queries sent to the DB just never return successfully anymore.

What possibilities (if any) are provided by Oracle to detect these stalled connections from the client-application side and recover in a more or less safe way?

A: 

There's a set_timeout API in OTL that might be useful for this.

Edit: Actually, ignore that. set_timeout doesn't work with OCI. Have a look at the set_timeout description from here where it describes a technique that can be used with OCI

Glen
You are referring to the workaround, where another thread monitors the one executing the request and initiates a otl_connect->cancel() if it thinks the connection is dead?
Kosi2801
A: 

There is a manual way to avoid this. You can open a firewall and do something like ping database after every specified duration of time. In this way the database connection will not get lost.

idea

If (current_time - lastPingTime > configuredPingTime) { //Dummy query select 1 from dual; }

Sachin Chourasiya
Sorry, that doesn't work as the problem is that the connection does not go down or gets lost but just does not react anymore. It's a known problem on the database-server itself (and it's under heavy investigation) but the applications must be resistant to these odd conditions.
Kosi2801
I will help you further if you can send me the ORACLE CODE and the ERROR STRING recieved in this condition.
Sachin Chourasiya
No ORACLE CODE and ERROR STRING, because the last query/operation on the connection just stalls and never returns any result or error.
Kosi2801
+1  A: 

In all my DB schema i have a table with one constant record. Just poll such table periodically by simple SQL request. All other methods unreliable.

Pirks
As described in my question, such a periodical fetch would eventually lock up and never return again. What I need is a possibility to detect such a situation of possibly already stalled connections and recover them.
Kosi2801
A: 

Sounds like you need to fire off a query to the database (eg SELECT * FROM dual;), then if the database hasn't responded within a specified amount of time, assume the server has died and react accordingly. I'm afraid I don't know C/C++, but can you use multi-threading to fire off the statement then wait for the response, without hanging the application?

P72endragon
I'm already doing something like this. A second thread checks the last-active timestamp on my connections which I'm tracking and fires an otl_connect->cancel() if it's been too long. Strangely it does not break with an ORA-xxx (user requested cancel) but stays stuck.
Kosi2801
A: 

This works - I have done exactly what you are looking for. Have a parent process (A) create a child process (B). The child process (B) connects to the database, performs a query (something like "select 1 from a_table" - you will get better performance if you avoid using "dual" for this and create your own table). If (B) is successful then it writes out that it was successful and exits. (A) is waiting for a specified amount of time. I used 15 seconds. If (A) detects that (B) is still running - then it can assume that the database is hung - it Kills (B) and takes necessary actions (Like calling me on the phone with a SMS).

If you configure SQL*NET to use a timeout you will probably notice that large queries will fail because of it. The OCI set_timeout configuration will also cause this.

Philip Schlump
Sounds like an interesting aproach. Do you use the client-process just to periodically check the functionality of the DB in general or for all DB-related stuff?If the former do you have also an approach how to deal with the already-stalled other connections?
Kosi2801
Yes the parent process checks a bunch of d.b. related items. It also checks for network connectivity.
Philip Schlump
A: 

I am also having the same issue. OCIBreak also didn't work. Any other idea, apart from killing the thread which is stuck.

siddhusingh
A: 

This is a bug in Oracle ( or call it a feature ) till 11.1.0.6 and they said the patch on Oracle 11g release 1 ( patch 11.1.0.7 ) which has the fix. Need to see that. If it happens you will have to cancel ( kill ) the thread performing this action. Not good approach though

siddhusingh
As this is the most useful hint in a long time and has led me to the changelog for the patchset (http://www.anysql.net/doc/bug11107.html#OCI ), I'm accepting this as the best answer finally. Thanks.
Kosi2801