views:

419

answers:

4

We have an x-files problem with our .NET application. Or, rather, hybrid Win32 and .NET application.

When it attempts to communicate with Oracle, it just dies. Vanishes. Goes to the big black void in the sky. No event log message, no exception, no nothing.

If we simply ask the application to talk to a MS SQL Server instead, which has the effect of replacing the usage of OracleConnection and related classes with SqlConnection and related classes, it works as expected.

Today we had a breakthrough.

For some reason, a customer had figured out that by placing all the application files in a directory on his desktop, it worked as expected with Oracle as well. Moving the directory down to the root of the drive, or in C:\Temp or, well, around a bit, made the crash reappear.

Basically it was 100% reproducable that the application worked if run from directory on desktop, and failed if run from directory in root.

Today we figured out that the difference that counted was wether there was a space in the directory name or not.

So, these directories would work:

C:\Program Files\AppDir\Executable.exe
C:\Temp Lemp\AppDir\Executable.exe
C:\Documents and Settings\someuser\Desktop\AppDir\Executable.exe

whereas these would not:

C:\CompanyName\AppDir\Executable.exe
C:\Programfiler\AppDir\Executable.exe      <-- Program Files in norwegian
C:\Temp\AppDir\Executable.exe

I'm hoping someone reading this has seen similar behavior and have a "aha, you need to twiddle the frob on the oracle glitz driver configuration" or similar.

Anyone?


Followup #1: Ok, I've processed the procmon output now, both files from when I hit the button that attempts to open the window that triggers the cascade failure, and I've noticed that they keep track mostly, there's some smallish differences near the top of both files, and they they keep track a long way down.

However, when one run fails, the other keeps going and the next few lines of the log output are these:

ReadFile C:\oracle\product\10.2.0\db_1\BIN\orageneric10.dll    SUCCESS    Offset: 274 432, Length: 32 768, I/O Flags: Non-cached, Paging I/O, Synchronous Paging I/O
ReadFile C:\oracle\product\10.2.0\db_1\BIN\orageneric10.dll    SUCCESS    Offset: 233 472, Length: 32 768, I/O Flags: Non-cached, Paging I/O, Synchronous Paging I/O

After this, the working run continues to execute, and the other touches the mscorwks.dll files a few times before threads close down and the app closes. Thus, the failed run does not touch the above files.


Followup #2: Figured I'd try to upgrade the oracle client drivers, but 10.2.0.1 is apparently the highest version available for Windows 2003 server and XP clients.


Followup #3: Well, we've ended up with a black-box solution. Basically we found that the problem is somewhere related to XPO and Oracle. XPO has a system-table it manages, called XPObjectType, with three columns: Oid, TypeName and AssemblyName. Due to how Oracle is configured in the databases we talk to, the column names were OID, TYPENAME and ASSEMBLYNAME. This would ordinarily not be a problem, except that XPO talks to the schema information directly and checks if the table is there with the right column names, and XPO doesn't handle case differences so it sees a XPObjectType table with three unknown columns and none of those it expects.

Exactly what XPO does now I don't really know, but if I dropped this table, and recreated it with the right case, using double quotes around all the column names to get the case right, the problem doesn't crop up.

Exactly where the space in the folder name comes into this, I still have no idea, but this problem had two tiers:

  1. Stop the application from crashing at our customers, short-term solution
  2. Fix the bug, long-term solution

Right now tier 1 is solved, tier 2 will be put back into the queue for now and prioritized. We're facing some bigger changes to our data tier anyway so this might not be a problem we need to solve, at least if all our Oracle-customers verify that the table-fix actually gets rid of the problem.

I'll accept the answer by Dave Markle since though Process Monitor (the big brother of File Monitor) didn't actually pinpoint the problem, I was able to use it to determine that after my breakpoint in user-code where XPO had built up the query for this table, no I/O happened until all the entries for the application closing down was logged, which led me to believe it was this table that was the culprit, or at least influenced the problem somehow.

If I manage to get to the real cause of this, I'll update the post.

+3  A: 

Here's what I would do. First, TRIPLE-check that you're seeing the behavior you think you're seeing. I can see this happening the other way around by not using System.IO.Path to concatenate paths, but not like you're seeing it. Triple-check that the file permissions make sense.

Next, download Filemon from MS and watch what's happening on the filesystem as your program hits these troubled spots. You can filter out specific file activity (removing your anti-virus file activity, for example) to make everything look a bit cleaner while you do this. Look for file access errors using FileMon for both the success case and the error case for your program. That should point you to what file's being accessed and causing the problem. For example, if you see a FILE_NOT_FOUND error accessing a nonsense filename, you can be assured that you or the vendor are doing something wrong, possibly leading to your problem...

Dave Markle
I've quadruple-checked it. Putting all the files in C:\Temp and it crashes. Rename the directory to "Te mp" and it works. I'll try filemon.
Lasse V. Karlsen
Wow. That's weird. Let us know what you find with Filemon.
Dave Markle
Will do. Just to be sure I've said it clearly enough. The *only* difference *I* do, is to rename the directory and re-run the app, still with the same configuration and thus database connectivity. I've run FileMon now, in the process of filtering the files for comparison.
Lasse V. Karlsen
Really strange. Do you not get any file I/O errors at all in FileMon?
Dave Markle
None, the log for when it works (space in folder names) and when it fails (no space) is not 100% equal up to the point where one fails and the other doesn't, but it looks like that might just be timing issues due to some background processing during logon. Question updated.
Lasse V. Karlsen
+1  A: 

You should probably see if you can reproduce the problem it with a simple application that only tries to open a connection to Oracle. That way you can be 100% sure that the problem is with OracleConnection or the Oracle driver and not with your own code.

Rasmus Faber
A: 

You should get a medal for perseverance for that !.

"Exactly what XPO does now I don't really know, but if I dropped this table, and recreated it with the right case, using double quotes around all the column names to get the case right, the problem doesn't crop up.

Exactly where the space in the folder name comes into this, I still have no idea"

The issues I get with spaces in names is that they generally interpret the bit before the space as the name and the rest as a parameter. If that is the case, then with the plain name it can see "C\Temp" and it is a directory. With the spaced name, it gets "C:\Program Files", looks for "C:\Program" and that doesn't exist. It would fail, for example, to overwrite "C:\Temp" but would succeed in writing "C:\Program". Wonder whether it would still fail with "C:\Program Files" if there is a file or directory called "C:\Program"

Gary
A: 

Hi, I´d suspect the oracle client to be honest. Had a problem which was similar in it´s frustrating nature.

If we installed on 64 bit machines the client would stop at start when connecting to oracle even though the app is 32 bit. We eventually tracked it down to the fact that a certain oracle client (Ora 10 had a problem with brackets in the path so a program running under program files would work under program files (x86) caused the crash. Updating the machine to use the 11G client fixed the problem but there were also some patches available from metalink which are not directly available. Whats strange in your case is that you get no exception but the behaviour of moving the application to a new folder fixes the issue in a similar way so it may be related.

ORA-12154: TNS:could not resolve the connect identifier specified or ORA-6413: Connection not open.

Useful links http://blogs.msdn.com/debarchan/archive/2009/02/04/good-old-connectivity-issue.aspx

Details from Metalink below.

Metalink Bug 3807408 Cannot externally authenticate user with quote in username

Description If an externally authenticated username contains a '(',')' or '=' then the user cannot be authenticated. Additionally if a program name / path contains these characters it may not be possible to connect . eg: Windows clients installed in a directory "C:\Program Files (x86)" fail to connect with ORA-12154: TNS:could not resolve the connect identifier specified

The hallmark of this problem is that the Net trace (level 16) shows the problem character/s replaced by a "?" in the trace.

Workaround For the authentication problem: change username, or do not use remote OS authentication for those users

For the program / directory issue: change the program/directory name

Crocked