views:

188

answers:

4

A bit of history: We have an application, which was originally written many years ago (1998 is the first date in PVCS but the app is about 5 years older than that as it originally was a DOS program). This application communicates with a piece of hardware via serial. When we got to Windows XP we started receiving reports of the app dying after a short time of running. It seems that the serial comms just 'died' and the app was left in a stuck state. The only way to recover from this situation was to restart the application.

The only information I can find regarding this problem was apparently the Windows Message system would miss that information was received, the buffer would fill and the system would get stuck. This snippet of information was left in a old word document, but there's no evidence to back this up. It also mentions that this is only prevalent at high baud rates (115200+).

The solution was to provide customers with USB->Serial converters along with the hardware.

Today: We are working on a new version of the hardware that will run across a network as well as serial ports. So to allow me to work on the network code, minus the actual hardware we are using a VSCOM NetCom113 device. It also installs a virtual comm port on the users (ie: mine) machine.

Now I have got the network code integrated with the app, it appears that the NetCom device exhibits the same behaviour as a physical commport. This is undesirable as I need the app to run longer than ~30 seconds.

Google turns up zero problems that we experience.

I was wondering:

  • Has anyone experienced this before? If so what did you do to fix/workaround the problem?
  • Does anyone have any suggestions as to whether the original author of the document is correct and what I can do to test the theory?

Unfortunately I can't post code as the serial code is tightly couple with the rest of the system, though if you have questions regarding it I can answer questions about it.

Updates:

  • The code is written using Win32 Comm routines - so I am using CreateFile, ReadFile. There's also judicious calls to GetOverlappedResult.
  • It's not hanging per se, it's just that the comms stops. You can access the menus, click the buttons, but nothing can interact with the connected hardware. Using realterm you can see that no data is coming in or going out.
  • I think the reference to the windows message is that the problem is internal to windows. Data has arrived but the kernal has missed it and thus not told the rest of the system about it.
  • Flow control is not used.
  • Writing a 'simple' test is difficult due the the fact that the code is tightly coupled and the underlying protocol is quite complex and would require a lot of work.
+2  A: 

Are you using DOS-style serial code, or the Win32 CreateFile approach?

If the former, be very suspicious: if at all possible I'd convert to the latter.

If the latter, do you know on what kind of system call it's hanging? Are you in a blocking read call? or an overlapped I/O call? or waiting on an event? (I'm not sure I have enough experience to help, but those are the kinds of questions that come to mind)

You might also check into the queue size, which you can set with the SetupComm function.

I don't buy the "Windows Message system" stuff -- it sounds fishy; you can write good Win32 serial i/o code that never uses Windows messages.

edit: does your Overlapped I/O use events? I seem to remember something about auto-reset events occasionally missing their trigger... check your overlapped I/O calls very carefully to see whether you're handling the possible outcomes properly. Perhaps there's a way to make your code more robust by automatically cancelling the overlapped i/o and restarting another read. (I assume the problem is in the read half, not the write half?)

edit 2: A suggestion: assuming the win32 side has missed a byte or packet, and your devices are in deadlock because they're both expecting each other to respond to something, can you tweak the other side of the serial I/O to regularly send some type of "ping" packet with an incrementing counter? (and log the ping packets on the PC side; that way you can see whether you've missed any)

Jason S
I agree - the DOS code was buggy and the win32 apis should work fine.
Tim
I've updated the main description.
graham.reeds
The overlapped IO does use events (quite a few in fact). All are manual reset. I will check that they are all reset correctly.Unfortunately I can't modify the other side of the connection. I do get ping packets, though not with an incrementing counter.
graham.reeds
Is the protocol simple enough that you could simulate it with other hardware (a PC or whatever is most convenient)? That would decouple the two devices and allow you to stress-test each a bit more easily.
Jason S
Not really. The packeted data is from 2 bytes (a ping) to 2007 bytes. Also information can be cross packeted (ie a config file or job data can be across several packets if it is sufficiently large). Additionally there are over 500 separate messages that need to be handled. So emulation is not easy.
graham.reeds
Re edit: Also the problem only exhibits itself with the actual serial ports. A USB->Serial converter does not exhibit this problem.
graham.reeds
oy. I feel your pain....
Jason S
Jason S
That's why the problem has gone unfixed for the past 7 years. However I can't do any testing with the NetCom device as it seems to fail the same way the original serial port does.
graham.reeds
+1  A: 

Are you sure you have your flow control set up correctly? DTR, RTS, etc...

Adam Davis
Flow control is not used.
graham.reeds
Have you written a much simpler 'test' program that demonstrates the problem?
Adam Davis
Several problems prevent a simple test. 1) The crash is random. 2) The serial code is tightly coupled with the rest of the program. 3) The hardware expects certain responses to messages that it gives. So to create an effective test the majority of the code needs to recreated.
graham.reeds
I think you're going to have to bite the bullet if you want to get to the root of the problem. You might consider adding a serial class and making the project go through that so you can test with other serial port controls. This would narrow down your focus.
Adam Davis
A: 

i have written apps that use usb / bluetooth serial ports and have never had an issue. with bluetooth i have seen bit rates (sustained) of 800,000 bps for long periods of time. most people don't properly implement the port.

My serial port

dbasnett
A: 

Not sure if this is a possibility for you, but if you could re-write the code using C#.NET you'd have access to the SerialPort class there. It might remedy your problem. I know a lot of legacy code based around the Win32 API for hardware I/O ports tended to fail in XP due to timing (had a small bit of experience with MIDI).

In addition, I don't know if you can use the Win32 method of Serial Port access in Vista, so that might shut out future MS OSes from being able to use your code.

sheepsimulator
Not really a possiblity - the app was a massive C++/MFC thing that grew over a 12 year period.It's a moot point now as the division was closed down due to the economic downturn. No new hardware and no updates to the software:-(
graham.reeds