views:

246

answers:

3

This Paper (When the CRC and TCP checksum disagree) suggests that since the TCP checksumming algorithm is rather weak, there would occur an undetected error every 16 million to 10 billion packets using TCP.

Are there any application developers out there who protect the data against such kind of errors by adding checksums at the application level?

Are there any patterns available to protect against such errors while doing EJB remote method invocation (Java EE 5)? Or does Java already checksum serialized objects automatically (additionally to the underlying network protocol)?

Enterprise software has been running on computers doing not only memory ECC, but also doing error checking within the CPU at the registers etc (SPARC and others). Bit errors at storage systems (hard drives, cables, ...) can be prevented by using Solaris ZFS.

I was never afraid of network bit errors because of TCP - until I saw that article.

It might not be that much work to implement application level checksumming for some very few client server remote interfaces. But what about distributed enterprise software that runs on many machines in a single datacenter. There can be a really huge number of remote interfaces.

Is every Enterprise Software vendor like SAP, Oracle and others just ignoring this kind of problem? What about banks? What about stock exchange software?

Follow up: Thank you very much for all your answers! So it seems that it is pretty uncommon to check against undetected network data corruption - but they do seem to exist.

Couldn't I solve this problem simply by configuring the JEE Application Servers (or EJB deployment descriptors) to use RMI over TLS with the TLS configured to use MD5 or SHA1 and by configuring the Java SE clients to do the same? Would this be a way to get reliable transparent checksumming (although by overkill) so that I would not have to implement this at application level? Or am I completely confused network-stack wise?

+2  A: 

I've worked on trading systems for IBs, and I can assure you there is no extra checksumming going on - most apps use naked sockets. Given the current problems in the financial sector, I think bad TCP/IP checksums should be the least of your worries.

anon
@altCognito please do not edit my answers so as to alter their meaning. thank you.
anon
This is a case of Premature Optimization http://c2.com/cgi/wiki?PrematureOptimization (old answer removed, I'm spooked)
altCognito
+1  A: 

Well, that paper is from 2000, so it's from a LONG time ago (man, am I old), and on a pretty limited set of traces. So take their figures with a huge grain of salt. That said, it would be interesting to see if this is still the case. However, I suspect things have changed, though some classes of errors may still well exist, such as hardware faults.

More useful than checksums if you really need the extra application-level assurance would be a SHA-N hash of the data, or MD5, etc.

jesup
document is from 2000, but TCP protocol with its checksum is even more old - from seventies of previous century- and the error still there. So, I wouldn't expect it to disappear just by becoming too "ancient"
Andrey
+2  A: 

I am convinced that every application that cares about data integrity should use a secure hash. Most, however, do not. People simply ignore the problem.

Although I have frequently seen data corruption over the years - even that which gets by checksums - the most memorable in fact involved a stock trading system. A bad router was corrupting data such that it usually got past the TCP checksum. It was flipping the same bit off and on. And of course, no one is alerted for the packets that in fact failed the TCP checksum. The application had no additional checks for data integrity.

The messages were things like stock orders and trades. The consequences of corrupting the data are as serious as it sounds.

Luckily, the corruption caused the messages to be invalid enough to result in the trading system completely crashing. The consequences of some lost business were nowhere near as severe as the potential consequences of executing bogus transactions.

We identified the problem with luck - someone's SSH session between two of the servers involved failed with a strange error message. Obviously SSH must ensure data integrity.

After this incident, the company did nothing to mitigate the risk of data corruption while in flight or in storage. The same code remains in production, and in fact additional code has gone into production that assumes the environment around it will never corrupt data.

This actually is the correct decision for all the individuals involved. A developer who prevents a problem that was caused by some other part of the system (e.g. bad memory, bad hard drive controller, bad router) is not likely to gain anything. The extra code creates the risk of adding a bug, or being blamed for a bug that isn't actually related. If a problem does occur later, it will be someone else's fault.

For management, it's like spending time on security. The odds of an incident are low, but the "wasted" effort is visible. For example, notice how end-to-end data integrity checking has been compared to premature optimization already here.

So far as things changing since that paper was written - all that has changed is we have greater data rates, more complexity to systems, and faster CPUs to make a cryptographic hash less costly. More chances for corruption, and less cost to preventing it.

The real issue is whether it is better in your environment to detect/prevent problems or to ignore them. Remember that by detecting a problem, it may become your responsibility. And if you spend time preventing problems that management does not recognize is a problem, it can make you look like you are wasting time.

chuck

related questions