views:

75

answers:

2

Hello all.

I have an extremely strange bug.

I have two applications that communicate over TCP/IP.

Application A is the server, and application B is the client.

Application A sends a bunch of float values to application B every 100 milliseconds.

The bug is the following: sometimes some of the float values received by application B are not the same as the values transmitted by application A.

Initially, I thought there was a problem with the Ethernet or TCP/IP drivers (some sort of data corruption). I then tested the code in other Windows machines, but the problem persisted.

I then tested the code on Linux (Ubuntu 10.04.1 LTS) and the problem is still there!!!

The values are logged just before they are sent and just after they are received.

The code is pretty straightforward: the message protocol has a 4 byte header like this:

//message header
struct MESSAGE_HEADER {
    unsigned short type;
    unsigned short length;
};

//orientation message
struct ORIENTATION_MESSAGE : MESSAGE_HEADER
{
  float azimuth;
  float elevation;
  float speed_az;
  float speed_elev;
};

//any message
struct MESSAGE : MESSAGE_HEADER {
    char buffer[512];
};

//receive specific size of bytes from the socket
static int receive(SOCKET socket, void *buffer, size_t size) {
    int r;
    do {
        r = recv(socket, (char *)buffer, size, 0);
        if (r == 0 || r == SOCKET_ERROR) break;
        buffer = (char *)buffer + r;
        size -= r;
    } while (size);
    return r;
}

//send specific size of bytes to a socket
static int send(SOCKET socket, const void *buffer, size_t size) {
    int r;
    do {
        r = send(socket, (const char *)buffer, size, 0);
        if (r == 0 || r == SOCKET_ERROR) break;
        buffer = (char *)buffer + r;
        size -= r;
    } while (size);
    return r;
}

//get message from socket
static bool receive(SOCKET socket, MESSAGE &msg) {
    int r = receive(socket, &msg, sizeof(MESSAGE_HEADER));
    if (r == SOCKET_ERROR || r == 0) return false;
    if (ntohs(msg.length) == 0) return true;
    r = receive(socket, msg.buffer, ntohs(msg.length));
    if (r == SOCKET_ERROR || r == 0) return false;
    return true;
}

//send message
static bool send(SOCKET socket, const MESSAGE &msg) {
    int r = send(socket, &msg, ntohs(msg.length) + sizeof(MESSAGE_HEADER));
    if (r == SOCKET_ERROR || r == 0) return false;
    return true;
}

When I receive the message 'orientation', sometimes the 'azimuth' value is different from the one sent by the server!

Shouldn't the data be the same all the time? doesn't TCP/IP guarantee delivery of the data uncorrupted? could it be that an exception in the math co-processor affects the TCP/IP stack? is it a problem that I receive a small number of bytes first (4 bytes) and then the message body?

EDIT:

The problem is in the endianess swapping routine. The following code swaps the endianess of a specific float around, and then swaps it again and prints the bytes:

#include <iostream>
using namespace std;

float ntohf(float f)
{
    float r;
    unsigned char *s = (unsigned char *)&f;
    unsigned char *d = (unsigned char *)&r;
    d[0] = s[3];
    d[1] = s[2];
    d[2] = s[1];
    d[3] = s[0];
    return r;
}

int main() {
    unsigned long l = 3206974079;
    float f1 = (float &)l;
    float f2 = ntohf(ntohf(f1));
    unsigned char *c1 = (unsigned char *)&f1;
    unsigned char *c2 = (unsigned char *)&f2;
    printf("%02X %02X %02X %02X\n", c1[0], c1[1], c1[2], c1[3]);
    printf("%02X %02X %02X %02X\n", c2[0], c2[1], c2[2], c2[3]);
    getchar();
    return 0;
}

The output is:

7F 8A 26 BF 7F CA 26 BF

I.e. the float assignment probably normalizes the value, producing a different value from the original.

Any input on this is welcomed.

EDIT2:

Thank you all for your replies. It seems the problem is that the swapped float, when returned via the 'return' statement, is pushed in the CPU's floating point stack. The caller then pops the value from the stack, the value is rounded, but it is the swapped float, and therefore the rounding messes up the value.

+2  A: 

TCP tries to deliver unaltered bytes, but unless the machines have similar CPU-s and operating-systems, there's no guarantee that the floating-point representation on one system is identical to that on the other. You need a mechanism for ensuring this such as XDR or Google's protobuf.

Steve Emmerson
Yes, the machines have similar CPUs and operating system, and both programs are using the same code base and compiler.
axilmar
A: 

You're sending binary data over the network, using implementation-defined padding for the struct layout, so this will only work if you're using the same hardware, OS and compiler for both application A and application B.

If that's ok, though, I can't see anything wrong with your code. One potential issue is that you're using ntohs to extract the length of the message and that length is the total length minus the header length, so you need to make sure you setting it properly. It needs to be done as

msg.length = htons(sizeof(ORIENTATION_MESSAGE) - sizeof(MESSAGE_HEADER));

but you don't show the code that sets up the message...

Chris Dodd
It's not the padding. I use #pragma pack(push, 1), so the packing is 1 bytes. If it was the padding, the problem would manifest itself immediately.
axilmar