tags:

views:

451

answers:

1

I'm on a local LAN with only 8 connected computers using a netgear 24 port gigabit switch, network load is really low and send/receive buffers on all involved nodes(running slackware 11) have been set to 16mb. I'm also running tcpdump on each node to monitor the traffic.

A sending node sends a 10044byte large UDP packet which more often than not (3/4 times) does not end up in the receiving side application, in these cases I notice(using tcpdump) that the first x fragments are missing and only the last 3 (all with offsets > 0 and in order) are caught by tcpdump. The fragmented UDP package can therefore not be reassembled and is most likely thrown away.

I find the missing fragments strange since I have also tried a simple load test bursting out 10000 UDP messages of the same size, the receiving application sends a response and all tests so far gives 100% responses back.

Any clues or hints?

A: 

Update!

After resuming the testing of the above mentioned software I found a repeatable way of recreating the error.

Using windump on the sending windows machine, and tcpdump on the receiving machine, after having left the application idle for some time(~5 minutes), I tried sending the udp message but only end up with a single fragment caught by windump and tcpdump, the 3 remaining fragments are lost. Sending the same message one more time works fine and booth windump and tcpdump catches all 4 fragments and the application on the receiving side gets the message. The pattern is repeatable.

Started searching and found the following information, but to me, still not a clear answer.

http://www.eggheadcafe.com/software/aspnet/32856705/first-udp-message-to-a-sp.aspx

Re examining the logs I now notice the ARP request/reply being sent, which matches one of the ideas given in the link above.

NOTE! I filter windump on the sending side using: "dst host receivernode"

Capture from windump: first failed udp message, should be 4 fragments long

14:52:45.342266 arp who-has receivernode tell sendernode
14:52:45.342599 IP sendernode> receivernode : udp

Capture from windump: second udp message, exactly the same contents, all 4 fragments caught

14:52:54.132383 IP sendernode.10104 > receivernode .10113: UDP, length 6019
14:52:54.132397 IP sendernode> receivernode : udp
14:52:54.132406 IP sendernode> receivernode : udp
14:52:54.132414 IP sendernode> receivernode : udp
14:52:54.132422 IP sendernode> receivernode : udp
14:52:56.142421 arp reply sendernode is-at 00:11:11:XX:XX:fd (oui unknown)

Anyone who has a good idea about whats happening? please elaborate!

Kristofer
Some more research gave this: If no ARP cache entry exists and the UDP size exceeds the MTU, only the last fragment is sent to the destination, the remaining fragments are silently discarded.http://support.microsoft.com/kb/233401http://www.keil.com/support/man/docs/rlarm/rlarm_tn_using_udp_arpempty.htmA workaround is to extend the chache timeout, or add a keep alive message, or adding a static entry instead of using dynamic.Still, is there any way of getting notified of this happening?
Kristofer
Changing the ArpCacheLife parameter on the windows machine corrected the problem.The linux equivalent is /proc/sys/net/ipv4/neigh/$DEV/gc_stale_time
Kristofer