tags:

views:

1338

answers:

4

I have a third party component which tries to send too many UDP messages to too many separate addresses in a certain situation. This is a burst which happens when the software is started and the situation is temporary. I'm actually not sure is it the plain amount of the messages or the fact that each of them go to a separate IP address.

Anyway, changing the underlying protocol or the problematic component is not an option, so I'm looking for a workaround. The StackTrace looks like this:

java.io.IOException: No buffer space available
    at java.net.PlainDatagramSocketImpl.send(Native Method)
    at java.net.DatagramSocket.send(DatagramSocket.java:612)

This issue occurs (at least) with Java versions 1.6.0_13 and 1.6.0_10 and Linux versions Ubuntu 9.04 and RHEL 4.6.

Are there any Java system properties or Linux configuration tweaks which might help?

+2  A: 

When sending lots of messages, especially over gigabit ethernet in Linux, the stock parameters for your kernel are usually not optimal. You can increase the Linux kernel buffer size for networking through:

echo 1048576 > /proc/sys/net/core/wmem_max
echo 1048576 > /proc/sys/net/core/wmem_default
echo 1048576 > /proc/sys/net/core/rmem_max
echo 1048576 > /proc/sys/net/core/rmem_default

As root.

Or use sysctl

sysctl -w net.core.rmem_max=8388608

There are tons of network options

See Linux Network Tuning by IBM and More tuning information

Aiden Bell
Thanks. In addition to those parameters, I tried to also tweak net.ipv4.udp_mem and net.ipv4.udp_wmem_min. First I doubled, the values, then I doubled them again, and at last I changed them to be 10 times as big as the defaults. Nothing has helped so far though.
auramo
@auramo, Which JVM are you using? The sun build or the OpenJDK/JVM stuff from your distro? I would recommend using one for your distro, the open one if possible as it will be less 'safe' and more accurate interfacing with the kernel/libc.
Aiden Bell
I'm using the Sun builds of 1.6.0_13 and 1.6.0_10. I could easily try with the OpenJDK versions, but changing from the Sun implementation the OpenJDK for the end product would be a major hassle at this point of the project.
auramo
If you switch to OpenJDK, which is based on the Sun source anyway and the problem is solved, then you can ask around Sun forums for differences that would cause this ... and help reconfiguring the Sun JRE release to work ;) it might not be the Linux kernel, but some component in-between (c library that the sun blob may link to statically as an example)
Aiden Bell
I have the same suspicion: the Sun's JDK has some C code which makes library calls which override the sysctl values I tried to change. I bumped into several articles while googling which said that you can override udp_mem, wmem_max etc. with some C API call in your client code.
auramo
Im sure it is doable. That's VM architectures for you :p
Aiden Bell
Finally tried with OpenJDK (the one from ubuntu 9.04 repos, IcedTea6 1.4.1 6b14-1.4.1-0ubuntu7). Same problem.
auramo
+1  A: 

Might be a bit complicated but as I know, Java uses the SPI1 pattern for the network sub-library. This allows you to change the implementation used for various network operations. If you use OpenJDK then you could gain some hints how and what to wrap with your implementation. Then, in your implementation you slow down the I/O with some sleeps for example.

Or, just for fun, you could override the default DatagramSocket with your modified implementation. Have the same package name for it and - as I know - it will take precedence over the default JRE class. At least this method worked for me on some buggy 3rd party library.

Edit:

1Service Provider Interface is a method to separate client and service code within an API. This separation allows different client and different provider implementations. Can be recognized from the name ending in Impl usually, just like in your stack trace java.net.PlainDatagramSocketImpl is the provider implementation where the DatagramSocket is the client side API.

You commented that you don't want to slow down the communication the entire way. There are several hacks to avoid it, for example measure the time in your code and slow the communication within the first 1-2 minutes starting at your first incoming method call. Then you can skip the sleep.

Another option would be to identify the misbehaving class in the library, JAD it and fix it. Then replace the original class file in the library.

kd304
Can you tell me what an SPI pattern is? I want to overcome a 1-2 minute, boot-time burst. For that I definitely do not want to slow down my UDP I/O which needs to be speedy throughout the time the application is running (it's a server application).
auramo
Ok, that old pattern :-)
auramo
A: 

I'm also currently seeing this problem as well with both Debian & RHEL. At this point I believe I've isolated it down to the NIC and/or the NIC driver. What hardware configuration do you have this also exhibits this problem? This seems to only occur on new Dell PowerEdge servers that we recently acquired that have Broadcom Corporation NetXtreme II BCM5708 Gigabit Ethernet NICs.

I too can confirm that it is the rapid generation of outbound UDP packets to many different IP addresses in a short window. I've attempted to write a simple Java application that can reproduce it (since ours is occurring with snmp4j).

EDIT

Look at my answer here: http://stackoverflow.com/questions/1043567/java-ioexception-no-buffer-space-available-while-sending-udp-packets-on-linux/1548174#1548174

Matthew B. Jones
My problem occurred on many hw configurations, on a HP workstation as well as a rack server. Eventually we ended up hacking the underlying component (Java-component from another team inside our company) which did the excessive network messaging triggering the issue. Now that component does a lot less UDP request/responses and the problem is solved for us.
auramo
+2  A: 

I've finally determined what the issue is. The Java IOException is misleading since it is "No buffer space available" but the root issue is that the local ARP table has been filled. On Linux, the default ARP table lookup is 1024 (files /proc/sys/net/ipv4/neigh/default/gc_thresh1, /proc/sys/net/ipv4/neigh/default/gc_thresh2, /proc/sys/net/ipv4/neigh/default/gc_thresh3).

What was happening in my case (and I assume your case), is that your Java code is sending out UDP packets from an IP address that is in the same subnet as your destination addresses. When this is the case, the Linux machine will perform an ARP lookup to translate the IP address into the hardware MAC address. Since you are blasting out packets to many different IPs the local ARP table fills up quickly, hits 1024, and that is when the Java exception is thrown.

The solution is simple, either increase the limit by editing the files I mentioned earlier, or move your server into a different subnet than your destination addresses, which then causes the Linux box to no longer perform neighbor ARP lookups (instead will be handled by a router on the network).

Matthew B. Jones