tags:

views:

48

answers:

1

My team has a situation where an SNMP SET will fail once every two weeks or so. Since this set happens automatically, we don't necessarily notice it immediately when it fails, and this can result in an inconsistent configuration and associated wailing and gnashing of teeth. The plan is to fix this by having our software automatically retry the SET when it fails.

The problem is, we aren't sure why the failure is happening. My (extremely limited) knowledge of SNMP isn't particularly helpful in diagnosing this problem, so I thought I'd ask StackOverflow for some advice. We think that every so often a spike in network traffic will cause the SET to fail. Since SNMP uses UDP for communication, I would think it would be relatively easy for a command to be drowned out if traffic was high for a short period of time. However, I have no idea how common this is. We have a small network with a single cisco router and there are less than a dozen SNMP controlled devices on that network. In addition to the SNMP traffic, there are some status web pages being loaded from the various devices. In case it makes a difference, I believe we are using the AdventNet SNMP API version 4.0.4 for Java.

Does it sound reasonable that there will be some SET commands dropped occasionally, or should we be looking for other causes?

+2  A: 

SNMP was designed to be unreliable. It uses UDP as its transport protocol. Routers will drop SNMP packets when they've got high priority work to do. So yes, it sounds very reasonable that SET commands are dropped occasionally :)

First upgrade to the newest version of the SNMP library if there is one.

Then you can set up a retry mechanism: verify each SET with a GET. If this fails, queue the SET for a later attempt. This requires an elaborate queuing mechanism: a later SET for the same setting should be queued after, or over, an existing queued SET.

Another option is to synchronize the entire state every hour; use GET for a setting, if it has changed, SET it. Changes that do not make it through for over 3 hours can be reported using an alerting system.

There are many more options, but if you have just 1 failure per week average, I'd go with the simplest one: Verify a SET with a GET, retry for 5 times, if it still fails, email.

Andomar
Thanks! It is good to know we are heading down the right trail.
A. Levy