views:

1178

answers:

3

I am trying out RabbitMQ with this python binding.

One thing I noticed is that if I kill a consumer uncleanly (emulating a crashed program), the server will think that this consumer is still there for a long time. The result of this is that every other message will be ignored.

For example if you kill a consumer 1 time and reconnect, then 1/2 messages will be ignored. If you kill another consumer, then 2/3 messages will be ignored. If you kill a 3rd, then 3/4 messages will be ignored and so on.

I've tried turning on acknowledgments, but it doesn't seem to be helping. The only solution I have found is to manually stop the server and reset it.

Is there a better way?

How to recreate this scenario

  • Run rabbitmq.

  • Unarchive this library.

  • Download the consumer and publisher here. Run amqp_consumer.py twice. Run amqp_publisher.py, feeding in some data and observe that it works as expected. Messages are received round robin style.

  • Kill one of the consumer processes with kill -9 or task manager.

  • Now when you publish a message, 50% of the messages will be lost.

+1  A: 

Please provide a few more specifics regarding the components you've declared. Usually (and independent of the the client implementation) a queue with the properties

  • exclusive and
  • auto-delete

should get removed as soon as the connection between the declaring client and the broker breaks up. This won't help you with shared queues, though. Please detail a bit what exactly you are trying to model.

yawn
I am not talking about when queues get deleted. I am talking about how rabbitmq doesn't detect crashed connections for a very long time and keeps trying to send them messages as though they are still there.
Unknown
+1  A: 

RabbitMQ doesn't have a timeout on acknowledgements from the client that a message has been processed: see this post (the whole thread might be of interest). Some salient points from the post:

The AMQP ack model for subscriptions and "pull" are identical. In both cases the message is kept on the server but is unavailable to other consumers until it either has been ack'ed (and gets removed), nack'ed (with basic.reject; though RabbitMQ does not implement that) or the channel/connection is closed (at which point the message becomes available to other consumers).

and (my emphases)

There is no timeout on waiting for acks. Usually that is not a problem since the common cases of a missing ack - network or client failure - will result in the connection getting dropped (and thus trigger the behaviour described above). Still, a timeout could be useful to, say, deal with alive but unresponsive consumers. That has come up in discussion before. Is there a specific use case you have in mind that requires such functionality?

The problem might well be occurring because in a client pull model, it's harder for the server to detect a broken connection (as opposed to an alive but unresponsive consumer), particularly as the server seems happy to wait forever for an ack.

Update: On Linux, you can attach signal handlers for SIGTERM and/or SIGKILL and/or SIGINT and hopefully close down the connection in an orderly way from the client. On Windows, I believe closing from Task Manager invokes the Win32 TerminateProcess API, about which MSDN says:

If a process is terminated by TerminateProcess, all threads of the process are terminated immediately with no chance to run additional code. This means that the thread does not execute code in termination handler blocks. In addition, no attached DLLs are notified that the process is detaching.

This means it might be difficult to catch termination and close down in an orderly way.

It might be worth pursuing on the RabbitMQ list with your own use case for an ack timeout.

Vinay Sajip
According to that mailing list, if the consumer terminates the connection, it should operate correctly. However, kill -9 or end process in taskmanager should also terminate the connection in that manner. But it still doesn't work correctly.
Unknown
+3  A: 

I don't see amqp_consumer.py or amqp_producer.py in the tarball, so reproducing the fault is tricky.

RabbitMQ terminates connections, releasing their unacknowledged messages for redelivery to other clients, whenever it is told by the operating system that a socket has closed. Your symptoms are very strange, in that even a kill -9 ought to cause the TCP socket to be cleaned up properly.

Some people have noticed problems with sockets surviving longer than they should when running with a firewall or NAT device between the AMQP clients and the server. Could that be an issue here, or are you running everything on localhost? Also, what operating system are you running the various components of the system on?

ETA: From your comment below, I am guessing that while you are running the server on Linux, you may be running the clients on Windows. If this is the case, then it could be that the Windows TCP driver is not closing the sockets correctly, which is different from the kill-9 behaviour on Unix. (On Unix, the kernel will properly close the TCP connections on any killed process.)

If that's the case, then the bad news is that RabbitMQ can only release resources when the socket is closed, so if the client operating system doesn't do that, there's nothing it can do. This is the same as almost every other TCP-based service out there.

The good news, though, is that AMQP supports a "heartbeat" option for exactly these cases, where the networking fabric is untrustworthy. You could try enabling heartbeats. When they're enabled, if the server doesn't receive any traffic within a configurable interval, it decides that the connection must be dead.

The bad news, however, is that I don't think py-amqplib supports heartbeats at the moment. Worth a try, though!

Tony Garnock-Jones
Sorry about that. The producer and consumer is here http://blogs.digitar.com/jjww/code-samples/
Unknown
I am running rabbitmq on a remote linux server while I am running the producer and consumer. I realize that the socket may not have been closed cleanly, but that is exactly what I want to emulate. I was testing to see how rabbitmq handles crashed processes which may not have closed the socket cleanly, and unfortunately it doesn't seems to handle this very well.
Unknown
@Tony; how does one enable the "heartbeat" option within the RabbitMQ server (within /etc/rabbitmq/rabbitmq.config for example)?
Ryan Duffield
Never mind, I understand now that this is configured client-side.
Ryan Duffield