We have a message processing system with high performance demands. Recently we have noticed that the first message takes many times longer then subsequent messages. A bunch of transformation and message augmentation happens as this goes through our system, much of it done by way of external lib.
I just profiled this issue (using callgrind), comparing a "run" of just one message with a "run" of many messages (providing a baseline of comparison).
The main difference I see is the function "do_lookup_x" taking up a huge amount of time. Looking at the various calls to this function, they all seem to be called by the common function: _dl_runtime_resolve. Not sure what this function does, but to me this looks like the first time the various shared libraries are being used, and are then being loaded in to memory by the ld.
Is this a correct assumption? That the binary will not load the shared libraries in to memory until they are being prepped for use, therefore we will see a massive slowdown on the first message, but on none of the subsequent?
How do we go about avoiding this?
Note: We operate on the microsecond scale.