+1  A: 

I am working on a similar project. From everything I've read, and personal experience, your best option is to work with small bits of data and send them as soon as you can. You want any jitter buffering to be done on the side of the receiver.

It is typical for a VoIP application to send 50-100 packets per second. For uLaw encoding at 8000Hz, this would result in a packet size of 80-160 bytes. The reasoning for this is that some packets will inevitably be dropped, and you want the impact to the receiver to be as small as possible. So with 10ms or 20ms of audio data per packet, a dropped packet may result in a small hiccup, but not nearly as bad as losing 2k of audio data (~250ms).

Additionally, with a large packet size, you must accumulate all of the data at the sender before sending it. So given a typical network latency of 50ms, with 20ms of audio data per packet, the receiver is not going to hear what the sender says for a minimum of 70ms. Now imagine what happens when 250ms of audio is being sent at once. 270ms will elapse between the sender speaking and the receiver playing that audio.

Users seem to be more forgiving of packet loss here and there, which results in sub-par audio quality, because the audio quality of most telephones isn't that great to begin with. However, users are also used to very low latency on modern telephone circuits, so introducing a round-trip delay of even 250ms can be extremely frustrating.

Now, as far as implementing buffering, I have found a good strategy to use a Queue (whoops, using .NET here :)), and then wrap that in a class that tracks the desired minimum and maximum number of packets in the Queue. Use rigorous locking since you will most likely be accessing it from multiple threads. If the Queue "bottoms out" and has zero packets in it (buffer underrun), set a flag and return null until the packet count reaches your desired minimum. Your consumer will have to check for null being returned and not queue anything into the output buffer, however. Alternatively your consumer could keep track of the last packet and repeatedly enqueue it, which may cause looping audio, but in some cases that may "sound" better than silence. You will have to do this until the producer puts enough packets into the queue to reach the minimum. This will result in a longer period of silence for the user, but that is generally better accepted than short, frequent periods of silence (choppiness). If you get a burst of packets and the producer fills up the queue (reaching the desired maximum), you can either start ignoring the new packets, or drop enough packets out of the front of the queue to return to the minimum.

Picking those min/max values is tough though. You are trying to balance smooth audio (no underruns) with minimum latency between sender and receiver. VoIP is fun but it sure can be frustrating! Good luck!

Brad
Thanks for your input. I like it!
Xiphias3