views:

116

answers:

5

Given is an array of 320 elements (int16), which represent an audio signal (16-bit LPCM) of 20 ms duration. I am looking for a most simple and very fast method which should decide whether this array contains active audio (like speech or music), but not noise or silence. I don't need a very high quality of the decision, but it must be very fast.

It occurred to me first to add all squares or absolute values of the elements and compare their sum with a threshold, but such a method is very slow on my system, even if it is O(n).

+3  A: 

You're not going to get much faster than a sum-of-squares approach.

One optimization that you may not be doing so far is to use a running total. That is, in each time step, instead of summing the squares of the last n samples, keep a running total and update that with the square of the most recent sample. To avoid your running total from growing and growing over time, add an exponential decay. In pseudocode:

decay_constant=0.999;  // Some suitable value smaller than 1
total=0;
for t=1,...
    // Exponential decay
    total=total*decay_constant;

    // Add in latest sample
    total+=current_sample;

    if total>threshold
        // do something
    end
end

Of course, you'll have to tune the decay constant and threshold to suit your application. If this isn't fast enough to run in real time, you have a seriously underpowered DSP...

Martin B
Don't forget to add a simple filter to remove all high frequency noise. A low pass filter can be as simple as 'remembering' the previous sample, and averaging it with the current, and using this instead of the raw sample. Very fast and very effective
Toad
+2  A: 

You might try calculating two simple "statistics" - first would be spread (max-min). Silence will have very low spread. Second would be variety - divide the range of possible values into say 16 brackets (= value range) and as you go through the elements, determine in which bracket that element goes. Noise will have similar numbers for all brackets, whereas music or speech should prefer some of them while neglecting others.

This should be possible to do in just one pass through the array and you do not need complicated arithmetics, just some addition and comparison of values.

Also consider some approximation, for example take only each fourth value, thus reducing the number of checked elements to 80. For audio signal, this should be okay.

PeterK
A: 

It is clearly that the complexity should be at least O(n). Probably some simple algorithms that calculate some value range are good for the moment but I would look for Voice Activity Detection on web and for related code samples.

Iulian Şerbănoiu
A: 

I did something like this a while back. After some experimentation I arrived at a solution that worked sufficiently well in my case.

I used the rate of change in the cube of the running average over about 120ms. When there is silence (only noise that is) the expression should be hovering around zero. As soon as the rate starts increasing over a couple of runs, you probably have some action going on.


rate = cur_avg^3 - prev_avg^3

I used a cube because the square just wasn't agressive enough. If the cube is to slow for you, try using the square and a bitshift instead. Hope this helps.

manneorama
A: 

If you just look at the squared norm of the buffer, you'll trigger on constant signals, so I'd subtract off the mean before summing the squares (i.e. signal power). Use a circular buffer and adjust its size and the frequency of calculation for accuracy vs. responsiveness. It shouldn't be slow to do this with 16-bit integers.

eryksun