views:

891

answers:

3

I'm trying to determine the "beats per minute" from real-time audio in C#. It is not music that I'm detecting in though, just a constant tapping sound. My problem is determining the time between those taps so I can determine "taps per minute" I have tried using the WaveIn.cs class out there, but I don't really understand how its sampling. I'm not getting a set number of samples a second to analyze. I guess I really just don't know how to read in an exact number of samples a second to know the time between to samples.

Any help to get me in the right direction would be greatly appreciated.

+1  A: 
Robert Harvey
A: 

Assuming we're talking about the same WaveIn.cs, the constructor of WaveLib.WaveInRecorder takes a WaveLib.WaveFormat object as a parameter. This allows you to set the audio format, ie. samples rate, bit depth, etc. Just scan the audio samples for peaks or however you're detecting "taps" and record the average distance in samples between peaks.

Since you know the sample rate of the audio stream (eg. 44100 samples/second), take your average peak distance (in samples), multiply by 1/(samples rate) to get the time (in seconds) between taps, divide by 60 to get the time (in minutes) between taps, and invert to get the taps/minute.

Hope that helps

Donnie DeBoer
+1  A: 

I'm not sure which WaveIn.cs class you're using, but usually with code that records audio, you either A) tell the code to start recording, and then at some later point you tell the code to stop, and you get back an array (usually of type short[]) that comprises the data recorded during this time period; or B) tell the code to start recording with a given buffer size, and as each buffer is filled, the code makes a callback to a method you've defined with a reference to the filled buffer, and this process continues until you tell it to stop recording.

Let's assume that your recording format is 16 bits (aka 2 bytes) per sample, 44100 samples per second, and mono (1 channel). In the case of (A), let's say you start recording and then stop recording exactly 10 seconds later. You will end up with a short[] array that is 441,000 (44,100 x 10) elements in length. I don't know what algorithm you're using to detect "taps", but let's say that you detect taps in this array at element 0, element 22,050, element 44,100, element 66,150 etc. This means you're finding taps every .5 seconds (because 22,050 is half of 44,100 samples per second), which means you have 2 taps per second and thus 120 BPM.

In the case of (B) let's say you start recording with a fixed buffer size of 44,100 samples (aka 1 second). As each buffer comes in, you find taps at element 0 and at element 22,050. By the same logic as above, you'll calculate 120 BPM.

Hope this helps. With beat detection in general, it's best to record for a relatively long time and count the beats through a large array of data. Trying to estimate the "instantaneous" tempo is more difficult and prone to error, just like estimating the pitch of a recording is more difficult to do in realtime than with a recording of a full note.

MusiGenesis
So if I do mono, each of those numbers in my array represents one sample for one channel? If I were to do 2 channels would my array then by 88200 in size? Alternating between channels?
zac
Yes, stereo means you have twice as many samples per second, and the samples are interleaved (left, right, left, right etc.), so elements 0, 2, 4, 6 etc. represent data for the left channel, and elements 1, 3, 5, 7 etc. represent data for the right channel.
MusiGenesis
Another question: when I do an fft on this to get amplitude, it returns half the size of my original array. From what I have read this is because it converts it to a real and imaginary part and uses both of those to get amplitude. What do each of the values in this array now account for? 2 samples?
zac
It depends on what code you're using to do the FFT. For audio DSP, an FFT function usually takes two arrays, 1 for the real and 1 for the imaginary part. Before the FFT, the real array contains the recorded sample values, and the imaginary array is all zeroes. After the transform, the two arrays are the same size as before, but will both now contain different, non-zero values. Whatever code you're using is probably combining these two transformed arrays into a single, half-size array that contains the frequency components. CONTINUED...
MusiGenesis
The values in this array now contain what are usually called frequency "bins", where the number in each bin represents the magnitude of the frequency component in that range. If, for example, your original audio was at 44100 Hz (normal CD audio) and your magnitude array is, say, 1000 elements in size, then each bin represents the magnitude of a 44.1 Hz slice of your original audio. So the value in element[0] represents the frequency response from 0 to 44.1 Hz, the value in element[1] represents the magnitude from 44.1 to 88.2Hz, element[2] is 88.2 to 132.3 Hz and so on.
MusiGenesis
Incidentally, this is why FFT is not a good approach for pitch detection. To make the "bins" narrow enough to read a pitch accurately, your FFT window has to be enormous, which means it will be very slow.
MusiGenesis
Go ahead and at least vote my answer up, you ungrateful bastard. :)
MusiGenesis
Wow, your a genius with this sound analysis stuff. Its making sense to me know. So what am I looking at when I'm analyzing the raw array before doing an FFT on it? Will it still give me an amplitude to detect my tap with? Because I would need it I guess then to get the time between the taps.
zac
For detecting each tap in your recorded audio, you actually don't need to do an FFT at all (I wasn't quite sure why you mentioned it). Here is my answer to an earlier question about note onset detection (which is very similar to what you're trying to do): http://stackoverflow.com/questions/294468/note-onset-detection/294724#294724
MusiGenesis
Also, the term "genius" is a measure of accomplishment, not of aptitude or knowledge, and I haven't accomplished anything of significance. Thanks, though. :)
MusiGenesis