views:

799

answers:

6

Hi. With limited resources such as slower CPUs, code size and RAM, how best to detect the pitch of a musical note, similar to what an electronic or software tuner would do?

Should I use:

  • Kiss FFT
  • FFTW
  • Discrete Wavelet Transform
  • autocorrelation
  • zero crossing analysis
  • octave-spaced filters

other?

In a nutshell, what I am trying to do is to recognize a single musical note, two octaves below middle-C to two octaves above, played on any (reasonable) instrument. I'd like to be within 20% of the semitone - in other words, if the user plays too flat or too sharp, I need to distinguish that. However, I will not need the accuracy required for tuning.

Thank you in advance.

-mz

+2  A: 

In my project danstuner, I took code from Audacity. It essentially took an FFT, then found the peak power by putting a cubic curve on the FFT and finding the peak of that curve. Works pretty well, although I had to guard against octave-jumping.

See Spectrum.cpp.

dfrankow
Audacity might have something more slick now, that was quite awhile ago.
dfrankow
Actually, I think the code took the autocorrelation using the FFT. Wow, it's been a long time.
dfrankow
+5  A: 

If you want to do pitch recognition in realtime (and accurate to within 1/100 of a semi-tone), your only real hope is the zero-crossing approach. And it's a faint hope, sorry to say. Zero-crossing can estimate pitch from just a couple of wavelengths of data, and it can be done with a smartphone's processing power, but it's not especially accurate, as tiny errors in measuring the wavelengths result in large errors in the estimated frequency. Devices like guitar synthesizers (which deduce the pitch from a guitar string with just a couple of wavelengths) work by quantizing the measurements to notes of the scale. This may work for your purposes, but be aware that zero-crossing works great with simple waveforms, but tends to work less and less well with more complex instrument sounds.

In my application (a software synthesizer that runs on smartphones) I use recordings of single instrument notes as the raw material for wavetable synthesis, and in order to produce notes at a particular pitch, I need to know the fundamental pitch of a recording, accurate to within 1/1000 of a semi-tone (I really only need 1/100 accuracy, but I'm OCD about this). The zero-crossing approach is much too inaccurate for this, and FFT-based approaches are either way too inaccurate or way too slow (or both sometimes).

The best approach that I've found in this case is to use autocorrelation. With autocorrelation you basically guess the pitch and then measure the autocorrelation of your sample at that corresponding wavelength. By scanning through the range of plausible pitches (say A = 55 Hz thru A = 880 Hz) by semi-tones, I locate the most-correlated pitch, then do a more finely-grained scan in the neighborhood of that pitch to get a more accurate value.

The approach best for you depends entirely on what you're trying to use this for.

MusiGenesis
Interesting answer. One thing confuses me though... isn't the autocorrelation O(N^2) and the FFT O(NlogN), that is, to take a fast autocorrelation people often use an FFT as the first step, so how is the FFT too slow and inaccurate while the autocorrelation works? Is this because you can effectively use a shorter segment of data with the autocorrelation?
tom10
@tom10: usually the relative performance calculations are for non-musical applications. For pitch detection, you can take advantage of the fact that the pitch is *probably* in the range of 55-880 Hz, instead of 0-44100 Hz, so you don't have to scan/autocorrelate through the full theoretical range. Also, there's a simple mathematical trick you can do that makes a set of scans take much less time when done together than the sum of times if they're done individually.
MusiGenesis
You *can* use FFT to get in the neighborhood of the fundamental, and then scan/autocorrelate to get a more precise measurement. So far, I've found my method (pure autocorrelation) to be slightly faster, but I recently had an insight (thanks to an SO question) into a way to speed up my FFT code, so I may move to this approach soon myself. Ultimately, however, there is a definite curve to the FFT approach where speed goes down as accuracy increases, so autocorrelation has to be involved at some point.
MusiGenesis
@MusiGenesis: Thanks for the info. Very interesting!
tom10
@MusiGenesis: Thanks. Can you please tell me what the simple mathematical trick is that you refer too above?
mahboudz
+3  A: 

Zero crossing won't work because a typical sound has harmonics and zero-crossings much more than the base frequency.

Something I experimented with (as a home side project) was this:

  1. Sample the sound with ADC at whatever sample rate you need.
  2. Detect the levels of the short-term positive and negative peaks of the waveform (sliding window or similar). I.e. an envelope detector.
  3. Make a square wave that goes high when the waveform goes within 90% (or so) of the positive envelope, and goes low when the waveform goes within 90% of the negative envelope. I.e. a tracking square wave with hysteresis.
  4. Measure the frequency of that square wave with straight-forward count/time calculations, using as many samples as you need to get the required accuracy.

However I found that with inputs from my electronic keyboard, for some instrument sounds it managed to pick up 2× the base frequency (next octave). This was a side project and I never got around to implementing a solution before moving on to other things. But I thought it had promise as being much less CPU load than FFT.

Craig McQueen
+3  A: 

I'm not familiar with all the methods you mention, but what you choose should depend primarily on the nature of your input data. Are you analysing pure tones, or does your input source have multiple notes? Is speech a feature of your input? Are there any limitations on the length of time you have to sample the input? Are you able to trade off some accuracy for speed?

To some extent what you choose also depends on whether you would like to perform your calculations in time or in frequency space. Converting a time series to a frequency representation takes time, but in my experience tends to give better results.

Autocorrelation compares two signals in the time domain. A naive implementation is simple but relatively expensive to compute, as it requires pair-wise differencing between all points in the original and time-shifted signals, followed by differentiation to identify turning points in the autocorrelation function, and then selection of the minimum corresponding to the fundamental frequency. There are alternative methods. For example, Average Magnitude Differencing is a very cheap form of autocorrelation, but accuracy suffers. All autocorrelation techniques run the risk of octave errors, since peaks other than the fundamental exist in the function.

Measuring zero-crossing points is simple and straightforward, but will run into problems if you have multiple waveforms present in the signal.

In frequency-space, techniques based on FFT may be efficient enough for your purposes. One example is the harmonic product spectrum technique, which compares the power spectrum of the signal with downsampled versions at each harmonic, and identifies the pitch by multiplying the spectra together to produce a clear peak.

As ever, there is no substitute for testing and profiling several techniques, to empirically determine what will work best for your problem and constraints.

An answer like this can only scratch the surface of this topic. As well as the earlier links, here are some relevant references for further reading.

ire_and_curses
In a nutshell, what I am trying to do is to recognize a single musical note, played on any (reasonable) instrument. I'd like to be within 20% of the semitone - in other words, if the user plays too flat or too sharp, I need to distinguish that. However, I will not need the accuracy required for tuning.
mahboudz
+4  A: 

If you don't need that much accuracy, a simple FFT would be sufficient. Window the chunk of audio first so that you get well-defined peaks, then find the first significant peak.

Bin width = sampling rate / FFT size. Fundamentals range from 20 Hz to 7 kHz, so a sampling rate of 14 kHz would be enough. The next "standard" sampling rate is 22050 Hz.

This is a range of 8 octaves, so if you want each bin to be 20% of a semitone, 480 bins would be enough. A 512-bin FFT would be the next power-of-two, which takes my laptop's processor less than 32 µs to calculate, for 23 ms of data.

This won't work with signals that have a "missing fundamental", but if you don't care about timpanis, you'll be fine. Finding the "first significant" peak is somewhat difficult (since harmonics are often higher than the fundamental), but you can figure out a way that suits your situation.

Autocorrelation and harmonic product spectrum are better at finding the true fundamental for a wave instead of one of the harmonics, but I don't think they deal as well with inharmonicity, and most instruments like piano or guitar are inharmonic (harmonics are slightly sharp from what they should be). It really depends on your circumstances, though.

Also, you can save even more processor cycles by computing only within a specific frequency band of interest, using the Chirp-Z transform.

I've written up a few different methods in Python for comparison purposes.

endolith
This is very informative. Thank you. I think for now I only will care about 4 octaves, with half an octave on each side, for a total of 5 octaves. Given what you are telling me, I might be able to do 6 or 8 without any trouble.
mahboudz
+1  A: 

Doesn't endolith's answer grossly underestimate the number of bins required? The frequency are linear in frequency, right? So to get 20% of a semitone accuracy at the lower end of the spectrum, say 100Hz, would require a bin frequency width around one hertz. four octaves up from 100Hz is 1600Hz, so you would need 1500 bins. Or am I missing something, entirely possible...?

tml