196

7
+12  Q:

## Algorithms for determining the key of an audio sample

Hi,

I am interested in determining the musical key of an audio sample. How would (or could) an algorithm go about trying to approximate the key of a musical audio sample?

Antares Autotune and Melodyne are two pieces of software that do this sort of thing.

Can anyone give a bit of a layman's explanation about how this would work? To mathematically deduce the key of a song by analysing the frequency spectrum for chord progressions etc.

This topic interests me a lot!

Edit - brilliant sources and a wealth of information to be found from everyone who contributed to this question.

Especially from: the_mandrill and Daniel Brückner.

+3  A:

As far as I can tell from this article, various keys each have their own common frequencies, so it likely analyzes the audio sample to detect what the most common notes and chords are. After all, you can have multiple keys that have the same configuration of sharps and flats, with the difference being the note that the key starts on and thus the chords that such keys, so it seems how often the significant notes and chords appear would be the only real way you could figure that sort of thing out. I don't really think you can get a layman's explanation of the actual mathematical formulas without leaving out a lot of information.

Do note that this is coming from somebody who has absolutely no experience in this area, with his first exposure being the article linked in this answer.

That's a brilliant article and right on the money! Thanks.
You're quite welcome.
+1 for the amazing article.
+1  A:

It's a complex topic, but a simple algorithm for determining a single key (single note) would look like this:

Do a fourier transformation on let's say 4096 samples (exact size depends on your resolution demands) on a part of the sample which contains the note. Determine the power peak in the spectrum - this is the frequency of the note.

Things are getting tighter if you have a chord, different "instruments/effects" or a non-homophonic music pattern.

Yes I think you'd need a fairly clean sample to work with. Plus one that fits with Western tonal structures too of course. Good answer, many thanks.
Frequency of a peak != pitch, at least for for musical instruments. Better to use one of the popular pitch detection algorithms.
@Paul R - yes I've seen that the perception of *volume* of a pitch is determined by it's frequency, not by some other measure. This also confuses me a bit though.
@AlexW: pitch is a *percept* rather than an actual physical quantity, but it's usually quite close to the fundamental frequency of the note being played. In some instruments though the fundamental frequency may be of quite low amplitude, or even missing altogether, hence the need to use a proper pitch detection algorithm rather than a power spectrum.
+1  A:

You can use the Fourier Transform to calculate the frequency spectrum from an audio sample. From this output, you can use the frequency values for particular notes to turn this into a list of notes heard during the sample. Choosing the strongest notes heard per sample over a series of samples should give you a decent map of the different notes used, which you can compare to the different musical scales to get a list of the possible scales that contain that combination of notes.

To help decide which particular scale is being used, make a note (no pun intended) of the most frequently heard notes. In Western music, the root of the scale is typically the most common note heard, followed by the fifth, and then the fourth. You can also look for patterns such as common chords, arpeggios, or progressions.

Sample size will probably be important here. Ideally, each sample will be a single note (so that you don't get two chords in one sample). If you filter out and concentrate on the low frequencies, you may be able to use the volume spikes ("clicks") normally associated with percussion instruments in order to determine the song's tempo and "lock" your algorithm to the beat of the music. Start with samples that are a half-beat in length and adjust from there. Be prepared to throw out some samples that don't have a lot of useful data (such as a sample taken in the middle of a slide).

It's not that easy to extract pitch from a power spectrum - there are much better pitch detection algorithms.
The whole process is a complex one. But very interesting. Chords I think create much complexity as they generate their own resonance and harmonic frequency that must be very difficult to account for in an algorith!
@AlexW- Yes harmonic resonance is present, but it appears at a much lower magnitude than the chord itself. If you know the chord, you can predict the harmonics that might be heard and filter them out of your results accordingly.
@bta yes that's true. Going by the material generated from this page, it's an all round tricky task. Maybe if you can strip away unnecessary artefacts from music, it would be easier to determine the key (to add a bandpass filter first to get rid of high- and low-frequencies).
@AlexW- I would recommend starting with something recorded as a series of electronic tones (from an electronic keyboard, perhaps). Simple tones are much easier to work with, and once you get the hang of that you can slowly move to more and more complex sounds. Real-world instruments (and to a greater degree, voices) are a complex combination of sounds and are much tougher to crack; if you are targeting a specific instrument, it helps if you can filter out anything outside that instrument's range.
@bta - good idea. :)
+1  A:

First you need a pitch detection algorithm (e.g. autocorrelation).

You can use then your pitch detection algorithm to extract the pitch over a number of short time windows. After that you would need to see which musical key the sampled pitches fit best with.

I'm not sure this could work on chords as you are hearing many pitches at once.
@AlexW: yes, chords are going to be tricky - you are going to want to sample the more melodic and monophonic parts of the music.
At the moment this is to be honest a rather vague notion. The maths is intimidating, but it's important to remember that tools exist to help with 'boilerplate' Fourier transformations. It's a case of understanding this data and experimenting with algorithms.
+9  A:

It's worth being aware that this is a very tricky problem and if you don't have a background in signal processing (or an interest in learning about it) then you have a very frustrating time ahead of you. If you're expecting to throw a couple of FFTs at the problem then you won't get very far. I hope you do have the interest as it is a really fascinating area.

Initially there is the problem of pitch recognition, which is reasonably easy to do for simple monophonic instruments (eg voice) using a method such as autocorrelation or harmonic sum spectrum (eg see Paul R's link). However, you'll often find that this gives the wrong results: you'll often get half or double the pitch that you were expecting. This is called pitch period doubling or octave errors and it occurs essentially because the FFT or autocorrelation has an assumption that the data has constant characteristics over time. If you have an instrument played by a human there will always be some variation.

Some people approach the problem of key recognition as being a matter of doing the pitch recognition first and then finding the key from the sequence of pitches. This is incredibly difficult if you have anything other than a monophonic sequence of pitches. If you do have a monophonic sequence of pitches then it's still not a clear cut method of determining the key: how you deal with chromatic notes, for instance, or determining whether it's major or minor. So you'd need to use a method similar to Krumhansl's key finding algorithm.

So, given the complexity of this approach, an alternative is to look at all the notes being played at the same time. If you have chords, or more than one instruments then you're going to have a rich spectral soup of many sinusoids playing at once. Each individual note is comprised of multiple harmonics a fundamental frequency, so A (at 440Hz) will be comprised of sinusoids at 440, 880, 1320... Furthermore, if you play an E (see this diagram for pitches) then that is 659.25Hz which is almost one and a half times that of A (actually 1.498). This means that every 3rd harmonic of A coincides with every 2nd harmonic of E. This is the reason that chords sound pleasant, because they share harmonics. (as an aside, the whole reason that western harmony works is due to the quirk of fate that the twelfth root of 2 to the power 7 is nearly 1.5)

If you looked beyond this interval of a 5th to major, minor and other chords then you'll find other ratios. I think that many key finding techniques will enumerate these ratios and then fill a histogram for each spectral peak in the signal. So in the case of detecting the chord A5 you would expect to find peaks at 440, 880, 659, 1320, 1760, 1977. For B5 it'll be 494, 988, 741, etc. So create a frequency histogram and for every sinusoidal peak in the signal (eg from the FFT power spectrum) increment the histogram entry. Then for each key A-G tally up the bins in your histogram and the ones with the most entries is most likely to be your key.

That's just a very simple approach but may be enough to find the key of a strummed or sustained chord. You'd also have to chop the signal into small intervals (eg 20ms) and analyse each one to build up a more robust estimate.

EDIT:
If you want to experiment then I'd suggest downloading a package like Octave or CLAM which makes it easier to visualise audio data and run FFTs and other operations.

• My PhD thesis on some aspects of pitch recognition -- the maths is a bit heavy going but chapter 2 is (I hope) quite an accessible introduction to the different approaches of modelling musical audio
• http://en.wikipedia.org/wiki/Auditory_scene_analysis -- Bregman's Auditory Scene analysis which though not talking about music has some fascinating findings about how we perceive complex scenes
• Dan Ellis has done some great papers in this and similar areas
• Keith Martin has some interesting approaches
Chordata looks pretty neat.
A good pitch detection algorithm should not be detecting chords or determining "major or minor". It should be detecting individual notes. This is how ear with absolute pitch ability works (I do have the ability + musical education) - I do not hear "C major chord". I hear C+E+G and then determine that it is, indeed C major chord. Even if you sit on the piano keyboard or press combo of random keys (like C+Cis+D+Fis+G+Bes+B), I still will be able to name every note, although it will not be a "chord". This is because (my) ear does not operate on chords or tonalities. It operates on notes.
The ear is detecting individual frequencies (to be more precies, the brain does the analysis), and maps them to chromatic note names(C, Cis/Des, D, etc). After that combination of notes can be analyzed and recognized as a some kind of chords, and you'll be able to guess tonality. I believe that computer tone detection should be working in similar way. And another thing - the easiest way to detect keys or chords will be probably processing signal histogram. Because every key or chord will be visible as a "blip" on the histogram at the certain frequencies.
@SigTerm: the problem isn't as clear cut as you make out. When there are multiple instruments playing (and in particular for orchestral scores) it's simply not possible to hear every single note, but yet it can be simple to hear the chord. From a signal processing point of view the problem is ambiguous since you have several instruments playing the same pitch, or at (almost) integer multiples thereof. Therefore the signal from each instrument isn't orthogonal. I think it was one of Tangian's papers who showed that a complex tone can be indistinguishable from a chord. (see above for link)
Besides, polyphonic pitch recognition is *incredibly* difficult (there are a handful of systems in the world that perform well) and is therefore unsuited to being a front end to a chord/key finding system.
I would start by going from a single note, recording frequency bands and analysing the frequency of all the 12 semi-tones. This would build a database of associated harmonics as well as the actual base frequency for that particular pitch. Once you have built an extensive database of harmonic and base note frequencies, you might be able to take measurements from chords and approximate the note combinations based upon earlier readings. This method may not work in a good enough time frame for real-time analysis, but after some extensive referencing work is performed on the music.
You could model the waveform used for frequency measurements using acoustically shaped oscillators, to mimic more natural harmonic structures and complexity.
@the_mandrill: "not possible to hear every single note" it IS possible to hear every single one. As I said, I'm not an expert in audio processing, but I had more than enough musical training, and I can name note by hearing. With very complex chord (12+ notes) picking every note will take more time. With simple chord(4+), it is instant, but you always can name them all. Also chords can be positioned differently. C major chord (C+E+G) can be C5+E5+G5, E5+G5+C6, C5+G5+E3, C4+C5+G5+E6, etc. Different notes, same harmony. This makes chord detection useless. You need to pick individual notes.
@the_mandrill: With complex-sounding harmonics, when you're recognizing pitch by ear (and when you can't name all notes instantly), it goes like this: You concentrate on sound of one instrument, then for all currently "active" sounds of every instrument, you concentrate on individual notes and "name" them. Recognition of one note (ear) is instant. Not sure how brain does it, "concentrating" is probably equivalent to setting up filter sensitivity, and picking up individual notes probably equals to histogram scanning. Also, don't forget that it may be possible to use trained neural networks.
@the_mandrill: "polyphonic pitch recognition is incredibly difficult" difficult or not, it is the proper way to do it. Chord detection system will screw up on non-standard music with dissonant chords. Dodecaphonic music, for example, maybe even jazz.
@AlexW - if you are only wanting to recognise one specific instrument then building up a harmonic database may work well enough. There was some work to do this for piano transcription that made explicit use of the fact that there is slightly inharmonicity in the higher harmonics (as Daniel mentions).
@SigTerm: It's not always possible (or necessary) to hear every single note. A chord composed of C4+C5 it may be indistinguishable from a complex tone at C4. The only reason you may be able to hear it as two notes is that you have a prior expectation of the harmonic structure of that particular instrument. If you construct it out of sine waves (which intrinsically is what you're detecting) then it can be impossible to detect. Similarly C4+C5+G5 sounds just like a complex tone at C4. So the whole problem of chord recognition is ambiguous. See Terhardt's virtual pitch theory for more.
@the_mandrill: "A chord of C4+C5 it may be indistinguishable from a complex tone at C4" Although it is possible theoretically, it doesn't match my musical experience. Haven't ever heard a sound like this from a real instrument. "expectation of the harmonic". In this case you'll need a solution that will rapidly train itself, classify frequency patterns, ans associate them with notes. This obviously won't be a simple numeric algorithm, it will be slow, results will be probabilistic, not precise, but I think it can be done at least for one polyphonic instrument at a time. Or a harmonic database.
@SigTerm it's interesting that you have those differences in audible recognition of tones. The human brain has an incredible advantage with this sort of ability, to recognise the note from training and from musicality generally. Transferring this skill to a computer seems to test the best minds in the field.
+1 for the 2^(7/12) ~= 1.5 bit. I've been wondering about that for some time.
+3  A:

I have worked at the problem of transcribing polyphonic CD recordings into scores for more than two years at university. The problem is notoriously hard. The first scientific papers related to the problem date back to the 1940s and up to today there are no robust solutions for the general case.

All the basic assumption you usually read are not exactly right and most of them are wrong enough that they become unusable for everything but very simple scenarios.

The frequencies of overtones are not multiples of the fundamental frequency - there are non-linear effects so that the high partials drift away from the expected frequency - and not only a few Hertz; it is not unusual to find the 7th partial where you expected the 6th.

Fourier transformations do not play nice with audio analysis because the frequencies one is interested in are spaced logarithmically while the Fourier transformation yields linearly spaced frequencies. At low frequencies you need high frequency resolution to separate neighboring pitches - but this yields bad time resolution and you lose the ability the separate notes played in quick succession.

An audio recording does (probably) not contain all the information needed to reconstruct the score. A large part of our music perception happens in our ears and brain. That is why some of the most successful systems are expert systems with large knowledge repositories about the structure of (western) music that only rely to a small portion on signal processing to extract information from the audio recording.

When I am back home I will look through the papers I have read and pick the 20 or 30 most relevant ones and add them here. I really suggest to read them before you decide to implement something - as stated before most common assumptions are somewhat incorrect and you really don't want to rediscover all this things found and analyzed for more than 50 year while implementing and testing.

It's a hard problem, but it's much fun, too. I would really like to hear what you tried and how well it worked.

For now you may have a look at the Constant Q transform, Cepstrum and Wigner(–Ville) distribution. There are also some good papers on how to extract the frequency from shifts in the phase of short time Fourier spectra - this allows to use very short windows sizes (for high time resolution) because the frequency can be determined with a precision several 1000 times larger than the frequency resolution of the underlying Fourier transformation.

All this transformations fit the problem of audio processing much better than ordinary Fourier transformations. For improving the results of basic transformations have a look at the concept of energy reassignment.

+1. For me though as things stand, I do not have the mathematical knowledge to fully comprehend Constant Q transform like you. What I can do though is try and think of practical solutions based on my not particularly extensive knowledge of computing and programming.
A:

If you need to classify a bunch of songs right now, then crowd-source the problem with something like Mechanical Turk.

That would be a 'Mechanical Turk With Pitch Perfect Musical Understanding'... good luck finding your source!