views:

191

answers:

2

I am trying to use ANN for pitch detection of musical notes. The network is a simple two-layer MLP, whose inputs are basically a DFT (averaged and logarithmically distributed), and 12 outputs correspond to the 12 notes of a particular octave.

The network is trained with several samples of those 12 notes played by some instrument (one note at a time), and a few samples of "silence".

The results are actually good. The network is able to detect those notes played by different instruments preety accurately, it's relatively amune to noise, and even doesn't loose it's sanety completely when being played a song.

The goal, however, is to be able to detect polyphonic sound. So that when two or more notes are played together, the two corresponding neurons will fire. The surprising thing is that the network actually already does that to some extent (being trained over monophonic samples only), however less consistently and less accurately than for monophonic notes. My question is how can I enhance it's ability to recognise polyphnic sound?

The problem is I don't truely understand why it actually works already. The different notes (or their DFTs) are basically different points in space for which the network is trained. So I see why it does recognise similiar sounds (nearby points), but not how it "concludes" the output for a combination of notes (which form a distant point from each of the training examples). The same way an AND network which is trained over (0,0) (0,1) (1,0) = (0), is not expected to "conclude" that (1,1) = (1).

The brute force aprroach to this is to train the network with as many polyphonic samples as possible. However, since the network seem to somehow vaguely grasp the idea from the monophonic samples, there's probably something more fundemential here.

Any pointers? (sorry for the length, btw :).

+6  A: 

The reason it works already is probably quite simply that you didn't train it to pick one and only one output (at least I assume you didn't). In the simple case when the output is just a dot product of the input and the weights, the weights would become matched filters for the corresponding pitch. Since everything is linear, multiple outputs would simultaneously get activated if multiple matched filters simultaneously saw good matches (as is the case for polyphonic notes). Since your network probably includes nonlinearities, the picture is a bit more complex, but the idea is probably the same.

Regarding ways to improve it, training with polyphonic samples is certainly one possibility. Another possibility is to switch to a linear filter. The DFT of a polyphonic sound is basically the sum of DFTs of each individual sound. You want a linear combination of inputs to become a corresponding linear combination of outputs, so a linear filter is appropriate.

Incidentally, why do you use a neural network for this in the first place? It seems that just looking at the DFT and, say, taking the maximum frequency would give you better results more easily.

A: 

I experimented with evolving a CTRNN (Continuous Time Recurrent Neural Network) on detecting the difference between 2 sine waves. I had moderate success, but never had time to follow up with a bank of these neurons (ie in bands similar to the cochlear).

stephendwolff