I am trying to use ANN for pitch detection of musical notes. The network is a simple two-layer MLP, whose inputs are basically a DFT (averaged and logarithmically distributed), and 12 outputs correspond to the 12 notes of a particular octave.
The network is trained with several samples of those 12 notes played by some instrument (one note at a time), and a few samples of "silence".
The results are actually good. The network is able to detect those notes played by different instruments preety accurately, it's relatively amune to noise, and even doesn't loose it's sanety completely when being played a song.
The goal, however, is to be able to detect polyphonic sound. So that when two or more notes are played together, the two corresponding neurons will fire. The surprising thing is that the network actually already does that to some extent (being trained over monophonic samples only), however less consistently and less accurately than for monophonic notes. My question is how can I enhance it's ability to recognise polyphnic sound?
The problem is I don't truely understand why it actually works already. The different notes (or their DFTs) are basically different points in space for which the network is trained. So I see why it does recognise similiar sounds (nearby points), but not how it "concludes" the output for a combination of notes (which form a distant point from each of the training examples). The same way an AND network which is trained over (0,0) (0,1) (1,0) = (0), is not expected to "conclude" that (1,1) = (1).
The brute force aprroach to this is to train the network with as many polyphonic samples as possible. However, since the network seem to somehow vaguely grasp the idea from the monophonic samples, there's probably something more fundemential here.
Any pointers? (sorry for the length, btw :).