views:

104

answers:

3

Hi, I would like to get some sort of distance measure between two pieces of audio. For example, I want to compare the sound of an animal to the sound of a human mimicking that animal, and then return a score of how similar the sounds were.

It seems like a difficult problem. What would be the best way to approach it? I was thinking to extract a couple of features from the audio signals and then do a euclidian distance or cosine similarity (or something like that) on those features. What kind of features would be easy to extract and useful to determine the perceptual difference between sounds?

(I saw something on how shazam uses hashing, but that seemed like a different problem because there the two pieces of audio are exactly the same, with the addition of noise. Whereas in this case the two pieces of audio are not the same, they are just perceptually similar)

I'm looking forward to your ideas :)

Cheers, Bart

+2  A: 

I'd suggest looking into spectrum analysis. Whilst this isn't as straightforward as you're most likely wanting, I'd expect that decomposing the audio into it's underlying frequencies would provide some very useful data to analyse. Check out this link

Will A
Thanks. I could try to generate some frequency spectrums of different sounds and see if similar sounding sound produce similar spectrums, and different sounds don't. From what I understand of the wikipedia link, the frequency spectrum has to be created using Fourier Transform?
Bart
+1  A: 
Andrew
You mean generating a frequency spectrum right? Wouldn't the line of the difference function just be 0 when they you take it against the orignal sounds wave?
Bart
Yes, I mean generating a frequency spectrum. In so many words. :)If the line of best fit being compared was based off of an average of the two sound waves, no, I don't believe it would just be 0. Could be wrong, though!
Andrew
+1  A: 

The process for comparing a set of sounds for similarities is called Content Based Audio Indexing, Retrieval, and Fingerprinting in computer science research.

One method of doing this is to:

  1. Run several bits of signal processing on each audio file to extract features, such as pitch over time, frequency spectrum, autocorrelation, dynamic range, transients, etc.

  2. Put all the features for each audio file into a multi-dimensional array and dump each multi-dimensional array into a database

  3. Use optimization techniques (such as gradient descent) to find the best match for a given audio file in your database of multi-dimensional data.

The trick to making this work well is which features to pick. Doing this automatically and getting good results can be tricky. The guys at Pandora do this really well, and in my opinion they have the best similarity matching around. They encode their vectors by hand though, by having people listen to music and rate them in many different ways. See their Music Genome Project and List of Music Genome Project attributes for more info.

For automatic distance measurements, there are several projects that do stuff like this, including marsysas, MusicBrainz, and EchoNest.

Echonest has one of the simplest APIs I've seen in this space. Very easy to get started.

Nick Haddad