Hi, I would like to get some sort of distance measure between two pieces of audio. For example, I want to compare the sound of an animal to the sound of a human mimicking that animal, and then return a score of how similar the sounds were.
It seems like a difficult problem. What would be the best way to approach it? I was thinking to extract a couple of features from the audio signals and then do a euclidian distance or cosine similarity (or something like that) on those features. What kind of features would be easy to extract and useful to determine the perceptual difference between sounds?
(I saw something on how shazam uses hashing, but that seemed like a different problem because there the two pieces of audio are exactly the same, with the addition of noise. Whereas in this case the two pieces of audio are not the same, they are just perceptually similar)
I'm looking forward to your ideas :)
Cheers, Bart