tags:

views:

136

answers:

2

What is the current state of the art of sound matching / search in practical terms? I am currently remotely involved in planning a web application which, among others, will contain and expose a database of recorded short audio clips (at most 3-5 seconds, names of people). A question has been raised whether it would be possible to implement search based on user voice input. My gut tells me that it is an impossible task both from computational as well as algorithmic point of view, especially in web application (and besides that it would not a core feature of the application). I realize that there are perhaps a number of academic projects and that it would be a good research topic, but it’s not anything that could be implemented to a medium sized web application as an additional feature. To support my claims I spent half an hours searching so that I would not miss anything obvious, but I really could not find any good sources.

I know that it’s not very responsible to ask a question on SO without spending more time researching on my own, but I’ve been noticing that firing out a question on SO is far more effective, precise and faster that just randomly Googling stuff.

+2  A: 

There are some audio fingerprinting technologies out there, (mostly proprietary), which essentially `hashes' an audio file. Then searching is an easy hashtable or database lookup.

Musicbrainz has a good run-down of the various technologies Here

Whether or not these fingerprints are suitable or accurate for your particular situation, I couldn't tell you.

Nick
+1  A: 

I'm not sure whether you are trying to identify the speaker based on the input or to match the input with the names in the database. However: I used to have this idea to develop a metric to calculate a 'distance' of two spoken words. I never got even close of an implementation, but I figured out the following:

1) You need to define the significant features of the audio. This is the 'hashing' part Nick described in his answer. Even a spectrogram may contain too much information to be useful. An approach I found potentially interesting (without having any theoretical knowledge about speech research) was MFCC (i.e mel frequency cepstral coefficients). There is free code at etsi.org (look for speech regocnition and standards).

2) Speed of speech can vary which complicates things. Dynamic time warping can be used to tackle this. See this Matlab-code for an example.

I don't think this would be very easy to implement, and it would need much tuning. And it's definitely not state of the art.

Ville Koskinen