views:

316

answers:

5

I have a voice application that would be much-improved if there was the ability to use a "trigger word" to start recording audio. I don't need a full speech-text engine, just the ability to reliably/efficiently detect the trigger word.

I am wondering if there are any specialized speech engines that support this specific use case, or any libraries/methods to developing such a single-purpose detection engine. Ideally I'd like it to work in noisy environments, but it can be trained for a single user's voice.

Pointers to research papers / topics would also be appreciated so I know what to ask for.

A: 

Okay, I could be completely off, but using a full featured speech-recognition library may be overkill for your use-case..

If you can live with something simpler but still audio driven consider this:

Detecting a hand-clap is very simple. A hand-clap will have high energy over the overall audio band. Detecting it is simple and much cheaper computational wise than full-bown speech recoginition.

In a nutshell you record the audio, do a (short time) FFT on the data and detect the case where you have high energy in 80% of the available frequency bins. 80% takes care of any phasing issues due to a simple recording-room/microphone setting. Then adjust the thresold to taste and you're done.

Doing the same with speech-recognition is possible as well, but you will burn tons of CPU cycles.

Nils Pipenbrinck
A: 

What O/S? I wonder for example whether Speech functionality in Windows Vista would help you. Recognising a single word seems like the simplest possible problem for any speech analyzer.

ChrisW
Recognizing a single phrase would be easier. The longer the key word or phrase to be recognized is, the easier it is to avoid false positives. That's why finite-grammar recognition is much easier and more reliable than dictation.
A: 

There were asked a question just a few days ago about speech recognition possibilities on linux. What you ask for is a subset of that, I assume some of those answers could contain useful information. The article linked in joeforker's answer was very interesting.

hlovdal
An explanation of why this was down voted is appreciated.
hlovdal
A: 

Hi.

I have a voice recording win32 app. I use an OCX to manage recording/playback.

I know it is not exactly the solution you are asking, but you might want to consider a foot pedal. It is simple to program and would serve very much like a spoken word to begin/stop recording. Check these: www.pedalpower.com

Hope it helps,

Reinaldo.

reinaldo Crespo
A: 

A colleague of mine on the Red5 project created a similar demo using trigger words to cause a search to be run against an image repository. Saying "cat" caused an image of a cat to appear within about a second. The client application was written in Flash and the back-end ran on Red5 using the free Sphinx library. You could certainly do what you want with Sphinx without much effort.
Sphinx project: http://cmusphinx.sourceforge.net/sphinx4/

Mondain