I’ve got a lot of speech audio in WMA format and I’d like to machine transcribe it – even if the transcription is not 100% accurate, I think it could help quite a bit as an “index” to some of the audio. I’m willing to write some code to make this happen, but can Microsoft’s Speech APIs help me here? Is there already an app that can do this for me?


You would need an according program to achieve this, like a dictating software. The Speech API is the other way around. I don't believe there is something opensource for this either, as this is a very, very complicated piece of software.

SAPI covers both recognition and synthesis, so it's certainly possible that it can be used. I'm not familiar with it, though, so I can't say if Windows actually provides access to built-in recognition for English even on a non-English OS. It might still provide enough to get you started, though.
Michael Madsen
Oh, didn't know that. Only remebered the thing that is in XP, now that you say it, Vista has this recognision feature.

SAPI can certainly do what you want. Start with an in-proc recognizer, connect up your audio as a file stream (you'll probably need to transcode your WMA files to a WAV stream, as SAPI only takes WAV input, but you can do the transcoding on the fly), set dictation mode, and off you go.

Now the disappointing bit. You probably won't get terribly good results; in fact, I suspect that unless you're very lucky, you'll probably get total garbage.

There are several problems:

  1. Dictation really only works well once the SR engine has been trained. If you're lucky (like me), you can get OK results, but if the speaker has an accent, training is a must.
  2. Training only works well for a single voice. If you've got multiple speakers in a single audio file, it's not going to work well.
  3. The audio model for dictation (and Speech Recognition in general) assumes that you're using a close-talk microphone (i.e., a microphone right next to your face, to minimize noise pickup). If your WMA files have extra noise, accuracy will go down dramatically.

I actually would suggest using Dragon Naturally Speaking Professional; they've spent the time and money to make transcription work. I haven't used it myself, so I don't know how well it would work in your situation.

Eric Brown
I did a bit of research on Dragon Naturally Speaking, and the transcription tool assumes that it's taking its input from a voice recorder or similar tool, so it has a similar set of problems (it requires training, assumes a single voice, and assumes the microphone is close to the speaker).
Eric Brown
That is true, but the Dragon engine has been used successfully for "Audio Mining" before. If you need an accurate transcript, you will be disappointed. If you want to find keywords or phrases, on a reasonable quality audio source (like TV, not a phone conference recording) it works. This was a number of years ago, but I'm sure it hasn't gotten worse.
Mike Elkins