I’ve got a lot of speech audio in WMA format and I’d like to machine transcribe it – even if the transcription is not 100% accurate, I think it could help quite a bit as an “index” to some of the audio. I’m willing to write some code to make this happen, but can Microsoft’s Speech APIs help me here? Is there already an app that can do this for me?
You would need an according program to achieve this, like a dictating software. The Speech API is the other way around. I don't believe there is something opensource for this either, as this is a very, very complicated piece of software.
SAPI can certainly do what you want. Start with an in-proc recognizer, connect up your audio as a file stream (you'll probably need to transcode your WMA files to a WAV stream, as SAPI only takes WAV input, but you can do the transcoding on the fly), set dictation mode, and off you go.
Now the disappointing bit. You probably won't get terribly good results; in fact, I suspect that unless you're very lucky, you'll probably get total garbage.
There are several problems:
- Dictation really only works well once the SR engine has been trained. If you're lucky (like me), you can get OK results, but if the speaker has an accent, training is a must.
- Training only works well for a single voice. If you've got multiple speakers in a single audio file, it's not going to work well.
- The audio model for dictation (and Speech Recognition in general) assumes that you're using a close-talk microphone (i.e., a microphone right next to your face, to minimize noise pickup). If your WMA files have extra noise, accuracy will go down dramatically.
I actually would suggest using Dragon Naturally Speaking Professional; they've spent the time and money to make transcription work. I haven't used it myself, so I don't know how well it would work in your situation.