views:

90

answers:

1

Hello

I'm working on a program for tone deaf people. I've working with sapi and a TTs. The program does a 3D animation with a hand at the same time. But the problem is that the voices (also when a put them at its slowest speech) is to fast for what I want. So, I've thought on speech recognition, but the problem is that I've to do a large process to the text before the animation start.

So, I want to know if It would be posible to do speech recognition(from my voice on a .wave file) and afterwards do the same process of TTs (with Sapi events...) but using the .wave with my voice.

If It's posible, please tell me how. If you think there are better alternatives, let my see them.

Thanks for your time (and excuse my English)

Jesuskiewicz

A: 

Now that I understand what you want to happen, I can say that as far as I know, the SAPI SR engine doesn't really provide phoneme-level markup that's synchronized to the incoming text.

What you could try (although I have no real expectation for this to work) would be to take the audio, run it through a pronunciation grammar to generate phonemes, and then take the text elements to find the corresponding bits of audio.

When I say a 'pronunciation grammar', I mean a dictation grammar with the pronunciation model loaded - set it up like this:

CComPtr<ISpRecoGrammar> cpGrammar;
... initialize SR engine and create a grammar ...
cpGrammar->LoadDictation(L"Pronunciation", SPLO_STATIC);

In your recognition handler, you would need to parse out the elements:

ISpRecoResult* ipReco;
SPPHRASE* pPhrase;
ipReco->GetPhrase(&pPhrase);
for (int i = 0; i < pPhrase->Rule.ulCountOfElements; ++i)
{
    const SPPHRASEELEMENT * pElem = pPhrase->pElements + i;
    // examine pElem->ulAudioSizeTime, etc.
}
::CoTaskMemFree(pPhrase);

I hope this is enough to get you started...

Eric Brown
Thank you for the answer. I think I gonna take another way (If I find it)... My error was I thought that when you do the recognition, there were Viseme_events just like there are when you do Synthesis... But It's true, I have realized that It didn't make sense, recognition is at word level. Thanks again
Jesuskiewicz