views:

262

answers:

5

We are seeing more and more speech recognition implemented and request for libraries that does good speech recognition. What's the rationale (in term of usability) behind it versus a keyboard or keypad? What reasons would you have to invest in this development?

For example, let's take the call centers. A few years ago, almost every call center used an IVR that prompted for a key for the menus. Now, we're seeing more and more menus with prompt for a spoken keyword and/or a pressed keypad: "please say invoice or press 1 to see your invoice". Or we are seeing the same thing in companies' phone directory: "please say the name of the person you are trying to reach" ... "Franck Loyd" ... "Did you say Jack Freud? Please say yes if you want to reach this person or say no to try again".

I guess it's a plus when you're in your car without holding your phone but is it worth the additional waiting time? Longer interaction for all the choices, longer prompt time while trying to analyze if something was said and so on? Also, reliability is better than it was, definitely, but sometime it feels more like an toy someone decided to plugged into the system so it can feel futuristic.

Any experience designing IVR or software that used (or chose not to) speech recognition?

Thanks!

+1  A: 

I think that speech-recognition like any method of input has it's pro's and con's.

Pro's

  • No learning curve, we have been speaking since a very young age.
  • Very user-intuitive.
  • On the phone, no need to constantly move the headset from your ear.

Con's

  • Longer wait time
  • If bad sound quality, takes multiple attempts to get the selection right.
Dmitri Farkov
It also has the limitation of requiring user-specific training in order to optimize performance. If you have an unusual accent, the generically trained phone systems you encounter might give you a rough time.
Steve S
I like the "move the headset from your ear" argument, but on the other hand, on most system you need to type in at least something on the keypad (your NIP, credit card number, etc.) and a good IVR shouldn't have more than what 4 or 5 level's deep? Shouldn't get you to dial too much.
lpfavreau
@Ipfavreau: I have come across systems that actually have you speak each number, though it tends to be a frustrating endeavor.
Steve S
@Steve S: And I guess we're not talking about security issues either. "Please say your NIP out loud" ... "9! 9! 4! 9!". "Thank you, me and the weird looking guy following you can now access your account". ;-)
lpfavreau
Yeah, that's definitely an issue. Usually when I'm using such a system I can find a private place (my home or my car), but it sure doesn't encourage good security.
Steve S
I had a problem once when my fully legit version of Vista decided to require random activation confirmation from Microsoft (what a pointless thing to do). Their whole automated speech-recognition system asked you to say your activation code, number by number. Pausing after each number and telling you "Next number..."Very frustrating. I am not sure what was worse though, the system that could barely understand me, or offshore tech support who understood me even less.
Dmitri Farkov
+1  A: 

In some cases a company is required to handle rotary phones. It might be found as more cost affective to just setup the recognition system instead of both.

Voice recognition has a lot more overhead than touch tones. If you want the best results you need to constantly tweak the app and train the system on unrecognized word pronunciations. You also need to be very particular on how you prompt the user with voice recognition or you may get unexpected responses.

Overall touch tone is a lot easier as there are only a limited set of possible options at any given time.

If your app is straight forward enough you voice rec many only complicate it. Press 2 for some other language..

cwhite
A: 

Speech recognition seems to be buggy!

TokenMacGuy
Haha, yes I saw that one on XKCD, excellent!
lpfavreau
A: 

Speech recognition is definetly the wave of the future when combined with touchscreen technology. As example I use tazti speech recognition. It's available in XP and Vista version. Since Microsoft's touchscreen "Surface" platform runs on Vista, I'm sure tazti will work with the touchscreen technology. When I tried tazti speech recognition the built in commands worked great. Also it let's me create my own speech commands and those also work great. Voice searching Google and Yahoo, Wikipedia Youtube and many other search engines works great. Has many other features as well. But it doesn't have dictation. I found that I eliminate 70% or more of my internet generated clicks.... maybe more. NOTE: Tazti is a free download from their website.

+1  A: 

What's the rationale (in term of usability) behind it versus a keyboard or keypad?

Usability is a very broad term. If I were to attempt to enter my address with a touch pad, it wouldn't be considered very usable. Some argue that using a speech engine with an overall success rate of 70-80% isn't very usable either. As indicated in other posts, hands free input can be much easier for those on a mobile phone. However, using words versus numeric input can actually be less intuitive than a touch tone phone if the topic is somewhat foreign to the caller. A caller hearing terms and phrases that aren't very familiar can't remember them in the 10-30 seconds of the prompt but they can hover over the best sounding choice with their finger or remember the order of choices.

What reasons would you have to invest in this development?

This is an odd question. Usually the decision to use speech or not in an IVR environment is not driven from the development view of the world. Unless you have a specific requirement that really requires speech, you are almost always reducing overall success rates. Speech is usually a factor of corporate image ... or having the latest technological toy.

I guess it's a plus when you're in your car without holding your phone but is it worth the additional waiting time?

Speech recognition latencies aren't very high these days when using modern ASRs. In most cases, input is handled in parallel with speech and time between end of speech recognition is .5 to 1s. Be aware that many IVRs then need to perform data look-ups after some inputs and this can appear as a slower system. Normal inputs pushing beyond 1s is usually the sign of an under-powered deployment.

It may not have been under-powered when original implemented, but through tuning efforts, you make a lot of performance versus accuracy decisions. To get that next .1%, resources can be pushed beyond what they should be at peak.

Also, reliability is better than it was, definitely, but sometime it feels more like an toy someone decided to plugged into the system so it can feel futuristic.

In general, yes. On the reliability note, you need to really look at the overall numbers to get a sense of the system. It is a battle of statistics where the individual isn't very important (unless they hold the title of VP or above). Through optimization of the input (shifting prompting), resource usage and other speech reco tuning parameters you attempt to maximize accuracy. For basic natural language responses, you can get in the upper 90s. However, your overall success rate is much lower. Imagine 5 prompts all at 98% (in reality, you tend to have a bunch 99 and then a few mid 90s or slightly below): .98 * .98 * .98 * .98 * .98 = 90%. That means 1 out of 10 failing. That is before caller confusion and business rules. DTMF input is usually very near 100%, even after several inputs.

Any experience designing IVR or software that used (or chose not to) speech recognition? Yes. But, I suspect that really isn't the question you want. As someone on the technology side, this is usually not your decision and you have limited influence on it. If you are really looking for the pros/cons of speech:

Pros:

  • Cool/hip (note, speech alone isn't sufficient. You need a great VUI and voice talent)
  • Good for a highly mobile crowd that shuns ear pieces. The future is supposed to be blending speech with tactile input. Maybe. It probably won't come from the IVR side of the market.
  • Good for tasks that can't be done with DTMF. Note, many of these problems tend to have low success rates in speech as well. Cost (versus humans) is usually the driving factor not usability. Dropping a call into a voicemail box for things like address change can be very cost effective.

Cons:

  • Expensive to development, deploy and maintain. Adding new choices can have a significant impact on success rates if you aren't careful. Always monitor the impact of change.
  • Is often deployed inappropriately. For example, just say your numeric menu choice. This is nearly often a case of we want speech coolness, but can't afford what it really takes to achieve speech coolness.
  • Success rates will be lower and therefore call center costs will be higher.
  • Failures tend to focus on specific prompts and individual callers. A caller that regularly experiences problems with your system will be very unhappy with you.
  • Callers get angry when they aren't understood. Is your goal to identify a subset of your customer base and really get them angry ?
Jim Rush
Great answer. Thanks. About the additional waiting time, I was mostly refering to the longer prompts often required to describe how to interact with the system rather than the system's time required for voice analysis. The prompts are often: "press 1 to access your invoices or say 'invoices' loud and clear, press 2 to talk to someone or say 'impossible' while mumbling".
lpfavreau