views:

951

answers:

21

What's so difficult about the subject that algorithm designers are having a hard time tackling it?

Is it really that complex?

I'm having a hard time grasping why this topic is so problematic. Can anyone give me an example as to why this is the case?

+35  A: 

Because if people find it hard to understand other people with a strong accent why do you think computers will be any better at it?

Simon
I love the explanation =)
dionadar
can you calculate 99923423423 ^ 32423343 ? nopebut computer can ;)
Adinochestva
Adinochestva: Actually, calculating that would take a while even for a computer. And there's no reason why couldn't a human emulate a Turing Machine, so theoretically, it is just as hard for a computer as for a human.
DrJokepu
In response to Adinochevstva: yes I can calculate extremely large numbers. It would take a long time, but I know the steps to take. I can recognise speech, but I can't explain exactly how I do it - I just do.
Andrew Shepherd
WolframAlpha gives an (approximate) answer pretty much immediately
AakashM
@Adinochestva: Mine can't. C# thinks it's "Infinity". =:) It's big but I'm sure it's not that big.
Simon P Stevens
Speech and calculations are two completely different things. A calculation has one answer, speech does not. For instance in America the tendency is to pronounce MRSA as MERSA where as in the UK we say it as M.R.S.A. They both mean the same thing and should be translated in the same way but the computer needs to know the differences. The same is true for any number of differences in language such as slang.Most English people find it impossible to understand people with broad Scottish accents because they pronounce things completely differently to the way they are pronounced in England.
Simon
AakashM: Telling you the last few digits of a number thas more than a 300 million digits isn't really "approximate", it is just an application of the (well-known) (a ^ b) % c algorithm. Also playing with exponents isn't an approximate answer either, it's playing with logarithm bases. An approximate answer would be the normal form of the number with a low (such as below 1%) error.
DrJokepu
@Adinochestva: 99923423423 ^ 32423343 = 99892120848 ;-). @AakashM: either a human or a computer can give an approximate answer - the problem with a human actually computing it is that it has 300 million digits. I reckon it would take me about 3 years to write it down, let along compute it.
Steve Jessop
The question means "Why existing speech recognition technologies doesn't advancing?", E.g. Why there is no visible progress in this area? So answers like "it's hard to people, so impossible to computers" and "If it's easy, do it" aren't relevant.
Kamarey
@onebyone: I think AakashM was using ^ as the exponentiation operator rather than XOR, but meh.
Noldorin
Now imagine trying to do speech recognition on Chinese where each word can have four different tonals that differ only in emphasis.
HVS
@Kamarey. As usual on SO, the title and the actual question differ. The title says "why isn't it advancing", but the actual question also asks, "why is it difficult". Personally I think this answer is a good starting point for the latter. It does not account for why Yuval A (perhaps correctly, perhaps incorrectly) perceives zero progress in the field, but personally I don't think an answer is obliged to cover every part of a question.
Steve Jessop
You can calculate 99923423423^32423343 in your head without too much difficulty if you're willing to accept an approximate answer.
David Plumpton
+2  A: 

Speech synthesis is very complex by itself - many parameters are combined to form the resulting speech. Breaking it apart is hard even for people - sometimes you mishear one word for another.

sharptooth
+7  A: 

beecos iyfe peepl find it hard to arnerstand uvver peepl wif e strang acsent wie doo yoo fink compootrs wyll bee ani bettre ayt it?

I bet that took you half a second to work out what the hell I was typing and all Iw as doing was repeating Simons answer in a different 'accent'. The processing power just isn't there yet but it's getting there.

Russell Troywest
And I just noticed I made an error in my typing of "and all IW as saying" which ironically helps my point I think. That's a bit like a speach tic or stutter which makes speach recognition even harder than just accent issues....
Russell Troywest
It's not just strange accents - the (English) Speech Recognition tool in Macs fails to recognize even the British accent!
DrJokepu
I'm British and I can't understand some of our regional accents.
Russell Troywest
Exactly. I have no idea what people from Liverpool are talking about half the time.
Simon
Everton. The other half they're talking about Liverpool.
Steve Jessop
A: 

It's not my field, but I do believe it is advancing, just slowly.

And I believe Simon's answer is somewhat correct in a way: part of the problem is that no two people speak alike in terms of the patterns that a computer is programmed to recognize. Thus, it is difficult to analysis speech.

PTBNL
+3  A: 

The variety in language would be the predominant factor, making it difficult. Dialects and accents would make this more complicated. Also, context. The book was read. The book was red. How do you determine the difference. The extra effort needed for this would make it easier to just type the thing in the first place.

Now, there would probably be more effort devoted to this if it was more necessary, but advances in other forms of data input have come along so quickly that it is not deemed that necessary.

Of course, there are areas where it would be great, even extremely useful or helpful. Situations where you have your hands full or can't look at a screen for input. Helping the disabled etc. But most of these are niche markets which have their own solutions. Maybe some of these are working more towards this, but most environments where computers are used are not good candidates for speech recognition. I prefer my working environment to be quiet. And endless chatter to computers would make crosstalk a realistic problem.

On top of this, unless you are dictating prose to the computer, any other type of input is easier and quicker using keyboard, mouse or touch. I did once try coding using voice input. The whole thing was painful from beginning to end.

Xetius
A: 

Computers are not even very good at natural language processing to start with. They are great at matching but when it comes to inferring, it gets hairy.

Then, with trying to figure out the same word from hundreds of different accents/inflections and it suddenly doesn't seem so simple.

Tom Hubbard
+1  A: 

Most of the time we human understand based on context. So that a perticular sentence is in harmony with the whole conversation unfortunately computer have a big handicap in this sense. It is just tries to capture the word not whats between it.

we would understand a foreigner whose english accent is very poor may be guess what is he trying to say instead of what is he actually saying.

Umair Ahmed
+1  A: 

To recognize speech well, you need to know what people mean - and computers aren't there yet at all.

Michiel de Mare
+1  A: 

Because Lernout&Hauspie went bust :)

(sorry, as a Belgian I couldn't resist)

Philippe Leybaert
+1, exactly my thought when I saw this question. :)
KristoferA - Huagati.com
A: 

Well I have got Google Voice Search on my G1 and it works amazingly well. The answer is, the field is advancing, but you just haven't noticed!

Matt Howells
google voice search is far from speech recognition.
tharkun
@tharkun: Google Voice Search makes heavy use of speech recognition technology.
Jim Ferrans
+1  A: 

You said it yourself, algorithm designers are working on it... but language and speech are not an algorithmic constructs. They are the peak of the development of the highly complex human system involving concepts, meta-concepts, syntax, exceptions, grammar, tonality, emotions, neuronal as well as hormon activity, etc. etc.

Language needs a highly heuristic approach and that's why progress is slow and prospects maybe not too optimistic.

tharkun
+13  A: 

I remember reading that Microsoft had a team working on speech recognition, and they called themselves the "Wreck a Nice Beach" team (a name given to them by their own software).

To actually turn speech into words, it's not a simple as mapping discreet sounds, there has to be an understanding of the context as well. The software would need to have a lifetime of human experience encoded in it.

Andrew Shepherd
"Recognize Speech" ~= "Wreck a Nice Beach" example = +1.
Beska
and even then it could/would fail with background noise, new accents, or surprising changes in topic just like a meat bag
jk
+1  A: 

I once asked a similar question to my instructor; i asked him something like what challenge is there in making a speech-to-text converter. Among the answers he gave, he asked me to pronounce 'p' and 'b'. Then he said that they differ for a very small time in the beginning, and then they sound similar. My point is that it is even hard to recognize what sound is made, recognizing voice would be even harder. Also, note that once you record people's voices, it is just numbers that you store. Imagine trying to find metrics like accent, frequency, and other parameters useful for identifying voice from nothing but input such as matrices of numbers. Computers are good at numerical processing etc, but voice is not really 'numbers'. You need to encode voice in numbers and then do all computation on them.

Actually, the difference between 'p' and 'b' is not precisely so much in initial sound as the voiced vs. unvoiced aspect of them. They are definitely similar, both being bilabial plosives, but the voiced aspect of b is what differentiates it from the unvoiced p.
Beska
+2  A: 

The basic problem is that human language is ambiguous. Therefore, in order to understand speech, the computer (or human) needs to understand the context of what is being spoken. That context is actually the physical world the speaker and listener inhabit. And no AI program has yet demonstrated having adeep understanding of the physical world.

anon
I think SHRDLU, by Terry Winograd, had a somewhat deep understanding of the physical world. At least a small part of it.
Walter Mitty
I don't think it understood anything. If you asked it to move the "six sided solid object who's colour is the same as my tie", I bet it would have had difficulties!
anon
A: 

If speech recognition was possible with substantially less MIPS than the human brain, we really could talk to the animals.

Evolution wouldn't spend all those calories on grey matter if they weren't required to do the job.

soru
+7  A: 

This kind of problem is more general than only speech recognition. It exists also in vision processing, natural language processing, artificial intelligence, ...

Speech recognition is affected by the semantic gap problem :

The semantic gap characterizes the difference between two descriptions of an object by different linguistic representations, for instance languages or symbols. In computer science, the concept is relevant whenever ordinary human activities, observations, and tasks are transferred into a computational representation

Between an audio wave form and a textual word, the gap is big,

Between the word and its meaning, it is even bigger...

fa.
A: 

Spoken language is context sensitive, ambiguous. Computers don't deal well with ambiguous commands.

+1  A: 

I would expect some advances from Google in the future because of their voice data collection through 1-800-GOOG411

Soldier.moth
Hehe, and yet Google's Speech To Text for voice mail is horrible.
Moshe
A: 

I don't agree with the assumption in the question - I have recently been introduced to Microsoft's speech recognition and am impressed. It can learn my voice after a few minutes and usually identifies common words correctly. It also allows new words to be added. It is certainly usable for my purposes (understanding chemistry).

Differentiate between recognising the (word) tokens and understanding the meaning of them.

I don't yet know about other languages or operating systems.

peter.murray.rust
A: 

The problem is that there are two types of speech recognition engines. Speaker-trained ones such as Dragon are good for dictation. They can recognize almost any spoke text with fairly good accuracy, but require (a) training by the user, and (b) a good microphone.

Speaker-independent speech rec engines are most often used in telephony. They require no "training" by the user, but must know ahead of time exactly what words are expected. The application development effort to create these grammars (and deal with errors) is huge. Telephony is limited to a 4Khz bandwidth due to historical limits in our public phone network. This limited audio quality greatly hampers the speech rec engines' ability to "hear" what people are saying. Digits such as "six" or "seven" contain an ssss sound that is particularly hard for the engines to distinguish. This means that recognizing strings of digits, one of the most basic recognition tasks, is problematic. Add in regional accents, where "nine" is pronounced "nan" in some places, and accuracy really suffers.

The best hope are interfaces that combine graphics and speech rec. Think of an IPhone application that you can control with your voice.

IanRae
+20  A: 
nacmartin
Why wasn't this marked as an answer?
baeltazor
Answered months later.
MiseryIndex
+1 excellent explanation.
Lazer
+1 wonderful answer, very insightful. Too bad it was answered months later...
Yuval A
FYI, Hermansky's paper is from 1997. Hardly "the current state of the research", but very interesting nevertheless.
François