views:

75

answers:

2

I'm curious what the programming terms or methodology is used when Google shows you the "did you mean" link for a word that is made up of multiple words?

For example if I type in "redflower.jpg" It knows to break that up into Red Flower Is there a common paradigm for doing that sort of operation? Would a Lucene search give you that?

thanks!

+3  A: 

If google does not see a lot of matching results for reflowers.jpg, it might then try to cut the words in multiple words until it finds a lot of matching results.

It might also recognize the extension (.jpg), recognize the image extension and then try to find images with the similar name.

If I would have to make an algorithm like this, I would use an huge EXISTING database (either a dictionary or a search engine) and then try what I said in the beginning of my post.

Lobsterm
Ahh the advantages of being an incredibly wealthy company with a massively-used, massively scalable backend system...
notJim
Yes! Specifically, it probably uses a dictionary to recognize the fact that Red and Flower are words, and then uses the probability of particular phrases occurring in whatever language it thinks you're using to discover which phrase is most likely. For example, it proposes "Red Flower" as opposed to "Redfl Ower" or "Red FL ower" or "Red Flow Er" because "Red Flower" is much more probable.
nearlymonolith
but how would they know how to cut the words at the right points?
They would not know before trying it. Like Anthony Morelli said, If you take RedFlowers, they would try to find a least a popular word:Re -> is a popular wordRed -> is even more popularRedf -> is obviously not popular[...]And the popularity is based on the number of result you would get when searching those specifics words
Lobsterm
A: 

Perhaps they could to look at what other people do when they have searched for redflowers.jpg? Maybe a number of people searched for "redflowers.jpg", didn't click on any links, and then searched for "Red Flower" and found some results worth clicking on.

Of course they would have to take into account that the queries are similar (contain matching strings), otherwise some strange results might appear.

destinsmithn