views:

261

answers:

6

How would you find the correct words in a long stream of characters?

Input :

"The revised report onthesyntactictheoriesofsequentialcontrolandstate"

Google's Output:

"The revised report on syntactic theories sequential controlandstate"

(which is close enough considering the time that they produced the output)

How do you think Google does it? How would you increase the accuracy?

A: 

Check spelling correction algorithm. Here is a link to an article on algorithm used in google - http://www.norvig.com/spell-correct.html. Here you will find a scientific paper on this topic from google.

Skarab
Nice link, but only very remotely relevant to the question. Extending this algorithm to find all possible conjunctions of 5 words would be a **huge** waste of resources.
Nikita Rybak
In NLP nothing **comes** easy...
Skarab
+3  A: 

Here is a code in Mathematica I started to develop for a recent code golf.
It is a minimal matching, non greedy, recursive algorithm. That means that the sentence "the pen is mighter than the sword" (without spaces) returns {"the pen is might er than the sword} :)

findAll[s_] :=
  Module[{a = s, b = "", c, sy = "="},
  While[
   StringLength[a] != 0,
   j = "";
   While[(c = findFirst[a]) == {} && StringLength[a] != 0,
    j = j <> StringTake[a, 1];
    sy = "~";
    a = StringDrop[a, 1];
   ];
   b = b <> " " <> j ;
   If[c != {},
    b = b <> " " <> c[[1]];
    a = StringDrop[a, StringLength[c[[1]]]];
   ];
  ];
   Return[{StringTrim[StringReplace[b, "  " -> " "]], sy}];
]

findFirst[s_] :=
  If[s != "" && (c = DictionaryLookup[s]) == {}, 
   findFirst[StringDrop[s, -1]], Return[c]];

Sample Input

ss = {"twodreamstop", 
      "onebackstop", 
      "butterfingers", 
      "dependentrelationship", 
      "payperiodmatchcode", 
      "labordistributioncodedesc", 
      "benefitcalcrulecodedesc", 
      "psaddresstype", 
      "ageconrolnoticeperiod",
      "month05", 
      "as_benefits", 
      "fname"}

Output

 twodreamstop              = two dreams top
 onebackstop               = one backstop
 butterfingers             = butterfingers
 dependentrelationship     = dependent relationship
 payperiodmatchcode        = pay period match code
 labordistributioncodedesc ~ labor distribution coded es c
 benefitcalcrulecodedesc   ~ benefit c a lc rule coded es c
 psaddresstype             ~ p sad dress type
 ageconrolnoticeperiod     ~ age con rol notice period
 month05                   ~ month 05
 as_benefits               ~ as _ benefits
 fname                     ~ f name

HTH

belisarius
Could you please describe the algorithm?
kunjaan
+10  A: 

I would try a recursive algorithm like this:

  • Try inserting a space at each position. If the left part is a word, then recur on the right part.
  • Count the number of valid words / number of total words in all the final outputs. The one with the best ratio is likely your answer.

For example, giving it "thesentenceisgood" would run:

thesentenceisgood
the sentenceisgood
    sent enceisgood
         enceisgood: OUT1: the sent enceisgood, 2/3
    sentence isgood
             is good
                go od: OUT2: the sentence is go od, 4/5
             is good: OUT3: the sentence is good, 4/4
    sentenceisgood: OUT4: the sentenceisgood, 1/2
these ntenceisgood
      ntenceisgood: OUT5: these ntenceisgood, 1/2

So you would pick OUT3 as the answer.

Claudiu
+1 - But "the sentence is goo d" could give 5/5, depending on the dictionary. Which raises the question of how to evaluate two complete parses, such as the infamous "the pen is mighter than the sword" example.
Jeffrey L Whitledge
Claudiu
@Jeffrey: heh, we think alike. so OUT2, OUT3, and OUTJEFFREY could all be picked. you might want to weigh it in favor of longer words somehow, since I think these errors will tend to happen by creating more shorter words than necessary. or you can do some NLP things, identify them as parts of speech and see which ones are most likely..
Claudiu
I didn't say optimal, I said I'd try this and (implicitly) it might be good enough. And my alg would pick mississippi, since that is 1, whereas "miss issippi" is 0.5.
Claudiu
@Claudiu, sorry I tried to delete my comment
piccolbo
+6  A: 

Try a stochastic regular grammar (equivalent to hidden markov models) with the following rules:

for every word in a dictionary:
stream -> word_i stream with probability p_w
word_i -> letter_i1 ...letter_in` with probability q_w (this is the spelling of word_i)
stream -> letter stream with prob p (for any letter)
stream -> epsilon with prob 1

The probabilities could be derived from a training set, but see the following discussion. The most likely parse is computed using the Viterbi algorithm, which has quadratic time complexity in the number of hidden states, in this case your vocabulary, so you could run into speed issues with large vocabularies. But what if you set all the p_w = 1, q_w = 1 p = .5 Which means, these are probabilities in an artificial language model where all words are equally likely and all non-words are equally unlikely. Of course you could segment better if you didn't use this simplification, but the algorithm complexity goes down by quite a bit. If you look at the recurrence relation in the wikipedia entry you can try and simplify it for this special case. The viterbi parse probability up to position k can be simplified to VP(k) = max_l(VP(k-l) * (1 if text[k-l:k] is a word else .5^l) You can bound l with the maximim length of a word and find if a l letters form a word with a hash search. The complexity of this is independent of the vocabulary size and is O(<text length> <max l>). Sorry this is not a proof, just a sketch but should get you going. Another potential optimization, if you create a trie of the dictionary, you can check if a substring is a prefix of any correct word. So when you query text[k-l:k] and get a negative answer, you already know that the same is true for text[k-l:k+d] for any d. To take advantage of this you would have to rearrange the recursion significantly, so I am not sure this can be fully exploited (it can see comment).

piccolbo
Actually, I just convinced myself that you can use the potential optimization I described. Imagine a matrix TxT where T is the length of the text. An entry (i,j) is 0 if text[i,j] is a word, else is 1. Now if you fill this (you need only a diagonal band up to the max word length) with a nested loop for i: for j: then as soon as you realize that text[i:j] is not a prefix of any word, you can fill the rest with 1s. This gives you an advantage if you use a sparse representation, like a defaultdict that defaults to 1, so that you actually need to write only the 0s.
piccolbo
This form of recursion is the first known example of dynamic programming and is found in a paper by Bellman from the 60's (the application is signal processing)
piccolbo
To make it more accurate you could use a text corpus to derive probabilities, see for instance http://www.cs.brown.edu/research/ai/dynamics/tutorial/Documents/HiddenMarkovModels.html, problem 3 and look even at pair of words or more, introducing rules like stream -> word_i word_j stream. You could also use existing ngram data, like the huge one released by Google (not free and noncommercial use though). But there is a computational cost to pay.
piccolbo
A: 

Duplicate.

http://stackoverflow.com/questions/3856630/how-to-separate-words-in-a-sentence-with-spaces

Answered:

http://blog.nu42.com/2010/10/sometimes-brute-force-solution-is.html

If you do not need 100% accuracy, I like the Google solution1. ;-)

#!/bin/bash

for i in $(cat concatenated.txt); do
  echo -n "$i : "
  lynx -dump http://www.google.com/search?q=$i | \
    grep "Did you mean to search for" | \
    awk '{ for( i = 7; i <= NF; i++ ) printf( "%s ", $i ) }'
  echo
  sleep 1
done

1Breaks Google TOS.

Dave Jarvis
Yes, if you are happy with that answer. Anybody can write a brute force solution, but this problem has a nice dynamic programming solution.
piccolbo
A: 

After doing the recursive splitting and dictionary lookup, to increase the quality of word pairs in your your phrase you might be interested to employ Mutual information of Word pairs.

This is essentially going though a training set and finding out M.I. values of word pairs that tells you that Albert Simpson is less Likely than Albert Einstein :)

You can try searching Science Direct for academic papers in this theme. For basic information on Mutual information see http://en.wikipedia.org/wiki/Mutual_information

Last year I had been involved in the phrase search part of a search engine project in which I was trying to parse though wikipedia dataset and rank each word pair. I've got the code in C++ if you care could share it with you if you can find some use of it. It parses wikimedia and for every word pair finds out the mutual information.