views:

685

answers:

8

I have a huge list of person's full names that I must search in a huge text.

Only part of the name may appear in the text. And it is possible to be misspelled, misstyped or abreviated. The text has no tokens, so I don't know where a person name starts in the text. And I don't if know if the name will appear or not in the text.

Example:

I have "Barack Hussein Obama" in my list, so I have to check for occurrences of that name in the following texts:

  • ...The candidate Barack Obama was elected the president of the United States... (incomplete)
  • ...The candidate Barack Hussein was elected the president of the United States... (incomplete)
  • ...The candidate Barack H. O. was elected the president of the United States... (abbreviated)
  • ...The candidate Barack ObaNa was elected the president of the United States... (misspelled)
  • ...The candidate Barack OVama was elected the president of the United States... (misstyped, B is next to V)
  • ...The candidate John McCain lost the the election... (no occurrences of Obama name)

Certanily there isn't a deterministic solution for it, but...

What is a good heuristic for this kind of search?

If you had to, how would you do it?

A: 

The best way I can think of would be to define grammars in python NLTK. However it can get quite complicated for what you want.

I'd personnaly go for regular expressions while generating a list of permutations with some programming.

Eric
+5  A: 

Split everything on spaces removing special characters (commas, periods, etc). Then use something like soundex to handle misspellings. Or you could go with something like lucene if you need to search a lot of documents.

joegtp
I like the soundex idea to handle misspellings. It's probably not going to run very fast for huge lists, but it's still an interesting idea.
Bill the Lizard
If you are using MSSQL there are two sql-functions SOUNDEX and DIFFERENCE ready to use.
Stefan
SOUNDEX works well for non-english languages? In my case it is portuguese-BR
Daniel Silveira
How about abbreviation?
Daniel Silveira
@Daniel,look here: http://stackoverflow.com/questions/299949/sql-servers-soundex-function-on-non-latin-character-setsIt is not that hard to create a own version of the soundex so you could do it yourself.
Stefan
Check here how soundex works: http://www.ancestry.com/learn/library/article.aspx?article=2253
Stefan
For misspelled or misstyped words I`m think about Levenshtein distance (http://en.wikipedia.org/wiki/Levenshtein_distance).... But, how to check for incomplete names?
Daniel Silveira
+1  A: 

At first blush I'm going for an indexing server. lucene, FAST or Microsoft Indexing Server.

Dan Blair
A: 

Both SQL Server and Oracle have built-in SOUNDEX Functions.

Additionally there is a built-in function for SQL Server called DIFFERENCE, that can be used.

Brian Schmitt
A: 

pure old regular expression scripting will do the job.

use Ruby, it's quite fast. read lines and match words.

cheers

cnicolaou
+2  A: 

What you want is a Natural Lanuage Processing library. You are trying to identify a subset of proper nouns. If names are the main source of proper nouns than it will be easy if there are a decent number of other proper nouns mixed in than it will be more difficult. If you are writing in JAVA look at OpenNLP or C# SharpNLP. After extracting all the proper nouns you could probably use Wordnet to remove most non-name proper nouns. You may be able to use wordnet to identify subparts of names like "John" and then search the neighboring tokens to suck up other parts of the name. You will have problems with something like "John Smith Industries". You will have to look at your underlying data to see if there are features that you can take advantage of to help narrow the problem.

Using an NLP solution is the only real robust technique I have seen to similar problems. You may still have issues since 200 pages is actually fairly small. Ideally you would have more text and be able to use more statistical techniques to help disambiguate between names and non names.

Steve
+4  A: 

You said it's about 200 pages.

Divide it into 200 one-page PDFs.

Put each page on Mechanical Turk, along with the list of names. Offer a reward of about $5 per page.

Joel Spolsky
$5 or $.05? $1000 could be cost prohibitive
John Sheehan
Joel has a lot of interns to feed - those Aeron chairs don't pay for themselves!
Martin Beckett
Compared to the cost of writing code to do it, $1000 seems like a bargain. But I suspect you could probably get Mechanical Turks to search a page for names for about $0.50.... try one or two, it's supposed to be easy!
Joel Spolsky
A: 

I would use C# and LINQ. I'd tokenize all the words on space and then use LINQ to sort the text (and possibly use the Distinct() function) to isolate all the text that I'm interested in. When manipulating the text I'd keep track of the indexes (which you can do with LINQ) so that I could relocate the text in the original document - if that's a requirement.

Guy