I have a huge list of person's full names that I must search in a huge text.
Only part of the name may appear in the text. And it is possible to be misspelled, misstyped or abreviated. The text has no tokens, so I don't know where a person name starts in the text. And I don't if know if the name will appear or not in the text.
Example:
I have "Barack Hussein Obama" in my list, so I have to check for occurrences of that name in the following texts:
- ...The candidate Barack Obama was elected the president of the United States... (incomplete)
- ...The candidate Barack Hussein was elected the president of the United States... (incomplete)
- ...The candidate Barack H. O. was elected the president of the United States... (abbreviated)
- ...The candidate Barack ObaNa was elected the president of the United States... (misspelled)
- ...The candidate Barack OVama was elected the president of the United States... (misstyped, B is next to V)
- ...The candidate John McCain lost the the election... (no occurrences of Obama name)
Certanily there isn't a deterministic solution for it, but...
What is a good heuristic for this kind of search?
If you had to, how would you do it?