views:

56

answers:

4

Hi,

What is the best way to parse large texts (5000 words and more), searching names, that are stored in a database? The texts will be multi lingual.

My first idea is a rather naive approach, taking all words beginning with a big letter and compare them against the database. But this tends to fail in texts containing lowercase letters only.

Edit The texts are not static, but dynamic (e.g. web sites)

Best

Macs

+4  A: 

Use your RDBMS's built-in full-text indexing capabilities.

Full-Text Search (SQL Server)

MySQL Full-Text Search Functions

Full Text Indexing using Oracle Text

Mitch Wheat
A: 

You can use the Aho-Corasick algorithm, and construct a dictionary with the names that you are trying to match. It's linear in the number of tokens in the text plus the number of matched names.

JG
A: 

You will need a dictionary of names.

Or you can try http://www.opencalais.com/ that knows quite a large collection of names.

phsiao
Wow, thanks for that one. It's really an option beneath the other answers :)
macs
A: 

I made a method for replacing multiple strings in a large text here: http://stackoverflow.com/questions/711753/a-better-way-to-replace-many-strings-obfuscation-in-c. Perhaps you can use the same principle.

Guffa