views:

687

answers:

5

I am looking for a reference database that can be used to test for possible name typos in a contact database. This is for a batch process, so performance isn't a real issue. Ideally I'd like a comprehensive database, but even something like "top 5000" would go a long way.

Thanks!

+14  A: 

I don't know about a database, but populating one yourself from a resource such as this http://www.census.gov/genealogy/names/dist.all.last should work fine :)

Supernovah
Bear in mind, though, that these are the top-x for the US of A. For other places, you'd have to get these data from the repective census authority.
Piskvor
Thanks- just what I needed!
+12  A: 

I don't understand how you can find typos in names. I mean, my first name is Philippe (French), but it can be Philip, Philips, Felipe, Fèlipe, or anything else. Likely, there is a traditional French name, Sandrine, but a trend is to write that Cendrine, even more as law is relaxed recently in France. And so on.
OK, perhaps a Jhon smell like a typo (common two letter inversion) but you can't tell for sure.
Typos in last names is even more impossible to detect... Unless you check against a limited, known list (employees of a company, for example).

PhiLho
+5  A: 

I know a first name database http://www.lexique.org/public/Prenoms100.zip which covers Phil, Phile, Philip, Philipp, Phillip, Felipe, Philippe. (around 12000 first names)

I think you won't find anything useful with second names, as they are far more numerous than first names. This is a known problem in computational linguistics.

Mapad
+1  A: 

If there is no additional language information involved, this can be pretty useless. I would not spend effort on this as it probably works only on a small population procentage.

PS: Don't forget the chinese, russian and indian names (millions)

Drejc
+1  A: 

I personally know people who have unique names (names their parents deliberately made up to be unique) and I also personally know people whose names appear to be misspelled but that is actually what their parents named them. I would not even attempt to do such a thing as attempt to fix name typos. What we do instead is import the names (and we require a unique identifier to come from our clients). Then the next time we import, we match on the unique identifier and if the name was changed (because we contacted the person and he or she told us what to change it to) then the name is not updated. Buut if the name was not changed and it is differnt inthe file (usually because of a marriage or divorce) then the name is updated. You'll need some kind of flag on the data record to tell that it was updated manually. We populate this through a trigger.

Far more important when importing name data is to avoid creating duplicates (hence our requirement for a unique identifier from our data sources) or avoiding incorrect matching of data (you can't just consider name when matching to see if the record already exists).

HLGEM