views:

1000

answers:

8

I am trying to determine what the best way is to find variations of a first name in a database. For example, I search for Bill Smith. I would like it return "Bill Smith", obviously, but I would also like it to return "William Smith", or "Billy Smith", or even "Willy Smith". My initial thought was to build a first name hierarchy, but I do not know where I could obtain such data, if it even exists.

Since users can search the directory, I thought this would be a key feature. For example, people I went to school with called me Joe, but I always go by Joseph now. So, I was looking at doing a phonetic search on the last name, either with NYSIIS or Double Metaphone and then searching on the first name using this name heirarchy. Is there a better way to do this - maybe some sort of graded relevance using a full text search on the full name instead of a two part search on the first and last name? Part of me thinks that if I stored a name as a single value instead of multiple values, it might facilitate more search options at the expense of being able to address a user by the first name.

As far as platform, I am using SQL Server 2005 - however, I don't have a problem shifting some of the matching into the code; for example, pre-seeding the phonetic keys for a user, since they wouldn't change.

Any thoughts or guidance would be appreciated. Countless searches have pretty much turned up empty. Thanks!

Edit: It seems that there are two very distinct camps on the functionality and I am definitely sitting in the middle right now. I could see the argument of a full-text search - most likely done with a lack of data normalization, and a multi-part approach that uses different criteria for different parts of the name.

The problem ultimately comes down to user intent. The Bill / William example is a good one, because it shows the mutation of a first name based upon the formality of the usage. I think that building a name hierarchy is the more accurate (and extensible) solution, but is going to be far more complex. The fuzzy search approach is easier to implement at the expense of accuracy. Is this a fair comparison?

Resolution: Upon doing some tests, I have determined to go with an approach where the initial registration will take a full name and I will split it out into multiple fields (forename, surname, middle, suffix, etc.). Since I am sure that it won't be perfect, I will allow the user to edit the "parts", including adding a maiden or alternate name. As far as searching goes, with either solution I am going to need to maintain what variations exists, either in a database table, or as a thesaurus. Neither have an advantage over the other in this case. I think it is going to come down to performance, and I will have to actually run some benchmarks to determine which is best. Thank you, everyone, for your input!

+1  A: 

I think your basic approach is solid. I don't think fulltext is going to help you. For seeding, behindthename.com seems to have large amount of the data you want.

chaos
That is definitely the sort of data that I am looking for, but it does bring me back to a problem. Looking at the sources for most of that data - it is sited as being user-contributed or from books, meaning that there is a large manual effort in creating and maintaining it.
joseph.ferris
Also, thank you for the link. I wasn't aware of that site before. (I ran out of room on my last comment.)
joseph.ferris
+1  A: 

Are you using SQl Server 2005 Express with Advanced Services as to me it sounds you would benefit from the Full Text indexing and more specifically Contains and Containstable which you can use with specific instructions here is a link for the uses of Containstable:

http://msdn.microsoft.com/en-us/library/ms189760.aspx

and here is the download link for SQL Server 2005 With Advanced Services:

http://www.microsoft.com/downloads/details.aspx?familyid=4C6BA9FD-319A-4887-BC75-3B02B5E48A40&displaylang=en

Hope this helps,

Andrew

REA_ANDREW
For my development environment I have SQL Server 2005 Enterprise. I will need to get specific information from my hosting provider as to their configuration. How does Advanced Services apply to a Standard or Enterprise edition - are the features already available? Thank you for the leads!
joseph.ferris
Yes those services will definitely be included in the enterprise edition. I am just glad though that they released these advanced services for the express edition because I have found them too useful. :-) Glad I could be of some help!
REA_ANDREW
+1  A: 

You can use the SQL Server Full Text Search and do an inflectional search.

Basically like:

SELECT ProductId, ProductName FROM ProductModel WHERE CONTAINS(CatalogDescription, ' FORMSOF(THESAURUS, metal) ')

Check out: http://en.wikipedia.org/wiki/SQL_Server_Full_Text_Search#Inflectional_Searches http://msdn.microsoft.com/en-us/library/ms345119.aspx http://www.mssqltips.com/tip.asp?tip=1491

Andrew Barrett
Thank you! That seems promising. It would imply that I would be more likely to be searching on a single "name" as opposed to a normalized version of the name, which I am fine with.
joseph.ferris
+4  A: 

In my opinion you should either do a feature right and make it complete, or you should leave it off to avoid building a half-assed intelligence into a computer program that still gets it wrong most of the time ("Looks like you're writing a letter", anyone?).

In case of human names, a computer will get it wrong most of the time, doing it right and complete is impossible, IMHO. Maybe you can hack something that does the most common English names. But actually, the intelligence to look for both "Bill" and "William" is built into almost any English speaking person - I would leave it to them to connect the dots.

Tomalak
+1 for "you should leave it off to avoid building a half-assed intelligence into a computer program that still gets it wrong most of the time".
MicSim
While I agree in principle, I think that leaving first name suggestion functionality out is lazy. I don't anticipate it be 100% accurate, but am not sure that less than perfect makes something half-assed. While a person can figure it out, they also expects you to guess their intent. I'm torn.
joseph.ferris
I'm just saying that you are probably going to implement something that only catches most of the cases. That you can think of. And you increase the chance of false positives, which are a lot more annoying (to me at least) than *explainable* non-matches. I think the problem domain is just too small.
Tomalak
I definitely wasn't intending to disagree with you. The problem is more about meeting user expectations. I've already conceded the fact that it will work better for English names, no matter what the solution. Signal to noise ratio is going to be directly related to the amount of effort applied.
joseph.ferris
And the amount of effort applied will be directly connected to the overall performance. :-) But I know it's hard to *not do* something the client wants. It's just the "tech-head" view on things, I guess.
Tomalak
Exactly. While we know what the result vs. effort will be, the user is ultimately oblivous to this and just expects that everything works a certain way. To spin a Kevin Smith quote, "This job would be great if it wasn't for the f****' users."
joseph.ferris
+1  A: 

No, Full Text searches will not help to solve your problem.

I think you might want to take a look at some of the following links: (Funny, no one mentioned SoundEx till now)

Basically SoundEx allows you to evaluate the level of similarity in similar sounding words. The function is also available on SQL 2005.

As a side issue, instead of returning similar results, it might prove more intuitive to the user to use a AJAX based script to deliver similar sounding names before the user initiates his/her search. That way you can show the user "similar names" or "did you mean..." kind of data.

Cerebrus
Thank you. I have used SoundEx before and am considering other phonetic algorithms - namely double metaphone and NYSIIS. What is interesting is how the approach that I will use will dictate the data structure - namely, a phonetic implementation would use a more normalized form of name.
joseph.ferris
+1  A: 

Not sure what your application is, but if your users know at the time of sign up that people from their past might be searching the database for them, you could offer them the chance in the user profile to define other names they might be known as (including last names, women change these all the time and makes finding them much harder!) and that they want people to be able to search on. Store these in a separate related table. Then search on that. Just make the structure such that you can define one name as the main name (the one you use for everything except the search.)

HLGEM
That is a good point. Looking at the Facebook model, it appears that they take a whole name and parse it out, then allow you to change the parts - including adding an alias for former names that is used just to search against.
joseph.ferris
A: 

Here's an idea for automatically finding "name synonyms" like Bill/William. That problem has been studied in the broader context of synonyms in general: inducing them from statistics of which words commonly appear in the same contexts in a large text corpus like the Web. You could try combining that approach with a list of names like Moby Names; I don't know if it's been done before.

Here are some pointers.

Darius Bacon
+1  A: 

You'll find that you're dabbling in an area known as "Natural Language Processing" and you'll need to do several things, most of which can be found under the topic of stemming.

Simplistic stemming simply breaks the word apart, but more advanced algorithms associate words that mean the same thing - for instance Google might use stemming to convert "cat" and "kitten" to "feline" and search for all three, weighing the actual word provided by the user as slightly heavier so exact matches return before stemmed matches.

It's a known problem, and there are open source stemmers available.

Adam Davis
This exercise has very little to do with NLP. "Stemming" refers solely to extracting the stem from a word. This is useful in search engines to map say philosophy, philosopher, and philosophical into one group. It is done (in English at least) by applying transformation rules (e.g. removal of known suffixes) to the end of the word. This has only limited application to first names (e.g. Bobbie/Bobby -> Bob) and uses a (small) set of rules that is NOT the same as the full-blown set used in a (more general) English text stemmer.
John Machin
(continued) It doesn't address the Bob vs Robert[a] or Ron[nie] vs Veronica/Ronald problem, which needs an alias/nickname dictionary, nor does it deal with the typo/reado/spello/thinko problem, nor with the immigrant tendency to Anglicise given names e.g. Karl -> Carl -> Charles or Guglielmo/Guillaume/Uiliami becoming Bill :-)
John Machin