views:

93

answers:

3

I'm working on a website that will be based on user contributed data, submitted using a regular HTML form.

To simplify my question, let's say that there will be two fields in the form: "User Name" and "Country" (this is just an example, not the actual site).

There will be two tables in the database : "countries" and "users," with "users.country_id" being a foreign key to the "countries" table (one-to-many).

The initial database will be empty. Users from all over the world will submit their names and the countries they live in and eventually the "countries" table will get filled out with all of the country names in the world.

Since one country can have several alternative names, input like Chile, Chili, Chilli will generate 3 different records in the countries table, but in fact there is only one country. When I search for records from Chile, Chili and Chilli will not be included.

So my question is - what would be the best way to deal with a situation like this, with conditions such that the initial database is empty, no other resources are available and everything is based on user input?

How can I organize it in such way that Chile, Chili and Chilli would be treated as one country, with minimum manual interference.

What are the best practices when it comes to normalizing user submitted data and is there a scientific term for this? I'm sure this is a common problem.

Again, I used country names just to simplify my question, it can be anything that has possible different spellings.

A: 

I would say to use dropdown list for the country and you can easily populate it using javascript. You can find list of all countries here http://openconcept.ca/blog/mgifford/text_list_all_countries

As to your normalization question, I don't see any problem with design according to your example

DanielZ
Country names are used just for the sake of simplicity in my example.
Vincent
A: 

You cannot programmatically determine that Chile should be the same as Chili which is the same as Chilli. With the country example you can have a list of countries that you enter into your db and have a drop down list that users can select from.

If the data is all user entered, you can only match if it is exactly the same therefore their meaning is the same.

You might come up with an algorithm that associates words that are alike but IMHO that is just calling for nondeterministic results (disaster). For example (using a different example than your country one) you might programmatically determine that the word fight and sight are only off by one letter therefore they are alike. But are they really? Just because the two words are syntactically close does not mean they are semantically close too. And I am guessing that is what you are after.

c_maker
I feel like country names is a bad example to explain my problem.Another example: MoMA and Museum of Modern Art. Lexically different but refer to one entity. I understand that it would be impossible to detect this programmatically. I'm just looking for general ways to deal with it, with a hope that someone dealt with something similar before. One solution would be something in the user interface that will allow users to flag MoMA as a variation of Museum of Modern Art, website admin can then apply appropriate changes to the database. But what if there are thousands of records like this?
Vincent
A: 

Search engines whether file based (Lucene, Sphinx, etc) or database (Oracle Text, MSSQL Fulltext) solve this problem with a thesaurus. That is, they collect words together on the basis of them being synonyms. The qualification for being a synonym is tighter than in Roget's book, but the principle is the same. Synonyms bundle up abbreviations, acronyms and common misspellings. So for instance, a search thesaurus might identify street and st as being the same thing. Although, context is everything: in the string "St Pancras Road" st is a synonym for saint.

So, does this help you at all? Up to a point. It suggests the sort of thing you want to implement:

string      | canonical
------------+----------
street      |
st          | street
strete      | street
Chile       |
chilly      | Chile
chili       | Chile

The unfortunate thing is that building and maintaining a thesaurus requires human ingenuity and effort. Building a taxonomy requires expertise; tracking new additions requires time. The other thing is that even with a thesaurus the matches remain probabalistic: MoMA might be the same as Museum of Modern Art but is it the same as SFMOMA or NYMOMA? Not exactly but maybe 90% the same?

An alternative approach would be to do what SO does with tags. When you tagged your question a dropdown box appeared, suggesting available tags. As you typed more letters the list narrowed. This is isn't fool foolproof, witness the presence of tags like tsql and t-sql but it is pretty good. SO also has a backup, which is to provide the power users with a list of freshly minted tags so they can investigate these coinages and perhaps quash them. But that still remnains a manual process.

Alas there is no alogorithm that is going to be able to tell that MoMA is the same as Museum of Modern Art, let alone figure out whether it references the institution in New York or San Francisco.

APC