views:

142

answers:

5

Use case: User 1 uploads 100 company names (e.g. Microsoft, Bank of Sierra)

User 2 uploads 100 company names (e.g. The Gap, Uservoice, Microsoft, Inc.)

I want User 1's notion of Microsoft and User 2's notion of Microsoft to map to a centrally maintained entity with a unique index for Microsoft.

If someone uploads a name which isn't in the central repository, I guess I'd like it to be entered as is. But then what happens if that first entry is incorrectly spelled (e.g. Vergin Mobile instead of Virgin Mobile?) How can we best correct it and correlate new uploads to that same index?

Technically, should the central repository be a separate database altogether? Should even the user generated information be in a separate database, as well, from the business transactions that will occur against it?

Starting out with a large definition of the problem and hoping to chunk it up with your input, thanks.

A: 
company table    
  id
  name

company_synonym table
  company_id
  name

This schema structure solves the problems you have listed.

mson
A: 

Do you see what happens when you try to enter a new question on this site? All those previous questions that might be the same?

Probably even that will be insufficient. It's insufficient here.

le dorfier
+1  A: 

FWIW, this has nothing to do with database normalization. This is a data cleanup task.

Data cleanup cannot be fully automated in the general case. Many people try, but it's impossible to detect all the ways that the input data might be malformed. You can automate some percentage of the cases with techniques such as:

  • Force users to select company names from a list instead of typing them. Of course this is best for single entries, not for bulk uploads.
  • Compare the SOUNDEX of the input company names to the SOUNDEX of company names already in the database. This is useful for identifying possible matches, but it can also give false positives. So you need a human to review them.

Ultimately, you need to design your software to make it easy for an administrator to "merge" entries (and update any references from other database tables) as they are discovered to be duplicates of one another. There's no elegant way to do this with cascading foreign keys, you just have to write a bunch of UPDATE statements.

Bill Karwin
A: 

Linked in does this somehow. However, they don't do batch uploads... Basically you want to set some sort of difference calculator that will cause an action on some potential matches.

dropping words like "Inc", "The" and others is one rule, and then there is pattern matching or closely matching words that are misspelled.

Not an easy thing to do with batch uploads from a workflow standpoint. You will need a known data dictionary that is approved and then each upload/addition has to be vetted. Eventually the number of additions will dwindle.

I agree that this is not a database issue - it is a workflow issue.

EDIT

I would have an approved list, and then some rules that propagate a potential "good" name to the approved list. How you implement that is left as an exercise for the reader...

Tim
A: 

There is a whole type of systems called Master Data Management trying do this for different domains, such as partners, addresses, products. Typically large, full-featured systems, nothing that can be properly done in an ad-hoc fashion. These things sound easy at first, but get very difficult very soon.

Sorry I'm not being too cheery here, but this can quickly turn into a nightmare .. similar to trying to solve an np-complete problem ...

IronGoofy