ansaurus

Question

how do I normalize a large, user-generated data-set of company names?

Answer 1

A:

company table    
  id
  name

company_synonym table
  company_id
  name

This schema structure solves the problems you have listed.

mson 2009-01-09 19:41:57

Answer 2

A:

Do you see what happens when you try to enter a new question on this site? All those previous questions that might be the same?

Probably even that will be insufficient. It's insufficient here.

le dorfier 2009-01-09 19:42:27

Answer 3

+1 A:

FWIW, this has nothing to do with database normalization. This is a data cleanup task.

Data cleanup cannot be fully automated in the general case. Many people try, but it's impossible to detect all the ways that the input data might be malformed. You can automate some percentage of the cases with techniques such as:

Force users to select company names from a list instead of typing them. Of course this is best for single entries, not for bulk uploads.
Compare the SOUNDEX of the input company names to the SOUNDEX of company names already in the database. This is useful for identifying possible matches, but it can also give false positives. So you need a human to review them.

Ultimately, you need to design your software to make it easy for an administrator to "merge" entries (and update any references from other database tables) as they are discovered to be duplicates of one another. There's no elegant way to do this with cascading foreign keys, you just have to write a bunch of UPDATE statements.

Bill Karwin 2009-01-09 19:44:35

Answer 4

A:

Linked in does this somehow. However, they don't do batch uploads... Basically you want to set some sort of difference calculator that will cause an action on some potential matches.

dropping words like "Inc", "The" and others is one rule, and then there is pattern matching or closely matching words that are misspelled.

Not an easy thing to do with batch uploads from a workflow standpoint. You will need a known data dictionary that is approved and then each upload/addition has to be vetted. Eventually the number of additions will dwindle.

I agree that this is not a database issue - it is a workflow issue.

EDIT

I would have an approved list, and then some rules that propagate a potential "good" name to the approved list. How you implement that is left as an exercise for the reader...

Tim 2009-01-09 19:44:37

Answer 5

A:

There is a whole type of systems called Master Data Management trying do this for different domains, such as partners, addresses, products. Typically large, full-featured systems, nothing that can be properly done in an ad-hoc fashion. These things sound easy at first, but get very difficult very soon.

Sorry I'm not being too cheery here, but this can quickly turn into a nightmare .. similar to trying to solve an np-complete problem ...

IronGoofy 2009-01-09 19:52:51

ansaurus

tags:

views:

answers:

how do I normalize a large, user-generated data-set of company names?

related questions