tags:

views:

170

answers:

4

I am a system administrator for a company that supports a CRM CMS (Salesforce).

I don't like the built-in functionality so I generally manage data through CSV/XLS files for uploading and downloading, because I can write better queries and the like. One of my tasks is uploading Contacts and Accounts. Because of a lack of unique Identifiers and misspellings, I'm encountering duplicates in uploaded records.

  • When uploading contacts, I try to match by email, phone number and/or lastname/firstname
  • With companies I am generally limited to just the name of the account, with numerous possible abbreviations and misspellings

I'm looking for a better method to check for duplicates with the following constraints:

Given a list of names, emails or phone numbers (all stored as text fields), do a comparison check between two tables looking for the best match out of the second table. Preferably between multiple fields, but even if it's just one, it will greatly assist my endeavor.

The size of the dataset is about 17,000 records on the longest table. Values are generally entered about 50 at a time. My limitations are the hardware I have and no budgetary discretion. My programming ability is pretty basic, but I can learn and have Eclipse & Visual Studio on the system.

Does anyone have a suggestion as how I can either solve this problem programmatically, or with a third party tool?

+1  A: 

The two db comparison tools I have used are xSQL and Redgate SQL Compare.

I prefer redgate for its ease of use but it is pretty pricey.

xSQL is a little complicated to get the hang of at first but is quite powerful, it also does a much better job of scripting schema and data than SQL server does. its also a lot cheaper.

Actually looking at your question a little closer it seems that you want a bit of intelligence in your comparisons, it that case since you have Visual Studio installed already I would say you have the tools already to do some pretty in depth comparrisons, just need to do a bit of coding.

CodeKiwi
A: 

You are essentially talking about natural language parsing, along with more general pattern matching (the abbreviations etc.). You won't find a simple solution to that; but you can improve your chances.

I would start with perhaps one of the easier points: mis-spellings. The “soundex” algorithm (properly, the Soundex Indexing System) codifies words by what they sound like when pronounced (presumably by a USA speaker), and render them such that words that sound similar will compare similar. This is very useful when trying to find mis-spellings based on homophones: present the user with matches based on similar soundex values.

bignose
A: 

Name and Address matching can be pretty hard, I've worked on commercial tools that did this and they came with a hefty price tag, especially as they work with a complete list of addresses and sometimes names too.

The other approach that works, but is time consuming is to build up your own mini library of scripts. There are two ways to go, you can try perl, awk, shell script and any other languages that you like and work with file and text based data. Or you could build up a suite of Sql, queries and coe snippets, maybe in Access? Either way if your budget is zero, you'll probably spend a lot of time building your own library, but it's a useful skill to have.

MrTelly
A: 

As @MrTelly mentioned, name and address (and company) matching is very complicated.

One third party tool - there's a stand-alone GUI as well as a programmer's API - MatchUp by Melissa DATA.

GUI: http://www.melissadata.com/products/matchup.htm

API: http://www.melissadata.com/dqt/matchup-api.htm

Disclaimer: I work for Melissa DATA and was the primary developer for these products. But I wholeheartedly stand behind my work!

Marc Bernier