views:

532

answers:

7

Hello all,

I'm running into a problem when trying to select records from my 2005 MS-SQL database (I'm still very new to SQL, but I learned and use the basic commands from w3schools already). In theory, all my manufacturer records should be unique. At least that is how I intended it to be when I did my first massive data dump into it. Unfortunately, that is not the case and now I need to fix it! Here is my scenario:

Table name = ItemCatalog
Relevant columns = Partnumber,Manufacturer,Category

When I did a "SELECT DISTINCT Manufacturer FROM ItemCatalog" this little problem is what turned up:

Cables2Go
CablesToGo
Cables To Go
CableToGo Inc
CablesToGo Inc

All 5 of those showed up as distinct, which they are. Can't fault my SELECT statement for returning it, but from my human perspective they are all the same manufacturer! One method I see working is doing an UPDATE command and fixing all the permutations that show up, but I have a LOT of manufacturers and this would be very time consuming.

Is there a way when I punch in a SELECT statement, that I can find all the likely permutations of a manufacturer name (or any field really)? I attempted the LIKE operator, so my statement would read "SELECT Manufacturer FROM ItemCatalog WHERE Manufacturer LIKE '%CablesToGo%'", but that didn't turn out as well as I had hoped. Here's the nasty bit, my other program that I'm putting together absolutely requires that I only ask for a single manufacturer name, not all 5 variations. Maybe I'm talking in circles here, but is there is a simple way in one statement for me to find a similar string?

Thanks! Bill

A: 

I know others are suggesting query fixes - I thought I'd elaborate on my long-term fix for kicks.

You could create another table relating each of the variations to a single manufacturer entity. If I encountered this situation at work (and I have), I would be enticed to fix it.

Create a manufacturer's table with a primary key, name, etc..

Create a table with aliases - these will only be needed when you are presented with data that doesn't have the manufacturer's ID (like an import file).

Modify ItemCatalog such that it references the primary key from the manufacturer table (i.e. a ManufacturerID foreign key).

When importing data to ItemCatalog, assign the ManufacturerID foreign key based on matches to the alias table. If you have a name that matches 2+ records then you flag them for manual review or you try to match on more than manufacturer name.

Mayo
A: 

'Cable%Go%' might work for that one case, but if you have other variations for other strings, you'll probably have to do a lot of manual data cleanup.

Andy White
It misses 'CableToGo Inc'. Drop the 's' and you'd be OK.
Jonathan Leffler
Fixed, thanks for noticing that.
Andy White
A: 

Standard SQL has a SIMILAR statement, which is a bit more powerful than LIKE.

However, you could use LIKE to good effect with:

Manufacturer LIKE 'Cable%Go%'

This would work in this specific case, finding all the variants listed. However, it would also find 'Cable TV Gorgons' and you probably don't need them included. Your version would also find 'We Hate CablesToGo With Ferocity Inc', which you probably didn't want either.

However, data cleansing is a major problem, and there are companies that make a living out of providing data cleansing. You often end up making a dictionary or thesaurus of terms (company names here) mapping all the variants encountered to the canonical form. The problem is that sometimes you find the same variant spelling is used for two separate canonical forms. For example, a pair of bright sparks might both decide to use 'C2G' as an abbreviation, but one uses it for 'Cables To Go Inc' and the other uses it for 'Computers To Gamers Inc'. You have to use some other information to determine whether a particular instance of 'C2G' means 'Cables' or 'Computers'.

Jonathan Leffler
Do you know if MS-SQL has the SIMILAR statement? I'm not finding any documentation on it anywhere, so I assume that's a 'no'? I'm afraid ugly data cleanup is what I'm stuck with :(
Bill Sambrone
If you can't find it documented in MS-SQL manuals rather easily, then it probably isn't there; it is one of the least widely implemented parts of the SQL standard (see http://savage.net.au/SQL/ for standard SQL grammars).
Jonathan Leffler
And ugly data cleanup is going to be necessary, even with SIMILAR to provide regex support instead of just LIKE. The pain is there. If it is any consolation, I had to do this once (well, a few dozen times as I was developing it) mapping random misspellings of port names around the world to the canonical format. It was long enough ago that Perl wasn't available to me (1991?), and for a while I used a shell script that split a file of mappings into separate parts that could be run through the brain-dead `sed` on the machine. That got too awful, so I ended up with a homebrew program that did it.
Jonathan Leffler
Aw rats, I thought it might come to this. I've already started work on some simple vbscript stuff to try and fix the worst of it for me. At least I've learned a very important lesson here about keys and table structure!
Bill Sambrone
A: 

I suggest you to use object relational mapping tool to map your table into object and add filtering logic there.

Roy
A: 

One option you have is to loosen your wildcard search to something like 'Cables%Go%'. This might be good in the short term, but with this approach you run the risk of matching more manufacturers than you want (ie , Cables on the Go, etc).

You could also put together a mapping table, which would put all of the variants of Cables To Go into a single group, which your app can query and normalize for your ItemCatalog query.

Another option you have is to introduce a Manufacturers table. This your ItemCatalog table would then have a foreign key to this table and only allow manufacturers that are in the Manufacturer table. This would require some cleanup of your ItemCatalog table to get it working, assuming that you want all of the variants of Cables to Go to be the same.

akf
+3  A: 

If you are doing some data mining, you could also try the SOUNDEX and DIFFERENCE function in SQL Server.

While they are both outdated (they don't handle foreign character very well), they could yield some interesting result for you:

SELECT * FROM ItemCatalog WHERE SOUNDEX(Manufacturer) = SOUNDEX('Cables To Go');

and

SELECT * FROM ItemCatalog WHERE DIFFERENCE(Name, 'Cables To Go') >= 3;

The number 3 means likely similar (0 mean not similar and 4 is very similar)

There are a few number of better SOUNDEX function available on the internet. See Tek-Tips for an example.

Here is another example at SQL Team.

Pierre-Alain Vigeant
A: 

I DIDNOT GET THE EXACT MEANING OF SOUNDX

Akbar Khan