views:

182

answers:

7

So I have a huge list of entries in a DB (MySql)

I'm using Python and Django in the creation of my web application.

This is the base Django model I'm using:

class DJ(models.Model):
    alias = models.CharField(max_length=255)
    #other fields...

In my DB I have now duplicates

eg. Above & Beyond, Above and Beyond, Above Beyond, DJ Above and Beyond, Disk Jokey Above and Beyond, ...

This is a problem... as it blows a big hole in my DB and therefore my application.


I'm sure other people have encountered this problem and thought about it.

My ideas are the following:

  • Create a set of rules so a new entry cannot be created?

    eg. "DJ Above and Beyond" cannot be created because "Above & Beyond" is in the DB

  • Relate these aliases to each other somehow?

    eg. relate "DJ Above and Beyond" to "Above & Beyond"


I have literally no clue how to go on about this, even if someone could point me into a direction that would be very helpful.

Any help would be very much appreciated! Thank you guys.

+4  A: 

I guess you could do something based on Levenshtein distance, but there's no real way to do this automatically - without creating a fairly complex rules-based system.

Unless you can define a rules system that can work out for any x and y whether x is a duplicate of y, you're going to have to deal with this in a fuzzy, human way.

Stack Overflow has a fairly decent way of dealing with this - warn users if something may be a duplicate, based on something like Levenshtein distance (and perhaps some kind of rules engine), and then allow a subset of your users to merge things as duplicates if other users ignore the warnings.

Dominic Rodger
RadiantHex
Dominic Rodger
BlueRaja - Danny Pflughoeft
You may also want to consider Levenshtein distance using words rather than characters - then `DJ Above and Beyond` would have distance 1 from `Above and Beyond`. Of course, then `DJ Mike` and `DJ Steve` will have distance 1 also, even though they are clearly not ripping each other off. Because of cases like this, there is no *general* solution that does not require a case-by-case inspection by a human being.
BlueRaja - Danny Pflughoeft
You are right, but seems like one of my only options.Would **Above and Behind** be close to **Above and Beyond**?
RadiantHex
"Levenshtein distance using words" I could make it better by using stopwords maybe? Anyways thanks very much for your contribution, I have a lot to discover and experiment now. :)
RadiantHex
+1  A: 

This is not a complete solutions but one thought I had:

class DJ(models.Model):
    #other fields, no alias!

class DJAlias(models.Model):
    dj = models.ForeignKey(DJ)

This would allow you to have several Aliases for the same dj.

But still you will need to find a proper way to ensure the aliases are added to the right dj. See Dominics post.

But if you check an alias against several other aliases pointing to one dj, the algorithms might work better.

vikingosegundo
+3  A: 

From the examples you give, it sounds like you have more a natural language problem than an exact matching problem. Given that natural language matching is inexact by nature you're unlikely to come up with a perfect solution.

  • String distance doesn't really work, as strings that are algorithmically close may not be semantically close (e.g. "DJ Above & Beyond" should match "Above and Beyond" but not "DJ Above & Beyond 2" which is closer in Levenshtein distance.
  • Some cheap alternatives to natural language parsing are soundex, which will match by phonetic sounds, and Stemming, which removes prefixes/suffixes to normalize on word stems. I suppose you could create a linked list of word roots, but this wouldn't be terribly accurate either.
  • If this is a User-interacting program, you could echo "near misses" to the user, e.g. "Is one of these what you meant to enter?"
  • You could normalize the entries in some way so that different entries map to the same normalized value (e.g. case normalize, "&" -> "And", etc, etc. which some of the above suggestions might be a step towards) to find near misses or map multiple inputs to a single value.

Add the caveat that my experience only applies to English, e.g. an english PorterStemmer won't recognize the one French title you put in there.

Steve B.
RadiantHex
Stemming is a word-based, not a phrase-based process.The idea is to normalize grammatical forms to their word stem, e.g. Ran, Run, Running -> Run, basically by encoding suffix, prefix removal as well as the many special cases. Would take you forever except that it's been done, code examples are out there. I was thinking you could transform a phrase into a linked list of word stems. I'm sure there are cases where this would break too (e.g. "Dumb and Dumber"->"Dumb and Dumberer"), as what you're trying to do is by nature a fuzzy problem. Maybe in conjunction with other strategies...
Steve B.
+1  A: 

You could try to solve this problem for this instance only (replacing the "&" with "&" and "DJ" with "Disk jokey" or ignore "DJ" etc..). If your table only contains DJ's you could set up a bunch of rules like those. If your table contains more diverse stuff you will have to go with a more structural approach. Could you give a sample of your dataset?

Lex
Hi Lex, thanks for your answer.There are only DJs in the list, but the aliases vary a loteg. **"DJ Lex"** could be written as **"L.E.X."**but **also simple names** pop up like **"Sasha"**
RadiantHex
+2  A: 

I think this is more of a social problem than a programming problem. Any sort of programatic solution to natural language processing like this is going to be buggy and error prone. It's very hard to distinguish things that are close, but legitimately different from the sort of undesired duplicates that you're talking about.

As Dominic mentioned, Stack Overflow's tagging system is a pretty good model for this. It provides cues to the user that encourage them to use existing tags if appropriate (drop down lists as the user types), it allows trusted users to retag individual questions, and it allows moderators to do mass retags.

This is really a process that has to have a person directly involved.

Chris Upchurch
+1  A: 

First of all of course the programming task (NLP etc. as mentioned) is interesting. But as mentioned it's overkill aiming to perfect that.

But the other view is as mentioned ("social"), who enters the data, who views it, how long and how correct should it be? So it's a naming convention issue and reminds me to the great project musicbrainz.org - should your site "just work" or do you prefer to go along standards, in latter case i would orient myself along the mb project - in case you haven't done that and not heard of it. ie. see here for Above & Beyond: they have on alias defined, they use it to match user searches. http://musicbrainz.org/show/artist/aliases.html?artistid=58438 check out also the Artist_Alias page in the wiki.

The data model is worth a look and there are even several API bindings to sync data, also in python.

groovehunter
Thank you groovehunter! This is a good source of info for putting my problem in context.
RadiantHex
+1  A: 

How about changing the model so "alias" to be list of keys to other table that looks like this (skipping small words like "the", "and", etc.): 1 => Above; 2 => Beyond; 3 => Disk; 4 => Jokey;

Then when you want to insert new record just check how many of the significant words from the title are already in the table and match currently existing model entities. If more than 50% (for example) maybe you have a coincidence and you can show list of them to the visitor and asking "do you mean some of this one".

Ilian Iliev
Thanks for your reply llian lliev! You are right I should skip words,I found out that they call these words "stop words".
RadiantHex