views:

364

answers:

4

How can websites as big as Wikipedia sort duplicated entries out?

I need to know the exact procedure from the moment that user creates the duplicate entry and so on. If you don't know it but you know a method please send it.

----update----

Suppose there is wikipedia.com/horse and somebody afterward creates wikipedia.com/the_horse this is a duplicate entry! It should be deleted or may be redirected to the original page.

A: 

I assume they have a procedure that removes extraneous words such as 'the' to create a canonical title, and if it matches an existing page not allow the entry.

Alex JL
there are many more complex keywords that can be resubmitted by users!
EBAGHAKI
Yes, there are and as many people have pointed out, this is not an automatic process. Wikipedia obviously depends on the community to edit submissions. If you're positive about how it works, then why are you asking? Go look at the code for mediawiki and answer your own question.
Alex JL
If it's not an automatic process why your answer is like one!
EBAGHAKI
Of course part of the process is automated. You're submitting a form to a website - do you think people are standing there turning a crank? It seemed so clear that Wikipedia's editing is done by users, considering that's the entire point of the site. I didn't think I needed to point that out.
Alex JL
+7  A: 

It's a manual process

Basically, sites such as wikipedia and also stackoverflow rely on their users/editors not to make duplicates or to merge/remove them when they have been created by accident. There are various features that make this process easier and more reliable:

  • Establish good naming conventions ("the horse" is not a well-accepted name, one would naturally choose "horse") so that editors will naturally give the same name to the same subject.
  • Make it easy for editors to find similar articles.
  • Make it easy to flag articles as duplicates or delete them.
  • Make sensible restrictions so that vandals can't mis-use these features to remove genuine content from your site.

Having said this, you still find a lot of duplicate information on wikipedia --- but the editors are cleaning this up as quickly as it is being added.

It's all about community (update)

Community sites (like wikipedia or stackoverflow) over time develop their procedures over time. Take a look at Wikipedia:about Stackoverflow:FAQ or meta.stackoverflow. You can spend weeks reading about all the little (but important) details of how a community together builds a site together and how they deal with the problems that arise. Much of this is about rules for your contributors --- but as you develop your rules, many of their details will be put into the code of your site.

As a general rule, I would strongly suggest to start a site with a simple system and a small community of contributors that agree on a common goal and are interested in reading the content of your site, like to contribute, are willing to compromise and to correct problems manually. At this stage it is much more important to have an "identity" of your community and mutual help than to have many visitors or contributors. You will have to spend much time and care to deal with problems as they arise and delegate responsibility to your members. Once the site has a basis and a commonly agreed direction, you can slowly grow your community. If you do it right, you will gain enough supporters to share the additional work amongst the new members. If you don't care enough, spammers or trolls will take over your site.

Note that Wikipedia grew slowly over many years to its current size. The secret is not "get big" but "keep growing healthily".

Having said that, stackoverflow seems to have grown at a faster rate than wikipedia. You may want to consider the different trade off decisions that were made here: stackoverflow is much more restricted in allowing one user to change the contribution of another user. Bad information is often simply pushed down to the bottom of a page (low ranking). Hence, it will not produce articles like wikipedia. But it's easier to keep problems out.

Yaakov Belch
Thanks for the features. Therefore all i need is a good editorial system. Any idea where to get more info about this?
EBAGHAKI
+3  A: 

I can add one to Yaakov's list: * Wikipedia makes sure that after merging the information, "The Horse" points to "Horse", so that the same wrong title can not be used a second time.

Rob Hooft
This is probably the most important way duplicates are avoided: The duplicate can't be added a dozen times, because after it's added once, it's redirected to the actual entry.
Kyralessa
+1  A: 

EBAGHAKI, responding to your last question in the comments above:

If you're trying to design your own system with these features, the key one is:

  • Make the namespace itself editable by the community that is identifying duplicates.

In MediaWiki's case, this is done with the special "#REDIRECT" command -- an article created with only "#REDIRECT [[new article title]]" on its first line is treated as a URL redirect.

The rest of the editorial system used in MediaWiki is depressingly simple -- every page is essentially treated as a block of text, with no structure, and with a single-stream revision history that any reader can add a new revision to. Nothing automatic about any of this.

When you try to create a main page, you are shown a long message encouraging you to search for the page title in various ways to see whether an existing page is already there -- many sites have similar processes. Digg is a typical example of one with an aggressive, automated search to try to convince you not to post duplicates -- you have to click through a screen listing potential duplicates and affirm that yours is different, before you are allowed to post.

sj
Well thanks, very good points indeed
EBAGHAKI