views:

279

answers:

5

I recently used Wikipedia's function "What links here" (which is found under the "Toolbox" element in any entry's left menu) and it got me started wondering how this function actually works.
I'm guessing that searching through all the article entries after links isn't very effective, so are all the links stored in a separate database? If so, is this updated when an article is edited or another time?

Thanks.

+10  A: 

Whenever a page on Wikipedia is edited, it is placed into a background queue that does some further processing. Some of the things that happen there are:

  • updates to the "what links here" for other pages
  • updates to category index pages
  • updates to the global cache of existing pages to help render "redlinks" on other pages

This sort of information doesn't need to be updated right away when you hit "Submit", so the background processing queue takes care of it. Sometimes this queue can grow quite large, but usually it's kept under control.

You can find more information about this at Help:Job Queue.

Greg Hewgill
+1  A: 

The way I would implement is to get all the links after an edit, then store them in a separate table with the key being the current url. Then I could just query the table with the URL the user is currently on and get all the links that have been marked as linking to that page.

It probably wouldn't be as straightforward as that but that's the general, simplified idea. Probably instead of URLs it would be wiser to store page IDs and so on.

Tatu Ulmanen
+1  A: 

It would make sense for the "update event" of an article to trigger a links parser as this is the only time an article is going to change. The update event in turn would simply scan for links, and query the db for links that are internal to wikipedia.

I imagine each page has a primary key and a simple association table is created to relate the pages PK to all the other pages that link to it.

Theres likely some additional bits that get added to aid performance on such a large site but that would be the basic mechanics.

rism
+1  A: 

You could think this as a more general problem. If you have a link (or pointer or whatever) from A to B, how can B know that A has a link pointing there?

The answer is to store the information to target location. That is, when the page A is edited and a link is created to B, at the same time store information about the link source to B (a reverse link). In case of a web page, the reverse link could be written directly into "what links here" page. Just a single write into a static page. No need to perform any searches or database queries.

PauliL
+1  A: 

Pseudo code for a simple algorithm that would do it

procedure updateChanges(editedPage):
    for_each(link on editedPage):
        if(link is not to another wikipedia page): continue
        pageToUpdate = open(link):
        if(pageToUpdate->whatLinksHere.contains(editedPage)): continue
        pageToUpdate->whatLinksHere.insert(editedPage)

Sorry I just finished my algorithms class so I have an urge to write pseudo code. In this context, the updateChanges() procedure would be something called during the "update the 'what links here' for other pages" phase that Greg Hewgill referred to.

Graphics Noob