views:

431

answers:

3

Does a tool exist that can parse text and output that text, hyper-linked to Wikipedia entries for words of interest?

For example, I'd like a tool that could turn something like:

The most popular search algorithm on a sorted list is the binary search.

Into:

The most popular search algorithm on a sorted list is the binary search.

It would be wonderful if Wikipedia had an API which would do this since they would be best equipped to determine what "words of interests" are.

In my example I simply linked all combinations which linked directly to an entry except for The and most.

+1  A: 

You have two separate problems to solve here:

  1. Deciding which words should be linked
  2. Determining if there's a suitable entry to link these words to

Now, (2) is simpler, though it's also somewhat problematic. Wikipedia seems to have an API that allows you to gather data efficiently, and they also allow "screen scraping". But there's a problem with disambiguation - sometimes you might hit not the entry you wanted. For example, python links to a disambiguation page, as it can be a programming language, a snake and a couple of other things.

(1) Is much harder, though. You can take the "simple approach" and attempt to find links for all non-trivial nouns (or even noun/adjective pairs). Non-trivial here means omitting words like "fiend, word, computer" etc. But This would result in a plethora of links, which isn't convenient to read. It's really up to you to decide what's interesting in the text, and this depends a lot on the text itself. In an article for professional programmers, do you really want to link to "search algorithm" every time? But for beginners, perhaps you do.

To conclude, I strongly doubt there's a single general-purpose tool that will do the trick for you. But you surely have all the options at your hand, and something need-specific can be coded without too much effort.

Eli Bendersky
A: 

Silviu Cucerzan of Microsoft Research tackled this problem. Well, not the problem of inserting the links, but the general issue of determining what entities are being mentioned in a some piece of text. Fortunately for you, he used Wikipedia articles as his set of entities. His paper, "Large-Scale Named Entity Disambiguation Based on Wikipedia Data", is available on his website. Direct link: pdf.

Matt G
+1  A: 

There is a tool that does exactly what you're asking for. http: //wikify.appointment.at/ It's not perfect, but it works.