views:

175

answers:

5

I need to find the shortest distance between two Wikipedia pages (in "hops")

I have a method to extract all the internal wiki links on a page

I know the start destination and the end destination but I'm coming up blank on how to extract the hops from the data

so far ive been using the link extraction method to populate a dictionary with the key being the link on the page and the value being the page it was taken off of.

If anyone has any ideas one what a good data structure would be to hold the info and then how to look through it I would appreciate it very much

+5  A: 

Do you know anything about graph theory? You have the necessary data to build a graph, but you'll need to use Dijkstra's algorithm to traverse it to find the shortest path between your two points.

Charlie Salts
Yes. Or a simple breadth first search in this case, since all edges have a weight of 1 click.
CaptnCraig
@CaptnCraig - Yes, I think you're right. I was trying to remember all my graph algorithms and I found Dijkstra's so I stopped looking ;)
Charlie Salts
+2  A: 

Maybe it's a bit stupid since I'm not really a C# programmer but a multidimensional array that contains all the links inside would, depending on the depth of dimensions let you know which way contains less hoops.

It's just a thought while this is certainly doable in theory since there is no language limit to the number of dimensions an array can have, I'm pretty sure it'll be really memory hungry!

Something like this:

[source] -> [source link] -> ['source link' link] -> etc
         -> [source link] -> ['source link' link] -> etc
         -> [source link] -> ['source link' link] -> etc
         -> [source link] -> ['source link' link] -> [target]
         -> [source link] -> ['source link' link] -> etc
johnnyArt
+1  A: 

Assuming you have a IEnumerable<Link> PageLinks(Link link)

The number of hops would be solved by the following:

Link curentPage = "somepage";
Link destinationPage = "otherpage";
if (currentPage == destinationPage) return 0;
int hops = 1;
IEnumerable<Link> currentLinks = PageLinks(currentPage);
IEnumerable<Link> visited = new [] {currentPage};
while(!currentLinks.Contains(destinationPage)) 
{
    currentLinks = currentLinks
        .SelectMany(l => PageLinks(l).Where(f => !visited.Contains(f)));
    visited = visited.Union(currentLinks);
    hops++;
}
return hops;

Edited to make faster for cycling, although the algorithm would have worked without it. It may run until StackOverflow or so if the pages aren't linked.

Yuriy Faktorovich
Very nice. I like your use of IEnumerables to contain memory while looping over an exponentially increasing dataset. But won't you need loop detection in the case of the *cyclic* graphs in the problem? Also you need a termination condition in the case that destination is never found.
Frank Krueger
A: 

I think the graph is sparse in this case. So it might be a good idea to use something like HashSet for each Wikipedia page, with pages where it links to inside the set.

You don't really need to implement Dijikstra's shortest path algorithm in this case. Since this is equal to shortest path problem where the weight of each edge is equal to 1. You could just do a Breadth-first search and get get the depth where the destination page is found.

m3rLinEz
Yes, CaptnCraig already made this comment - Breadth First will do fine as well.
Charlie Salts
+1  A: 

Here's a implementation of Dijkstra's algorithm in python: http://code.activestate.com/recipes/119466/

adamse