views:

41

answers:

1

Hi!

I've just started to read upon graph-teory and data structures.

I'm building an example application which should be able to find the xpath for the most common links. Imagine a Google serp, my application should be able to find the xpath for all links pointing to a result.

Imagine that theese xpaths were found:

/html/body/h2/a
/html/body/p/a
/html/body/p/strong/a
/html/body/p/strong/a
/html/body/p/strong/a
/html/body/div[@class=footer]/span[@id=copyright]/a

From these xpats, i've thought of a graph like this (i might be completely lost here):

                            html
                             |
                            body
                        h2 -     p           - div[@class=footer]
                        |        |                     |
                        a (1)  a - strong      span[@id=copyright]
                                      |                |
                                      a (3)            a (1)

Is this the best approach to this problem?

What would be the best way (data structure) to store this in memory? The language does not mather. We can see that we have 3 links matching the path html -> body -> p -> strong -> a.

As I said, i'm totally new to this so please forgive me if I thought of this completely wrong.

EDIT: I may be looking for the trie data structure?

+1  A: 

Don't worry about tries yet. Just construct a tree using standard graph representation (node = {value, count, parent} while immediately collapsing same branches and incrementing the counter. Then, sort all the leaves by count in descending order and traverse from each leaf upwards to get a path.

Gintautas Miliauskas