What algorithm helps to identify repeating sequences in a DOM?

tags:

algorithm

views:

answers:

What algorithm helps to identify repeating sequences in a DOM?

Is there a good algorithm I might apply to a DOM to lead me to groups of probably related nodes? The ultimate goal is to get something useful to assist extracting things like TOC's and "blog rolls" from websites. If something like this already exists, I'd be happy if someone let me know that as well.

I realize it's not something I can hope to do deterministically. The reason I suspect there might be a solution out there already comes from recently stepping through the 'diff algorithm' which deals with common sequences. I'm not sure if it's a leap or not to go from 'common' to 'repeating'...

"Related" is a very general term, in that it is always going to depend largely upon what the actual data is and what the relationships you're trying to infer are. I don't quite understand why you're talking about "repeating sequences" as a metric for "relatedness". Stricly speaking, there is not really any "sequence" in a DOM - it's a tree, so you could only talk about ordering (and therefore sequencing) with respect to parent/child relationships or sibling relationships. I'm not sure that you mean any of these things.

That said, there are some things you can say about DOMs. They are trees, so you're essentially looking to identify sub-trees with similar shape, I assume?

One approach you might take is to take two such DOMs and attempt to relate similar nodes (e.g. ones with known attributes or particular nodes) by adding edges (making the whole thing a connected graph), and then calculate a clique.

Other that that, I'm not sure there's much more specific methods I could suggest without a slightly more complete problem description.

Gian 2010-08-31 22:14:27

Thanks! I'll tell you why I said 'related' (not an attempt to justify). I was thinking about a DOM as having two classes of nodes - those with information that I cared about and those without information I cared about. Within the latter, I was trying to think about how to further group them into smaller sets and extract information about those, i.e. related, nodes so that I might identify them on a different page (probably within the same site) just by the characteristics of their surrounding DOM. It all boils down to me wanting to analyze things like MSDN semi-automatically.

Gabriel 2010-08-31 22:34:52

Oh yeah, and by sequence I was thinking of sub-tree, and I think I mean trees where there's only one node at a given level - which in my mind looks like a tall stick with nodes that create a 'sequence' from top to bottom

Gabriel 2010-08-31 22:39:10

You just have to pick an example of a "definitely interesting" node and invent a good similarity relation; then all similar nodes will be interesting. Similarity might be based on factors like: height of path to root, attribute values, tag names, positions among siblings, all of the above for several levels of parent nodes, etc. I used this approach and it worked surprisingly well.

jkff 2010-09-01 04:02:30

ansaurus

tags:

views:

answers:

What algorithm helps to identify repeating sequences in a DOM?

related questions