views:

84

answers:

2

I have some unknown webpages and I want to determine which websites they come from. I have example webpages from each website and I assume each website has a distinctive template. I do not need complete certainty, and don't want to use too much resources matching each webpage. So crawling each website for the webpage is out of the question.

I imagine the best way is to compare the tree structure of each webpage's DOM. Are there any libraries that will do this?

Ideally I am after a Python based solution, but if there is an algorithm I can understand and implement then I would be interested in that too.

Thanks

A: 

A quick and dirty way you can try is to split html source in html tags, then compare the resultant collections of strings. You should end up with collection of tags and content, say:

item[n] ="<p>"
item[n+2] ="This is some content"
item[n+2] ="</p>"

I think a regex can do this in about every language.

Some content, other than tags, would be the same (menus and so on). I think a numeric comparison of occurrences should be enough. You can improve by giving kinda "points" when you have same tag/content in the same position. Probably a "combo" of a decent number of collection items can give you certainty.

m.bagattini
+3  A: 

You could do this via Bayes classification. Feed a few pages from each site into the classifier first, then future pages can be tested against them to see how closely they match.

Bayes classifier library available here: reverend (LGPL)

Simplified example:

# initialisation
from reverend.thomas import Bayes
guesser = Bayes()
guesser.train('site one', site_one_page_one_data)
guesser.train('site one', site_one_page_two_data)
# ...etc...
guesser.train('site two', site_two_page_one_data)
guesser.train('site two', site_two_page_two_data)
# ...etc...
guesser.save()

# run time
guesser.load()
results = guesser.guess(page_I_want_to_classify)

For better results, tokenise the HTML first. But that might not be necessary.

Kylotan