ansaurus

Question

How to efficiently determine if webpage comes from a website

Answer 1

A:

A quick and dirty way you can try is to split html source in html tags, then compare the resultant collections of strings. You should end up with collection of tags and content, say:

item[n] ="<p>"
item[n+2] ="This is some content"
item[n+2] ="</p>"

I think a regex can do this in about every language.

Some content, other than tags, would be the same (menus and so on). I think a numeric comparison of occurrences should be enough. You can improve by giving kinda "points" when you have same tag/content in the same position. Probably a "combo" of a decent number of collection items can give you certainty.

m.bagattini 2009-08-28 08:02:54

Answer 2

+3 A:

You could do this via Bayes classification. Feed a few pages from each site into the classifier first, then future pages can be tested against them to see how closely they match.

Bayes classifier library available here: reverend (LGPL)

Simplified example:

# initialisation
from reverend.thomas import Bayes
guesser = Bayes()
guesser.train('site one', site_one_page_one_data)
guesser.train('site one', site_one_page_two_data)
# ...etc...
guesser.train('site two', site_two_page_one_data)
guesser.train('site two', site_two_page_two_data)
# ...etc...
guesser.save()

# run time
guesser.load()
results = guesser.guess(page_I_want_to_classify)

For better results, tokenise the HTML first. But that might not be necessary.

Kylotan 2009-08-28 09:50:36

ansaurus

tags:

views:

answers:

How to efficiently determine if webpage comes from a website

related questions