I have some unknown webpages and I want to determine which websites they come from. I have example webpages from each website and I assume each website has a distinctive template. I do not need complete certainty, and don't want to use too much resources matching each webpage. So crawling each website for the webpage is out of the question.
I imagine the best way is to compare the tree structure of each webpage's DOM. Are there any libraries that will do this?
Ideally I am after a Python based solution, but if there is an algorithm I can understand and implement then I would be interested in that too.
Thanks