I've been thinking about this for a while now, so I thought I would ask for suggestions:
I have some crawler which enters the root of some site (could be anything from www.StackOverFlow.com, www.SomeDudesPersonalSite.se or even www.Facebook.com). Then I need to determin what "kind of homepage" I'm visiting.. Different types could for instance be:
- Forum
- Blog
- Link catalog
- Social media site
- News site
- "One man site"
I've been brainstorming for a while, and the best solution seems to be some heuristic with a point system. By this I mean different trends gives some points to the different types, and then the program makes a guess afterwards.
But this is where I get stuck.. How do you detect trends?
- Catalogs could be easy: If sitesIndexed/Outgoing links is very high, catalogs should get several points.
- News sites/Blogs could be easy: If a high amount of sites indexed has a datetime, those types should get several points..
BUT I can't really find too many trends.
SO: My question is: Any ideas on how to do this?
Thanks so much..