views:

39

answers:

2

Hi everyone. I'm looking at things that can distinguish a blog from a normal website. These are things that a program needs to be able identify from the html of a website or particular features that a site supports. For eg. pings. The same for news websites.

I'm working on a blog/news monitor program and it will index sites to automatically determine if it is a blog or a news site and then monitor user feedback in comments etc on posts from sites that it determines to be of a blog or news nature.

So what i'm really after is suggestions on what i can use or look out for in identifying these sites.

It's going to be a desktop app written in java so if you have any code specifics in java that'll be great.

thanks in advance

+1  A: 

You can search the page for the word "blog", as this will probably be present. Specifically, you can look for it in parts of the HTML page, or exclude parts - like links. This will give you a decent starting point.

Ultimately, though, this is something that will have to be done manually. You should construct an interface for people to specify if it's a blog or news site, or different features of it, when the site is submitted. Then you should create a database of sites and features, and flag them so that you or another administrator can review them and make changes. Once you do this for a site, you'll never need to do it again, so for example http://*.wordpress.com/ is all going to be blogs.

Some features you can automatically detect or get a pretty good chance of detecting, but ultimately you will need a manual review.

Erick Robertson
thanks for the edits and suggestions
robinsonc494
A: 

Look for a discoverable RSS or Atom feed, which should be present on a blog or serially-updated news site.

ahockley
thanks, i had the rss in mind i'll look for the others as well.
robinsonc494