views:

52

answers:

3

hi guys,

I have to crawl the contents of several blogs. The problem is that I need to classify whether the blogs the authors are from a specific school and is talking about the school's stuff. May i know what's the best approach in doing the crawling or how should i go about the classification?

+1  A: 

If you're looking for a good Python web scraper, this question seems to have all the information you're looking for.

As for classifying whether the blog is discussing the school's stuff, that's a much trickier problem. I doubt you'll get away from having to have the results reviewed by humans. A really sophisticated scraper would use probabilistic filters--train it on blog posts which do and don't discuss the school, and let it infer the rules itself. That's fairly sophisticated, however, and from the question I'm guessing you want quick-and-dirty. I'd just put together a list of keywords, and review (and refine) the results until it's close enough to what you want.

As for identifying the authors, this is the Internet, where no one knows whether or not you're a dog (or, by extension, what school you attended). If you had a list of authors to look for you could always use them as part of the keyword search, but if the authors choose not to identify themselves (or, worse, identify themselves as someone else) there's no practical way to do it.

Chris B.
hi chris, from what you saying, probably i could use a naive bayes classifier and have trained dataset based on keyword occurances and the perhaps the supposed friends' blogspots or wordpress url links on the pages as features? Is that the right track?
goh
That's the right idea. I'd pull together positive and negative samples of the posts you're interested in and feed them in. Obviously, the more posts you review, the better the filter gets. And remember to periodically review the results, both the posts getting flagged as well as posts not getting flagged, to help it improve.
Chris B.
A: 

Web scrapping is one problem. Handling classification is a whole field.

You really have two choices: hire someone who knows how to do it or figure it out. For figuring it out, I strongly recommend the Programming Collective Intelligence book. The examples are in Python, use real world APIs, and invite hacking around to find solutions. Each chapter handles one part of the collective intelligence world, e.g., grouping or classifying, walks through some basics, and provides plenty of references for more information. It might be a good idea to skim the book even if you decide to hire an expert.

Charles Merriam
A: 

Sorry to butt in on your post but I was wondering if Charles could provide a few other sources for getting really familiAr with the subject

Sasha