ansaurus

Question

Create Great Parser - Extract Relevant Text From HTML/Blogs

Answer 1

+2 A:

There are projects out there that specifically look at filtering out the 'noise' of a given page. Typically the way this is done is by giving the algorithm a few examples of a given type of page, and it can look at what parts don't change between them. That being said, you'd have to give the algorithm a few example pages/posts of every blog you wanted to parse. This usually works well when you have a small defined set of sites you'll be crawling (news sites, for instance). The algorithm is basically detecting the template they use in HTML and picking out the interesting part. There's no magic here, it's tough and imperfect.

A great example of this alogrithm can be found in the EveryBlock.com source code which was just open-sourced. Go to everyblock.com/code and download the "ebdata" package and look at the "templatemaker" module.

And I don't mean to state the obvious, but have you considered just using RSS from the blogs in question? Usually the fields have the entire blog post, title, and other meta info along with them. Using RSS is going to be far simpler than the previous solution I mentioned.

sotangochips 2009-07-18 08:19:19

Yeah I actually have the RSS data, the problem is that many don't have the full text, and I need it in every case. Checking this out now, thanks.

2009-07-18 08:57:37

Answer 2

+6 A:

Boy, do I have the perfect solution for you.

Arc90's readability algorithm does exactly this. Given HTML content, it picks out the content of the main blog post text, ignoring headers, footers, navigation, etc.

Here are implementations in:

~~I'll be releasing a Perl port to CPAN in a couple of days.~~ Done.

Hope this helps!

Anirvan 2009-07-18 08:26:34

It turns out this worked really well - needed to make a few changes to make it better (change to SGML parser instead of HTMLParser in beautifulsoup), but what a great solution! Thanks

2009-07-20 02:13:48

One quick note: Arc90's Readability tool has some weak spots. On a complex page like this one (http://blog.moertel.com/articles/2007/02/22/a-simple-directory-tree-printer-in-haskell), it silently drops most of the code blocks. That's a significant problem if you are going to use it to extract information from _coding_ blogs.

Telemachus 2009-07-24 14:56:34

Thanks for the python and php links, I didn't know those existed.

Tristan Havelick 2010-02-11 04:15:28

The PHP version seems fantastic. The link is broken. Here's a new one. http://www.keyvan.net/2010/08/php-readability/

Mridang Agarwalla 2010-09-21 10:13:24

Answer 3

A:

Hi Anirvan,

I tried the above code which is done on Python(dated Feb 22, 12.49). I get an error

Traceback (most recent call last):

File "C:\Users\workspace\secpython\src\hn.py", line 213, in (module)

print upgradeFeed(HN_RSS_FEED)

File "C:\Users\workspace\secpython\src\hn.py", line 180, in upgradeFeed

parsedFeed = feedparser.parse(feedData)

AttributeError: 'module' object has no attribute 'parse'

I am using Python 2.6.5 and Eclipse with Pydev(windows environment). Can you help me to solve this problem..

Thanks in Advance..

2010-06-10 12:41:43

ansaurus

tags:

views:

answers:

Create Great Parser - Extract Relevant Text From HTML/Blogs

related questions