views:

87

answers:

5

The big mission: I am trying to get a few lines of summary of a webpage. i.e. I want to have a function that takes a URL and returns the most informative paragraph from that page. (Which would usually be the first paragraph of actual content text, in contrast to "junk text", like the navigation bar.)

So I managed to reduce an HTML page to a bunch of text by cutting out the tags, throwing out the <HEAD> and all the scripts. But some of the text is still "junk text". I want to know where the actual paragraphs of text begin. (Ideally it should be human-language-agnostic, but if you have a solution only for English, that might help too.)

How can I figure out which of the text is "junk text" and which is actual content?

UPDATE: I see some people have pointed me to use an HTML parsing library. I am using Beautiful Soup. My problem isn't parsing HTML; I already got rid of all the HTML tags, I just have a bunch of text and I want to separate the context text from the junk text.

A: 

Use a HTML parser. I'm serious. Get a friggin' HTML parser. Why? Read this (if only for teh lulz...) and this (if only to know how fragile any other solution is). Beautiful Soup is supposed to be pretty good, from the looks of the documentation it should be a piece of cake to get a list of all paragraphs (using findAll).

delnan
I *am* using Beautiful Soup. It would definitely *not* be a piece of cake. I mean, I can get all the `<p>` tags easily, but often the content is not in a `<p>` tag at all, and often there is junk text inside `<p>` tags.
cool-RR
Okay, sorry, but I couldn't have guessed that from the original question. This is indeed tricky... cletus's idea to look for patterns across pages to identify menus etc. sounds like a good shot.
delnan
I think that trying to look for patterns across pages would be too complex. (Then you need to fetch more pages, figure out which pages are "siblings" of this page, etc., and even then there might be differences between the pages' navigation bars.)
cool-RR
+1  A: 

A general solution to this problem is a non-trivial problem to solve.

To put this in context, a large part of Google's success with search has come from their ability to automatically discern some semantic meaning from arbitrary Web pages, namely figuring out where the "content" is.

One idea that springs to mind is if you can crawl many pages from the same site then you will be able to identify patterns. Menu markup will be largely the same between all pages. If you zero this out somehow (and it will need to fairly "fuzzy") what's left is the content.

The next step would be to identify the text and what constitutes a boundary. Ideally that would be some HTML paragraphs but you won't get that lucky most of the time.

A better approach might be to find the RSS feeds for the site and get the content that way because that will be stripped down as is. Ignore any AdSense (or similar) content and you should be able to get the text.

Oh and absolutely throw out your regex code for this. This requires an HTML parser absolutely without question.

cletus
Cletus, the HTML is a non-issue. The tags don't interest me, I throw all of them out.The reason I'm thinking about regex is to use it for telling which pieces of text are flowing paragraphs and which are link texts from the navigation bar (or other small bits of text.)
cool-RR
+1  A: 

Probably a bit overkill, but you could try nltk, the Natural Language Toolkit. That library is used for parsing natural languages. It's quite a nice library and an interesting subject. If you want to just get sentences from a text you would do something like:

>>> import nltk
>>> nltk.sent_tokenize("Hi this is a sentence. And isn't this a second one, a sentence with a url http://www.google.com in it?")
['Hi this is a sentence.', "And isn't this a second one, a sentence with a url http://www.google.com in it?"]

Or you could use the sentences_from_text method from the PunktSentenceTokenizer class. You have to do nltk.download() before you get started.

DiggyF
A: 

I'd recommend having a look at what Readability does. Readability strips out all but the actual content of the page and restyles it for easy reading. It seems to work very well in terms of detecting the content from my experience.

Have a look at its source code (particularly the grabArticle function) and maybe you can get some ideas.

Liquid_Fire
+1  A: 

You could use the approach outlined at the AI depot blog along with some python code:

ars