ansaurus

Question

Python: Detecting the actual text paragraphs in a string

Answer 1

A:

Use a HTML parser. I'm serious. Get a friggin' HTML parser. Why? Read this (if only for teh lulz...) and this (if only to know how fragile any other solution is). Beautiful Soup is supposed to be pretty good, from the looks of the documentation it should be a piece of cake to get a list of all paragraphs (using findAll).

delnan 2010-07-24 16:24:11

I *am* using Beautiful Soup. It would definitely *not* be a piece of cake. I mean, I can get all the `<p>` tags easily, but often the content is not in a `<p>` tag at all, and often there is junk text inside `<p>` tags.

cool-RR 2010-07-24 16:45:33

Okay, sorry, but I couldn't have guessed that from the original question. This is indeed tricky... cletus's idea to look for patterns across pages to identify menus etc. sounds like a good shot.

delnan 2010-07-24 16:51:12

I think that trying to look for patterns across pages would be too complex. (Then you need to fetch more pages, figure out which pages are "siblings" of this page, etc., and even then there might be differences between the pages' navigation bars.)

cool-RR 2010-07-24 16:56:18

Answer 2

+1 A:

A general solution to this problem is a non-trivial problem to solve.

To put this in context, a large part of Google's success with search has come from their ability to automatically discern some semantic meaning from arbitrary Web pages, namely figuring out where the "content" is.

One idea that springs to mind is if you can crawl many pages from the same site then you will be able to identify patterns. Menu markup will be largely the same between all pages. If you zero this out somehow (and it will need to fairly "fuzzy") what's left is the content.

The next step would be to identify the text and what constitutes a boundary. Ideally that would be some HTML paragraphs but you won't get that lucky most of the time.

A better approach might be to find the RSS feeds for the site and get the content that way because that will be stripped down as is. Ignore any AdSense (or similar) content and you should be able to get the text.

Oh and absolutely throw out your regex code for this. This requires an HTML parser absolutely without question.

cletus 2010-07-24 16:28:03

Cletus, the HTML is a non-issue. The tags don't interest me, I throw all of them out.The reason I'm thinking about regex is to use it for telling which pieces of text are flowing paragraphs and which are link texts from the navigation bar (or other small bits of text.)

cool-RR 2010-07-24 16:47:23

Answer 3

+1 A:

Probably a bit overkill, but you could try nltk, the Natural Language Toolkit. That library is used for parsing natural languages. It's quite a nice library and an interesting subject. If you want to just get sentences from a text you would do something like:

>>> import nltk
>>> nltk.sent_tokenize("Hi this is a sentence. And isn't this a second one, a sentence with a url http://www.google.com in it?")
['Hi this is a sentence.', "And isn't this a second one, a sentence with a url http://www.google.com in it?"]

Or you could use the sentences_from_text method from the PunktSentenceTokenizer class. You have to do nltk.download() before you get started.

DiggyF 2010-07-24 16:59:31

Answer 4

A:

I'd recommend having a look at what Readability does. Readability strips out all but the actual content of the page and restyles it for easy reading. It seems to work very well in terms of detecting the content from my experience.

Have a look at its source code (particularly the grabArticle function) and maybe you can get some ideas.

Liquid_Fire 2010-07-24 19:02:41

Answer 5

+1 A:

You could use the approach outlined at the AI depot blog along with some python code:

The Easy Way to Extract Useful Text from Arbitrary HTML

ars 2010-07-24 19:12:33

ansaurus

tags:

views:

answers:

Python: Detecting the actual text paragraphs in a string

related questions