Counting content only in HTML page

tags:

html
python

views:

answers:

Counting content only in HTML page

Hello

Is there anyway I can parse a website by just viewing the content as displayed to the user in his browser? That is, instead of downloading "page.htm"l and starting to parse the whole page with all the HTML/javascript tags, I will be able to retrieve the version as displayed to users in their browsers. I would like to "crawl" websites and rank them according to keywords popularity (viewing the HTML source version is problematic for that purpose).

Thanks!

Joel

You could get the source and strip the tags out, leaving only non-tag text, which works for almost all pages, except those where JavaScript-generated content is essential.

Delan Azabani 2010-09-11 10:14:01

Thanks for the answer. using re.sub(r'<[^>]*?>', '', in_text) still leaves many unwanted keywords such as "padding", "color", "border", "size" etc. Thought maybe instead of stripping everything I could just get the "display version" and work around it that way.

Joel 2010-09-11 10:19:54

That's probably because it's stripping the script or style tags, but not the content.

Delan Azabani 2010-09-11 10:20:42

A browser also downloads the page.html and then renders it. You should work the same way. Use a html parser like lxml.html or BeautifulSoup, using those you can ask for only the text enclosed within tags (and arguments you do like, like title and alt attributes).

ikanobori 2010-09-11 11:48:04

The pyparsing wiki Examples page includes this html tag stripper.

Paul McGuire 2010-09-11 14:27:07

ansaurus

tags:

views:

answers:

Counting content only in HTML page

related questions