tags:

views:

45

answers:

3

Hello

Is there anyway I can parse a website by just viewing the content as displayed to the user in his browser? That is, instead of downloading "page.htm"l and starting to parse the whole page with all the HTML/javascript tags, I will be able to retrieve the version as displayed to users in their browsers. I would like to "crawl" websites and rank them according to keywords popularity (viewing the HTML source version is problematic for that purpose).

Thanks!

Joel

A: 

You could get the source and strip the tags out, leaving only non-tag text, which works for almost all pages, except those where JavaScript-generated content is essential.

Delan Azabani
Thanks for the answer. using re.sub(r'<[^>]*?>', '', in_text) still leaves many unwanted keywords such as "padding", "color", "border", "size" etc. Thought maybe instead of stripping everything I could just get the "display version" and work around it that way.
Joel
That's probably because it's stripping the script or style tags, but not the content.
Delan Azabani
A: 

A browser also downloads the page.html and then renders it. You should work the same way. Use a html parser like lxml.html or BeautifulSoup, using those you can ask for only the text enclosed within tags (and arguments you do like, like title and alt attributes).

ikanobori
A: 

The pyparsing wiki Examples page includes this html tag stripper.

Paul McGuire