views:

379

answers:

6

I have downloaded the web page into an html file. I am wondering what's the simplest way to get the content of that page. By content, I mean I need the strings that a browser would display.

To be clear:

Input:

<html><head><title>Page title</title></head>
       <body><p id="firstpara" align="center">This is paragraph <b>one</b>.
       <p id="secondpara" align="blah">This is paragraph <b>two</b>.
       </html>

Output:

Page title This is paragraph one. This is paragraph two.

putting together:

from BeautifulSoup import BeautifulSoup
import re

def removeHtmlTags(page):
    p = re.compile(r'''<(?:"[^"]*"['"]*|'[^']*'['"]*|[^'">])+>''')
    return p.sub('', page)

def removeHtmlTags2(page):
    soup = BeautifulSoup(page)
    return ''.join(soup.findAll(text=True))

Related

A: 

The quickest way to get a usable sample of what a browser would display is to remove any tags from the html and print the rest. This can, for example, be done using python's re.

Alexander Gessler
this cannot be done using regex. please, don't confuse people.
SilentGhost
Please explain. I'm not talking about a perfect solution, just about a rough way to get the contents in acceptable quality (I'm aware that the approach is limited). Removing tags is just looking for `<..> and </..>`, so why exactly is it not possible using regexes?
Alexander Gessler
http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454
SilentGhost
+1. at least your method solves my problem to some extent!
Yin Zhu
"Now you have two problems" http://www.codinghorror.com/blog/2008/06/regular-expressions-now-you-have-two-problems.html
Oddthinking
+11  A: 

Parse the HTML with Beautiful Soup.

To get all the text, without the tags, try:

''.join(soup.findAll(text=True))
Oddthinking
http://www.crummy.com/software/BeautifulSoup/documentation.htmlI don't see renderContents() function works here. I want to delete the tags.
Yin Zhu
@Yin Zhu - Ah, renderContents works on sub-parts, not the whole document. I replaced the technique with one snipped from the documentation.
Oddthinking
@Yin Zhu: renderContents occurs 6 times in the referenced documentation. Please use a web browser that supports page search.
S.Lott
A: 

If I am getting your question correctly, this can simply be done by using urlopen function of urllib. Just have a look at this function to open an url and read the response which will be the html code of that page.

Ankit
you're not getting it right, OP says: *I have downloaded the web page into an html file.*
SilentGhost
+3  A: 

The best modules for this task are lxml or html5lib; Beautifull Soap is imho not worth to use anymore. And for recursive models regular expressions are definitly the wrong method.

Christian Hausknecht
Would you like to explain why Beautiful Soup is no longer worth using?
Oddthinking
Seconded. What changed about HTML that made Beautiful Soup irrelevant? It abstracts away a lot of the issues with imperfect HTML.
Tom
+2  A: 

You want to look at Extracting data from HTML documents - Dive into Python because HERE it does (almost)exactly what you want.

TheMachineCharmer
+2  A: 

Personally, I use lxml because it's a swiss-army knife...

from lxml import html

print html.parse('http://someurl.at.domain').xpath('//body')[0].text_content()

This tells lxml to retrieve the page, locate the <body> tag then extract and print all the text.

I do a lot of page parsing and a regex is the wrong solution most of the time, unless it's a one-time-only need. If the author of the page changes their HTML you run a good risk of your regex breaking. A parser is a lot more likely to continue working.

The big problem with a parser is learning how to access the sections of the document you are after, but there are a lot of XPATH tools you can use inside your browser that simplify the task.

Greg