ansaurus

Question

How to get the content of a Html page in Python

Answer 1

A:

The quickest way to get a usable sample of what a browser would display is to remove any tags from the html and print the rest. This can, for example, be done using python's re.

Alexander Gessler 2010-03-10 12:34:52

this cannot be done using regex. please, don't confuse people.

SilentGhost 2010-03-10 12:36:29

Please explain. I'm not talking about a perfect solution, just about a rough way to get the contents in acceptable quality (I'm aware that the approach is limited). Removing tags is just looking for `<..> and </..>`, so why exactly is it not possible using regexes?

Alexander Gessler 2010-03-10 12:50:58

http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454

SilentGhost 2010-03-10 12:56:11

+1. at least your method solves my problem to some extent!

Yin Zhu 2010-03-10 13:19:12

"Now you have two problems" http://www.codinghorror.com/blog/2008/06/regular-expressions-now-you-have-two-problems.html

Oddthinking 2010-03-10 14:04:57

Answer 2

+11 A:

Parse the HTML with Beautiful Soup.

To get all the text, without the tags, try:

''.join(soup.findAll(text=True))

Oddthinking 2010-03-10 12:35:10

http://www.crummy.com/software/BeautifulSoup/documentation.htmlI don't see renderContents() function works here. I want to delete the tags.

Yin Zhu 2010-03-10 12:47:35

@Yin Zhu - Ah, renderContents works on sub-parts, not the whole document. I replaced the technique with one snipped from the documentation.

Oddthinking 2010-03-10 13:52:00

@Yin Zhu: renderContents occurs 6 times in the referenced documentation. Please use a web browser that supports page search.

S.Lott 2010-03-10 13:52:57

Answer 3

A:

If I am getting your question correctly, this can simply be done by using urlopen function of urllib. Just have a look at this function to open an url and read the response which will be the html code of that page.

Ankit 2010-03-10 12:46:31

you're not getting it right, OP says: *I have downloaded the web page into an html file.*

SilentGhost 2010-03-10 12:49:52

Answer 4

+3 A:

The best modules for this task are lxml or html5lib; Beautifull Soap is imho not worth to use anymore. And for recursive models regular expressions are definitly the wrong method.

Christian Hausknecht 2010-03-10 12:49:50

Would you like to explain why Beautiful Soup is no longer worth using?

Oddthinking 2010-03-10 13:52:45

Seconded. What changed about HTML that made Beautiful Soup irrelevant? It abstracts away a lot of the issues with imperfect HTML.

Tom 2010-03-10 13:59:47

Answer 5

+2 A:

You want to look at Extracting data from HTML documents - Dive into Python because HERE it does (almost)exactly what you want.

TheMachineCharmer 2010-03-10 13:15:12

Answer 6

+2 A:

Personally, I use lxml because it's a swiss-army knife...

from lxml import html

print html.parse('http://someurl.at.domain').xpath('//body')[0].text_content()

This tells lxml to retrieve the page, locate the <body> tag then extract and print all the text.

I do a lot of page parsing and a regex is the wrong solution most of the time, unless it's a one-time-only need. If the author of the page changes their HTML you run a good risk of your regex breaking. A parser is a lot more likely to continue working.

The big problem with a parser is learning how to access the sections of the document you are after, but there are a lot of XPATH tools you can use inside your browser that simplify the task.

Greg 2010-03-10 19:43:26

ansaurus

tags:

views:

answers:

How to get the content of a Html page in Python

Related

related questions