ansaurus

Question

Searching for specific HTML string using Python

Answer 1

+1 A:

BeautifulSoup or lxml.

Ignacio Vazquez-Abrams 2010-04-04 21:06:20

He only needs literal string replacement. No need to parse html...

twneale 2010-04-04 21:32:18

Answer 2

A:

htmllib

newtover 2010-04-04 21:09:18

Answer 3

+5 A:

If the string you are searching for will be in the HTML literally, then simple string replacement will be fine:

old_html = open(html_file).read()
new_html = old_html.replace(my_string, "")
if new_html != old_html:
    open(html_file, "w").write(new_html)

As an example of the string not being in the HTML literally, suppose you are looking for "Test" as you said. Do you want it to match these snippets of HTML?:

<a href='test.html'>Test</a>
<A HREF='test.html'>Test</A>
<a href="test.html" class="external">Test</a>
<a href="test.html">Tes&#116;</a>

and so on: the "same" HTML can be expressed in many different ways. If you know the precise characters used in the HTML, then simple string replacement is fine. If you need to match at an HTML semantic level, then you'll need to use more advanced tools like BeautifulSoup, but then you'll also have potentially very different HTML output than you started with, even in the sections not affected by the deletion, because the entire file will have been parsed and reconstituted.

To execute code over many files, you'll find os.path.walk useful for finding files in a tree, or glob.glob for matching filenames to shell-like wildcard patterns.

Ned Batchelder 2010-04-04 21:09:54

That solves the string replacement but what about having to run the same script for hundreds of html pages?

Morpheous 2010-04-04 21:17:09

Added os.path.walk and glob.glob to the answer...

Ned Batchelder 2010-04-04 21:19:16

ansaurus

tags:

views:

answers:

Searching for specific HTML string using Python

related questions