views:

61

answers:

3

I'm looking for a way to grab a piece of markup that is in a 1000+ html files published on unix servers (running via apache) and replace the markup with either empty nodes or alternate html markup.

ex:

Find

<div id="someComponent"> .....{a bunch of interior markup} .... </div>

Replace with {empty}

ex 2:

Find </div></body>

Replace </div>{some HTML markup needed here}</body>

+1  A: 

One way to do it: use Python with BeautifulSoup to parse the HTML file, do replacement and write back.

jldupont
Agreed. Normally I'd suggest `perl -pi -e`, but it seems like you need something that's aware of HTML's structure.
David Wolever
A: 

If the markup is written in the same way in all the files, sed or perl will be much quicker than BeautifulSoup or the like, but it's also harder to make flexible in terms of various ways of expressing the same HTML markup in text form.

Do you have a more concrete example of what kind of markup you're looking for, and ideally how it might vary from file to file? Where in the file will it be? Also, is it okay to prettify or tidy the HTML in the process if necessary?

Oh, and are you running something on the server(s), or do you need code to spider the server to retrieve the HTML files for processing?

Walter Mundt
+1  A: 

If it is really simple (no parse needed, markup well known and not one into another), the fastest way should be :

(In Zsh or Bash)

perl -pi -e 's#<div class="toto">.*?</div>#<span>new content</span>#g' /path/to/files/**/*.html(.)

That should do the trick to replace all between all ...<div class="toto">.....</div>... by ...<span>newcontent</span>...

But beware it will NOT work for ...<div class="toto"> ... <div class="toto"> ... </div> ... </div> ....

yogsototh
I will try the above recommended approach and get back to you once I've tested it in our dev environment.
nopuck4you
Can the search HTML and Replace HTML be pulled from files instead of inline?
nopuck4you