views:

48

answers:

1

The goal is to read through html files and change all instances of MyWord to Myword; except, must NOT change the word if it is found inside or as part of a path, file name or script:

href="..."
src="..."
url(...)
class="..."
id="..."
script inline or linked (file name) --> <script ...></script>
styles inline or linked (file name) --> <link ...>   <style></style>  

Now the question of all questions: how do you determine if the instance of the word is in a position where it's ok to change it? (or, how do you determine if the word is inside of one of the above listed locations and should not be changed?)

Here is my code, it can be changed to read line by line, etc. but I just can not think of how to define and enforce a rule to match above...

Here it is:

#!/usr/bin/python

import os
import time
from stat import *

def fileExtension(s):
   i = s.rfind('.')
   if i == -1:
      return ''
   tmp = '|' + s[i+1:] + '|'
   return tmp

def changeFiles():
   # get all files in current directory with desired extension
   files = [f for f in os.listdir('.') if extStr.find(fileExtension(f)) != -1]

   for f in files:
      if os.path.isdir(f):
         continue

      st = os.stat(f)
      atime = st[ST_ATIME] # org access time
      mtime = st[ST_MTIME] # org modification time

      fw = open(f, 'r+')
      tmp = fw.read().replace(oldStr, newStr)
      fw.seek(0)
      fw.write(tmp)
      fw.close()

      # put file timestamp back to org timestamp
      os.utime(f,(atime,mtime))

      # if we want to check subdirectories
      if checkSubDirs :
         dirs = [d for d in os.listdir('.') if os.path.isdir(d)]

      for d in dirs :
         os.chdir(d)
         changeFiles()
         os.chdir('..')

# ==============================================================================
# ==================================== MAIN ====================================

oldStr = 'MyWord'
newStr = 'Myword'
extStr = '|html|htm|'
checkSubDirs = True

changeFiles()  

Anybody know how? Have any suggestions? ANY help is appreciated, beating my brain for 2 days now and just can not think of anything.

A: 

Use regex here is an example that you can start with, hope this will help :

import re

html = """
    <href="MyWord" />
    MyWord
"""

re.sub(r'(?<!href=")MyWord', 'myword', html)
output: \n\n <href="MyWord" />\n myword\n\n

ref : http://docs.python.org/library/re.html

singularity
@knitti: you mean "don't parse html with regex" do i look like i'm parsing html heh ?
singularity
OK, you're right, you just happen to have some text which could be HTML and do some search-and-replace. Sorry. When you edit your answer I can take back my vote.
knitti
@knitti: ???? i quiet don't follow you , are you being sarcastic ? ok let's be more mature , the thing here is that in SO lot of html parsing question are being ignore and not answered because of this , this mis comprehension of the problem of regex and html and what mean parsing, quote : __"But we should also be teaching them the very real difference between parsing HTML and the simple expedience of processing a few strings"__ Jeff Atwood ref:http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html
singularity
No, I screwed this up, and I offered to take my vote back, which only works when you reedit and I take it back within 5 mins or so.
knitti
@singularity: Hmmm, that looks interesting. I am terrible at regex but will play with your example and go from there... Thanks.
Angie