I've been putting together a list of pages that we need to update with new content (we're switching media formats). In the process I'm cataloging pages that correctly have the new content.
Here's the general idea of what I'm doing:
- Iterate through a file structure and get a list of files
- For each file read to a buffer and, using regex search, match a specific tag
- If matched, test 2 more regex matches
- write the resulting matches (one or the other) into a database
Everything works fine up until the 3rd regex pattern match, where I get the following:
'NoneType' object has no attribute 'group'
I can comment out the 2nd match and the 3rd works fine. And it's a complete mystery to me.
# only interested in embeded content
pattern = "(<embed .*?</embed>)"
# matches content pointing to our old root
pattern2 = 'data="(http://.*?/media/.*?")'
# matches content pointing to our new root
pattern3 = 'data="(http://.*?/content/.*?")'
matches = re.findall(pattern, filebuffer)
for match in matches:
if len(match) > 0:
urla = re.search(pattern2, match)
if urla.group(1) is not None:
print filename, urla.group(1)
urlb = re.search(pattern3, match)
if urlb.group(1) is not None:
print filename, urlb.group(1)
as you can see, I've even tried using different variable names for the 2nd and 3rd pattern matches, which doesn't help at all. if i comment the entire URLA block, URLB works fine.
any idea what i might be doing wrong? or is there some type of shared regex object which isn't intended to be used in more than one or two instances?
the url's are a bit more complicated than listed above, which is why I'm using regex matches for the conditions. it's looking like I'll have to do multiple passes, but I don't grasp why I should have to.
thank you.