views:

37

answers:

1

Hello Everyone!

I am working on a VERY simple script to clean up a few hundred thousand small XML files. My current method is to iterate through the directory and (for each file) read the file, use String::gsub! to make all my changes (not sure if this is best) and then I write the new contents to the file. My code looks something like the following:

Dir.entries('.').each do |file_name|

  f = File.read( file_name )

  f.gsub!( /softwareiconneedsshine>(.|\s)*<\/softwareiconneedsshine>/i, '' )
  f.gsub!( /<rating>(.|\s)*<\/rating>, '' )

  f.gsub!( /softwareIdentifiers>/, 'version_history>' )

  #some more regex's

  File.open( file_name, 'w' ) { |w| w.write(f) }

end

This all looks fine and dandy, but for some reason (that I, for the life of me, cannot figure out) the program hangs seemingly arbitrarily at the gsub! commands that are similar to the first two shown. However, it hangs randomly (but only at those points). Sometimes it works, other times is just hangs. I really can't figure out why it would work sometimes but not all other times???

Any help is greatly appreciated!!

+2  A: 

Without knowing anything else about your environment, or the type of files you're reading, I would suggest trying to make your kleene stars to be non-greedy. Like, change (.|\s)* to (.|\s)*?

jason.rickman
That worked surprisingly well, the script executed flawlessly! But could you explain to me exactly what that did, I'm still not sure what is going on?? Thanks!
John
For details, see the descriptions of `*` and `*?` here: http://www.regular-expressions.info/reference.html
bta
'Greedy' matching starts with the largest possible match and shrinks it until it finds the proper match. 'Lazy' matching starts with the smallest possible match and expands it. For example, take the string `abc "def" "ghi" jkl`. The 'greedy' regex `".*"` would match `"def" "ghi"` and the 'lazy' regex `".*?"` would match `"def"`.
bta
I'm guessing it's trying to find the longest possible match in a long document, so it's taking a long time to traverse the whole document.
Ken Bloom
@Ken Bloom - it's actually a fairly small document. @bta - thanks! that makes a lot of sense.
John