tags:

views:

232

answers:

5

I often find myself needing a tool that would allow me to:

search for multiple multi-line regex patterns in a large file and replace them using back-referencing.

Should I:

  1. take the 2 hours it'll require to build myself such a tool
  2. use something someone has already built (please suggest)
  3. learn to use a language that's particularly good at this type of thing(Perl?)


Example
I have an xml document containing thousands of entries. There are about 100 entries with a known value field which need to be removed. I can build a regular expression for each entry. The expression is the same for the 100 entries except for the value string part. Either this tool would need to be able to loop through once for every value or just once with 100 OR terms (|) in the expression (it would be huge). In this case I'm replacing the matches with a blank but in other cases, I'd reformat the text and re-insert the value field.

+2  A: 

I reckon you should write the thing in Python. The python re library is great:

# get the re library
import re

# this is the line to process
xml_line = "<stuff><bad i_am_naughty=\"True\"></bad></stuff>"
# compile a regex 
exp = re.compile ("(.*)(<bad.*bad>)(.*)")
# run the regex on the line
match = exp.search (xml_line)
# print out the groups the regex found
print match.groups ()

N.B. You could also use python XML parsing libraries to strip out the elements you don't want. Using the python XMl parsing simplifies some of the complexity that I have ignored in my example (multiple lines etc). In lieu of a Python XML parsing example this question has some good answers re parsing XML in Python.

Cannonade
A: 

I would suggest not to use regular expression. XML should usually be handled with XML tools. Why not just use XSLT?

Daniel Brückner
This case is xml but many cases aren't.
Mr Grieves
I think he wants to use regexes to make the expression for choosing one of the values easy to construct. The alternative in XSLT would have about 100 matching templates, right?
hughdbrown
A: 

Is there some reason why you have discounted sed as an option? It seems to do exactly what you need with its substitute command and the RE engine supports back-references to minimize the number of REs you'll have to create.

It's not whizz-bang at procedural code, so if you have a lot of decisions to make, you're better off writing something in Python/Perl, but it allows reasonably complex substitutions.

paxdiablo
Does SED work on multi-line regexes? I thought it was like AWK in this regard: it works on only single lines in sequence.
hughdbrown
According to wikipedia.org: "It reads input files line by line (sequentially), applying the operation which has been specified via the command line (or a sed script), and then outputs the line."
Mr Grieves
No, I wouldn't use it for anything more complicated but, if your search strings don't cross lines, it would be adequate. Anything more complicated I would do in Perl.
paxdiablo
IIRC, sed *can* be made to work with multiline regexes, but it quickly becomes really hairy. For complex substitutions, I'd go with Python / Perl.
Jeff Shannon
+1  A: 

I am not quite sure what your data looks like, but I would consider writing the tool in python in three passes:

  1. convert the file of XML path plus variable = value to lines of XML.path.variable=value
  2. apply massive regex to each line, possibly deleting line from output
  3. convert shortened list of XML.path.variable=value lines back to XML
hughdbrown
A: 

There're large number of module that could be used to handle xml.

http://www.crummy.com/software/BeautifulSoup/ http://codespeak.net/lxml/