tags:

views:

336

answers:

6

I'm using Python to write a regular expression for replacing parts of the string with a XML node.

The source string looks like:

Hello
REPLACE(str1) this is to replace
REPLACE(str2) this is to replace

And the result string should be like:

Hello
<replace name="str1"> this is to replace </replace>
<replace name="str2"> this is to replace </replace>

Can anyone help me?

+5  A: 

Here is an excellent tutorial on how to write regular expressions in Python.

JasCav
A: 

Maybe like this ?

import re

mystr = """Hello
REPLACE(str1) this is to replace
REPLACE(str2) this is to replace"""

prog = re.compile(r'REPLACE\((.*?)\)\s(.*)')

for line in mystr.split("\n"):
    print prog.sub(r'< replace name="\1" > \2',line)
A: 

Something like this should work:

import re,sys

f = open( sys.argv[1], 'r' )
for i in f:
    g = re.match( r'REPLACE\((.*)\)(.*)', i )
    if g is None:
        print i
    else:
        print '<replace name=\"%s\">%s</replace>' % (g.group(1),g.group(2))
f.close()
Graeme Perrow
+4  A: 

What makes your problem a little bit tricky is that you want to match inside of a multiline string. You need to use the re.MULTILINE flag to make that work.

Then, you need to match some groups inside your source string, and use those groups in the final output. Here is code that works to solve your problem:

import re


s_pat = "^\s*REPLACE\(([^)]+)\)(.*)$"
pat = re.compile(s_pat, re.MULTILINE)

s_input = """\
Hello
REPLACE(str1) this is to replace
REPLACE(str2) this is to replace"""


def mksub(m):
    return '<replace name="%s">%s</replace>' % m.groups()


s_output = re.sub(pat, mksub, s_input)

The only tricky part is the regular expression pattern. Let's look at it in detail.

^ matches the start of a string. With re.MULTILINE, this matches the start of a line within a multiline string; in other words, it matches right after a newline in the string.

\s* matches optional whitespace.

REPLACE matches the literal string "REPLACE".

\( matches the literal string "(".

( begins a "match group".

[^)] means "match any character but a ")".

+ means "match one or more of the preceding pattern.

) closes a "match group".

\) matches the literal string ")"

(.*) is another match group containing ".*".

$ matches the end of a string. With re.MULTILINE, this matches the end of a line within a multiline string; in other words, it matches a newline character in the string.

. matches any character, and * means to match zero or more of the preceding pattern. Thus .* matches anything, up to the end of the line.

So, our pattern has two "match groups". When you run re.sub() it will make a "match object" which will be passed to mksub(). The match object has a method, .groups(), that returns the matched substrings as a tuple, and that gets substituted in to make the replacement text.

EDIT: You actually don't need to use a replacement function. You can put the special string \1 inside the replacement text, and it will be replaced by the contents of match group 1. (Match groups count from 1; the special match group 0 corresponds the the entire string matched by the pattern.) The only tricky part of the \1 string is that \ is special in strings. In a normal string, to get a \, you need to put two backslashes in a row, like so: "\\1" But you can use a Python "raw string" to conveniently write the replacement pattern. Doing so you get this:

import re

s_pat = "^\s*REPLACE\(([^)]+)\)(.*)$"
pat = re.compile(s_pat, re.MULTILINE)

s_repl = r'<replace name="\1">\2</replace>'

s_input = """\
Hello
REPLACE(str1) this is to replace
REPLACE(str2) this is to replace"""


s_output = re.sub(pat, s_repl, s_input)
steveha
A: 
import re

a="""Hello
REPLACE(str1) this is to replace
REPLACE(str2) this is to replace"""

regex = re.compile(r"^REPLACE\(([^)]+)\)\s+(.*)$", re.MULTILINE)

b=re.sub(regex, r'< replace name="\1" > \2 < /replace >', a)

print b

will do the replace in one line.

Tim Pietzcker
+1  A: 

Here is a solution using pyparsing. I know you specifically asked about a regex solution, but if your requirements change, you might find it easier to expand a pyparsing parser. Or a pyparsing prototype solution might give you a little more insight into the problem leading toward a regex or other final implementation.

src = """\
Hello
REPLACE(str1) this is to replace
REPLACE(str2) this is to replace
"""

from pyparsing import Suppress, Word, alphas, alphanums, restOfLine

LPAR,RPAR = map(Suppress,"()")
ident = Word(alphas, alphanums)
replExpr = "REPLACE" + LPAR + ident("name") + RPAR + restOfLine("body")
replExpr.setParseAction(
    lambda toks : '<replace name="%(name)s">%(body)s </replace>' % toks
    )

print replExpr.transformString(src)

In this case, you create the expression to be matched with pyparsing, define a parse action to do the text conversion, and then call transformString to scan through the input source to find all the matches, apply the parse action to each match, and return the resulting output. The parse action serves a similar function to mksub in @steveha's solution.

In addition to the parse action, pyparsing also supports naming individual elements of the expression - I used "name" and "body" to label the two parts of interest, which are represented in the re solution as groups 1 and 2. You can name groups in an re, the corresponding re would look like:

s_pat = "^\s*REPLACE\((?P<name>[^)]+)\)(?P<body>.*)$"

Unfortunately, to access these groups by name, you have to invoke the group() method on the re match object, you can't directly do the named string interpolation as in my lambda parse action. But this is Python, right? We can wrap that callable with a class that will give us dict-like access to the groups by name:

class CallableDict(object):
    def __init__(self,fn):
        self.fn = fn
    def __getitem__(self,name):
        return self.fn(name)

def mksub(m):    
    return '<replace name="%(name)s">%(body)s</replace>' %  CallableDict(m.group)

s_output = re.sub(pat, mksub, s_input)

Using CallableDict, the string interpolation in mksub can now call m.group for each field, by making it look like we are retrieving the ['name'] and ['body'] elements of a dict.

Paul McGuire