tags:

views:

80

answers:

2

I have an end tag followed by a carriage return line feed (x0Dx0A) followd by one or more tabs (x09) followed by a new start tag .

Something like this:

</tag1>x0Dx0Ax09x09x09<tag2> or </tag1>x0Dx0Ax09x09x09x09x09<tag2>

What Python regex should I use to replace it with something like this:

</tag1><tag3>content</tag3><tag2>

Thanks in advance.

A: 

The regex for a generic version of this (i.e. will match regardless of the #s listed with the tags) is:

(</tag\d+>)x0Dx0A(?:x09)+(<tag\d+>)

You can use this in what cjrh provided to do the replacement, as follows:

import re
input   = '</tag1>x0Dx0Ax09x09x09<tag2> or </tag1>x0Dx0Ax09x09x09x09x09<tag2>'
pattern = '(</tag\d+>)x0Dx0A(?:x09)+(<tag\d+>)'
replace = r'\1<tag3>content</tag3>\2'
output  = re.compile(pat, re.M | re.S).sub(repl,input)
JGB146
Your grouping parentheses are in strange places; how do propose to actually use that regex??
John Machin
Honestly, today was my first venture into Python Regex, and some of that was from improper translation between what I'm used to in PHP and what I saw elsewhere in Python. I then saw that others were editing their posts and thought they had it solved. Will adjust mine accordingly to be more complete.
JGB146
Edited (and tested). I believe this performs exactly as requested (at least, it did in my tests!)
JGB146
The likelihood that the OP has literally "x0Dx0Ax09" etc in his data is rather small ...
John Machin
+1  A: 

Here is code for something like what you say that you need:

>>> import re
>>> sample = '</tag1>\r\n\t\t\t\t<tag2>'
>>> sample
'</tag1>\r\n\t\t\t\t<tag2>'
>>> pattern = '(</tag1>)\r\n\t+(<tag2>)'
>>> replacement = r'\1<tag3>content</tag3>\2'
>>> re.sub(pattern, replacement, sample)
'</tag1><tag3>content</tag3><tag2>'
>>>

Note that \r\n\t+ may be a bit too specific, especially if production of your input is not under your control. It may be better to adopt the much more general \s* (zero or more whitespace characters).

Using regexes to parse XML and HTML is not a good idea in general ... while it's hard to see a failure mode here (apart from elementary errors in getting the pattern correct), you might like to tell us what the underlying problem is, in case some other solution is better.

John Machin