ansaurus

Question

Python Regex Question

Answer 1

A:

The regex for a generic version of this (i.e. will match regardless of the #s listed with the tags) is:

(</tag\d+>)x0Dx0A(?:x09)+(<tag\d+>)

You can use this in what cjrh provided to do the replacement, as follows:

import re
input   = '</tag1>x0Dx0Ax09x09x09<tag2> or </tag1>x0Dx0Ax09x09x09x09x09<tag2>'
pattern = '(</tag\d+>)x0Dx0A(?:x09)+(<tag\d+>)'
replace = r'\1<tag3>content</tag3>\2'
output  = re.compile(pat, re.M | re.S).sub(repl,input)

JGB146 2010-07-23 21:24:08

Your grouping parentheses are in strange places; how do propose to actually use that regex??

John Machin 2010-07-24 01:02:09

Honestly, today was my first venture into Python Regex, and some of that was from improper translation between what I'm used to in PHP and what I saw elsewhere in Python. I then saw that others were editing their posts and thought they had it solved. Will adjust mine accordingly to be more complete.

JGB146 2010-07-24 01:20:11

Edited (and tested). I believe this performs exactly as requested (at least, it did in my tests!)

JGB146 2010-07-24 01:38:25

The likelihood that the OP has literally "x0Dx0Ax09" etc in his data is rather small ...

John Machin 2010-07-24 03:18:51

Answer 2

+1 A:

Here is code for something like what you say that you need:

>>> import re
>>> sample = '</tag1>\r\n\t\t\t\t<tag2>'
>>> sample
'</tag1>\r\n\t\t\t\t<tag2>'
>>> pattern = '(</tag1>)\r\n\t+(<tag2>)'
>>> replacement = r'\1<tag3>content</tag3>\2'
>>> re.sub(pattern, replacement, sample)
'</tag1><tag3>content</tag3><tag2>'
>>>

Note that \r\n\t+ may be a bit too specific, especially if production of your input is not under your control. It may be better to adopt the much more general \s* (zero or more whitespace characters).

Using regexes to parse XML and HTML is not a good idea in general ... while it's hard to see a failure mode here (apart from elementary errors in getting the pattern correct), you might like to tell us what the underlying problem is, in case some other solution is better.

John Machin 2010-07-24 01:26:29

ansaurus

tags:

views:

answers:

Python Regex Question

related questions