tags:

views:

339

answers:

5

I have a Perl regular expression (shown here, though understanding the whole thing isn't hopefully necessary to answering this question) that contains the \G metacharacter. I'd like to translate it into Python, but Python doesn't appear to support \G. What can I do?

+2  A: 

Try these:

import re
re.sub()
re.findall()
re.finditer()

for example:

# Finds all words of length 3 or 4
s = "the quick brown fox jumped over the lazy dogs."
print re.findall(r'\b\w{3,4}\b', s)

# prints ['the','fox','over','the','lazy','dogs']
Triptych
+1  A: 

You can use re.match to match anchored patterns. re.match will only match at the beginning (position 0) of the text, or where you specify.

def match_sequence(pattern,text,pos=0):
  pat = re.compile(pattern)
  match = pat.match(text,pos)
  while match:
    yield match
    if match.end() == pos:
      break # infinite loop otherwise
    pos = match.end()
    match = pat.match(text,pos)

This will only match pattern from the given position, and any matches that follow 0 characters after.

>>> for match in match_sequence(r'[^\W\d]+|\d+',"he11o world!"):
...   print match.group()
...
he
11
o
MizardX
A: 

Python does not have the /g modifier for their regexen, and so do not have the \G regex token. A pity, really.

Robert P
A: 

Don't try to put everything into one expression as it become very hard to read, translate (as you see for yourself) and maintain.

import re
lines = [re.sub(r'http://[^\s]+', r'<\g<0>>', line) for line in text_block.splitlines() if not line.startedwith('//')]
print '\n'.join(lines)

Python is not usually best when you literally translate from Perl, it has it's own programming patterns.

Mike
A: 

I know I'm little late, but here's an alternative to the \G approach:

import re

def replace(match):
    if match.group(0)[0] == '/': return match.group(0)
    else: return '<' + match.group(0) + '>'

source = '''http://a.com http://b.com
//http://etc.'''

pattern = re.compile(r'(?m)^//.*$|http://\S+')
result = re.sub(pattern, replace, source)
print(result)

output (via Ideone):

<http://a.com&gt; <http://b.com&gt;
//http://etc.

The idea is to use a regex that matches both kinds of string: a URL or a commented line. Then you use a callback (delegate, closure, embedded code, etc.) to find out which one you matched and return the appropriate replacement string.

As a matter of fact, this is my preferred approach even in flavors that do support \G. Even in Java, where I have to write a bunch of boilerplate code to implement the callback.

(I'm not a Python guy, so forgive me if the code is terribly un-pythonic.)

Alan Moore