ansaurus

Question

Python regex question

Answer 1

+2 A:

This would be a concise way:

import re

s = "aa67bc54c9"
print ''.join(t * int(n) for t, n in re.findall(r"([a-z]+)([0-9]+)", s))

This solution uses a regular expression to match "one or more letters followed by one or more numbers", searching for all of them in the input string. Then it uses a list comprehension to iterate through each group found, assigning the letters to t and the digits to n in turn. The list generates strings using the string * operator, which repeates a string a given number of times (int() is used to convert the digit string into an integer). Finally, ''.join() is used to paste everything together.

For the regular expression, [a-z] is a character class consisting of any single (lowercase) letter of the alphabet. [a-z]+ means one or more lowercase letters. Similarly, [0-9]+ means one or more digits. The grouping parentheses around each component "capture" the characters within them and make them available as a result of the findall() function. There are two groups of parentheses, so there are two output values, which get assigned to t and n in the list comprehension.

Greg Hewgill 2009-10-01 03:53:22

Thank you very much for the answer could you please exaplin the regex ([a-z]+)([0-9]+)

Edwards 2009-10-01 03:56:32

Totally would have beaten you if StackOverflow hadn't gone down for me. But I would have given it a little more space.

Chris Lutz 2009-10-01 03:56:54

To understand the regex, first read my answer. For a full explanation of regular expressions in Python, here's an excellent tutorial: http://www.amk.ca/python/howto/regex/

steveha 2009-10-01 06:08:06

Answer 2

+3 A:

Here is my Python solution.

import re
pat = re.compile("^(\D+)(\d+)(.*)$")

def rle_expand(s):
    lst = []
    while True:
        m = pat.match(s)
        if m:
            n = int(m.group(2))
            lst.append(m.group(1) * n)
        else:
            lst.append(s)
            break
        s = m.group(3)
    return "".join(lst)

s = "aa03bc05d9whew"

print rle_expand(s)
# prints aaaaaabcbcbcbcbcdddddddddwhew

s = “aa67bc54c9”
print rle_expand(s)
# prints: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaabcbcbcbcbcbcbcbcbcbcbcbcbcbcbcbcbcbcbcbcbcbcbcbcbcbcbcbcbcbcbcbcbcbcbcbcbcbcbcbcbcbcbcbcbcbcbcbcbcbcbcbcbcbcccccccccc

The problem is basically to expand a run-length encoding. First you have some sort of a pattern, then some digits that specify how many times to repeat the pattern.

First we import the re module, to gain access to Python's regular expressions.

Next we compile a pattern once, so we can use it later. What will this pattern do?

The pattern uses parentheses to mark groups of letters from the string being matched. There are three pairs of parens so this will match three groups. Before the first group is a '^' character, which anchors to the start of the string, and after the last group is a '$' character, which anchors to the end of the string; these aren't strictly necessary in this case. The first group matches anything that is not a digit using the special sequence \D; the + extends it to match a run of one or more instances of not-a-digit. The second group is similar, using \d+ to match a run of one or more digits. The third group uses . to match any character, then extends it with * to match a run of 0 or more of any character. (Note that * and + are very similar; it's just that * matches 0 or more of something and + matches one or more.)

Using a standard Python idiom, we build a string using a list. We start out with an empty list (called lst). As long as the pattern keeps matching things, we append things to this list. When we are done, we use "".join() to join the list together into a string.

pat.match() returns an object called a "match object", or None if the match failed. If the match succeeded, we convert match group 2 to an integer, and use Python string repetition operator ("multiply") on match group 1 to do the run-length expansion. After this, we rebind the name s with the results of match group 3, thereby snipping off the part of the string we just processed, and loop. If the match failed, we just append all of s to the list and break out of the loop.

Building a list and then using "".join() on the list is a standard Python idiom. It will give good performance with any version of Python. Because Python strings are immutable, you can suffer from very slow performance if you build up a long dynamic string by repeatedly appending to a string; you wind up copying the early parts of the string many times as you build your final string. Python lists can be trivially appended to, and then the final join operation is quite fast. (Recent versions of Python have optimized the case where you repeatedly append to a string, and no longer suffer from the repeated copying in this case.)

Greg Hewgill's solution only recognizes lower-case letters from 'a' to 'z' for the expansion text; you could fix that by putting \D instead of [a-z]. His solution uses explicit ranges such as [0-9] where my solution uses the Python shorthand abbreviations such as \d. His solution only expands the run length encoded sequences; if there is a trailing sequence that does not have an integer on it, mine passes that sequence unchanged, while his silently ignores it. However, it must be said that his solution is brutally elegant and I wish I had thought of it. :-)

steveha 2009-10-01 06:04:30

Stevha thanks a ton for your elegant explanation and also the expanded regex.I wish stackoverflow allows me to choose more then on answer honestly this solution of using a list to get what i want was the first one to come to my mind .

Edwards 2009-10-01 07:38:55

ansaurus

tags:

views:

answers:

Python regex question

related questions