ansaurus

Question

Python regex

Answer 1

+4 A:

I would try findall or finditer instead of match.

Edit by Oli: Yeah findall work brilliantly but I had to simplify the regex to:

r"'(?P<main>\d+)\[(?P<thumb>\d+)\]',?"

stesch 2008-12-06 13:38:54

Whoop! findall worked with a slightly simplified regex. I'll alter your answer to show what worked. Thanks!

Oli 2008-12-06 13:43:54

Answer 2

+1 A:

Modifying your regexp a little,

>>> str = "'813702104[813702106]','813702141[813702143]','813702172[813702174]"
>>> imgRegex = re.compile(r"'(?P<main>\d+)\[(?P<thumb>\d+)\]',?")
>>> print imgRegex.findall(str)
[('813702104', '813702106'), ('813702141', '813702143')]

Which is a "2 dimensional array" - in Python, "a list of 2-tuples".

gimel 2008-12-06 13:44:13

Answer 3

+1 A:

I've got something that seems to work on your data set:

In [19]: str = "'813702104[813702106]','813702141[813702143]','813702172[813702174]'"
In [20]: ptr = re.compile( r"'(?P<one>\d+)\[(?P<two>\d+)\]'" )
In [21]: ptr.findall( str )
Out [23]:
[('813702104', '813702106'),
 ('813702141', '813702143'),
 ('813702172', '813702174')]

ayaz 2008-12-06 13:50:55

Answer 4

+3 A:

I think I will not go for regex for this task. Python list comprehension is quite powerful for this

In [27]: s = "'813702104[813702106]','813702141[813702143]','813702172[813702174]'"

In [28]: d=[[int(each1.strip(']\'')) for each1 in each.split('[')] for each in s.split(',')]

In [29]: d[0][1]
Out[29]: 813702106

In [30]: d[1][0]
Out[30]: 813702141

In [31]: d
Out[31]: [[813702104, 813702106], [813702141, 813702143], [813702172, 813702174]]

JV 2008-12-06 13:54:34

split() is the way to go.

Tomalak 2008-12-06 15:09:00

I explained why I thought regex was the only way: The string in real life is not a pure little array. It's buried in a 100k HTML file. I could extract by regex and then split... but that seems a little silly, no?

Oli 2008-12-06 16:40:22

Answer 5

+1 A:

Alternatively, you could use Python's [statement for item in list] syntax for building lists. You should find this to be considerably faster than a regex, particularly for small data sets. Larger data sets will show a less marked difference (it only has to load the regular expressions engine once no matter the size), but the listmaker should always be faster.

Start by splitting the string on commas:

>>> str = "'813702104[813702106]','813702141[813702143]','813702172[813702174]'"
>>> arr = [pair for pair in str.split(",")]
>>> arr
["'813702104[813702106]'", "'813702141[813702143]'", "'813702172[813702174]'"]

Right now, this returns the same thing as just str.split(","), so isn't very useful, but you should be able to see how the listmaker works — it iterates through list, assigning each value to item, executing statement, and appending the resulting value to the newly-built list.

In order to get something useful accomplished, we need to put a real statement in, so we get a slice of each pair which removes the single quotes and closing square bracket, then further split on that conveniently-placed opening square bracket:

>>> arr = [pair[1:-2].split("[") for pair in str.split(",")]
>>> arr
>>> [['813702104', '813702106'], ['813702141', '813702143'], ['813702172', '813702174']]

This returns a two-dimensional array like you describe, but the items are all strings rather than integers. If you're simply going to use them as strings, that's far enough. If you need them to be actual integers, you simply use an "inner" listmaker as the statement for the "outer" listmaker:

>>> arr = [[int(x) for x in pair[1:-2].split("[")] for pair in str.split(",")]
>>> arr
>>> [[813702104, 813702106], [813702141, 813702143], [813702172, 813702174]]

This returns a two-dimensional array of the integers representing in a string like the one you provided, without ever needing to load the regular expressions engine.

Ben Blank 2008-12-15 17:04:40

Ben I've edited the post to highlight that in its real world application, the string comes from a large HTML file and therefore vanilla splitting isn't a direct option.

Oli 2008-12-16 13:27:34

Ah, yes. If you need to use regexes anyway to pull it out of the HTML, you may as well run with them. :-D

Ben Blank 2008-12-19 18:08:31

ansaurus

tags:

views:

answers:

Python regex

Note part deux: The real string is embedded in a large HTML file and therefore splitting does not appear to be an option.

related questions