tags:

views:

482

answers:

5
+2  Q: 

Python regex

I have a string like this that I need to parse into a 2D array:

 str = "'813702104[813702106]','813702141[813702143]','813702172[813702174]'"

the array equiv would be:

arr[0][0] = 813702104
arr[0][1] = 813702106
arr[1][0] = 813702141
arr[1][1] = 813702143
#... etc ...

I'm trying to do this by REGEX. The string above is buried in an HTML page but I can be certain it's the only string in that pattern on the page. I'm not sure if this is the best way, but it's all I've got right now.

imgRegex = re.compile(r"(?:'(?P<main>\d+)\[(?P<thumb>\d+)\]',?)+")

If I run imgRegex.match(str).groups() I only get one result (the first couplet). How do I either get multiple matches back or a 2d match object (if such a thing exists!)?

Note: Contrary to how it might look, this is not homework

Note part deux: The real string is embedded in a large HTML file and therefore splitting does not appear to be an option.

I'm still getting answers for this, so I thought I better edit it to show why I'm not changing the accepted answer. Splitting, though more efficient on this test string, isn't going to extract the parts from a whole HTML file. I could combine a regex and splitting but that seems silly.

If you do have a better way to find the parts from a load of HTML (the pattern \d+\[\d+\] is unique to this string in the source), I'll happily change accepted answers. Anything else is academic.

+4  A: 

I would try findall or finditer instead of match.

Edit by Oli: Yeah findall work brilliantly but I had to simplify the regex to:

r"'(?P<main>\d+)\[(?P<thumb>\d+)\]',?"
stesch
Whoop! findall worked with a slightly simplified regex. I'll alter your answer to show what worked. Thanks!
Oli
+1  A: 

Modifying your regexp a little,

>>> str = "'813702104[813702106]','813702141[813702143]','813702172[813702174]"
>>> imgRegex = re.compile(r"'(?P<main>\d+)\[(?P<thumb>\d+)\]',?")
>>> print imgRegex.findall(str)
[('813702104', '813702106'), ('813702141', '813702143')]

Which is a "2 dimensional array" - in Python, "a list of 2-tuples".

gimel
+1  A: 

I've got something that seems to work on your data set:

In [19]: str = "'813702104[813702106]','813702141[813702143]','813702172[813702174]'"
In [20]: ptr = re.compile( r"'(?P<one>\d+)\[(?P<two>\d+)\]'" )
In [21]: ptr.findall( str )
Out [23]:
[('813702104', '813702106'),
 ('813702141', '813702143'),
 ('813702172', '813702174')]
ayaz
+3  A: 

I think I will not go for regex for this task. Python list comprehension is quite powerful for this

In [27]: s = "'813702104[813702106]','813702141[813702143]','813702172[813702174]'"

In [28]: d=[[int(each1.strip(']\'')) for each1 in each.split('[')] for each in s.split(',')]

In [29]: d[0][1]
Out[29]: 813702106

In [30]: d[1][0]
Out[30]: 813702141

In [31]: d
Out[31]: [[813702104, 813702106], [813702141, 813702143], [813702172, 813702174]]
JV
split() is the way to go.
Tomalak
I explained why I thought regex was the only way: The string in real life is not a pure little array. It's buried in a 100k HTML file. I could extract by regex and then split... but that seems a little silly, no?
Oli
+1  A: 

Alternatively, you could use Python's [statement for item in list] syntax for building lists. You should find this to be considerably faster than a regex, particularly for small data sets. Larger data sets will show a less marked difference (it only has to load the regular expressions engine once no matter the size), but the listmaker should always be faster.

Start by splitting the string on commas:

>>> str = "'813702104[813702106]','813702141[813702143]','813702172[813702174]'"
>>> arr = [pair for pair in str.split(",")]
>>> arr
["'813702104[813702106]'", "'813702141[813702143]'", "'813702172[813702174]'"]

Right now, this returns the same thing as just str.split(","), so isn't very useful, but you should be able to see how the listmaker works — it iterates through list, assigning each value to item, executing statement, and appending the resulting value to the newly-built list.

In order to get something useful accomplished, we need to put a real statement in, so we get a slice of each pair which removes the single quotes and closing square bracket, then further split on that conveniently-placed opening square bracket:

>>> arr = [pair[1:-2].split("[") for pair in str.split(",")]
>>> arr
>>> [['813702104', '813702106'], ['813702141', '813702143'], ['813702172', '813702174']]

This returns a two-dimensional array like you describe, but the items are all strings rather than integers. If you're simply going to use them as strings, that's far enough. If you need them to be actual integers, you simply use an "inner" listmaker as the statement for the "outer" listmaker:

>>> arr = [[int(x) for x in pair[1:-2].split("[")] for pair in str.split(",")]
>>> arr
>>> [[813702104, 813702106], [813702141, 813702143], [813702172, 813702174]]

This returns a two-dimensional array of the integers representing in a string like the one you provided, without ever needing to load the regular expressions engine.

Ben Blank
Ben I've edited the post to highlight that in its real world application, the string comes from a large HTML file and therefore vanilla splitting isn't a direct option.
Oli
Ah, yes. If you need to use regexes anyway to pull it out of the HTML, you may as well run with them. :-D
Ben Blank