tags:

views:

138

answers:

7

I have a string like this:

"a word {{bla|123|456}} another {{bli|789|123}} some more text {{blu|789}} and more".

I would like to get this as an output:

(("bla", 123, 456), ("bli", 789, 123), ("blu", 789))

I haven't been able to find the proper python regex to achieve that.

+1  A: 

You need a lot of escapes in your regular expression since {, } and | are special characters in them. A first step to extract the relevant parts of the string would be this:

regex = re.compile(r'\{\{(.*?)\|(.*?)(?:\|(.*?))?\}\}')
regex.findall(line)

For the example this gives:

[('bla', '123', '456'), ('bli', '789', '123'), ('blu', '789', '')]

Then you can continue with converting strings with digits into integers and removing empty strings like for the last match.

sth
`{` is a special character?
SilentGhost
Well, if you put {a,b} after a pattern, that is special, and you can omit one or both of a and b there. But I think if you just put "{{" into a pattern, it will just match "{{". I tried it, and it worked for me.
steveha
+1  A: 
>>> re.findall(' {{(\w+)\|(\w+)(?:\|(\w+))?}} ', s)
[('bla', '123', '456'), ('bli', '789', '123'), ('blu', '789', '')]

if you still want number there you'd need to iterate over the output and convert it to the integer with int.

SilentGhost
that doesn't match what the question specified. he specifically wanted parentheses around all the data rather than a list of groups. Plus, he wanted double quotes only around the first element in each group and not quotes around the others.
Bryan Oakley
And yet it was accepted... poorly written question, lucky answer?
Jefromi
@Bryan: regex work on strings, they've no idea what numbers are, only know digits. *quotes around the data* are presentational quotes that indicate that values is a string. As I've clearly said, if OP needs number, he can convert respective values to the integers.
SilentGhost
regarding the *within quotes* requirement: I can only generalise so far. What I see is only example, no where OP indicates that any other patterns are possible.
SilentGhost
@SilentGhost: I know they work on strings. I was just trying to clarify because your solution doesn't give what was explicitly asked for. Your solution is probably what he really wanted though, since your question was accepted.
Bryan Oakley
my solution includes the steps needed to achieve exact compliance with written text. I just think that to convert string to integer here would be an insult to reader's intelligence.
SilentGhost
A: 
[re.split('\|', i) for i in re.findall("{{(.*?)}}", str)]

Returns:

[['bla', '123', '456'], ['bli', '789', '123'], ['blu', '789']]

This method works regardless of the number of elements in the {{ }} blocks.

Jeff B
A: 

To get the exact output you wrote, you need a regex and a split:

import re
map(lambda s: s.split("|"), re.findall(r"\{\{([^}]*)\}\}", s))

To get it with the numbers converted, do this:

toint = lambda x: int(x) if x.isdigit() else x
[map(toint, p.split("|")) for p in re.findall(r"\{\{([^}]*)\}\}", s)]
Joakim Lundborg
A: 

We might be able to get fancy and do everything in a single complicated regular expression, but that way lies madness. Let's do one regexp that grabs the groups, and then split the groups up. We could use a regexp to split the groups, but we can just use str.split(), so let's do that.

import re
pat_group = re.compile("{{([^}]*)}}")
def mixed_tuple(iterable):
    lst = []
    for x in iterable:
        try:
            lst.append(int(x))
        except ValueError:
            lst.append(x)
    return tuple(lst)

s = "a word {{bla|123|456}} another {{bli|789|123}} some more text {{blu|789}} and more"

lst_groups = re.findall(pat_group, s)
lst = [mixed_tuple(x.split("|")) for x in lst_groups]

In pat_group, "{{" just matches literal "{{". "(" starts a group. "[^}]" is a character class that matches any character except for "}", and '*' allows it to match zero or more such characters. ")" closes out the group and "}}" matches literal characters. Thus, we match the "{{...}}" patterns, and can extract everything between the curly braces as a group.

re.findall() returns a list of groups matched from the pattern.

Finally, a list comprehension splits each string and returns the result as a tuple.

steveha
A: 

Assuming your actual format is {{[a-z]+|[0-9]+|[0-9]+}}, here's a complete program with conversion to ints.

import re

s = "a word {{bla|123|456}} another {{bli|789|123}} some more text {{blu|789}} and more"
result = []

for match in re.finditer('{{.*?}}', s):

   # Split on pipe (|) and filter out non-alphanumerics
   parts = [filter(str.isalnum, part) for part in match.group().split('|')]

   # Convert to int when possible
   for index, part in enumerate(parts):      
      try:
         parts[index] = int(part)
      except ValueError:
         pass

   result.append(tuple(parts))
Triptych
A: 

Is pyparsing overkill for this? Maybe, but without too much suffering, it does deliver the desired output, without a thicket of backslashes to escape the '{', '|', or '}' characters. Plus, there's no need for post-parse conversions of integers and whatnot - the parse actions take care of this kind of stuff at parse time.

from pyparsing import Word, Suppress, alphas, alphanums, nums, delimitedList

LBRACE,RBRACE,VERT = map(Suppress,"{}|")
word = Word(alphas,alphanums)
integer = Word(nums)
integer.setParseAction(lambda t: int(t[0]))

patt = (LBRACE*2 + delimitedList(word|integer, VERT) + RBRACE*2)
patt.setParseAction(lambda toks:tuple(toks.asList()))


s = "a word {{bla|123|456}} another {{bli|789|123}} some more text {{blu|789}} and more"

print tuple(p[0] for p in patt.searchString(s))

Prints:

(('bla', 123, 456), ('bli', 789, 123), ('blu', 789))
Paul McGuire