ansaurus

Question

Answer 1

+1 A:

You need a lot of escapes in your regular expression since {, } and | are special characters in them. A first step to extract the relevant parts of the string would be this:

regex = re.compile(r'\{\{(.*?)\|(.*?)(?:\|(.*?))?\}\}')
regex.findall(line)

For the example this gives:

[('bla', '123', '456'), ('bli', '789', '123'), ('blu', '789', '')]

Then you can continue with converting strings with digits into integers and removing empty strings like for the last match.

sth 2009-10-06 20:54:27

`{` is a special character?

SilentGhost 2009-10-06 20:57:51

Well, if you put {a,b} after a pattern, that is special, and you can omit one or both of a and b there. But I think if you just put "{{" into a pattern, it will just match "{{". I tried it, and it worked for me.

steveha 2009-10-06 21:01:42

Answer 2

+1 A:

>>> re.findall(' {{(\w+)\|(\w+)(?:\|(\w+))?}} ', s)
[('bla', '123', '456'), ('bli', '789', '123'), ('blu', '789', '')]

if you still want number there you'd need to iterate over the output and convert it to the integer with int.

SilentGhost 2009-10-06 20:56:55

that doesn't match what the question specified. he specifically wanted parentheses around all the data rather than a list of groups. Plus, he wanted double quotes only around the first element in each group and not quotes around the others.

Bryan Oakley 2009-10-06 21:08:30

And yet it was accepted... poorly written question, lucky answer?

Jefromi 2009-10-06 21:10:01

@Bryan: regex work on strings, they've no idea what numbers are, only know digits. *quotes around the data* are presentational quotes that indicate that values is a string. As I've clearly said, if OP needs number, he can convert respective values to the integers.

SilentGhost 2009-10-06 21:11:08

regarding the *within quotes* requirement: I can only generalise so far. What I see is only example, no where OP indicates that any other patterns are possible.

SilentGhost 2009-10-06 21:13:30

@SilentGhost: I know they work on strings. I was just trying to clarify because your solution doesn't give what was explicitly asked for. Your solution is probably what he really wanted though, since your question was accepted.

Bryan Oakley 2009-10-06 21:27:11

my solution includes the steps needed to achieve exact compliance with written text. I just think that to convert string to integer here would be an insult to reader's intelligence.

SilentGhost 2009-10-06 21:32:05

Answer 3

A:

[re.split('\|', i) for i in re.findall("{{(.*?)}}", str)]

Returns:

[['bla', '123', '456'], ['bli', '789', '123'], ['blu', '789']]

This method works regardless of the number of elements in the {{ }} blocks.

Jeff B 2009-10-06 21:02:03

Answer 4

A:

To get the exact output you wrote, you need a regex and a split:

import re
map(lambda s: s.split("|"), re.findall(r"\{\{([^}]*)\}\}", s))

To get it with the numbers converted, do this:

toint = lambda x: int(x) if x.isdigit() else x
[map(toint, p.split("|")) for p in re.findall(r"\{\{([^}]*)\}\}", s)]

Joakim Lundborg 2009-10-06 21:02:20

Answer 5

A:

We might be able to get fancy and do everything in a single complicated regular expression, but that way lies madness. Let's do one regexp that grabs the groups, and then split the groups up. We could use a regexp to split the groups, but we can just use str.split(), so let's do that.

import re
pat_group = re.compile("{{([^}]*)}}")
def mixed_tuple(iterable):
    lst = []
    for x in iterable:
        try:
            lst.append(int(x))
        except ValueError:
            lst.append(x)
    return tuple(lst)

s = "a word {{bla|123|456}} another {{bli|789|123}} some more text {{blu|789}} and more"

lst_groups = re.findall(pat_group, s)
lst = [mixed_tuple(x.split("|")) for x in lst_groups]

In pat_group, "{{" just matches literal "{{". "(" starts a group. "[^}]" is a character class that matches any character except for "}", and '*' allows it to match zero or more such characters. ")" closes out the group and "}}" matches literal characters. Thus, we match the "{{...}}" patterns, and can extract everything between the curly braces as a group.

re.findall() returns a list of groups matched from the pattern.

Finally, a list comprehension splits each string and returns the result as a tuple.

steveha 2009-10-06 21:08:03

Answer 6

A:

Assuming your actual format is {{[a-z]+|[0-9]+|[0-9]+}}, here's a complete program with conversion to ints.

import re

s = "a word {{bla|123|456}} another {{bli|789|123}} some more text {{blu|789}} and more"
result = []

for match in re.finditer('{{.*?}}', s):

   # Split on pipe (|) and filter out non-alphanumerics
   parts = [filter(str.isalnum, part) for part in match.group().split('|')]

   # Convert to int when possible
   for index, part in enumerate(parts):      
      try:
         parts[index] = int(part)
      except ValueError:
         pass

   result.append(tuple(parts))

Triptych 2009-10-06 21:14:40

Answer 7

A:

Is pyparsing overkill for this? Maybe, but without too much suffering, it does deliver the desired output, without a thicket of backslashes to escape the '{', '|', or '}' characters. Plus, there's no need for post-parse conversions of integers and whatnot - the parse actions take care of this kind of stuff at parse time.

from pyparsing import Word, Suppress, alphas, alphanums, nums, delimitedList

LBRACE,RBRACE,VERT = map(Suppress,"{}|")
word = Word(alphas,alphanums)
integer = Word(nums)
integer.setParseAction(lambda t: int(t[0]))

patt = (LBRACE*2 + delimitedList(word|integer, VERT) + RBRACE*2)
patt.setParseAction(lambda toks:tuple(toks.asList()))


s = "a word {{bla|123|456}} another {{bli|789|123}} some more text {{blu|789}} and more"

print tuple(p[0] for p in patt.searchString(s))

Prints:

(('bla', 123, 456), ('bli', 789, 123), ('blu', 789))

Paul McGuire 2009-10-06 23:09:24

ansaurus

tags:

views:

answers:

Regex for extraction in Python

related questions