views:

16239

answers:

11

Learning to program, trying to do this

I have a string like this

2+24*48/32

and I want to split it into a list like this

['2', '+', '24', '*', '48', '/', '32']

I have messed around with .split() but that's messy as it returns a list, which means I would now have to iterate over two strings, etc

+2  A: 

Regular expressions:

>>> import re
>>> splitter = re.compile(r'([+*/])')
>>> splitter.split("2+24*48/32")

You can expand the regular expression to include any other characters you want to split on.

Cristian
+2  A: 

s = "2+24*48/32"

p = re.compile(r'(\W+)')

p.split(s)

Jiayao Yu
+15  A: 

Building on Cristian's solution, I would suggest matching for digits. That way, you won't have to specify every single operator you want to support (=, %, ^, whitespace, etc)

import re
splitter = re.compile(r'[\D]') # Match non-digits
splitter.split("2+24*48/32=10")

EDIT: (JFS)

The last expression returns just a list of strings with numbers without operators. Here's an improved variant:

>>> re.split(r'(\D)', '2+24*48/32')
['2', '+', '24', '*', '48', '/', '32']
Readonly
+8  A: 

This looks like a parsing problem, and thus I am compelled to present a solution based on parsing techniques.

While it may seem that you want to 'split' this string, I think what you actually want to do is 'tokenize' it. Tokenization or lexxing is the compilation step before parsing. I have amended my original example in an edit to implement a proper recursive decent parser here. This is the easiest way to implement a parser by hand.

import re

patterns = [
    ('number', re.compile('\d+')),
    ('*', re.compile(r'\*')),
    ('/', re.compile(r'\/')),
    ('+', re.compile(r'\+')),
    ('-', re.compile(r'\-')),
]
whitespace = re.compile('\W+')

def tokenize(string):
    while string:

        # strip off whitespace
        m = whitespace.match(string)
        if m:
            string = string[m.end():]

        for tokentype, pattern in patterns:
            m = pattern.match(string)
            if m:
                yield tokentype, m.group(0)
                string = string[m.end():]

def parseNumber(tokens):
    tokentype, literal = tokens.pop(0)
    assert tokentype == 'number'
    return int(literal)

def parseMultiplication(tokens):
    product = parseNumber(tokens)
    while tokens and tokens[0][0] in ('*', '/'):
        tokentype, literal = tokens.pop(0)
        if tokentype == '*':
            product *= parseNumber(tokens)
        elif tokentype == '/':
            product /= parseNumber(tokens)
        else:
            raise ValueError("Parse Error, unexpected %s %s" % (tokentype, literal))

    return product

def parseAddition(tokens):
    total = parseMultiplication(tokens)
    while tokens and tokens[0][0] in ('+', '-'):
        tokentype, literal = tokens.pop(0)
        if tokentype == '+':
            total += parseMultiplication(tokens)
        elif tokentype == '-':
            total -= parseMultiplication(tokens)
        else:
            raise ValueError("Parse Error, unexpected %s %s" % (tokentype, literal))

    return total

def parse(tokens):
    tokenlist = list(tokens)
    returnvalue = parseAddition(tokenlist)
    if tokenlist:
        print 'Unconsumed data', tokenlist
    return returnvalue

def main():
    string = '2+24*48/32'
    for tokentype, literal in tokenize(string):
        print tokentype, literal

    print parse(tokenize(string))

if __name__ == '__main__':
    main()

Implementation of handling of brackets is left as an exercise for the reader. This example will correctly do multiplication before addition.

Jerub
I'm reading up on tokenizing now to understand it. So I'm not able too say where the problem is though I think it's in the fact that this script will eval * and / at the same time, which is incorrect. 8/2*2 this string should print a result of 2, but it prints a result of 8.
excuse me im wrong, always took bomdas literally turns out multiplication and division are equal in order of predecnce and whichever is occurs first is evaluated first
A: 

i'm sure Tim meant "splitter = re.compile(r'([\D])')". if you copy exactly what he has down you only get the digits not the operators.

+11  A: 
>>> import re
>>> re.findall(r'\d+|\D+', '2+24*48/32=10')

['2', '+', '24', '*', '48', '/', '32', '=', '10']

Matches consecutive digits or consecutive non-digits.

Each match is returned as a new element in the list.

Depending on the usage, you may need to alter the regular expression. Such as if you need to match numbers with a decimal point.

>>> re.findall(r'[0-9\.]+|[^0-9\.]+', '2+24*48/32=10.1')

['2', '+', '24', '*', '48', '/', '32', '=', '10.1']
molasses
+3  A: 

Another solution to this would be to avoid writing a calculator like that altogether. Writing an RPN parser is much simpler, and doesn't have any of the ambiguity inherent in writing math with infix notation.

import operator, math
calc_operands = {
    '+': (2, operator.add),
    '-': (2, operator.sub),
    '*': (2, operator.mul),
    '/': (2, operator.truediv),
    '//': (2, operator.div),
    '%': (2, operator.mod),
    '^': (2, operator.pow),
    '**': (2, math.pow),
    'abs': (1, operator.abs),
    'ceil': (1, math.ceil),
    'floor': (1, math.floor),
    'round': (2, round),
    'trunc': (1, int),
    'log': (2, math.log),
    'ln': (1, math.log),
    'pi': (0, lambda: math.pi),
    'e': (0, lambda: math.e),
}

def calculate(inp):
    stack = []
    for tok in inp.split():
        if tok in self.calc_operands:
            n_pops, func = self.calc_operands[tok]
            args = [stack.pop() for x in xrange(n_pops)]
            args.reverse()
            stack.append(func(*args))
        elif '.' in tok:
            stack.append(float(tok))
        else:
            stack.append(int(tok))
    if not stack:
        raise ValueError('no items on the stack.')
    return stack.pop()
    if stack:
        raise ValueError('%d item(s) left on the stack.' % len(stack))

calculate('24 38 * 32 / 2 +')
Aaron Gallagher
Why don't you just go implement forth, it'll only be 5 more lines!
Jerub
A: 

Why not just use SymPy? It should do what you're trying to achieve.

+2  A: 

This is a parsing problem, so neither regex not split() are the "good" solution. Use a parser generator instead.

I would look closely at pyparsing. There have also been some decent articles about pyparsing in the Python Magazine.

Ber
+13  A: 

It just so happens that the tokens you want split are already Python tokens, so you can use the built-in tokenize module. It's almost a one-liner:

from cStringIO import StringIO
from tokenize import generate_tokens
STRING = 1
list(token[STRING] for token 
     in generate_tokens(StringIO('2+24*48/32').readline)
     if token[STRING])
['2', '+', '24', '*', '48', '/', '32']
Glyph
Great answer, I didn't realize this module existed :)
Kiv
A: 

This doesn't answer the question exactly, but I believe it solves what you're trying to achieve. I would add it as a comment, but I don't have permission to do so yet.

I personally would take advantage of Python's maths functionality directly with exec:

expression = "2+24*48/32"
exec "result = " + expression
print result
38

Syhon
Forgive me if I'm wrong, but wouldn't it be preferable to use `result = eval(expression)`?
P Daddy
Indeed it would; my apologies.
Syhon