views:

184

answers:

5

I want to be able to pull out the type and count of letters from a piece of text where the letters could be in any order. There is some other parsing going on which I have working, but this bit has me stumped!

input -> result
"abc" -> [['a',1], ['b',1],['c',1]]
"bbbc" -> [['b',3],['c',1]]
"cccaa" -> [['a',2],['c',3]]

I could use search or scan and repeat for each possible letter, but is there a clean way of doing it?

This is as far as I got:

from pyparsing import *


def handleStuff(string, location, tokens):

        return [tokens[0][0], len(tokens[0])]


stype = Word("abc").setParseAction(handleStuff)
section =  ZeroOrMore(stype("stype"))


print section.parseString("abc").dump()
print section.parseString("aabcc").dump()
print section.parseString("bbaaa").dump()
+6  A: 

One solution:

text = 'sufja srfjhvlasfjkhv lasjfvhslfjkv hlskjfvh slfkjvhslk'
print([(x,text.count(x)) for x in set(text)])

No pyparsing involved, but it seems like overkill.

Lennart Regebro
I think I'll go with this by using pyparsing to pull the chunk and setParseAction to process it with this. Still be interested to know if there is a pyparsing solution though!
Thanks for your solution - it's now been pipped by the pyparsing one, but thanks for your help and the very neat solution!
+1  A: 

pyparsing apart -- in Python 3.1, collections.Counter makes such counting tasks really easy. A good version of Counter for Python 2 can be found here.

Alex Martelli
Like the Counter class - will keep that in mind for other things. Thanks.
+3  A: 

I like Lennart's one-line solution.

Alex mentions another great option if you're using 3.1

Yet another option is collections.defaultdict:

>>> from collections import defaultdict
>>> mydict = defaultdict(int)
>>> for c in 'bbbc':
...   mydict[c] += 1
...
>>> mydict
defaultdict(<type 'int'>, {'c': 1, 'b': 3})
Adam Bernier
Sadly using python 2.6!
defaultdict was added in 2.5
Adam Bernier
+1  A: 

If you want a pure-pyparsing approach, this feels about right:

from pyparsing import *

# lambda to define expressions
def makeExpr(ch):
    expr = Literal(ch).setResultsName(ch, listAllMatches=True)
    return expr

expr = OneOrMore(MatchFirst(makeExpr(c) for c in "abc"))
expr.setParseAction(lambda tokens: [[a,len(b)] for a,b in tokens.items()])


tests = """\
abc
bbbc
cccaa
""".splitlines()

for t in tests:
    print t,expr.parseString(t).asList()

Prints:

abc [['a', 1], ['c', 1], ['b', 1]]
bbbc [['c', 1], ['b', 3]]
cccaa [['a', 2], ['c', 3]]

But this starts to get into an obscure code area, since it relies on some of the more arcane features of pyparsing. In general, I like frequency counters that use defaultdict (haven't tried Counter yet), since it's pretty clear just what you are doing.

Paul McGuire
+3  A: 

I wasn't clear from your description whether the input characters could be mixed like "ababc", since in all your test cases, the letters were always grouped together. If the letters are always grouped together, you could use this pyparsing code:

def makeExpr(ch):
    expr = Word(ch).setParseAction(lambda tokens: [ch,len(tokens[0])])
    return expr

expr = Each([Optional(makeExpr(ch)) for ch in "abc"])

for t in tests:
    print t,expr.parseString(t).asList()

The Each construct takes care of matching out of order, and Word(ch) handles the 1-to-n repetition. The parse action takes care of converting the parsed tokens into the (character, count) tuples.

Paul McGuire
Yes the chars are grouped so this is perfect. Thanks for the solution and explanation. Loving pyparsing!