views:

766

answers:

10

I've been parsing poker hand histories for the past year and have learned quite a deal about parsing in general.

We started with regexes but quickly realized that wouldn't scale easily. We skipped languages from ruby to c++ and finally came to grips that it was the algorithim that had to change.

We picked up Boost::Spirit and watched our speed dramatically rise on orders of more than 10 times our original speed. We then skipped over to java and are currently using antlr to create grammars for each site. This is definitely the fastest method yet and it's very thorough which is nice cause you know exactly where you stand in terms of a 'complete' grammar. Unfortunately, I have spent incredible amounts of time working with these grammars -- they work pretty damn well but not perfectly yet.

Anyways, enough with the background on to the question at hand -- are there any 'exotic' or less well known techniques to parsing that I'm not aware of? I only know of lexing/parsing a grammar and the other inferior regex/loop method.

For those of you who are not familiar with poker hand histories I'll post one so you can tell what the structure is.

Full Tilt Poker Game #12037626529: Table durrrr (heads up, deep) - $500/$1000 -
Pot Limit Omaha Hi - 2:00:48 ET - 2009/05/05
Seat 1: durrrr ($196,456.50)
Seat 2: Gus Hansen ($65,499)
durrrr posts the small blind of $500
Gus Hansen posts the big blind of $1,000
The button is in seat #1
*** HOLE CARDS ***
durrrr raises to $3,000
Gus Hansen raises to $9,000
durrrr calls $6,000
*** FLOP *** [3d 4d 7d]
Gus Hansen has 15 seconds left to act
Gus Hansen checks
durrrr checks
*** TURN *** [3d 4d 7d] [Jh]
Gus Hansen checks
durrrr checks
*** RIVER *** [3d 4d 7d Jh] [Ah]
Gus Hansen has 15 seconds left to act
Gus Hansen checks
durrrr has 15 seconds left to act
123stayfree (Observer): GUS I NOW BRING U LUCK
durrrr bets $7,600
Gus Hansen has 15 seconds left to act
Gus Hansen has requested TIME
Hernandez777 (Observer): Gus has the super-duper nuts
Gus Hansen calls $7,600
Podobed45 (Observer): fluuuuuuuuuush
*** SHOW DOWN ***
durrrr shows [Kc 3s Qd As] two pair, Aces and Threes
Gus Hansen mucks
durrrr wins the pot ($33,199.50) with two pair, Aces and Threes
*** SUMMARY ***
Total pot $33,200 | Rake $0.50
Board: [3d 4d 7d Jh Ah]
Seat 1: durrrr (small blind) collected ($33,199.50)
Seat 2: Gus Hansen (big blind) mucked

I'm well aware of other methods of collecting the information (such as screen-scraping and dll injection) but the need to transform the hand history into structured data is still there so I'm looking only at methods that grab the info such as regex/grammars...

I think if I don't find something I'm going to rewrite our grammars with ocamllex/ocamlyacc.

update

fyi: regexen speed was ~60 hands/sec while the grammars were processing 600+ hands/sec... the entire hand is transformed into xml after the data is all sorted out... there are between 20-30 regexes needed (at last count) for EACH site you want to parse....each site on the grammar side has it's own grammar with ungodly amounts of lexer/parser rules (but it's still smaller code size)

I do have the dragon book and have been reading through it -- which has spurned my interest in using the ocamllex/ocamlyacc.... speed is the name of the game here..

+2  A: 

Parser combinators is a very popular method of building parsers in functional languages such as Haskell.

andri
+1  A: 

You might try playing with Hadoop which is an Apache project inspired by Google's MapReduce and GFS.

http://hadoop.apache.org/core/

It won't necessarily make a single machine hum but it allows you to easily and cheaply throw more hardware at the problem.

Alternately, Erlang might be a good option if you've got a multi-core machine. It's good at concurrency and pattern matching.

http://erlang.org/

steamer25
A: 

No particular recommendation but I feel I should point you at this list

ShuggyCoUk
A: 

Read the Dragon Book: http://www.amazon.com/Compilers-Principles-Techniques-Alfred-Aho/dp/0201100886

It covers lexical and syntactical analysis in depth (among other topics). You can use this to help you understand the "language" you are trying to parse to determine the best way to go about it.

mmorrisson
+1  A: 

Wikipedia has a nice overview about parser types, here: http://en.wikipedia.org/wiki/Parser

And a comparison about parser generator tools, here: http://en.wikipedia.org/wiki/Comparison_of_parser_generators

I think GLR is a kind of less well known method which is interesting because it deals with language ambiguities.

Vizu
+3  A: 

If you're looking to maximize speed, then you might do better to use OcamlYacc/FsYacc over ANTLR. OcamlYacc creates LL(1) parsers, which typically have better performance than ANTLR-style LL(*) parsers (someone can correct me if I'm wrong). [Edit to add:] Looks like someone corrected me: OCamlYacc produces LALR(1) parsers. I can't say with any confidence whether OcamlYacc parsers are faster than ANTLR parsers.

OCaml/F# are very good languages for building a DSL, and in my opinion much more appropriate for the job than Java, mostly because its ridiculously easy to create and traverse an AST represented as a union data structure. I recommed this tutorial which demonstrates how to parse SQL in F#.

Juliet
I don't have much exp. with the diff. parsers out there but everything I've read has said the time/space complexities are in favor of LL(1) and also incidentally solve a few problems in LL(k)/LL(*)
feydr
This is incorrect. ocamlyacc, like yacc and bison, produces LALR(1), not LL(1).
Bob Aman
Looking for a SQL parser, and I like F#, so +1
James Hugard
A: 

Recursive Descent Parsing might work for you. It is very customizable. It may be a bit slower than yacc/antlr, but may be fast enough. The basic idea: You encode every grammar rule as a function.

Yuval F
ANTLR generates recursive-descent parsers (as do most LL-based parser generators)
Scott Stanchfield
+3  A: 

Since you're looking for exotic, read this article about Vaughan Pratt's Top Down Operator Precedence...

http://javascript.crockford.com/tdop/tdop.html

Nosredna
+1  A: 

Since you are talking about using OCaml for parsing, this page gives an overview of different parsing options for that language:

Parser generators for the OCaml language

If you decide to settle for ocamlyacc (or menhir), these tutorials may be a little easier than the reference manual:

Bruno De Fraine
+2  A: 

You have to ask yourself if what you really want to do is play around with parsers (admittedly fun, and what I prefer myself) or if you want to actually get work done on your poker bot. Mostly likely, exotic parsing techniques are overkill for what you need. Just choose a fast language with some straightforward, easy to use parsers. You should probably be able to process 10k hands / sec with straight C + flex. Or, ocamllex + ocamlyacc should be more than enough. If you have to hadoopify your code I think you're doing something wrong. Network latency should end up being your real bottleneck, not parsing speed. What kind of machine are you running this on?

Another alternative is using a parser generator to autogenerate a parse table, and then hand optimizing that, or hand optimizing from the NFA (you probably won't save much though, and the tradeoff in programmer time probably isn't worth it). Combinator parsing is likely going to be slower.

On average, for a given grammar of equivalent power LL will be slower than LALR. In particular, if the poker hands are actually parseable by an LALR parser, then bison/byacc + flex will beat ANTLR hands down, every time. I'm personally pretty happy with menhir, though it's a raging bitch and a half to get working with godi + ocamlbuild.

--Nico