I've been parsing poker hand histories for the past year and have learned quite a deal about parsing in general.
We started with regexes but quickly realized that wouldn't scale easily. We skipped languages from ruby to c++ and finally came to grips that it was the algorithim that had to change.
We picked up Boost::Spirit and watched our speed dramatically rise on orders of more than 10 times our original speed. We then skipped over to java and are currently using antlr to create grammars for each site. This is definitely the fastest method yet and it's very thorough which is nice cause you know exactly where you stand in terms of a 'complete' grammar. Unfortunately, I have spent incredible amounts of time working with these grammars -- they work pretty damn well but not perfectly yet.
Anyways, enough with the background on to the question at hand -- are there any 'exotic' or less well known techniques to parsing that I'm not aware of? I only know of lexing/parsing a grammar and the other inferior regex/loop method.
For those of you who are not familiar with poker hand histories I'll post one so you can tell what the structure is.
Full Tilt Poker Game #12037626529: Table durrrr (heads up, deep) - $500/$1000 -
Pot Limit Omaha Hi - 2:00:48 ET - 2009/05/05
Seat 1: durrrr ($196,456.50)
Seat 2: Gus Hansen ($65,499)
durrrr posts the small blind of $500
Gus Hansen posts the big blind of $1,000
The button is in seat #1
*** HOLE CARDS ***
durrrr raises to $3,000
Gus Hansen raises to $9,000
durrrr calls $6,000
*** FLOP *** [3d 4d 7d]
Gus Hansen has 15 seconds left to act
Gus Hansen checks
durrrr checks
*** TURN *** [3d 4d 7d] [Jh]
Gus Hansen checks
durrrr checks
*** RIVER *** [3d 4d 7d Jh] [Ah]
Gus Hansen has 15 seconds left to act
Gus Hansen checks
durrrr has 15 seconds left to act
123stayfree (Observer): GUS I NOW BRING U LUCK
durrrr bets $7,600
Gus Hansen has 15 seconds left to act
Gus Hansen has requested TIME
Hernandez777 (Observer): Gus has the super-duper nuts
Gus Hansen calls $7,600
Podobed45 (Observer): fluuuuuuuuuush
*** SHOW DOWN ***
durrrr shows [Kc 3s Qd As] two pair, Aces and Threes
Gus Hansen mucks
durrrr wins the pot ($33,199.50) with two pair, Aces and Threes
*** SUMMARY ***
Total pot $33,200 | Rake $0.50
Board: [3d 4d 7d Jh Ah]
Seat 1: durrrr (small blind) collected ($33,199.50)
Seat 2: Gus Hansen (big blind) mucked
I'm well aware of other methods of collecting the information (such as screen-scraping and dll injection) but the need to transform the hand history into structured data is still there so I'm looking only at methods that grab the info such as regex/grammars...
I think if I don't find something I'm going to rewrite our grammars with ocamllex/ocamlyacc.
update
fyi: regexen speed was ~60 hands/sec while the grammars were processing 600+ hands/sec... the entire hand is transformed into xml after the data is all sorted out... there are between 20-30 regexes needed (at last count) for EACH site you want to parse....each site on the grammar side has it's own grammar with ungodly amounts of lexer/parser rules (but it's still smaller code size)
I do have the dragon book and have been reading through it -- which has spurned my interest in using the ocamllex/ocamlyacc.... speed is the name of the game here..