views:

79

answers:

1

I have a database full of messages from a bulletin board. The board uses BB codes as formatting style. I.e.:

  • I'm not formatted
  • This is [b]bold[/b] text
  • Tags can also [i][b]be[/b] nested[/i]
  • And the [b]nesting [i]can be[/b] rather[/i] ugly

My ultimate goal is to convert these messages to some well formed XML (no discussion here ;) ). I don't want to use regular expression, which will fail at some point (in fact: it does).

First step: parse a message into some kind of internal representation (a graph, a tree, etc.). And I'm stuck at this point. The actual extraction is not that big problem, but the storage is.

How do I represent this kind of markup into some meaningful structure. My problem seems to be similar (or almost identical) to a browser building a DOM from a HTML file. So I think there are some strategies to solve it. I know the solution will not be perfect but im willing to invest a vast amount of time to do build the best possible.

Question: Do you have any tips/hint/comments? Any articles or paper you can recommend? Or a book which discusses these topic? I'm grateful for any input.

+2  A: 
rayd09
Wow. Thank you very much. Your sample code solved nearly all my problems :)
Martin
Thank you and good luck. I just fixed a typo and I was passing the wrong parameter in the push call, that is fixed now too.
rayd09