I have a database full of messages from a bulletin board. The board uses BB codes as formatting style. I.e.:
- I'm not formatted
- This is [b]bold[/b] text
- Tags can also [i][b]be[/b] nested[/i]
- And the [b]nesting [i]can be[/b] rather[/i] ugly
My ultimate goal is to convert these messages to some well formed XML (no discussion here ;) ). I don't want to use regular expression, which will fail at some point (in fact: it does).
First step: parse a message into some kind of internal representation (a graph, a tree, etc.). And I'm stuck at this point. The actual extraction is not that big problem, but the storage is.
How do I represent this kind of markup into some meaningful structure. My problem seems to be similar (or almost identical) to a browser building a DOM from a HTML file. So I think there are some strategies to solve it. I know the solution will not be perfect but im willing to invest a vast amount of time to do build the best possible.
Question: Do you have any tips/hint/comments? Any articles or paper you can recommend? Or a book which discusses these topic? I'm grateful for any input.