views:

102

answers:

3

I decided to write a small parser to parse BBCode and return properly formatted HTML. I am having a hard time deciding what the most efficient way to represent the keywords would be. I could always use separate strings to hold them, but I feel like there must be some unknown data structure (to me) that would allow for efficient lookup.

I am using C++ if there is anything in the STL I can use. I don't intend to actually use it so I don't need to use anything like PHP. It will not have a GUI interface; just input a text file and it outputs a new file with the HTML parsed out.

Edit: By keywords, I mean the opening and closing tags, such as [b] and [/b].

+1  A: 

The classic is the Aho-Corasick keyword tree introduced in their 1973 paper.

linear time word insertion, linear time word lookup.

San Jacinto
Since insertion is at compile time, why is this better than using a raw array which also has linear lookup?
Hooked
Huh? why is insertion done at compile-time? Do you mean that YOUR insertion is only done at compile-time?Additionally, I fail to see how an array is a linear-time search, unless you are using some advanced algorithms based on suffix arrays, which I doubt you are.The aho-corasick method will search against ALL keywords for a match in time linear with respect to the length of the word you are searching for.
San Jacinto
+4  A: 

Since you know all your keywords in advance, you can take advantage of perfect hashing, e.g. via this library -- see also the wikipedia entry and the pointers from it.

Alex Martelli
+2  A: 

The classic answer is a hash table. Constant time insertion/replacement.

But it's not entirely clear what you want. If it's just to keep the keywords neatly organized instead of peppered through your code a simple array would do; then use #defines to index and select them.

Kristoffon
I want to keep it simple. If I am worried about speed (ie I need constant time instead of linear time), is my only option to store individual strings and use them separately?
Hooked
If you are going to use this approach, Karp-Rabin is the answer. You won't achieve true constant time.
San Jacinto