views:

132

answers:

5

Alright so I am writing a function as part of a lexical analyzer that 'looks up' or searches for a match with a keyword. My lexer catches all the obvious tokens such as single and multi character operators (+ - * / > < = == etc) (also comments and whitespace are already taken out) so I call a function after I've collected a stream of only alphanumeric characters (including underscores) into a string, this string then needs to be matched as either a known keyword or an identifier.

So I was wondering how I might go about identifying it? I know I basically need to compare it to some list or array or something of all the built in keywords, and if it matches one return that match to it's corresponding enum value; otherwise, if there is no match, then it must be a function or variable identifier. So how should I look for matches? I read somewhere that something called a Binary Search Tree is an efficient way to do it or by using Hash Tables, problem is I've never used either so I am not sure if it's the right way. Could I possibly use a MySQL database?

+2  A: 

A "trie" will surely be the most efficient way.

Ben Voigt
+4  A: 

If your set of keywords is fixed, a perfect hash can be built for O(1) lookup. Check out gperf or cmph.

ergosys
You'd still have hash collisions with non-keywords, so I don't see this being more efficient than other methods. It's also not O(1), true the complexity doesn't depend on the number of keywords but it does depend on the length of each keyword.
Ben Voigt
Verification after the lookup is a string compare, but this is unlikely to be an significant factor in the performance. Since the hash is perfect, there is no hash collision penalty, the input either matches the hashed slot or it doesn't, no additional search is needed.
ergosys
+2  A: 

Whatever implementation of std::map you have will probably be sufficient.

no one important
Or `std::tr1::unordered_map` if your compiler supports it, which the latest VC++ and GCC both do. :)
Jonathan Grynspan
A: 

For singe character keywords a lookup table would be perfect. For multicharacter (especially if the lengths differs): a hash table. If you need performance, you could even use source code generation to create the hash tables (using a simple hash function that is able or not to ignore case, depending on your syntax).

So I'd implement it with a LUT and a hash table: first you check the first character with the LUT (if it's a simple operator, it would start with a non-alpha-numeric value), and, if not found, check the hash table.

ruslik
A: 

This is for a language, with a specific set of keywords that never change, and there aren't very many of them?

If so, it probably doesn't matter what you use. You will have bigger fish to fry.

However, since the list doesn't change, it would be hard to beat a hard coded search like this:

// search on first letter
switch(s[0]){
  case 'a':
    // search on 2nd letter, etc.
    break;
  case 'b':
    // search on 2nd letter, etc.
    break;
  ........
  case '_':
    // search on 2nd letter, etc.
    break;
}
Mike Dunlavey