views:

127

answers:

4

When constructing a lexer/tokenizer is it a mistake to rely on functions(in C) such as isdigit/isalpha/... ? They are dependent on locale as far as I know. Should I pick a character set and concentrate on it and make a character mapping myself from which I look up classifications? Then the problem becomes being able to lex multiple character sets. Do I produce one lexer/tokenizer for each character set or do I try to code the one I wrote so that the only thing I have to do is change the character mapping. What are common practices?

+2  A: 

The ctype.h functions are not very usable for chars that contain anything but ASCII. The default locale is C (essentially the same as ASCII on most machines), no matter what the system locale is. Even if you use setlocale to change the locale, the chances are that the system uses a character set with bigger than 8 bit characters (e.g. UTF-8), in which case you cannot tell anything useful from a single char.

Wide chars handle more cases properly, but even they fail too often.

So, if you want to support non-ASCII isspace reliably, you have to do it yourself (or possibly use an existing library).

Note: ASCII only has character codes 0-127 (or 32-127) and what some call 8 bit ASCII is actually some other character set (commonly CP437, CP1252, ISO-8859-1 and often also something else).

Tronic
I see, this problem is not really as trivial as I would initially like to think. In theory it should be possible to abstract away this character set trouble though by using my own internal representation and then mapping wanted "codes" in other character sets to my representation.
Questionable
+2  A: 

You are likely not to get very far in trying to build a local sensitive parser -- it will drive you mad. ASCII works fine for most parsing needs -- don't fight it :D

If you do want to fight it and use some of the classifications of characters you should look to the ICU library that implements Unicode religiously.

Hassan Syed
+2  A: 

For now, I would concentrate on getting the lexer working first using the plain ASCII character set, then when the lexer is working, put in a mapping support for different character types such as UTF-16 and locale support.

And no, it is not a mistake to rely on the ctype's functions such as isdigit, isalpha and so on...

Actually, maybe at a later stage, there is a Posix equivalent of ctype for wide characters 'wctype.h' so it might be in your best interests to define a macro, later on...so that you will be able to transparently change the code to handle the different locale sets...

#ifdef LEX_WIDECHARS
#include <wctype.h>
#define isdigit  iswdigit
#else
#define isdigit
#endif

It would be defined something like that in that context...

Hope this helps, Best regards, Tom.

tommieb75
+1  A: 

Generally you need to ask yourself:

  • what exactly do you want to do, what kind of parsing?
  • What languages do you want to support, wide range or Western-European only?
  • What encoding do you want to use UTF-8 or localized 8-bit encoding?
  • What OS are you using?

Lets start, if you work with Western languages with localized 8-bit encoding, then probably yes, you may relay on is*, if locales are installed and configured.

However:

  • if you work with UTF-8 you can't because only ASCII would be covered you can't, because all outside of ASCII taks more then one byte.
  • If you want to support Eastern languages, all your assumptions about parsing would be worng, like Chineese do not use space to separate words. Most languaes even do not have upper or lower case, even alpha-beth based like Hebrew or Arabic.

So, what exacly do you want to do?

I'd suggest to take a look on ICU library that have various break iterators, or other toolkits like Qt that provide some basic boundary analisys.

Artyom