views:

164

answers:

2

I'm currently working on a scanner generator. The generator already works fine. But when using character classes the algorithm gets very slow.

The scanner generator produces a scanner for UTF8 encoded files. The full range of characters (0x000000 to 0x10ffff) should be supported.

If I use large character sets, like the any operator '.' or the unicode property {L}, the nfa (and also the dfa) contains a lot of states ( > 10000 ). So the convertation for nfa to dfa and create the minimal dfa takes a long time (even if the output minimal dfa contains only a few states).

Here's my current implementation of creating a character set part of the nfa.

void CreateNfaPart(int startStateIndex, int endStateIndex, Set<int> characters)
{
transitions[startStateIndex] = CreateEmptyTransitionsArray();
foreach (int character in characters) {
    // get the utf8 encoded bytes for the character
    byte[] encoded = EncodingHelper.EncodeCharacter(character);
    int tStartStateIndex = startStateIndex;
    for (int i = 0; i < encoded.Length - 1; i++) {
        int tEndStateIndex = transitions[tStartStateIndex][encoded[i]];
        if (tEndStateIndex == -1) {
           tEndStateIndex = CreateState();
               transitions[tEndStateIndex] = CreateEmptyTransitionsArray();
        }                   
        transitions[tStartStateIndex][encoded[i]] = tEndStateIndex;
        tStartStateIndex = tEndStateIndex;
    }
    transitions[tStartStateIndex][encoded[encoded.Length - 1]] = endStateIndex;
}

Does anyone know how to implement the function much more efficiently to create only the necessary states?

EDIT:

To be more specific I need a function like:

List<Set<byte>[]> Convert(Set<int> characters)
{
     ???????
}

A helper function to convert a character (int) to a UTF8 encoding byte[] is defined as:

byte[] EncodeCharacter(int character)
{ ... }
+1  A: 

Look at what regular expression libraries like Google RE2 and TRE are doing.

jilles
I think Google RE2 does the kind of thing I need, but it's very complex... I find some interestering code at http://code.google.com/p/re2/source/browse/re2/compile.cc (starting at line 559)
youllknow
+1  A: 

There are a number of ways to handle it. They all boil down to treating sets of characters at a time in the data structures, instead of enumerating the entire alphabet ever at all. It's also how you make scanners for Unicode in a reasonable amount of memory.

You've many choices about how to represent and process sets of characters. I'm presently working with a solution that keeps an ordered list of boundary conditions and corresponding target states. You can process operations on these lists much faster than you could if you had to scan the entire alphabet at each juncture. In fact, it's fast enough that it runs in Python with acceptable speed.

Ian