I'm currently working on a scanner generator. The generator already works fine. But when using character classes the algorithm gets very slow.
The scanner generator produces a scanner for UTF8 encoded files. The full range of characters (0x000000 to 0x10ffff) should be supported.
If I use large character sets, like the any operator '.' or the unicode property {L}, the nfa (and also the dfa) contains a lot of states ( > 10000 ). So the convertation for nfa to dfa and create the minimal dfa takes a long time (even if the output minimal dfa contains only a few states).
Here's my current implementation of creating a character set part of the nfa.
void CreateNfaPart(int startStateIndex, int endStateIndex, Set<int> characters)
{
transitions[startStateIndex] = CreateEmptyTransitionsArray();
foreach (int character in characters) {
// get the utf8 encoded bytes for the character
byte[] encoded = EncodingHelper.EncodeCharacter(character);
int tStartStateIndex = startStateIndex;
for (int i = 0; i < encoded.Length - 1; i++) {
int tEndStateIndex = transitions[tStartStateIndex][encoded[i]];
if (tEndStateIndex == -1) {
tEndStateIndex = CreateState();
transitions[tEndStateIndex] = CreateEmptyTransitionsArray();
}
transitions[tStartStateIndex][encoded[i]] = tEndStateIndex;
tStartStateIndex = tEndStateIndex;
}
transitions[tStartStateIndex][encoded[encoded.Length - 1]] = endStateIndex;
}
Does anyone know how to implement the function much more efficiently to create only the necessary states?
EDIT:
To be more specific I need a function like:
List<Set<byte>[]> Convert(Set<int> characters)
{
???????
}
A helper function to convert a character (int) to a UTF8 encoding byte[] is defined as:
byte[] EncodeCharacter(int character)
{ ... }