views:

163

answers:

5

Hello all :)

I'm working on a state machine which is supposed to extract function calls of the form

/* I am a comment */
//I am a comment
pref("this.is.a.string.which\"can have QUOTES\"", 123456);

where the extracted data would be pref("this.is.a.string.which\"can have QUOTES\"", 123456); from a file. Currently, to process a 41kb file, this process is taking close to a minute and a half. Is there something I'm seriously misunderstanding here about this finite state machine?

#include <boost/algorithm/string.hpp>
std::vector<std::string> Foo()
{
    std::string fileData;
    //Fill filedata with the contents of a file
    std::vector<std::string> results;
    std::string::iterator begin = fileData.begin();
    std::string::iterator end = fileData.end();
    std::string::iterator stateZeroFoundLocation = fileData.begin();
    std::size_t state = 0;
    for(; begin < end; begin++)
    {
        switch (state)
        {
        case 0:
            if (boost::starts_with(boost::make_iterator_range(begin, end), "pref(")) {
                stateZeroFoundLocation = begin;
                begin += 4;
                state = 2;
            } else if (*begin == '/')
                state = 1;
            break;
        case 1:
            state = 0;
            switch (*begin)
            {
            case '*':
                begin = boost::find_first(boost::make_iterator_range(begin, end), "*/").end();
                break;
            case '/':
                begin = std::find(begin, end, L'\n');
            }
            break;
        case 2:
            if (*begin == '"')
                state = 3;
            break;
        case 3:
            switch(*begin)
            {
            case '\\':
                state = 4;
                break;
            case '"':
                state = 5;
            }
            break;
        case 4:
            state = 3;
            break;
        case 5:
            if (*begin == ',')
                state = 6;
            break;
        case 6:
            if (*begin != ' ')
                state = 7;
            break;
        case 7:
            switch(*begin)
            {
            case '"':
                state = 8;
                break;
            default:
                state = 10;
                break;
            }
            break;
        case 8:
            switch(*begin)
            {
            case '\\':
                state = 9;
                break;
            case '"':
                state = 10;
            }
            break;
        case 9:
            state = 8;
            break;
        case 10:
            if (*begin == ')')
                state = 11;
            break;
        case 11:
            if (*begin == ';')
                state = 12;
            break;
        case 12:
            state = 0;
            results.push_back(std::string(stateZeroFoundLocation, begin));
        };
    }
    return results;
}

Billy3

EDIT: Well this is one of the strangest things I've ever seen. I just rebuilt the project and it's running reasonably again. Odd.

+1  A: 

I don't know if this is part of the problem, but you have a typo in case 0, "perf" is misspelled as "pref".

Ferruccio
Actually it's mispelled in my question -- the state machine produces correct output it just takes FOREVER :( (Thank you +1)
Billy ONeal
+1  A: 

Well it's hard to say just by looking at it...but I'm guessing the find algorithms are doing it. Why are you searching within a FSM? By definition you're supposed to giving them one character at a time....Add more states. Also try making results a list, not a vector. A lot of copying going on with

vector<string>

But mostly: Profile it!

Chris H
Somehow I don't think it's the find functions because everything else in my program is using boost's string algorithms without exorbitant runtimes. And considering all the find algorithms are doing is incrementing `begin`, I don't see how replacing them with more states will speed things up.
Billy ONeal
I also don't think it's the vector because I'm getting maybe 50 results from this file -- it shouldn't be taking minutes for that many results. Also adding `results.reserve(2000);` at the top of the function does not help.
Billy ONeal
A: 

Finite state machines are a workable solution, but for text processing, its best to use a highly optimized finite state machine generator. In this case, a regular expression. Here it is as Perl regex:

# first clean the comments
$source =~ s|//.*$||;      # replace "// till end of line" with nothing
$source =~ s|/\*.*?\*/||s; # replace "/* any text until */" with nothing
                           # depending on your data, you may need a few other
                           # rules here to avoid blanking data, you could replace
                           # the comments with a unique identifier, and then
                           # expand any identifiers that the regex below returns

# then find your data
while ($source =~ /perf\(\s*"(.+?)",\s*(\d+)\s*\);/g) { 
   # matches your function signature and moves along source
   # do something with the captured groups, in this case $1 and $2
}

Since most regex libraries are Perl compatible, it shouldn't be hard to translate the syntax. If your search becomes more complicated, then a parser is in order.

Eric Strom
Adding a regex library is not an option for a simple case like this. It's the only case like it in the entire program, and it's hard to justify 200kb+ of regex code for what can be done with a single FSM.
Billy ONeal
While size is of course a concern, making sure that your state machine handles all of the edge cases (data that contains something that looks like a comment, differences in spacing, multiline calls...) can be difficult. Since you don't need many advanced regex features, I would imagine that there are small enough libraries. Even if not, modeling your solution as a regex, and then translating to C++ might be easier
Eric Strom
"Even if not, modeling your solution as a regex, and then translating to C++ might be easier" <-- That's exactly what I did to build the state machine :)
Billy ONeal
Sounds good, glad you got things working, any idea what caused your compile bug?
Eric Strom
I disagree. I would certainly not suggest using Regex to anyone willing to parse something. Sure it might work at the beginning, but when the requirements build up you end up with an unwieldy huge beast no-one has any chance to understand... and then people ask for nesting and you're in for a world of hurt.
Matthieu M.
+2  A: 

Unless your 41 kb file is mostly comments or prefs, it's going to spend most of its time in state 0. And for each character in state 0, you make a minimum of two function calls.

if (boost::starts_with(boost::make_iterator_range(begin, end), "pref(")) {

You can speed this up by pre-testing to see if the current character is 'p'

if (*begin == 'p' && boost::starts_with(boost::make_iterator_range(begin, end), "pref(")) {

If the character isn't 'p' then there is no need to make any function calls. In particular not creating a iterator, which is probably where the time is being spent.

John Knoeller
A: 

If you are doing parsing, why not use a parser library.

I typically have Boost.Spirit.Qi in mind.

  • You express your grammar with EBNF-like expressions which certainly makes it easier for maintenance.
  • It's a header only library, so you don't have any problem of bringing a whole binary in the mix.

While I can appreciate the minimalist approach, I am afraid that your idea of coding a finite state machine on your own is not that wise. It works well with a toy example, but as the requirements add up you'll have one monstrous switch and it's going to become more and more complicated to understand what's going on.

And please, don't tell me you know it's not going to evolve: I don't believe in oracles ;)

Matthieu M.
As I told Eric Storm, I can't justify a huge library (like spirit) like this for a one-off function with this specific task. If I had other requirements, then I'd consider this. It might be header only, but it's still a lot of boilerplate code that is used
Billy ONeal