views:

112

answers:

4

For my very own little parser framework, I am trying to define (something like) the following function:

template <class T>
// with operator>>( std::istream&, T& )
void tryParse( std::istream& is, T& tgt )
{
    is >> tgt /* , *BUT* store every character that is consumed by this operation
    in some string. If afterwards, is.fail() (which should indicate a parsing
    error for now), put all the characters read back into the 'is' stream so that
    we can try a different parser. */
}

Then I could write something like this: (maybe not the best example)

/* grammar: MyData     = <IntTriple> | <DoublePair>
            DoublePair = <double> <double>
            IntTriple  = <int> <int> <int> */
class MyData
{ public:
    union { DoublePair dp; IntTriple it; } data;
    bool isDoublePair;
};

istream& operator>>( istream& is, MyData& md )
{
    /* If I used just "is >> md.data.it" here instead, the
       operator>>( ..., IntTriple ) might consume two ints, then hit an
       unexpected character, and fail, making it impossible to read these two
       numbers as doubles in the "else" branch below. */
    tryParse( is, md.data.it );
    if ( !is.fail() )
        md.isDoublePair = false;
    else
    {
        md.isDoublePair = true;
        is.clear();
        is >> md.data.dp;
    }
    return is;
}

Any help is greatly appreciated.

+3  A: 

This is not what streams are intended for. You should read the data you want to parse into a buffer and then hand that buffer (preferably as an iterator-range) to the functions that parse it. This could look something like this:

template <class T, class U>
bool tryParse( U & begin, U & end, T & target ) {
    // return true if parse was successful, false otherwise
}

To read from an istream into a buffer, you can use an istream_iterator:

 std::vector< char > buffer(std::istream_iterator<char>(is), std::istream_iterator<char>());

This reads the entire stream into the vector when it is created.

Space_C0wb0y
+2  A: 

Putting the characters back is tricky. Some streams support unget() and putback(somechar), but there is no guarantee how many characters you can unget (if any).

A more reliable way is to read the characters into a buffer and parse that, or store the characters read in the first parsing attempt and use that buffer when parsing a second time.

Anthony Williams
+1 for parsing a separate buffer.
Mark B
+2  A: 

Unfortunately, streams have only very minimal and rudimentary putback support.

The last times I needed this, I wrote my own reader classes which wrapped a stream, but had a buffer to put things back into, and read from the stream only when that buffer is empty. These had ways to get a state from, and you could commit a state or rollback to an earlier state.
The default action in the state class' destructor was to rollback, so that you could parse ahead without giving much thought to error handling, because an exception would simply rollback the parser's state up to a point where a different grammar rule was tried. (I think this is called backtracking.) Here's a sketch:

class parse_buffer {
    friend class parse_state;
public:
    typedef std::string::size_type index_type;

    parse_buffer(std::istream& str);

    index_type get_current_index() const;
    void set_current_index(index_type) const;

    std::string get_next_string(bool skip_ws = true) const;
    char get_next_char(bool skip_ws = true);
    char peek_next_char(bool skip_ws = true); 

    std::string get_error_string() const; // returns string starting at error idx
    index_type get_error_index() const;
    void set_error_index(index_type);

    bool eof() const;

    // ...
};

class parse_state {
public:
    parse_state(parse_buffer&);
    ~parse_state();

    void commit();
    void rollback();

    // ...
};

This should give you an idea. It has none of the implementation, but that was straightforward and should be easy to redo. Also, the real code had many convenient functions like reading functions that read a delimited string, consumed a string if it was one of several given keywords, read a string and converted it to a type given per template parameter, and stuff like this.

The idea was that a function would set the error index to its starting position, save the parse state, and try to parse until it either succeeded or ran into a dead end. In the latter case, it would just throw an exception. This would destroy the parse_state objects on the stack, rolling back the state up to a function which could catch the exception and either try something else, or output an error (which is where get_error_string() comes in.)

If you want a really fast parser, this strategy might be wrong, but then streams are often to slow, too. OTOH, the last time I used something like this, I made an XPath parser that operates on a proprietary DOM, which is used to represent scenes in a 3D renderer. And it was not the XPath parser that got all the heat from the guys trying to get higher frame rates. :)

sbi
That sounds quite interesting. Do you still have the code and is it open source / may I take a look at it?
rainmaker
Oh, how nice! So, parse_state::rollback() just calls set_current_index( index_on_my_creation ) on its parse_buffer? and commit() does - uhm - nothing, leaving the parse_buffer's index where it is? Ah, but commit() could tell parse_buffer that we're not going to rollback() before the current position and therefore it is safe (at least for this parse_state) to forget everything before that position... Yes, I think I begin to understand. I'm gonna try this, thanks a lot!
rainmaker
@rainmaker: Yes, you got the idea. One more thing committing does is to set the error index to the current position, so that later parse errors would yield the current index as where the faulty text started. I vaguely remember that this had its nooks and crannies (wouldn't `parse_state` also have to store the old error index?), but it's several years ago that I used this scheme, so, absent the old code, would I need to do this again, I wouldn't have more than the above to start from either. `:)`
sbi
+1  A: 

You can do some interesting things with streambuf stream members. In particular, you have direct access to the buffers' pointers.

However, you have no guarantee on the size of the buffers.

Alexandre C.