views:

267

answers:

6

Attention please:

I already implemented this stuff, just not in any way generic or elegant. This question is motivated by my wanting to learn more tricks with the stl, not the problem itself.

This I think is clear in the way I stated that I already solved the problem, but many people have answered in their best intentions with solutions to the problem, not answers to the question "how to solve this the stl way". I am really sorry if I phrased this question in a confusing way. I hate to waste people's time.

Ok, here it comes:

I get a string full of encoded Data.

It comes in: N >> 64 bytes

  • every 3 byte get decoded into an int value
  • after at most 64 byte (yes,not divisible by 3!) comes a byte as checksum
  • followed by a line feed.
  • and so it goes on.

It ends when 2 successive linefeeds are found.

It looks like a nice or at least ok data format, but parsing it elegantly the stl way is a real bit**.

I have done the thing "manually".

But I would be interested if there is an elegant way with the stl- or maybe boost- magic that doesn't incorporate copying the thing.

Clarification: It gets really big sometimes. The N >> 64byte was more like a N >>> 64 byte ;-)

UPDATE Ok, the N>64 bytes seems to be confusing. It is not important.

  • The sensor takes M measurements as integers. Encodes each of them into 3 bytes. and sends them one after another
  • when the sensor has sent 64byte of data, it inserts a checksum over the 64 byte and an LF. It doesn't care if one of the encoded integers is "broken up" by that. It just continues in the next line.(That has only the effect to make the data nicely human readable but kindof nasty to parse elegantly.)
  • if it has finished sending data it inserts a checksum-byte and LFLF

So one data chunk can look like this, for N=129=43x3:

|<--64byte-data-->|1byte checksum|LF 
|<--64byte-data-->|1byte checksum|LF 
|<--1byte-data-->|1byte checksum|LF
LF

When I have M=22 measurements, this means I have N=66 bytes of data. After 64 byte it inserts the checksum and LF and continues. This way it breaks up my last measurement which is encoded in byte 64, 65 and 66. It now looks like this: 64, checksum, LF, 65, 66. Since a multiple of 3 divided by 64 carries a residue 2 out of 3 times, and everytime another one, it is nasty to parse. I had 2 solutions:

  1. check checksum, concatenate data to one string that only has data bytes, decode.
  2. run through with iterators and one nasty if construct to avoid copying.

I just thought there might be someting better. I mused about std::transform, but it wouldn't work because of the 3 byte is one int thing.

A: 

It seems to me that something as simple as the following should solve the problem:

string line;
while( getline( input, line ) && line != "" ) {    
  int val = atoi( line.substr(0, 3 ).c_str() );
  string data = line.substr( 3, line.size() - 4 );
  char csum = line[ line.size() - 1 ];
  // process val, data and csum
}

In a real implementation you would want to add error checking, but the basic logic should remain the same.

anon
er, Neil, you copy the string. string data = line.substr....I dont want to be naggy, but it is a really BIG string in some cases.
AndreasT
@ AndreasT How big? And how much memory do you have? If you are parsing, you typically need to make copies. Or you could use iterators. You need to make your question more explicit - for example, what does the consumer of the parsed data want to see?
anon
@ AndreasT And in your original question, you said "after at most 64 byte", so the copying at the place you pointed to in your comment won't be significant.
anon
In other words, is the checksum always followed by a linefeed? Or does the linefeed occur after some multiple of 64 bytes?
gary
@Neil: since strtok() for example jumps through some hoops to avoid copying, I don't think that parsing usually means copying.
AndreasT
@AndreasT I don't think strtok can be held up as an example of good practice. In the parsers I use and write, copies are always made.
anon
A: 

Why are you concerned with copying? Is it the time overhead or the space overhead?

It sounds like you are reading all of the unparsed data into a big buffer, and now you want to write code that makes the big buffer of unparsed data look like a slightly smaller buffer of parsed data (the data minus the checksums and linefeeds), to avoid the space overhead involved in copying it into the slightly smaller buffer.

Adding a complicated layer of abstraction isn't going to help with the time overhead unless you only need a small portion of the data. And if that's the case, maybe you could just figure out which small portion you need, and copy that. (Then again, most of the abstraction layer may be already written for you, e.g. the Boost iterator library.)

An easier way to reduce the space overhead is to read the data in smaller chunks (or a line at a time), and parse it as you go. Then you only need to store the parsed version in a big buffer. (This assumes you're reading it from a file / socket / port, rather than being passed a large buffer that is not under your control.)

Another way to reduce the space overhead is to overwrite the data in place as you parse it. You will incur the cost of copying it, but you will only need one buffer. (This assumes that the 64-byte data doesn't grow when you parse it, i.e. it is not compressed.)

bk1e
I am concerned with copying because it is the single thing that will have any performance impact whatsoever. This runs on a robot and resources are generally scarce. The rest is just simple math that cannot be avoided in any case. The data comes in big chunks as I stated in the question. Thanks for your thoughts, but really, the question is about an stl solution, as generic and fast as possible. Nothing else. I already implemented this stuff, just not in any way generic or elegant. This question is motivated by my wanting to learn more tricks with the stl, not the problem itself.
AndreasT
@AndreasT, thanks for clarifying that. It's not purely STL, but have you looked into the Boost iterator library? It has iterator adapters for filtering and transforming data at the iterator level: http://www.boost.org/doc/libs/1_41_0/libs/iterator/doc/index.html
bk1e
+2  A: 

As much as I like STL, I don't think there's anything wrong with doing things manually, especially if the problem does not really fall into the cases the STL has been made for. Then again, I'm not sure why you ask. Maybe you need an STL input iterator that (checks and) discards the check sums and LF characters and emits the integers?

I assume the encoding is such that LF can only appear at those places, i.e., some kind of Base-64 or similar?

Christopher Creutzig
Why I ask: I have come to love the simplicity and generality of stl solutions once you get over the initial rise in the learning curve. I just hoped that there was some algorithm/iterator/functor combination that can cope with this data format. But as you rightly state, this seems to be a situation that the stl has not been made for.
AndreasT
Other than that, I think using a custom iterator might just be the answer that currently fits the question best. (+1) This rather adds more complexity than it solves, since the logic in the iterator is almost the same as I use for the "manual" solution + the iterator declaration overhead. It doesn't really provide any "divide and conquer" advantages, but it fits most of the requirements. Thank you.
AndreasT
A: 

You should think of your communication protocol as being layered. Treat

|<--64byte-data-->|1byte checksum|LF

as fragments to be reassembled into larger packets of contiguous data. Once the larger packet is reconstituted, it is easier to parse its data contiguously (you don't have to deal with measurements being split up across fragments). Many existing network protocols (such as UDP/IP) does this sort of reassembly of fragments into packets.

It's possible to read the fragments directly into their proper "slot" in the packet buffer. Since your fragments have footers instead of headers, and there is no out-of-order arrival of your fragments, this should be fairly easy to code (compared to copyless IP reassembly algorithms). Once you receive an "empty" fragment (the duplicate LF), this marks the end of the packet.

Here is some sample code to illustrate the idea:

#include <vector>
#include <cassert>

class Reassembler
{
public:
    // Constructs reassembler with given packet buffer capacity
    Reassembler(int capacity) : buf_(capacity) {reset();}

    // Returns bytes remaining in packet buffer
    int remaining() const {return buf_.end() - pos_;}

    // Returns a pointer to where the next fragment should be read
    char* back() {return &*pos_;}

    // Advances the packet's position cursor for the next fragment
    void push(int size) {pos_ += size; if (size == 0) complete_ = true;}

    // Returns true if an empty fragment was pushed to indicate end of packet
    bool isComplete() const {return complete_;}

    // Resets the reassembler so it can process a new packet
    void reset() {pos_ = buf_.begin(); complete_ = false;}

    // Returns a pointer to the accumulated packet data
    char* data() {return &buf_[0];}

    // Returns the size in bytes of the accumulated packet data
    int size() const {return pos_ - buf_.begin();}

private:
    std::vector<char> buf_;
    std::vector<char>::iterator pos_;
    bool complete_;
};


int readFragment(char* dest, int maxBytes, char delimiter)
{
    // Read next fragment from source and save to dest pointer
    // Return number of bytes in fragment, except delimiter character
}

bool verifyChecksum(char* fragPtr, int size)
{
    // Returns true if fragment checksum is valid
}

void processPacket(char* data, int size)
{
    // Extract measurements which are now stored contiguously in packet
}

int main()
{
    const int kChecksumSize = 1;
    Reassembler reasm(1000); // Use realistic capacity here
    while (true)
    {
        while (!reasm.isComplete())
        {
            char* fragDest = reasm.back();
            int fragSize = readFragment(fragDest, reasm.remaining(), '\n');
            if (fragSize > 1)
                assert(verifyChecksum(fragDest, fragSize));
            reasm.push(fragSize - kChecksumSize);
        }
        processPacket(reasm.data(), reasm.size());
        reasm.reset();
    }
}

The trick will be making an efficient readFragment function that stops at every newline delimiter and stores the incoming data into the given destination buffer pointer. If you tell me how you acquire your sensor data, then I can perhaps give you more ideas.

Emile Cormier
I had thought of that too. It is essentially what I meant with solution one.
AndreasT
Ah, sorry about that Andreas. I'll leave my answer here for others to laugh.. er look at for inspiration.
Emile Cormier
A: 

An elegant solution this isn't. It would be more so by using a "transition matrix", and only reading one character at a time. Not my style. Yet this code has a minimum of redundant data movement, and it seems to do the job. Minimally C++, it really is just a C program. Adding iterators is left as an exercise for the reader. The data stream wasn't completely defined, and there was no defined destination for the converted data. Assumptions noted in comments. Lots of printing should show functionality.

// convert series of 3 ASCII decimal digits to binary
// there is a checksum byte at least once every 64 bytes - it can split a digit series
// if the interval is less than 64 bytes, it must be followd by LF (to identify it)
// if the interval is a full 64 bytes, the checksum may or may not be followed by LF
// checksum restricted to a simple sum modulo 10 to keep ASCII format
// checksum computations are only printed to allowed continuation of demo, and so results can be
// inserted back in data for testing
// there is no verification of the 3 byte sets of digits
// results are just printed, non-zero return indicates error


int readData(void) {
    int binValue =  0, digitNdx = 0,  sensorCnt = 0, lineCnt = 0;
    char oneDigit;
    string sensorTxt;

    while( getline( cin, sensorTxt ) ) {
        int i, restart = 0, checkSum = 0, size = sensorTxt.size()-1;
        if(size < 0)
            break;
        lineCnt++;
        if(sensorTxt[0] == '#')
            continue;
        printf("INPUT: %s\n", &sensorTxt[0]);           // gag

        while(restart<size) {
            for(i=0; i<min(64, size); i++) {
                oneDigit = sensorTxt[i+restart] & 0xF;
                checkSum += oneDigit;
                binValue = binValue*10 + oneDigit;
                //printf("%3d-%X ", binValue, sensorTxt[i+restart]);
                digitNdx++;
                if(digitNdx == 3) {
                    sensorCnt++;
                    printf("READING# %d (LINE %d) = %d  CKSUM %d\n",
                           sensorCnt, lineCnt, binValue, checkSum);
                    digitNdx = 0;
                    binValue = 0;
                }
            }
            oneDigit = sensorTxt[i+restart] & 0x0F;
            char compCheckDigit = (10-(checkSum%10)) % 10;
            printf("    CKSUM at sensorCnt %d  ", sensorCnt);
            if((checkSum+oneDigit) % 10)
                printf("ERR got %c exp %c\n", oneDigit|0x30, compCheckDigit|0x30);
            else
                printf("OK\n");
            i++;
            restart += i;
        }
    }
    if(digitNdx)
        return -2;
    else
        return 0;
}

The data definition was extended with comments, you you can use the following as is:

# normal 64 byte lines with 3 digit value split across lines
00100200300400500600700800901001101201301401501601701801902002105
22023024025026027028029030031032033034035036037038039040041042046
# short lines, partial values - remove checksum digit to combine short lines
30449
0451
0460479
0480490500510520530540550560570580590600610620630641
# long line with embedded checksums every 64 bytes
001002003004005006007008009010011012013014015016017018019020021052202302402502602702802903003103203303403503603703803904004104204630440450460470480490500510520530540550560570580590600610620630640
# dangling digit at end of file (with OK checksum)
37
gary
<:| I am really sorry for the effort you made. But this solution is neither particularly stl'y nor avoids copying. I mentioned that the problem in general is solved. I just wanted to know if there is some clever combination of algorithms, iterators, streams and a functor to do this elegantly.
AndreasT
+1  A: 

As others have said, there is no silver bullet in stl/boost to elegantly solve your problem. If you want to parse your chunk directly via pointer arithmetic, perhaps you can take inspiration from std::iostream and hide the messy pointer arithmetic in a custom stream class. Here's a half-arsed solution I came up with:

#include <cctype>
#include <iostream>
#include <vector>
#include <boost/lexical_cast.hpp>

class Stream
{
public:
    enum StateFlags
    {
        goodbit = 0,
        eofbit  = 1 << 0,   // End of input packet
        failbit = 1 << 1    // Corrupt packet
    };

    Stream() : state_(failbit), csum_(0), pos_(0), end_(0) {}
    Stream(char* begin, char* end) {open(begin, end);}
    void open(char* begin, char* end)
        {state_=goodbit; csum_=0; pos_=begin, end_=end;}
    StateFlags rdstate() const {return static_cast<StateFlags>(state_);}
    bool good() const {return state_ == goodbit;}
    bool fail() const {return (state_ & failbit) != 0;}
    bool eof() const {return (state_ & eofbit) != 0;}
    Stream& read(int& measurement)
    {
        measurement = readDigit() * 100;
        measurement += readDigit() * 10;
        measurement += readDigit();
        return *this;
    }

private:
    int readDigit()
    {
        int digit = 0;

        // Check if we are at end of packet
        if (pos_ == end_) {state_ |= eofbit; return 0;}

        /* We should be at least csum|lf|lf away from end, and we are
            not expecting csum or lf here. */
        if (pos_+3 >= end_ || pos_[0] == '\n' || pos_[1] == '\n')
        {
            state_ |= failbit;
            return 0;
        }

        if (!getDigit(digit)) {return 0;}
        csum_ = (csum_ + digit) % 10;
        ++pos_;

        // If we are at checksum, check and consume it, along with linefeed
        if (pos_[1] == '\n')
        {
            int checksum = 0;
            if (!getDigit(checksum) || (checksum != csum_)) {state_ |= failbit;}
            csum_ = 0;
            pos_ += 2;

            // If there is a second linefeed, we are at end of packet
            if (*pos_ == '\n') {pos_ = end_;}
        }
        return digit;
    }

    bool getDigit(int& digit)
    {
        bool success = std::isdigit(*pos_);
        if (success)
            digit = boost::lexical_cast<int>(*pos_);
        else
            state_ |= failbit;
        return success;
    }

    int csum_;
    unsigned int state_;
    char* pos_;
    char* end_;
};


int main()
{
    // Use (8-byte + csum + LF) fragments for this example
    char data[] = "\
001002003\n\
300400502\n\
060070081\n\n";

    std::vector<int> measurements;
    Stream s(data, data + sizeof(data));
    int meas = 0;

    while (s.read(meas).good())
    {
        measurements.push_back(meas);
        std::cout << meas << " ";
    }

    return 0;
}

Maybe you'll want to add extra StateFlags to determine if failure is due to checksum error or framing error. Hope this helps.

Emile Cormier