views:

460

answers:

9

Hi,

What is the best way of storing data out to a file on a network, which will be later read in again programmatically. Target platform for the program is Linux (Fedora), but it will need to write out a file to a Windows (XP) machine

This needs to be in C++, there will be a high number of write / read events so it needs to be efficient, and the data needs to be written out in such a way that it can be read back in easily.

The whole file may not be being read back in, I'll need to search for a specific block of data in the file and read that back in.

Will simple binary stream writer do? How should I store the data - XML?

Anything else I need to worry about?


UPDATE : To clarify, here are some answers to peterchen's points

Please clarify:

* do you only append blocks, or do you also need to remove / update them?

I only need to append to the end of the file, but will need to search through it and retrieve from any point in it

*** are all blocks of the same size?**

No, the data will vary in size - some will be free text comments (like a post here) others will be specific object-like data (sets of parameters)

*** is it necessary to be a single file?**

No, but desirable

*** by which criteria do you need to locate blocks?**

By data type and by timestamp. For example, if I periodically write out a specific set of parameters, in amognst other data, like free text, I want to find the value of those parameters at a cerain date/time - so I'll need to search for the time I wrote out those parameters nearest that date and read them back in.

*** must the data be readable for other applications?**

No.

*** do you need concurrent access?**

Yes, I may be continuing to write as I read. but should only ever do one write at a time.

*** Amount of data (per block / total) - kilo, mega, giga, tera?**

Amount of data will be low per write... from a number of bytes to a coupe hundred bytes - total should see no more than few hundred kilobytes possible a fwe megabytes. (still unsure as yet)

**> If you need all of this, rolling your own will be a challenge, I would definitely

recommend to use a database. If you need less than that, please specify so we can recommend.**

A database would over complicate the system so that is not an option unfortunately.

+5  A: 

Your question is too general. I would first define my needs, then a record structure for the file, and then use a textual representation to save it. Take a look at Eric Stone Raymond's data metaformat, at JSON, and maybe CSV or XML. All of peterchen's points seem relevant.

Yuval F
+1  A: 

You'll need to look at the kind of data you are writing out. Once you are dealing with objects instead of PODs, simply writing out the binary representation of the object will not necessarily result in anything that you can deserialise successfully.

If you are "only" writing out text, reading the data back in should be comparatively easy if you are writing out in the same text representation. If you are trying to write out more complex data types you'll probably need to look at something like boost::serialization.

Timo Geusch
I do need to store the 'value' or 'state' of particular objects, but can't decide if serialization is the answer or if I should manually parse them down into key value pairs, or vectors of data.
Krakkos
Some of the objects have quite complex data in that they contain other objects which contain other objects etc.
Krakkos
If you are trying to store quite complex data I would definitely investigate the available serialisation libraries as they will save you a lot of headaches and late night debugging over rolling your own.
Timo Geusch
+1  A: 

Your application sounds like it needs a database. If you can afford, use one. But don't use an embedded database engine like sqlite for a file over a network storage, since it may be too unstable for your purposes. If you still want to use something like it, you have to use it through your own reader/writer process with your own access protocol, stability concerns still applies if you use a text based file format like XML instead, so you will have to do same for them.

I can't be certain without knowing your workload though.

artificialidiot
+3  A: 

there will be a high number of write / read events so it needs to be efficient,

That will not be efficient.

I did a lot of timing on this back in the Win2K days, when I had to implement a program that essentially had a file copy in it. What I found was that by far the biggest bottleneck in my program seemed to be the overhead in each I/O operation. The single most effective thing I found in reducing total runtime was to reduce the number of I/O operations I requested.

I started doing pretty stream I/O, but that was no good because the stupid compiler was issuing an I/O for every single character. Its performace compared to the shell "copy" command was just pitiful. Then I tried writing out an entire line at a time, but that was only marginally better.

Eventually I ended up writing the program to attempt to read the entire file into memory so that in most cases there would be only 2 I/Os: one to read it in and another to write it out. This is where I saw the huge savings. The extra code involved in dealing with the manual buffering was more than made up for in less time waiting for I/Os to complete.

Of course this was 7 years or so ago, so I suppose things may be much different now. Time it yourself if you want to be sure.

T.E.D.
A: 

Store it as binary if you're not doing text storage. Text is hideously inefficient; XML is even worse. The lack of efficiency of the storage format predicates larger file transfers which means more time. If you are having to store text, filter it through a zip library.

Your main issue is going to be file locking and concurrency. Everything starts to get groady when you have to write/read/write in a concurrent fashion. At this point, get a DB of some sort installed and BLOB the file up or something, because you'll be writing your own DB at this point....and no one wants to reinvent that wheel(you know, if they aren't doing their own DB company of their own, or are a PhD student, or have a strange hobby...)

Paul Nathan
+1  A: 

If you are only talking about a few megabytes, I wouldn't store in on disk at all. Have a process on the network that accepts data and stores it internally, and also accepts queries on that data. If you need a record of the data, this process can also write it to the disk. Note that this sounds a lot like a database, and this indeed may be the best way to do it. I don't see how this complicates the system. In fact, it makes it much easier. Just write a class that abstracts the database, and have the rest of the code use that.

I went through this same process myself in the past, including dismissing a database as too complicated. It started off fairly simple, but after a couple of years we had written our own, poorly implemented, buggy, hard to use database. At that point, we junked our code and moved to postgres. We've never regretted the change.

KeithB
yes, the more i think about it, and the more i anticipate extensions, the more it sounds like a proper database is the answer.. initially i will have to implement some kind of manual file writing, but I expect i'll need to propose a database solution eventually.
Krakkos
+3  A: 

Probably you should have another file that would be read into a vector with fixed size data.

struct structBlockInfo
    {
     int iTimeStamp;    // TimeStamp 
     char cBlockType;   // Type of Data (PArameters or Simple Text)
     long vOffset;      // Position on the real File
    };

Every time you added a new block you would also add it to this vector the correspondent information and save it.

Now if you wanted to to read a specific block you could do a search on this vector, position yourself on the "Real File" with fseek (or whatever) to the correspondent offset, and read X bytes (this offset to the start of the other or to the end of the file) And do a cast to something depending on the cBlockType, examples:

    struct structBlockText
    { 
     char cComment[];
    };

    struct structBlockValuesExample1
    { 
     int iValue1;
     int iValue2;
    };

    struct structBlockValuesExample2
    { 
     int iValue1;
     int iValue2;
        long lValue1;
        char cLittleText[];
    };

Read some Bytes....

fread(cBuffer, 1, iTotalBytes, p_File);

If it was a BLockText...

structBlockText* p_stBlock = (structBlockText*) cBuffer;

If it was a structBlockValuesExample1...

structBlockValuesExample1* p_stBlock = (structBlockValuesExample1*) cBuffer;

Note: that cBuffer can hold more than one Block.

João Augusto
I would recommend a similar solution. Some provisions may be needed for concurrency (atomic update of the index AFTER data was appended), and of course automatic padding should be disabled, and manual padding used instead.
peterchen
Yes, main point is that since he needs efficiency it's always faster to read X Bytes at a time, than reading a Int, then another Int, and so on... also I can't think of another way of searching in the file, without some auxiliary file, since the blocks aren't of the same type.
João Augusto
This is a very good idea, thanks... I hadn't thought of using a seperate index here... :)Some of the 'blocks' are quite complex data types, but thats another post ;)
Krakkos
+1  A: 

This is what I have for reading/writing of data:

template<class T>
    int write_pod( std::ofstream& out, T& t )
{
    out.write( reinterpret_cast<const char*>( &t ), sizeof( T ) );
    return sizeof( T );
}

template<class T>
    void read_pod( std::ifstream& in, T& t )
{
    in.read( reinterpret_cast<char*>( &t ), sizeof( T ) );
}

This doesn't work for vectors, deque's etc. but it is easy to do by simply writing out the number of items followed by the data:

struct object {
 std::vector<small_objects> values;

 template <class archive>
 void deserialize( archive& ar ) {
  size_t size;
  read_pod( ar, size );
  values.resize( size );
  for ( int i=0; i<size; ++i ) {
   values[i].deserialize( ar );
  }
 }
}

Of course you will need to implement the serialize & deserialize functions but they are easy to implement...

graham.reeds
This will work as long as the writer and reader are on the same system, compiled with the same version of the compiler with the same options. If your intent is to have the data readable in the long-term, or portable between machines, eventually this will come back to bite you.
KeithB
+1  A: 

I would check out the Boost Serialization library

One of their examples is:

#include <fstream>

// include headers that implement a archive in simple text format
#include <boost/archive/text_oarchive.hpp>
#include <boost/archive/text_iarchive.hpp>

/////////////////////////////////////////////////////////////
// gps coordinate
//
// illustrates serialization for a simple type
//
class gps_position
{
private:
    friend class boost::serialization::access;
    // When the class Archive corresponds to an output archive, the
    // & operator is defined similar to <<.  Likewise, when the class Archive
    // is a type of input archive the & operator is defined similar to >>.
    template<class Archive>
    void serialize(Archive & ar, const unsigned int version)
    {
        ar & degrees;
        ar & minutes;
        ar & seconds;
    }
    int degrees;
    int minutes;
    float seconds;
public:
    gps_position(){};
    gps_position(int d, int m, float s) :
        degrees(d), minutes(m), seconds(s)
    {}
};

int main() {
    // create and open a character archive for output
    std::ofstream ofs("filename");

    // create class instance
    const gps_position g(35, 59, 24.567f);

    // save data to archive
    {
        boost::archive::text_oarchive oa(ofs);
        // write class instance to archive
        oa << g;
        // archive and stream closed when destructors are called
    }

    // ... some time later restore the class instance to its orginal state
    gps_position newg;
    {
        // create and open an archive for input
        std::ifstream ifs("filename");
        boost::archive::text_iarchive ia(ifs);
        // read class state from archive
        ia >> newg;
        // archive and stream closed when destructors are called
    }
    return 0;
}
Yes, thanks, severalpeople have mentioned the Boost libraries - i'll check them out :)
Krakkos