ansaurus

Question

Elegant way of parsing Data files for Simulation

Answer 1

A:

Take a look at Boost.Spirit. It's a decent parser framework.

edit, fixed link

Glen 2010-03-25 15:59:21

While I really like Spirit, unless you are already familiar with it, its probably overkill for this purpose.

KeithB 2010-03-25 16:08:48

Thanks Glen. It is good to know about Boost.Spirit. Since I am somewhat of a newcomer to the C++ world, I was wondering if this could be done without reliance on frameworks such as Spirit. Also, I would probably need sample code that gives me an overview of the necessary plumbing need to read the file, parse the file and move the parsed data into other data structures with minimal overhead since computation time will cut into the simulation time.

sc_ray 2010-03-25 16:13:43

Answer 2

+3 A:

Addressing the question of designing the in memory storage and linking:

You haven't told us enough. The necessary structure of your data depends on how you need to use the data.

If you are going to walk sequentially through (all or part) the data by starttime, shouldn't you be able to visit events in order by starttime? If you're going to jump into the middle of the stream sometime shouldn't you be able to search efficiently by starttime.
If you want to examine the event active during a certain interval, you also need to be able to search efficiently by endtime.
If you want to examine all the interaction of a single device you need to be able to select events by device (which you proposed structure does nicely)
... what other use cases do you have...

If you don't need the best performance possible (i.e. good performance will do) a relational DB might be in order. Or you can build in memory structures with all the characteristics you need, but they may be moderately complicated...

dmckee 2010-03-25 16:09:33

Thanks. Fortunately for me, the trace data has been arranged in a chronological fashion. As soon as I start parsing the data, the simulation triggers. As soon as I read the first line, I check the HashMap for whether the DeviceId and the interacting DeviceId exists.If not, I create two new DeviceId Objects, populate their properties and store them in the HashMap. When the simulation has run for a little while, I analyze the device interactions for any trends based on the frequency of interactions and the timespan a device spent with another device.I am "device agnostic" during my simulation.

sc_ray 2010-03-25 17:11:36

Answer 3

+1 A:

I done similar thing with interacting people. For future expandability, I would do the following: Have a Device class that holds the id and a vector of pointers Interaction objects. The Devices could be kept in a map (or hashmap) foe easy lookup. The interaction class would contain the rest of the information from the file. This will allow you to create polymorphic devices and interactions, in case you every have multiple kinds of devices or interactions. You might also want to have factories for the devices and interactions, to facilitate this.

KeithB 2010-03-25 16:16:13

Right. *Whatever* storage arrangement you use, you should keep each piece of data in *one* place and use a lot of references (or pointers) to manage your organizational needs.

dmckee 2010-03-25 16:22:34

Thanks Keith. Can you explain what is the vector of pointers Interaction Objects? Keeping the Device in a HashMap with the DeviceId as the key is a great idea. During the course of the simulation I have to look up deviceIds and their interaction history to extrapolate useful trends. Also, are you suggesting I create a separate class for the interaction part? I am assuming this Interaction object would consist of the InteractingDeviceId,InteractionStartTime and InteractionEndTime.

sc_ray 2010-03-25 17:00:39

Answer 4

+3 A:

Your approach seems to be correct given the information you've provided.

I'm assuming you'd be creating a class something like:

class device {
  public:
    int id;
    vector<interaction> interactions;
    void add_interaction(interaction add_me); // uses vector::insert
};

with

typedef struct interaction_t {
    int other_device_id;
    int start_time;
    int end_time;
} interaction;

At that point, you should be able to read in the file, one line at a time, and pull out the data.

device* pDev = NULL;
interaction new_interaction;
ifstream ifs( "data.dat" );
char temp[MAX_LINE_LENGTH+1];
int id, other_id, start, end;

while(ifs.getline(temp, MAX_LINE_LENGTH)) {
    sscanf(temp, "%i\t%i\t%i\t%i",
        &id,
        &new_interaction.other_device_id,
        &new_interaction.start_time,
        &new_interaction.end_time);
    pDev = find_device_by_id(id);
    pDev->add_interaction(new_interaction);
}

Code is untested and for illustration purposes only, but you can get the idea. The trick would be writing the find_device_by_id function (would return a pointer to the device object with a matching id field). This shouldn't require too heavy of a memory overhead per input line; if your input files are huge, you may not be able to store the data in memory and may have to store in a database instead.

bta 2010-03-25 16:58:44

Doing this you end up with two (or more for n-sided interactions) copies of every interaction. If you maintain a separate interaction list and use a `vector<interaction *>` (or use a smarter pointer), you get the same behavior without the duplication of data. Otherwise, nice.

dmckee 2010-03-25 17:20:01

If it is possible for a device index to appear in either the "DeviceID" column or the "InteractingDeviceID" column, then you are right. Dropping `interaction.other_device_id`, having a global vector of `interaction`, and having each `device` store only pointers to `interaction`s would be better. When I wrote this, I was assuming that the list of devices and the list of interacting devices were distinct (as if this data were tracking "initiator" devices and a list of the "target" devices they sent commands to).

bta 2010-03-25 17:49:42

Thanks a lot..This explains it.

sc_ray 2010-03-25 18:41:20

ansaurus

tags:

views:

answers:

Elegant way of parsing Data files for Simulation

related questions