views:

64

answers:

4

I am working on this project where I need to read in a lot of data from .dat files and use the data to perform simulations. The data in my .dat file looks as follows:

DeviceID  InteractingDeviceID InteractionStartTime InteractionEndTime
  1            2                  1101                1105

1,2 1101 and 1105 are tab delimited and it means Device 1 interacted with Device 2 at 1101 ms and ended the interaction at 1105ms.

I have a trace data sets that compile thousands of such interactions and my job is to analyze these interactions.

The first step is to parse the file. The language of choice is C++. The approach I was thinking of taking was to read the file, for every line that's read create a Device Object. This Device object will contain the property DeviceId and an array/vector of structs, that will contain a list of all the devices the given DeviceId interacted with over the course of the simulation.The struct will contain the Interacting Device Id, Interaction Start Time and Interaction End Time.

I have a two fold question here:

  1. Is my approach correct?

  2. If I am on the right track, how do I rapidly parse these tab delimited data files and create Device objects without excessive memory overhead using C++?

A push in the right direction will be much appreciated.

Thanks

A: 

Take a look at Boost.Spirit. It's a decent parser framework.

edit, fixed link

Glen
While I really like Spirit, unless you are already familiar with it, its probably overkill for this purpose.
KeithB
Thanks Glen. It is good to know about Boost.Spirit. Since I am somewhat of a newcomer to the C++ world, I was wondering if this could be done without reliance on frameworks such as Spirit. Also, I would probably need sample code that gives me an overview of the necessary plumbing need to read the file, parse the file and move the parsed data into other data structures with minimal overhead since computation time will cut into the simulation time.
sc_ray
+3  A: 

Addressing the question of designing the in memory storage and linking:

You haven't told us enough. The necessary structure of your data depends on how you need to use the data.

  • If you are going to walk sequentially through (all or part) the data by starttime, shouldn't you be able to visit events in order by starttime? If you're going to jump into the middle of the stream sometime shouldn't you be able to search efficiently by starttime.
  • If you want to examine the event active during a certain interval, you also need to be able to search efficiently by endtime.
  • If you want to examine all the interaction of a single device you need to be able to select events by device (which you proposed structure does nicely)
  • ... what other use cases do you have...

If you don't need the best performance possible (i.e. good performance will do) a relational DB might be in order. Or you can build in memory structures with all the characteristics you need, but they may be moderately complicated...

dmckee
Thanks. Fortunately for me, the trace data has been arranged in a chronological fashion. As soon as I start parsing the data, the simulation triggers. As soon as I read the first line, I check the HashMap for whether the DeviceId and the interacting DeviceId exists.If not, I create two new DeviceId Objects, populate their properties and store them in the HashMap. When the simulation has run for a little while, I analyze the device interactions for any trends based on the frequency of interactions and the timespan a device spent with another device.I am "device agnostic" during my simulation.
sc_ray
+1  A: 

I done similar thing with interacting people. For future expandability, I would do the following: Have a Device class that holds the id and a vector of pointers Interaction objects. The Devices could be kept in a map (or hashmap) foe easy lookup. The interaction class would contain the rest of the information from the file. This will allow you to create polymorphic devices and interactions, in case you every have multiple kinds of devices or interactions. You might also want to have factories for the devices and interactions, to facilitate this.

KeithB
Right. *Whatever* storage arrangement you use, you should keep each piece of data in *one* place and use a lot of references (or pointers) to manage your organizational needs.
dmckee
Thanks Keith. Can you explain what is the vector of pointers Interaction Objects? Keeping the Device in a HashMap with the DeviceId as the key is a great idea. During the course of the simulation I have to look up deviceIds and their interaction history to extrapolate useful trends. Also, are you suggesting I create a separate class for the interaction part? I am assuming this Interaction object would consist of the InteractingDeviceId,InteractionStartTime and InteractionEndTime.
sc_ray
+3  A: 

Your approach seems to be correct given the information you've provided.

I'm assuming you'd be creating a class something like:

class device {
  public:
    int id;
    vector<interaction> interactions;
    void add_interaction(interaction add_me); // uses vector::insert
};

with

typedef struct interaction_t {
    int other_device_id;
    int start_time;
    int end_time;
} interaction;

At that point, you should be able to read in the file, one line at a time, and pull out the data.

device* pDev = NULL;
interaction new_interaction;
ifstream ifs( "data.dat" );
char temp[MAX_LINE_LENGTH+1];
int id, other_id, start, end;

while(ifs.getline(temp, MAX_LINE_LENGTH)) {
    sscanf(temp, "%i\t%i\t%i\t%i",
        &id,
        &new_interaction.other_device_id,
        &new_interaction.start_time,
        &new_interaction.end_time);
    pDev = find_device_by_id(id);
    pDev->add_interaction(new_interaction);
}

Code is untested and for illustration purposes only, but you can get the idea. The trick would be writing the find_device_by_id function (would return a pointer to the device object with a matching id field). This shouldn't require too heavy of a memory overhead per input line; if your input files are huge, you may not be able to store the data in memory and may have to store in a database instead.

bta
Doing this you end up with two (or more for n-sided interactions) copies of every interaction. If you maintain a separate interaction list and use a `vector<interaction *>` (or use a smarter pointer), you get the same behavior without the duplication of data. Otherwise, nice.
dmckee
If it is possible for a device index to appear in either the "DeviceID" column or the "InteractingDeviceID" column, then you are right. Dropping `interaction.other_device_id`, having a global vector of `interaction`, and having each `device` store only pointers to `interaction`s would be better. When I wrote this, I was assuming that the list of devices and the list of interacting devices were distinct (as if this data were tracking "initiator" devices and a list of the "target" devices they sent commands to).
bta
Thanks a lot..This explains it.
sc_ray