views:

124

answers:

3

I have a task, which I know how to code (in C#), but I know a simple implementation will not meet ALL my needs. So, I am looking for tricks which might meet ALL my needs.

  1. I am writing a simulation involving N number of entities interacting over time.

  2. N will start at around 30 and move in to many thousands.

    a. The number of entities will change during the course of the simulation.
    

    b. I expect this will require each entity to have its own trace file.

  3. Each Entity has a minimum of 20 parameters, up to millions; I want to track over time.

    a. This will most likely required that we can’t keep all values in memory at all times. Some subset should be fine.

    b. The number of parameters per entity will initially be fixed, but I can think of some test which would have the number of parameters slowing changing over time.

  4. Simulation will last for millions of time steps and I need to keep every value for every parameter.

  5. What I will be using these traces for:

    a. Plotting a subset (configurable) of the parameters for a fixed amount of time from the current time step to the past.

    i. Normally on the order of 300 time steps.
    
    
    ii. These plots are in real time while the simulation is running.
    

    b. I will be using these traces to re-play the simulation, so I need to quickly access all the parameters at a give time step so I can quickly move to different times in the simulation.

    i. This requires the values be stored in a file(s) which can be inspected/loaded after restarting the software.
    
    
    ii. Using a database is NOT an option.
    

    c. I will be using the parameters for follow up analysis which I can’t define up front so a more flexible system is desirable.

My initial thought:

  1. One class per entity which holds all the parameters.

  2. Backed by a memory mapped file.

  3. Only a fixed, but moving, amount of the file is mapped to main memory

  4. A second memory mapped file which holds time indexes to main file for quicker access during re-playing of simulation. This may be very important because each entity file will represent a different time slice of the full simulation.

+1  A: 

Just for the memory part...

1.You can save the data as xElemet (sorry for not knowing much about linq) but it holds an XML logic.

2.hold a record counter.

after n records save the xelement to an xmlFile (data1.xml,...dataN.xml)

It can be a perfect log to any parameter you have with any logic you like:

<run>
  <step id="1">
     <param1 />
     <param2 />
     <param3 />
  </step>
  .
  .
  .
  <step id="N">
     <param1 />
     <param2 />
     <param3 />
  </step>
</run>

This way your memory is free and the data is relatively free. You don't have to think too much about DB issues and it's pretty amazing what LINQ can do for you... just open the currect XML log file...

I thought of something similar to this, but I can't see how this works with the playback requirements plus the size of the XML elements could cause the size of these files to double what they could be. Since, I expect these files to be a few gigabytes each this seem too wasteful. Taking into account the full set of requirements is the hard part and is why I am asking this question. I am open to any thoughts and discussions to help work this out. Thanks.
Jim Kramer
Millions of steps... It will take you ages to read this XML.
romkyns
+3  A: 

I would start with SQLite. SQLite is like a binary format library that you can query conveniently and quickly. It is not really like a database, in that you can really run it on any machine, with no installation whatsoever.

I strongly recommend against XML, given the requirement of millions of steps, potentially with millions parameters.

EDIT: Given the sheer amount of data involved, SQLite may well end up being too slow for you. Don't get me wrong, SQLite is very fast, but it won't beat seeks & reads, and it looks like your use case is such that basic binary IO is rather appropriate.

If you go with the binary IO method you should expect some moderately involved coding, and the absence of such niceties as your file staying in a consistent state if the application dies halfway through (unless you code this specifically that is).

romkyns
I'd go with SQL CE instead - same usage, but more functional and accessible from C#.
codekaizen
Fair point. SQLCE may be better in terms of toolchain. It's a little harder to redistribute though - I believe the standard way is by redistributing the (free) MSI.
romkyns
Interesting though. I just did a back of the envelope calculation and each entity will have value logs totaling about 1.5 GB and with just 4 entity that would exceed the 4GB limits of SQLCE. I just did a quick search for limits of SQLLite and I could not find it. I will do more searching. This was an idea I had not thought about and is the kind of information I need.
Jim Kramer
http://www.sqlite.org/whentouse.html suggests (under "Very large datasets") that in the default build, the max file size is 2TB. Also, using numbers from http://www.sqlite.org/limits.html, the absolute upper limit appears to be 32TB, but you'd need to compile it yourself to get this.
romkyns
coding complexity is not an issue. It is just with the range of requirements and desire not to pix a wrong solution is the reason I have asked this question. So far this discussion still leads me to a similar solution to my original thoughts.
Jim Kramer
+2  A: 

KISS -- just write a logfile for each entity and at each time slice write out every parameter in a specified order (so you don't double the size of the logfile by adding parameter names). You can have a header in each logfile if you want to specify the parameter names of each column and the identify of the entity.

If there are many parameter values that will remain fixed or slowly changing during the course of the simulation, you can write these out to another file that encodes only changes to parameter values rather than every value at each time slice.

You should probably synchronize the logging so that each log entry is written out with the same time value. Rather than coordinate through a central file, just make the first value in each line of the file the time value.

Forget about database - too slow and too much overhead for simulation replay purposes. For replaying of a simulation, you will simply need sequential access to each time slice, which is most efficiently and fastest implemented by simply reading in the lines of the files one by one.

For the same reason - speed and space efficiency - forget XML.

Larry Watanabe
This was my general thoughts, but I am just not sure because of the wide range requirements. Yes the logging will be synchronized no matter the method I use. Secondary, almost every value will change during each step.
Jim Kramer
If you go this way, I think you're better off with a binary file and fixed width entries, rather than lines of text. That will make seeking a trivial task (and will be much more efficient space-wise). It seems from your description that fixed-width entries are no problem.
romkyns
I expect to write out raw binary copies of the floating/integer values.
Jim Kramer
binary will be more efficient, but if you go this route write a tool that will allow you to convert these files to text form. Then you can just pipe it to e.g. "more" or whatever. This will aid in debugging.
Larry Watanabe
The wide-range requirements are probably there because the actual requirements are unknown and people are erring on the safe side. When you actually see the data, that will be the time to start optimizing - not before. "Premature optimization is the root of all evil".
Larry Watanabe
No, the requires are known and are not just erring on the safe side. While it is true that the first uses of these requirements will come no ware near the limits specified, but the finial uses will and in fact my be understated.
Jim Kramer