views:

45

answers:

1

Hey everyone,

I am writing an application, which parses a large file, generates a large amount of data and do some complex visualization with it. Since all this data can't be kept in memory, I did some research and I'm starting to consider embedded databases as a temporary container for this data.

My question is: is this a traditional way of solving this problem? And is an embedded database (other than structuring data) supposed to manage data by keeping in memory only a subset (like a cache), while the rest is kept on disk? Thank you.

Edit: to clarify: I am writing a desktop application. The application will be inputted with a file of size of 100s of Mb. After reading the file, the application will generate a large number of graphs which will be visualized. Since, the graphs may have such a large number of nodes, they may not fit into memory. Should I save them into an embedded database which will take care of keeping only the relevant data in memory? (Do embedded databases do that?), or I should write my own sophisticated module which does that?

+2  A: 

Tough question - but I'll share my experience and let you decide if it helps.

If you need to retain the output from processing the source file, and you use that to produce multiple views of the derived data, then you might consider using an embedded database. The reasons to use an embedded database (IMHO):

  • To take advantage of RDBMS features (ACID, relationships, foreign keys, constraints, triggers, aggregation...)
  • To make it easier to export the data in a flexible manner
  • To enable access to your processed data to external clients (known format)
  • To allow more flexible transformation of the data when preparing for viewing

Factors which you should consider when making the decision:

  • What is the target platform(s) (windows, linux, android, iPhone, PDA)?
  • What technology base? (Java, .Net, C, C++, ...)
  • What resource constraints are expected or need to be designed for? (RAM, CPU, HD space)
  • What operational behaviours do you need to take into account (connected to network, disconnected)?

On the typical modern desktop there is enough spare capacity to handle most operations. On eeePCs, PDAs, and other portable devices, maybe not. On embedded devices, very likely not. The language you use may have build in features to help with memory management - maybe you can take advantage of those. The connectivity aspect (stateful / stateless / etc.) may impact how much you really need to keep in memory at any given point.

If you are dealing with really big files, then you might consider a streaming process approach so you only have in memory a small portion of the overall data at a time - but that doesn't really mean you should (or shouldn't) use an embedded database. Straight text or binary files could work just as well (record based, column based, line based... whatever).

Some databases will allow you more effective ways to interact with the data once it is stored - it depends on the engine. I find that if you have a lot of aggregation required in your base files (by which I mean the files you generate initially from the original source) then an RDBMS engine can be very helpful to simplify your logic. Other options include building your base transform and then adding additional steps to process that into other temporary stores for each specific view, which are then in turn processed for rendering to the target (report?) format.

Just a stream-of-consciousness response - hope that helps a little.

Edit:

Per your further clarification, I'm not sure an embedded database is the direction you want to take. You either need to make some sort of simplifying assumptions for rendering your graphs or investigate methods like segmentation (render sections of the graph and then cache the output before rendering the next section).

AJ