I have a bunch of files (on the order of 10 per second) coming into a system (stored into a database). Each file contains an entry for somewhere between 1 and 500 devices. A given device will appear in multiple files (but not every file). This data eventually needs to be stored in another database, stored per-device. There are two different file formats.
There is an API that takes care of the final database part, which takes several entries for a single device (behind the scenes, this also does some lookups to find IDs in the database, and so processing multiple entries at once for a single device means doing the lookups once, instead of once per entry).
To do this, I have a program with several parts:
- Parse files, extracting data into a common set of data objects.
- This is a threaded process, with one thread per file, adding data into a thread-safe collection
- As each file is loaded, its DB entry is marked as being 'in progress'
- Save objects into database
- Another threaded process, which extracts all the objects for a given device, and then tells the data API to save them.
- Once the save for all devices from a single file are successful (or if any fail) the DB entry for the original file is marked as being success/failed
My question is: what is the best way to manage when to parse files, how many threads to use, how much RAM etc?
- The data API is going to take the longest - most of the time, the threads there will just be waiting for the API to return.
- The overall efficiency of the system is improved by having more data grouped per device
- The application shouldn't run out of RAM, or have so many files parsed but waiting to be saved that it causes the OS to swap.
- It's unknown how many simultaneous calls the DB API can handle, or how quickly it runs - this process needs to adapt to that
So how do I know when to parse files to make sure that this is going as fast as it can, without causing a performance hit by using too much RAM?