views:

141

answers:

6

I have a design question. I have a file that is several GB (between 3-4 GB). The file is ordered by time stamp. I am trying to figure out what the best way is to deal with this file

I was thinking of reading this whole file into memory, then transmitting this data to different machines and then running my analysis on those machines.

Would it be wise to upload this into a database before running my analysis ?

Just so you know, I plan to run my analysis on different machines, so doing it through database would be easier but if I increase the number machines to run my analysis on the database might get too slow.

Any ideas?

@update :

I want to process the records one by one. Basically trying to run a model on a timestamp data but I have various models so want to distribute it so that this whole process run over night every day. I want to make sure that I can easily increase the number of models and not decrease the system performance. Which is why I am planning to distributing data to all the machines running the model ( each machine will run a single model).

+1  A: 

Would it be wise to upload this into a database before running my analysis ?

yes

I plan to run my analysis on different machines, so doing it through database would be easier but if I increase the number machines to run my analysis on the database might get too slow.

don't worry about it, it will be fine. Just introduce a marker so the rows processed by each computer are identified.

I'm not sure I fully understand all of your requirements, but if you need to persist the data (refer to it more than once,) then a db is the way to go. If you just need to process portions of these output files and trust the results, you can do it on the fly without storing any contents.

Only store the data you need, not everything in the files.

Beth
It sounds like the OP is concerened about performance. The database route will be, *by far*, the slowest possible implementation. Sounds like we're talking about millions upon millions of rows to insert.
Kirk Woll
Sorry didnt add that I want to run this everyday. Would it still be wise to upload it to the database ?
@user465353 tell us more about how you need to analyze the data, it has a big impact if you need to look at the data set as a whole, or whether you can process lines/records one by one.
nos
I want to process the records one by one. Basically trying to run a model on a timestamp data but I have various models so want to distribute it so that this whole process run over night every day. I want to make sure that I can easily increase the number of models and not decrease the system performance. Which is why I am planning to distributing data to all the machines running the model ( each machine will run a single model).
+2  A: 

You can even access the file in the hard disk itself and reading a small chunk at a time. Java has something called Random Access file for the same but the same concept is available in other languages also.

Whether you want to load into the the database and do analysis should be purely governed by the requirement. If you can read the file and keep processing it as you go no need to store in database. But for analysis if you require the data from all the different area of file than database would be a good idea.

lalit
+1  A: 

You do not need the whole file into memory, just the data you need for analysis. You can read every line and store only the needed parts of the line and additionally the index where the line starts in file, so you can find it later if you need more data from this line.

codymanix
A: 

Depending on the analysis needed, this sounds like a textbook case for using MapReduce with Hadoop. It will support your requirement of adding more machines in the future. Have a look at the Hadoop wiki: http://wiki.apache.org/hadoop/

Start with the overview, get the standalone setup working on a single machine, and try doing a simple analysis on your file (e.g. start with a "grep" or something). There is some assembly required but once you have things configured I think it could be the right path for you.

MikeG
A: 

I had a similar problem recently, and just as @lalit mentioned, I used the RandomAccess file reader against my file located in the hard disk.

In my case I only needed read access to the file, so I launched a bunch of threads, each thread starting in a different point of the file, and that got me the job done and that really improved my throughput since each thread could spend a good amount of time blocked while doing some processing and meanwhile other threads could be reading the file.

A program like the one I mentioned should be very easy to write, just try it and see if the performance is what you need.

Abel Morelos
A: 

@update :

I want to process the records one by one. Basically trying to run a model on a timestamp data but I have various models so want to distribute it so that this whole process run over night every day. I want to make sure that I can easily increase the number of models and not decrease the system performance. Which is why I am planning to distributing data to all the machines running the model ( each machine will run a single model).