tags:

views:

139

answers:

5

i am reading a csv file into a list of a list in python. it is around 100mb right now. in a couple of years that file will go to 2-5gigs. i am doing lots of log calculations on the data. the 100mb file is taking the script around 1 minute to do. after the script does a lot of fiddling with the data, it creates URL's that point to google charts and then downloads the charts locally.

can i continue to use python on a 2gig file or should i move the data into a database?

+1  A: 

At 2 gigs, you may start running up against speed issues. I work with model simulations for which it calls hundreds of csv files and it takes about an hour to go through 3 iterations, or about 20 minutes per loop.

This is a matter of personal preference, but I would go with something like PostGreSql because it integrates the speed of python with the capacity of a sql-driven relational database. I encountered the same issue a couple of years ago when my Access db was corrupting itself and crashing on a daily basis. It was either MySQL or PostGres and I chose Postgres because of its python friendliness. Not to say MySQL would not work with Python, because it does, which is why I say its personal preference.

Hope that helps with your decision-making!

myClone
thank you very much this is very helpful. can you give me an examlpe of why postgres is better?
I__
I wouldn't say Postgres is better than, for example MySQL or even Oracle. For me it was cost. Postgres is open source and my database is non-commercial, so I wanted to keep things as transparent and flexible as possible. I also like PostgreSQL's interface and from a usability standpoint it matched my learning curve.
myClone
I think duffymo's explanation covers it. Relational databases are super powerful and will handle many of the tasks you are asking python to do. However, if you are simply interested in storage and reference, with little to no use for querying/calculating. My assumption was that you were going to be eventually performing calculations and adding/changing data, which is why I recommended going with a RDBMS
myClone
+4  A: 

I'd only put it into a relational database if:

  1. The data is actually relational and expressing it that way helps shrink the size of the data set by normalizing it.
  2. You can take advantage of triggers and stored procedures to offload some of the calculations that your Python code is performing now.
  3. You can take advantage of queries to only perform calculations on data that's changed, cutting down on the amount of work done by Python.

If neither of those things is true, I don't see much difference between a database and a file. Both ultimately have to be stored on the file system.

If Python has to process all of it, and getting it into memory means loading an entire data set, then there's no difference between a database and a flat file.

2GB of data in memory could mean page swapping and thrashing by your application. I would be careful and get some data before I blamed the problem on the file. Just because you access the data from a database won't solve a paging problem.

If your data's flat, I see less advantage in a database, unless "flat" == "highly denormalized".

I'd recommend some profiling to see what's consuming CPU and memory before I made a change. You're guessing about the root cause right now. Better to get some data so you know where the time is being spent.

duffymo
correct me if i am wrong, but for example a database would be better on a huge file that requires you to sort stuff, right?
I__
The answer depends on the file and the schema. You're correct that databases are good at sorting, but there are other considerations: indexing, number of JOINs, etc.
duffymo
it's flat. there's no relational data
I__
Databases are often really good at sorting huge amounts of data. Sorting a big ol' list in python would probably not be very efficient if the list doesn't fit in your memory for example. Also, indexing would allow you to search your data efficiently.
André Laszlo
@Andre - agreed, but there's no indication that the data processing has to sort or that the calculations depend on the data being in sorted order.
duffymo
@duffymo: True.
André Laszlo
That's the problem is I'm not sure what he is doing. Or even if he needs the entire 2-4 GB files or that is just the total size he is expecting the file to grow and he will just want the last few MB or hundred MB for processing...
Cervo
Also a database will not beat a C program doing quick sort. The database is made to handle constraints, datatype checks, and will often use temporary files because it has to service multiple requests. A C program with a tight array will probably out sort the database. Also searching could potentially be faster in C once sorted because no need for index seeks/locks and also everything will already be in RAM. The bigger question is what are the nature of the calcs. For a pure speedup I wouldn't use a DB. For better data organization and reporting later I would.
Cervo
Great points, Cervo. Thanks.
duffymo
+4  A: 

If you need to go through all lines each time you perform the "fiddling" it wouldn't really make much difference, assuming the actual "fiddling" is whats eating your cycles.

Perhaps you could store the results of your calculations somehow, then a database would probably be nice. Also, databases have methods for ensuring data integrity and stuff like that, so a database is often a great place for storing large sets of data (duh! ;)).

André Laszlo
+1 for "store the results of your calculations". I'll point out that it's also possible for a file if you choose to add them to your file at the end of the calculation, so it's a wash.
duffymo
Yeah :) And of course a database is just some fancy algorithms and "a file" in the end. So you can reinvent the database using python if you want (it actually sounds fun...).
André Laszlo
Usually the databases are written in compiled languages, and for a sort compiled languages Python are orders of magnitude apart. Also sometimes databases can automatically parallelize things across processors/disks for you. But at the same time a database is mostly just another way to store the data. Unless there is some specific way you plan to take advantage of something it provides for speed, it's not going to magically make things faster. On per record basis often even scripting languages beat SQL Cursors.
Cervo
+2  A: 

I don't know exactly what you are doing. But a database will just change how the data is stored. and in fact it might take longer since most reasonable databases may have constraints put on columns and additional processing for the checks. In many cases having the whole file local, going through and doing calculations is going to be more efficient than querying and writing it back to the database (subject to disk speeds, network and database contention, etc...). But in some cases the database may speed things up, especially because if you do indexing it is easy to get subsets of the data.

Anyway you mentioned logs, so before you go database crazy I have the following ideas for you to check out. Anyway I'm not sure if you have to keep going through every log since the beginning of time to download charts and you expect it to grow to 2 GB or if eventually you are expecting 2 GB of traffic per day/week.

  1. ARCHIVING -- you can archive old logs, say every few months. Copy the production logs to an archive location and clear the live logs out. This will keep the file size reasonable. If you are wasting time accessing the file to find the small piece you need then this will solve your issue.

  2. You might want to consider converting to Java or C. Especially on loops and calculations you might see a factor of 30 or more speedup. This will probably reduce the time immediately. But over time as data creeps up, some day this will slow down as well. if you have no bound on the amount of data, eventually even hand optimized Assembly by the world's greatest programmer will be too slow. But it might give you 10x the time...

  3. You also may want to think about figuring out the bottleneck (is it disk access, is it cpu time) and based on that figuring out a scheme to do this task in parallel. If it is processing, look into multi-threading (and eventually multiple computers), if it is disk access consider splitting the file among multiple machines...It really depends on your situation. But I suspect archiving might eliminate the need here.

  4. As was suggested, if you are doing the same calculations over and over again, then just store them. Whether you use a database or a file this will give you a huge speedup.

  5. If you are downloading stuff and that is a bottleneck, look into conditional gets using the if modified request. Then only download changed items. If you are just processing new charts then ignore this suggestion.

  6. Oh and if you are sequentially reading a giant log file, looking for a specific place in the log line by line, just make another file storing the last file location you worked with and then do a seek each run.

  7. Before an entire database, you may want to think of SQLite.

  8. Finally a "couple of years" seems like a long time in programmer time. Even if it is just 2, a lot can change. Maybe your department/division will be laid off. Maybe you will have moved on and your boss. Maybe the system will be replaced by something else. Maybe there will no longer be a need for what you are doing. If it was 6 months I'd say fix it. but for a couple of years, in most cases, I'd say just use the solution you have now and once it gets too slow then look to do something else. You could make a comment in the code with your thoughts on the issue and even an e-mail to your boss so he knows it as well. But as long as it works and will continue doing so for a reasonable amount of time, I would consider it "done" for now. No matter what solution you pick, if data grows unbounded you will need to reconsider it. Adding more machines, more disk space, new algorithms/systems/developments. Solving it for a "couple of years" is probably pretty good.

Cervo
java or C is faster than python by 30x?!?!?!
I__
Today's Great Language Shootout has the fastest program beating Python by 10x. Python is quite slow quite often.
Paul Nathan
Depending upon what you are doing it can be. Compiled languages have a big advantage for tight loops and calculations. For those types of things a 10x+ difference is not unheard of.
Cervo
Paul Nathan
@Paul nathan - Wow. Actually that's where I saw 30x in speed in some test between C and Python (although not recently). Only 10x between C and Python is a huge improvement on Python's part...
Cervo
Actually on those results sometimes python loses by 85x to C. Anyway I was being conservative with the 30 but yeah Python is often an order of magnitude slower than C and sometimes even 100x slower for specific benchmarks. Java and C are close, within 1x-5x in most benchmarks. But still 5x could do a weeks worth of work in a day. Usually constant factors like 1 or 5 don't matter, but with a huge dataset every little bit helps....
Cervo
@Cervo: looking at the benchmarks from my link, it...uhh.... starts at 10x and just gets worse and worse. *I should brush my high-level C++ skillz up...*
Paul Nathan
+1  A: 

I always reach for a database for larger datasets.

A database gives me some stuff for "free"; that is, I don't have to code it.

  • searching
  • sorting
  • indexing
  • language-independent connections

Something like SQLite might be the answer for you.

Also, you should investigate the "nosql" databases; it sounds like your problem might fit well into one of them.

Paul Nathan
what is a nosql db
I__
Also databases give you stuff you don't ask for like concurrency, locking, constraints, etc... Mostly you want these but from a text file it is adding extra stuff you don't want. Definitely explore optimizing your text file, then NOSQL and SQLLite solutions. And finally databases. Although I think for just a speed up a database won't help. You could probably do faster sorting on your own. 4GB already fits into memory, so a quick sort (even two quick sorts and a merge) would probably beat a database sort.
Cervo
Err assuming you aren't using Python to do that sort... In that case the compiled advantage may make even a database sort quicker than Python for large numbers of records....
Cervo
nosql is a category of database management systems - usually they don't have relational constraints, often they don't have ACID property.
Paul Nathan
@Cervo: "NoSQL" == "Not Only SQL". Look at CouchDB, Voldemort, Neo4J, Hadoop, BigTable, etc. http://nosql-database.org/
duffymo
I was thinking some of the simpler NoSQL solutions. But generally any database comes with the whole transaction processing/locking baggage and data integrity checking. Not all NoSQL solutions come with all that. Some are more complex than others and made for handling different aspects of transactions. I was thinking more of super simple like BDB (well I don't think that would apply to this problem) than something like BigTable or Cassandra.
Cervo