Best DataMining Database

views:

524

answers:

+8 Q:

Best DataMining Database

I am an occasional Python programer who only have worked so far with MYSQL or SQLITE databases. I am the computer person for everything in a small company and I have been started a new project where I think it is about time to try new databases.

Sales department makes a CSV dump every week and I need to make a small scripting application that allow people form other departments mixing the information, mostly linking the records. I have all this solved, my problem is the speed, I am using just plain text files for all this and unsurprisingly it is very slow.

I thought about using mysql, but then I need installing mysql in every desktop, sqlite is easier, but it is very slow. I do not need a full relational database, just some way of play with big amounts of data in a decent time.

Update: I think I was not being very detailed about my database usage thus explaining my problem badly. I am working reading all the data ~900 Megas or more from a csv into a Python dictionary then working with it. My problem is storing and mostly reading the data quickly.

Many thanks!

+11 A:

You probably do need a full relational DBMS, if not right now, very soon. If you start now while your problems and data are simple and straightforward then when they become complex and difficult you will have plenty of experience with at least one DBMS to help you. You probably don't need MySQL on all desktops, you might install it on a server for example and feed data out over your network, but you perhaps need to provide more information about your requirements, toolset and equipment to get better suggestions.

And, while the other DBMSes have their strengths and weaknesses too, there's nothing wrong with MySQL for large and complex databases. I don't know enough about SQLite to comment knowledgeably about it.

EDIT: @Eric from your comments to my answer and the other answers I form even more strongly the view that it is time you moved to a database. I'm not surprised that trying to do database operations on a 900MB Python dictionary is slow. I think you have to first convince yourself, then your management, that you have reached the limits of what your current toolset can cope with, and that future developments are threatened unless you rethink matters.

If your network really can't support a server-based database than (a) you really need to make your network robust, reliable and performant enough for such a purpose, but (b) if that is not an option, or not an early option, you should be thinking along the lines of a central database server passing out digests/extracts/reports to other users, rather than simultaneous, full RDBMS working in a client-server configuration.

The problems you are currently experiencing are problems of not having the right tools for the job. They are only going to get worse. I wish I could suggest a magic way in which this is not the case, but I can't and I don't think anyone else will.

High Performance Mark 2010-04-05 11:04:21

Sadly, the server installation option is not available, the network via wifi is quite flaky here. Thanks for your reply!

Eric 2010-04-05 11:42:46

+1 A:

Here is a performance benchmark of different database suits -> Database Speed Comparison

I'm not sure how objective the above comparison is though, seeing as it's hosted on sqlite.org. ~~Sqlite only seems to be a bit slower when dropping tables, otherwise you shouldn't have any problems using it.~~ Both sqlite and mysql seem to have their own strengths and weaknesses, in some tests the one is faster then the other, in other tests, the reverse is true.

If you've been experiencing lower then expected performance, perhaps it is not sqlite that is the causing this, have you done any profiling or otherwise to make sure nothing else is causing your program to misbehave?

EDIT: Updated with a link to a slightly more recent speed comparison.

jimka 2010-04-05 11:05:42

No I have not done any profiling, I am an occasional programmer, will look at it and at the database speed comparison. Thanks for your answer!

Eric 2010-04-05 11:46:32

That comparison page is useless. They even start off with the disclaimer "The numbers here are old enough to be nearly meaningless"

JimB 2010-04-05 23:58:17

@JimB, You're right, fixed now.

jimka 2010-04-06 04:21:58

+1 A:

Have you done any bench marking to confirm that it is the text files that are slowing you down? If you haven't, there's a good chance that tweaking some other part of the code will speed things up so that it's fast enough.

Simon Callan 2010-04-05 11:24:54

Fair question. In some moment, I need to read a 900Megas csv file that I need in a Python dictionary, it takes ages a lot of time (~hours) and hangs up the whole system.Thanks for your answer!

Eric 2010-04-05 11:41:42

@Eric: That's not sensible. It should read in minutes. Can you please provide profiling data?

S.Lott 2010-04-05 14:27:29

@S.Lott: @jinka already told me in his|her answer to look at profiling. I have not idea about how to do it, but I am looking some documentation about it atm. Any tips welcome!Thanks!

Eric 2010-04-05 15:37:32

If you have that problem with a CSV file, maybe you can just pickle the dictionary and generate a pickle "binary" file with pickle.HIGHEST_PROTOCOL option. It can be faster to read and you get a smaller file. You can load the CSV file once and then generate the pickled file, allowing faster load in next accesses.

Anyway, with 900 Mb of information, you're going to deal with some time loading it in memory. Another approach is not loading it on one step on memory, but load only the information when needed, maybe making different files by date, or any other category (company, type, etc..)

Khelben 2010-04-05 12:23:00

thanks!It can be an interesting workaroud easy to do while I get a better solution!

Eric 2010-04-05 13:46:47

I'm not sure this is such a good idea. Correct me if I'm wrong, but Pickle isn't a streaming format, so in order to generate a Pickle file, you'd first have to load *all* 900 megs into Python's memory before saving the Pickle file. Instead, I'd recommend making a CSV dump, which is natively supported by the mysqldump tool and the "SELECT * INTO OUTFILE" query syntax, and then running this output through gzip to compress it.

Chris S 2010-04-05 18:33:57

I liked from this idea that none of my desktop users would dare trying to open the 900Megas CSV file with Excel anymore (They love trying this!)It was easy and quick to test, but I did not get any improvement.

Eric 2010-04-06 00:17:36

+1 A:

It sounds like each department has their own feudal database, and this implies a lot of unnecessary redundancy and inefficiency.

Instead of transferring hundreds of megabytes to everyone across your network, why not keep your data in MySQL and have the departments upload their data to the database, where it can be normalized and accessible by everyone?

As your organization grows, having completely different departmental databases that are unaware of each other, and contain potentially redundant or conflicting data, is going to become very painful.

Chris S 2010-04-05 13:43:13

Hi Chris,"It sounds like each department has their own feudal database, and this implies a lot of unnecessary redundancy and inefficiency."You exactly got it here, sadly I can not change this without upper management support. So I still have this problem to solve :)

Eric 2010-04-05 13:45:51

@Eric -- your comment on @Chris's answer makes it clear that you are looking for a technical solution to what is fundamentally an organisational or management problem. Only broken dreams and bitter tears lie ahead.

High Performance Mark 2010-04-05 14:31:16

Dear @High Performance Mark: I am aware of the organisational problems, sadly I still need to get the best approach to this with the tools I have.

Eric 2010-04-05 15:36:23

@Eric, I feel your pain. A bad solution now is often better then the perfect solution never.

Chris S 2010-04-05 19:02:40

+1 A:

Does the machine this process runs on have sufficient memory and bandwidth to handle this efficiently? Putting MySQL on a slow machine and recoding the tool to use MySQL rather than text files could potentially be far more costly than simply adding memory or upgrading the machine.

ecounysis 2010-04-05 14:03:38

Take a look at mongodb.

asdfsadf 2010-04-05 22:03:43

Thanks was the kind of answer I was waiting for, however it would be nice if could you elaborate why mongodb is a good idea?

Eric 2010-04-06 00:14:57

+5 A:

Quick Summary

You need enough memory(RAM) to solve your problem efficiently. I think you should upgrade memory?? When reading the excellent High Scalability Blog you will notice that for big sites to solve there problem efficiently they store the complete problem set in memory.
You do need a central database solution. I don't think hand doing this with python dictionary's only will get the job done.
How to solve "your problem" depends on your "query's". What I would try to do first is put your data in elastic-search(see below) and query the database(see how it performs). I think this is the easiest way to tackle your problem. But as you can read below there are a lot of ways to tackle your problem.

We know:

You used python as your program language.
Your database is ~900MB (I think that's pretty large, but absolute manageable).
You have loaded all the data in a python dictionary. Here I am assume the problem lays. Python tries to store the dictionary(also python dictionary's aren't the most memory friendly) in your memory, but you don't have enough memory(How much memory do you have????). When that happens you are going to have a lot of Virtual Memory. When you attempt to read the dictionary you are constantly swapping data from you disc into memory. This swapping causes "Trashing". I am assuming that your computer does not have enough Ram. If true then I would first upgrade your memory with at least 2 Gigabytes extra RAM. When your problem set is able to fit in memory solving the problem is going to be a lot faster. I opened my computer architecture book where it(The memory hierarchy) says that main memory access time is about 40-80ns while disc memory access time is 5 ms. That is a BIG difference.

Missing information

Do you have a central server. You should use/have a server.
What kind of architecture does your server have? Linux/Unix/Windows/Mac OSX? In my opinion your server should have linux/Unix/Mac OSX architecture.
How much memory does your server have?
Could you specify your data set(CSV) a little better.
What kind of data mining are you doing? Do you need full-text-search capabilities? I am not assuming you are doing any complicated (SQL) query's. Performing that task with only python dictionary's will be a complicated problem. Could you formalize the query's that you would like to perform? For example:
- "get all users who work for departement x"
- "get all sales from user x"

Database needed

I am the computer person for everything in a small company and I have been started a new project where I think it is about time to try new databases.

You are sure right that you need a database to solve your problem. Doing that yourself only using python dictionary's is difficult. Especially when your problem set can't fit in memory.

MySQL

I thought about using mysql, but then I need installing mysql in every desktop, sqlite is easier, but it is very slow. I do not need a full relational database, just some way of play with big amounts of data in a decent time.

A centralized(Client-server architecture) database is exactly what you need to solve your problem. Let all the users access the database from 1 PC which you manage. You can use MySQL to solve your problem.

Tokyo Tyrant

You could also use Tokyo Tyrant to store all your data. Tokyo Tyrant is pretty fast and it does not have to be stored in RAM. It handles getting data a more efficient(instead of using python dictionary's). However if your problem can completely fit in Memory I think you should have look at Redis(below).

Redis:

You could for example use Redis(quick start in 5 minutes)(Redis is extremely fast) to store all sales in memory. Redis is extremely powerful and can do this kind of queries insanely fast. The only problem with Redis is that it has to fit completely in RAM, but I believe he is working on that(nightly build already supports it). Also like I already said previously solving your problem set completely from memory is how big sites solve there problem in a timely manner.

Document stores

This article tries to evaluate kv-stores with document stores like couchdb/riak/mongodb. These stores are better capable of searching(a little slower then KV stores), but aren't good at full-text-search.

Full-text-search

If you want to do full-text-search queries you could like at:

elasticsearch(videos): When I saw the video demonstration of elasticsearch it looked pretty cool. You could try put(post simple json) your data in elasticsearch and see how fast it is. I am following elastissearch on github and the author is commiting a lot of new code to it.
solr(tutorial): A lot of big companies are using solr(github, digg) to power there search. They got a big boost going from MySQL full-text search to solr.

Alfred 2010-04-06 05:33:43

First of all, thanks a lot Arthur for this elaborated answer.Your pointers to Tokyo Tyrant, Redis and to the document-store link were exactly what I was hoping for when I launched my question. However, I was not expecting be told MySQL can also do the job.The part of the virtual memory and thrashing was something I knew was happening but I did not understand exactly at tech level, thanks a lot for the explanation! My application is *very simple* and it works exactly as my users want, the problem is sometimes they need to use it with a big files and that made my question here.

Eric 2010-04-06 19:04:22

Answering your questions:- Most of the desktop computers have 4 GB RAM.- Mixed environment GNU/Linux and Windows.- very simple queries, it is mostly compare this and this column with this other and this other column of the second file. It would be pretty much: + "GET row1 from DATABASE1" (gives a csv line) + "GET row2 from DATABASE2" (gives a csv line) + compare 2nd element from row1.dabatase1 with 3rd element from row1.database1, + if the match, output the whole 2 rows to final report.- I do not need any full-text-search capability

Eric 2010-04-06 19:05:31

MySQL also is a pretty fast KV-store, but not as fast as redis and I with the limited search you have to perform you can do it. A lot of heavy users of MySQL say "“Normalization is for sissies.” Okay actually this is a quote from a blog post @http://code.flickr.com/blog/2010/02/08/using-abusing-and-scaling-mysql-at-flickr/.

Alfred 2010-04-06 22:29:45

It has been a couple of months since I posted this question and I wanted to let you all know how I solved this problem. I am using Berkeley DB with the module bsddb instead loading all the data in a Python dictionary. I am not fully happy, but my users are. My next step is trying to get a shared server with redis, but unless users starts complaining about speed, I doubt I will get it. Many thanks everybody who helped here, and I hope this question and answers are useful to somebody else.

Eric 2010-06-05 16:40:16

Thanks for giving us feedback. If your users are happy you should be happy I guess :)

Alfred 2010-06-05 18:42:01

ansaurus

tags:

views:

answers:

Best DataMining Database

Quick Summary

We know:

Missing information

Database needed

MySQL

Tokyo Tyrant

Redis:

Document stores

Full-text-search

related questions