views:

356

answers:

9

I'm building an English web dictionary where users can type in words and get definitions. I thought about this for a while and since the data is 100% static and I was only to retrieve one word at a time I was better off using the filesystem (ext3) as the database system instead of opting to use MySQL to store definitions. I figured there would be less overhead considering that you have to connect to MySQL and that in itself is a very slow operation.

My fear is that if my system were to get bombarded by let's say 500 word retrievals/sec, would I still be better off using the filesystem as the database? or will the increased filesystem reads hinder performance as opposed to something that MySQL might be doing under the hood?

Currently the hierarchy is segmented by first letter, second letter and third letter of the word. So if you were to search for the definition of "water", the script (PHP) will try to read from "../dict/w/a/t/water.word" (after cleaning up the word of problematic characters and lowercasing it)

Am I heading in the right direction with this or is there a faster solution (not counting storing definitions in memory using something like memcached)? Will the amount of files stored in any directory factor in performance? What's the rough benchmark for the number of files that I should store in a directory?

+1  A: 

The type of lookups that a dictionary requires is exactly what a database is good at. I think the filesystem method you describe will be unworkable. Don't make it hard! Use a Database.

Mitch Wheat
A: 

A DB sounds perfect for your needs. I also don't see why memcached is relevant (how big is your data? Can't be more than a few GB... right?)

Assaf Lavie
+1  A: 

You can keep a connection pool around to speed up connecting to the DB.

Also, if this application needs to scale to multiple servers, the file system may be tricky to share between servers.

So, I third the suggestion. Use a DB.

But unless it's a fabulously large dictionary, caching would mean you're nearly alwys getting stuff from local memory, so I don't think this is going to be the biggest issue for your application :)

Paul
A: 

The data is approximately a couple of GBs. And my goal is speed, speed, speed (definitions will be loaded using XHR). The data as I said is static and is never going to change, and in no where would I using anything other than a single read operation for each request. So I'm having a pretty hard time getting convinced of using MySQL and all its bloat.

Which would be first to fail under high load using this strategy, the filesystem or MySQL? As for scaling replication is the answer since the data will never change and is only a couple of GBs.

+2  A: 

What are your grounds for your belief that this decision will matter to the overall performance of the solution? WHat does it do other than provide definitions?

Do you have MySQL as part of the solution anyway, or would you need to add it should you select it as the solution here?

Where is the definitive source of definitions? The (maybe replicated) filesystem, or some off line DB?

It seems like something that should be in a DB architecturally - filesystems are a strange place to map a large number of names to values (as is evidenced by your file system structure breaking things down by initial letters)

If it's in the DB, answering questions like "how many definitions are there?" is a lot easier, but if you don't care about such things for your application, this may not matter.

So to some extent this feels like looking to hyper optimise the performance of something whose performance won't actually make much difference to the overall solution.

I'm a fan of "make it correct, then make it fast", and "correct" would be more straightforward to achieve with a DB.

And of course, the ultimate answer would to be try both and see which one works best in your situation.

Paul

Paul
A: 

Make it work first. Premature optimisation is bad.

Using a database enables easier refactoring of your schema, and you don't have to write an implementation of an index-based lookup, which in actual fact is nontrivial.

Saying that connecting to a database "is a very slow operation" overstates the problem. Actually connecting should not take very long, plus you can reuse connections anyway.

If you are worried about read-scaling, a 1G database is very small, so you can push readonly replicas of it to each web server and they can each read from their local copy. Provided the writes stay at a level which doesn't impact read performance, that gives you almost perfect read-scalability.

Moreover, 1G of data will fit into ram easily, so you can make it fast by loading the entire database into memory at startup time (before that node advertises itself to the load balancer).

500 lookups per second is trivially small. I would start worrying about 5000 per second per server, maybe. If you can't achieve 5000 key lookups per second on modern hardware (from a database which fits in RAM?!!), there is something seriously wrong with your implementation.

MarkR
A: 

Agreeing that this is premature optimization, and that MySQL surely will be performant enough for this use case. I must add you can also use a file based database, like the very fast Tokyo Cabinet as a compromise. Sadly it doesn't have a PHP binding so you could use its grandfather, DBM.

That said, do not use a filesystem, there's no good reason to, as far as I can see.

Vinko Vrsalovic
A: 

The system does work (though not released to public yet, and I'm only on a single machine) and currently implements the directory structure solution. So yes, if I were to use MySQL I'd have rework it and move the database to MySQL. The source of definitions is Wiktionary's XML filedump. Which is approximately 600 MB in size (though larger than that when you convert Wikimarkup to HTML). I parsed that file to create the directory structure.

The basis of this question was that I performed a simple benchmark in reading a definition from filesystem vs MySQL (without using caching nor any optimizations) and the filesystem reads were %4000 faster than MySQL's SELECT operation on an indexed column.

Of course caching into memory would render this question nil in point (since you could do this with or without an RDMBS). So caching aside and considering I'm not really into going through all the loops to get MySQL to perform under high concurrency, which solution will outperform from the get-go: the filesystem or MySQL? Only benchmarks will tell.

"(without using caching nor any optimizations) "You sure? Filesystems get cached by the OS :)Did that time include the DB connection (which in practice will be amortized over many lookups)?Sp one lookup isn't very interesting. Ten thousand random ones would perhaps be more meaningful.
Paul
You're right. I'll have to do a bunch of benchmarks and some tests using ab
A: 

Use a virtual Drive in your ram (google it for a how to for your distro) or if your data is provided by PHP use APC, memcache might work well with mysql. Personally I don't think the optimization you are doing here is really where you should be spending your time. 500 requests a second is massive, I think using mysql would give you better forward features for later. I think you need to concentrate on features and not speed if you want to differentiate yourself from your competitors. Also there are a few good talks about UI for the web, the server speed is only a small factor in the whole picture.

Good luck