views:

588

answers:

5

I think I understand sharding to be putting back your sliced up data (the shards) into an easy to deal with aggregate that makes sense in the context. Is this correct?

Update: I guess I am struggling here. In my opinion the application tier should have no business determining where data should be stored. At best it should be shard client of some sort. Both responses answered the what but not the why is it important aspect. What implications does it have outside of the obvious performance gains? Are these gains sufficient to offset the MVC violation? Is sharding mostly important in very large scale applications or does it apply to smaller scale ones?

+3  A: 

If you have queries to a DBMS for which the locality is quite restricted (say, a user only fires selects with a 'where username = $my_username') it makes sense to put all the usernames starting with A-N on one server and all from M-Z on the other. By this you get near linear scaling for some queries.

Long story short: Sharding is basically the process of distributing tables onto different servers in order to balance the load onto both equally.

Of course, it's so much more complicated in reality. :)

bayer
So sharding affects the design of the data you are storing... sorry if I don't quite understand.
ojblass
I added another sentence to make it clearer.
bayer
+10  A: 

Sharding is just another name for "horizontal partitioning" of a database. You might want to search for that term to get it clearer.

From Wikipedia:

Horizontal partitioning is a design principle whereby rows of a database table are held separately, rather than splitting by columns (as for normalization). Each partition forms part of a shard, which may in turn be located on a separate database server or physical location. The advantage is the number of rows in each table is reduced (this reduces index size, thus improves search performance). If the sharding is based on some real-world aspect of the data (e.g. European customers vs. American customers) then it may be possible to infer the appropriate shard membership easily and automatically, and query only the relevant shard.

From a comment to this article (which I recommend to read, including comments):

[Sharding:] Firstly, each database server is identical, having the same table structure. Secondly, the data records are logically split up in a sharded database. Unlike the partitioned database, each complete data record exists in only one shard (unless there's mirroring for backup/redundancy) with all CRUD operations performed just in that database. You may not like the terminology used, but this does represent a different way of organizing a logical database into smaller parts.

Update: You wont break MVC. The work of determining the correct shard where to store the data would be transparently done by your data access layer. There you would have to determine the correct shard based on the criteria which you used to shard your database. (As you have to manually shard the database into some different shards based on some concrete aspects of your application.) Then you have to take care when loading and storing the data from/into the database to use the correct shard.

Maybe this example with Java code makes it somewhat clearer (it's about the Hibernate Shards project), how this would work in a real world scenario.

To address the "why sharding": It's mainly only for very large scale applications, with lots of data. First, it helps minimizing response times for database queries. Second, you can use more cheaper, "lower-end" machines to host your data on, instead of one big server, which might not suffice anymore.

MicSim
Forgive me but shouldn't the database make the determinations of where to store data. Does this affect code on the application tier?
ojblass
See my above update in posting.
MicSim
+1  A: 

Is sharding mostly important in very large scale applications or does it apply to smaller scale ones?

Sharding is a concern if and only if your needs scale past what can be served by a single database server. It's a swell tool if you have shardable data and you have incredibly high scalability and performance requirements. I would guess that in my entire 12 years I've been a software professional, I've encountered one situation that could have benefited from sharding. It's an advanced technique with very limited applicability.

Besides, the future is probably going to be something fun and exciting like a massive object "cloud" that erases all potential performance limitations, right? :)

earino
+1  A: 

In my opinion the application tier should have no business determining where data should be stored

This is a good rule but like most things not always correct.

When you do your architecture you start with responsibilities and collaborations. Once you determine your functional architecture, you have to balance the non-functional forces.

If one of these non-functional forces is massive scalability, you have to adapt your architecture to cater for this force even if it means that your data storage abstraction now leaks into your application tier.

Hans Malherbe
+1  A: 

Take a look at Hibernate Shards, I haven't tried it yet but it looks really easy to use.