views:

519

answers:

4

Short version

If I split my users into shards, how do I offer a "user search"? Obviously, I don't want every search to hit every shard.

Long version

By shard, I mean have multiple databases where each contains a fraction of the total data. For (a naive) example, the databases UserA, UserB, etc. might contain users whose names begin with "A", "B", etc. When a new user signs up, I simple examine his name and put him into the correct database. When a returning user signs in, I again look at his name to determine the correct database to pull his information from.

The advantage of sharding vs read replication is that read replication does not scale your writes. All the writes that go to the master have to go to each slave. In a sense, they all carry the same write load, even though the read load is distributed.

Meanwhile, shards do not care about each other's writes. If Brian signs up on the UserB shard, the UserA shard does not need to hear about it. If Brian sends a message to Alex, I can record that fact on both the UserA and UserB shards. In this way, when either Alex or Brian logs in, he can retrieve all his sent and received messages from his own shard without querying all shards.

So far, so good. What about searches? In this example, if Brian searches for "Alex" I can check UserA. But what if he searches for Alex by his last name, "Smith"? There are Smiths in every shard. From here, I see two options:

  1. Have the application search for Smiths on each shard. This can be done slowly (querying each shard in succession) or quickly (querying each shard in parallel), but either way, every shard needs to be involved in every search. In the same way that read replication does not scale writes, having searches hit every shard does not scale your searches. You may reach a time when your search volume is high enough to overwhelm each shard, and adding shards does not help you, since they all get the same volume.
  2. Some kind of indexing that itself is tolerant of sharding. For example, let's say I have a constant number of fields by which I want to search: first name and last name. In addition to UserA, UserB, etc. I also have IndexA, IndexB, etc. When a new user registers, I attach him to each index I want him to be found on. So I put Alex Smith into both IndexA and IndexS, and he can be found on either "Alex" or "Smith", but no substrings. In this way, you don't need to query each shard, so search might be scalable.

So can search be scaled? If so, is this indexing approach the right one? Is there any other?

+2  A: 

I'm assuming you are talking about shards a la : http://highscalability.com/unorthodox-approach-database-design-coming-shard

If you read that article he goes into some detail on exactly your question, but long answer short, you write custom application code to bring your disparate shards together. You can do some smart hashing to both query individual shards and insert data into shards. You need to ask a more specific question to get a more specific answer.

Zak
Thanks. I've actually read that site extensively. I've tried to clarify my question above; which hopefully is beyond the article you helpfully linked.
+1  A: 

You actually do need every search to hit every shard, or at least every search needs to be performed against an index that contains the data from all shards, which boils down to the same thing.

Presumably you shard based on a single property of the user, probably a hash of the username. If your search feature allows the user to search based on other properties of the user it is clear that there is no single shard or subset of shards that can satisfy a query, because any shard could contain users that match the query. You can't rule out any shards before performing the search, which implies that you must run the query against all shards.

Please see my clarification above.
+5  A: 

There is no magic bullet.

Searching each shard in succession is out of the question, obviously, due to the incredibly high latency you will incur.

So you want to search in parallel, if you have to.

There are two realistic options, and you already listed them -- indexing, and parallelized search. Allow me to go into a little more detail on how you would go about designing them.

The key insight you can use is that in search, you rarely need the complete set of results. You only need the first (or nth) page of results. So there is quite a bit of wiggle room you can use to decrease response time.

Indexing

If you know the attributes on which the users will be searched, you can create custom, separate indexes for them. You can build your own inverted index, which will point to the (shard, recordId) tuple for each search term, or you can store it in the database. Update it lazily, and asynchronously. I do not know your application requirements, it might even be possible to just rebuild the index every night (meaning you will not have the most recent entries on any given day -- but that might be ok for you). Make sure to optimize this index for size so it can fit in memory; note that you can shard this index, if you need to.

Naturally, if people can search for something like "lastname='Smith' OR lastname='Jones'", you can read the index for Smith, read the index for Jones, and compute the union -- you do not need to store all possible queries, just their building parts.

Parallel Search

For every query, send off requests to every shard unless you know which shard to look for because the search happens to be on the distribution key. Make the requests asynchronous. Reply to the user as soon as you get the first page-worth of results; collect the rest and cache locally, so that if the user hits "next" you will have the results ready and do not need to re-query the servers. This way, if some of the servers are taking longer than others, you do not need to wait on them to service the request.

While you are at it, log the response times of the sharded servers to observe potential problems with uneven data and/or load distribution.

SquareCog
+1  A: 

You may want to look at Sphinx (http://www.sphinxsearch.com/articles.html). It supports distributed searching. GigaSpaces has parallel query and merge support. This can also be done with MySQL Proxy (http://jan.kneschke.de/2008/6/2/mysql-proxy-merging-resultsets).

To build a non-sharded indexed kinds of defeats the purpose of the shard in the first place :-) A centralized index probably won't work if shards were necessary.

I think all the shards need to be hit in parallel. The results need to be filtered, ranked, sorted, grouped and the results merged from all the shards. If the shards themselves become overwhelmed you have to do the usual (reshard, scale up, etc) to underwhelm them again.

Todd Hoff