Short version
If I split my users into shards, how do I offer a "user search"? Obviously, I don't want every search to hit every shard.
Long version
By shard, I mean have multiple databases where each contains a fraction of the total data. For (a naive) example, the databases UserA, UserB, etc. might contain users whose names begin with "A", "B", etc. When a new user signs up, I simple examine his name and put him into the correct database. When a returning user signs in, I again look at his name to determine the correct database to pull his information from.
The advantage of sharding vs read replication is that read replication does not scale your writes. All the writes that go to the master have to go to each slave. In a sense, they all carry the same write load, even though the read load is distributed.
Meanwhile, shards do not care about each other's writes. If Brian signs up on the UserB shard, the UserA shard does not need to hear about it. If Brian sends a message to Alex, I can record that fact on both the UserA and UserB shards. In this way, when either Alex or Brian logs in, he can retrieve all his sent and received messages from his own shard without querying all shards.
So far, so good. What about searches? In this example, if Brian searches for "Alex" I can check UserA. But what if he searches for Alex by his last name, "Smith"? There are Smiths in every shard. From here, I see two options:
- Have the application search for Smiths on each shard. This can be done slowly (querying each shard in succession) or quickly (querying each shard in parallel), but either way, every shard needs to be involved in every search. In the same way that read replication does not scale writes, having searches hit every shard does not scale your searches. You may reach a time when your search volume is high enough to overwhelm each shard, and adding shards does not help you, since they all get the same volume.
- Some kind of indexing that itself is tolerant of sharding. For example, let's say I have a constant number of fields by which I want to search: first name and last name. In addition to UserA, UserB, etc. I also have IndexA, IndexB, etc. When a new user registers, I attach him to each index I want him to be found on. So I put Alex Smith into both IndexA and IndexS, and he can be found on either "Alex" or "Smith", but no substrings. In this way, you don't need to query each shard, so search might be scalable.
So can search be scaled? If so, is this indexing approach the right one? Is there any other?