views:

92

answers:

2

Facebook has cooked into their search some features that are unique -- possibly some are patented, even? The features I speak of are driven by three distinct requirements:

  1. The fact that their database is gigantic, and they can't just JOIN their way over to the data they need as they need it, as you can, typically in a single-homed business app with less than a million records.
  2. The expectations of their users are shaped by other search experiences, namely Google, so that long-tail search queries are done by appending keywords to the person's name being searched for, such as "Orlando, Florida" or "Rotary Club" (or some other identifying value like an employer name).
  3. The data architecture appears to be shallow, based on the window we have on it looking in from the application (of course it's not shallow). What I'm saying is that beyond the so-called "Basic Information" in a user profile, such as gender, and current city, so much of what makes a profile unique is not rigidly assigned to logical columns.

So, then, complexity exists in the needs associated with the size of the dataset, BUT with it a need to deliver to the user relevant results, to a user community that's not savvy in search, but has had their expectations and training provided by The Google.

Given all of that (a refinement of my question):

a.) What search features are necessary for FaceBook that we should take note of and deploy in our own search apps/engines? By necessary, I mean driven by either the massive size of the data set, or driven by the expectations of the users, and the need for the site to organically grow and increase its relationships among its data -- I mean, users.

b.) What search features are innovative and worthy of attention by data and/or search architects?

Some are obvious, such as using synonyms for first names -- fuzzy matching a query for "Bill" with a "William" record. You can do this in Solr with a list of synonyms. I'd call this a basic feature that is necessary, not innovative of course.

Others, which are innovative deserve our attention. The first example of innovation that I can call attention to is that their search relevancy is custom per user. If I type "John Smith" I get a different set of results than another searcher would (theoretically better matches for me, people in my network, friends of friends, etc.). Before you say that's not innovative because you can type just "Pizza" in Google and they'll give you relevant results by appending your locale to the query, follow along, please. My hope is that answers and discussions, really, to this question would frame in some of the technical requirements as well as provide ideas to include as features in search.

For instance...

  • Would you guess they run a regular batch process to denormalize the data? (i.e. a batch job to make a link table of in-place first degree of separation, second degree, etc.)
  • From such a batch or denormalization, does it then limit the number of hits? This is evidenced by returning only the logically nearest "John Smith" matches. However, searches of non-common names [such as my own first and lastname] seem not affected by a limit on results and the search will look around the world, completely outside of those "few degrees" of separation.
  • Are they increasing the relevance scoring by age, giving more relevancy to matches that are near the same age group as the searcher? (comment: it seems they should, it could be at least a minor speed bump to intergenerational communications/meetings that should not happen -- euphemistic, I know)

Technically, on the back end, is it best to do a denormalization process at the database level and THEN index the "documents?" (clarification: for those unitiated to enterprise search a "document" is MOL similar in concept to a database record... MOL)

OR, is there no database denormalization. In place of that, the process of writing the search index includes writing into each "document" the related information and the people who are "in-network" or just a few degrees apart?

CERTAINLY it's necessary to pre-process such info. Without having done this exact thing in practice, myself, it seems to me that it's advantageous to denormalize in batches at the database level, reason being that the search server is good at finding info super fast, but the database server is better at getting the matching data (assuming it extends out to related columns which are not in the search index).

Consequently, expanding upon the concept of search relevancy being dependent upon the user-searcher, notice that it is also derivative of the recent browsing activity of the user. For example, a search for "John Smith Orlando" might never produce the "right" John Smith, but after visiting the correct John Smith's FaceBook page (suppose you got his URL in an email), even without adding John Smith as a friend, a subsequent search on John Smith will, this time, actually return that result the very next time. [I wonder how long before that ages out, or if it ages out at all?]

I used Facebook as an example here because they're huge. Their size forces a well-thought architecture -- such as what stays in it's normal form, and what cannot because you just can't JOIN a 100 million record table repeatedly (rejoining the same person table from another "fork" off of a link table or a derived table can produce the "friends of friends" effect).

The practice of relevancy tuning is really almost an art. Data sets, business rules, and users' expectations are unique enough that a multipurpose scoring template, or even a best practices is nearly impossible to create.

That being said, by looking to the big sites who are pulling off search well enough, there is a technique to emulate, isn't there?

What are those techniques in place at FaceBook? And given their size, they can't just fetch what the user needs when they need it via ORM (not a slam to ORM champions) -- this requires well-planned normalization, SQL-level indexing, DE-normalization, and search server indexing.

Can anyone suggest what are some of the techniques in place there? For that matter, any large site with a similar search (and a large data set) will also provide good, on-topic suggestions.

+1  A: 

The question is kind of vague and we can only speculate as to what Facebook does.

But we can discuss instead how a typical Solr-powered search works, which is a more concrete topic. Yes, you have to denormalize data (here are some good tips on Solr schema design) when loading data into the Solr index. This ETL process can be done with the Data Import Handler, or a custom ETL process. Data sources can be anything, not just relational databases. How you design your schema depends largely on what kind of searches you'll be performing.

Full denormalization (Solr really has a flat schema) means no joins so it's pretty scalable (see Solr shards and replication).

Your other concern was relevancy in search results. Here, Solr is very tunable (see the Relevancy Cookbook, FAQ). Yes, it's almost an art as you say, since every application has a different concept of relevancy, so it needs to be tuned differently. And yet the default relevancy is usually acceptable for an out-of-the-box Solr instance (kudos to Solr and Lucene devs for that).

Mauricio Scheffer
Regarding speculation, in terms of features (not the back end), it's all open for us to take note of what has been provided to the user, regardless of its hidden or undocumented nature. A good example is my noting how the search adds to its relevant universe a record previously *not* presented as a query result at all, after the user visits that person's page; not by friending them, just visiting their page. That actually raises that search result into a *most* relevant slot, giving it top position in autocomplete. I find this worthy of documenting as a best practice (or at least a good idea).
Chris Adragna
Next, Maricio, I respectfully suggest that your conclusion stating “yes you have to denormalize” requires clarification, or could just go unsaid. Are you saying so with it being the state of the data in Solr (yes, a given), or explaining it’s a by-product of pushing data into Solr during the index? If that’s what you’re saying, I’m sorry, but within my (tedious) question I did assign a placement to questioning of denormalization – asking which layer and at which point in the process?
Chris Adragna
Again, respectfully, I see the state of denormalization in Solr as a given. In contrast, within the database, it’s a point worthy of discussion, ideally accompanied by illustrations of better performance or optimal alignment to the app objects/logic.
Chris Adragna
Still on denormalization as it pertains to search... I'm trying to raise this question: should the data layer have a regular batch process to denormalize which is driven by either one of two sides – 1. search and 2. the post-search fetching of data detail.Perhaps as it pertains to search it’s inconsequential, given the natural state of denormalized data captured during the index process. However, performance needs could make it necessary to store denormalized data in SQL. Here, yes, we may need speculate, but perhaps someone might know this from experience?
Chris Adragna
@Chris: Solr has a flat schema, therefore when you load data into into from your database (or other data sources) you *have* to denormalize data. The relational database schema doesn't have to be modified. It's the ETL process that denormalizes data to feed it to Solr.
Mauricio Scheffer
@Chris: about relevancy, I already posted some resources that barely scratch the topic of relevancy. Solr is really *very* flexible about this. It's really not a big deal to tune it to make people's pages you visited in the past more relevant than other people.
Mauricio Scheffer
@Chris: "which layer and at which point in the process": this can't be answered in a general way. It depends on each application. I recommend getting a Solr consultant if you need to implement search in your application.
Mauricio Scheffer
@Chris: also, you seem to think that what Facebook does is not accessible to "mere mortals". It's not like that. I don't mean to say it's trivial, but it's certainly doable thanks to projects like Solr/Lucene, Xapian, Sphinx, etc.
Mauricio Scheffer
:) Any elevating FaceBook to god status was unintentional. They're a terrific example because many people are familiar with the app and their huge size doesn't afford them the opportunity to pop out unscalable features, which would probably have a tremendous cool factor, but no practicality nor academic value. I've worked with data sets of various sizes, and I admit that the things you can get away with in a small data set don't sharpen your mind as well as the same tasks in multi-million record tables (similar to facebook JOIN-ing many times, plus derivitive tables for added fun). :)
Chris Adragna
I'm a Solr novice, but I'm very fortunate to have worked with a couple great consultants, one being an author of "the" book. Because of that, I'm motivated to add to my repertoire, and I think that FB is a good place to look, largely because of their scale, but in some ways also because of their "innovation" (as I labeled it). I get what you're saying, perhaps it's not a big deal to include a visited page among the relevant results. I'll try not to be in awe of them, but I sure would like to piece together a feature set that's worthy of documenting as good objectives to "try on your own." :)
Chris Adragna
+2  A: 

For the database, Facebook utilizes MySQL because of its speed and reliability. MySQL is used primarily as a key-value store as data is randomly distributed amongst a large set of logical instances. These logical instances are spread out across physical nodes and load balancing is done at the physical node level. As far as customizations are concerned, Facebook has developed a custom partitioning scheme in which a global ID is assigned to all data. They also have a custom archiving scheme that is based on how frequent and recent data is on a per-user basis. Most data is distributed randomly.

For some parts like inbox it uses a NoSQL databases that is "eventually consistent" and when you query a cluster of them you get "the best answer at that time" and not necessarily what is correct.

From parts of your question it appears you're trying to take practices that work for social media and apply them more widely. Eventually Consistent won't work in accounting or trading or medical or research. If it's Auntie Fannie's latest picture of her cat, no one cares if the FB page doesn't show the most recent one, ALL THE TIME. You're willing to sacrifice that accuracy for such banality.

Turning every 3rd normal form business app into key value pairs because FB does it, isn't a train I'm willing to board.

Stephanie Page
I appreciate the technical background you provided. While I had a significant exchange of comments and refinement of the question/answer with Mauricio, your answer is more in the spirit of what I was hoping to illicit. I am guessing my posing the question on Facebook, specifically, may have hindered the discussion. I may try again to resurrect a question in a similar vein, asking in generalities instead.
Chris Adragna