Facebook has cooked into their search some features that are unique -- possibly some are patented, even? The features I speak of are driven by three distinct requirements:
- The fact that their database is gigantic, and they can't just JOIN their way over to the data they need as they need it, as you can, typically in a single-homed business app with less than a million records.
- The expectations of their users are shaped by other search experiences, namely Google, so that long-tail search queries are done by appending keywords to the person's name being searched for, such as "Orlando, Florida" or "Rotary Club" (or some other identifying value like an employer name).
- The data architecture appears to be shallow, based on the window we have on it looking in from the application (of course it's not shallow). What I'm saying is that beyond the so-called "Basic Information" in a user profile, such as gender, and current city, so much of what makes a profile unique is not rigidly assigned to logical columns.
So, then, complexity exists in the needs associated with the size of the dataset, BUT with it a need to deliver to the user relevant results, to a user community that's not savvy in search, but has had their expectations and training provided by The Google.
Given all of that (a refinement of my question):
a.) What search features are necessary for FaceBook that we should take note of and deploy in our own search apps/engines? By necessary, I mean driven by either the massive size of the data set, or driven by the expectations of the users, and the need for the site to organically grow and increase its relationships among its data -- I mean, users.
b.) What search features are innovative and worthy of attention by data and/or search architects?
Some are obvious, such as using synonyms for first names -- fuzzy matching a query for "Bill" with a "William" record. You can do this in Solr with a list of synonyms. I'd call this a basic feature that is necessary, not innovative of course.
Others, which are innovative deserve our attention. The first example of innovation that I can call attention to is that their search relevancy is custom per user. If I type "John Smith" I get a different set of results than another searcher would (theoretically better matches for me, people in my network, friends of friends, etc.). Before you say that's not innovative because you can type just "Pizza" in Google and they'll give you relevant results by appending your locale to the query, follow along, please. My hope is that answers and discussions, really, to this question would frame in some of the technical requirements as well as provide ideas to include as features in search.
For instance...
- Would you guess they run a regular batch process to denormalize the data? (i.e. a batch job to make a link table of in-place first degree of separation, second degree, etc.)
- From such a batch or denormalization, does it then limit the number of hits? This is evidenced by returning only the logically nearest "John Smith" matches. However, searches of non-common names [such as my own first and lastname] seem not affected by a limit on results and the search will look around the world, completely outside of those "few degrees" of separation.
- Are they increasing the relevance scoring by age, giving more relevancy to matches that are near the same age group as the searcher? (comment: it seems they should, it could be at least a minor speed bump to intergenerational communications/meetings that should not happen -- euphemistic, I know)
Technically, on the back end, is it best to do a denormalization process at the database level and THEN index the "documents?" (clarification: for those unitiated to enterprise search a "document" is MOL similar in concept to a database record... MOL)
OR, is there no database denormalization. In place of that, the process of writing the search index includes writing into each "document" the related information and the people who are "in-network" or just a few degrees apart?
CERTAINLY it's necessary to pre-process such info. Without having done this exact thing in practice, myself, it seems to me that it's advantageous to denormalize in batches at the database level, reason being that the search server is good at finding info super fast, but the database server is better at getting the matching data (assuming it extends out to related columns which are not in the search index).
Consequently, expanding upon the concept of search relevancy being dependent upon the user-searcher, notice that it is also derivative of the recent browsing activity of the user. For example, a search for "John Smith Orlando" might never produce the "right" John Smith, but after visiting the correct John Smith's FaceBook page (suppose you got his URL in an email), even without adding John Smith as a friend, a subsequent search on John Smith will, this time, actually return that result the very next time. [I wonder how long before that ages out, or if it ages out at all?]
I used Facebook as an example here because they're huge. Their size forces a well-thought architecture -- such as what stays in it's normal form, and what cannot because you just can't JOIN
a 100 million record table repeatedly (rejoining the same person table from another "fork" off of a link table or a derived table can produce the "friends of friends" effect).
The practice of relevancy tuning is really almost an art. Data sets, business rules, and users' expectations are unique enough that a multipurpose scoring template, or even a best practices is nearly impossible to create.
That being said, by looking to the big sites who are pulling off search well enough, there is a technique to emulate, isn't there?
What are those techniques in place at FaceBook? And given their size, they can't just fetch what the user needs when they need it via ORM (not a slam to ORM champions) -- this requires well-planned normalization, SQL-level indexing, DE-normalization, and search server indexing.
Can anyone suggest what are some of the techniques in place there? For that matter, any large site with a similar search (and a large data set) will also provide good, on-topic suggestions.