+7  A: 

Lucene's built-in Similarity is a fairly standard "Inverse Document Frequency" scoring algorithm. The Wikipedia article is brief, but covers the basics. The book Lucene in Action breaks down the Lucene formula in more detail; it doesn't mirror the current Lucene formula perfectly, but all of the main concepts are explained.

Primarily, the score varies with number of times that term occurs in the current document (the term frequency), and inversely with the number of times a term occurs in all documents (the document frequency). The other factors in the formula are secondary, adjusting the score in attempt to make scores from different queries fairly comparable to each other.

erickson
-1 The book says less than the JavaDocs so don't bother buying.
Piotr Czapla
+1  A: 

Think of each document and search term as a vector whose coordinates represent some measure of how important each word in the entire corpus of documents is to that particular document or search term. Similarity tells your the distance between two different vectors.

Say your corpus is normalized to ignore some terms, then a document consisting only of those terms would be located at the origin of a graph of all of your documents in the vector space defined by your corpus. Each document that contains some other terms, then represents a point in the space whose coordinates are defined by the importance of that term in the document relative to that term in the corpus. Two documents (or a document and search) whose coordinates put their "points" closer together are more similar than those with coordinates that put their "points" further apart.

tvanfosson