views:

173

answers:

2

I'm curious if anyone understands, knows or can point me to comprehensive literature or source code on how Google created their popular passage blocks feature. However, if you know of any other application that can do the same please post your answer too.

If you do not know what I am writing about here is a link to an example of Popular Passages. When you look at the overview of the book Modelling the legal decision process for information technology applications ... By Georgios N. Yannopoulos you can see something like:

Popular passages

... direction, indeterminate. We have not settled, because we have not anticipated, the question which will be raised by the unenvisaged case when it occurs; whether some degree of peace in the park is to be sacrificed to, or defended against, those children whose pleasure or interest it is to use these things. When the unenvisaged case does arise, we confront the issues at stake and can then settle the question by choosing between the competing interests in the way which best satisfies us. In doing...‎ Page 86

Appears in 15 books from 1968-2003

This would be a world fit for "mechanical" jurisprudence. Plainly this world is not our world; human legislators can have no such knowledge of all the possible combinations of circumstances which the future may bring. This inability to anticipate brings with it a relative indeterminacy of aim. When we are bold enough to frame some general rule of conduct (eg, a rule that no vehicle may be taken into the park), the language used in this context fixes necessary conditions which anything must satisfy...‎ Page 86

Appears in 8 books from 1968-2000

more

It must be an intensive pattern matching process. I can only think of n-gram models, text corpus, automatic plagisrism detection. But, sometimes n-grams are probabilistic models for predicting the next item in a sequence and text corpus (to my knowledge) are manually created. And, in this particular case, popular passages, there can be a great deal of words.

I am really lost. If I wanted to create such a feature, how or where should I start? Also, include in your response what programming languages are best suited for this stuff: F# or any other functional lang, PERL, Python, Java... (I am becoming a F# fan myself)

PS: can someone include the tag automatic-plagiarism-detection, because i can't

A: 

In the small sample I looked over, it looks like all the passages picked were inline or block quotes. Just a guess, but perhaps Google Books looks for quote marks/differences in formatting and a citation, then uses a parsed version of the bibliography to associate the quote with the source. Hooray for style manuals.

This approach is obviously of no help to detect plagiarism, and is of little help if the corpus isn't in a format that preserves text formatting.

outis
A: 

If you know which books are citing or referencing other books you don't need to look at all possible books only the books that are citing each other. If is is scientific reference often line and page numbers are included with the quote or can be found in the bibliography at the end of the book, so maybe google parses only this informations?

Google scholar certainly has the information about citing from paper to paper maybe from book to book too.

Janusz