views:

129

answers:

5

get the x most similar texts from a lot of texts to one text.

maybe change the page to text is better.

You should not compare the text to every text, because its too slow.

A: 

You will have to define a function to measure the "difference" between two pages. I can imagine a variety of such functions, one of which you have to choose for your domain:

  • Difference of Keyword Sets - You can prune the document of the most common words in the dictionary, and then end up with a list of unique keywords per document. The difference funciton would then calculate the difference as the difference of the sets of keywords per document.

  • Difference of Text - Calculate each distance based upon the number of edits it takes to turn one doc into another using a text diffing algorithm (see Text Difference Algorithm.

Once you have a difference function, simply calculate the difference of your current doc with every other doc, then return the other doc that is closest.

If you need to do this a lot and you have a lot of documents, then the problem becomes a bit more difficult.

Frank Krueger
+1  A: 

The ability of identifying similar documents/pages, whether web pages or more general forms of text or even of codes, has many practical applications. This topics is well represented in scholarly papers and also in less specialized forums. In spite of this relative wealth of documentation, it can be difficult to find the information and techniques relevant to a particular case.

By describing the specific problem at hand and associated requirements, it may be possible to provide you more guidance. In the meantime the following provides a few general ideas.

Many different functions may be used to measure, in some fashion, the similarity of pages. Selecting one (or possibly several) of these functions depends on various factors, including the amount of time and/or space one can allot the problem and also to the level of tolerance desired for noise.

Some of the simpler metrics are:

  • length of the longest common sequence of words
  • number of common words
  • number of common sequences of words of more than n words
  • number of common words for the top n most frequent words within each document.
  • length of the document

Some of the metrics above work better when normalized (for example to avoid favoring long pages which, through their sheer size have more chances of having similar words with other pages)

More complicated and/or computationally expensive measurements are:

  • Edit distance (which is in fact a generic term as there are many ways to measure the Edit distance. In general, the idea is to measure how many [editing] operations it would take to convert one text to the other.)
  • Algorithms derived from the Ratcliff/Obershelp algorithm (but counting words rather than letters)
  • Linear algebra-based measurements
  • Statistical methods such as Bayesian fitlers

In general, we can distinguish measurements/algorithms where most of the calculation can be done once for each document, followed by a extra pass aimed at comparing or combining these measurements (with relatively little extra computation), as opposed to the algorithms that require to deal with the documents to be compared in pairs.

Before choosing one (or indeed several such measures, along with some weighing coefficients), it is important to consider additional factors, beyond the similarity measurement per-se. for example, it may be beneficial to...

  • normalize the text in some fashion (in the case of web pages, in particular, similar page contents, or similar paragraphs are made to look less similar because of all the "decorum" associated with the page: headers, footers, advertisement panels, different markup etc.)
  • exploit markup (ex: giving more weight to similarities found in the title or in tables, than similarities found in plain text.
  • identify and eliminate domain-related (or even generally known) expressions. For example two completely different documents may appear similar is they have in common two "boiler plate" paragraphs pertaining to some legal disclaimer or some general purpose description, not truly associated with the essence of each cocument's content.
mjv
A: 

Tokenize texts, remove stop words and arrange in a term vector. Calculate tf-idf. Arrange all vectors in a matrix and calculate distances between them to find similar docs, using for example Jaccard index.

piotr
A: 

I don't know what you mean by similar, but perhaps you ought to load your texts into a search system like Lucene and pose your 'one text' to it as a query. Lucene does pre-index the texts so it can quickly find the most similar ones (by its lights) at query-time, as you asked.

Darius Bacon
A: 

All depends on what you mean by "similar". If you mean "about the same subject", looking for matching N-grams usually works pretty well. For example, just make a map from trigrams to the text that contains them, and put all trigrams from all of your texts into that map. Then when you get your text to be matched, look up all its trigrams in your map and pick the most frequent texts that come back (perhaps with some normalization by length).

Keith Randall