views:

1100

answers:

3

I need to compare documents stored in a DB and come up with a similarity score between 0 and 1.

The method I need to use has to be very simple. Implementing a vanilla version of n-grams (where it possible to define how many grams to use), along with a simple implementation of tf-idf and Cosine similarity.

Is there any program that can do this? Or should I start writing this from scratch?

+2  A: 

For our Information Retrieval Course, we use some code that is written by our professor in Java. Sorry, no python port. "It is being released for educational and research purposes only under the GNU General Public License."

You can check out the documentation http://userweb.cs.utexas.edu/~mooney/ir-course/doc/

But more specifically check out: http://userweb.cs.utexas.edu/users/mooney/ir-course/doc/ir/vsr/HashMapVector.html

You can download it http://userweb.cs.utexas.edu/users/mooney/ir-course/

Penang
+4  A: 

Check out NLTK package: http://www.nltk.org it has everything what you need

For the cosine_similarity:


def cosine_distance(u, v):
    """
    Returns the cosine of the angle between vectors v and u. This is equal to
    u.v / |u||v|.
    """
    return numpy.dot(u, v) / (math.sqrt(numpy.dot(u, u)) * math.sqrt(numpy.dot(v, v))) 

For ngrams:


def ngrams(sequence, n, pad_left=False, pad_right=False, pad_symbol=None):
    """
    A utility that produces a sequence of ngrams from a sequence of items.
    For example:

    >>> ngrams([1,2,3,4,5], 3)
    [(1, 2, 3), (2, 3, 4), (3, 4, 5)]

    Use ingram for an iterator version of this function.  Set pad_left
    or pad_right to true in order to get additional ngrams:

    >>> ngrams([1,2,3,4,5], 2, pad_right=True)
    [(1, 2), (2, 3), (3, 4), (4, 5), (5, None)]

    @param sequence: the source data to be converted into ngrams
    @type sequence: C{sequence} or C{iterator}
    @param n: the degree of the ngrams
    @type n: C{int}
    @param pad_left: whether the ngrams should be left-padded
    @type pad_left: C{boolean}
    @param pad_right: whether the ngrams should be right-padded
    @type pad_right: C{boolean}
    @param pad_symbol: the symbol to use for padding (default is None)
    @type pad_symbol: C{any}
    @return: The ngrams
    @rtype: C{list} of C{tuple}s
    """

    if pad_left:
        sequence = chain((pad_symbol,) * (n-1), sequence)
    if pad_right:
        sequence = chain(sequence, (pad_symbol,) * (n-1))
    sequence = list(sequence)

    count = max(0, len(sequence) - n + 1)
    return [tuple(sequence[i:i+n]) for i in range(count)] 

for tf-idf you will have to compute distribution first, I am using Lucene to do that, but you may very well do something similar with NLTK, use FreqDist:

http://nltk.googlecode.com/svn/trunk/doc/book/ch01.html#frequency_distribution_index_term

if you like pylucene, this will tell you how to comute tf.idf

    # reader = lucene.IndexReader(FSDirectory.open(index_loc))
    docs = reader.numDocs()
    for i in xrange(docs):
        tfv = reader.getTermFreqVector(i, fieldname)
        if tfv:
            rec = {}
            terms = tfv.getTerms()
            frequencies = tfv.getTermFrequencies()
            for (t,f,x) in zip(terms,frequencies,xrange(maxtokensperdoc)):
                    df= searcher.docFreq(Term(fieldname, t)) # number of docs with the given term
                        tmap.setdefault(t, len(tmap))
                        rec[t] = sim.tf(f) * sim.idf(df, max_doc)  #compute TF.IDF
            # and normalize the values using cosine normalization
            if cosine_normalization:
                denom = sum([x**2 for x in rec.values()])**0.5
                for k,v in rec.items():
                    rec[k] = v / denom
roman
+2  A: 

In case you're still interested in this problem, I've done something very similar using Lucene Java and Jython. Here's some snippets from my code.

Lucene preprocesses documents and queries using so-called analyzers. This one uses Lucene's built-in n-gram filter:

class NGramAnalyzer(Analyzer):
    '''Analyzer that yields n-grams for minlength <= n <= maxlength'''
    def __init__(self, minlength, maxlength):
        self.minlength = minlength
        self.maxlength = maxlength
    def tokenStream(self, field, reader):
        lower = ASCIIFoldingFilter(LowerCaseTokenizer(reader))
        return NGramTokenFilter(lower, self.minlength, self.maxlength)

To turn a list of ngrams into a Document:

doc = Document()
doc.add(Field('n-grams', ' '.join(ngrams),
        Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.YES))

To store a document in an index:

wr = IndexWriter(index_dir, NGramAnalyzer(), True,
                 IndexWriter.MaxFieldLength.LIMITED)
wr.addDocument(doc)

Building queries is a little bit more difficult as Lucene's QueryParser expects a query language with special operators, quotes, etc., but it can be circumvented (as partly explained here).

larsmans