ansaurus

Question

Searching a normal query in an inverted index

Answer 1

+2 A:

Here's a start:

doc_has_word = [ (index[word].keys(),word) for word in wordlist ]

This will build an list of (word,document) pairs. You can't easily make a dictionary out of that, since each document occurs many times.

But

from collections import defaultdict
doc_words = defaultdict(list)
for d, w in doc_has_word:
    doc_words[tuple(d.items())].append(w)

Might be helpful.

S.Lott 2010-10-15 19:01:26

doc_words[d].append(w) ...... d is a list hence unhashable. Thanx for the code. it was of great help. I have edited the question and posted my code that I wrote using yours.

Siddharth Sharma 2010-10-16 11:09:45

For the final code, see the question. The code there is written using this code. Marking this as accepted

Siddharth Sharma 2010-10-18 14:27:25

Answer 2

A:

import itertools

index = {...}

def query(*args):
    result = []

    doc_count = [(doc, len(index[word][doc])) for word in args for doc in index[word]]
    doc_group = itertools.groupby(doc_count, key=lambda doc: doc[0])

    for doc, group in doc_group:
        result.append((doc, sum([elem[1] for elem in group])))

    return sorted(result, key=lambda x:x[1])[::-1]

singularity 2010-10-15 19:54:30

Answer 3

A:

Here is a solution for finding the similar documents (the hardest part):

wordList = ['spam','eggs','toast'] # our list of words to query for
wordMatches = [index.get(word, {}) for word in wordList]
similarDocs = reduce(set.intersection, [set(docMatch.keys()) for docMatch in wordMatches])

wordMatches gets a list where each element is a dictionary of the document matches for one of the words being matched.

similarDocs is a set of the documents that contain all of the words being queried for. This is found by taking just the document names out of each of the dictionaries in the wordMatches list, representing these lists of document names as sets, and then intersecting the sets to find the common document names.

Once you have found the documents that are similar, you should be able to use a defaultdict (as shown in S. Lott's answer) to append all of the lists of matches together for each word and each document.

ansaurus

tags:

views:

answers:

Searching a normal query in an inverted index

related questions