views:

98

answers:

3

I have a full inverted index in form of nested python dictionary. Its structure is :

{word : { doc_name : [location_list] } }

For example let the dictionary be called index, then for a word " spam ", entry would look like :

{ spam : { doc1.txt : [102,300,399], doc5.txt : [200,587] } }

so that, the documents containing any word can be given by index[word].keys() , and frequency in that document by len(index[word][document])

Now my question is, how do I implement a normal query search in this index. i.e. given a query containing lets say 4 words, find documents containing all four matches (ranked by total frequency of occurrence ), then docs containing 3 matches and so on ....

**

Added this code, using S. Lott's answer. This is the code I have written. Its working exactly as I want, ( just some formatting of output is needed ) but I know it could be improved.

**

from collections import defaultdict
from operator import itemgetter

# Take input

query = input(" Enter the query : ")

# Some preprocessing

query = query.lower()
query = query.strip()

# now real work

wordlist = query.split()
search_words = [ x for x in wordlist if x in index ]    # list of words that are present in index.

print "\nsearching for words ... : ", search_words, "\n"

doc_has_word = [ (index[word].keys(),word) for word in search_words ]
doc_words = defaultdict(list)
for d, w in doc_has_word:
    for p in d:
        doc_words[p].append(w)

# create a dictionary identifying matches for each document    

result_set = {}

for i in doc_words.keys():
    count = 0
    matches = len(doc_words[i])     # number of matches
    for w in doc_words[i]:
        count += len(index[w][i])   # count total occurances
    result_set[i] = (matches,count)

# Now print in sorted order

print "   Document \t\t Words matched \t\t Total Frequency "
print '-'*40
for doc, (matches, count)) in sorted(result_set.items(), key = itemgetter(1), reverse = True):
    print doc, "\t",doc_words[doc],"\t",count

Pls comment .... Thanx.

+2  A: 

Here's a start:

doc_has_word = [ (index[word].keys(),word) for word in wordlist ]

This will build an list of (word,document) pairs. You can't easily make a dictionary out of that, since each document occurs many times.

But

from collections import defaultdict
doc_words = defaultdict(list)
for d, w in doc_has_word:
    doc_words[tuple(d.items())].append(w)

Might be helpful.

S.Lott
doc_words[d].append(w) ...... d is a list hence unhashable. Thanx for the code. it was of great help. I have edited the question and posted my code that I wrote using yours.
Siddharth Sharma
For the final code, see the question. The code there is written using this code. Marking this as accepted
Siddharth Sharma
A: 
import itertools

index = {...}

def query(*args):
    result = []

    doc_count = [(doc, len(index[word][doc])) for word in args for doc in index[word]]
    doc_group = itertools.groupby(doc_count, key=lambda doc: doc[0])

    for doc, group in doc_group:
        result.append((doc, sum([elem[1] for elem in group])))

    return sorted(result, key=lambda x:x[1])[::-1]
singularity
A: 

Here is a solution for finding the similar documents (the hardest part):

wordList = ['spam','eggs','toast'] # our list of words to query for
wordMatches = [index.get(word, {}) for word in wordList]
similarDocs = reduce(set.intersection, [set(docMatch.keys()) for docMatch in wordMatches])

wordMatches gets a list where each element is a dictionary of the document matches for one of the words being matched.

similarDocs is a set of the documents that contain all of the words being queried for. This is found by taking just the document names out of each of the dictionaries in the wordMatches list, representing these lists of document names as sets, and then intersecting the sets to find the common document names.

Once you have found the documents that are similar, you should be able to use a defaultdict (as shown in S. Lott's answer) to append all of the lists of matches together for each word and each document.

Related links:

Trey
I was already able to get matching docs for all the words in the query, forgot to mention that in post.
Siddharth Sharma