views:

100

answers:

3
def makecounter():
     return collections.defaultdict(int)

class RankedIndex(object):
  def __init__(self):
    self._inverted_index = collections.defaultdict(list)
    self._documents = []
    self._inverted_index = collections.defaultdict(makecounter)


def index_dir(self, base_path):
    num_files_indexed = 0
    allfiles = os.listdir(base_path)
    self._documents = os.listdir(base_path)
    num_files_indexed = len(allfiles)
    docnumber = 0
    self._inverted_index = collections.defaultdict(list)

    docnumlist = []
    for file in allfiles: 
            self.documents = [base_path+file] #list of all text files
            f = open(base_path+file, 'r')
            lines = f.read()

            tokens = self.tokenize(lines)
            docnumber = docnumber + 1
            for term in tokens:  
                if term not in sorted(self._inverted_index.keys()):
                    self._inverted_index[term] = [docnumber]
                    self._inverted_index[term][docnumber] +=1                                           
                else:
                    if docnumber not in self._inverted_index.get(term):
                        docnumlist = self._inverted_index.get(term)
                        docnumlist = docnumlist.append(docnumber)
            f.close()
    print '\n \n'
    print 'Dictionary contents: \n'
    for term in sorted(self._inverted_index):
        print term, '->', self._inverted_index.get(term)
    return num_files_indexed
    return 0

I get index error on executing this code: list index out of range.

The above code generates a dictionary index that stores the 'term' as a key and the document numbers in which the term occurs as a list. For ex: if the term 'cat' occurs in documents 1.txt, 5.txt and 7.txt the dictionary will have: cat <- [1,5,7]

Now, I have to modify it to add term frequency, so if the word cat occurs twice in document 1, thrice in document 5 and once in document 7: expected result: term <-[[docnumber, term freq], [docnumber,term freq]] <--list of lists in a dict!!! cat <- [[1,2],[5,3],[7,1]]

I played around with the code, but nothing works. I have no clue to modify this datastructure to achieve the above.

Thanks in advance.

A: 

Perhaps you could just create a simple class for (docname, frequency).

Then your dict could have lists of this new data type. You can do a list of lists, too, but a separate data type would be cleaner.

JoshD
+1  A: 

Here is a general algorithm you could use, but you will have adapt some of your code to it. It produce a dict containing a dictionary of word counts for each file.

filedicts = {}
for file in allfiles:
  filedicts[file] = {}

  for term in terms:
    filedict.setdefault(term, 0)
    filedict[term] += 1
mikerobi
+2  A: 
Alex Martelli
I've made the changes suggested by you. I realize that your approach is much simpler and clear than implementing dict of list of lists. However, it's currently giving me an error, I've edited the code above.
csguy11
@csguy, in your `indexdir` method (assuming it **is** one, your indentation as posted above is all wrong) you completely destroy whatever was previously assigned to `self._inverted_index` by assigning to it your previous, erroneous data structure, thus making your edits to your code totally and utterly irrelevant. You **do** realize that when you do `self.a = b` it **just does not matter in the least any more** whatever, if anything, was previously assigned to `self.a`, right?!
Alex Martelli
I got what the problem was, but since I don't really understand your implementation, I decided to stick with my method i.e. dict of list of lists, even though it's overly complicated.
csguy11