views:

52

answers:

3

I have a full inverted index in form of nested python dictionary. Its structure is :

{word : { doc_name : [location_list] } }

For example let the dictionary be called index, then for a word " spam ", entry would look like :

{ spam : { doc1.txt : [102,300,399], doc5.txt : [200,587] } }

I used this structure as python dict are pretty optimised and it makes programming easier.

for any word 'spam', the documents containig it can be given by :

index['spam'].keys()

and posting list for a document doc1 by:

index['spam']['doc1']

At present I am using cPickle to store and load this dictionary. But the pickled file is around 380 MB and takes a long time to load - 112 seconds(approx. I timed it using time.time()) and memory usage goes to 1.2 GB (Gnome system monitor). Once it loads, its fine. I have 4GB RAM.

len(index.keys()) gives 229758

code :

import cPickle as pickle

f = open('full_index','rb')
print 'Loading index... please wait...'
index = pickle.load(f)  # This takes ages
print 'Index loaded. You may now proceed to search'

My question is how can I make it load faster? I only need to load it once, when the application starts. After that, the access time is important to respond to queries. Should I switch to a database like SQLite and create an index on its keys? If yes, how do I store the values to have an equivalent schema, which makes retrieval easy. Is there anything else that I should look into ?

Addendum

Using Tim's answer pickle.dump(index, file, -1) the pickled file is considerably smaller - around 237 MB (took 300 seconds to dump)... and takes half the time to load now (61 seconds ... as opposed to 112 s earlier .... time.time())

But should I migrate to a database for scalability ?

As for now I am marking Tim's answer as accepted.

PS :I don't want to use Lucene or Xapian ... This question refers http://stackoverflow.com/questions/3687715/storing-an-inverted-index . I had to ask a new question because I wasn't able to delete the previous one.

A: 

Have you tried using an alternative storage format such as YAML or JSON? Python supports JSON natively from Python 2.6 using the json module I think, and there are third party modules for YAML.

You may also try the shelve module.

Tamás
@ Tamas. No I haven't tried using YAML or JSON. I read about using shelve in my problem, but as far as I know, shelve wasn't a good performer for large files as mine.
Siddharth Sharma
+3  A: 

Try the protocol argument when using cPickle.dump/cPickle.dumps. From cPickle.Pickler.__doc__:

Pickler(file, protocol=0) -- Create a pickler.

This takes a file-like object for writing a pickle data stream. The optional proto argument tells the pickler to use the given protocol; supported protocols are 0, 1, 2. The default protocol is 0, to be backwards compatible. (Protocol 0 is the only protocol that can be written to a file opened in text mode and read back successfully. When using a protocol higher than 0, make sure the file is opened in binary mode, both when pickling and unpickling.)

Protocol 1 is more efficient than protocol 0; protocol 2 is more efficient than protocol 1.

Specifying a negative protocol version selects the highest protocol version supported. The higher the protocol used, the more recent the version of Python needed to read the pickle produced.

The file parameter must have a write() method that accepts a single string argument. It can thus be an open file object, a StringIO object, or any other custom object that meets this interface.

Converting JSON or YAML will probably take longer than pickling most of the time - pickle stores native Python types.

Tim McNamara
@Tim. thank you ! using pickle.dump(obj,file,-1), the file is smaller and loads quicker. Added details of timing in the question.
Siddharth Sharma
A: 

Dependend on how long is 'long' you have to think about the trade-offs you have to make: either have all data ready in memory after (long) startup, or load only partial data (then you need to split up the date in multiple files or use SQLite or something like this). I doubt that loading all data upfront from e.g. sqlite into a dictionary will bring any improvement.

knitti