I have a full inverted index in form of nested python dictionary. Its structure is :
{word : { doc_name : [location_list] } }
For example let the dictionary be called index, then for a word " spam ", entry would look like :
{ spam : { doc1.txt : [102,300,399], doc5.txt : [200,587] } }
I used this structure as python dict are pretty optimised and it makes programming easier.
for any word 'spam', the documents containig it can be given by :
index['spam'].keys()
and posting list for a document doc1 by:
index['spam']['doc1']
At present I am using cPickle to store and load this dictionary. But the pickled file is around 380 MB and takes a long time to load - 112 seconds(approx. I timed it using time.time()) and memory usage goes to 1.2 GB (Gnome system monitor). Once it loads, its fine. I have 4GB RAM.
len(index.keys())
gives 229758
code :
import cPickle as pickle
f = open('full_index','rb')
print 'Loading index... please wait...'
index = pickle.load(f) # This takes ages
print 'Index loaded. You may now proceed to search'
My question is how can I make it load faster? I only need to load it once, when the application starts. After that, the access time is important to respond to queries. Should I switch to a database like SQLite and create an index on its keys? If yes, how do I store the values to have an equivalent schema, which makes retrieval easy. Is there anything else that I should look into ?
Addendum
Using Tim's answer pickle.dump(index, file, -1)
the pickled file is considerably smaller - around 237 MB (took 300 seconds to dump)... and takes half the time to load now (61 seconds ... as opposed to 112 s earlier .... time.time())
But should I migrate to a database for scalability ?
As for now I am marking Tim's answer as accepted.
PS :I don't want to use Lucene or Xapian ... This question refers http://stackoverflow.com/questions/3687715/storing-an-inverted-index . I had to ask a new question because I wasn't able to delete the previous one.