views:

140

answers:

5

Hello, I am working on a project on Info Retrieval. I have made a Full Inverted Index using Hadoop/Python. Hadoop outputs the index as (word,documentlist) pairs which are written on the file. For a quick access, I have created a dictionary(hashtable) using the above file. My question is, how do I store such an index on disk that also has quick access time. At present I am storing the dictionary using python pickle module and loading from it but it brings the whole of index into memory at once (or does it?). Please suggest an efficient way of storing and searching through the index.

My dictionary structure is as follows (using nested dictionaries)

{word : {doc1:[locations], doc2:[locations], ....}}

so that I can get the documents containing a word by dictionary[word].keys() ... and so on.

+2  A: 

shelve

At present I am storing the dictionary using python pickle module and loading from it but it brings the whole of index into memory at once (or does it?).

Yes it does bring it all in.

Is that a problem? If it's not an actual problem, then stick with it.

If it's a problem, what kind of problem do you have? Too slow? Too fast? Too colorful? Too much memory used? What problem do you have?

S.Lott
Thank you all for your kind support ... I am supposed to build a text search engine that uses inverted index. So far, I have build a full inverted index. So, the problem I think I may have, is that as index size grows, then bringing it all in would probably consume too much memory(???) .. at present I am only working on a prototype with reduced functionality, so index size is trivial ... but when its finished, it would probably be a large file .... that's the problem that I have. –
Siddharth Sharma
@Siddharth Sharma: "consume too much memory(???)". If you don't know, then don't start by trying optimize it. First -- build using a simple dictionary until you can **prove** that it uses too much memory. Then -- and only after you have **proof** -- switch to shelve. Later, you'll have to switch to a bloom filter. But only after you can **prove** that shelve is too slow.
S.Lott
@ S.Lott : thanx ... yeah Knuth said, "Early optimization is root of all evil" .... will try keeping it in mind.
Siddharth Sharma
@Siddharth Sharma: Keep your goals in mind. (1) Works. (2) Optimal use of resources. You can't do #2 until you've done #1.
S.Lott
A: 

Just store it in a string like this:

<entry1>,<entry2>,<entry3>,...,<entryN>

If <entry*> contains ',' character, use some other delimiter like '\t'. This is smaller in size than an equivalent pickled string.

If you want to load it, just do:

L = s.split(delimiter)
OTZ
A: 

You could store the repr() of the dictionary and use that to re-create it.

ikanobori
That'll be space-inefficient. My solution takes less space.
OTZ
Why does space matter? What problem does the original question actually have? Time? Space? Licensing fees for third party software? There's no hint as to what to optimize.
S.Lott
A: 

If it's taking a long time to load or using too much memory, you might need a database. There are many you might use; I would probably start with SQLite. Then your problem is "reduced" ;-) to simply formulating the right query to get what you need out of the database. This way you will only load what you need.

kindall
A: 

I would use Lucene. Why reinvent the wheel?

Jay Askren