views:

11

answers:

2

Hi

i'm trying to figure out what kind of binary file can support my needs for inverse index. Let say that i have document that i can identify with unique ID and each document can have 360 fixed values in range of 0-65535. Something like this:

Document0: [1, 10, 123, ...] // 360 values

Document1: [1, 10, 345, ...] // 360 values

Now, inverse index is easy - i can create for each possible value list of documents that contains, and query can be executed fast, e.g.:

1: [Document0, Document1]

10: [Document0, Document1]

123: [Document0]

345: [Document1]

But i wanna to store large number of documents in some kind of file (binary) and to have ability to query fast but also to add new documents without recreating whole structure.

Now i'm struggling how to organize that file. If I wanna fast access i need fixed length document arrays to do file seek and than read. But fixed size means that i will have a lot of empty spaces for document list. My only idea is to have some kind of bucketing system and each value can belong to bucket of specific size, e.g. there are buckets with size 1, 2, 4, 8, 16, 32, ... (or something like that) and i need some kind of header which will point me where bucket starts and size of bucket. This idea will optimize store size, but again i'm having problem with addition of new documents.

Any idea how to organize my 'inverse index' file?

Best.

A: 

I would go for 65536 files each having ID's of the documents. If you want to go gentle on the filesystem, divide that into 256 directories having 256 files each.

00\00.idx
00\01.idx
..
FF\FF.idx
Daniel Mošmondor
A: 

That sounds good. I'm doing reads very fast, writes on other hand are slower - i need to make sure that each file has unique document in it (for now I'm having simple model to store constant number of files in memory, and dump them on disk when some threshold is reached). Thanks for response.