Hi
i'm trying to figure out what kind of binary file can support my needs for inverse index. Let say that i have document that i can identify with unique ID and each document can have 360 fixed values in range of 0-65535. Something like this:
Document0: [1, 10, 123, ...] // 360 values
Document1: [1, 10, 345, ...] // 360 values
Now, inverse index is easy - i can create for each possible value list of documents that contains, and query can be executed fast, e.g.:
1: [Document0, Document1]
10: [Document0, Document1]
123: [Document0]
345: [Document1]
But i wanna to store large number of documents in some kind of file (binary) and to have ability to query fast but also to add new documents without recreating whole structure.
Now i'm struggling how to organize that file. If I wanna fast access i need fixed length document arrays to do file seek and than read. But fixed size means that i will have a lot of empty spaces for document list. My only idea is to have some kind of bucketing system and each value can belong to bucket of specific size, e.g. there are buckets with size 1, 2, 4, 8, 16, 32, ... (or something like that) and i need some kind of header which will point me where bucket starts and size of bucket. This idea will optimize store size, but again i'm having problem with addition of new documents.
Any idea how to organize my 'inverse index' file?
Best.