I am solving a problem of getting the top K frequency from an array of structure containing value and frequency sorted by value ascending. The data size is really huge say a million entries. After discussing this with my friend, we shortlisted hash bucket approach with separate overflow chaining. Here’s my approach:
Since the number of file descriptors is limited, we do not want to associate a separate external bucket with each in memory buffer. So all the external buckets are in the same file where each block represents one bucket for the frequency and it may point to another block in case this also overflows. I’ll also keep the holes in between blocks so that it can be used for overflow. We’ll use fake pointers (block numbers) as next ptr for each block pointing to another block representing same frequency.
I was just wondering how I am going to implement this. How do I access something stored in disk in the form of blocks. Please Help. This all made a lot of sense while discussing but I am wondering how do I work for the externally stored file.