views:

113

answers:

3

Say you have a large collection with n objects on disk and each one has a variable-sized string. What are common practices of efficient ways to make an index of those objects with plain string comparison. Storing the whole strings on the index would be prohibitive in the long rundue to size and I/O, but since disks have a high latency storing only references isn't a good idea, either.

I've been thinking on using a B-Tree-like design with tries but can't find any database implementation using this approach. In fact, it's hard to find how major databases implement indexes for strings (it probably gets lost in the vast results for SQL-level information.)

TIA!

EDIT: changed title from "Efficient external sorting and searching of stored objects with large strings" to "Efficient storage of external index of strings."

A: 

What are you doing with the objects?

If you're running a large system that needs low latency to handle lots of concurrent requests, then I'd store the objects in a database and have it take care of the sorting and indexing. This would be much simpler than implementing B-tree from scratch and possibly having it be buggy.

DBMSs also have caching and various other features that might make your life easier.

Ben S
Thanks. The objects are very dynamic. This is a **database design** question (hence, the tag.)
alecco
@aleccolocco: Very Dynamic is a bit vague. Many existing RDBMS will do this just fine. Why invent your own?
S.Lott
@S.Lott OK, then please read it as a curiosity-based hypothetical question on algorithms.
alecco
A: 

Start by being clear what you want. Do you want to sort them or index them? Sorting is likely to require moving at least some of the items on disk, but indexing would likely leave them where they are.

If you really want to sort them, Knuth's "The Art of Computer Programming" volume three covers sorting and searching in about as much details as you're likely to want.

Tim
Thank you. Yes, I read TAOCP 3 long time ago and I already have the sorting algorithms for this implementation. Also have extensive knowledge on SQLite internals. As my question states, the purpose is to create an SORTED EXTERNAL INDEX of the strings (not in-memory.) What is sorted is the index, not the objects. The question core is how to efficiently store the index for searching (and other operations, implied for my choosing to a B-Tree approach.) Thanks again.
alecco
Then you may wish to reconsider the title "Efficient external sorting and searching of stored objects with large strings" :-)
Tim
Is the new title OK?
alecco
Yes, much better!
Tim
+2  A: 

A "prefix B-tree" or "simple prefix B-tree" would probably be helpful here.

A "simple prefix B-tree" is a bit simpler, just storing the shortest prefix that separates two items, without trying to eliminate redundancy within those prefixes (e.g. for 'astronomy' and 'azimuth', it would store just 'as' and 'az', but not try to keep from duplicating the 'a').

A "prefix B-tree" is close to what you've described -- something like a trie, but in a B-tree structure to give good characteristics when stored primarily on disk. Nonetheless, it's intended to remove (most of) the redundancy within the prefixes that form the index.

There is one other question: do you really need to traverse the records in order, or do you just need to look up a specified record quickly? If the latter is adequate, you might be able to use extendible hashing instead. Extendible hashing has been around (in a number of different forms) for a few decades, and still works pretty well. The general idea is fairly simple: hash the strings to create keys of fixed length, then create some sort of tree of those fixed-length pseudo-keys. As with (almost) any hash, you have to be prepared to deal with collisions. As with other hash tables, the details of the hashing and collision resolution vary (though probably not quite as much with extendible hashing as in-memory hashing).

As for real use, major DBMS and DBMS-like systems use all of the above. B-tree variants are probably the most common in the general purpose DBMS market (e.g. Oracle or MS SQL Server). Extendible hashing is used in a fair number of more-specialized products (e.g., Lotus Domino Server).

Jerry Coffin
Yes, it needs to traverse in order, in particular find ranges. Thanks a lot, finally a real answer.
alecco