views:

183

answers:

2
+2  Q: 

Database Indexes

I need to develop a "naive" implementation of database indexes for use in a distributed environment. I know almost nothing about the subject, and I'm a bit pressured by time.

I would love to hear some opinions, examples and algorithms on the subject. I'd like to be able to have a mental representation of what I need to implement.

EDIT: I'm referring to clustered indexes

+3  A: 

There are basically two main types of indexes :

  • Clustered (ie the data is physically organized, and you re-sort it at each insertion if needed)

    Typical use-case : the physical organization is usually the same as the insertion order, so the re-sorting overhead is not a problem. This is for example the case with sequential UID's (the so called "IDENTITY" fields in a database context)

    An obvious drawback of clustered indexing is that you can have only one such index on your data.

    Naive implementation if the insertion order is exactly the sorting order : use a List.

    1. Insertion is O(1) : you just append the new data of the list
    2. Access is O(1) if the ID's are sequential (ie array indexes exactly matches UID), O(log) otherwise
  • Unclustered (ie you keep pointers on the data, as in a Hashtable)

    Typical use-case : The clustering is not appropriate because it would induce to big an insertion overhead.

Depending on your needs, you'll probably end up using on those two datastructures

An extensive repository of Index-related information is available here

Brann
In SQL Server - yes. Other database systems might have other types of indices. The question wasn't quite clear on this...
marc_s
Can you expand on the clustered index a bit, that's what I'm after
Mihai Lazar
@Brann - Ok, guess I got it know. I suppose that I'll have to create some sort of algoritm for the non-sequential data.
Mihai Lazar
Most DBMS will NOT resort at insertion. What they will do is have a growth factor and have the data not in order once it spills outside of this range. As such you use a maintenance plan to rebuild your indexes on an appropriate schedule to reorder your clustered index. Otherwise single row insertions might trigger movement of terabytes of data.
Spence
@Spence : Indeed. I was focusing on the OP question (ie "naive implementation")
Brann
+1  A: 

A really fast-and-easy-to-implement, really naive index implementation, most appropriate to any language that has a native associative array format, is a hash whose keys are extant values for the column you're indexing and whose values are arrays of row IDs for the rows with that value.

chaos