views:

9688

answers:

8

Given that indexing is so important as your dataset increases in size, can someone explain how indexing works at a database agnostic level?

For information on queries to index a field, check out http://stackoverflow.com/questions/1156/how-do-i-index-a-database-field

+2  A: 

Perhaps it would be useful to set a standard where posters of informational articles first pose the question which the article most clearly answers. This might perhaps help navigation and create a more consistent and semantic interface.

Just an idea, sorry for polluting your article with it :)

sparkes
+1  A: 

@sparkes, I am trying to jump start the article process, so there isn't a right or wrong way just yet :)

I thought about doing it the way you suggested, and if enough people think it's the way to go, I'll happily edit it that way. Maybe Jeff can weigh in with an acceptable way of doing things here.

Xenph Yan
+1  A: 

I agree with sparkes, it makes more sense in the context of the site to ask the question first and then answer it with your article.

John Downey
+52  A: 

Why is it needed?

When data is stored on disk based storage devices, it is stored as blocks of data. These blocks are accessed in their entirety, making them the atomic disk access operation. Disk blocks are structured in much the same way as linked lists; both contain a section for data, a pointer to the location of the next node (or block), and both need not be stored contiguously.

Due to the fact that a number of records can only be sorted on one field, we can state that searching on a field that isn’t sorted requires a Linear Search which requires N/2 block accesses, where N is the number of blocks that the table spans. If that field is a non-key field (i.e. doesn’t contain unique entries) then the entire table space must be searched at N block accesses.

Whereas with a sorted field, a Binary Search may be used, this has log2 N block accesses. Also since the data is sorted given a non-key field, the rest of the table doesn’t need to be searched for duplicate values, once a higher value is found. Thus the performance increase is substantial.

What is indexing?

Indexing is a way of sorting a number of records on multiple fields. Creating an index on a field in a table creates another data structure which holds the field value, and pointer to the record it relates to. This index structure is then sorted, allowing Binary Searches to be performed on it.

The downside to indexing is that these indexes require additional space on the disk, since the indexes are stored together in a MyISAM database, this file can quickly reach the size limits of the underlying file system if many fields within the same table are indexed.

How does it work?

Firstly, let’s outline a sample database table schema;

Field name    Data type   Size on disk
id (Primary key)    Unsigned INT    4 bytes
firstName    Char(50)    50 bytes
lastName    Char(50)    50 bytes
emailAddress    Char(100)   100 bytes
Note: char was used in place of varchar to allow for an accurate size on disk value. 
This sample database contains five million rows, and is unindexed. The performance of several queries will now be analyzed. These are a query using the id (a sorted key field) and one using the firstName (a non-key unsorted field).

Example 1

Given our sample database of r = 5,000,000 records of a fixed size giving a record length of R = 204 bytes and they are stored in a MyISAM database which is using the default block size B = 1,024 bytes. The blocking factor of the table would be bfr = (B/R) = 1024/204 = 5 records per disk block. The total number of blocks required to hold the table is N = (r/bfr) = 5000000/5 = 1,000,000 blocks.

A linear search on the id field would require an average of N/2 = 500,000 block accesses to find a value given that the id field is a key field. But since the id field is also sorted a binary search can be conducted requiring an average of log2 1000000 = 19.93 = 20 block accesses. Instantly we can see this is a drastic improvement.

Now the firstName field is neither sorted, so a binary search is impossible, nor are the values unique, and thus the table will require searching to the end for an exact N = 1,000,000 block accesses. It is this situation that indexing aims to correct.

Given that an index record contains only the indexed field and a pointer to the original record, it stands to reason that it will be smaller than the multi-field record that it points to. So the index itself requires fewer disk blocks that the original table, which therefore requires fewer block accesses to iterate through. The schema for an index on the firstName field is outlined below;

Field name    Data type   Size on disk
firstName    Char(50)    50 bytes
(record pointer)    Special 4 bytes
Note: Pointers in MySQL are 2, 3, 4 or 5 bytes in length depending on the size of the table.

Example 2

Given our sample database of r = 5,000,000 records with an index record length of R = 54 bytes and using the default block size B = 1,024 bytes. The blocking factor of the index would be bfr = (B/R) = 1024/54 = 18 records per disk block. The total number of blocks required to hold the table is N = (r/bfr) = 5000000/18 = 277,778 blocks.

Now a search using the firstName field can utilise the index to increase performance. This allows for a binary search of the index with an average of log2 277778 = 18.08 = 19 block accesses. To find the address of the actual record, which requires a further block access to read, bringing the total to 19 + 1 = 20 block accesses, a far cry from the 1,000,000 block accesses required by the non-indexed table.

When should it be used?

Given that creating an index requires additional disk space (277,778 blocks extra from the above example), and that too many indexes can cause issues arising from the file systems size limits, careful thought must be used to select the correct fields to index.

Since indexes are only used to speed up the searching for a matching field within the records, it stands to reason that indexing fields used only for output would be simply a waste of disk space and processing time when doing an insert or delete operation, and thus should be avoided. Also given the nature of a binary search, the cardinality or uniqueness of the data is important. Indexing on a field with a cardinality of 2 would split the data in half, whereas a cardinality of 1,000 would return approximately 1,000 records. With such a low cardinality the effectiveness is reduced to a linear sort, and the query optimizer will avoid using the index if the cardinality is greater than 30% of the record number, effectively making the index a waste of space.

Xenph Yan
very well said!
Yoely
this is a great answer. i wish that i could give it more than one up-vote.
Andrew Garrison
A: 

Like the rest, I can agree to have a standard way to do things, however it will be very difficult to achieve a descent level of standardisation if it is up to the user to make sure it is done correctly.

For example, I also like that the question and the answer is separated from eachother as pointed out by sparkes but then you have the minor nuisance of modding two items up: the question itself and the answer for it. This will (I think) eventually lead to users switching to giving the answer directly in the question as other users probably wont mod both the question and the answer up (just as it is the case here)

It is therefore my thought that it should be up to the website creators to allow, us users, to keep a level of standardisation throughout stackoverflow, by providing the means to do so, in any way they see fit.

Now, I don't want to make this thread into a second http://stackoverflow.uservoice.com/, and I think the rest of the beta testers should also avoid it as much as possible in other topics, because this is, again, another example proving to me that reaching a high enough level of standardisation is almost impossible without having the means to it.

Sven
+1  A: 

I'm starting to think that given the fact that self-answerers will be rare that either they agree to conform to a standard (Question, Answer, and article as a tag) or users with 1000+ rep will modify these posts to a standard that most people agree on.

Having the creators custom code for a differing use of the website isn't efficient, especially when we can band together and enforce a de-facto standard.

Xenph Yan
A: 

I understand your point and I too believe in the power of the crowd to keep thing in balance, and I strongly believe in having as much freedom as humanly possible of doing things.
This isn't the first site with user driven modding and content that, if played out correctly, just works.

But on the other side, if there are no guidelines for the "standard way to do things" then who can judge that the power users don't abuse their abilities? who can or will stand up against them if they are the ones forcing others to do things the way they see it?
If that happens then you create a trust problem towards your users and stackoverflow will become a place where people feel like they must undergo the wrath of the power users.
whereas if the standard is laid down by the creators themselves it wont feel that way.

Sven
A: 

the explanation was simple and excellent. great info ! many thanks

Nelson Anthony