views:

86

answers:

3

Hi,

Typically, the databases are designed as below to allow multiple types for an entity.

Entity Name Type Additional info

Entity name can be something like account number and type could be like savings,current etc in a bank database for example.

Mostly, type will be some kind of string. There could be additional information associated with an entity type.

Normally queries will be posed like this. Find account numbers of this particular type? Find account numbers of type X, having balance greater than 1 million?

To answer these queries, query analyzer will scan the index if the index is associated with a particular column. Otherwise, it will do a full scan of all the rows.

I am thinking about the below optimization. Why not we store the hash or integral value of each column data in the actual table such that the ordering property is maintained, so that it will be easy for comparison.

It has below advantages. 1. Table size will be lot less because we will be storing small size values for each column data. 2. We can construct a clustered B+ tree index on the hash values for each column to retrieve the corresponding rows matching or greater or smaller than some value. 3. The corresponding values can be easily retrieved by having B+ tree index in the main memory and retrieving the corresponding values. 4. Infrequent values will never need to retrieved.

I am still having more optimizations in my mind. I will post those based on the feedback to this question.

I am not sure if this is already implemented in database, this is just a thought.

Thank you for reading this.

-- Bala

Update:

I am not trying to emulate what the database does. Normally indexes are created by the database administrator. I am trying to propose a physical schema by having indexes on all the fields in the database, so that database table size is reduced and its easy to answer few queries.

Updates:(Joe's answer)

How does adding indexes to every field reduce the size of the database? You still have to store all of the true values in addition to the hash; we don't just want to query for existence but want to return the actual data.

In a typical table, all the physical data will be stored. But now by generating a hash value on each column data, I am only storing the hash value in the actual table. I agree that its not reducing the size of the database, but its reducing the size of the table. It will be useful when you don't need to return all the column values.

Most RDBMSes answer most queries efficiently now (especially with key indices in place). I'm having a hard time formulating scenarios where your database would be more efficient and save space.

There can be only one clustered index on a table and all other indexes have to unclustered indexes. With my approach I will be having clustered index on all the values of the database. It will improve query performance.

Putting indexes within the physical data -- that doesn't really make sense. The key to indexes' performance is that each index is stored in sorted order. How do you propose doing that across any possible field if they are only stored once in their physical layout? Ultimately, the actual rows have to be sorted by something (in SQL Server, for example, this is the clustered index)?

The basic idea is that instead of creating a separate table for each column for efficient access, we are doing it at the physical level.

Now the table will look like this.

Row1 - OrderedHash(Column1),OrderedHash(Column2),OrderedHash(Column3)

A: 

How does adding indexes to every field reduce the size of the database? You still have to store all of the true values in addition to the hash; we don't just want to query for existence but want to return the actual data.

Most RDBMSes answer most queries efficiently now (especially with key indices in place). I'm having a hard time formulating scenarios where your database would be more efficient and save space.

Putting indexes within the physical data -- that doesn't really make sense. The key to indexes' performance is that each index is stored in sorted order. How do you propose doing that across any possible field if they are only stored once in their physical layout? Ultimately, the actual rows have to be sorted by something (in SQL Server, for example, this is the clustered index)?

Joe
please see my comments above.
Algorist
A: 

I don't think your approach is very helpful.

Hash values only help for equality/inequality comparisons, but not less than/greater than comparisons, compared to pretty much every database index.

Even with (in)equality hash functions do not offer 100% guarantee of having given you the right answer, as hash collisions can happen, so you will still have to fetch and compare the original value - boom, you just lost what you wanted to save.

You can have the rows in a table ordered only one way at a time. So if you have an application where you have to order rows differently in different queries (e.g. query A needs a list of customers ordered by their name, query B needs a list of customers ordered by their sales volume), one of those queries will have to access the table out-of-order.

If you don't want the database to have to work around colums you do not use in a query, then use indexes with extra data columns - if your query is ordered according to that index, and your query only uses columns that are in the index (coulmns the index is based on plus columns you have explicitly added into the index), the DBMS will not read the original table.

Etc.

Bandi-T
+1  A: 

Google for "hash index". For example, in SQL Server such an index is created and queried using the CHECKSUM function.

This is mainly useful when you need to index a column which contains long values, e.g. varchars which are on average more than 100 characters or something like that.

Todd Owen