views:

172

answers:

7

Suppose you have a dense table with an integer primary key, where you know the table will contain 99% of all values from 0 to 1,000,000.

A super-efficient way to implement such a table is an array (or a flat file on disk), assuming a fixed record size.

Is there a way to achieve similar efficiency using a database?

Clarification - When stored in a simple table / array, access to entries are O(1) - just a memory read (or read from disk). As I understand, all databases store their nodes in trees, so they cannot achieve identical performance - access to an average node will take a few hops.

A: 

If you've got a decent amount of records in a DB (and 1MM is decent, not really that big), then indexes are your friend.

Eric
A: 

You're talking about old fixed record length flat files. And yes, they are super-efficient compared to databases, but like structure/value arrays vs. classes, they just do not have the kind of features that we typically expect today.

Things like:

  • searching on different columns/combintations
  • variable length columns
  • nullable columns
  • editiablility
  • restructuring
  • concurrency control
  • transaction control etc., etc.
RBarryYoung
A: 

Create a DB with an ID column and a bit column. Use a clustered index for the ID column (the ID column is your primary key). Insert all 1,000,000 elements (do so in order or it will be slow). This is kind of inefficient in terms of space (you're using nlgn space instead of n space).

I don't claim this is efficient, but it will be stored in a similar manner to how an array would have been stored.

Note that the ID column can be marked as being a counter in most DB systems, in which case you can just insert 1000000 items and it will do the counting for you. I am not sure if such a DB avoids explicitely storing the counter's value, but if it does then you'd only end up using n space)

Brian
A: 

When you have your primary key as a integer sequence it would be a good idea to have reverse index. This kind of makes sure that the contiguous values are spread apart in the index tree. However, there is a catch - with reverse indexes you will not be able to do range searching.

Sathya
A: 

The big question is: efficient for what?

for oracle ideas might include:

  • read access by id: index organized table (this might be what you are looking for)
  • insert only, no update: no indexes, no spare space
  • read access full table scan: compressed
  • high concurrent write when id comes from a sequence: reverse index
  • for the actual question, precisely as asked: write all rows in a single blob (the table contains one column, one row. You might be able to access this like an array, but I am not sure since I don't know what operations are possible on blobs. Even if it works I don't think this approach would be useful in any realistic scenario.
Jens Schauder
+1  A: 

Perhaps I don't understand your question but a database is designed to handle data. I work with database all day long that have millions of rows. They are efficiency enough.

I don't know what your definition of "achieve similar efficiency using a database" means. In a database (from my experience) what are exactly trying to do matters with performance.

If you simply need a single record based on a primary key, the the database should be naturally efficient enough assuming it is properly structure (For example, 3NF).

Again, you need to design your database to be efficient for what you need. Furthermore, consider how you will write queries against the database in a given structure.

In my work, I've been able to cut query execution time from >15 minutes to 1 or 2 seconds simply by optimizing my joins, the where clause and overall query structure. Proper indexing, obviously, is also important.

Also, consider the database engine you are going to use. I've been assuming SQL server or MySql, but those may not be right. I've heard (but have never tested the idea) that SQLite is very quick - faster than either of the a fore mentioned. There are also many other options, I'm sure.

Update: Based on your explanation in the comments, I'd say no -- you can't. You are asking about mechanizes designed for two completely different things. A database persist data over a long amount of time and is usually optimized for many connections and data read/writes. In your description the data in an array, in memory is for a single program to access and that program owns the memory. It's not (usually) shared. I do not see how you could achieve the same performance.

Another thought: The absolute closest thing you could get to this, in SQL server specifically, is using a table variable. A table variable (in theory) is held in memory only. I've heard people refer to table variables as SQL server's "array". Any regular table write or create statements prompts the RDMS to write to the disk (I think, first the log and then to the data files). And large data reads can also cause the DB to write to private temp tables to store data for later or what-have.

Frank V
I don't have any specific performance goals. What I'm wondering is this:Given data organized how I specified (integer primary key, dense - almost no gaps in the data), the best way to store the data is an array. This way I get O(1) access time guaranteed with only a single memory read.What I'm wondering is can you configure a database to store rows in a similar manner (not in a tree, for instance).
ripper234
+1  A: 

There is not much you can do to specify how data will be physically stored in database. Most you can do is to specify if data and indices will be stored separately or data will be stored in one index tree (clustered index as Brian described).

But in your case this does not matter at all because of:

  1. All databases heavily use caching. 1.000.000 of records hardly can exceed 1GB of memory, so your complete database will quickly be cached in database cache.
  2. If you are reading single record at a time, main overhead you will see is accessing data over database protocol. Process goes something like this:
    • connect to database - open communication channel
    • send SQL text from application to database
    • database analyzes SQL (parse SQL, checks if SQL command is previously compiled, compiles command if it is first time issued, ...)
    • database executes SQL. After few executions data from your example will be cached in memory, so execution will be very fast.
    • database packs fetched records for transport to application
    • data is sent over communication channel
    • database component in application unpacks received data into some dataset representation (e.g. ADO.Net dataset)

In your scenario, executing SQL and finding records needs very little time compared to total time needed to get data from database to application. Even if you could force database to store data into array, there will be no visible gain.

zendar