ansaurus

Question

How would you represent a hashtable collection in a database schema?

Answer 1

+2 A:

Relations are not hash tables; they are sets.

I wouldn't organize the database using the coordinates as the key. What if an object changes location? Instead, I would probably treat coordinates as attributes of an object.

Also, I assume there is a fixed number of dimensions, for example, three. If so, then you can store these attributes of an object in fixed columns:

CREATE TABLE SpaceQuadrant (
  quadrant_id INT NOT NULL PRIMARY KEY,
  quadrant_name VARCHAR(20)
  -- other attributes
);

CREATE TABLE SpaceObject (
  object_id INT NOT NULL PRIMARY KEY,
  x NUMERIC(9,2) NOT NULL,
  y NUMERIC(9,2) NOT NULL
  z NUMERIC(9,2) NOT NULL,
  object_name VARCHAR(20) NOT NULL,
  -- other attributes
  quadrant_id INT NOT NULL,
  FOREIGN KEY (quadrant_id) REFERENCES SpaceQuadrant(quadrant_id)
);

In your object-oriented class, it's not clear why your objects are in a dictionary. You mention accessing them in O(1) time, but why do you do that by coordinate?

If you're using that to optimize finding objects that are near a certain point (the player's spaceship, for instance), you could also build into your SQL query that populates this SpaceQuadrant a calculation of every object's distance from that given point, and sort the results by distance.

I don't know enough about your program to know if these suggestions are relevant. But are they at least making you think of different ways of organizing the data?

Bill Karwin 2009-01-16 02:03:40

Doing proximity searches that way would be dead slow - you can't use an index, it's a full table scan AND calculation. Support for geolocation could make it faster.

Blaisorblade 2009-01-16 04:23:24

The way the OP is doing it now, he has to load the entire contents of the table anyway, before he can benefit from his O(1) hash-search, doesn't he?

Bill Karwin 2009-01-16 05:59:45

Well, I was trying to avoid loading the whole table, maybe through lazy-loading or something. I'm really just exploring how to design both the code objects and the database schema in a such a way that it requires as little load on the database as possible.

Mark Rogers 2009-01-19 15:10:48

Answer 2

+2 A:

In the simplest case, the dictionary has a key which would map to the primary key of a table - so that when you specify the values of the key, you can immediately find the matching data via a simple lookup.

In this case, you would need a table SpaceQuadrant with any general (single-valued) attributes that describe or characterize a space quadrant. The SpaceQuadrant table would have a primary key, possibly a generated ID, possibly a natural value. The hashtable would then consist of a table with the primary key value for cross-referencing the SpaceQuadrant, with the position (a SpaceCoordinate) and the attributes of the quadrant and coordinate.

Now, if you have an extensible DBMS, you can define a user-defined type for the SpaceCoordinate; failing that, you can use a trio of columns - x, y, z or r, theta, rho, for example - to represent the position (SpaceCoordinate).

In general terms, the structure I'm describing is quite similar to Bill Karwin's; the key (pun not intended until after I was rereading the message) difference is that it is perfectly OK in my book to have the position as part of the primary key of the sub-ordinate table if you are sure that's the best way to organize it. You might also have an object ID column that is an alternative candidate key. Alternatively, if objects have an existence independent of the space quadrant they happen to be in at the moment (or can exist in multiple positions - because they aren't points but are space stations or something), then you might have the SpaceObject in a separate table. What is best depends on information that we don't have available to us.

You should be aware of the limitations of using a SpaceCoordinate as part of the primary key:

no two objects can occupy the same position (that's called a collision in a hash table, as well as in 3D space),
if the position changes, then you have to update the key data, which is more expensive than an update up non-key data,
proximity lookups will be hard - exact lookups are easy enough.

The same is true of your dictionary in memory; if you change the coordinates, you have to remove the record from the old location and place it in the new location in the dictionary (or the language has to do that for you behind the scenes).

Jonathan Leffler 2009-01-16 02:56:07

Answer 3

+2 A:

A dictionary is a table. The hash is a question of what kind of index is used. Most RDBMS assume that tables are big and densely packed, making a hashed index not appropriate.

table SpaceQuadrant { 
    ID Primary Key,
    -- whatever other attributes are relevant
}

table Space {
    SpaceCoordinate Primary Key,
    Quadrant Foreign Key SpaceQuadrant(ID),
    SpaceObject -- whatever the object is
}

Your Space objects have FK references to the Quadrant in which they're located.

Depending on your RDBMS, you might be able to find a hash-based index that gets you the performance you're hoping for. For example MySQL, using the HEAP storage engine supports HASH indexes.

S.Lott 2009-01-16 03:00:03

Answer 4

+1 A:

First, dedicated support for geo-located data exists in many databases - different algorithms can be used (a spatial version of a B-Tree exists for instance), and support for proximity searches probably will exist.

Since you have a different hash table for each SpaceQuadrant, you'd need something like (edited from S.Lott's post):

table Space {
    SpaceCoordinate,
    Quadrant Foreign Key SpaceQuadrant(ID),
    SpaceObject -- whatever the object is (by ID)
    Primary Key(SpaceCoordinate, Quadrant)
}

This is a (SpaceCoordinate, Quadrant) -> SpaceObjectId dictionary.

=====

Now, about your O(1) performance concern, there is a lot of reasons why it's wrongly addressed.

You can use in many DB's a hash index for memory-based tables, as somebody told you. But if you need persistent storage, you'd need to update two tables (the memory one and the persistent one) instead of one (if there is no built-in support for this). To discover whether that's worth, you'd need to benchmark on the actual data (with actual data sizes).

Also, forcing a table into memory can have worse implications.

If something ever gets swapped, you're dead - if you had used a B-Tree (i.e. normal disk-based index), its algorithms would have minimized the needed I/O. Otherwise, all DBMS's would use hash tables and rely on swapping, instead of B-Trees. You can try to anticipate whether you'll fit in memory, but...

Moreover, B-Trees are not O(1) but they are O(log_512(N)), or stuff like that (I know that collapses to O(log N), but bear me on this). You'd need (2^9)^4 = 2^36 = 64GiB for that to be 4, and if you have so much data you'd need a big iron server anyway for that to fit in memory. So, it's almost O(1), and the constant factors are what actually matters.
Ever heard about low-asymptotic-complexity, big-constant-factor algorithms, that would be faster than simple ones just on unpractical data sizes?

Finally, I think DB authors are smarter than me and you. Especially given the declarative nature of SQL, hand-optimizing it this way isn't gonna pay. If an index fits in memory, I guess they could choose to build and use a hashtable version of the disk index, as needed, if it was worth it. Investigate your docs for that.

But the bottom line is that, premature optimization is evil, especially when it's of this kind (weird optimizations we're thinking on our own, as opposed as standard SQL optimizations), and with a declarative language.

Blaisorblade 2009-01-16 04:39:06

I ended up doing a variation on this. All the answers were great, but I chose this one because it was closest to the actual solution I chose. If I end up switching it out, I might go back and choose another answer, but I doubt it.

Mark Rogers 2009-01-20 15:08:39

ansaurus

tags:

views:

answers:

How would you represent a hashtable collection in a database schema?

related questions