views:

160

answers:

2

Hello,

I have a relatively large database tables (just under a million rows) that has mapping like this:

(int, int, int, int) ----> (float, float)

Not all possible int combinations map to a float pair.

Here is where my knowledge of database design ends:

  1. The database size is not a concern for me, but query time is. Should I worry about optimizing the database for searching? Would it be enough if I just did a single query on the above table or should I use some scheme to optimize the way information is stored?

  2. I also need to go the other way, but the problem here is that I will need to find the closest value of my (float, float) pair that matches my parameters, so this will require interpolation of submitted float values. I was going to search for closest value of the first float, then the second float, then for a match in the table. Again, is there anythign I can do to my table to make it better for this kind of search?

Thanks,

+2  A: 

Due to long post size, A summary

  1. Check MySQL's spatial extension
  2. Because your int's sequences matter, you can combine or hash them into a key that MySQL can index and search quicker, saving the ints to refactor the key when they change.
  3. You are in a great position to distribute/parallelize/subset by using the first
    integer as a key to some distribution mechanism, and the second and so on.
  4. The examples presented below are a bit bonkers, possibly causing 2^32 databases each with 2^32 tables then 2^32 rows. They are just examples of using alot of space to increase speed.
  5. I am not a DB expert.

Looks like an injective mapping of float values to integer sequences.

I would go for:

tbl:
varchar my_ints; float; int a, int b, int c, ...

And then interpolate ints into a string:

"abcde" such as "1928 9384 2993 4884" (ignoring spaces)

And then let MySQL do a search on the string index.

Or, as others have suggested, investigate MySQL's Spatial extensions, If they don't apply, here are some ideas which may be of interest:

I would keep all the data in one table (or a spliced table). Don't have foreign keys to floats or anything like that. The join overhead isn't worth it and you're not worried about size. (I don't think you would split the table, just saying :S)

As ints are a basic type in SQL databases...

You can optimize by ensuring that you index on the items you search for alot. If the mapping is deterministic over time, then you can also cache the results.

Database access works in tiers. Optimize your searches (an SQL query debugger such as the MySQL bench stuff might help), then ensure that you arn't querying more than you need to.

If you can pre-hash the integer quad into a single, more indexable value and have:

int_tbl = { int_hash_index, int_a, int_b, int_c, float_value }

Then you may be able to see an overall improvement by translating the search item (in this case the integers) into a single more searchable quantity.

So if you are saying:

SELECT float_value FROM int_tbl WHERE int_a=x AND int_b=y ...

Then you can turn this into:

SELECT float_value FROM int_tbl WHERE int_hash_index=pre_computed_value

Providing that:

  1. The hash is indexable, and represents a perfect mapping between the integers.
  2. The hash can be used as your primary key, allowing duplicate erradication
  3. The performance of the tuple enumeration is improved.

You can keep the ints in there and update the indexed hash when they change using triggers or something.

Point 3 seems obvious, but you need to benchmark and test.

Also depending on the computation of f(a,b,c,d)->float for example:

f() = (a*b*c*d)*pi

Then you might be able to cache an application side tuple that maps to the float and use that for common sequences of its. If you are using the database for data-mining then this gives quick-apps a common reference cache and also allows you do to background work on the entire set.

Just some ideas, I am no expert.

Edit due to GPS information First check the spatial extensions to mysql ...

below you said it is GPS data. In this case, the integers relate to eachother I presume. And their sequence matters ... you could glue them together into a binary value like:

tbl_key_blob = [int_a_bits int_b_bits int_c_bits]

And use that as your key. I don't know how efficient MySQL's indexing of arbitary binary values is however. I imagine a search tree would be quite efficient.

You can also have separate tables, if, a in (a,b,c,d) is positional. So

int_map_1234_tbl = { int_b, int_c, int_d, float }

Where 1234 is int_a = 1234 so if you have a query:

SELECT * FROM int_tbl WHERE a=x AND b=y ...

You can select table based on interpolation of a:

SELECT * FROM int_map_[a's value here]_tbl WHERE b=y AND c=z ...

of course you can do this to the n-1 degree and end up with:

SELECT * FROM int_map_[c's value here]_tbl WHERE d=d';

But you will need efficient table resolution and that is a mission in terms of size and performance as it is, because you will end up with VERY LARGE indexes of tables.

I am waffling and tired now. If size doesn't matter then you can pre-calculate tons of stuff. The examples above are just space/time trade-offs, such as using more space to store sequences encapsulated in separate tables ... but you may hit the MySQL table limit unless you do something like:

database_[a's value here] -> table_[b's value here]_tbl -> SELECT * FROM that WHERE c=x ...

So you interpolate 'a' into the database name, 'b' into the table and search on C followed by D. All bullshit really until you try things out and benchmark with real queries.

You are in a really great position to distribute queries across many databases, think

node_[a's value].mydbcluster.com->database_[b's value]->table[c's] ...

Providing the distribution of those is even, you will get a fairly even query distribution. It also allows you to fire off queries in parallel depending on their first, second, ..., sequential integer values.

Be benefit to this is that, if any of the following steps fail:

  1. connect to machine with 'a' on it
  2. connect to database with 'b' in it
  3. select table with 'c' and later on it

then you know the combination (a,b,c,d,...,) doesn't exist. And you have to perform those steps anyway.

Might be overkill though, because you could end up with alot of tables/databases unless you get clusters of numbers.

I am no database expert, and when one arrives, I he/she will have comments on this ;) And so on.

good luck

Aiden Bell
Thanks for all the info, certainly helps!
Goro
A: 

Hi Goro,

How do the int groups map to the float pairs? It would help to know how the matching is accomplished.

Crazy Joe Malloy
Hi,The relationship is essentially a giant lookup table, with no mathematical relationship. It's a tranlation from a property address coordinate system to a GPS lat/long location data, so the data is somewhat organized by nature, but completely independent.
Goro
Then a search around proximity, or table splitting on the first thing. See edit.
Aiden Bell
Aiden has covered alot of bases with his post, I'm certain there's something if not several things in there that should help you out. With that said derobert makes an excellent point, MySQL has spatial capabilities that you should certainly investigate if you haven't already - this is just speculation, but you may be reinventing a wheel here.
Crazy Joe Malloy