ansaurus

Question

Efficiently finding unique values in a database table

Answer 1

+9 A:

A separate lookup table with the id of the message type stored in your log. This will reduce the size and increase the efficiency of the log. Also it would Normalize your data.

Yuriy Faktorovich 2010-02-11 17:04:38

Answer 2

+2 A:

SELECT  DISTINCT message_type
FROM    message_log

is the most straightforward but not very efficient way.

If you have a list of types that can possibly appear in the log, use this:

SELECT  message_type
FROM    message_types mt
WHERE   message_type IN
        (
        SELECT  message_type
        FROM    message_log
        )

This will be more efficient if message_log.message_type is indexed.

If you don't have this table but want to create one, and message_log.message_type is indexed, use a recursive CTE to emulate loose index scan:

WITH    rows (message_type) AS
        (
        SELECT  MIN(message_type) AS mm
        FROM    message_log
        UNION ALL
        SELECT  message_type
        FROM    (
                SELECT  mn.message_type, ROW_NUMBER() OVER (ORDER BY mn.message_type) AS rn
                FROM    rows r
                JOIN    message_type mn
                ON      mn.message_type > r.message_type
                WHERE   r.message_type IS NOT NULL
                ) q
        WHERE   rn = 1
        )
SELECT  message_type
FROM    rows r
OPTION (MAXRECURSION 0)

Quassnoi 2010-02-11 17:05:37

Wouldn't it be relatively efficient thought because SQL Server would cache the results and they shouldn't change very often.

RandomBen 2010-02-11 17:08:02

@RandomBen: `SQL Server` caches data pages, not results. Full table or index scan needs to be done anyway, even if all the table (or index) pages are cached. This still takes long for a table large enough.

Quassnoi 2010-02-11 17:12:45

Answer 3

+5 A:

Yep, I would definitely go with the separate lookup table. You can then populate it using something like:

INSERT TypeLookup (Type)
SELECT DISTINCT Type
FROM BigMassiveTable

You could then run a top-up job periodically to pull in new types from your main table that don't already exist in the lookup table.

AdaTheDev 2010-02-11 17:06:46

+1 for treating the ongoing maintenance of the lookup table. If both the big table and the lookup table have "insert timestamps", then the periodic job can be made more efficient by inspecting only "new" records for new message types. Alternatively, an INSERT trigger on `BigMassiveTable` would do the job without a regular batch job.

pilcrow 2010-02-11 17:31:11

@pilcrow - yeah I think the top-up approach is best, as opposed to a trigger - a trigger will incur a hit for each insert, so I'd keep it as an "off peak" task especially if the addition of new types is not very frequent.

AdaTheDev 2010-02-11 19:17:32

Answer 4

A:

Have you considered an indexed view? Its result set is materialized and persists in storage so that the overhead of the lookup is separated from the rest of whatever you're trying to do.

SQL Server takes care of automagically updating the view when there is a data change which in its opinion would change the contents of the view, so in this respect it's less flexible than Oracle materialized.

c. liau 2010-02-11 17:18:20

You cannot index a view over a query with `DISTINCT` clause.

Quassnoi 2010-02-11 17:22:39

Answer 5

A:

The MessageType should be a Foreign Key in the main table to a definition table containing the message type codes and descriptions. This will greatly increase your lookup performance.

Something like

DECLARE @MessageTypes TABLE(
        MessageTypeCode VARCHAR(10),
        MessageTypeDesciption VARCHAR(100)
)

DECLARE @Messages TABLE(
        MessageTypeCode VARCHAR(10),
        MessageValue VARCHAR(MAX),
        MessageLogDate DATETIME,
        AdditionalNotes VARCHAR(MAX)
)

From this design, your lookup should only query MessageTypes

astander 2010-02-11 17:31:35

Answer 6

+1 A:

I just wanted to state the obvious: normalize the data.

message_types
message_type | message_type_name

messages
message_id | message_type | message_type_name

Then you can just do without any cached DISTINCT:

For your dropdown

SELECT * FROM message_types

For your retrieval

SELECT * FROM messages WHERE message_type = ? 

SELECT m.*, mt.message_type_name FROM messages AS m
JOIN message_types AS mt
ON ( m.message_type = mt.message_type)

I'm not sure why you would want a cached DISTINCT which you'll have to update, when you can slightly tweak the schema and have one with RI.

Evan Carroll 2010-02-11 17:52:06

Makes sense. +1Not sure who and why downvoted. IMO downvotes without explanations are not very nice.

AlexKuznetsov 2010-02-16 20:06:57

Why is this flagged as spam?

Jon B 2010-02-18 14:05:39

Answer 7

+1 A:

Create an index on the message type:

CREATE INDEX IX_Messages_MessageType ON Messages (MessageType)

Then to get a list of unique Message Types, you run:

SELECT DISTINCT MessageType
FROM Messages
ORDER BY MessageType

Because the index is physically sorted in order of MessageType SQL Server can very quickly, and efficiently, scan through the index, picking up a list of unique message types.

It is not bad performing - it's what SQL Server is good at.

Admittedly, you can save some space by having a "message types" table. And if you only display a few messages at a time: then the bookmark lookup, as it joins back to the MessageTypes table, won't be a problem. But if you start displaying hundreds or thousands of messages at a time, then the join back to MessageTypes can get pretty expensive, and needless, and it will be faster to have the MessageType stored with the message.

But i would have no problem with creating an index on the MessageType column, and selecting distinct. SQL Server loves that sort of thing. But if you're finding it to be a real load on your server, once you're getting dozens of hits a second, then follow the other suggestion and cache them in memory.

My personal solution would be:

create the index
select distinct

and if i still had problems

cache in memory that expires after 30 seconds

As for the normalized/denormalized issue. Normalizing saves space, at the cost of CPU when joins are constantly performed. But the logical point of denoralization is to avoid duplicate data, which can lead to inconsistent data.

Are you planning on changing the text of a message type, which if you stored with the messages you would have to update all rows?

Or is there something to be said for the fact that at the time of the message the message type was "Client response requested"?

Ian Boyd 2010-02-11 18:45:42

Answer 8

A:

As others have said, create a separate table of message types. When you add a record to the message table, check if the message type already exists in the table. If not, add it. In either case, then post the identifier from the message type table into the message table. This should give you normalized data. Yes, it's a little extra time when you add a record, but should be more efficient on retrieval.

If there are a lot more adds then reads and if the "message type" is short, an entirely different approach would be to still create the separate message type table, but don't reference it when doing adds, and only update it lazily, on demand.

Namely, (a) Include a time-stamp in each message record. (b) Keep a list of the message types found as of the last time you checked. (c) Each time you check, search for any new message types added since the last time, as in:

create table temp_new_types as
    (select distinct message_type
    from message
    where timestamp>last_type_check
);

insert into message_type_list (message_type)
select message_type
from temp_new_types
where message_type not in (select message_type from message_type_list);

drop table temp_new_types;

Then store the timestamp of this check somewhere so you can use it the next time around.

Jay 2010-02-12 04:06:49

Answer 9

A:

The answer is to use 'DISTINCT' and each best solution is different for different sizes of table. Thousands of rows, millions, billions ? more ? This are very different best solutions.

RocketSurgeon 2010-02-12 04:17:12

ansaurus

tags:

views:

answers:

Efficiently finding unique values in a database table

related questions