ansaurus

Question

How slow is DISTINCT?

Answer 1

+1 A:

Option 2 with some modifications. Assuming you're limiting input to the values contained in the tag table you're then able to perform two different queries for different purposes.

The first query will give you a unique list of all currently USED tags.

The second query will give you all possible tags that can be used, including those that have not yet been used.

You would be better to use a tag ID and set up a relationship between the tag and article_tags tables. This would make option two more efficient as the index would simply be an integer rather than a string.

nullnvoid 2009-01-12 05:40:10

integers are more efficient than strings?

Shawn Simon 2009-01-12 05:47:42

Generally. Depending on the size.

nullnvoid 2009-01-12 05:49:05

currently i have the second situation but using TagID int and Tagtext, just annoying having to delete any used tags after each query.

Shawn Simon 2009-01-12 05:51:04

Answer 2

+4 A:

I would go with whichever solution will result in the best/cleanest design for your application. If you need to attach data directly to a tag then a separate table (i.e. a more normalised solution) would be the correct one.

I would warn against worrying too much about the performance difference in the 2 proposed solutions, if indexed, the difference is likely to be insignificant (both are fairly common use cases and can be easily optimised using standard db techniques). Making a decision between the 2 presented options on the basis of performance sounds like a premature optimisation.

Michael Barker 2009-01-12 05:47:17

Answer 3

+1 A:

I would go with

Article
-------
ArticleID int PK

Article_Tags
------------
ArticleTagID int PK
ArticleID int FK
TagId int FK

Tag
---
TagId int identity(1,1) PK
TagText varchar(50)

There really is no reason to denormalize this from get go. (your first and second versions are not normalized)

Having tags in a separate table is fine, and getting the unique tags is cleaner without that distinct statement.

Performance wise if any of the 3 options are going to perform pretty much the same provided the correct indexing is applied.

Sam Saffron 2009-01-12 06:44:46

are you sure about the second one not being normalized?

Shawn Simon 2009-01-12 14:19:08

Answer 4

+1 A:

To answer the basic question from the title: a DISTINCT usually means sorting the data. Depending on indices, the structure of the query and the amount of data returned this might be gratis (proper index on tagtext, ORDER BY tagtext, small return set) or not (missing index, order irrelevant, massive return set).

David Schmitt 2009-01-12 06:58:49

Wrong, no sort as long as the obvious indexes are supplied. And DISTINCT would be very efficient - dbms's know how to skip from the start of one value to the next.

le dorfier 2009-01-12 07:07:48

That's why I said "Depending on the indices [...] this might be gratis (proper index on tagtext [...]"

David Schmitt 2009-01-12 11:15:21

Answer 5

+1 A:

The overhead of maintaining two tables would be very unlikely to be a reasonable tradeoff for messing up a simple design.

It would be simple enough to test it both ways if you're really worried about it, but from my extensive experience there's no question this is true.

le dorfier 2009-01-12 07:05:44

i will probably execute the update/insert queries on a separate thread anyway since theyre not mission critical

Shawn Simon 2009-01-12 14:19:46

ansaurus

tags:

views:

answers:

How slow is DISTINCT?

related questions