views:

297

answers:

6

I am no good at SQL.

I am looking for a way to speed up a simple join like this:

SELECT
    E.expressionID,
    A.attributeName,
    A.attributeValue
FROM 
    attributes A
JOIN
    expressions E
ON 
    E.attributeId = A.attributeId

I am doing this dozens of thousands times and it's taking more and more as the table gets bigger.

I am thinking indexes - If I was to speed up selects on the single tables I'd probably put nonclustered indexes on expressionID for the expressions table and another on (attributeName, attributeValue) for the attributes table - but I don't know how this could apply to the join.

EDIT: I already have a clustered index on expressionId (PK), attributeId (PK, FK) on the expressions table and another clustered index on attributeId (PK) on the attributes table

I've seen this question but I am asking for something more general and probably far simpler.

Any help appreciated!

+12  A: 

You definitely want to have indexes on attributeID on both the attributes and expressions table. If you don't currently have those indexes in place, I think you'll see a big speedup.

JerSchneid
Not to forget that both columns should be of the same data type, and, if they are character data, of the same collation.
Tomalak
Knowing the primary key would help. A single column which is the primary key would already be indexed. It's possible that your expressions table has two fields which make up the primary key. This means that creating an index on E.attributeId would be the way to go. The primary key would create an index using both E.ID and E.attributeId. Adding a index for only E.attributeId would speed it up.
Kieveli
Actually, the primary key isnt autoindexed on all platforms. MySql for instance does not create an index by default on the primary key.
Goblyn27
I have an index on expressionId, attributeId (PK) on the expressions table and a clustered index on attributeId (PK) on the attributes table
JohnIdol
Not necessarily do both tables need an index. It's actually bad form to just blindly add indexes in this manner. You need to make sure your DB stats are up-to-date and see how the table sizes stack up. More than likely, the optimizer is going to do a full table scan on the base table no matter what (since there's no WHERE clause) so the index on AttributeId on the base table is just wasted space.
Matt
@Goblyn27: Can you cite a reference for this? I use MySQL quite a lot, and a PRIMARY KEY constraint does create an index implicitly.
Bill Karwin
please note EDIT on the question
JohnIdol
+4  A: 

In fact, because there are so few columns being returned, I would consider a covered index for this query

i.e. an index that includes all the fields in the query.

Goblyn27
how would I index a join? (never done)
JohnIdol
I think that Goblyn is suggesting adding an index on A.attributeId, A.attributeName,A.attributeValue and another on E.attributeId and E.expressionID...but I'm not 100% sure. The theory of this being that all of the data for the query would come directly from the indices and never hit the table.
Greg
Sorry, I wasnt clear on that. Greg is correct. In this instance there would be two covered indexes, one for each table and the join would take place between the two covered indexes without involving the actual table.
Goblyn27
I'll give it a shot and report back
JohnIdol
+3  A: 

Some things you need to care about are indexes, the query plan and statistics.

Put indexes on attributeId. Or, make sure indexes exist where attributeId is the first column in the key (SQL Server can still use indexes if it's not the 1st column, but it's not as fast).

Highlight the query in Query Analyzer and hit ^L to see the plan. You can see how tables are joined together. Almost always, using indexes is better than not (there are fringe cases where if a table is small enough, indexes can slow you down -- but for now, just be aware that 99% of the time indexes are good).

Pay attention to the order in which tables are joined. SQL Server maintains statistics on table sizes and will determine which one is better to join first. Do some investigation on internal SQL Server procedures to update statistics -- it's been too long so I don't have that info handy.

That should get you started. Really, an entire chapter can be written on how a database can optimize even such a simple query.

Matt
+1  A: 

Another thing to do is add some indexes like this:

attributes.{attributeId, attributeName, attributeValue}
expressions.{attributeId, expressionID}

This is hacky! But useful if it's a last resort.

What this does is create a query plan that can be "entirely answered" by indexes. Usually, an index actually causes a double-I/O in your above query: one to hit the index (i.e. probe into the table), another to fetch the actual row referred to by the index (to pull attributeName, etc).

This is especially helpful if "attributes" or "expresssions" is a wide table. That is, a table that's expensive to fetch the rows from.

Finally, the best way to speed your query is to add a WHERE clause!

Matt
would those indexes kill me on insertion? about WHERE - I am using this join to populate a temp table which I am gonna use to find the expressionID (if any) for a given set of name-value pairs (attributes). So I guess I could filter with OR disjuncts attributeNames+AttributeValues on this query to speed it up
JohnIdol
I'd have to dynamically append the OR disjuncts though 'cause I need smt like WHERE (attributeName = 'X' AND attributeValue = 'Y') OR (attributeName = 'Z' AND attributeValue = 'W') ... and so forth! So I'd probably lose time looping through the table with the name value pairs and building these clauses
JohnIdol
There's always a tradeoff of indexes for insertions. Again (and unfortunately), there's no one-size-fits-all answer. If you only have one or two indexes, and given this one isn't clustered, it's likely not going to kill you. That said, this IS an index that's heavily geared toward a specific query, so use at your discretion.
Matt
Matt
table gets about 10k rows and there's huge repetition of name-values. Anyway I'l lprobably ask another question for that specific problem - I meant this one just as performance suggestions for simple joins
JohnIdol
+2  A: 

I bet your problem is the huge number of rows that are being inserted into that temp table. Is there any way you can add a WHERE clause before you SELECT every row in the database?

JerSchneid
I guess I could filter with OR disjuncts on attributeNames+AttributeValues on this query to speed it up but the problem is that I'd have to dynamically append the OR disjuncts 'cause I need smt like WHERE (attributeName = 'X' AND attributeValue = 'Y') OR (attributeName = 'Z' AND attributeValue = 'W') ... to get ultimately the ExpressionId of a given set of name-value pairs. So I'd probably lose time looping through the table with the name-value pairs and building these OR disjuncts for the WHERE clause.
JohnIdol
That still may be better? Or you could look into caching that temp table. Either caching it in some middle-tier memory, or making that temp table a permanent table and updating it only when rows from the other tables change?
JerSchneid
If I can't get significant improvements playing with indexes I'll go with the dynamic filtering of the join as describe din the previous comment - I'd like to avoid having pesistent caching tables!
JohnIdol
I tried with dynamic filtering, I am moving with a fully populated db 9k instead of 70k - a bit better but not as much as I would've expected. I am rethinking the whole thing - maybe I can join the expression (e) table with the attributes (e) table on the e.articleId = a.articleId and then join on the name-value pairs table (t) on a.attributeName = t.name and a.attributeValue = t.value achieving the same kind of filtering with less computation!
JohnIdol
I started another question --> http://stackoverflow.com/questions/923136/t-sql-filtering-on-dynamic-name-value-pairs
JohnIdol
+1  A: 

If I'm understanding your schema correctly, you're stating that your tables kinda look like this:

Expressions: PK - ExpressionID, AttributeID
Attributes:  PK - AttributeID

Assuming that each PK is a clustered index, that still means that an Index Scan is required on the Expressions table. You might want to consider creating an Index on the Expressions table such as: AttributeID, ExpressionID. This would help to stop the Index Scanning that currently occurs.

your understanding is correct. You mean adding a nonclustered index on expressions for (ExpressionId, AttributeId) other than the clustered index that's already there?
JohnIdol