views:

139

answers:

5

I have 2 tables resulted from merging the following tables:

Authors
 -Aid bigint
 -Surname nvarchar(500)
 -Email nvarchar(500)

Articles
 -ArId varchar(50)
 -Year int
 -……Some other fields……

ArticleAuthors
 -ArId varchar(50)
 -Aid bigint

Classifications
 -ClassNumber int
 -ClassDescription nvarchar(100)

ClassArticles
 -ArId varchar(50)
 -ClassNumber int

After denormalizing these tables the resulted tables were:

Articles
 -FieldId int
 -ArId varchar(50)
 -ClassNumber int (Foreign key from the Classifications table)
 -Year int

Authors
 -FieldId int
 -ArId varchar(50) (Foreign key from the Articles table)
 -Aid bigint
 -Surname nvarchar(500)
 -Email nvarchar(500)
 -Year int

Here are the conditions of the data within the resulted tables:

  • SQL Server 2008 database
  • The relationships between the two tables are applied physically
  • The Authors table has 50 million records
  • The Articles table has 20 million records
  • The author has written many articles during the same year with different emails
  • There are authors in the authors table with ArIds that don’t reference ArIds in the Articles table (Orphan records)
  • The values within the Year fields ranges from 2002 and 2009
  • The Articles table has a unique clustered index on the [FieldId and Year] fields and this index created on 9 partitions (1 partition per year)
  • The Authors table has a non-unique clustered index on the [Year, ArId, Aid] fields and this index is created on the same 9 partition as the Articles table (1 partition per year)

The question is:

  • We need to create a stored procedure that gets the following result from the two tables [Aid,Surname,Email] under the following conditions:

  • Authors that have written articles during and after a specific year (AND)

  • The total number of articles for the author is greater than a specific number (AND)
  • The count of the articles written by the author under a specific ClassNumber is greater than a specific percentage of the total number of his articles (AND)
  • Get only the most recent email of the author (in the last year during which he has written an article)
  • If the author has more than one email in the same year, get them all.

We need the query to take the least possible time

If anyone can help, Thank you very much.

A: 

First step would be appropriate indexes. With the where criteria being the primary contenders and then items not used in where but selected can simply be included in the index. As mentioned there are standard tools and queries to find these.

To focus on the query in hand run the Query with "Query | Include Execution Plan" (Ctrl + M) turned on. This should show up any obvious bottlenecks.

dove
according to the recommendations of the Tuning Advisor I have created the following indexes:* on the Authors table-[Aid,ArId]-[Aid,Year]-[Aid]*On the Articles table:-ClassNumber-ClassNumberYear
SubPortal
@Darsh updated answer. if this doesn't help you might include actual query in question.
dove
+1  A: 

Do you want help writing the query, or help in fixing the performance? The query itself should be relatively simple. That's not where you're going to get the most bang for your buck.

SQL Server comes with tools for analyzing queries and boosting performance by tuning your indexes. That's where you're going to see the biggest help in getting it to run quickly.

David Stratton
My problem is the performance and as I mention in the comment on the first answer the best optimization I did takes 3 minutes.
SubPortal
A: 

You suggested in your title that you have a very slow sql query, yet in your post I get the impression that you havent even wrote it yet?

First step is try have a go creating it yourself (hopefully you get an answer anyway) and test just how fast it runs based on what indexes you have. (It might already be quick enough for you)

Second would be to viewing the execution plan for the query and identify if Tables Scans are being used. These are bad, so this is where you might choose to apply indexes etc to improve performace.

SQL Tuning Wizard is another one.

Of course, indexes can take up considerable space so choose them wisely.

kevchadders
I have already written the query ,and the best optimization takes 3 minutes but this is a long time for a web page response.
SubPortal
A: 

Given the beforementioned conditions This is the query I have created but it takes 3 minutes (So long time for a web page response):

CREATE PROC [dbo].[GetAuthorForMailing]
(
    @classNumber INT,
    @noPapers int,
    @year int,
    @percent int
)
AS
BEGIN

CREATE TABLE #TT
(
    Aid bigint, 
    allPapers int,
    classPapers int, 
    perc as CEILING(CAST(classPapers AS DECIMAL) / CAST(allPapers AS DECIMAL) * 100)
)
INSERT INTO #TT(Aid,allPapers,classPapers)
SELECT [YearAuthors].Aid,   
      (
       SELECT COUNT(EA.Aid) 
       FROM Authors EA
       WHERE EA.Aid =[YearAuthors].Aid) AS [AllPapers],
      (
       SELECT COUNT(*) 
       FROM Articles ED INNER JOIN Authors EAD ON EAD.ArId = ED.ArId
       WHERE EAD.Aid = [YearAuthors].Aid AND ED.ClassNumber = @classNumber) AS [ClassPapers]

FROM
(
     SELECT DISTINCT Aid
     FROM Authors EAE
     WHERE EAE.[Year] >= @year AND EAE.Email IS NOT NULL AND EAE.Email != ' '

)AS [YearAuthors]

SELECT DISTINCT EA.Aid,EA.Surname,EA.Email,[Year]
FROM #TT INNER JOIN Authors EA ON EA.Aid = #TT.Aid  
    AND allPapers > @noPapers 
    AND perc > @percent
    AND EA.[Year] = (SELECT MAX([Year]) FROM Authors WHERE Aid = EA.Aid)
WHERE EA.Email IS NOT NULL AND EA.Email != ' '
DROP TABLE #TT
SubPortal
+1  A: 

Without having the data this is very difficult to work on, but I created the tables and duplicated the procedure to get a rough idea on the query plan and potential problems.

First noticable thing, the part of the query written as :

SELECT DISTINCT Aid 
FROM Authors EAE 
WHERE EAE.[Year] >= @year AND EAE.Email IS NOT NULL AND EAE.Email != ' '

Is going to table scan, you have Year as your partitioning key, but within each partition there is no index supporting the email clauses in the query. As a side note the EAE.Email != ' ' might not give you quite what you expect. If ' ' != '' print 'true' else print 'false' That will give false for most systems. (Based on ansi padding)

FROM Articles ED 
INNER JOIN Authors EAD ON EAD.ArId = ED.ArId 
WHERE EAD.Aid = [YearAuthors].Aid AND ED.ClassNumber = @classNumber

ED.ClassNumber will have no supporting index, causing a clustered index scan.

In the final select statement : INNER JOIN Authors EA ON EA.Aid = #TT.Aid

This has no supporting index on the #TT side and doesn't appear to be one on the Authors Table side.

WHERE EA.Email IS NOT NULL AND EA.Email != ' '

this has no supporting index, causing a scan.

There are a lot more issues in there, with a considerable number of sort's appearing that probably will disappear with suitable indexes - You will have to sort out some of the basic indexing on the tables and then get a new query plan / set of problems and iteratively fix the plan - you will not fix it in a single 'silver bullet' shot.

Andrew