views:

541

answers:

4

I used the ANTS profiler to identify the remaining bottleneck in my C# application: the SQL Server stored procedure. I am using SQL Server 2008. Can anybody here help me increase performance, or give me pointers as to what I can do to make it better or more performant?

First, here's the procedure:

PROCEDURE [dbo].[readerSimilarity] 
-- Add the parameters for the stored procedure here
@id int,
@type int
AS
BEGIN
-- SET NOCOUNT ON added to prevent extra result sets from
-- interfering with SELECT statements.
SET NOCOUNT ON;

-- Insert statements for procedure here
IF (@type=1) --by Article
 SELECT id1, id2, similarity_byArticle FROM similarity WHERE (id1 = @id OR id2 = @id) 
AND similarity_byArticle != 0

ELSE IF (@type=2) --by Parent
 SELECT id1, id2, similarity_byParent FROM similarity WHERE (id1 = @id OR id2 = @id) 
AND similarity_byParent != 0

ELSE IF (@type=3) --by Child
 SELECT id1, id2, similarity_byChild FROM similarity WHERE (id1 = @id OR id2 = @id) 
AND similarity_byChild != 0

ELSE IF (@type=4) --combined
 SELECT id1, id2, similarity_combined FROM similarity WHERE (id1 = @id OR id2 = @id) 
AND similarity_combined != 0

END

The table 'similarity' consists of two ids (id1 and id2) and a number of columns that store double values. The constraint is that id1 < id2.

Column     Data
-----      ----
ID1         PK, Indexed
ID2         PK, Indexed

The table contains 28.5 million entries.

Stored Procedure Background

The job of the stored procedure is to get all the rows that have the parameter id in either id1 or id2. Additionally, the column specified by the type-parameter cannot be zero.

The stored procedure is called multiple times for different ids. Although only taking ~1.6 ms per call, it sums up, when calling it 17,000 times.

The processor is running at only 25%, which seems to be because the application is waiting for the procedure call to return.

Do you see any way to speed things up?

Calling the Stored Procedure C# Code Snippet

private HashSet<NodeClustering> AddNeighbourNodes(int id)
 {
  HashSet<NodeClustering> resultSet = new HashSet<NodeClustering>();
  HashSet<nodeConnection> simSet = _graphDataLoader.LoadEdgesOfNode(id);

  foreach (nodeConnection s in simSet)
  {
   int connectedId = s.id1;
   if (connectedId == id)
    connectedId = s.id2;

   // if the corresponding node doesn't exist yet, add it to the graph
   if (!_setNodes.ContainsKey(connectedId))
   {
    NodeClustering nodeToAdd = CreateNode(connectedId);
    GraphAddOuter(nodeToAdd);
    ChangeWeightIntoCluster(nodeToAdd.id, s.weight);
    _bFlowOuter += s.weight;
    resultSet.Add(nodeToAdd);
   }
  }

  // the nodes in the result set have been added 
                   to the outernodes -> add to the outernodes count
  _setNodes[id].countEdges2Outside += resultSet.Count;

  return resultSet;
 }

C# Code Background Information

This method is called each time a new id is added to the cluster. It gets all the connected nodes of that id (they are connected, when there is an entry in the db with id1=id or id2=id) via

_graphDataLoader.LoadEdgesOfNode(id);

Then it checks all the connected ids and if they are not loaded yet:

if (!_setNodes.ContainsKey(connectedId))

It Loads them:

CreateNode(connectedId);

The Method:

_graphDataLoader.LoadEdgesOfNode(id);

is called again, this time with the connectedId.

I need this to get all the connections of the new nodes with those nodes that are already in the set.

I probably could collect the ids of all nodes i need to add and call my stored procedure only once with a list of the ids.

Ideas

I could probably load the connected ids connection at once via something like

     SELECT id1, id2, similarity_byArticle FROM similarity WHERE 
                   (id1 = @id OR id2 = @id OR
  id1 IN (SELECT id1 FROM similarity WHERE id2 = @id) OR 
  id2 IN (SELECT id1 FROM similarity WHERE id2 = @id) OR
  id1 IN (SELECT id2 FROM similarity WHERE id1 = @id) OR 
  id2 IN (SELECT id2 FROM similarity WHERE id1 = @id))
                    AND similarity_byArticle != 0

but then I would get more entries than I'd need, because I would get them for already loaded nodes too (which from my tests would make up around 75% of the call).

Questions

  1. How can I speed up the Stored Procedure?
  2. Can I do it differently, is there a more performant way?
  3. Can I use a List<int> as a SP-Parameter?
  4. Any other thoughts?
+3  A: 

I'd try to change the application to only call this one time per ID, but if that is not possible, try this (make sure that there is an index on similarity.id1 and another index on similarity.id2):

PROCEDURE [dbo].[readerSimilarity] 
-- Add the parameters for the stored procedure here
@id int,
@type int
AS
BEGIN
    -- SET NOCOUNT ON added to prevent extra result sets from
    -- interfering with SELECT statements.
    SET NOCOUNT ON;

    -- Insert statements for procedure here
    IF @type=1 --by Article
    BEGIN
        SELECT
            id1, id2,similarity_byArticle
            FROM similarity
            WHERE id1 = @id AND similarity_byArticle!=0
        UNION
        SELECT
            id1, id2,similarity_byArticle
            FROM similarity
            WHERE id2 = @id AND similarity_byArticle!=0

    END
    ELSE IF @type=2 --by Parent
    BEGIN
        SELECT
            id1, id2,similarity_byParent
            FROM similarity
            WHERE id1 = @id AND similarity_byParent!=0
        UNION
        SELECT
            id1, id2,similarity_byParent
            FROM similarity
            WHERE id2 = @id AND similarity_byParent!=0

    END

    ELSE IF @type=3 --by Child
    BEGIN
        SELECT
            id1, id2,similarity_byChild
            FROM similarity
            WHERE id1 = @id AND similarity_byChild!=0
        UNION
        SELECT
            id1, id2,similarity_byChild
            FROM similarity
            WHERE id2 = @id AND similarity_byChild!=0

    END
    ELSE IF @type=4 --combined
    BEGIN
        SELECT
            id1, id2,similarity_combined
            FROM similarity
            WHERE id1 = @id AND similarity_combined!=0
        UNION
        SELECT
            id1, id2,similarity_combined
            FROM similarity
            WHERE id2 = @id AND similarity_combined!=0

    END

END

GO

EDIT based on OP's latest comment:

The whole graph is stored in the MSSQL-Database and I load it successively with the procedure into some Dictionary structures

You need to redesign your load process. You should call the database just one time to load all of this data. Since the IDs are already in a Database table, you can use a join in this query to get the proper IDs from the other table. edit your question with the table schema that contain the IDs to graph, and how they relate to the already posted code. Once you get a single query to return all the data, it will be much faster that 17,000 calls for a single row each time.

KM
are you calling this procedure in a loop? if so post the details, there are ways to return one result set for all items in your loop and save massive amounts of time
KM
problem with this is, that I don't know which ids I need. The clustering algorithm calculates which ids of the graph should belong to the cluster. I could try to load all ids in a given distance, but this would cause me to load a *lot* of data I do not need and I have no assurance that the given distance is enough to load all the ids that will finally belong to the cluster. And loading the whole graph would burst the machines memory (28.5 Mio edges). I redesign the algorithm. Now it is a bit more blurry, but I just need to load all the edges only when actually adding it to the cluster.
Aaginor
how complex is the "clustering algorithm"? is is also looping over data one row at a time. Post the table definitions and the clustering algorithm, perhaps someone here can figure it out.
KM
I optimized the code and if possible, I returned a set of results with one request via the method Charles Bretana offered. Now I have acceptable runtime :)
Aaginor
+4  A: 

If it runs that quickly, your problem is probably in the sheer number of repeated calls to the procedure. Is there a way that you could modify the stored procedure and code to return all the results the app needs in a single call?

Optimizing a query that runs in less than 2ms is probably not a fruitful effort. I doubt you will be able to shave more than fractions of a millisecond with query tweaks.

JohnFx
I think your right. I have tweaked the sp with some of the tips here (using union instead of or), but this just made up a small amount of time.
Aaginor
Even the fastest stored proc called 17K times is going to be a poor performer, the overhead of making the calls alone probably exceeds the time taken by the actual SQL code.
JohnFx
A: 

First create a view

CREATE VIEW ViewArticles
AS
SELECT id1, id2, similarity_byArticle 
FROM similarity 
WHERE (id1 = @id or id2 = @id) 
and similarity_byArticle != 0

In your code populate all the needed ids into a table.

Create a function which takes all the ids table as parameter.

CREATE FUNCTION
  SelectArticles
(
  @Ids TABLE
)
RETURNS TABLE
AS
RETURN
(
     SELECT id1, id2, similarity_byArticle FROM ViewArticles
     INNER JOIN @Ids I ON I.Id = id1
     UNION
     SELECT id1, id2, similarity_byArticle FROM ViewArticles
     INNER JOIN @Ids I ON I.Id = id2
)
Greco
+1  A: 

Pass all the ids into the stored proc at once, using a delimited list (Use a comma or a slash or whatever, I use a pipe character [ | ].. Add the User defined function (UDF) listed below to your database. It will convert a delimited list into a table which you can join to your similarity table. Then in your actual stored proc, you can write...

Create Procedure GetSimilarityIDs
@IdValues Text -- @IdValues is pipe-delimited [|] list of Id Values
As
Set NoCount On
Declare @IDs Table 
   (rowNum Integer Primary Key Identity Not Null,
    Id Integer Not Null)
Insert Into @IDs(Id)
Select Cast(sVal As Integer)
From dbo.ParseString(@IdValues, '|') -- specify delimiter
-- ---------------------------------------------------------

Select id1, id2, similarity_byArticle            
From similarity s Join @IDs i On i.Id = s.Id
Where similarity_byArticle <> 0
Return 0

-- *******************************************************

The below code is to create the generic function UDF that can parse any text string into a table of string values...:

Create FUNCTION [dbo].[ParseTextString] (@S Text, @delim VarChar(5))
Returns @tOut Table 
    (ValNum Integer Identity Primary Key, 
     sVal VarChar(8000))
As
Begin 
Declare @dLLen TinyInt       -- Length of delimiter
Declare @sWin  VarChar(8000) -- Will Contain Window into text string
Declare @wLen  Integer       -- Length of Window
Declare @wLast TinyInt     -- Boolean to indicate processing Last Window
Declare @wPos  Integer     -- Start Position of Window within Text String
Declare @sVal  VarChar(8000) -- String Data to insert into output Table
Declare @BtchSiz Integer     -- Maximum Size of Window
    Set @BtchSiz = 7900      -- (Reset to smaller values to test routine)
Declare @dPos Integer        -- Position within Window of next Delimiter
Declare @Strt Integer        -- Start Position of each data value within Window
-- -------------------------------------------------------------------------
If @delim is Null Set @delim = '|'
If DataLength(@S) = 0 Or
      Substring(@S, 1, @BtchSiz) = @delim Return
-- ---------------------------
Select @dLLen = Len(@delim),
       @Strt = 1, @wPos = 1,
       @sWin = Substring(@S, 1, @BtchSiz)
Select @wLen = Len(@sWin),
       @wLast = Case When Len(@sWin) = @BtchSiz
           Then 0 Else 1 End,
       @dPos = CharIndex(@delim, @sWin, @Strt)
-- ------------------------------------
  While @Strt <= @wLen
  Begin
      If @dPos = 0 -- No More delimiters in window
      Begin                      
          If @wLast = 1 Set @dPos = @wLen + 1 
          Else 
          Begin
              Set @wPos = @wPos + @Strt - 1
              Set @sWin = Substring(@S, @wPos, @BtchSiz)
              -- ----------------------------------------
              Select @wLen = Len(@sWin), @Strt = 1,
                     @wLast = Case When Len(@sWin) = @BtchSiz
                              Then 0 Else 1 End,
                     @dPos = CharIndex(@delim, @sWin, 1)
              If @dPos = 0 Set @dPos = @wLen + 1 
          End
      End
      -- -------------------------------
      Set @sVal = LTrim(Substring(@sWin, @Strt, @dPos - @Strt))
      Insert @tOut (sVal) Values (@sVal)
      -- -------------------------------
      -- Move @Strt to char after last delimiter
      Set @Strt = @dPos + @dLLen 
      Set @dPos = CharIndex(@delim, @sWin, @Strt)
   End
   Return
End
Charles Bretana
Thanks for this code! I was able to shorten execution time for fetching data by ~30% ... but was getting hit hard by separating the result for every id (raised execution time from 30 sek to 300 sek *g*). But I guess I can optimize this, will have a look into it later.
Aaginor