views:

23

answers:

1

I'm currently working on a data export feature for a survey application. We are using SQL2k8. We store data in a normalized format: QuestionId, RespondentId, Answer. We have a couple other tables that define what the question text is for the QuestionId and demographics for the RespondentId...

Currently I'm using some dynamic SQL to generate a pivot that joins the question table to the answer table and creates an export, its working... The problem is that it seems slow and we don't have that much data (less than 50k respondents).

Right now I'm thinking "why am I 'paying' to de-aggregate the data for each query? Why don't I cache that?" The data being exported is based on dynamic criteria. It could be "give me respondents that completed on x date (or range)" or "people that like blue", etc. Because of that, I think I have to cache at the respondent level, find out what respondents are being exported and then select their combined cached de-aggregated data.

To me the quick and dirty fix is a totally flat table, RespondentId, Question1, Question2, etc. The problem is, we have multiple clients and that doesn't scale AND I don't want to have to maintain the flattened table as the survey changes.

So I'm thinking about putting an XML column on the respondent table and caching the results of a SELECT * FROM Data FOR XML AUTO WHERE RespondentId = x. With that in place, I would then be able to get my export with filtering and XML calls into the XML column.

What are you doing to export aggregated data in a flattened format (CSV, Excel, etc)? Does this approach seem ok? I worry about the cost of XML functions on larger result sets (think SELECT RespondentId, XmlCol.value('//data/question_1', 'nvarchar(50)') AS [Why is there air?], XmlCol.RinseAndRepeat)...

Is there a better technology/approach for this?

Thanks!

EDIT: SQL Block for testing. Run steps 1 & 2 to prime the data, test with step 3, clean up with step 4... At a thousand respondents by one hundred questions, it already seems slower than I'd like.

SET NOCOUNT ON;

-- step 1 - create seed data
CREATE TABLE #Questions (QuestionId INT PRIMARY KEY IDENTITY (1,1), QuestionText VARCHAR(50));
CREATE TABLE #Respondents (RespondentId INT PRIMARY KEY IDENTITY (1,1), Name VARCHAR(50));
CREATE TABLE #Data (QuestionId INT NOT NULL, RespondentId INT NOT NULL, Answer INT);

DECLARE @QuestionTarget INT = 100
    ,@QuestionCount INT = 0
    ,@RespondentTarget INT = 1000
    ,@RespondentCount INT = 0
    ,@RespondentId INT;

WHILE @QuestionCount < @QuestionTarget BEGIN
    INSERT INTO #Questions(QuestionText) VALUES(CAST(NEWID() AS CHAR(36)));
    SET @QuestionCount = @QuestionCount + 1;
END;

WHILE @RespondentCount < @RespondentTarget BEGIN
    INSERT INTO #Respondents(Name) VALUES(CAST(NEWID() AS CHAR(36)));
    SET @RespondentId = SCOPE_IDENTITY();
    SET @QuestionCount = 1;

    WHILE @QuestionCount <= @QuestionTarget BEGIN
        INSERT INTO #Data(QuestionId, RespondentId, Answer)
            VALUES(@QuestionCount, @RespondentId,  ROUND(((10 - 1 -1) * RAND() + 1), 0));

        SET @QuestionCount = @QuestionCount + 1;
    END;

    SET @RespondentCount = @RespondentCount + 1;
END;

-- step 2 - index seed data
ALTER TABLE #Data ADD CONSTRAINT [PK_Data] PRIMARY KEY CLUSTERED (QuestionId ASC, RespondentId ASC);
CREATE INDEX DataRespondentQuestion ON #Data (RespondentId ASC, QuestionId ASC);

-- step 3 - query data
DECLARE @Columns NVARCHAR(MAX)
    ,@TemplateSQL NVARCHAR(MAX)
    ,@RunSQL NVARCHAR(MAX);

SELECT @Columns = STUFF(
    (
        SELECT DISTINCT '],[' + q.QuestionText
        FROM #Questions AS q
        ORDER BY '],[' + q.QuestionText
        FOR XML PATH('')
    ), 1, 2, '') + ']';

SET @TemplateSql =
'SELECT *
FROM
(
    SELECT r.Name, q.QuestionText, d.Answer
    FROM #Respondents AS r
        INNER JOIN #Data AS d ON d.RespondentId = r.RespondentId
        INNER JOIN #Questions AS q ON q.QuestionId = d.QuestionId
) AS d
PIVOT
(
  MAX(d.Answer)
  FOR d.QuestionText
    IN (xxCOLUMNSxx)
) AS p;';

SET @RunSql = REPLACE(@TemplateSql, 'xxCOLUMNSxx', @Columns)
EXECUTE sys.sp_executesql @RunSql;

-- step 4 - clean up
DROP INDEX DataRespondentQuestion ON #Data;
DROP TABLE #Data;
DROP TABLE #Questions;
DROP TABLE #Respondents;
A: 

No, your approach does not seem ok. Keep your normalized data. If you have proper keys, the "cost" to deaggregate will be minimal. To further optimize your performance, stop using dynamic SQL. Write some cleverly written queries and encapsulate them in stored procedures. This will allow SQL server to cache the query plans instead of rebuilding them every time.

Before you do any of this, however, check the query plan. It is also possible that you are forgetting an index on at least one of the fields you are searching on, which will result in a full table scan of data. You may be able to drastically increase your performance with a few well placed indexes.

MrGumbe
The dynamic SQL is required to get the question text as column names for the pivot. It is not a bottleneck. I am executing the dynamic SQL using sp_executesql and parameters where applicable, so the execution plan is being reused. I'll try to mock up some example SQL...
Ben