I'm currently working on a data export feature for a survey application. We are using SQL2k8. We store data in a normalized format: QuestionId, RespondentId, Answer. We have a couple other tables that define what the question text is for the QuestionId and demographics for the RespondentId...
Currently I'm using some dynamic SQL to generate a pivot that joins the question table to the answer table and creates an export, its working... The problem is that it seems slow and we don't have that much data (less than 50k respondents).
Right now I'm thinking "why am I 'paying' to de-aggregate the data for each query? Why don't I cache that?" The data being exported is based on dynamic criteria. It could be "give me respondents that completed on x date (or range)" or "people that like blue", etc. Because of that, I think I have to cache at the respondent level, find out what respondents are being exported and then select their combined cached de-aggregated data.
To me the quick and dirty fix is a totally flat table, RespondentId, Question1, Question2, etc. The problem is, we have multiple clients and that doesn't scale AND I don't want to have to maintain the flattened table as the survey changes.
So I'm thinking about putting an XML column on the respondent table and caching the results of a SELECT * FROM Data FOR XML AUTO WHERE RespondentId = x. With that in place, I would then be able to get my export with filtering and XML calls into the XML column.
What are you doing to export aggregated data in a flattened format (CSV, Excel, etc)? Does this approach seem ok? I worry about the cost of XML functions on larger result sets (think SELECT RespondentId, XmlCol.value('//data/question_1', 'nvarchar(50)') AS [Why is there air?], XmlCol.RinseAndRepeat)...
Is there a better technology/approach for this?
Thanks!
EDIT: SQL Block for testing. Run steps 1 & 2 to prime the data, test with step 3, clean up with step 4... At a thousand respondents by one hundred questions, it already seems slower than I'd like.
SET NOCOUNT ON;
-- step 1 - create seed data
CREATE TABLE #Questions (QuestionId INT PRIMARY KEY IDENTITY (1,1), QuestionText VARCHAR(50));
CREATE TABLE #Respondents (RespondentId INT PRIMARY KEY IDENTITY (1,1), Name VARCHAR(50));
CREATE TABLE #Data (QuestionId INT NOT NULL, RespondentId INT NOT NULL, Answer INT);
DECLARE @QuestionTarget INT = 100
,@QuestionCount INT = 0
,@RespondentTarget INT = 1000
,@RespondentCount INT = 0
,@RespondentId INT;
WHILE @QuestionCount < @QuestionTarget BEGIN
INSERT INTO #Questions(QuestionText) VALUES(CAST(NEWID() AS CHAR(36)));
SET @QuestionCount = @QuestionCount + 1;
END;
WHILE @RespondentCount < @RespondentTarget BEGIN
INSERT INTO #Respondents(Name) VALUES(CAST(NEWID() AS CHAR(36)));
SET @RespondentId = SCOPE_IDENTITY();
SET @QuestionCount = 1;
WHILE @QuestionCount <= @QuestionTarget BEGIN
INSERT INTO #Data(QuestionId, RespondentId, Answer)
VALUES(@QuestionCount, @RespondentId, ROUND(((10 - 1 -1) * RAND() + 1), 0));
SET @QuestionCount = @QuestionCount + 1;
END;
SET @RespondentCount = @RespondentCount + 1;
END;
-- step 2 - index seed data
ALTER TABLE #Data ADD CONSTRAINT [PK_Data] PRIMARY KEY CLUSTERED (QuestionId ASC, RespondentId ASC);
CREATE INDEX DataRespondentQuestion ON #Data (RespondentId ASC, QuestionId ASC);
-- step 3 - query data
DECLARE @Columns NVARCHAR(MAX)
,@TemplateSQL NVARCHAR(MAX)
,@RunSQL NVARCHAR(MAX);
SELECT @Columns = STUFF(
(
SELECT DISTINCT '],[' + q.QuestionText
FROM #Questions AS q
ORDER BY '],[' + q.QuestionText
FOR XML PATH('')
), 1, 2, '') + ']';
SET @TemplateSql =
'SELECT *
FROM
(
SELECT r.Name, q.QuestionText, d.Answer
FROM #Respondents AS r
INNER JOIN #Data AS d ON d.RespondentId = r.RespondentId
INNER JOIN #Questions AS q ON q.QuestionId = d.QuestionId
) AS d
PIVOT
(
MAX(d.Answer)
FOR d.QuestionText
IN (xxCOLUMNSxx)
) AS p;';
SET @RunSql = REPLACE(@TemplateSql, 'xxCOLUMNSxx', @Columns)
EXECUTE sys.sp_executesql @RunSql;
-- step 4 - clean up
DROP INDEX DataRespondentQuestion ON #Data;
DROP TABLE #Data;
DROP TABLE #Questions;
DROP TABLE #Respondents;