views:

69

answers:

3

I need to count occurrence of a list of words across all records in a given table. If I only had 1 word, I could do this:

select count(id) as NumRecs where essay like '%word%'

But my list could be hundreds or thousands of words, and I don't want to create hundreds or thousands of sql requests serially; that seems silly. I had a thought that I might be able to create a stored procedure that would accept a comma-delimited list of words, and for each word, it would run the above query, and then union them all together, and return one huge dataset. (Sounds reasonable, right? But I'm not sure where to start with that approach...)

Short of some weird thing with union, I might try to do something with a temp table -- inserting a row for each word and record count, and then returning select * from that temp table.

If it's possible with a union, how? And does one approach have advantages (performance or otherwise) over the other?

+5  A: 

If you want to run the query on multiple words returning a result row for each word then you can store those words in a table as you suggested, and join the query with it instead of running lots of queries in a loop. Note that the key word here is join, not union.

SELECT word, COUNT(*)
FROM words
LEFT JOIN essays
ON essay LIKE '%' + words.word + '%'
GROUP BY word

Result:

'bar', 2
'baz', 2
'corge', 0
'foo', 1
'qux', 1

You could look into full text search. It will run much faster than LIKE '%word%'. It will also correctly handle word boundaries. The LIKE based solution does not.


Test data:

CREATE TABLE essays (essay NVARCHAR(100) NOT NULL);
INSERT INTO essays (essay) VALUES
('foo bar'),
('bar baz'),
('baz qux');

DROP TABLE words;
CREATE TABLE words (word NVARCHAR(100) NOT NULL);
INSERT INTO words (word) VALUES
('foo'),
('bar'),
('baz'),
('qux'),
('corge');
Mark Byers
Also its worth mentioning that SQL Server Express 2008 R2 with Advanced Services (free) includes full text search - just in case Express Edition is currently your limitation on not using full text search.
cfeduke
@cfeduke Thanks for the additional details. This particular app is using MSSQL 2005 Enterprise, so no worries about what is or isn't included. @Mark Byers This looks the most promising, so I'll be playing with it today.
Adam Tuttle
So this works quite well, after I figured out how to do it with full text search -- except it leaves out words that I know occur very frequently. I believe this is because of the Noise Words removal (examples of words it's removing: A, I, Have, Did)... but I want to **keep** them! They are important to the research we're doing. Any ideas?
Adam Tuttle
@Adam Tuttle: I would imagine that the "noise words" list, or stop words list as it is usually known, can be disabled. But if not then you could try another product such as Lucene which might be more configurable. http://lucene.apache.org/java/docs/. I have never tried to disable stop words so I can't give a more precise answer.
Mark Byers
There are multiple applications using this SQL server instance, so disabling the stop words is a non-starter. I ended up doing essentially the same logic in the application code using regular expressions and word-boundaries, since I have more control over that. Thanks for getting me started, though.
Adam Tuttle
A: 

There are many ways to split string in SQL Server. This article covers the PROs and CONs of just about every method: "Arrays and Lists in SQL Server 2005 and Beyond, When Table Value Parameters Do Not Cut it" by Erland Sommarskog

I prefer the number table approach to split a string in TSQL, for this method to work, you need to do this one time table setup:

SELECT TOP 10000 IDENTITY(int,1,1) AS Number
    INTO Numbers
    FROM sys.objects s1
    CROSS JOIN sys.objects s2
ALTER TABLE Numbers ADD CONSTRAINT PK_Numbers PRIMARY KEY CLUSTERED (Number)

Once the Numbers table is set up, create this split function:

CREATE FUNCTION [dbo].[FN_ListToTable]
(
     @SplitOn  char(1)      --REQUIRED, the character to split the @List string on
    ,@List     varchar(8000)--REQUIRED, the list to split apart
)
RETURNS TABLE
AS
RETURN 
(

    ----------------
    --SINGLE QUERY-- --this will not return empty rows
    ----------------
    SELECT
        ListValue
        FROM (SELECT
                  LTRIM(RTRIM(SUBSTRING(List2, number+1, CHARINDEX(@SplitOn, List2, number+1)-number - 1))) AS ListValue
                  FROM (
                           SELECT @SplitOn + @List + @SplitOn AS List2
                       ) AS dt
                      INNER JOIN Numbers n ON n.Number < LEN(dt.List2)
                  WHERE SUBSTRING(List2, number, 1) = @SplitOn
             ) dt2
        WHERE ListValue IS NOT NULL AND ListValue!=''

);
GO 

You can now easily split a CSV string into a table and join on it:

select * from dbo.FN_ListToTable(',','1,2,3,,,4,5,6777,,,')

OUTPUT:

ListValue
-----------------------
1
2
3
4
5
6777

(6 row(s) affected)

Your can now join to the split of your CSV like:

DECLARE @YourTable table (RowID int, RowValue varchar(200))
INSERT INTO @YourTable VALUES (1,'aaa bbb ccc ddd eee fff ggg hhh')
INSERT INTO @YourTable VALUES (2,'bbb ddd fff hhh')
INSERT INTO @YourTable VALUES (3,'aaa bbb zzz')

DECLARE @Words varchar(500)
SET @Words='aaa,bbb,ccc,zzz'

SELECT
    COUNT(y.RowID) AS CountOF,l.ListValue
    FROM @YourTable                                  y
        INNER JOIN dbo.FN_ListToTable(',',@Words) AS l ON y.RowValue LIKE '%'+l.ListValue+'%'
    GROUP BY l.ListValue

OUTPUT:

CountOF     ListValue
----------- ---------------
2           aaa
3           bbb
1           ccc
1           zzz

(4 row(s) affected)
KM
A: 

Does your solution have to use SQL? Here's a nice C# sample already built:

http://msdn.microsoft.com/en-us/library/bb546166.aspx

"This example shows how to use a LINQ query to count the occurrences of a specified word in a string. Note that to perform the count, first the Split method is called to create an array of words. There is a performance cost to the Split method. If the only operation on the string is to count the words, you should consider using the Matches or IndexOf methods instead. However, if performance is not a critical issue, or you have already split the sentence in order to perform other types of queries over it, then it makes sense to use LINQ to count the words or phrases as well."

Boils down to one LINQ query:

var matchQuery = from word in source
                 where word.ToLowerInvariant() == searchTerm.ToLowerInvariant()
                 select word;

int wordCount = matchQuery.Count();
Console.WriteLine("{0} occurrences(s) of the search term \"{1}\" were found.", wordCount, searchTerm);
Shane Cusson
So are you suggesting that he copies his whole table across to run a LINQ Query against it?
Martin Smith
Totally not appropriate for me, sorry. I'm not trying to match the number of occurrences in a string; I want the number of records that contain each word.
Adam Tuttle