tags:

views:

71

answers:

2

Hi.

I have problem with comparing two tables in SQL Server.

I have first table [Table1] with text column where I store my content and second table [table2] with column of my keywords.

And now I want to compare all my keywords against my content and get a list of keywords with number of occurrences in the content. (clear enough?)

A: 

This would return you list of IDs from Table1 (id int, txt ntext) with key woards from Table2 (kwd nvarchar(255)) that exist in ntext field. Number of occurrences is tricky and you will have to write UDF, preferable CLR one, to get it.

I defined word as everything that is separated by space or open parenthesize from left and space, close parenhesize, comma, dot or semicolon from right. You can add more conditions eg quotes, double-quotes etc.

Select Table1.id, Table2.kwd 
  From Table1
 Cross Join Table2 
 Where patindex(N'%[ (]'+Table2.kwd+N'[ ,.;)]%',N' '+cast(Table1.txt as nvarchar(max))+N' ')>0
Order by id, kwd
Niikola
@Niikola, and how to get number of occurrences of each keyword?
Michael Pakhantsov
This is killing my database - for Top 10 executing time reach 9 minutes :)You wrote: "I defined word as everything that is separated by space or open parenthesize from left and space"But, every keyword is one row from Table 2
giker
It is heavy query as it has to do cross join between keywords and text tables plus it has to perform patindex function which is not champion of speed. When I said "I defined word as..." I mean how do you define word in your ntext field because I do not know if words are space delimited, comma delimited, tab delimited or by regular meaning of word in printed text, which could have combination of delimiters. Anyway, if you have SQL 2008 use solution proposed by Martin Smith (below), although it would take a lot of resources too.
Niikola
+2  A: 

What version of SQL Server? If SQL2008 you can do (probably after casting from text to nvarchar(max))

WITH Table1 AS
(
SELECT 1 AS Id, N'how now brown cow' AS txt UNION ALL
SELECT 2, N'she sells sea shells upon the sea shore' UNION ALL
SELECT 3, N'red lorry yellow lorry' UNION ALL
SELECT 4, N'the quick brown fox jumped over the lazy dog'
),
Table2 AS
(
SELECT 'lorry' as keyword UNION ALL
SELECT 'yellow' as keyword UNION ALL
SELECT 'brown' as keyword
)

SELECT Table1.id,display_term, COUNT(*) As Cnt
FROM Table1
CROSS APPLY sys.dm_fts_parser('"' + REPLACE(txt,'"','""') + '"', 1033, 0,0)
JOIN Table2 t2 ON t2.keyword=display_term
WHERE TXT IS NOT NULL
GROUP BY Table1.id,display_term
ORDER BY Cnt DESC

Returns

id          display_term                   Cnt
----------- ------------------------------ -----------
3           lorry                          2
3           yellow                         1
4           brown                          1
1           brown                          1
Martin Smith
Wow, I missed sys.dm_fts_parser function and it's great
Niikola
Thanks for that, but I can't run it. I'm getting:"Argument data type ntext is invalid for argument 1 of replace "function."
giker
pass `cast(YourColumnName as nvarchar(max))` to the replace function.
Martin Smith
Thanks again. Now it's working but results are far from what I was expecting. Now it's breaking my content to single words, but in my keywords list I have also phrases (2 and more words) and I want to know for each keyword number of occurrences in the content.
giker
Maybe you should have mentioned in the question that you have key phrases rather than key words! I haven't tried this but the `sys.dm_fts_parser` allows you to reference a thesaurus - Maybe you could create a custom thesaurus that maps all your phrases to single word versions (e.g. by replacing all spaces with empty strings). Failing that you could probably do something with `row_number()` and the occurrence column returned by the parser function.
Martin Smith