views:

49

answers:

3

I want to compute and count all trigrams of a string in SQL Server.

For example if the string is hello I want the following output:

Trigram Count
------- -----
hel     1
ell     1
llo     1
lo-     1

Wikipedia page on n-gram

+1  A: 

I still don't know what an n-gram is but based on Ed's answer is this what you need?

declare @string varchar(max) = 'hello'
declare @n int = 3

set @string = @string + REPLICATE('-',@n - (len(@string) % @n))

;with n as
(
SELECT 1 AS i
UNION ALL
SELECT i+1 
FROM n
WHERE i <= (LEN(@string)-@n)
)
select SUBSTRING(@string, i, @n), COUNT(*)
from n
group by  SUBSTRING(@string, i, @n)
option (maxrecursion 0)
Martin Smith
lo- must be part of the output
jozi
My question was "why is it"? i.e. can you do a better job of explaining it for people that don't know what trigrams are.
Martin Smith
http://en.wikipedia.org/wiki/Trigram
jozi
... and for people that can't be bothered to read the Wikipedia article. Is Ed's interpretation correct?
Martin Smith
From my reading, underscore equals a space character, so should not occur if there are no spaces in the input string. Replacing all spaces wirth an underscore in the input should do the trick.
RedFilter
+1  A: 

Based on Martin Smith's answer - added logic to pad the string out with - to a number of characters divisible by 3

declare @string varchar(max) = 'hello'

SET @string = (SELECT CASE LEN(@string) % 3
                          WHEN 1 THEN @string + '--'
                          WHEN 2 THEN @string + '-'
                          ELSE @string
                      END )
;with n as
(
SELECT 1 AS i
UNION ALL
SELECT i+1 
FROM n
WHERE i < (LEN(@string)-2)
)
select SUBSTRING(@string, i, 3) AS Trigram, COUNT(*) AS Count
from n
group by  SUBSTRING(@string, i, 3)
option (maxrecursion 0)
Ed Harper
+1  A: 

Borrowing from Ed and Martin, I think this is a correct implementation:

declare @string varchar(max) = 'here kitty kitty' 

SET @string = replace(@string, ' ', '-') --Wikipedia says this should be underscore, not dash
;with n as 
( 
    SELECT 1 AS i 
    UNION ALL 
    SELECT i + 1  
    FROM n 
    WHERE i < (LEN(@string)-2) 
) 
select SUBSTRING(@string, i, 3) AS Trigram, COUNT(*) AS Count 
from n 
group by SUBSTRING(@string, i, 3) 
option (maxrecursion 0) 
RedFilter
I've no idea whether or not that is correct but +1. Did you read the Wikipedia article? Edit: I see that you did!
Martin Smith
I read the concise trigram page (http://en.wikipedia.org/wiki/Trigram). This algorithm gives the same output as that page.
RedFilter