views:

85

answers:

2

Hello,

I have a table in my DB containning a free text field column.

I would like to know the frequency each word appears over all the rows, or maybe even calc a TF-IDF for all words, where my documents are that field's values per row.

Is it possible to calculate this using an Sql Query? if not or there's a simpler way could you please direct me to it?

Many Thanks,

Jon

+2  A: 

Solution for SQL Server 2008:

here is the table:

CREATE TABLE MyTable (id INT, txt VARCHAR(MAX));

here is SQL query:

SELECT sum(case when TSplitted.txt_word = 'searched' then 1 else 0 end) as cnt_searched
     , count(*) as cnt_all
FROM MyTable MYT 
INNER JOIN Fn_Split(MYT.id,' ',MYT.txt) TSplitted on MYT.id=TSplitted.id

here is table valued function Fn_Split(@id int, @separator VARCHAR(32), @string VARCHAR(MAX)) (taken from here):

CREATE FUNCTION Fn_Split (@id int, @separator VARCHAR(32), @string VARCHAR(MAX))

RETURNS @t TABLE
    (
        ret_id INT
       ,txt_word VARCHAR(MAX)
    )   
AS
    BEGIN
        DECLARE @xml XML
        SET @XML = N'<root><r>' + REPLACE(@s, @separator, '</r><r>') + '</r></root>'

        INSERT INTO @t(ret_id, val)
        SELECT @id, r.value('.','VARCHAR(5)') as Item
        FROM @xml.nodes('//root/r') AS RECORDS(r)

        RETURN
    END
Tim
+2  A: 

In SQL Server 2008 depending on your needs you could apply full text indexing to the column then query the sys.dm_fts_index_keywords and sys.dm_fts_index_keywords_by_document table valued functions to get the occurrence count.

Edit: Actually even without creating a persistent full text index you can still leverage the parser

WITH testTable AS
(
SELECT 1 AS Id, N'how now brown cow' AS txt UNION ALL
SELECT 2, N'she sells sea shells upon the sea shore' UNION ALL
SELECT 3, N'red lorry yellow lorry' UNION ALL
SELECT 4, N'the quick brown fox jumped over the lazy dog'
)

SELECT display_term, COUNT(*) As Cnt
FROM testTable
CROSS APPLY sys.dm_fts_parser('"' + REPLACE(txt,'"','""') + '"', 1033, 0,0)
WHERE TXT IS NOT NULL
GROUP BY display_term
HAVING COUNT(*) > 1
ORDER BY Cnt DESC

Returns

display_term                   Cnt
------------------------------ -----------
the                            3
brown                          2
lorry                          2
sea                            2
Martin Smith
You rock!!! That's an amazing solution which would have taken me days to find. (I only had to add one line to filter fields with null text otherwise it would return an error "Null or empty full-text predicate", the line I add was "WHERE tbl1.txt_field != '""' ")
Jon
Thanks, I'll incorporate that into my answer.
Martin Smith