views:

1107

answers:

5

We have a web application that uses SQL Server 2008 as the database. Our users are able to do full-text searches on particular columns in the database. SQL Server's full-text functionality does not seem to provide support for hit highlighting. Do we need to build this ourselves or is there perhaps some libraries or knowledge around on how to do this?

BTW the application is written in C# so a .Net solution would be ideal but not necessary as we could translate.

A: 

It is unclear to me what your problem is. MSS will have hit highlighting no matter what backend you use. Also, the C# tag is irrelevant to the problem space...

RedDeckWins
+1  A: 

You might be missing the point of the database in this instance. Its job is to return the data to you that satisfies the conditions you gave it. I think you will want to implement the highlighting probably using regex in your web control.

Here is something a quick search would reveal.

http://www.dotnetjunkies.com/PrintContent.aspx?type=article&id=195E323C-78F3-4884-A5AA-3A1081AC3B35

WIDBA
Thanks for the reply. While I realize that it is outside the scope of the database to do the highlighting, perhaps the database should provide hit locations etc rather than having to rely on regex etc which could be difficult/inaccurate when you consider the effects of stemming, stop words etc.
Phil Haselden
+1  A: 

Some details:

   search_kiemeles=replace(lcase(search),"""","")
   do while not rs.eof  'The search result loop
    hirdetes=rs("hirdetes")
    data=RegExpValueA("([A-Za-zöüóőúéáűíÖÜÓŐÚÉÁŰÍ0-9]+)",search_kiemeles) 'Give back all the search words in an array, I need non-english characters also
    For i=0 to Ubound(data,1)
     hirdetes = RegExpReplace(hirdetes,"("&NoAccentRE(data(i))&")","<em>$1</em>")
    Next
    response.write hirdetes
    rs.movenext
   Loop
   ...

Functions

'All Match to Array
Function RegExpValueA(patrn, strng)
 Dim regEx
 Set regEx = New RegExp   ' Create a regular expression.
 regEx.IgnoreCase = True   ' Set case insensitivity.
 regEx.Global = True
 Dim Match, Matches, RetStr
 Dim data()
 Dim count
 count = 0
 Redim data(-1)  'VBSCript Ubound array bug workaround
 if isnull(strng) or strng="" then
  RegExpValueA = data
  exit function
 end if
 regEx.Pattern = patrn   ' Set pattern.
 Set Matches = regEx.Execute(strng)   ' Execute search.
 For Each Match in Matches   ' Iterate Matches collection.
  count = count + 1
  Redim Preserve data(count-1)
   data(count-1) = Match.Value
 Next
 set regEx = nothing
 RegExpValueA = data
End Function

'Replace non-english chars
Function NoAccentRE(accent_string)
 NoAccentRE=accent_string
 NoAccentRE=Replace(NoAccentRE,"a","§")
 NoAccentRE=Replace(NoAccentRE,"á","§")
 NoAccentRE=Replace(NoAccentRE,"§","[aá]")
 NoAccentRE=Replace(NoAccentRE,"e","§")
 NoAccentRE=Replace(NoAccentRE,"é","§")
 NoAccentRE=Replace(NoAccentRE,"§","[eé]")
 NoAccentRE=Replace(NoAccentRE,"i","§")
 NoAccentRE=Replace(NoAccentRE,"í","§")
 NoAccentRE=Replace(NoAccentRE,"§","[ií]")
 NoAccentRE=Replace(NoAccentRE,"o","§")
 NoAccentRE=Replace(NoAccentRE,"ó","§")
 NoAccentRE=Replace(NoAccentRE,"ö","§")
 NoAccentRE=Replace(NoAccentRE,"ő","§")
 NoAccentRE=Replace(NoAccentRE,"§","[oóöő]")
 NoAccentRE=Replace(NoAccentRE,"u","§")
 NoAccentRE=Replace(NoAccentRE,"ú","§")
 NoAccentRE=Replace(NoAccentRE,"ü","§")
 NoAccentRE=Replace(NoAccentRE,"ű","§")
 NoAccentRE=Replace(NoAccentRE,"§","[uúüű]")
end function
xnagyg
+1  A: 

It looks like you could parse the output of the new SQL Server 2008 stored procedure sys.dm_fts_parser and use regex, but I haven't looked at it too closely.

Ishmael
A: 

Expanding on Ishmael's idea, it's not the final solution, but I think it's a good way to start.

Firstly we need to get the list of words that have been retrieved with the full-text engine:

declare @SearchPattern nvarchar(1000) = 'FORMSOF (INFLECTIONAL, " ' + @SearchString + ' ")' 
declare @SearchWords table (Word varchar(100), Expansion_type int)
insert into @SearchWords
select distinct display_term, expansion_type
from sys.dm_fts_parser(@SearchPattern, 1033, 0, 0)
where special_term = 'Exact Match'

There is already quite a lot one can expand on, for example the search pattern is quite basic; also there are probably better ways to filter out the words you don't need, but it least it gives you a list of stem words etc. that would be matched by full-text search.

After you get the results you need, you can use RegEx to parse through the result set (or preferably only a subset to speed it up, although I haven't yet figured out a good way to do so). For this I simply use two while loops and a bunch of temporary table and variables:

declare @FinalResults table 
while (select COUNT(*) from @PrelimResults) > 0
begin
    select top 1 @CurrID = [UID], @Text = Text from @PrelimResults
    declare @TextLength int = LEN(@Text )
    declare @IndexOfDot int = CHARINDEX('.', REVERSE(@Text ), @TextLength - dbo.RegExIndexOf(@Text, '\b' + @FirstSearchWord + '\b') + 1)
    set @Text = SUBSTRING(@Text, case @IndexOfDot when 0 then 0 else @TextLength - @IndexOfDot + 3 end, 300)

    while (select COUNT(*) from @TempSearchWords) > 0
    begin
        select top 1 @CurrWord = Word from @TempSearchWords
        set @Text = dbo.RegExReplace(@Text, '\b' + @CurrWord + '\b',  '<b>' + SUBSTRING(@Text, dbo.RegExIndexOf(@Text, '\b' + @CurrWord + '\b'), LEN(@CurrWord) + 1) + '</b>')
        delete from @TempSearchWords where Word = @CurrWord
    end

    insert into @FinalResults
    select * from @PrelimResults where [UID] = @CurrID
    delete from @PrelimResults where [UID] = @CurrID
end

Several notes:
1. Nested while loops probably aren't the most efficient way of doing it, however nothing else comes to mind. If I were to use cursors, it would essentially be the same thing?
2. @FirstSearchWord here to refers to the first instance in the text of one of the original search words, so essentially the text you are replacing is only going to be in the summary. Again, it's quite a basic method, some sort of text cluster finding algorithm would probably be handy.
3. To get RegEx in the first place, you need CLR user-defined functions.

Shagglez