views:

381

answers:

2

Hi, I'm working on a knowledge base project using SQL Server 2008 Full Text Search Engine. Project included in articles and files where each article has multiple files.In those articles whole content is pure html.

Right now,I successfully created fulltext catalog and index on SQL Server 2008 and my database is version 10 compatible.

Here are my questions:

1)Is it possible to ignore html tags,more clearly texts containing in "<...>", while searching in these articles,because if i wish to search for div,table etc. there should be no result returned?

2)Articles will be updated anytime,so full text index must be updated when a new record is inserted.Is it enough to set only "TRACK CHANGES AUTOMATIC" while creating full text catalog?

3)We may use FILESTREAM feature hereafter,does SQL Server 2008 have a good performance on files using full text index? What specific document types does SQL Server 2008 good on indexing?

Regards

+1  A: 

Please check for these:

1) In SQL Server Full Text, we can define noise words/Stopwords. You can edit the Noise world file and then you have to rebuild the catalog. So you can put all the html tags as noise. Please check

http://msdn.microsoft.com/en-us/library/ms142551.aspx

2) With track changes it automatically include the changes in current full text search, but the ranking of these newly added article gets changed from the previous. So until and unless you master index is synced it will give up and down with ranking.

3) As far as i know we can implement custom filters, stemmers and word breakers and can plug into SQL Server full text search.By default i may not know the complete list, but it does doc and pdf.

For more information on SQL Server full text search 2008 please check:

http://technet.microsoft.com/en-us/library/cc721269.aspx

Nitin Midha
Stopwords are not supposed to filter html tags.If there is a text like "<form onclick="doSome();">" what will I give as a stopword to filter it like,"form","<form>" etc. ??
Myra
I thought you want to just ignore basic html tags. For this situation i would like to recommend to implement Custom Word Breaker and Stemmer by implementing IWordBreaker, IStemmer and if required IFilter interfaces and then plug into your sql server.
Nitin Midha
Would you give an example of implementing these interfaces you mentioned ?
Myra
You can find information on these atIWordBreaker: http://msdn.microsoft.com/en-us/library/ms691079(VS.85).aspxIFilter: http://msdn.microsoft.com/en-us/library/ms691105(VS.85).aspxIStemmer: http://msdn.microsoft.com/en-us/library/ms690983(VS.85).aspxSample Implementation of IWordBreaker: http://blogs.msdn.com/michkap/archive/2005/03/14/395199.aspxRegistering with SQL Server: http://blogs.msdn.com/shajan/default.aspxHope these helps.
Nitin Midha
Since you're the only one who's answered,I accept your answer,Thank you Nitin Midha
Myra
Thanks, I hope it has helped you in someway, that is why you have accepted the answer. If this answer does not satisfy your needs, then please leave it open so that others may have interest and someone would come up with a better solution :). Thanks .....
Nitin Midha
+1  A: 

there is a filter for .htm and .html files.

to see if you have the filter installed run this sql:

SELECT * FROM sys.fulltext_document_types

you should see:

.htm E0CA5340-4534-11CF-B952-00AA0051FE20 C:\Program Files\Microsoft SQL Server\MSSQL10.MSSQLSERVER\MSSQL\Binn\nlhtml.dll 12.0.6828.0 Microsoft Corporation

.html E0CA5340-4534-11CF-B952-00AA0051FE20 C:\Program Files\Microsoft SQL Server\MSSQL10.MSSQLSERVER\MSSQL\Binn\nlhtml.dll 12.0.6828.0 Microsoft Corporation

so, if you can convert your articles column to varbinary(max), then you can add a full text index on it and specify a doc type of '.html'

once the index has populated, you can verify the keywords using this sql:

SELECT display_term, column_id, document_count FROM sys.dm_fts_index_keywords (DB_ID('your_db'), OBJECT_ID('your_table'))

DC
Your first upvote from me :)
Myra