tags:

views:

82

answers:

2

In my database I have a field wich contains a html document. Now there must be a possibility to search in this document. However, the html tags may not be found. So when I have something like this:

<html>
  <head>
    <title>Bar</title>
  </head>
  <body>
   <p>
     this content my be found
   </p>
  </body>
</html>

It is possible that the document stored in the database is not xhtml. Can you tell me what the best way is to search in the content? Shall i use regular expressions? And of so, how would it look like? ANd if not, what should I use else?

+2  A: 

You could try turning on Full-Text Search or use something like Lucene.Net to index the content for you.

Joel Coehoorn
+2  A: 

What volume of records are there? I expect you might have to use full-text search and an IFilter to do this efficiently. Html does not lend itself well to regex - it can quickly be very hard to do something very simple.

If the volume isn't huge, can you iterate over the records with an external parsing application, using something like the HTML Agility Pack (for .NET) - or any other DOM of your choice.

But the FTS/IFilter would be my first choice.

Marc Gravell
The search has to be done in 5 tables. Each table has a few 100 records. How do I use the FTS and IFilter?
Martijn
It'll be somewhere under: http://msdn.microsoft.com/en-us/library/ms142571.aspx
Marc Gravell
Looks to be under the "Management" node in Management Studio.
Marc Gravell