views:

845

answers:

2

I'm currently looking at indexing an ASP website from Sharepoint and I need to replicate the old "advanced search" schema that the users are familiar with. In order to do this I need to index a few meta tags from the web pages. This is easily done and for the text fields I can use them in the search as well. However for date meta tags, like "expired" or "published" I'm having some problems. The problem is basically that the meta tags are crawled as "text", but I need Sharepoint to parse them as datetime. I've seen a few posts on TechNet asking for the same, but with no answer.

[1]: https://forums.microsoft.com/TechNet/ShowPost.aspx?PostID=2614064&SiteID=17 TechNet

+1  A: 

The web crawler built into search is rudimentary and you won't be able to easily extend it to include meta tags. Allegedly you can write your own protocol handler and crawl the ASP pages in their own content source; allegedly that works. I don't think anyone actually writes their own protocol handlers though.

You're going to be disappointed with what the SharePoint crawler offers, which is why there are no answers on the official forum either--because the real answer is "Can't do that easily, sorry."

You may be able to hack something up by writing a custom web service (ASMX or WCF-based) that itself crawls the ASP pages' meta tags. From there, you could pull the web service results into the BDC which is searchable, and then in the search results/BDC data you can have a link to the original page. It's like a Rube Goldberg device, I know, but trust me when I say it will be easier than figuring out how to write a protocol handler.

Actually the crawler does find the meta tags, as I said in my question. The problem is that it does support mapping/conversion of value types.
noocyte
+2  A: 

You're not doing anything wrong, this is how the product works. To add to what was said earlier, it's not easy to customize.

The proper way to approach this is to create a custom protocol handler for HTML. This is a custom COM Object that implements a few interfaces. The MOSS 2007 SDK has a protocol handler reference.

When we did this, we created an ini file so we could define the type we wanted META fields crawled as (String, Int, DateTime). Then when you added the custom properties everything was properly parsed. Then you can use the custom properties like you would normally.

jwmiller5
You could also just wrap the HTML IFilter, which is responsible for extracting the properties and sending them downstream.
Lars Fastrup