views:

420

answers:

3

I want my sharepoint site to allow a user to search content in a known collection of RSS feeds. I figure conceptually a few ways to do this

  • crawl the feeds at their source (Yikes!)
  • Pull the full articles into my sharepoint site, then let my crawler crawl it
  • Make use of an existing index (like google)
  • search the full articles, on demand, using something like a google utility (my preference)

So can I somehow, from my sharepoint site, allow a user to search the full articles from a couple dozen, named, rss feeds

thanks

Cary

+1  A: 

I don't see why there is a problem with crawling the feeds at their source? That would seem to be reasonable.

It is fairly easy to create a content source to point at the feed and select the correct indexing schedule. If that does not work then you can try a more complicated approach.

Be aware that copying the content of another website to host on your own could have copyright implications (not too mention the risk that any inflammatory content would appear to be published on your own site).

--update--

Try reading the target sites robots.txt to see if (it even has one) it has a desired frequency. Otherwise it depends on the depth of the site you would be crawling.

If you are crawling just the rss feed xml, I suspect you could do that every hour without annoying anyone. Otherwise if you reach into each article, you may want to limit that. It really depends a lot on any relationship you have with the target site and type of site you are hitting.

Checkout this article for a little more info on how SharePoint deals with robots.txt

(p.s. the target site did not put the articles on the web so no one would read them)

Nat
Thanks Nat,regarding crawling the feeds - I was thinking of the performance impact on the target site. obviously major search engines do it, but my site is not a major search engine. is that type of behavior frowned on?I was also not seriously considering downloading content, though my client had suggested it. but you raise additional good reasons not to - so thanks.
How often were you thinking of crawling?
Nat
Thanks again Nat. I saw your edit last week, but didn't see this question until just now. crawling external sites once a day would be more than good enough.
I can't imagine it would make much difference to the websites traffic at once a day. Just don't schedule it for out of hours or your results may come backs "site is under scheduled maintenance" one day :)
Nat
A: 

The out of the box crawler will respect robots.txt and there are provisions for crawler impact rules that will lessen the chance that SharePoint will perform a beat down on the external site.

Mark Mascolino
A: 

Sorry. This is not an answer but a further question. Can I create a workflow on the esxternal RSS feed crawled when a new item is published in the feed? I want to either create an email or add some more metadata on the post content. The second option means I need to locally store RSS feed content.

Hemant