tags:

views:

517

answers:

2

Short question:

Has anybody got any C# code to parse robots.txt and then evaluate URLS against it so see if they would be excluded or not.

Long question:

I have been creating a sitemap for a new site yet to be released to google. The sitemap has two modes, a user mode (like a traditional sitemap) and an 'admin' mode.

The admin mode will show all possible URLS on the site, including customized entry URLS or URLS for a specific outside partner - such as example.com/oprah for anyone who sees our site on Oprah. I want to track published links somewhere other than in an Excel spreadsheet.

I would have to assume that someone might publish the /oprah link on their blog or somewhere. We don't actually want this 'mini-oprah site' to be indexed because it would result in non-oprah viewers being able to find the special Oprah offers.

So at the same time I was creating the sitemap I also added URLS such as /oprah to be excluded from our robots.txt file.

Then (and this is the actual question) I thought 'wouldn't it be nice to be able to show on the sitemap whether or not files are indexed and visible to robots'. This would be quite simple - just parse robots.txt and then evaluate a link against it.

However this is a 'bonus feature' and I certainly don't have time to go off and write it (even thought its probably not that complex) - so I was wondering if anyone has already written any code to parse robots.txt ?

+4  A: 

Hate to say that, but just google "C# robots.txt parser" and click the first hit. It's a CodeProject article about a simple search engine implemented in C# called "Searcharoo", and it contains a class Searcharoo.Indexer.RobotsTxt, described as:

  1. Check for, and if present, download and parse the robots.txt file on the site
  2. Provide an interface for the Spider to check each Url against the robots.txt rules
markus
oops. ill admit i didnt search google this time. however ironically this question is now the first match for 'c# robots.txt' :-) i'll see if i can extract what i need from that. thanks
Simon_Weaver
I hope you're not stuck in an infinite loop now ;-)Funny, they even show exactly the Google part of my answer as the preview text. I didn't realize Google has become so fast by now even for non-news sites, very interesting.
markus
A: 

A bit of self promoting, but since I needed a similar parser and couldn't find anything I was happy with, I created my own:

http://nrobots.codeplex.com/

I'd love any feedback

SaguiItay