questions about robots.txt | ansaurus

robots.txt

How to set up a robot.txt which only allows the default page of a site

Say I have a site on http://website.com. I would really like allowing bots to see the home page, but any other page need to blocked as it is pointless to spider. In other words http://website.com & http://website.com/ should be allowed, but http://website.com/anything and http://website.com/someendpoint.ashx should be blocked. Further...

Googlebots Ignoring robots.txt?

I have a site with the following robots.txt in the root: User-agent: * Disabled: / User-agent: Googlebot Disabled: / User-agent: Googlebot-Image Disallow: / And pages within this site are getting scanned by Googlebots all day long. Is there something wrong with my file or with Google? ...

Possible to prevent search engine spiders from infinitely crawling paging links on search results?

Our SEO team would like to open up our main dynamic search results page to spiders and remove the 'nofollow' from the meta tags. It is currently accessible to spiders via allowing the path in robots.txt, but with a 'nofollow' clause in the meta tag which prevents spiders from going beyond the first page. <meta name="robots" content="in...

Robots.txt to disallow everything and allow only specific parts of the site/pages. Is "allow" supported by crawlers like Ultraseek and FAST?

Hi, Just wanted to know if it is possible to disallow the whole site for crawlers and allow only specific webpages or sections? Is "allow" supported by crawlers like FAST and Ultraseek? Kind regards, ...

Googlebot not respecting Robots.txt

For some reason when I check on Google Webmaster Tool's "Analyze robots.txt" to see which urls are blocked by our robots.txt file, it's not what I'm expecting. Here is a snippet from the beginning of our file: Sitemap: http://[omitted]/sitemap_index.xml User-agent: Mediapartners-Google Disallow: /scripts User-agent: * Disallow: /scrip...

robots.txt: disallow all but a select few, why not?

I've been thinking a while about disallowing every crawler except Ask, Google, Microsoft, and Yahoo! from my site. The reasoning behind this is that I've never seen any traffic being generated by any of the other web-crawlers out there. My questions are: Is there any reason not to? Has anybody done this? Did you notice any negative e...

How to prevent robots.txt passing from staging env to production?

I had happen in the past that one of our IT Specialist will move the robots.txt from staging from production accidentally. Blocking google and others from indexing our customers' site in production. Is there a good way of managing this situation? Thanks in advance. ...

Anybody got any C# code to parse robots.txt and evaluate URLS against it

Short question: Has anybody got any C# code to parse robots.txt and then evaluate URLS against it so see if they would be excluded or not. Long question: I have been creating a sitemap for a new site yet to be released to google. The sitemap has two modes, a user mode (like a traditional sitemap) and an 'admin' mode. The admin mode ...

robots.txt: Disallow bots to access a given "url depth"

I have links with this structure: http://www.example.com/tags/blah http://www.example.com/tags/blubb http://www.example.com/tags/blah/blubb (for all items that match BOTH tags) I want google & co to spider all links that have ONE tag in the URL, but NOT the URLs that have two or more tags. Currently I use the html meta tag "robots" ...

will googlebot index my site?

Hi, in my robots.txt file, I have the following line User-agent: Googlebot-Mobile Disallow: / User-agent:GoogleBot Disallow: / Sitemap: http://mydomain.com/sitemapindex.xml I know that if I put the first 4 lines , googlebot won't index the sites, but what if I put the last line Sitemap: http://mydomain.com/sitemapindex.xml, will go...

Robots.txt block access to all https:// pages

What would the syntax be to block all access to any bots to https:// pages? I have an old site that now doesn't have an SSL and I want to block access to all https:// pages ...

Google indexed my test folders on my website :( How do I restrict the web crawlers!

Help Help! Google indexed a test folder on my website which no one save I was supposed to know about :(! How do I restrict google from indexing links and certain folders. ...

Ethics of Robots.txt

I have a serious question. I'm not trying to start a flamewar or to incite any violence--but here goes. Is it ever ethical to ignore the presence of a robots.txt file on a website? These are some of the considerations I've got in mind: 1.) If someone puts a web site up they're expecting some visits. Granted, web crawlers are using...

Is the sitemap.axd accepted by all search engines?

I am currently generating a sitemap file dynamically using a HttpHandler, with a path set to sitemap.axd. This then returns xml content. No one at my office is certain if all search engines accept this extension or if they need .xml to parse. I know that I can submit it to Google through the webmaster tools and use robots.txt to indicate...

Restrict robot access for (specific) query string (parameter) values?

Using robot.txt is it possible to restrict robot access for (specific) query string (parameter) values? ie http://www.url.com/default.aspx #allow http://www.url.com/default.aspx?id=6 #allow http://www.url.com/default.aspx?id=7 #disallow ...

How can I prevent the googlebot from crawling Ajaxified Links?

I've got a bunch of ajaxified links that do things like vote up, vote down, flag a post - standard community moderation stuff. Problem is that the googlebot crawls those links, and votes up, votes down, and flags items. Will adding this to robots.txt prevent the googlebot from crawling those links? Or is there something else I need to...

Google's robots.txt: Is scraping your positions = ignoring it?

I have found a post http://stackoverflow.com/questions/999056/ethics-of-robots-txt/999088#999088 discussing a matter of robots.txt on web sites. Generally, I agree with the principals. However, there are commercial tools checking Google positions by - very likely - scraping Google for results, due to lack of API (in case someone doesn't ...

screen-scraping

Google Sitemap and Robots.txt Issue

Hi, We have a sitemap at our site, http://www.gamezebo.com/sitemap.xml Some of the urls in the sitemap, are being reported in the webmaster central as being blocked by our robots.txt, see, gamezebo.com/robots.txt ! Although these urls are not Disallowed in Robots.txt. There are other such urls aswell, for example, gamezebo.com/gamelin...

how to disallow all dynamic urls robots.txt

how to disallow all dynamic urls in robots.txt Disallow: /?q=admin/ Disallow: /?q=aggregator/ Disallow: /?q=comment/reply/ Disallow: /?q=contact/ Disallow: /?q=logout/ Disallow: /?q=node/add/ Disallow: /?q=search/ Disallow: /?q=user/password/ Disallow: /?q=user/register/ Disallow: /?q=user/login/ i want to disallow all things that st...

robots.txt and wildcard at the end od disallow

I need to disallow indexing 2 pages, one of them dynamic: site.com/news.php site.com/news.php?id=__ site.com/news-all.php What should I write in robots.txt: User-agent: * Disallow: /news or Disallow: /news* or Disallow: /news.php* Disallow: /news-all.php Should one use wildcard in the end or not? ...

1
2
3
4
5