robots.txt

How to get http://mysite.com/ indexed in a search engine, not http://mysite.com/index.php, and nothing more?

In only one search engine I want to get... http://mysite.com/ indexed, not... http://mysite.com/index.php I only want to allow indexing of the main page of the website, and nothing more. I do not want the bot to follow any of the links on the main page. My meta tags include the following:: <meta name="robots" content="index, nofollo...

want 301 redirects to all except robots.txt

Hi all, we have moved our website to new domain & want all pages of old site to get removed from search engines. Its the same site, same content, just a new domain, so search-engines are taking time because of duplicate content (maybe). We have added .htaccess 301 from our old site to new site as: redirect 301 / http://new-domain.com/ N...

Will this robots.txt only allow googlebot to index my site?

Will this robots.txt file only allow googlebot to index my site's index.php file? CAVEAT, I have an htaccess redirect that people who type in http://www.example.com/index.php are redirected to simply http://www.example.com/ So, this is my robots.txt file content... User-agent: Googlebot Allow: /index.php Disallow: / User-agent: ...

Will googlebot index my index.php even with an htaccess redirect?

Using a redirect statement in my htaccess file, people who type the following into the address bar... http://example.com/index.php ...are redirected to... http://example.com/ I also have a noindex, nofollow meta tag on all my website's pages. My question is, given that redirect behavior and meta data, will googlebot index my mainp...

robots.txt; What encoding?

I am about to create a robots.txt file. I am using notepad. How should I save the file? UTF8, ANSI or what? Also, should it be a capital R? And in the file, I am specifying a sitemap location. Should this be with a capital S? User-agent: * Sitemap: http://www.domain.se/sitemap.xml Thanks ...

BOT/Spider Trap Ideas

I have a client whose domain seems to be getting hit pretty hard by what appears to be a DDoS. In the logs it's normal looking user agents with random IPs but they're flipping through pages too fast to be human. They also don't appear to be requesting any images. I can't seem to find any pattern and my suspicion is it's a fleet of Window...

one robots.txt to allow crawling only live website rest of them should be disallowed.

I need guideline about using of robots.txt problem is as following. I have one live website "www.faisal.com" or "faisal.com" and have two testing web servers as follows "faisal.jupiter.com" and "faisal.dev.com" I want one robots.txt to handle this all, i don't want crawlers to index pages from "faisal.jupiter.com" and "faisal.dev.com"...

spring SimpleUrlHandlerMapping to robots.txt

I am using spring frameworking following is the mapping of url to controller <bean id="urlMapping" class="org.springframework.web.servlet.handler.SimpleUrlHandlerMapping"> <property name="mappings"> <props> <prop key="/controller.web">webController</prop> <prop key="/robots.txt">robotsController</prop> </props> </property> </bean> Whe...

robots.txt and Mod Rewrite in .htaccess

In the robots.txt file, I am about to disallow some sections of my site. For instance, I don't want my "terms and conditions" to be indexed by search engines. User-agent: * Disallow: /terms The real path to the file is actually /data/terms_and_conditions.html But I have used .htaccess to rewrite the URL. Now to my Q, should I ...

How can robots.txt disallow all URLs except URLs that are in sitemap

I need to have control over what URLs are allowed to be indexed. To do this I want to allow google to index only URLs that are listed in my Sitemap(s), and disallow Google from indexing anything else. Easiest way to solve this is if there is a way to configure robots.txt to disallow everything: User-agent: * Disallow: / And ...

How to block search engines from indexing all urls beginning with origin.domainname.com

Hello All, I have www.domainname.com, origin.domainname.com pointing to the same codebase. Is there a way, I can prevent all urls of basename origin.domainname.com from getting indexed. Is there some rule in robot.txt to do it. Both the urls are pointing to the same folder. Also, I tried redirecting origin.domainname.com to www.domainn...

Why google index this ?

In this webpage: http://www.alvolante.it/news/pompe_benzina_%E2%80%9Ctruccate%E2%80%9D_autostrada-308391044 there is this image: http://immagini.alvolante.it/sites/default/files/imagecache/anteprima_100/images/rifornimento_benzina.jpg Why this image is indexed if in the robots.txt there is "Disallow: /sites/" ?? You can see that is ...

Why google index this ?

Possible Duplicate: Why google index this ? In this webpage: http://www.alvolante.it/news/pompe_benzina_%E2%80%9Ctruccate%E2%80%9D_autostrada-308391044 there is this image: http://immagini.alvolante.it/sites/default/files/imagecache/anteprima_100/images/rifornimento_benzina.jpg Why this image is indexed if in the robots.tx...

Sitemap for a site with a large number of dynamic subdomains

Hi, I'm running a site which allows users to create subdomains. I'd like to submit these user subdomains to search engines via sitemaps. However, according to the sitemaps protocol (and Google Webmaster Tools), a single sitemap can include URLs from a single host only. What is the best approach? At the moment I've the following stru...

How can I block Alexa Toolbar users?

Any ideas how I can block Alexa Toolbar users? I don't want to appear in the rankings while we are in beta ... I see you can block their search engine with User-agent: ia_archiver Disallow: / but I can't find any documentation on how to pull your self from actually being ranked.. I read earlier someone tried to email them and they r...

Can I allow indexing (by search engines) of restricted content without making it public?

Hi, I have a site with some restricted content. I want my site to appear in search results, but I do not want it to get public. Is there a way by which I can allow crawlers to crawl through my site but prevent them from making it public? The closest solution I have found is Google First Click Free but even it requires me to show the c...

How to disallow access to an url called without parameters with robots.txt

I would like to deny web robots to access a url like this: http://www.example.com/export allowing this kind of url instead: http://www.example.com/export?foo=value1 A spider bot is calling /export without query string causing a lot of errors on my log. Is there a way to manage this filter on robots.txt? ...

is it good idea to block bots from website through robots.txt

Hello, Here is the copy of my robots.txt content Sitemap: http://www.go4film.com/sitemap.xml User-agent: ia_archiver Disallow: / User-agent: robtexbot Disallow: / User-agent: Googlebot Allow: / Here I only allow googlebot.And block Alexa and everything So could someone please tell me if I block search engines like yahoo,ask,bing ...

What's up with Facebook policies vs. graph.facebook.com/robots.txt ?

Facebook's developer principles and policies and the general terms of use seem to forbid automated data collection, but graph.facebook.com/robots.txt seems to allow it: User-agent: * Disallow: Does anybody know how to make sense of this? ...