robots.txt

How to disallow search pages from robots.txt

I need to disallow http://example.com/startup?page=2 search pages from being indexed. I want http://example.com/startup to be indexed but not http://example.com/startup?page=2 and page3 and so on. edit: i forgot to add another detail here startup can be random http://example.com/XXXXX?page Adv thanks ...

How do I modify robots.txt in Plone?

I've got a Plone site that I administer and I'd like to add some pages to the Disallow of a robots.txt. It appears that Plone automatically generates a robots.txt file. I can't find any way to modify that. I've also tried adding a 'robots.txt' file to the root of the app, but it says that "robots.txt is reserved" Does anyone know how t...

How do you dynamically edit robots.txt in a load balanced environment?

Looks like we are going to have to start load balancing our webservers here soon. We have a feature request to edit robots.txt dynamically which is not a problem for one host -- however once we get our load balancer up and going -- it sounds like I will have to scp the file over to the other host(s). This sounds extremely 'bad'. How wo...

How to allow crawlers access to index.php only, using robots.txt ?

If i want to only allow crawlers to access index.php, will this work? User-agent: * Disallow: / Allow: /index.php ...

blocked links in sitemap

i'm using a online sitemap generator tool which generates links even for which are blocked in robots.txt. Is these blocked links affect site ranking ? . Is there anyway to overcome it ? ...

robots.txt syntax question.

Hello guys. I have a few doubts about this robots file. User-agent: * Disallow: /administrator/ Disallow: /css/ Disallow: /func/ Disallow: /images/ Disallow: /inc/ Disallow: /js/ Disallow: /login/ Disallow: /recover/ Disallow: /Scripts/ Disallow: /store/com-handler/ Disallow: /store/img/ Disallow: /store/theme/ Disallow: /store/StoreSys...

Robots.txt http://mydomain.com vs.http:// www.mydomain.com

I have a situation where we have two code bases that need to stay intact.. example: http://mysite.com And a new site http://www.mysite.com The old site (no WWW) supports some legacy code and has the rule: User-agent: * Disallow: / But in the new version (with WWW) there is no robots.txt. Is google looking to the old (no WWW) robots...

robots.txt to restrict search engines indexing specified keywords for privacy

I have a large directory of individual names along with generic publicaly available and category specific information that I want indexed as much as possible in search engines. Listing these names on the site itself is not a concern to people but some don't want to be in search results when they "Google" themselves. We want to continue ...

can i use robots.txt while handling my site with htaccess

I am using htaccess in my site, such that all the request to my site will be redirected to index page in my root directory. No other file in my site can be accessed because my htaccess will restrict it. My doubt is, when I use robots.txt file, will the search engines be able to reach the robots.txt file in my domain?. Or must i modify my...

How can a robots.txt ignore anything with action=history in it?

I have a MediaWiki, and I don't think I want Google indexing the history of any page. How can a robots.txt disallow URLs with action=history in the query string? ...

Can I do a 301 redirect in robots.txt?

I have a site which has a whole host of legacy links, which now need to be mapped to new links. We need to update search engine results so that the legacy ones disappear and are replaced. Because of the CMS we can't do this programatically, but I was wondering if we could set up a 301 redirect in the robots.txt file, which would update...

Implementing "Report this content" and detecting spammer or robot triggered event

I'm creating a forum for a website, and plan on implementing a "Report this content" function. In all honesty, I'm not sure how useful (lit. necessary) the feature will be, since a user account (created by admin) will be required for posting, but the solution interests me. So in short, this is the scenario: For all users, there will b...

Allow SE indexing on index.html only.

What would be the shortest method to block * and only allow just Major Search Engines to index the index page of the site only? User-agent: * Disallow: / User-agent: Googlebot Disallow: / Allow: index.html User-agent: Slurp Disallow: / Allow: index.html User-agent: msn Disallow: / Allow: index.html Would this work? ...

robots.txt ignrore all folders but crawl all files in root

hi all should i then do User-agent: * Disallow: / is it as simple as that? or will that not crawl the files in the root either? basically that is what i am after - crawling all the files/pages in the root, but not any of the folders at all or am i going to have to specify each folder explicitly.. ie disallow: /admin disallow: /thi...

Block msnbot completely from certain directory

Ok, bare with me here as I explain my convoluted problem... I run a video site, and I have setup custom SEO friendly permalinks to to something like this: /tv/view/352/title-of-video The number is the ID of the video and that's all the PHP script needs to fetch it. The "title-of-video" is completely superfluous and is just there for ...

Web bot in C++/PHP.

Hello, I've recently started learning PHP, but I have a wide knowledge on C++. I've been wondering how to make a web bot and now, I would greatly like to make one. I won't be using this robot for spamming or anything, just as a test of what PHP/C++ can do online. I was wondering how I could go about doing this and if you have any article...

prevent google from indexing

hi sirs what's the best way to prevent google from showing of a folder in the search engine ?, like e.g www.example.com/support , what should i do if I want the support folder to disappear in google ? the first thing I did was place a 'robots.txt' file and include this code User-agent: * Disallow: /support/etc but the results is a tot...

Asterisk in robots.txt

Wondering if following will work for google in robots.txt Disallow: /*.action I need to exclude all urls ending with .action. Is this correct? ...

SEO chaos from changing robots.txt file in Wordpress site

Hi there, I recently edited the robots.txt file in my site using a wordpress plugin. However, since i did this, google seems to have removed my site from their search page. I'd appreciate if I could get an expert opinion on why this is so, and a possible solution. I'd initially done it to increase my search ranking by limiting the pages ...

How to "merge" page "\Default.aspx" and "\"?

our site is developed in ASP.NET. We want to block Default.aspx page from Google and other search engines. How can we "close" the Default.aspx page so that it is not accessible? Or is there another way to solve the problem so that we don't create duplicate content. ...