robots.txt

How To Get Indexed Again After Removal of Robot.txt

While building a Webiste, i created a robot.txt on the server, to prevent the yet unfinished application from getting into Google's Index... Now that i am done with the site, i removed the robot.txt and i expected that my site would show up on Google, since the robot.txt is gone! But this is not happening! I have removed the robot.txt ...

Non-indexed file (?) still found in Google

How is it possible that my page /admin/login.asp is found in Google with the query "inurl:admin/login.asp" while it isn't with the "site:www.domain.xx" query? I've this line of code in my robots.txt: User-agent: * Disallow: /admin/ And this in the HTML code of the page: <meta name="robots" content="noindex, nofollow" /> Any ideas?...

Meta tag vs robots.txt

Is it better to use meta tags* or the robots.txt file for informing spiders/crawlers to include or exclude a page? Are there any issues in using both the meta tags and the robots.txt? *Eg: <#META name="robots" content="index, follow"> ...

What's a good & complete PHP/MySQL Screen Scraper project?

Requirements Written in PHP Control over the code (open source would be awesome, purchasing code is an option too) Optional features Listen to robots.txt Automatic rate limiting Scrape based on rules into a data object Admin interface, or configurable back end, to setup new rules Something like CSS selectors to pick our data in th...

How can I gather all links on a site without content?

I would like to get all URLs a site links to (on the same domain) without downloading all of the content with something like wget. Is there a way to tell wget to just list the links it WOULD download? For a little background of what I'm using this for if someone can come up with a better solution: I'm trying to build a robots.txt file t...

Parsing individual lines in a robots.txt file with C#

Working on an application to parse robots.txt. I wrote myself a method that pulled the the file from a webserver, and threw the ouput into a textbox. I would like the output to display a single line of text for every line thats in the file, just as it would appear if you were looking at the robots.txt normally, however the ouput in my te...

How do I disallow specific page from robots.txt

I am creating two pages on my site that are very similar but serve different purposes. One is to thank users for leaving a comment and the other is to encourage users to subscribe. I don't want the duplicate content but I do want the pages to be available. Can I set the sitemap to hide one? Would I do this in the robots.txt file? The ...

Robots.txt Disallow Certain Folder Names

I want to disallow robots from crawling any folder, at any position in the url with the name: this-folder. Examples to disallow: http://mysite.com/this-folder/ http://mysite.com/houses/this-folder/ http://mysite.com/some-other/this-folder/ http://mysite.com/no-robots/this-folder/ This is my attempt: Disallow: /.*this-folder/ Will ...

How do I disallow a folder in robots.txt but except for certain files?

I have a situation where I want to disallow the crawling of certain pages within a directory. This directory contains a large number of files but there are a few files that I need to still be indexed. I will have a very large robots file if I need to go through disallowing each page individually. Is there a way to disallow a folder in ro...

Robots.txt in ASP.NET MVC

Hi Guys, I am trying to figure out what to add to my robots.txt file ? Specifically, what does the command Allow: /$ do in the robots.txt file ? Edit: Also, how to allow a site to any have its /index page indexed when using ASP.NET MVC ? ...

Google Maps API key and robots.txt

Edit: I learned that my error was unrelated to the robots file. Disregard. I just learned the hard way that Google blocks access to the Maps API if you have a restrictive robots.txt file. I recently created a robots file with "Dissallow: /". Now my site can no longer use Maps. Rats. I removed the robots file, but I still cannot use...

Robots text, wordpress - block directory

Hello, In my robots.txt i have this: Disallow: /lo lo is a directory with a script i want blocked. Problem is that "Disallow: /lo" blocks a post of mine: /lonely-cars-etc/ How shall i block lo directory corectly? Please, take a look at my robots.txt. Maybe there are other problems i don't know about. User-agent: * Disallow: /bin...

Robots.txt, disallow multilanguage URL

I have a public page that is not supposed be possible for users to sign into. So I have a url that there is no link to and you have to enter manually and then sign in. The url is multilanguage however, so it can be "/SV/Account/Logon" or "/EN/Account/Logon" etc etc. Can I disable this url to be indexed for all languages? ...

In robots.txt, what will Disallow: /?s block?

What will this line block when the search engine crawls the website? Disallow: /?s ...

Robots.txt not working

Hi, I have used robots.txt to restrict one of the folders in my site. The folder consists of the sites in under construction. Google has indexed all those sites which are in testing phase. So I used robots.txt. I first submitted the site and robots.txt is enabled. Now the status is success for the www.mysite.com/robots.txt. But the goog...

robots.txt How to not allow engines to crawl urls with this in url: “http:

Disallow: /*“http: is what I've been using - my guess is I may need to escape the quotation mark somehow. In Google webmaster tools, it's not even reading that quotation mark (where it allows you to see the robots.txt file and test it on a few urls). On Google Webmaster Tools, it displays the robots.txt file without the quotes for ...

Help to rightly create robots.txt

Hi. I have dynamic urls like this. mydomain.com/?pg=login mydomain.com/?pg=reguser mydomain.com/?pg=aboutus mydomain.com/?pg=termsofuse When the page is requested for ex. mydomainname.com/?pg=login index.php include login.php file. some of the urls are converted to static url like mydomain.com/aboutus.html mydomain.com/termsofuse.htm...

Regex for robots.txt - disallow something within a directory, but not the directory itself...

I'm using wordpress with custom permalinks, and I want to disallow my posts but leave my category pages accessible to spiders. Here are some examples of what the URLs look like: Category page: somesite dot com /2010/category-name/ Post: somesite dot com /2010/category-name/product-name/ So, I'm curious if there is some type o...

parsing robots.txt file using c++

Hello, is there is any library to check robots.txt or else how can i right it in c++ with boost regex please explain with some examples.... ...

I want my robot.txt to allow indexing of only my index file in google. How does this look?

I want my robot.txt to allow indexing of only my index file in google. How does this look? Will the following do the trick? User-agent: Google Disallow: /_/ Disallow: /library/ Disallow: /media/ Disallow: /www/ User-agent: * Disallow: / ...