Say I have a site on http://website.com. I would really like allowing bots to see the home page, but any other page need to blocked as it is pointless to spider. In other words
http://website.com & http://website.com/ should be allowed, but
http://website.com/anything and http://website.com/someendpoint.ashx should be blocked.
Further...
How would you make the contents of Flex RIA applications accessible to Google, so that Google can index the content and shows links to the right items in your Flex RIA. Consider a online shop, created in Flex, where the offered items shall be indexed by Google. Then a link on Google should open the corresponding product in the RIA.
...
This will require a little setup. Trust me that this is for a good cause.
The Background
A friend of mine has run a non-profit public interest website for two years. The site is designed to counteract misinformation about a certain public person. Of course, over the last two years those of us who support what he is doing have relent...
I have a site with the following robots.txt in the root:
User-agent: *
Disabled: /
User-agent: Googlebot
Disabled: /
User-agent: Googlebot-Image
Disallow: /
And pages within this site are getting scanned by Googlebots all day long. Is there something wrong with my file or with Google?
...
For some reason when I check on Google Webmaster Tool's "Analyze robots.txt" to see which urls are blocked by our robots.txt file, it's not what I'm expecting. Here is a snippet from the beginning of our file:
Sitemap: http://[omitted]/sitemap_index.xml
User-agent: Mediapartners-Google
Disallow: /scripts
User-agent: *
Disallow: /scrip...
Hi,
I have a site which has been developed completely in flash. Now the site owners do not want to shift to a more text/html based site. So am planning to create an alternative html/text based site which the googlebot will get redirected to. (By checking the useragent). My question is that is this allowed officially by google?
If not t...
Hi,
in my robots.txt file, I have the following line
User-agent: Googlebot-Mobile
Disallow: /
User-agent:GoogleBot
Disallow: /
Sitemap: http://mydomain.com/sitemapindex.xml
I know that if I put the first 4 lines , googlebot won't index the sites, but what if I put the last line Sitemap: http://mydomain.com/sitemapindex.xml, will go...
I have recently started using Google Webmaster Tools.
I was quite surprised to see just how many links google is trying to index.
http://www.example.com/?c=123
http://www.example.com/?c=82
http://www.example.com/?c=234
http://www.example.com/?c=991
These are all campaigns that exist as links from partner sites.
For right now they'...
I found this question very interesting : Programmatic Bot Detection
I have a very similar question, but I'm not bothered about 'badly behaved bots'.
I am tracking (in addition to google analytics) the following per visit :
Entry URL
Referer
UserAgent
Adwords (by means of query string)
Whether or not the user made a purchase
etc.
The...
does anyone know that it is possible to setup any property in order to inform googlebot to just come and crawl the site during specific day or time period (eg. during the weekend only)?
thanks,
...
When does Google re-crawl a site?
And why does Google have two versions of the same page in Cache??
http://forum.portal.edu.ro/index.php?showtopic=112733
cache pages are: forum.portal.edu.ro/index.php?showtopic=112733&st=25/
forum.portal.edu.ro/index.php?showtopic=112733&st=50
...
Looking through my search logs from time to time, I notice that by far the biggest user of my search engine is the google-bot. What gives? Is it looking for content that might not be directly accessible through navigation? If so, how does it know which words and phrases to look for (they're surprisingly relevant). Does it check the most ...
I am a beginner web developer and i have a site JammuLinks.com, it is built on php. It is a city local listing search engine. Basically i've written search pages which take in a parameter, fetch the records from the database and display it. So it is dynamically generating the content. However if you look at the bottom of the site, i have...
Googlebot has been occasionally indexing one of our sites with a bad query string parameter. I am not sure how it is getting this query string parameter (there don't appear to be any sites linking to us with bad links, and nothing in our site is inserting the bad value). The bad parameter causes the site to throw a 500 error, as we expec...
I'm running a site with a lot of content, but little traffic, on a middle-of-the-road dedicated server.
Occasionally, Googlebot will stampede us, resulting in Apache maxing out its memory, and causing the server to crash.
How can I avoid this?
...
I want to know when google is crawling the site, preferably by sending myself an email.
Is there any way to do this that won't adversely effect performance?
...
Hello,
I have a CRON job php script that I just set up not too long ago. However, I noticed that the PHP file executed (without the cron job activating). It appears that it happened when a Google Bot crawled the file, because I noticed that the following engine visited my page:
http://www.google.com/bot.html
My question is:
1) Is ...
I have already tried Google's microformat testing tool, but it's not clear to me that it works the same way as Googlebot -- it seems reasonable that Googlebot would have more features than a simple web-based testing tool.
So, I'm wondering -- does anyone have any real-world experience in successfully getting Googlebot to parse microfor...
I've got a bunch of ajaxified links that do things like vote up, vote down, flag a post - standard community moderation stuff.
Problem is that the googlebot crawls those links, and votes up, votes down, and flags items.
Will adding this to robots.txt prevent the googlebot from crawling those links? Or is there something else I need to...
Does anyone know in which programming language the Googlebot was written?
Or, more generally, in which language are efficient web-crawlers written?
I've seen many in Java language, but it doesn't seem to me the most appropriate language to develop a web-crawler because it creates far too much overhead (tried with Heritrix web-crawler, ...