I am looking for a good open source bot to determine some quality, often required for google indexing.
For example
find duplicate titles
invalid links ( jspider do this, and I think a lot more will do this)
exactly the same page, but different urls
etc, where etc equals google quality reqs.
...
Googlebot (Googlebot/2.1) appears to crawl URL:s on a newly added sites in an order corresponding to the length of the URL:
.. GET /ivjwiej/ HTTP/1.1" 200 .. "Mozilla/5.0 (compatible; Googlebot/ ..
.. GET /voeoovo/ HTTP/1.1" 200 .. "Mozilla/5.0 (compatible; Googlebot/ ..
.. GET /zeooviee/ HTTP/1.1" 200 .. "Mozilla/5.0 (compatible; Googl...
I'd like to find out how current google's cached copy of a large set of pages is. I think I need to
look in the logs for IP's,
check to find user-agent "googlebot", then
export a list that says each page and when it was last visited.
I imagine this could be a cron job that runs weekly. If this is right, how would I write the s...
I can't seem to find any information on how google determines if you are cloaking your content. How, from a technical standpoint, do you think they are determining this? Are they sending in things other than the googlebot and comparing it to the googlebot results? Do they have a team of human beings comparing? Or can they somehow tel...
I wrote a site using ASP.NET MVC, and although it is not completely SEO optimized at this point I figured it is a good start. What I'm finding is that when I use Google's Webmaster Tools to fetch my site (to see what a GoogleBot sees) it sees this.
HTTP/1.1 200 OK
Cache-Control: public, max-age=1148
Content-Type: application/xhtml+xml; ...
The question I have is a bit of a ethical one.
I read here that Google gives a little more influence to sites that are optimized to load quickly. Obviously this makes Google's job easier, using less resources and it is a better experience for everyone, so why not reward it?
The actual process of finding bottlenecks and improving page ...
Sometime ago google used to update their index and backlinks every 3-4 months. It used to be a big update. Recently I noticed that the updates are way too frequent. has anyone else noticed these sort of changes in Google crawling, indexing and backlink updates?
...
I'm in the process of localizing a website. I was going to go the way of setting a cookie to the preferred language, and then display the respective language. And, if no cookie was set it would use the preferred language header, as set by the user's browser - and if the header was not set then it would default to English.
So - how does ...
I have login/signup popups on my site which are in hidden div by default.
According to http://stackoverflow.com/questions/1547426/google-seo-and-hidden-elements googlebot should NOT see it.
But Google Webmaster tool says that keywords "email" and "password" are top keywords over the site.
Why it is so? Why google bot sees them?
Should...
Hi all,
I've built a web site using jquery to make nice transitions between content.
The code works this way: there are 2 imgs (body and footer)
when I click on a link (instead of going to another page) I fade out the 2 imgs and change the src attribute of the 2. When the new imgs are loaded I fade them back in.
I'm using SWFaddress ...
In the hope of tempting Googlebot with fresh content, I've implemented a homepage news ticker which displays the 20 most recent headlines on our site.
The implementation I have chosen is a <ul> with each headline being a <li>
Initially all the <li> elements have no style but Javascript kicks in on page load and gives all but one of the...
I am bouncing between posting this here and on Superuser. Please excuse me if you feel this does not belong here.
I am observing the behavior described here - Googlebot is requesting random urls on my site, like aecgeqfx.html or sutwjemebk.html. I am sure that I am not linking these urls from anywhere on my site.
I suspect this may be ...
I've done plenty of ASP.NET and PHP development, but I'm less familiar with how to track this sort of thing down in CF. My naive first angle of attack was to search for any reference to Google in any of the source code. No luck.
I'm running the site on IIS7. Google, Bing and Yahoo all apparently "see" nothing on my site.
Update: I ...
Say if I register a domain and have developed it into a complete website. From where and how Googlebot knows that the new domain is up? Does it always start with the domain registry?
If it starts with the registry, does that mean that anyone can have complete access to the registry's database? Thanks for any insight.
...
I have a User class in my web app that represents a user currently logged in.
Every time a user vists a page, a User instance is populated based on authentication data supplied in cookies.
A User instance is created even if an anonymous user logs in - and a corresponding new record is created in the User table in the database.
This ap...
I'm going to block all bots except the big search engines. One of my blocking methods will be to check for "language": Accept-Language: If it has no Accept-Language the bot's IP address will be blocked until 2037. Googlebot does not have Accept-Language, I want to verify it with DNS lookup
<?php
gethostbyaddr($_SERVER['REMOTE_ADDR']);
?...
How is it possible that my page /admin/login.asp is found in Google with the query "inurl:admin/login.asp" while it isn't with the "site:www.domain.xx" query?
I've this line of code in my robots.txt:
User-agent: *
Disallow: /admin/
And this in the HTML code of the page:
<meta name="robots" content="noindex, nofollow" />
Any ideas?...
On a website, there are many pages with a component for users to leave comments. To reduce page load time and since few users use the commenting system, the commenting component is loaded via AJAX after the page is loaded. The issue: how can we get Google to index dynamic content that is loaded via AJAX on page load?
Many other pages on...
I just added rel="nofollow" to some links.
Anyone know how long it takes for google to stop following after "nofollow" is added to a link?
I added an hour ago and still see them crawling the "nofollow" links.
...
I would like to generate HTML Snapshots using Watir, hosted on Heroku.
Google's Full Specification for Making AJAX Applications Crawlable suggests using HTMLUnit... see How do I create an HTML snapshot? point #3.
HtmlUnit is a Java-only headless browser emulator; and unfortunately jRuby is not an option on Heroku. So HtmlUn...