views:

165

answers:

5

Hello,

If you go to wordpress admin and then settings->privacy, there are two options asking you whether you want to allow your blog to be searched though by seach engines and this option:

I would like to block search engines, but allow normal visitors

How does wordpress actually block search bots/crawlers from searching through this site when the site is live?

+3  A: 

With a robots.txt (if installed as root)

 User-agent: *
 Disallow: /

or (from here)

I would like to block search engines, but allow normal visitors - check this for these results:

  • Causes "<meta name='robots' content='noindex,nofollow' />" to be generated into the section (if wp_head is used) of your site's source, causing search engine spiders to ignore your site. * Causes hits to robots.txt to send back:

        User-agent: * 
        Disallow: / 
    

    Note: The above only works if WordPress is installed in the site root and no robots.txt exists.

  • Stops pings to ping-o-matic and any other RPC ping services specified in the Update Services of Administration > Settings > Writing. This works by having the function privacy_ping_filter() remove the sites to ping from the list. This filter is added by having add_filter('option_ping_sites','privacy_ping_filter'); in the default-filters. When the generic_ping function attempts to get the "ping_sites" option, this filter blocks it from returning anything.

  • Hides the Update Services option entirely on the Administration > Settings > Writing panel with the message "WordPress is not notifying any Update Services because of your blog's privacy settings."

Andy
+1  A: 

I don't know for sure but it probably generates a robots.txt file which specifies rules for search engines.

thetaiko
+1  A: 

Using a Robots Exclusion file.

Example:

User-agent: Google-Bot
Disallow: /private/
St. John Johnson
+5  A: 

According to the codex, it's just robots meta tags, robots.txt and suppression of pingbacks:

Causes <meta name='robots' content='noindex,nofollow' /> to be generated into the section (if wp_head is used) of your site's source, causing search engine spiders to ignore your site.

Causes hits to robots.txt to send back:

User-agent: *

Disallow: /

Note: The above only works if WordPress is installed in the site root and no robots.txt exists.

These are "guidelines" that all friendly bots will follow. A malicious spider searching for E-Mail addresses or forms to spam into will not be affected by these settings.

Pekka
+1  A: 

You can't actually block bots and crawlers from searching through a publicly available site; if a person with a browser can see it, then a bot or crawler can see it (caveat below).

However, there is something call the Robots Exclusion Standard (or robots.txt standard), which allows you to indicate to well behaved bots and crawlers that they shouldn't index your site. This site, as well as Wikipedia, provide more information.

The caveat to the above comment that what you see on your browser, a bot can see, is this: most simple bots do not include a Javascript engine, so anything that the browser renders as a result of Javascript code will probably not be seen by a bot. I would suggest that you don't use this as a way to avoid indexing, since the robots.txt standard does not rely on the presence of Javascript to ensure correct rendering of your page.

Once last comment: bots are free to ignore this standard. Those bots are badly behaved. The bottom line is that anything that can read your HTML can do what it likes with it.

Dancrumb