views:

33

answers:

3

Recently search engines have been able to page dynamic content on social networking sites. I would like to understand how this is done. Are there static pages created by a site like Facebook that update semi frequently. Does Google attempt to store every possible user name?

As I understand it, a page like www.facebook.com/username, is not an actual file stored on disk but is shorthand for a query like: select username from users and display the information on the page. How does Google know about every user, this gets even more complicated when things like tweets are involved.

EDIT: I guess I didn't really ask what I wanted to know about. Do I need to be as big as twitter or facebook in order for google to make special ways to crawl my site? Will google automatically find my users profiles if I allow anyone to view them? If not what do I have to do to make that work?

+1  A: 

As far as I know Google isn't able to read and store the actual contents of profiles, because the Google bot doesn't have a Facebook account, and it would be a huge privacy breach.

The bot works by hitting facebook.com and then following every link it can find. Whatever content it sees on the page it hits, it stores. So even if it follows a dynamic url like www.facebook.com/username, it will just remember whatever it saw when it went there. Hopefully in that particular case, it isn't all the private data of said user.

Additionally, facebook can and does provide special instructions that search bots can follow, so that google results don't include a bunch of login pages.

Tesserex
but if it goes to facebook.com it sees links like signup and login. there are no user profiles on the home page. So if I have a site that google doesn't know has user profiles how do I let google know my users are at mysite.com/username?
Lumpy
+1  A: 
  1. profiles can be linked from outside;
  2. site may provide sitemap
vartec
+4  A: 

In the case of tweets in particular, Google isn't 'crawling' for them in the traditional sense; they've integrated with Twitter to provide the search results in real-time.

In the more general case of your question, dynamic content is not new to Facebook or Twitter, though it may seem to be. Google crawls a URL; the URL provides HTML data; Google indexes it. Whether it's a dynamic query that's rendering the page, or whether it's a cache of static HTML, makes little difference to the indexing process in theory. In practice, there's a lot more to it (see Michael B's comment below.)

And see Vartec's succinct post on how Google might find all those public Facebook profiles without actually logging in and poking around FB.

OK, that was vastly oversimplified, but let's see what else people have to say..

LesterDove
In practice, Google *does* care quite a bit whether a page is built dynamically, because otherwise the crawlers could easily get trapped in combinatorical explosions or infinite amounts of dynamically generated spam content.
Michael Borgwardt
Agreed, I oversimplified my oversimplification. Edit forthcoming...
LesterDove
So if I have a website with user profiles there is nothing I need to do in order to have google crawl my users profiles?
Lumpy
LesterDove