As always, knowing the structure of the underlying transaction data--the atomic components used to build a DW--is the first and biggest step.
There are two essentially two options, based on how you retrieve the data. One of these, already mentioned in a prior answer to this question, is to access your GA data via the GA API. This is pretty close to the form that the data appears in the GA Report, rather than transactional data. The advantage of using this as your data source is that your "ETL" is very simple, just parsing the data from the XML container is about all that's needed.
The second option involves grabbing the data much closer to the source.
Nothing complicated, still, a few lines of background are perhaps helpful here.
The GA Web Dashboard is created by
parsing/filtering a GA transaction log
(the container
that holds the GA data that
corresponds to one Profile in one
Account).
Each line in this log represents a
single transaction and is delivered
to the GA server in the form of an
HTTP Request from the client.
Appended to that Request (which is
nominally for a single-pixel GIF) is
a single string that contains all of
the data returned from that
_TrackPageview function call plus data from the client DOM, GA cookies
set for this client, and the
contents of the Browser's location
bar (http://www....).
Though this Request is from the
client, it is invoked by the GA
script (which resides on the client)
immediately after execution of GA's primary
data-collecting function
(_TrackPageview).
So working directly with this transaction data is probably the most natural way to build a Data Warehouse; another advantage is that you avoid the additional overhead of an intermediate API).
The individual lines of the GA log are not normally avaialble to GA users. Still, it's simple to get them. These two steps should suffice:
modify the GA tracking code on each page of your Site so that it
sends a copy of each GIF Request
(one line in the GA logfile) to your
own server, specifically,
immeidately before the call to
_trackPageview(), add this line:
pageTracker._setLocalRemoteServerMode();
Next, just put a single-pixel gif
image in your document root and call
it "__utm.gif".
So now your server activity log will contain these individual transction lines, again built from a string appended to an HTTP Request for the GA tracking pixel as well as from other data in the Request (e.g., the User Agent string). This former string is just a concatenation of key-value pairs, each key begins with the letters "utm" (probably for "urching tracker"). Not every utm parameter appears in every GIF Request, several of them, for instance, are used only for e-commerce transactions--it depends on the transaction.
Here's an actual GIF Request (account ID has been sanitized, otherwise it's intact):
utm.gif?utmwv=1&utmn=1669045322&utmcs=UTF-8&utmsr=1280x800&utmsc=24-bit&utmul=en-us&utmje=1&utmfl=10.0%20r45&utmcn=1&utmdt=Position%20Listings%20%7C%20Linden%20Lab&utmhn=lindenlab.hrmdirect.com&utmr=http://lindenlab.com/employment&utmp=/employment/openings.php?sort=da&&utmac=UA-XXXXXX-X&utmcc=_utma%3D87045125.1669045322.1274256051.1274256051.1274256051.1%3B%2B_utmb%3D87045125%3B%2B_utmc%3D87045125%3B%2B_utmz%3D87045125.1274256051.1.1.utmccn%3D(referral)%7Cutmcsr%3Dlindenlab.com%7Cutmcct%3D%2Femployment%7Cutmcmd%3Dreferral%3B%2B">http://www.google-analytics.com/_utm.gif?utmwv=1&utmn=1669045322&utmcs=UTF-8&utmsr=1280x800&utmsc=24-bit&utmul=en-us&utmje=1&utmfl=10.0%20r45&utmcn=1&utmdt=Position%20Listings%20%7C%20Linden%20Lab&utmhn=lindenlab.hrmdirect.com&utmr=http://lindenlab.com/employment&utmp=/employment/openings.php?sort=da&&utmac=UA-XXXXXX-X&utmcc=_utma%3D87045125.1669045322.1274256051.1274256051.1274256051.1%3B%2B_utmb%3D87045125%3B%2B_utmc%3D87045125%3B%2B_utmz%3D87045125.1274256051.1.1.utmccn%3D(referral)%7Cutmcsr%3Dlindenlab.com%7Cutmcct%3D%2Femployment%7Cutmcmd%3Dreferral%3B%2B
As you can see, this string is comprised of a set of key-value pairs each separated by an "&". Just two trivial steps: (i) Splitting this string on the ampersand; and (ii) replacing each gif parameter (key) with a short descriptive phrase, make this much easier to read:
gatc_version 1
GIF_req_unique_id 1669045322
language_encoding UTF-8
screen_resolution 1280x800
screen_color_depth 24-bit
browser_language en-us
java_enabled 1
flash_version 10.0%20r45
campaign_session_new 1
page_title Position%20Listings%20%7C%20Linden%20Lab
host_name lindenlab.hrmdirect.com
referral_url http://lindenlab.com/employment
page_request /employment/openings.php?sort=da
account_string UA-XXXXXX-X
cookies _utma%3D87045125.1669045322.1274256051.1274256051.1274256051.1%3B%2B_utmb%3D87045125%3B%2B_utmc%3D87045125%3B%2B_utmz%3D87045125.1274256051.1.1.utmccn%3D(referral)%7Cutmcsr%3Dlindenlab.com%7Cutmcct%3D%2Femployment%7Cutmcmd%3Dreferral%3B%2B
The cookies are also simple to parse (see Google's concise description here): for instance,
__utma is the unique-visitor cookie,
__utmb, __utmc are session cookies, and
__utmz is the referral type.
The GA cookies store the majority of the data that record each interaction by a user (e.g., clicking a tagged download link, clicking a link to another page on the Site, subsequent visit the next day, etc.). So for instance, the __utma cookie is comprised of a groups of integers, each group separated by a "."; the last group is the visit count for that user (a "1" in this case).