views:

533

answers:

4

I have click stream data such as referring URL, top landing pages, top exit pages and metrics such as page views, number of visits, bounces all in Google Analytics. There is no database yet where all this information might be stored. I am required to build a data warehouse from scratch(which I believe is known as web-house) from this data.So I need to extract data from Google Analytics and load it into a warehouse on a daily automated basis. My questions are:-

1)Is it possible? Every day data increases (some in terms of metrics or measures such as visits and some in terms of new referring sites), how would the process of loading the warehouse go about?

2)What ETL tool would help me to achieve this? Pentaho I believe has a way to pull out data from Google Analytics, has anyone used it? How does that process go? Any references, links would be appreciated besides answers.

+1  A: 

You can use the Data Export API from Google or a service such as the one we've built specifically for your need: www.analyticspros.com/products/analytics-data-warehouse.html.

Best,

-Caleb Whitmore www.analyticspros.com / www.analyticsformarketers.com

Caleb Whitmore
Hi Celeb,Thanks for your response. I had come across this a few minutes back only but wasn't quite sure if I understood it thoroughly. It states "Analytics Backup sends the tracking data into a cloud-based data capture and warehouse environment engineered specifically for the purpose of backing up and archiving Google Analytics raw tracking data." So this means that it would store all Google analytics data in a warehouse structure / environment and then that data can be used by a BI tool for reporting? Also can extraction of data be automated on a scheduled basis?Thanks again.
nkaur301
+1  A: 

As always, knowing the structure of the underlying transaction data--the atomic components used to build a DW--is the first and biggest step.

There are two essentially two options, based on how you retrieve the data. One of these, already mentioned in a prior answer to this question, is to access your GA data via the GA API. This is pretty close to the form that the data appears in the GA Report, rather than transactional data. The advantage of using this as your data source is that your "ETL" is very simple, just parsing the data from the XML container is about all that's needed.

The second option involves grabbing the data much closer to the source.

Nothing complicated, still, a few lines of background are perhaps helpful here.

  • The GA Web Dashboard is created by parsing/filtering a GA transaction log (the container that holds the GA data that corresponds to one Profile in one Account).

  • Each line in this log represents a single transaction and is delivered to the GA server in the form of an HTTP Request from the client.

  • Appended to that Request (which is nominally for a single-pixel GIF) is a single string that contains all of the data returned from that _TrackPageview function call plus data from the client DOM, GA cookies set for this client, and the contents of the Browser's location bar (http://www....).

  • Though this Request is from the client, it is invoked by the GA script (which resides on the client) immediately after execution of GA's primary data-collecting function (_TrackPageview).

So working directly with this transaction data is probably the most natural way to build a Data Warehouse; another advantage is that you avoid the additional overhead of an intermediate API).

The individual lines of the GA log are not normally avaialble to GA users. Still, it's simple to get them. These two steps should suffice:

  1. modify the GA tracking code on each page of your Site so that it sends a copy of each GIF Request (one line in the GA logfile) to your own server, specifically, immeidately before the call to _trackPageview(), add this line:

    pageTracker._setLocalRemoteServerMode();
    
  2. Next, just put a single-pixel gif image in your document root and call it "__utm.gif".

So now your server activity log will contain these individual transction lines, again built from a string appended to an HTTP Request for the GA tracking pixel as well as from other data in the Request (e.g., the User Agent string). This former string is just a concatenation of key-value pairs, each key begins with the letters "utm" (probably for "urching tracker"). Not every utm parameter appears in every GIF Request, several of them, for instance, are used only for e-commerce transactions--it depends on the transaction.

Here's an actual GIF Request (account ID has been sanitized, otherwise it's intact):

utm.gif?utmwv=1&utmn=1669045322&utmcs=UTF-8&utmsr=1280x800&utmsc=24-bit&utmul=en-us&utmje=1&utmfl=10.0%20r45&utmcn=1&utmdt=Position%20Listings%20%7C%20Linden%20Lab&utmhn=lindenlab.hrmdirect.com&utmr=http://lindenlab.com/employment&utmp=/employment/openings.php?sort=da&&utmac=UA-XXXXXX-X&utmcc=_utma%3D87045125.1669045322.1274256051.1274256051.1274256051.1%3B%2B_utmb%3D87045125%3B%2B_utmc%3D87045125%3B%2B_utmz%3D87045125.1274256051.1.1.utmccn%3D(referral)%7Cutmcsr%3Dlindenlab.com%7Cutmcct%3D%2Femployment%7Cutmcmd%3Dreferral%3B%2B">http://www.google-analytics.com/_utm.gif?utmwv=1&utmn=1669045322&utmcs=UTF-8&utmsr=1280x800&utmsc=24-bit&utmul=en-us&utmje=1&utmfl=10.0%20r45&utmcn=1&utmdt=Position%20Listings%20%7C%20Linden%20Lab&utmhn=lindenlab.hrmdirect.com&utmr=http://lindenlab.com/employment&utmp=/employment/openings.php?sort=da&&utmac=UA-XXXXXX-X&utmcc=_utma%3D87045125.1669045322.1274256051.1274256051.1274256051.1%3B%2B_utmb%3D87045125%3B%2B_utmc%3D87045125%3B%2B_utmz%3D87045125.1274256051.1.1.utmccn%3D(referral)%7Cutmcsr%3Dlindenlab.com%7Cutmcct%3D%2Femployment%7Cutmcmd%3Dreferral%3B%2B

As you can see, this string is comprised of a set of key-value pairs each separated by an "&". Just two trivial steps: (i) Splitting this string on the ampersand; and (ii) replacing each gif parameter (key) with a short descriptive phrase, make this much easier to read:

gatc_version 1

GIF_req_unique_id 1669045322

language_encoding UTF-8     

screen_resolution         1280x800  

screen_color_depth        24-bit    

browser_language          en-us     

java_enabled              1         

flash_version             10.0%20r45

campaign_session_new      1         

page_title                Position%20Listings%20%7C%20Linden%20Lab

host_name lindenlab.hrmdirect.com

referral_url        http://lindenlab.com/employment

page_request              /employment/openings.php?sort=da

account_string            UA-XXXXXX-X

cookies _utma%3D87045125.1669045322.1274256051.1274256051.1274256051.1%3B%2B_utmb%3D87045125%3B%2B_utmc%3D87045125%3B%2B_utmz%3D87045125.1274256051.1.1.utmccn%3D(referral)%7Cutmcsr%3Dlindenlab.com%7Cutmcct%3D%2Femployment%7Cutmcmd%3Dreferral%3B%2B

The cookies are also simple to parse (see Google's concise description here): for instance,

  • __utma is the unique-visitor cookie,

  • __utmb, __utmc are session cookies, and

  • __utmz is the referral type.

The GA cookies store the majority of the data that record each interaction by a user (e.g., clicking a tagged download link, clicking a link to another page on the Site, subsequent visit the next day, etc.). So for instance, the __utma cookie is comprised of a groups of integers, each group separated by a "."; the last group is the visit count for that user (a "1" in this case).

doug
+1  A: 

You can pull the GA (Google Analytics) data anytime through their API and build your own data warehouse (DW). Before you start, you may want to sit with the business user and get clear understanding of business requirements. In DW environment it is extremely important to have a clear set goals and understanding of business user requirements due to fact that you will maintain history of transactions which live for long time and used often.

Assuming that business user defines KPI (Key Performance Indicators), metrics, dimensions, granularity needed for you to proceed, you can check different dimensions and metrics that are availabe through GA API at code.google.com/apis/analytics/docs/. Then it is just a matter of making the right API call and getting what you need. DW activity involves data cleaning, extraction, transformation and loading (ETL) or ELT along with summarizing the facts along different dimensions. Since data is much cleaner than one would encounter in disparate systems (from web logs, external vendors, excel or files, etc.), you can simply load the data through any ETL tools (For example, Talend, Pentaho, SSIS, etc.) or through application of your choice (Perl, Java, Ruby, C#, etc).

For daily load, you need to design incremental loading process during the low user traffic time (nightly loads), pulling only the recent data, de-duping any duplicates, cleansing any non-conforming data, handling erraneous rows, etc.

I have provided a sample GA API application at http://www.hiregion.com/2009/10/google-analytics-data-retriever-api-sem_25.html and it will provide you basic info to get started.

Hope that helps,

Shiva -- hiregion.com

Shiva
Thanks Shiva. I tried using Talend to extract the Google Analytics data with its tGoogleAnalyticsInput component. However, it asks me to specify columns in Data section which I am not able to add anyhow. If you have any idea about this, it would be great if you could give some insights into how to go about settings for that component.Appreciate all your help.
nkaur301
A: 

Automation Anywhere is best for data extraction. You can very well extract data from Analytics and also get the data extracted in any database! The biggest advantage is it will save a lot of time. I think in minutes you can get it done! Totally automated! Free trial and detail on data extraction.

Techsmart