I have click stream data such as referring URL, top landing pages, top exit pages and metrics such as page views, number of visits, bounces all in Google Analytics. There is no database yet where all this information might be stored. I am required to build a data warehouse from scratch(which I believe is known as web-house) from this data.So I need to extract data from Google Analytics and load it into a warehouse on a daily automated basis. My questions are:-
1)Is it possible? Every day data increases (some in terms of metrics or measures such as visits and some in terms of new referring sites), how would the process of loading the warehouse go about?
2)What ETL tool would help me to achieve this? Pentaho I believe has a way to pull out data from Google Analytics, has anyone used it? How does that process go? Any references, links would be appreciated besides answers.
As always, knowing the structure of the underlying transaction data--the atomic components used to build a DW--is the first and biggest step.
There are essentially two options, based on how you retrieve the data. One of these, already mentioned in a prior answer to this question, is to access your GA data via the GA API. This is pretty close to the form that the data appears in the GA Report, rather than transactional data. The advantage of using this as your data source is that your "ETL" is very simple, just parsing the data from the XML container is about all that's needed.
The second option involves grabbing the data much closer to the source.
Nothing complicated, still, a few lines of background are perhaps helpful here.
The GA Web Dashboard is created by parsing/filtering a GA transaction log (the container that holds the GA data that corresponds to one Profile in one Account).
Each line in this log represents a single transaction and is delivered to the GA server in the form of an HTTP Request from the client.
Appended to that Request (which is nominally for a single-pixel GIF) is a single string that contains all of the data returned from that _TrackPageview function call plus data from the client DOM, GA cookies set for this client, and the contents of the Browser's location bar (http://www....).
Though this Request is from the client, it is invoked by the GA script (which resides on the client) immediately after execution of GA's primary data-collecting function (_TrackPageview).
So working directly with this transaction data is probably the most natural way to build a Data Warehouse; another advantage is that you avoid the additional overhead of an intermediate API).
The individual lines of the GA log are not normally avaialble to GA users. Still, it's simple to get them. These two steps should suffice:
modify the GA tracking code on each page of your Site so that it sends a copy of each GIF Request (one line in the GA logfile) to your own server, specifically, immeidately before the call to _trackPageview(), add this line:
pageTracker._setLocalRemoteServerMode();
Next, just put a single-pixel gif image in your document root and call it "__utm.gif".
So now your server activity log will contain these individual transction lines, again built from a string appended to an HTTP Request for the GA tracking pixel as well as from other data in the Request (e.g., the User Agent string). This former string is just a concatenation of key-value pairs, each key begins with the letters "utm" (probably for "urching tracker"). Not every utm parameter appears in every GIF Request, several of them, for instance, are used only for e-commerce transactions--it depends on the transaction.
Here's an actual GIF Request (account ID has been sanitized, otherwise it's intact):
As you can see, this string is comprised of a set of key-value pairs each separated by an "&". Just two trivial steps: (i) Splitting this string on the ampersand; and (ii) replacing each gif parameter (key) with a short descriptive phrase, make this much easier to read:
gatc_version 1
GIF_req_unique_id 1669045322
language_encoding UTF-8
screen_resolution 1280x800
screen_color_depth 24-bit
browser_language en-us
java_enabled 1
flash_version 10.0%20r45
campaign_session_new 1
page_title Position%20Listings%20%7C%20Linden%20Lab
host_name lindenlab.hrmdirect.com
referral_url http://lindenlab.com/employment
page_request /employment/openings.php?sort=da
account_string UA-XXXXXX-X
cookies __utma%3D87045125.1669045322.1274256051.1274256051.1274256051.1%3B%2B__utmb%3D87045125%3B%2B__utmc%3D87045125%3B%2B__utmz%3D87045125.1274256051.1.1.utmccn%3D(referral)%7Cutmcsr%3Dlindenlab.com%7Cutmcct%3D%2Femployment%7Cutmcmd%3Dreferral%3B%2B
The cookies are also simple to parse (see Google's concise description here): for instance,
__utma is the unique-visitor cookie,
__utmb, __utmc are session cookies, and
__utmz is the referral type.
The GA cookies store the majority of the data that record each interaction by a user (e.g., clicking a tagged download link, clicking a link to another page on the Site, subsequent visit the next day, etc.). So for instance, the __utma cookie is comprised of a groups of integers, each group separated by a "."; the last group is the visit count for that user (a "1" in this case).