views:

67

answers:

3

After being stumped by an earlier quesiton: SO google-analytics-domain-data-without-filtering

I've been experimenting with a very basic analytics system of my own.

MySQL table:

hit_id, subsite_id, timestamp, ip, url

The subsite_id let's me drill down to a folder (as explained in the previous question).

I can now get the following metrics:

  • Page Views - Grouped by subsite_id and date
  • Unique Page Views - Grouped by subsite_id, date, url, IP (not nesecarily how Google does it!)
  • The usual "most visited page", "likely time to visit" etc etc.

I've now compared my data to that in Google Analytics and found that Google has lower values each metric. Ie, my own setup is counting more hits than Google.

So I've started discounting IP's from various web crawlers, Google, Yahoo & Dotbot so far.

Short Questions:

  1. Is it worth me collating a list of all major crawlers to discount, is any list likely to change regularly?
  2. Are there any other obvious filters that Google will be applying to GA data?
  3. What other data would you collect that might be of use further down the line?
  4. What variables does Google use to work out entrance search keywords to a site?

The data is only going to used internally for our own "subsite ranking system", but I would like to show my users some basic data (page views, most popular pages etc) for their reference.

+1  A: 

Lots of people block Google Analytics for privacy reasons.

Martin Smith
Interesting! I doubt this is a large proportion of our traffic though. It's definitely not a technical community.
Jenkz
It takes about 2 seconds to install AdBlock in Firefox, technical community not required. This blocks Google Analytics by default.
mxmissile
Most of my users haven't heard of Firefox, have no idea what "installing" something is and certainly wouldn't have a clue what AdBlock would do or how to get it. It's 90% Internet explorer straight out of the box.But I take your point :)
Jenkz
A: 

Under-reporting by the client-side rig versus server-side eems to be the usual outcome of these comparisons.

Here's how i've tried to reconcile the disparity when i've come across these studies:

Data Sources recorded in server-side collection but not client-side:

  • hits from mobile devices that don't support javascript (this is probably a significant source of disparity between the two collection techniques--e.g., Jan 07 comScore study showed that 19% of UK Internet Users access the Internet from a mobile device)

  • hits from spiders, bots (which you mentioned already)

Data Sources/Events that server-side collection tends to record with greater fidelity (much less false negatives) compared with javascript page tags:

  • hits from users behind firewalls, particularly corporate firewalls--firewalls block page tag, plus some are configured to reject/delete cookies.

  • hits from users who have disabled javascript in their browsers--five percent, according to the W3C Data

  • hits from users who exit the page before it loads. Again, this is a larger source of disparity than you might think. The most frequently-cited study to support this was conducted by Stone Temple Consulting, which showed that the difference in unique visitor traffic between two identical sites configured with the same web analytics system, but which differed only in that the js tracking code was placed at the bottom of the pages in one site, and at the top of the pages in the other--was 4.3%


FWIW, here's the scheme i use to remove/identify spiders, bots, etc.:

  1. monitor requests for our robots.txt file: then of course filter all other requests from same IP address + user agent (not all spiders will request robots.txt of course, but with miniscule error, any request for this resource is probably a bot.

  2. compare user agent and ip addresses against published lists: iab.net and user-agents.org publish the two lists that seem to be the most widely used for this purpose

  3. pattern analysis: nothing sophisticated here; we look at (i) page views as a function of time (i.e., clicking a lot of links with 200 msec on each page is probative); (ii) the path by which the 'user' traverses out Site, is it systematic and complete or nearly so (like following a back-tracking algorithm); and (iii) precisely-timed visits (e.g., 3 am each day).

doug
Thanks for the detail doug.
Jenkz
A: 

Biggest reasons are users have to have JavaScript enabled and load the entire page as the code is often in the footer. Awstars, other serverside solutions like yours will get everything. Plus, analytics does a real good job identifying bots and scrapers.

mdvaldosta