views:

29

answers:

3

I've got a problem with performance of my reporting database (tables have millions of records, 50+), when I want to calculate distinct on column that indicates a visitor uniqueness, let's say some hashkey.

For example: I have these columns: hashkey, name, surname, visit_datetime, site, gender, etc...

I need to get distinct in time span of 1 year, less than in 5 sec:

SELECT COUNT(DISTINCT hashkey) FROM table WHERE visit_datetime BETWEEN 'YYYY-MM-DD' AND 'YYYY-MM-DD' 

This query will be fast for short time ranges, but if it be bigger than one month, than it can takes more than 30s.

Is there a better technology to calculate something like this than relational databases?

I'm wondering what google analytics use to do theirs unique visitors calculating on the fly.

A: 

You could have another table store the count of unique visitors for each day, updated daily by a cron function or something.

Sam Dufel
This is not a solution, cause I need uniqueness for some period of time, not only for particular day.
sparrovv
+2  A: 

For reporting and analytics, the type of thing you're describing, these sorts of statistics tend to be pulled out, aggregated, and stored in a data warehouse or something. They are stored in a fashion meant for performance reasons in lieu of nice relational storage techniques optimized for OLTP (online transaction processing). This pre-aggregated technique is called OLAP (online analytical processing).

Ziplin
That said, there could be some technology better suited to this if this is the only analysis you do.
Ziplin
Right now I need only this analysis, but I can't say that in future I would need more... I have very little knowledge about OLAP, could you point me some good resources, from where I can start?
sparrovv
Well this answer pretty much hit my limit on knowledge of OLAP :P Try breezing over the references for the wiki page - http://en.wikipedia.org/wiki/OLAP#References
Ziplin
In the meantime Dufel's answer provides a good simple starting point for you. If possible, I'd personally use a database job over a cron job, but that's me.
Ziplin
A: 

Google Analytics uses a first-party cookie, which you can see if you log Request Headers using LiveHTTPHeaders, etc.

All GA analytics parameters are packed into the Request URL, e.g.,

utma%3D97315849.1774621898.1207701397.1207701397.1207701397.1%3B">http://www.google-analytics.com/_utm.gif?utmwv=4&utmn=769876874&utmhn=example.com&utmcs=ISO-8859-1&utmsr=1280x1024&utmsc=32-bit&utmul=en-us&utmje=1&utmfl=9.0%20%20r115&utmcn=1&utmdt=GATC012%20setting%20variables&utmhid=2059107202&utmr=0&utmp=/auto/GATC012.html?utm_source=www.gatc012.org&utm_campaign=campaign+gatc012&utm_term=keywords+gatc012&utm_content=content+gatc012&utm_medium=medium+gatc012&utmac=UA-30138-1&utmcc=__utma%3D97315849.1774621898.1207701397.1207701397.1207701397.1%3B...

Within that URL is a piece that keyed to __utmcc, these are the GA cookies. Within _utmcc, is a string keyed to _utma, which is string comprised of six fields each delimited by a '.'. The second field is the Visitor ID, a random number generated and set by the GA server after looking for GA cookies and not finding them:

__utma%3D97315849.1774621898.1207701397.1207701397.1207701397.1

In this example, 1774621898 is the Visitor ID, intended by Google Analytics as a unique identifier of each visitor

So you can see the flaws of technique to identify unique visitors--entering the Site using a different browser, or a different device, or after deleting the cookies, will cause you to appear to GA as a unique visitor (i.e., it looks for its cookies and doesn't find any, so it sets them).

There is an excellent article by EFF on this topic--i.e., how uniqueness can be established, and with what degree of certainty, and how it can be defeated.

Finally, once technique i have used to determine whether someone has visited our Site before (assuming the hard case, which is that they have deleted their cookies, etc.) is to examine the client request for our favicon. The directories that store favicons are quite often overlooked--whether during a manual sweep or programmatically using a script.

doug