views:

331

answers:

1

Counting distinct visitors is not an easy task. In web analytics for example, a visitor can visit on Monday and Thursday, but when counting unique visitors over that week, I'd expect to count that visitor only once.

count (distinct userid) for 10M visits over a month can't run too fast as aggregations can't be applied (because count distinct is not an "addable" measure).

My question is: How do Google Analytics and other web analytics platforms return unique visitors so fast? I assume statistical estimations are used. What kind? How?

A: 

They set a cookie with a reasonable expiration. If you have the cookie already, you've come back.

great_llama
the question is not specifically about cookies, it is more general - how to estimate count distinct for a high cardinality column ?
"How do Google Analytics and other web analytics platforms return unique visitors so fast?" They don't count distinct.
great_llama
... rather, they don't do it at reporting time.
great_llama
I assume they have a batch operation that pre-calculate everything needed for the report to run fast. What is that pre-calculate operation?