views:

15

answers:

1

We keep track of user agent strings in our website. I want to do some statistics on them, to see how many IE6 users we have ( so we know what we have to develop against), and also how many mobile users we have.

So we have log entires like this:

Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; FunWebProducts)
Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; FunWebProducts; .NET CLR 1.0.3705; .NET CLR 1.1.4322; Media Center PC 4.0; .NET CLR 2.0.50727)

And ideally, it would be pretty neat to see all the 'meaningful' strings, which would just mean probably strings longer than a certain length. For instance, IU might like to see how many entries have FunWebProducts in it, or .NET CLR, or .NET CLR 1.0.3705 -- but I don't want to see how many have a semi-colon. So I'm not necessarily looking for unique strings, but all strings, even sub-sets. So, I would want to see the count of all Mozilla, knowing that this includes the counts for Mozilla/5.0 and Mozilla/4.0. It would be nice if there were a nested display for this, starting with the shortest strings, and working its way down. Something perhaps like

4,2093 Mozilla
 1,093 Mozilla/5.0
    468 Mozilla/5.0 (Windows;
     47 Mozilla/5.0 (Windows; U 
 2,398 Mozilla/4.0

This sounds like a computer science homework. What would this be called? Does something like this exist out there, or do I write my own?

A: 

If you break it up into the major name (part before the opening paren), and then store each part separated by semicolon as a child record, you could do whatever analysis you want. For example, store it in a relational database:

BrowserID BrowserText
--------- -----------
1   Mozilla/4.0
2   Mozilla/5.0

FeatureID FeatureText
--------- -----------
1   compatible
2   MSIE 7.0
3   Windows NT 5.1
4   FunWebProducts
5   .NET CLR 1.0.3705
6   .NET CLR 1.1.4322
7   Media Center PC 4.0
8   .NET CLR 2.0.50727

Then log references to browser and parts and you can do any type of analysis you want.

Sam
Tokenize on semi-colon won't do; I have strings like `Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_8; en-us) AppleWebKit/531.9 (KHTML, like Gecko) Version/4.0.3 Safari/531.9`