views:

636

answers:

6

The following shows the structure of some data I have (format: a list of lists)

data = 
[ 
  [1,2008-12-01],
  [1,2008-12-01],
  [2,2008-12-01]
  ... (the lists continue)
]

The dates range from 2008-12-01 to 2008-12-25.

The first field identifies a user by id, the second field (a date field) shows when this user visited a page on my site.

I need to analyse this data so that i get the following results

25 users visted on 1 day
100 users visted on 2 days
300 users visted on 4 days
... upto 25 days

I am using python and don't know where to start !

EDIT

I'm sorry it seems I wasnt clear enough about what I needed as a few people have given answers that are not what I'm looking for.

I need to find out how many users visited on all the days e.g.
10 users visited on 25 days (or every day)

Then I'd like the to list the same for each frequency of days from 1 - 25. So as per my original example above
25 users visited for only one day (out of the 25)
100 users visited on 2 days (out of the 25)
etc

I DONT need to know how many visited on each day
thanks

+4  A: 

Your result is a dictionary, right?

{ userNumber: setOfDays }

How about this to get started.

from collections import defaultdict
visits = defaultdict(set)
for user, date in someList:
    visits[user].add(date)

This gives you a dictionary with a set of dates on which they visited.

counts = defaultdict(int)
for user in visits:
    v= len(visits[user])
    count[v] += 1

This gives you a dictionary of # visits, # of users with that many visits.

Is that the kind of thing you're looking for?

S.Lott
You declared the first dictionary as 'visits', so you mean:for user in visits: visit = len(visits[users]) count[visit] += 1or even:for visit in visits.itervalues() : count[len(visit)] += 1
hughdbrown
A: 

First, I should mention that you NEED to store the date as a string. Currently, it would do arithmetic on your current entry. So, if you format data like this, it will work better:

data = 
[ 
  [1,"2008-12-01"],
  [1,"2008-12-01"],
  [2,"2008-12-01"]
]

Next, we can do something like this to get the number for each day:

result = {}
for (id, date) in data:
    if date not in result:
        result[date] = 1
    else:
        result[date] += 1

Now you can get the number of users for a specific date by doing something like this:

print result[some_date]
Evan Fosmark
This should use collections.defaultdict, as S.Lott's code does. It's a class that simplifies this kind of dictionary addition.
hughdbrown
A: 

It is unclear what exactly your requirement are. Here's my take:

#!/usr/bin/env python
from collections import defaultdict

data = [ 
  [1,'2008-12-01'],
  [3,'2008-12-25'],
  [1,'2008-12-01'],
  [2,'2008-12-01'],
]

d = defaultdict(set)
for id, day in data:
    d[day].add(id)

for day in sorted(d):
    print('%d user(s) visited on %s' % (len(d[day]), day))

It prints:

2 user(s) visited on 2008-12-01
1 user(s) visited on 2008-12-25
J.F. Sebastian
This gives how many people visited on given days because you are grouping by date. He wants how many times people visited x times. He needs grouping by user id, not date.
hughdbrown
A: 

How about this: this gives you set of days as well as count:

In [39]: from itertools import groupby ##itertools is a part of the standard library.

In [40]: l=[[1, '2008-12-01'],
   ....:  [1, '2008-12-01'],
   ....:  [2, '2008-12-01'],
   ....:  [1, '2008-12-01'],
   ....:  [3, '3008-12-04']]

In [41]: l.sort()

In [42]: l
Out[42]: 
[[1, '2008-12-01'],
 [1, '2008-12-01'],
 [1, '2008-12-01'],
 [2, '2008-12-01'],
 [3, '3008-12-04']]

In [43]: for key, group in groupby(l, lambda x: x[0]):
   ....:     group=list(group)
   ....:     print key,' :: ', len(group), ' :: ', group
   ....:     
   ....:     
1  ::  3  ::  [[1, '2008-12-01'], [1, '2008-12-01'], [1, '2008-12-01']]
2  ::  1  ::  [[2, '2008-12-01']]
3  ::  1  ::  [[3, '3008-12-04']]

user::number of visits :: visit dates

Here the user -1 visits on 2008-12-01 3 times, if you are looking to count only distinct dates then

for key, group in groupby(l, lambda x: x[0]):
    group=list(group)
    print key,' :: ', len(set([(lambda y: y[1])(each) for each  in group])), ' :: ', group
   ....:     
   ....:     
1  ::  1  ::  [[1, '2008-12-01'], [1, '2008-12-01'], [1, '2008-12-01']]
2  ::  1  ::  [[2, '2008-12-01']]
3  ::  1  ::  [[3, '3008-12-04']]
JV
+1  A: 

Rewriting S.Lott's answer in SQL as an exercise, just to check that I got the requirements right...

SELECT * FROM someList;

 userid |    date    
--------+------------
      1 | 2008-12-01
      1 | 2008-12-02
      1 | 2008-12-03
      1 | 2008-12-04
      1 | 2008-12-05
      2 | 2008-12-03
      2 | 2008-12-04
      2 | 2008-12-05
      3 | 2008-12-04
      4 | 2008-12-04
      5 | 2008-12-05
      5 | 2008-12-05

SELECT countdates, COUNT(userid) AS nusers
FROM ( SELECT userid, COUNT (DISTINCT date) AS countdates
             FROM someList
             GROUP BY userid ) AS visits
GROUP BY countdates
HAVING countdates <= 25
ORDER BY countdates;

 countdates | nusers 
------------+--------
          1 |      3
          3 |      1
          5 |      1
Federico Ramponi
+1  A: 

This is probably not the most pythonic or efficient or smartest or whatever way of doing this. But maybe you can confirm if I've understood the requirements correctly:

>>> log=[[1, '2008-12-01'], [1, '2008-12-01'],[2, '2008-12-01'],[2, '2008-12-03'], [1, '2008-12-04'], [3, '2008-12-04'], [4, '2008-12-04']]
>>> all_dates = sorted(set([d for d in [x[1] for x in log]]))
>>> for i in range(0, len(all_dates)):
...     log_slice = [d for d in log if d[1] <= all_dates[i]]
...     num_users = len(set([u for u in [x[0] for x in log_slice]]))
...     print "%d users visited in %d days" % (num_users, i + 1)
... 
2 users visited in 1 days
2 users visited in 2 days
4 users visited in 3 days
>>>
PEZ