tags:

views:

597

answers:

7

I have a MySQL table with approximately 3000 rows per user. One of the columns is a datetime field, which is mutable, so the rows aren't in chronological order.

I'd like to visualize the time distribution in a chart, so I need a number of individual datapoints. 20 datapoints would be enough.

I could do this:

select timefield from entries where uid = ? order by timefield;

and look at every 150th row.

Or I could do 20 separate queries and use limit 1 and offset.

But there must be a more efficient solution...

UPDATE: still unanswered

+1  A: 

Something like this came to my mind

select @rownum:[email protected]+1 rownum, entries.* 
from (select @rownum:=0) r, entries 
where uid = ? and rownum % 150 = 0

I don't have MySQL at my hand but maybe this will help ...

Michal Sznajder
A: 

@Michal

For whatever reason, your example only works when the where @recnum uses a less than operator. I think when the where filters out a row, the rownum doesn't get incremented, and it can't match anything else.

If the original table has an auto incremented id column, and rows were inserted in chronological order, then this should work:

select timefield from entries
where uid = ? and id % 150 = 0 order by timefield;

Of course that doesn't work if there is no correlation between the id and the timefield, unless you don't actually care about getting evenly spaced timefields, just 20 random ones.

Ryan Ahearn
A: 

@Ryan

The rows are not in chronological order. The timefield is mutable - think of it as "last_update".

Michiel de Mare
A: 

Do you really care about the individual data points? Or will using the statistical aggregate functions on the day number instead suffice to tell you what you wish to know?

Scott Noyes
A: 
select
    timefield
from
    entries
where
    rand() = .01 --will return 1% of rows adjust as needed.

--not a mysql expert so I'm not sure how rand() operates in this environment.

jms
that should be "rand() < .01"
nickf
+2  A: 

Michal Sznajder almost had it, but you can't use column aliases in a WHERE clause in SQL. So you have to wrap it as a derived table. I tried this and it returns 20 rows:

SELECT * FROM (
    SELECT @rownum:[email protected]+1 AS rownum, e.*
    FROM (SELECT @rownum := 0) r, entries e) AS e2
WHERE uid = ? AND rownum % 150 = 0;
Bill Karwin
+1  A: 

As far as visualization, I know this is not the periodic sampling you are talking about, but I would look at all the rows for a user and choose an interval bucket, SUM within the buckets and show on a bar graph or similar. This would show a real "distribution", since many occurrences within a time frame may be significant.

SELECT DATEADD(day, DATEDIFF(day, 0, timefield), 0) AS bucket -- choose an appropriate granularity (days used here)
     ,COUNT(*)
FROM entries
WHERE uid = ?
GROUP BY DATEADD(day, DATEDIFF(day, 0, timefield), 0)
ORDER BY DATEADD(day, DATEDIFF(day, 0, timefield), 0)

Or if you don't like the way you have to repeat yourself - or if you are playing with different buckets and want to analyze across many users in 3-D (measure in Z against x, y uid, bucket):

SELECT uid
    ,bucket
    ,COUNT(*) AS measure
FROM (
    SELECT uid
        ,DATEADD(day, DATEDIFF(day, 0, timefield), 0) AS bucket
    FROM entries
) AS buckets
GROUP BY uid
    ,bucket
ORDER BY uid
    ,bucket

If I wanted to plot in 3-D, I would probably determine a way to order users according to some meaningful overall metric for the user.

Cade Roux
can you do "GROUP BY bucket ORDER BY bucket"? that seems as though it would be much more efficient (not having to recalculate that column each time)
nickf
No, you cannot, however, the optimizer does not actually re-calculate those expressions, because it knows that the functions are deterministic.
Cade Roux