views:

117

answers:

3

Say I have a list of datetimes, and we know each datetime to be the recorded time of an event happening.

Is it possible in matplotlib to graph the frequency of this event occuring over time, showing this data in a cumulative graph (so that each point is greater or equal to all of the points that went before it), without preprocessing this list? (e.g. passing datetime objects directly to some wonderful matplotlib function)

Or do I need to turn this list of datetimes into a list of dictionary items, such as:

{"year": 1998, "month": 12, "date": 15, "events": 92}

and then generate a graph from this list?

Sorry if this seems like a silly question - I'm not all too familiar with matplotlib, and would like to save myself the effort of doing this the latter way if matplotlib can already deal with datetime objects itself.

A: 

I just use chart director from advanced software engineering. Really easy to deal with especially with dates. They have lots of examples too in python.

Khorkrak
It's kind of expensive, though, and I can't imagine it being much easier than Python. (Well, I guess easy is subjective, so that's just my opinion)
David Zaslavsky
+2  A: 

This should work for you:

counts = arange(0, len(list_of_dates))
plot(list_of_dates, counts)

You can of course give any of the usual options to the plot call to make the graph look the way you want it. (I'll point out that matplotlib is very adept at handling dates and times.)

Another option would be the hist function - it has an option 'cumulative=True' that might be useful. You can create a cumulative histogram showing the number of events that have occurred as of any given date something like this:

from pyplot import hist
from matplotlib.dates import date2num
hist(date2num(list_of_dates), cumulative=True)

But this produces a bar chart, which might not be quite what you're looking for, and in any case making the date labels on the horizontal axis display properly will probably require some fudging.

EDIT: I'm getting the sense that what you really want is one point (or bar) per date, with the corresponding y-value being the number of events that have occurred up to (and including?) that date. In that case, I'd suggest doing something like this:

grouped_dates = [[d, len(list(g))] for d,g in itertools.groupby(list_of_dates, lambda k: k.date())]
dates, counts = grouped_dates.transpose()
counts = counts.cumsum()
step(dates, counts)

The groupby function from the itertools module will produce the kind of data you're looking for: only a single instance of each date, accompanied by a list (an iterator, actually) of all the datetime objects that have that date. As suggested by Jouni in the comments, the step function will give a graph that steps up at each day on which events occurred, so I'd suggest using that in place of plot.

(Hat tip to EOL for reminding me about cumsum)

If you want to have one point for every day, regardless of whether any events occurred on that day or not, you'll need to alter the above code a bit:

from matplotlib.dates import drange, num2date
date_dict = dict((d, len(list(g))) for d,g in itertools.groupby(list_of_dates, lambda k: k.date()))
dates = num2date(drange(min(list_of_dates).date(), max(list_of_dates).date() + timedelta(1), timedelta(1)))
counts = asarray([date_dict.get(d.date(), 0) for d in dates]).cumsum()
step(dates, counts)

I don't think it'll really make a difference for the plot produced by the step function though.

David Zaslavsky
This arange() method doesn't seem to take account of the number of times an event happens on one day, though.. I seem to just get a range of ascending numbers. e.g. http://pastebin.ca/1882575 Perhaps I didn't ask my original question in the most clear way..
ventolin
@ventolin: `arange()` is the same thing as Python's builtin `range()` except that it returns a NumPy array instead of a Python list. It's not supposed to take into account anything about your events. Your wording of the question implies that the list contains one `datetime` object for each occurrence of the event, and I inferred that you would want one point on the graph for each event. If that's not the case, please clarify and I can adjust my answer accordingly.
David Zaslavsky
+1 for matplotlib's `date2num`, and the `cumulative` option in `hist`.
EOL
@David: Aha, sorry I was unclear. The dictionary representation I mentioned might clear things up: What I need is a graph of the number of events on the Y-axis, and time (in regular daily intervals) on the X-axis. 50 events might happen on one day, 2 on the next, and so on, and I need a cumulative graph over time of these events. Reading EOL's response now...
ventolin
Try "step" instead of "plot"?
Jouni K. Seppänen
Using the very last portion of code you pasted above (with one slight modification - I added an extra ')' after 'list(g))', since that line was missing a close-bracket and it seemed like the most sensible place for it to go), I get the following for a list of maybe 200,000 datetime objects: str(dates): [ 732729., 732730., 732731., ..., 733935., 733936., 733937.]) and str(counts): [0 0 0 ..., 0 0 0] . The graph produced reflects this data (simply a horizontal line at 0.0). Any ideas off the top of your head? I don't want to badger you about this, sorry for the incessant questions!
ventolin
OK, I fixed it up, try the new version. Turns out I didn't read the documentation carefully enough; I thought `drange` returned an array of `date` objects but it actually returns floats which need to be converted using `num2date`.
David Zaslavsky
That looks great :) Quite encouraging.. Thanks so much. Just one question - shouldn't the following return True? counts[len(counts)-1] == len(list_of_dates)
ventolin
Yes, it should... I forgot to add `timedelta(1)` to the ending argument of `drange` to get an inclusive range. Try the updated version. (Also, in Python we say `counts[-1]` instead of `counts[len(counts)-1]`)
David Zaslavsky
+1  A: 

So, you start with a list of dates that you want to histogram:

from datetime import  datetime
list_of_datetime_datetime_objects = [datetime(2010, 6, 14), datetime(1974, 2, 8), datetime(1974, 2, 8)]

Matplotlib allows you to convert a datetime.datetime object into a simple number, as David mentioned:

from matplotlib.dates import date2num, num2date
num_dates = [date2num(d) for d in list_of_datetime_datetime_objects]

You can then calculate the histogram of your data:

import numpy
histo = numpy.histogram(num_dates)  # Look at the doc for more options (number of bins, etc.)

Since you want the cumulative histogram, you add individual counts together:

cumulative_histo_counts = histo[0].cumsum()

The histogram plot will need the bin size:

from matplotlib import pyplot

You can then plot the cumulative histogram:

bin_size = histo[1][1]-histo[1][0]
pyplot.bar(histo[1][:-1], cumulative_histo_counts, width=bin_size)

Alternatively, you might want a curve instead of an histogram:

# pyplot.plot(histo[1][1:], cumulative_histo_counts)

If you want dates on the x axis instead of numbers, you can convert the numbers back to dates and ask matplotlib to use date strings as ticks, instead of numbers:

from matplotlib import ticker

# The format for the x axis is set to the chosen string, as defined from a numerical date:
pyplot.gca().xaxis.set_major_formatter(ticker.FuncFormatter(lambda numdate, _: num2date(numdate).strftime('%Y-%d-%m')))
# The formatting proper is done:
pyplot.gcf().autofmt_xdate()
# To show the result:
pyplot.show()  # or draw(), if you don't want to block

Here, gca() and gcf() return the current axis and figure, respectively.

Of course, you can adapt the way you display dates, in the call to strftime() above.

To go beyond your question, I would like to mention that Matplotlib's gallery is a very good source of information: you can generally quickly find what you need by just finding images that look like what you're trying to do, and looking at their source code.

EOL
Trying this, I get http://paste.pocoo.org/show/225396/ ... Is this because after processing, the number of points on the X-axis isn't the same as the number on the Y-axis? Or am I wildly off track?
ventolin
You're right. I updated the code in the answer and it works on my machine.
EOL