views:

195

answers:

4

Assuming I have a list of shifts for an event (in the format start date/time, end date/time) - is there some sort of algorithm I could use to create a generalized summary of the schedule? It is quite common for most of the shifts to fall into some sort of common recurrence pattern (ie. Mondays from 9:00 am to 1:00 pm, Tuesdays from 10:00 am to 3:00 pm, etc). However, there can (and will be) exceptions to this rule (eg. one of the shifts fell on a holiday and was rescheduled for the next day). It would be fine to exclude those from my "summary", as I'm looking to provide a more general answer of when does this event usually occur.

I guess I'm looking for some sort of statistical method to determine the day and time occurences and create a description based on the most frequent occurences found in the list. Is there some sort of general algorithm for something like this? Has anyone created something similar?

Ideally I'm looking for a solution in C# or VB.NET, but don't mind porting from any other language.

Thanks in advance!

A: 

Could we see an example data set? If it is really "clean" data then you could simply find the mode of the start and end times.

Peter Hanneman
Just to clarify; there is no real "example" set; the start and end dates/times will always be valid (although I'm guessing this isn't what you're asking) - the set of these could be any arbitrary list of dates and times. For smaller sets (say less than 10) - I'll probably go the route of just displaying the occurences themselves, but I'm looking to generalize large sets with this method.
DanP
No I was inquiring to see if there was true repetition of times. But to further clarify...You have a file containing the shift times for a single event and you just want to output the most frequently occurring shift start and end points across what time duration? A week? Do you want to "roll up" the results for each day individually so that an event might have 3 days of the week that we find the most frequently occurring start and end times?
Peter Hanneman
I suppose the range would be arbitrary; probably from the first current (or future) date/time occurence to the last one in the list. And to further clarify, I'm interested in the day of week, start time and end time for grouping.
DanP
+1  A: 

Hi,

I don't think any ready made algorithm exists, so unfortunately you need to come up with something yourself. Because the problem is not really well defined (from mathematical perspective) it will require testing on some "real" data that would be reasonably representative, and a fair bit of tweaking.

I would start from dividing your shifts into weekdays (because if I understand correctly you are after a weekly view) - so for each weekday we have shifts that happen to be on that day. Then for each day I would group the shifts that happen at the same time (or "roughly" at the same time - here you need to come up with some heuristic, i.e. both start and end times do not deviate from average in the group by more than 15min or 30 mins). Now we need another heuristic to decide if this group is relevant, i.e. if a shift 1pm-3pm on a Monday happened only once it is probably not relevant, but if it happened on at least 70% of Mondays covered by the data then it is relevant. And now your relevant groups for each day of the week will form the schedule you are after.

Grzenio
Yes, this is pretty much what I'm thinking I'll have to do...more or less, I'll need to group by day of week / start time / end time and determine if the number of items in a given "group" is relevant given the overall number of items.
DanP
A: 

One option would be to label all the start times as +1 and the end times as -1 then create a three column table of times (both start and ends), label (+1 or -1), and number of staff at that time (starts with zero and adds or subtracts staff using the label) and sort the whole thing in time order.

This time series now is a summary descriptor of your staff levels and the labels are also a series as well. Now you can apply time series statistics to both to look for daily, weekly or monthly patterns.

Grembo
+7  A: 

You may use Cluster Analysis.

Clustering is a way to segregate a set of data into similar components (subsets). The "similarity" concept involves some definition of "distance" between points. Many usual formulas for the distance exists, among others the usual Euclidean distance.

Practical Case

Before pointing you to the quirks of the trade, let's show a practical case for your problem, so you may get involved in the algorithms and packages, or discard them upfront.

For easiness, I modelled the problem in Mathematica, because Cluster Analysis is included in the software and very straightforward to set up.

First, generate the data. The format is { DAY, START TIME, END TIME }.
The start and end times have a random variable added (+half hour, zero, -half hour} to show the capability of the algorithm to cope with "noise".

There are three days, three shifts per day and one extra (the last one) "anomalous" shift, which starts at 7 AM and ends at 9 AM (poor guys!).

There are 150 events in each "normal" shift and only two in the exceptional one.

As you can see, some shifts are not very far apart from each other.

I include the code in Mathematica, in case you have access to the software. I'm trying to avoid using the functional syntax, to make the code easier to read for "foreigners". BTW, English is not my first language, so please forgive the errors and mistakes.

Here is the data generation code:

Rn[] := 0.5 * RandomInteger[{-1, 1}];

monshft1 = Table[{ 1 , 10 + Rn[] , 15 + Rn[] }, {150}];  // 1
monshft2 = Table[{ 1 , 12 + Rn[] , 17 + Rn[] }, {150}];  // 2
wedshft1 = Table[{ 3 , 10 + Rn[] , 15 + Rn[] }, {150}];  // 3
wedshft2 = Table[{ 3 , 14 + Rn[] , 17 + Rn[] }, {150}];  // 4
frishft1 = Table[{ 5 , 10 + Rn[] , 15 + Rn[] }, {150}];  // 5
frishft2 = Table[{ 5 , 11 + Rn[] , 15 + Rn[] }, {150}];  // 6
monexcp  = Table[{ 1 , 7  + Rn[] , 9  + Rn[] }, {2}];    // 7

Now we join the data, obtaining one big dataset:

data = Join[monshft1, monshft2, wedshft1, wedshft2, frishft1, frishft2, monexcp];

Let's run a cluster analysis for the data:

clusters = FindClusters[data, 7, Method->{"Agglomerate","Linkage"->"Complete"}]

"Agglomerate" and "Linkage" -> "Complete" are two fine tuning options of the clustering methods implemented in Mathematica. They just specify we are trying to find very compact clusters.

I specified to try to detect 7 clusters. If the right number of shifts is unknown, you can try several reasonable values and see the results, or let the algorithm select the more proper value.

We can get a chart with the results, each cluster in a different color (don't mind the code)

ListPointPlot3D[ clusters, 
           PlotStyle->{{PointSize[Large], Pink},    {PointSize[Large], Green},   
                       {PointSize[Large], Yellow},  {PointSize[Large], Red},  
                       {PointSize[Large], Black},   {PointSize[Large], Blue},   
                       {PointSize[Large], Purple},  {PointSize[Large], Brown}},  
                       AxesLabel -> {"DAY", "START TIME", "END TIME"}]  

And the result is:

alt text

Where you can see our seven clusters clearly apart.

That solves part of your problem: identifying the data. Now you also want to be able to label it.

So, we'll get each cluster and take means (rounded):

Table[Round[Mean[clusters[[i]]]], {i, 7}]  

The result is:

Day   Start  End
{"1", "10", "15"},
{"1", "12", "17"},
{"3", "10", "15"},
{"3", "14", "17"},
{"5", "10", "15"},
{"5", "11", "15"},
{"1",  "7",  "9"}

And with that you get again your seven classes.

Now, perhaps you want to classify the shifts, no matter the day. If the same people make the same task at the same time everyday, so it's no useful to call it "Monday shift from 10 to 15", because it happens also on Weds and Fridays (as in our example).

Let's analyze the data disregarding the first column:

clusters=
 FindClusters[Take[data, All, -2],Method->{"Agglomerate","Linkage"->"Complete"}];

In this case, we are not selecting the number of clusters to retrieve, leaving the decision to the package.

The result is

image

You can see that five clusters have been identified.

Let's try to "label" them as before:

Grid[Table[Round[Mean[clusters[[i]]]], {i, 5}]]

The result is:

 START  END
{"10", "15"},
{"12", "17"},
{"14", "17"},
{"11", "15"},
{ "7",  "9"}

Which is exactly what we "suspected": there are repeated events each day at the same time that could be grouped together.

Edit: Overnight Shifts and Normalization

If you have (or plan to have) shifts that start one day and end on the following, it's better to model

{Start-Day Start-Hour Length}  // Correct!

than

{Start-Day Start-Hour End-Day End-Hour}  // Incorrect!  

That's because as with any statistical method, the correlation between the variables must be made explicit, or the method fails miserably. The principle could run something like "keep your candidate data normalized". Both concepts are almost the same (the attributes should be independent).

--- Edit end ---

By now I guess you understand pretty well what kind of things you can do with this kind if analysis.

Some references

  1. Of course, Wikipedia, its "references" and "further reading" is a good guide.
  2. A nice video here showing the capabilities of Statsoft, but you can get there many ideas about other things you can do with the alg.
  3. Here is a basic explanation of the algorithms involved
  4. Here you can find the impressive functionality of R for Cluster Analysis (R is a VERY good option)
  5. Finally, here you can find a long list of free and commercial software for statistics in general, including clustering.

HTH!

belisarius
@belisarius - thank you for the comprehensive answer; this is the type of guidance I was looking for!
DanP
@DanP Glad to see you appreciate the work. One last comment: use an already developed package/lib. Avoid developing if possible. There are a lot of numerical quirks and tricks involved in the math.
belisarius
@belisarius - Do you have a suggestion of a library for .NET? Also, would you be willing to provide some additional pseudo code that showed your algorithm applied directly to my case?
DanP
@DanP I suggest using R, which is robust and can communicate quite well in the MS environment. See http://www.omegahat.org/ for options for using R from outside. If you can drop a text file with a sample of your data (something like DAY, START TIME, END TIME) somewhere in the web, with a few lines explaining it, perhaps I could try to run the clustering and see what happens. There is no more pseudocode than the one that invokes the library, the rest is for coping with the specific problems in your data (and I don't even suspect what they are!)
belisarius
@belisarius - again, thanks for your hard work on this one - I think I have enough of a reference point to start with, so need need to bother you further on this one. I will make sure you recieve (a WELL deserved) bounty for this answer...
DanP
Thanks for the great answer belisarius!
Patricia
@belisarius; thank you for your additional edit; you're correct in your assumption about shifts spanning days!
DanP