views:

162

answers:

7

Hi:

Thanks for the answers, I have not used StackOverflow before so I was suprised by the number of answers and the speed of them - its fantastic.

I have not been through the answers properly yet, but thought I should add some information to the problem specification. See the image below.

I can't post an image in this because i don't have enough points but you can see an image at http://journal.acquitane.com/2010-01-20/image003.jpg

This image may describe more closely what I'm trying to achieve. So you can see on the horizontal lines across the page are price points on the chart. Now where you get a clustering of lines within 0.5% of each, this is considered to be a good thing and why I want to identify those clusters automatically. You can see on the chart that there is a cluster at S2 & MR1, R2 & WPP1.

So everyday I produce these price points and then I can identify manually those that are within 0.5%. - but the purpose of this question is how to do it with a python routine.

I have reproduced the list again (see below) with labels. Just be aware that the list price points don't match the price points in the image because they are from two different days.

[YR3,175.24,8] [SR3,147.85,6] [YR2,144.13,8] [SR2,130.44,6] [YR1,127.79,8] [QR3,127.42,5] [SR1,120.94,6] [QR2,120.22,5] [MR3,118.10,3] [WR3,116.73,2] [DR3,116.23,1] [WR2,115.93,2] [QR1,115.83,5] [MR2,115.56,3] [DR2,115.53,1] [WR1,114.79,2] [DR1,114.59,1] [WPP,113.99,2] [DPP,113.89,1] [MR1,113.50,3] [DS1,112.95,1] [WS1,112.85,2] [DS2,112.25,1] [WS2,112.05,2] [DS3,111.31,1] [MPP,110.97,3] [WS3,110.91,2] [50MA,110.87,4] [MS1,108.91,3] [QPP,108.64,5] [MS2,106.37,3] [MS3,104.31,3] [QS1,104.25,5] [SPP,103.53,6] [200MA,99.42,7] [QS2,97.05,5] [YPP,96.68,8] [SS1,94.03,6] [QS3,92.66,5] [YS1,80.34,8] [SS2,76.62,6] [SS3,67.12,6] [YS2,49.23,8] [YS3,32.89,8]

I did make a mistake with the original list in that Group C is wrong and should not be included. Thanks for pointing that out.

Also the 0.5% is not fixed this value will change from day to day, but I have just used 0.5% as an example for spec'ing the problem.

Thanks Again. Mark

PS. I will get cracking on checking the answers now now.

Hi:

I need to do some manipulation of stock prices. I have just started using Python, (but I think I would have trouble implementing this in any language). I'm looking for some ideas on how to implement this nicely in python.

Thanks Mark

Problem: I have a list of lists (FloorLevels (see below)) where the sublist has two items (stockprice, weight). I want to put the stockprices into groups when they are within 0.5% of each other. A groups strength will be determined by its total weight. For example:

Group-A
115.93,2
115.83,5
115.56,3
115.53,1
-------------
TotalWeight:12
-------------
Group-B
113.50,3
112.95,1
112.85,2
-------------
TotalWeight:6
-------------    

FloorLevels[   
[175.24,8]
[147.85,6]
[144.13,8]
[130.44,6]
[127.79,8]
[127.42,5]
[120.94,6]
[120.22,5]
[118.10,3]
[116.73,2]
[116.23,1]
[115.93,2]
[115.83,5]
[115.56,3]
[115.53,1]
[114.79,2]
[114.59,1]
[113.99,2]
[113.89,1]
[113.50,3]
[112.95,1]
[112.85,2]
[112.25,1]
[112.05,2]
[111.31,1]
[110.97,3]
[110.91,2]
[110.87,4]
[108.91,3]
[108.64,5]
[106.37,3]
[104.31,3]
[104.25,5]
[103.53,6]
[99.42,7]
[97.05,5]
[96.68,8]
[94.03,6]
[92.66,5]
[80.34,8]
[76.62,6]
[67.12,6]
[49.23,8]
[32.89,8]
]
+2  A: 

A stock s belong in a group G if for each stock t in G, s * 1.05 >= t and s / 1.05 <= t, right?

How do we add the stocks to each group? If we have the stocks 95, 100, 101, and 105, and we start a group with 100, then add 101, we will end up with {100, 101, 105}. If we did 95 after 100, we'd end up with {100, 95}.

Do we just need to consider all possible permutations? If so, your algorithm is going to be inefficient.

Eli
Yep, seems to be a poorly defined problem.
Greg
@Eli: Thanks, need to produce a unknown number of groups where each member of the group is within 0.5% of each other.@Greg" Thanks, I agree, I can see what I want, but am having trouble actually spec'ing the problem in a readable manner.
mark
+2  A: 

You need to specify your problem in more detail. Just what does "put the stockprices into groups when they are within 0.5% of each other" mean?

Possibilities:

(1) each member of the group is within 0.5% of every other member of the group

(2) sort the list and split it where the gap is more than 0.5%

Note that 116.23 is within 0.5% of 115.93 -- abs((116.23 / 115.93 - 1) * 100) < 0.5 -- but you have put one number in Group A and one in Group C.

Simple example: a, b, c = (0.996, 1, 1.004) ... Note that a and b fit, b and c fit, but a and c don't fit. How do you want them grouped, and why? Is the order in the input list relevant?

Possibility (1) produces ab,c or a,bc ... tie-breaking rule, please
Possibility (2) produces abc (no big gaps, so only one group)

John Machin
@John Machin: John thanks. (1) is the case I'm after. Group C is wrong - please ignore it. This is where its a bit messy, a,b,c should be in a group even though a,c don't fit but they are linked by being within 0.5% of b.
mark
@mark: You say (1) is the case that you want (each member of the group is within 0.5% of **every** other member of the group) but then you say that a,b,c should be in a group: this is possibility (2) i.e. each group member is within 5% of **some** member(s) of the group -- please make up your mind and edit your question to show unambiguously what you want.
John Machin
A: 

For a given set of stock prices, there is probably more than one way to group stocks that are within 0.5% of each other. Without some additional rules for grouping the prices, there's no way to be sure an answer will do what you really want.

Chris Upchurch
@Chris, Thanks, it is a list of price points for a single stock. I don't have any additional rules - only the rule about the 0.5%. But anyway it seems that the problem falls into a clustering type problem which is why I'm leaning towards Alex Martellis solution above.
mark
+1  A: 

You won't be able to classify them into hard "groups". If you have prices (1.0,1.05, 1.1) then the first and second should be in the same group, and the second and third should be in the same group, but not the first and third.

A quick, dirty way to do something that you might find useful:

def make_group_function(tolerance = 0.05):
    from math import log10, floor
    # I forget why this works. 
    tolerance_factor = -1.0/(-log10(1.0 + tolerance))
    # well ... since you might ask
    # we want: log(x)*tf - log(x*(1+t))*tf = -1, 
    # so every 5% change has a different group. The minus is just so groups 
    # are ascending .. it looks a bit nicer.
    #
    # tf = -1/(log(x)-log(x*(1+t)))
    # tf = -1/(log(x/(x*(1+t))))
    # tf = -1/(log(1/(1*(1+t)))) # solved .. but let's just be more clever
    # tf = -1/(0-log(1*(1+t)))
    # tf = -1/(-log((1+t))
    def group_function(value):
        # don't just use int - it rounds up below zero, and down above zero
        return int(floor(log10(value)*tolerance_factor))
    return group_function

Usage:

group_function = make_group_function()
import random
groups = {}
for i in range(50):
    v = random.random()*500+1000
    group = group_function(v)
    if group in groups:
        groups[group].append(v)
    else:
        groups[group] = [v]

for group in sorted(groups):
    print 'Group',group
    for v in sorted(groups[group]):
        print v
    print
wisty
mark
+3  A: 

I suggest a repeated use of k-means clustering -- let's call it KMC for short. KMC is a simple and powerful clustering algorithm... but it needs to "be told" how many clusters, k, you're aiming for. You don't know that in advance (if I understand you correctly) -- you just want the smallest k such that no two items "clustered together" are more than X% apart from each other. So, start with k equal 1 -- everything bunched together, no clustering pass needed;-) -- and check the diameter of the cluster (a cluster's "diameter", from the use of the term in geometry, is the largest distance between any two members of a cluster).

If the diameter is > X%, set k += 1, perform KMC with k as the number of clusters, and repeat the check, iteratively.

In pseudo-code:

def markCluster(items, threshold):
    k = 1
    clusters = [items]
    maxdist = diameter(items)
    while maxdist > threshold:
        k += 1
        clusters = Kmc(items, k)
        maxdist = max(diameter(c) for c in clusters)
    return clusters

assuming of course we have suitable diameter and Kmc Python functions.

Does this sound like the kind of thing you want? If so, then we can move on to show you how to write diameter and Kmc (in pure Python if you have a relatively limited number of items to deal with, otherwise maybe by exploiting powerful third-party add-on frameworks such as numpy) -- but it's not worthwhile to go to such trouble if you actually want something pretty different, whence this check!-)

Alex Martelli
@Alex Martelli. Hi Alex: I like this, I thought this was a clustering problem and was looking at clustering algorithms before putting this question up. I think this will be a good fit can you expand on the Diameter and KMC functions please.
mark
@mark, net of optimization issues, `diameter(aset)` is just `max(distance(a,b) for a in aset for b in aset)`, for whatever `distance` you choose to define. K-means clustering OTOH requires more code than can be reasonably squeezed here, I'll be glad to expound on that if you ask in a separate question (but the wikipedia URL I point to is already pretty good, mind!).
Alex Martelli
OK, thanks again.
mark
A: 

Hi --

apart from the proper way to pick which values fit together, this is a problem where a little Object Orientation dropped in can make it a lot easier to deal with.

I made two classes here, with a minimum of desirable behaviors, but which can make the classification a lot easier -- you get a single point to play with it on the Group class.

I can see the code bellow is incorrect, in the sense the limtis for group inclusion varies as new members are added -- even it the separation crieteria remaisn teh same, you heva e torewrite the get_groups method to use a multi-pass approach. It should nto be hard -- but the code would be too long to be helpfull here, and i think this snipped is enoguh to get you going:

from copy import copy

class Group(object):
    def __init__(self,data=None, name=""):
        if data:
            self.data = data
        else:
            self.data = []
        self.name = name

    def get_mean_stock(self):
        return sum(item[0] for item in self.data) / len(self.data)

    def fits(self, item):
        if 0.995 < abs(item[0]) / self.get_mean_stock() < 1.005:
            return True
        return False

    def get_weight(self):
        return sum(item[1] for item in self.data)

    def __repr__(self):
        return "Group-%s\n%s\n---\nTotalWeight: %d\n\n" % (
            self.name,
            "\n".join("%.02f, %d" % tuple(item) for item in self.data ),
            self.get_weight())


class StockGrouper(object):
    def __init__(self, data=None):
        if data:
            self.floor_levels = data
        else:
            self.floor_levels = []

    def get_groups(self):
        groups = []
        floor_levels = copy(self.floor_levels)
        name_ord = ord("A") - 1
        while floor_levels:
            seed = floor_levels.pop(0)
            name_ord += 1
            group = Group([seed], chr(name_ord))
            groups.append(group)
            to_remove = []
            for i, item in enumerate(floor_levels):
                if group.fits(item):
                    group.data.append(item)
                    to_remove.append(i)
            for i in reversed(to_remove):
                floor_levels.pop(i)
        return groups

testing:

floor_levels = [  [stock. weight] ,... <paste the data above> ]
s = StockGrouper(floor_levels)
s.get_groups()
jsbueno
@jsbeuno, Thanks for your effort, at this point I'm leaning towards the clustering solution by Alex Martelli.
mark
A: 

For the grouping element, could you use itertools.groupby()? As the data is sorted, a lot of the work of grouping it is already done, and then you could test if the current value in the iteration was different to the last by <0.5%, and have itertools.groupby() break into a new group every time your function returned false.

@yorksranter: Thanks, I'm not familiar with itertools.groupby(). At this point, I'm thinking that Alex Martelli's solution using a clustering algorithm looks good.
mark