ansaurus

Question

Optimisation of Ruby algorithm for grouping and counting colours.

Answer 1

+1 A:

klochner 2010-02-10 18:32:36

Many thanks for taking the time to reply, i hadn't come across Array.combination before. The results that your solution yields are not quite right, specifically they don't address point 2 i make in the original question. I did some basic profiling of your solution, and for the example data given it actually works faster than my original algorithm by about 5ms. I then tested it on a data set of about 63000 and your did it in 6 seconds whereas mine took 0.3 seconds. I guess this is down to the relative efficient of group_by compared to Array.combination. Thanks once again for your assistance.

Jon 2010-02-11 11:25:36

For Ruby 1.8.6, you can use the 'backports' gem for all 1.8.7 features (and more) instead of the combination gem.

Marc-André Lafortune 2010-02-11 11:59:24

Answer 2

A:

Requires 1.8.7+ for group_by

a = [[1,"red"],[1,"green"],[2,"red"],[3,"yellow"],[4,"green"],[4,"red"]]

groups = a .
  group_by {|e|e[0]} .
  collect do |id, photos|
    [id, photos.inject([]){|all,(id,colour)| all << colour}.sort.uniq]
  end .
  group_by {|e|e[1]}

groups.each {|colours, list| groups[colours] = list.length}
h = Hash.new {|h,k| h[k]=[0,0]}

groups.each do |colours, count|
  colours.each do |colour|
    h[colour][0] += 1  # how many times a colour appears
    h[colour][1] += count  # how many photos the colour appears in
  end
end

h.each do |colour, (n,total)|
  groups.update({[colour] => total}) if n > 1
end

groups.each {|colours, count| puts "#{count} photos for #{colours.join ','}"}

outputs

2 photos for green,red
3 photos for red
1 photos for yellow

glenn jackman 2010-02-11 00:33:15

Thanks for replying Glenn, your solution looks really elegant! but i will need to study it some more to get my head round what it is doing :-) I did some basic profiling, and it appears to be faster than mine in all cases which is great. There is one point i foolishly forgot to make in my original question, it was that i only ever return the top 5 results (when they are ordered by number of colours). It would be easy to just return the top 5 from your end result, but i think that if it is done as early as possible, it will reduce subsequent processing, and speed the whole thing up even more!

Jon 2010-02-11 11:37:53

@Jon, in this case, you can't know the top 5 results until the hash `h` is populated (to compute the totals for each colour). So that's not particularly early.

glenn jackman 2010-02-11 14:23:01

The top 5 results are the ones that have the most colours, and not because they have the greatest count.

Jon 2010-02-11 15:34:05

Your algorithm gives incorrect output on the new data set.

klochner 2010-02-11 19:00:29

@klochner, your right, i don't think this one is quite right.

Jon 2010-02-13 10:25:05

Answer 3

+1 A:

klochner 2010-02-11 21:05:07

Thanks again klochner, it's interesting how your solution is faster for small data sets, but mine is faster for larger ones. Trying to dissect your answer to combine the best from your with mine.

Jon 2010-02-13 10:38:01

Can you give me an estimate of #colors, #photos, and colors per photo in the large set?

klochner 2010-02-13 17:17:56

Hi klochner, sorry for not replying sooner, i only just noticed your additional comment. Here is the larger data set that i am using http://dl.dropbox.com/u/2306276/63285.rb

Jon 2010-02-18 21:35:35

I don't think yours is correct as specified, but may still be confused as to exactly what you're trying to do. Try your algorithm on the dataset here: http://gist.github.com/309607 and tell me if you think your algorithm is correct. I *think* you're erring in (a) restricting to the top 5, even though multiple photos could tie for 5th largest, and (b) not including photos outside the top 5 in looking for color matches.

klochner 2010-02-20 09:32:59

Ill explain the purpose of the algorithm, if you go to ebay and search for "dog playstation plate pen" you'll get a some results listed under "Retry your search with fewer keywords." this algorithm will allow me to do a similar thing. I see the advantages of your method because if the number of colours are the same for two groups, then it sorts by the highest number of duplicates. This desired effect will come at a price though in terms of speed. Ill have a play with it some more, and see if i can get your results, with my speeds. Thanks

Jon 2010-02-20 13:57:59

It sounds like you should be pre-compiling a frequency map for different tag combinations, and just updating the map when new items are added. No?

klochner 2010-02-20 21:02:19

ansaurus

tags:

views:

answers:

Optimisation of Ruby algorithm for grouping and counting colours.

related questions