views:

208

answers:

3

Hi, i have what seems on the surface a simple problem which i wish to solve using ruby, i have a bunch of colours with associated photo id's, e.g

[[1,"red"],[1,"green"],[2,"red"],[3,"yellow"],[4,"green"],[4,"red"]]

and i wish to process the data so that it is in this format:

2 photos for red,green
3 photos for red
1 photo for yellow

A few things to note:

  1. The photo/photos matching the most colours are first in the list, if the number of colours matched are the same (as for red and yellow above) then put the highest count first.

  2. The count for red is 3, as 2 photos have red and green, and a third has only red. I don't display a result for the colour green on its own, as all green photos are accounted for with the entry for red and green.

  3. Ultimately, i only ever need to display the top 5 results, no matter how large the dataset is.

I have written an algorithm that achieves this goal (see below) but i would appreciate any guidance on how i can make it faster, and then more elegant. Speed is the primary concern, i will be operating on lots of data (orders of a million), then if possible if it could be made more elegant that would be nice - i don't think i write elegant ruby code, i have a c++ background.

I am aware of embedding c and c++ code in ruby for performance gains, but i would really like to achieve this only using ruby.

Thanks very much

beginning = Time.now

ARR = [[1,"red"],[1,"green"],[2,"red"],[3,"yellow"],[4,"red"],[4,"green"],[4,"yellow"],[5,"green"],[5,"red"],[6,"black"]]

# Group the colours by their id.
groups = ARR.group_by {|x| x[0]}

# output for profiling.
puts "After Group BY: #{Time.now - beginning} seconds."

# Remove the id's, as they are no longer useful. Sort the colours alphabetically.
sorted_groups = []
groups.each do |i,j|
  sorted_groups << j.map!{ |x|  x[1]}.sort
end

# Order the colours, so the group containing the most colours comes first.
# Do a secondary sort alphabetically, so that all identical groups are next to each other. 
sorted_groups_in_order = sorted_groups.sort_by { |s| [s.length,s] }.reverse

# Traverse the groups in order to find the index that marks the position of results_to_return unique groups.
# This is to make subsequent processing more efficient, as it will only operate on a smaller subset.
results_to_return = 5
temp = sorted_groups_in_order[0]
combination_count = 0
index = 0

sorted_groups_in_order.each do |e|
 combination_count +=1 if e != temp
 break if combination_count == results_to_return

 index += 1
 temp = e
end

# Iterate through the subset, and count the duplicates.
tags_with_count = Hash.new(0)
sorted_groups_in_order[0..index].each do |v|
  tags_with_count.store(v,tags_with_count[v]+1)
end

# Sort by the number of colours in each subset, the most colours go first.
tags_with_count = tags_with_count.sort { |q,w| w[0].size <=> q[0].size }

# if colour subsets are found in colour supersets, then increment the subset count to reflect this.
tags_with_count.reverse.each_with_index do |object,index|
  tags_with_count.reverse.each_with_index do |object2,index2|
    if (index2 < index) && (object[0]&object2[0] == object2[0])
      object2[1] += object[1]
    end
  end
end

# Sort by the number of colours in each subset, the most colours go first.
# Perform a secondary sort by the count value.
tags_with_count = tags_with_count.sort_by { |s| [s[0].length,s[1]] }.reverse

# print our results.
tags_with_count.each do |l|
  puts l.inspect
end

# output for profiling.
puts "Time elapsed: #{Time.now - beginning} seconds."
+1  A: 
klochner
Many thanks for taking the time to reply, i hadn't come across Array.combination before. The results that your solution yields are not quite right, specifically they don't address point 2 i make in the original question. I did some basic profiling of your solution, and for the example data given it actually works faster than my original algorithm by about 5ms. I then tested it on a data set of about 63000 and your did it in 6 seconds whereas mine took 0.3 seconds. I guess this is down to the relative efficient of group_by compared to Array.combination. Thanks once again for your assistance.
Jon
For Ruby 1.8.6, you can use the 'backports' gem for all 1.8.7 features (and more) instead of the combination gem.
Marc-André Lafortune
A: 

Requires 1.8.7+ for group_by

a = [[1,"red"],[1,"green"],[2,"red"],[3,"yellow"],[4,"green"],[4,"red"]]

groups = a .
  group_by {|e|e[0]} .
  collect do |id, photos|
    [id, photos.inject([]){|all,(id,colour)| all << colour}.sort.uniq]
  end .
  group_by {|e|e[1]}

groups.each {|colours, list| groups[colours] = list.length}
h = Hash.new {|h,k| h[k]=[0,0]}

groups.each do |colours, count|
  colours.each do |colour|
    h[colour][0] += 1  # how many times a colour appears
    h[colour][1] += count  # how many photos the colour appears in
  end
end

h.each do |colour, (n,total)|
  groups.update({[colour] => total}) if n > 1
end

groups.each {|colours, count| puts "#{count} photos for #{colours.join ','}"}

outputs

2 photos for green,red
3 photos for red
1 photos for yellow
glenn jackman
Thanks for replying Glenn, your solution looks really elegant! but i will need to study it some more to get my head round what it is doing :-) I did some basic profiling, and it appears to be faster than mine in all cases which is great. There is one point i foolishly forgot to make in my original question, it was that i only ever return the top 5 results (when they are ordered by number of colours). It would be easy to just return the top 5 from your end result, but i think that if it is done as early as possible, it will reduce subsequent processing, and speed the whole thing up even more!
Jon
@Jon, in this case, you can't know the top 5 results until the hash `h` is populated (to compute the totals for each colour). So that's not particularly early.
glenn jackman
The top 5 results are the ones that have the most colours, and not because they have the greatest count.
Jon
Your algorithm gives incorrect output on the new data set.
klochner
@klochner, your right, i don't think this one is quite right.
Jon
+1  A: 
klochner
Thanks again klochner, it's interesting how your solution is faster for small data sets, but mine is faster for larger ones. Trying to dissect your answer to combine the best from your with mine.
Jon
Can you give me an estimate of #colors, #photos, and colors per photo in the large set?
klochner
Hi klochner, sorry for not replying sooner, i only just noticed your additional comment. Here is the larger data set that i am using http://dl.dropbox.com/u/2306276/63285.rb
Jon
I don't think yours is correct as specified, but may still be confused as to exactly what you're trying to do. Try your algorithm on the dataset here: http://gist.github.com/309607 and tell me if you think your algorithm is correct. I *think* you're erring in (a) restricting to the top 5, even though multiple photos could tie for 5th largest, and (b) not including photos outside the top 5 in looking for color matches.
klochner
Ill explain the purpose of the algorithm, if you go to ebay and search for "dog playstation plate pen" you'll get a some results listed under "Retry your search with fewer keywords." this algorithm will allow me to do a similar thing. I see the advantages of your method because if the number of colours are the same for two groups, then it sorts by the highest number of duplicates. This desired effect will come at a price though in terms of speed. Ill have a play with it some more, and see if i can get your results, with my speeds. Thanks
Jon
It sounds like you should be pre-compiling a frequency map for different tag combinations, and just updating the map when new items are added. No?
klochner