Hi, i have what seems on the surface a simple problem which i wish to solve using ruby, i have a bunch of colours with associated photo id's, e.g
[[1,"red"],[1,"green"],[2,"red"],[3,"yellow"],[4,"green"],[4,"red"]]
and i wish to process the data so that it is in this format:
2 photos for red,green
3 photos for red
1 photo for yellow
A few things to note:
The photo/photos matching the most colours are first in the list, if the number of colours matched are the same (as for red and yellow above) then put the highest count first.
The count for red is 3, as 2 photos have red and green, and a third has only red. I don't display a result for the colour green on its own, as all green photos are accounted for with the entry for red and green.
Ultimately, i only ever need to display the top 5 results, no matter how large the dataset is.
I have written an algorithm that achieves this goal (see below) but i would appreciate any guidance on how i can make it faster, and then more elegant. Speed is the primary concern, i will be operating on lots of data (orders of a million), then if possible if it could be made more elegant that would be nice - i don't think i write elegant ruby code, i have a c++ background.
I am aware of embedding c and c++ code in ruby for performance gains, but i would really like to achieve this only using ruby.
Thanks very much
beginning = Time.now
ARR = [[1,"red"],[1,"green"],[2,"red"],[3,"yellow"],[4,"red"],[4,"green"],[4,"yellow"],[5,"green"],[5,"red"],[6,"black"]]
# Group the colours by their id.
groups = ARR.group_by {|x| x[0]}
# output for profiling.
puts "After Group BY: #{Time.now - beginning} seconds."
# Remove the id's, as they are no longer useful. Sort the colours alphabetically.
sorted_groups = []
groups.each do |i,j|
sorted_groups << j.map!{ |x| x[1]}.sort
end
# Order the colours, so the group containing the most colours comes first.
# Do a secondary sort alphabetically, so that all identical groups are next to each other.
sorted_groups_in_order = sorted_groups.sort_by { |s| [s.length,s] }.reverse
# Traverse the groups in order to find the index that marks the position of results_to_return unique groups.
# This is to make subsequent processing more efficient, as it will only operate on a smaller subset.
results_to_return = 5
temp = sorted_groups_in_order[0]
combination_count = 0
index = 0
sorted_groups_in_order.each do |e|
combination_count +=1 if e != temp
break if combination_count == results_to_return
index += 1
temp = e
end
# Iterate through the subset, and count the duplicates.
tags_with_count = Hash.new(0)
sorted_groups_in_order[0..index].each do |v|
tags_with_count.store(v,tags_with_count[v]+1)
end
# Sort by the number of colours in each subset, the most colours go first.
tags_with_count = tags_with_count.sort { |q,w| w[0].size <=> q[0].size }
# if colour subsets are found in colour supersets, then increment the subset count to reflect this.
tags_with_count.reverse.each_with_index do |object,index|
tags_with_count.reverse.each_with_index do |object2,index2|
if (index2 < index) && (object[0]&object2[0] == object2[0])
object2[1] += object[1]
end
end
end
# Sort by the number of colours in each subset, the most colours go first.
# Perform a secondary sort by the count value.
tags_with_count = tags_with_count.sort_by { |s| [s[0].length,s[1]] }.reverse
# print our results.
tags_with_count.each do |l|
puts l.inspect
end
# output for profiling.
puts "Time elapsed: #{Time.now - beginning} seconds."