I would like to analyse data in my database to find out how many times certain words appear. Ideally I would like a list of the top 20 words used in a particular column. What would be the easiest way of going about this.
+1
A:
Create an autovivified hash and then loop through the rows populating the hash and incrementing the value each time you get the same key (word). Then sort the hash by value.
ennuikiller
2009-09-13 18:02:14
+1
A:
A word counter...
I wasn't sure if you were asking how to get rails to work on this or how to count words, but I went ahead and did a column-oriented ruby wordcounter anyway.
(BTW, at first I did try the autovivified hash, what a cool trick.)
# col: a column name or number
# strings: a String, Array of Strings, Array of Array of Strings, etc.
def count(col, *strings)
(@h ||= {})[col = col.to_s] ||= {}
[*strings].flatten.each { |s|
s.split.each { |s|
@h[col][s] ||= 0
@h[col][s] += 1
}
}
end
def formatOneCol a
limit = 2
a.sort { |e1,e2| e2[1]<=>e1[1] }.each { |results|
printf("%9d %s\n", results[1], results[0])
return unless (limit -= 1) > 0
}
end
def formatAllCols
@h.sort.each { |a|
printf("\n%9s\n", "Col " + a[0])
formatOneCol a[1]
}
end
count(1,"how now")
count(1,["how", "now", "brown"])
count(1,[["how", "now"], ["brown", "cow"]])
count(2,["you see", "see you",["how", "now"], ["brown", "cow"]])
count(2,["see", ["see", ["see"]]])
count("A_Name Instead","how now alpha alpha alpha")
formatAllCols
$ ruby count.rb
Col 1
3 how
3 now
Col 2
5 see
2 you
Col A_Name Instead
3 alpha
1 how
$
DigitalRoss
2009-09-13 20:23:47
A:
digitalross answer looks too verbose to me, also, as you tag ruby-on-rails and said you use DB.. i'm assuming you need an activerecord model so i'm giving you a full solution
in your model:
def self.top_strs(column_symbol, top_num)
h = Hash.new(0)
find(:all, :select => column_symbol).each do |obj|
obj.send(column_symbol).split.each do |word|
h[word] += 1
end
end
h.map.sort_by(&:second).reverse[0..top_num]
end
for example, model Comment, column body:
Comment.top_strs(:body, 20)
amikazmi
2009-09-13 23:01:24
Thanks, looks like a much nicer solution. Once problem I get when using this is the following:ArgumentError: wrong number of arguments (1 for 0)in the sort_by call. Also, what is the symbol :second you pass to it?
KJF
2009-09-15 17:22:22
I fixed it by using this:p h.sort_by {|a,b| b}.reverse[0..top_num]
KJF
2009-09-15 17:30:15
amikazmi
2009-09-15 17:32:11