views:

92

answers:

3

I would like to analyse data in my database to find out how many times certain words appear. Ideally I would like a list of the top 20 words used in a particular column. What would be the easiest way of going about this.

+1  A: 

Create an autovivified hash and then loop through the rows populating the hash and incrementing the value each time you get the same key (word). Then sort the hash by value.

ennuikiller
+1  A: 

A word counter...

I wasn't sure if you were asking how to get rails to work on this or how to count words, but I went ahead and did a column-oriented ruby wordcounter anyway.

(BTW, at first I did try the autovivified hash, what a cool trick.)


# col: a column name or number
# strings: a String, Array of Strings, Array of Array of Strings, etc.
def count(col, *strings) 
  (@h ||= {})[col = col.to_s] ||= {}
  [*strings].flatten.each { |s|
    s.split.each { |s|
      @h[col][s] ||= 0
      @h[col][s]  += 1
    }
  }
end
def formatOneCol a
  limit = 2
  a.sort { |e1,e2| e2[1]<=>e1[1] }.each { |results|
    printf("%9d %s\n", results[1], results[0])
    return unless (limit -= 1) > 0
  }
end
def formatAllCols
  @h.sort.each { |a|
    printf("\n%9s\n", "Col " + a[0])
    formatOneCol a[1]
  }
end

count(1,"how now")
count(1,["how", "now", "brown"])
count(1,[["how", "now"], ["brown", "cow"]])
count(2,["you see", "see you",["how", "now"], ["brown", "cow"]])
count(2,["see", ["see", ["see"]]])
count("A_Name Instead","how now alpha alpha alpha")

formatAllCols


$ ruby count.rb

    Col 1
        3 how
        3 now

    Col 2
        5 see
        2 you

Col A_Name Instead
        3 alpha
        1 how
$
DigitalRoss
A: 

digitalross answer looks too verbose to me, also, as you tag ruby-on-rails and said you use DB.. i'm assuming you need an activerecord model so i'm giving you a full solution

in your model:

def self.top_strs(column_symbol, top_num)
  h = Hash.new(0)
  find(:all, :select => column_symbol).each do |obj|
    obj.send(column_symbol).split.each do |word|
      h[word] += 1
    end
  end

  h.map.sort_by(&:second).reverse[0..top_num]
end

for example, model Comment, column body:

Comment.top_strs(:body, 20)
amikazmi
Thanks, looks like a much nicer solution. Once problem I get when using this is the following:ArgumentError: wrong number of arguments (1 for 0)in the sort_by call. Also, what is the symbol :second you pass to it?
KJF
I fixed it by using this:p h.sort_by {|a,b| b}.reverse[0..top_num]
KJF
amikazmi