views:

78

answers:

5

I have a Ruby-on-Rails model:

class Candidate < ActiveRecord::Base
  validates_presence_of :application_essay
  validate :validate_length_of_application_essay

  protected

  def validate_length_of_application_essay
    return if application_essay.blank? # don't add a second error message if they didn't fill it out
    errors.add(:application_essay, :too_long), unless ...
  end
end

Without dropping into C, what is the fastest way to check that the application_essay contains no more than 500 words? You can assume that most essays will be at least 200 words, are unlikely to be more than 5000 words, and are in English (or the pseudo-English sometimes called "business-ese"). You can also classify anything you want as a "word" as long as your classification would be immediately obvious to a typical user. (NB: this is not the place to debate what a "typical user" is :) )

+2  A: 

You're not going to get any faster than a linear search, sorry (unless this is for some sort of text-editor, and you can keep track incrementally)

BlueRaja - Danny Pflughoeft
But a linear search for what? Spaces? Word-boundaries? What's the minimum amount of information I have to keep track of as I do the linear search? And if I'm just looking for whitespace-groups, wouldn't a divide-and-conquer strategy take me from O(n) to O(log(n))?
James A. Rosen
BlueRaja - Danny Pflughoeft
+1  A: 

You could estimate the typical size of a word and guess the amount of words by dividing.

some hints here:http://blogamundo.net/lab/wordlengths/

You could try like 5.1 and see how accurate you are by running a few tests.

Well probably dividing by 6.1 since you have whitespaces.

Keep in mind you would be assuming that your text is not just huge amount of white spaces or something. Well but if your really just interested to make sure it has not more than x words. You could try a low number on x maybe 5 and if it has less then x times 5 characters you can be pretty sure it does not have more then x words.

So you are maybe better off doing a linear search as stated in the other answers. A linear search isnt that bad at all. It just depends on what you want to do.

HansDampf
I did think of that. I have no idea what to use as a "tyipcal word," but I'm not really opposed to the concept if I could find a reasonable value.
James A. Rosen
I updated my post.
HansDampf
And regarding your comment in other answer:I dont think you can get it faster than linear, for to find the words there is no way but checking every single character, which means you have at least n operations for lenght n as a minimum.
HansDampf
A: 

Here is a nice article that you might like

http://dotnetperls.com/word-count

anijhaw
+1  A: 

I would just use something like:

string.split(" ").length <= 500

What performance issue are you seeing? A string a 500 or so words shouldn't be much of a problem.

Toby Hede
+1  A: 

There's a plugin for that, havn't used it myself tho :)

http://code.google.com/p/validates-word-count/

That plugin switches all adjacent "word characters" into a single character, then removes all non-word characters and count them. Not sure if it's the fastest tho.

ba