views:

746

answers:

13

I've already worked out this solution for myself with PHP, but I'm curious how it could be done differently - better even. The two languages I'm primarily interested in are PHP and Javascript, but I'd be interested in seeing how quickly this could be done in any other major language today as well (mostly C#, Java, etc).

  1. Return only words with an occurrence greater than X
  2. Return only words with a length greater than Y
  3. Ignore common terms like "and, is, the, etc"
  4. Feel free to strip punctuation prior to processing (ie. "John's" becomes "John")
  5. Return results in a collection/array

Extra Credit

  1. Keep Quoted Statements together, (ie. "They were 'too good to be true' apparently")
    Where 'too good to be true' would be the actual statement

Extra-Extra Credit

  1. Can your script determine words that should be kept together based upon their frequency of being found together? This being done without knowing the words beforehand. Example:
    "The fruit fly is a great thing when it comes to medical research. Much study has been done on the fruit fly in the past, and has lead to many breakthroughs. In the future, the fruit fly will continue to be studied, but our methods may change."
    Clearly the word here is "fruit fly," which is easy for us to find. Can your search'n'scrape script determine this too?

Source text: http://sampsonresume.com/labs/c.txt

Answer Format

  1. It would be great to see the results of your code, output, in addition to how long the operation lasted.
+3  A: 

C# 3.0 (with LINQ)

Here's my solution. It makes use of some pretty nice features of LINQ/extension methods to keep the code short.

public static Dictionary<string, int> GetKeywords(string text, int minCount, int minLength)
{
    var commonWords = new string[] { "and", "is", "the", "as", "of", "to", "or", "in",
        "for", "by", "an", "be", "may", "has", "can", "its"};
    var words = Regex.Replace(text.ToLower(), @"[,.?\/;:\(\)]", string.Empty).Split(' ');
    var occurrences = words.Distinct().Except(commonWords).Select(w =>
        new { Word = w, Count = words.Count(s => s == w) });
    return occurrences.Where(wo => wo.Count >= minCount && wo.Word.Length >= minLength)
        .ToDictionary(wo => wo.Word, wo => wo.Count);
}

This is however far from the most efficient method, being O(n^2) with the number of words, rather than O(n), which is optimal in this case I believe. I'll see if I can creater a slightly longer method that is more efficient.

Here are the results of the function run on the sample text (min occurences: 3, min length: 2).

  3 x such
  4 x code
  4 x which
  4 x declarations
  5 x function
  4 x statements
  3 x new
  3 x types
  3 x keywords
  7 x statement
  3 x language
  3 x expression
  3 x execution
  3 x programming
  4 x operators
  3 x variables

And my test program:

static void Main(string[] args)
{
    string sampleText;
    using (var client = new WebClient())
        sampleText = client.DownloadString("http://sampsonresume.com/labs/c.txt");
    var keywords = GetKeywords(sampleText, 3, 2);
    foreach (var entry in keywords)
        Console.WriteLine("{0} x {1}", entry.Value.ToString().PadLeft(3), entry.Key);
    Console.ReadKey(true);
}
Noldorin
+1 Fast and simple!
Groo
+1 That was quick, Noldorin - nice work.
Jonathan Sampson
Thanks. :) I'll see if I can make a few changes for the sake of efficiency though, and also to get the "extra credit".
Noldorin
I'll also try to get some statistics from the sample text.
Noldorin
Can't find "variables" word... bugz..
Kamarey
@Kamarey: Ah yes, I missed out brackets from the punctuation list. Well spotted.
Noldorin
Reason for down-votes please?
Noldorin
+1  A: 

In C#:

  1. Use LINQ, specifically groupby, then filter by group count, and return a flattened (selectmany) list.

  2. Use LINQ, filter by length.

  3. Use LINQ, filter with 'badwords'.Contains.

leppie
You want them all together? As seen with the less than optimal solution below, you can easily do that :)
leppie
+4  A: 

F#: 304 chars

let f =
    let bad = Set.of_seq ["and";"is";"the";"of";"are";"by";"it"]
    fun length occurrence msg ->
        System.Text.RegularExpressions.Regex.Split(msg, @"[^\w-']+")
        |> Seq.countBy (fun a -> a)
        |> Seq.choose (fun (a, b) -> if a.Length > length && b > occurrence && (not <| bad.Contains a) then Some a else None)
Juliet
Awesome. I don't think I've seen a working block of F# to this date :) Glad to finally get acquainted.
Jonathan Sampson
Nice solution. I was going to try one in F# myself, as I knew the Seq module functions would help (in particular `countBy`), but this is great.
Noldorin
Slight problem is that punctuation does mess up some words (i.e. commas, periods as last chars of words).
Noldorin
Noldorin: Fixed to handle punctuation :)
Juliet
Ah, I missed the Regex.Split method somehow. That would be a good way of getting around quotes too. Anyway, deserves the up-vote. :)
Noldorin
+3  A: 
Alex Feinman
Makes me want to toy around with Perl a bit!
Jonathan Sampson
Since this isn't really trying to be as small as possible, I decided to make it a bit more maintainable. Which only increased the length by a few percent.
Brad Gilbert
@Brad, if only you'd fix the bugs too, like the one in the last line. :-/
Alex Feinman
I added my own examples for sorting, and filtering.
Brad Gilbert
Brad, why didn't you add your own answer instead of editing Alex's?
Jeff Atwood
I did post my own answer, and Alex asked me to fix some of the bugs. There were so many changes I would do that I decided to just leave his code alone, and add my own sorting code.
Brad Gilbert
A: 
Kuroki Kaze
Looks like you treated "The" and "the" (and other 'ignore-words') as different terms. I see a couple 'ignore-words' in your output. :)
Jonathan Sampson
Yeah :) Well, we always can "lowercase" all words, but it'll ruin case in names and quotes. Proper stemming would be a good idea, thoough it's a bit harder to implement.
Kuroki Kaze
+1  A: 

REBOL

Verbose, perhaps, so definitely not a winner, but gets the job done.

min-length: 0
min-count: 0

common-words: [ "a" "an" "as" "and" "are" "by" "for" "from" "in" "is" "it" "its" "the" "of" "or" "to" "until" ]

add-word: func [
    word [string!]
    /local
        count
        letter
        non-letter
        temp
        rules
        match
][    
    ; Strip out punctuation
    temp: copy {}
    letter: charset [ #"a" - #"z" #"A" - #"Z" #" " ]
    non-letter: complement letter
    rules: [
        some [
            copy match letter (append temp match)
            |
            non-letter
        ]
    ]
    parse/all word rules
    word: temp

    ; If we end up with nothing, bail
    if 0 == length? word [
        exit
    ]

    ; Check length
    if min-length > length? word [
        exit
    ]

    ; Ignore common words
    ignore: 
    if find common-words word [
        exit
    ]

    ; OK, its good. Add it.
    either found? count: select words word [
        words/(word): count + 1
    ][
        repend words [word 1]
    ]
]

rules: [
    some [
        {"}
        copy word to {"} (add-word word)
        {"}
        |
        copy word to { } (add-word word)
        { }
    ]
    end
]

words: copy []
parse/all read %c.txt rules

result: copy []
foreach word words [
    if string? word [
        count: words/:word
        if count >= min-count [
            append result word
        ]
    ]
]

sort result
foreach word result [ print word ]

The output is:

act
actions
all
allows
also
any
appear
arbitrary
arguments
assign
assigned
based
be
because
been
before
below
between
braces
branches
break
builtin
but
C
C like any other language has its blemishes Some of the operators have the wrong precedence some parts of the syntax could be better
call
called
calls
can
care
case
char
code
columnbased
comma
Comments
common
compiler
conditional
consisting
contain
contains
continue
control
controlflow
criticized
Cs
curly brackets
declarations
define
definitions
degree
delimiters
designated
directly
dowhile
each
effect
effects
either
enclosed
enclosing
end
entry
enum
evaluated
evaluation
evaluations
even
example
executed
execution
exert
expression
expressionExpressions
expressions
familiarity
file
followed
following
format
FORTRAN
freeform
function
functions
goto
has
high
However
identified
ifelse
imperative
include
including
initialization
innermost
int
integer
interleaved
Introduction
iterative
Kernighan
keywords
label
language
languages
languagesAlthough
leave
limit
lineEach
loop
looping
many
may
mimicked
modify
more
most
name
needed
new
next
nonstructured
normal
object
obtain
occur
often
omitted
on
operands
operator
operators
optimization
order
other
perhaps
permits
points
programmers
programming
provides
rather
reinitialization
reliable
requires
reserve
reserved
restrictions
results
return
Ritchie
say
scope
Sections
see
selects
semicolon
separate
sequence
sequence point
sequential
several
side
single
skip
sometimes
source
specify
statement
statements
storage
struct
Structured
structuresAs
such
supported
switch
syntax
testing
textlinebased
than
There
This
turn
type
types
union
Unlike
unspecified
use
used
uses
using
usually
value
values
variable
variables
variety
which
while
whitespace
widespread
will
within
writing
Gregory Higley
I think you also need the count in the output... or am I missing something?
fretje
Nope, you're right! I'll fix it in a few minutes.
Gregory Higley
And now 9 months have passed...
Wallacoloo
+2  A: 

C# code:

IEnumerable<KeyValuePair<String, Int32>> ProcessText(String text, int X, int Y)
{
    // common words, that will be ignored
    var exclude = new string[] { "and", "is", "the", "as", "of", "to", "or", "in", "for", "by", "an", "be", "may", "has", "can", "its" }.ToDictionary(word => word);
    // regular expression to find quoted text
    var regex = new Regex("\"[^\"]\"", RegexOptions.Compiled);

    return
        // remove quoted text (it will be processed later)
        regex.Replace(text, "")
        // remove case dependency
        .ToLower()
        // split text by all these chars
        .Split(".,'\\/[]{}()`~@#$%^&*-=+?!;:<>| \n\r".ToCharArray())
        // add quoted text
        .Concat(regex.Matches(text).Cast<Match>().Select(match => match.Value))
        // group words by the word and count them
        .GroupBy(word => word, (word, words) => new KeyValuePair<String, Int32>(word, words.Count()))
        // apply filter(min word count and word length) and remove common words 
        .Where(pair => pair.Value >= X && pair.Key.Length >= Y && !exclude.ContainsKey(pair.Key));
}

Output for ProcessText(text, 3, 2) call:

3 x languages
3 x such
4 x code
4 x which
3 x based
3 x each
4 x declarations
5 x function
4 x statements
3 x new
3 x types
3 x keywords
3 x variables
7 x statement
4 x expression
3 x execution
3 x programming
3 x operators
Kamarey
+6  A: 

Perl in only 43 characters.

perl -MYAML -anE'$_{$_}++for@F;say Dump\%_'

Here is an example of it's use:

echo a a a b b c  d e aa | perl -MYAML -anE'$_{$_}++for@F;say Dump \%_'

---
a: 3
aa: 1
b: 2
c: 1
d: 1
e: 1

If you need to list only the lowercase versions, it requires two more characters.

perl -MYAML -anE'$_{lc$_}++for@F;say Dump\%_'

For it to work on the specified text requires 58 characters.

curl http://sampsonresume.com/labs/c.txt |
perl -MYAML -F'\W+' -anE'$_{lc$_}++for@F;END{say Dump\%_}'
real    0m0.679s
user    0m0.304s
sys     0m0.084s

Here is the last example expanded a bit.

#! perl
use 5.010;
use YAML;

while( my $line = <> ){
  for my $elem ( split '\W+', $line ){
    $_{ lc $elem }++
  }
  END{
    say Dump \%_;
  }
}
Brad Gilbert
Good lord, man! I'm curious about a break-down explanation if you have time in the future ;)
Jonathan Sampson
I didn't really read the whole question before writing this example, that's why the first example is so incomplete.
Brad Gilbert
Now I know why people tell Perl looks like an explosion in an ASCII factory...
ya23
Cute use of YAML as a pretty-print output...
Alex Feinman
I used YAML, because it's fairly easy to read, and it helped shorten the code by not having to pretty-print the data myself.
Brad Gilbert
+3  A: 

Ruby

When "minified", this implementation becomes 165 characters long. It uses array#inject to give a starting value (a Hash object with a default of 0) and then loop through the elements, which are then rolled into the hash; the result is then selected from the minimum frequency.

Note that I didn't count the size of the words to skip, that being an external constant. When the constant is counted too, the solution is 244 characters long.

Apostrophes and dashes aren't stripped, but included; their use modifies the word and therefore cannot be stripped simply without removal of all information beyond the symbol.

Implementation

CommonWords = %w(the a an but and is not or as of to in for by be may has can its it's)
def get_keywords(text, minFreq=0, minLen=2)
  text.scan(/(?:\b)[a-z'-]{#{minLen},}(?=\b)/i).
    inject(Hash.new(0)) do |result,w|
      w.downcase!
      result[w] += 1 unless CommonWords.include?(w)
      result
    end.select { |k,n| n >= minFreq }
end

Test Rig

require 'net/http'

keywords = get_keywords(Net::HTTP.get('www.sampsonresume.com','/labs/c.txt'), 3)
keywords.sort.each { |name,count| puts "#{name} x #{count} times" }

Test Results

code x 4 times
declarations x 4 times
each x 3 times
execution x 3 times
expression x 4 times
function x 5 times
keywords x 3 times
language x 3 times
languages x 3 times
new x 3 times
operators x 4 times
programming x 3 times
statement x 7 times
statements x 4 times
such x 3 times
types x 3 times
variables x 3 times
which x 4 times
The Wicked Flea
+1  A: 

Python (258 chars as is, including 66 chars for first line and 30 chars for punctuation removal) :

W="and is the as of to or in for by an be may has can its".split()
x=3;y=2;d={}
for l in open('c.txt') :
    for w in l.lower().translate(None,',.;\'"!()[]{}').split() :
        if w not in W: d[w]=d.get(w,0)+1
for w,n in d.items() :
    if n>y and len(w)>x : print n,w

output :

4 code
3 keywords
3 languages
3 execution
3 each
3 language
4 expression
4 statements
3 variables
7 statement
5 function
4 operators
4 declarations
3 programming
4 which
3 such
3 types
makapuf
I love seeing these tiny solutions - impresses me how amazing programming can be :)
Jonathan Sampson
+11  A: 

GNU scripting

sed -e 's/ /\n/g' | grep -v '^ *$' | sort | uniq -c | sort -nr

Results:

  7 be
  6 to
[...]
  1 2.
  1 -

With occurence greater than X:

sed -e 's/ /\n/g' | grep -v '^ *$' | sort | uniq -c | awk '$1>X'

Return only words with a length greater than Y (put Y+1 dots in second grep):

sed -e 's/ /\n/g' | grep -v '^ *$' | grep .... | sort | uniq -c

Ignore common terms like "and, is, the, etc" (assuming that the common terms are in file 'ignored')

sed -e 's/ /\n/g' | grep -v '^ *$' | grep -vf ignored | sort | uniq -c

Feel free to strip punctuation prior to processing (ie. "John's" becomes "John"):

sed -e 's/[,.:"\']//g;s/ /\n/g' | grep -v '^ *$' | sort | uniq -c

Return results in a collection/array: it is already like an array for shell: first column is count, second is word.

liori
Absolutely amazing! About as short as the Perl solution yet completely readable. +1 for doing it completely outside the box.
gooli
Sometimes when I don't care about any rules, I just do os.popen(r'this kind of shell code') instead of doing everything in python. It is much faster to write for one-shot scripts... and the performance is still pretty good.
liori
Stunning. Simply beautiful!
Jonathan Sampson
+2  A: 

Another Python solution, at 247 chars. The actual code is a single line of highly dense Python line of 134 chars that computes the whole thing in a single expression.

x=3;y=2;W="and is the as of to or in for by an be may has can its".split()
from itertools import groupby as gb
d=dict((w,l)for w,l in((w,len(list(g)))for w,g in
    gb(sorted(open("c.txt").read().lower().split())))
    if l>x and len(w)>y and w not in W)

A much longer version with plenty of comments for you reading pleasure:

# High and low count boundaries.
x = 3
y = 2

# Common words string split into a list by spaces.
Words = "and is the as of to or in for by an be may has can its".split()

# A special function that groups similar strings in a list into a 
# (string, grouper) pairs. Grouper is a generator of occurences (see below).
from itertools import groupby

# Reads the entire file, converts it to lower case and splits on whitespace 
# to create a list of words
sortedWords = sorted(open("c.txt").read().lower().split())

# Using the groupby function, groups similar words together.
# Since grouper is a generator of occurences we need to use len(list(grouper)) 
# to get the word count by first converting the generator to a list and then
# getting the length of the list.
wordCounts = ((word, len(list(grouper))) for word, grouper in groupby(sortedWords))

# Filters the words by number of occurences and common words using yet another 
# list comprehension.
filteredWordCounts = ((word, count) for word, count in wordCounts if word not in Words and count > x and len(word) > y)

# Creates a dictionary from the list of tuples.
result = dict(filteredWordCounts)

print result

The main trick here is using the itertools.groupby function to count the occurrences on a sorted list. Don't know if it really saves characters, but it does allow all the processing to happen in a single expression.

Results:

{'function': 4, 'operators': 4, 'declarations': 4, 'which': 4, 'statement': 5}
gooli
I'm so impressed with these languages. Thanks for the contribution! Nice work!
Jonathan Sampson
You cannot safely assume that `'may'` should not be indexed.
Sinan Ünür
you only refer to gb once(after importing), you could change `from itertools import groupby as gb` to `from itertools import*` and then replace the other gb with groupby
Wallacoloo
A: 

This is not going to win any golfing awards but it does keep quoted phrases together and takes into account stop words (and leverages CPAN modules Lingua::StopWords and Text::ParseWords).

In addition, I use to_S from Lingua::EN::Inflect::Number to count only the singular forms of words.

You might also want to look at Lingua::CollinsParser.

#!/usr/bin/perl

use strict; use warnings;

use Lingua::EN::Inflect::Number qw( to_S );
use Lingua::StopWords qw( getStopWords );
use Text::ParseWords;

my $stop = getStopWords('en');

my %words;

while ( my $line = <> ) {
    chomp $line;
    next unless $line =~ /\S/;
    next unless my @words = parse_line(' ', 1, $line);

    ++ $words{to_S $_} for
        grep { length and not $stop->{$_} }
        map { s!^[[:punct:]]+!!; s![[:punct:]]+\z!!; lc }
        @words;
}

print "=== only words appearing 4 or more times ===\n";
print "$_ : $words{$_}\n" for sort {
    $words{$b} <=> $words{$a}
} grep { $words{$_} > 3 } keys %words;

print "=== only words that are 12 characters or longer ===\n";
print "$_ : $words{$_}\n" for sort {
    $words{$b} <=> $words{$a}
} grep { 11 < length } keys %words;

Output:

=== only words appearing 4 or more times ===
statement : 11
function : 7
expression : 6
may : 5
code : 4
variable : 4
operator : 4
declaration : 4
c : 4
type : 4
=== only words that are 12 characters or longer ===
reinitialization : 2
control-flow : 1
sequence point : 1
optimization : 1
curly brackets : 1
text-line-based : 1
non-structured : 1
column-based : 1
initialization : 1
Sinan Ünür