ansaurus

Question

Answer 1

+25 A:

Perl, 237 229 209 chars

(Updated again to beat the Ruby version with more dirty golf tricks, replacing split/[^a-z/,lc with lc=~/[a-z]+/g, and eliminating a check for empty string in another place. These were inspired by the Ruby version, so credit where credit is due.)

Update: now with Perl 5.10! Replace print with say, and use ~~ to avoid a map. This has to be invoked on the command line as perl -E '<one-liner>' alice.txt. Since the entire script is on one line, writing it as a one-liner shouldn't present any difficulty :).

 @s=qw/the and of to a i it in or is/;$c{$_}++foreach grep{!($_~~@s)}map{lc=~/[a-z]+/g}<>;@s=sort{$c{$b}<=>$c{$a}}keys%c;$f=76-length$s[0];say" "."_"x$f;say"|"."_"x($c{$_}/$c{$s[0]}*$f)."| $_ "foreach@s[0..21];

Note that this version normalizes for case. This doesn't shorten the solution any, since removing ,lc (for lower-casing) requires you to add A-Z to the split regex, so it's a wash.

If you're on a system where a newline is one character and not two, you can shorten this by another two chars by using a literal newline in place of \n. However, I haven't written the above sample that way, since it's "clearer" (ha!) that way.

Here is a mostly correct, but not remotely short enough, perl solution:

use strict;
use warnings;

my %short = map { $_ => 1 } qw/the and of to a i it in or is/;
my %count = ();

$count{$_}++ foreach grep { $_ && !$short{$_} } map { split /[^a-zA-Z]/ } (<>);
my @sorted = (sort { $count{$b} <=> $count{$a} } keys %count)[0..21];
my $widest = 76 - (length $sorted[0]);

print " " . ("_" x $widest) . "\n";
foreach (@sorted)
{
    my $width = int(($count{$_} / $count{$sorted[0]}) * $widest);
    print "|" . ("_" x $width) . "| $_ \n";
}

The following is about as short as it can get while remaining relatively readable. (392 chars).

%short = map { $_ => 1 } qw/the and of to a i it in or is/;
%count;

$count{$_}++ foreach grep { $_ && !$short{$_} } map { split /[^a-z]/, lc } (<>);
@sorted = (sort { $count{$b} <=> $count{$a} } keys %count)[0..21];
$widest = 76 - (length $sorted[0]);

print " " . "_" x $widest . "\n";
print"|" . "_" x int(($count{$_} / $count{$sorted[0]}) * $widest) . "| $_ \n" foreach @sorted;

JSBangs 2010-07-02 21:29:35

Has a few bugs right now; fixing and shortening.

JSBangs 2010-07-02 21:35:14

This doesn't cover the case when the second word is much longer than the first, right?

Joey 2010-07-03 10:19:00

Both `foreach` s can be written as `for` s. That's 8 chars down. Then you have the `grep{!($_~~@s)}map{lc=~/[a-z]+/g}<>`, which I believe could be written as `grep{!(/$_/i~~@s)}<>=~/[a-z]+/g` to go 4 more down. Replace the `" "` with `$"` and you're down 1 more...

Zaid 2010-07-04 18:05:55

`sort{$c{$b}-$c{$a}}...` to save two more. You can also just pass `%c` instead of `keys %c` to the `sort` function and save four more.

mobrule 2010-07-05 23:45:05

Answer 2

+27 A:

C# - 510 451 436 446 434 426 422 chars (minified)

Not that short, but now probably correct! Note, the previous version did not show the first line of the bars, did not scale the bars correctly, downloaded the file instead of getting it from stdin, and did not include all the required C# verbosity. You could easily shave many strokes if C# didn't need so much extra crap. Maybe Powershell could do better.

using C=System.Console;   // alias for Console
using System.Linq;  // for Split, GroupBy, Select, OrderBy, etc.

class Class // must define a class
{
    static void Main()  // must define a Main
    {
        // split into words
        var allwords = System.Text.RegularExpressions.Regex.Split(
                // convert stdin to lowercase
                C.In.ReadToEnd().ToLower(),
                // eliminate stopwords and non-letters
                @"(?:\b(?:the|and|of|to|a|i[tns]?|or)\b|\W)+")
            .GroupBy(x => x)    // group by words
            .OrderBy(x => -x.Count()) // sort descending by count
            .Take(22);   // take first 22 words

        // compute length of longest bar + word
        var lendivisor = allwords.Max(y => y.Count() / (76.0 - y.Key.Length));

        // prepare text to print
        var toPrint = allwords.Select(x=> 
            new { 
                // remember bar pseudographics (will be used in two places)
                Bar = new string('_',(int)(x.Count()/lendivisor)), 
                Word=x.Key 
            })
            .ToList();  // convert to list so we can index into it

        // print top of first bar
        C.WriteLine(" " + toPrint[0].Bar);
        toPrint.ForEach(x =>  // for each word, print its bar and the word
            C.WriteLine("|" + x.Bar + "| " + x.Word));
    }
}

422 chars with lendivisor inlined (which makes it 22 times slower) in the below form (newlines used for select spaces):

using System.Linq;using C=System.Console;class M{static void Main(){var
a=System.Text.RegularExpressions.Regex.Split(C.In.ReadToEnd().ToLower(),@"(?:\b(?:the|and|of|to|a|i[tns]?|or)\b|\W)+").GroupBy(x=>x).OrderBy(x=>-x.Count()).Take(22);var
b=a.Select(x=>new{p=new string('_',(int)(x.Count()/a.Max(y=>y.Count()/(76d-y.Key.Length)))),t=x.Key}).ToList();C.WriteLine(" "+b[0].p);b.ForEach(x=>C.WriteLine("|"+x.p+"| "+x.t));}}

Paul Creasey 2010-07-02 21:37:32

Clever one, this. I like it.

Arve Systad 2010-07-02 22:07:45

+1 for the smart-ass downloading the file inline. :)

sarnold 2010-07-03 00:24:29

Steal the short URL from Matt's answer.

indiv 2010-07-03 00:31:50

The spec said the file must be piped in or passed as an args. If you were to assume that args[0] contained the local file name, you could shorten it considerably by using args[0] instead of (new WebClient()).DownloadString(@"http://www.gutenberg.org/files/11/11.txt") -> it would save you approx 70 characters

thorkia 2010-07-03 01:19:00

Here is a version replacing the WebClient call with args 0, a call to StreamReader, and removing a few extra spaces. Total char count=413var a=Regex.Replace((new StreamReader(args[0])).ReadToEnd(),"[^a-zA-Z]"," ").ToLower().Split(' ').Where(x=>!(new[]{"the","and","of","to","a","i","it","in","or","is"}).Contains(x)).GroupBy(x=>x).Select(g=>new{w=g.Key,c=g.Count()}).OrderByDescending(x=>x.c).Skip(1).Take(22).ToList();var m=a.OrderByDescending(x=>x.c).First();a.ForEach(x=>Console.WriteLine("|"+new String('_',x.c*(80-m.w.Length-4)/m.c)+"| "+x.w));

thorkia 2010-07-03 01:41:32

"new StreamReader" without "using" is dirty.File.ReadAllText(args[0]) or Console.In.ReadToEnd() are much better. In the latter case you can even remove argument from your Main(). :)

Rotsor 2010-07-03 03:10:59

The bar widths are incorrect. "with"'s bar is shorter than "at"'s.

Rotsor 2010-07-03 03:52:31

Rotsor: As far as I can tell, "with" and "at" have the same width of bar, which they should because they have the same frequency.

Gabe 2010-07-03 05:40:54

You use Console.WriteLine a number of times. Save some more chars by aliasing `using C=System.Console;` and then in your code `C.WriteLine(..)`, or a different char since you already have C as a class name.

John K 2010-07-03 06:13:13

This is an awesome example of the power of LINQ. Just imagine that in Java.

Zoomzoom83 2010-07-05 02:01:20

@Zoomzoom83: It would be great to have but it would probably *still* be two orders of magnitude longer. We're talking about Java, after all ;) (and it will probably only show up in Java 8 which set its release date *after* Duke Nukem Forever).

Joey 2010-07-05 07:55:28

Answer 3

+9 A:

F#, 452 chars

Strightforward: get a sequence a of word-count pairs, find the best word-count-per-column multiplier k, then print results.

let a=
 stdin.ReadToEnd().Split(" .?!,\":;'\r\n".ToCharArray(),enum 1)
 |>Seq.map(fun s->s.ToLower())|>Seq.countBy id
 |>Seq.filter(fun(w,n)->not(set["the";"and";"of";"to";"a";"i";"it";"in";"or";"is"].Contains w))
 |>Seq.sortBy(fun(w,n)-> -n)|>Seq.take 22
let k=a|>Seq.map(fun(w,n)->float(78-w.Length)/float n)|>Seq.min
let u n=String.replicate(int(float(n)*k)-2)"_"
printfn" %s "(u(snd(Seq.nth 0 a)))
for(w,n)in a do printfn"|%s| %s "(u n)w

Example (I have different freq counts than you, unsure why):

% app.exe < Alice.txt

 _________________________________________________________________________
|_________________________________________________________________________| she
|_______________________________________________________________| you
|_____________________________________________________________| said
|_____________________________________________________| alice
|_______________________________________________| was
|___________________________________________| that
|___________________________________| as
|________________________________| her
|_____________________________| with
|_____________________________| at
|____________________________| t
|____________________________| s
|__________________________| on
|_________________________| all
|_______________________| this
|______________________| had
|______________________| for
|_____________________| but
|_____________________| be
|____________________| not
|___________________| they
|__________________| so

Brian 2010-07-02 21:52:47

@Brian: turns out my own solution was indeed a little off (due to a little different spec), the solutions correspond now ;-)

ChristopheD 2010-07-02 22:15:21

+1 for the only correct bar scaling implementation so far

Rotsor 2010-07-03 04:40:17

(@Rotsor: Ironic, given that mine is the oldest solution.)

Brian 2010-07-03 08:34:09

I bet you could shorten it quite a bit by merging the split, map, and filter stages. I'd also expect that you wouldn't need so many `float`s.

Gabe 2010-07-03 11:59:40

Isn't nesting functions usually shorter than using the pipeline operator `|>`?

Joey 2010-07-03 12:40:32

Answer 4

+7 A:

Gawk -- 336 (originally 507) characters

(after fixing the output formatting; fixing the contractions thing; tweaking; tweaking again; removing a wholly unnecessary sorting step; tweaking yet again; and again (oops this one broke the formatting); tweak some more; taking up Matt's challenge I desperately tweak so more; found another place to save a few, but gave two back to fix the bar length bug)

Heh heh! I am momentarily ahead of [Matt's JavaScript][1] solution^{counter challenge! ;)} and [AKX's python][2].

The problem seems to call out for a language that implements native associative arrays, so of course I've chosen one with a horribly deficient set of operators on them. In particular, you cannot control the order in which awk offers up the elements of a hash map, so I repeatedly scan the whole map to find the currently most numerous item, print it and delete it from the array.

It is all terribly inefficient, with all the golfifcations I've made it has gotten to be pretty awful, as well.

Minified:

{gsub("[^a-zA-Z]"," ");for(;NF;NF--)a[tolower($NF)]++}
END{split("the and of to a i it in or is",b," ");
for(w in b)delete a[b[w]];d=1;for(w in a){e=a[w]/(78-length(w));if(e>d)d=e}
for(i=22;i;--i){e=0;for(w in a)if(a[w]>e)e=a[x=w];l=a[x]/d-2;
t=sprintf(sprintf("%%%dc",l)," ");gsub(" ","_",t);if(i==22)print" "t;
print"|"t"| "x;delete a[x]}}

line breaks for clarity only: they are not necessary and should not be counted.

Output:

$ gawk -f wordfreq.awk.min < 11.txt 
 _________________________________________________________________________
|_________________________________________________________________________| she
|_______________________________________________________________| you
|____________________________________________________________| said
|____________________________________________________| alice
|______________________________________________| was
|__________________________________________| that
|___________________________________| as
|_______________________________| her
|____________________________| with
|____________________________| at
|___________________________| s
|___________________________| t
|_________________________| on
|_________________________| all
|______________________| this
|______________________| for
|______________________| had
|_____________________| but
|____________________| be
|____________________| not
|___________________| they
|__________________| so
$ sed 's/you/superlongstring/gI' 11.txt | gawk -f wordfreq.awk.min
 ______________________________________________________________________
|______________________________________________________________________| she
|_____________________________________________________________| superlongstring
|__________________________________________________________| said
|__________________________________________________| alice
|____________________________________________| was
|_________________________________________| that
|_________________________________| as
|______________________________| her
|___________________________| with
|___________________________| at
|__________________________| s
|__________________________| t
|________________________| on
|________________________| all
|_____________________| this
|_____________________| for
|_____________________| had
|____________________| but
|___________________| be
|___________________| not
|__________________| they
|_________________| so

Readable; 633 characters (originally 949):

{
    gsub("[^a-zA-Z]"," ");
    for(;NF;NF--)
    a[tolower($NF)]++
}
END{
    # remove "short" words
    split("the and of to a i it in or is",b," ");
    for (w in b) 
    delete a[b[w]];
    # Find the bar ratio
    d=1;
    for (w in a) {
    e=a[w]/(78-length(w));
    if (e>d)
        d=e
    }
    # Print the entries highest count first
    for (i=22; i; --i){               
    # find the highest count
    e=0;
    for (w in a) 
        if (a[w]>e)
        e=a[x=w];
        # Print the bar
    l=a[x]/d-2;
    # make a string of "_" the right length
    t=sprintf(sprintf("%%%dc",l)," ");
    gsub(" ","_",t);
    if (i==22) print" "t;
    print"|"t"| "x;
    delete a[x]
    }
}

dmckee 2010-07-02 22:54:33

Nice work, good you included an indented / commented version ;-)

ChristopheD 2010-07-03 12:31:33

Answer 5

+6 A:

sh (+curl), partial* solution

This is incomplete, but for the hell of it, here's the word-frequency counting half of the problem in 192 bytes:

curl -s http://www.gutenberg.org/files/11/11.txt|sed -e 's@[^a-z]@\n@gi'|tr '[:upper:]' '[:lower:]'|egrep -v '(^[^a-z]*$|\b(the|and|of|to|a|i|it|in|or|is)\b)' |sort|uniq -c|sort -n|tail -n 22

Frank Farmer 2010-07-02 22:55:19

Answer 6

+11 A:

JavaScript 1.8 (SpiderMonkey) - 354

x={};p='|';e=' ';z=[];c=77
while(l=readline())l.toLowerCase().replace(/\b(?!(the|and|of|to|a|i[tns]?|or)\b)\w+/g,function(y)x[y]?x[y].c++:z.push(x[y]={w:y,c:1}))
z=z.sort(function(a,b)b.c-a.c).slice(0,22)
for each(v in z){v.r=v.c/z[0].c
c=c>(l=(77-v.w.length)/v.r)?l:c}for(k in z){v=z[k]
s=Array(v.r*c|0).join('_')
if(!+k)print(e+s+e)
print(p+s+p+e+v.w)}

Sadly, the for([k,v]in z) from the Rhino version doesn't seem to want to work in SpiderMonkey, and readFile() is a little easier than using readline() but moving up to 1.8 allows us to use function closures to cut a few more lines....

Adding whitespace for readability:

x={};p='|';e=' ';z=[];c=77
while(l=readline())
  l.toLowerCase().replace(/\b(?!(the|and|of|to|a|i[tns]?|or)\b)\w+/g,
   function(y) x[y] ? x[y].c++ : z.push( x[y] = {w: y, c: 1} )
  )
z=z.sort(function(a,b) b.c - a.c).slice(0,22)
for each(v in z){
  v.r=v.c/z[0].c
  c=c>(l=(77-v.w.length)/v.r)?l:c
}
for(k in z){
  v=z[k]
  s=Array(v.r*c|0).join('_')
  if(!+k)print(e+s+e)
  print(p+s+p+e+v.w)
}

Usage: js golf.js < input.txt

Output:

 _________________________________________________________________________ 
|_________________________________________________________________________| she
|_______________________________________________________________| you
|____________________________________________________________| said
|____________________________________________________| alice
|______________________________________________| was
|___________________________________________| that
|___________________________________| as
|________________________________| her
|_____________________________| at
|_____________________________| with
|____________________________| s
|____________________________| t
|__________________________| on
|_________________________| all
|_______________________| this
|______________________| for
|______________________| had
|______________________| but
|_____________________| be
|_____________________| not
|___________________| they
|___________________| so

(base version - doesn't handle bar widths correctly)

JavaScript (Rhino) - 405 395 387 377 368 343 304 chars

~~I think my sorting logic is off, but.. I duno.~~ Brainfart fixed.

Minified (abusing \n's interpreted as a ; sometimes):

x={};p='|';e=' ';z=[]
readFile(arguments[0]).toLowerCase().replace(/\b(?!(the|and|of|to|a|i[tns]?|or)\b)\w+/g,function(y){x[y]?x[y].c++:z.push(x[y]={w:y,c:1})})
z=z.sort(function(a,b){return b.c-a.c}).slice(0,22)
for([k,v]in z){s=Array((v.c/z[0].c)*70|0).join('_')
if(!+k)print(e+s+e)
print(p+s+p+e+v.w)}

Matt 2010-07-02 23:05:58

Ah, sir. I believe this is your gauntlet. Have your second speak to mine.

dmckee 2010-07-03 04:09:00

BTW-- I like the `i[tns]?` bit. Very sneaky.

dmckee 2010-07-03 04:10:02

@dmckee - well played, I don't think I can beat your 336, enjoy your much-deserved upvote :)

Matt 2010-07-03 13:31:02

You can definitely beat 336... There is a 23 character cut available -- `.replace(/[^\w ]/g, e).split(/\s+/).map(` can be replaced with `.replace(/\w+/g,` and use the same function your `.map` did... Also not sure if Rhino supports `function(a,b)b.c-a.c` instead of your sort function (spidermonkey does), but that will shave `{return }` ... `b.c-a.c` is a better sort that `a.c<b.c` btw... Editing a Spidermonkey version at the bottom with these changes

gnarf 2010-07-03 20:29:53

I moved my SpiderMonkey version up to the top since it conforms to the bar width constraint... Also managed to cut out a few more chars in your original version by using a negative lookahead regexp to deny words allowing for a single replace(), and golfed a few ifs with `?:` Great base to work from though!

gnarf 2010-07-03 23:48:07

This will not eliminate stop words when surrounded by digits or underscores such as in `foo_the123` where only `foo` should remain.

Joey 2010-07-04 15:38:11

Answer 7

+8 A:

Python 2.6, 347 chars

import re
W,x={},"a and i in is it of or the to".split()
[W.__setitem__(w,W.get(w,0)-1)for w in re.findall("[a-z]+",file("11.txt").read().lower())if w not in x]
W=sorted(W.items(),key=lambda p:p[1])[:22]
bm=(76.-len(W[0][0]))/W[0][1]
U=lambda n:"_"*int(n*bm)
print "".join(("%s\n|%s| %s "%((""if i else" "+U(n)),U(n),w))for i,(w,n)in enumerate(W))

Output:

 _________________________________________________________________________
|_________________________________________________________________________| she 
|_______________________________________________________________| you 
|____________________________________________________________| said 
|_____________________________________________________| alice 
|_______________________________________________| was 
|___________________________________________| that 
|____________________________________| as 
|________________________________| her 
|_____________________________| with 
|_____________________________| at 
|____________________________| s 
|____________________________| t 
|__________________________| on 
|__________________________| all 
|_______________________| this 
|_______________________| for 
|_______________________| had 
|_______________________| but 
|______________________| be 
|_____________________| not 
|____________________| they 
|____________________| so

AKX 2010-07-02 23:27:39

You can lose the line `bm=(76.-len(W[0][0]))/W[0][1]` since you're only using bm once (make the next line `U=lambda n:"_"*int(n*(76.-len(W[0][0]))/W[0][1])`, shaves off 5 characters. Also: why would you use a 2-character variable name in code golfing? ;-)

ChristopheD 2010-07-03 14:28:36

On the last line the space after print isn't necessary, shaves off one character

ChristopheD 2010-07-03 14:29:11

Doesn't consider the case when the second-most frequent word is very long, right?

Joey 2010-07-04 23:12:15

@ChristopheD: Because I had been staring at that code for a little too long. :P Good catch.@Johannes: That could be fixed too, yes. Not sure all other implementations did it when I wrote this either.

AKX 2010-07-05 10:11:20

Answer 8

+18 A:

Ruby, 215, 216, 218, 221, 224, 236, 237 chars

update 1: Hurray! It's a tie with JS Bangs' solution. Can't think of a way to cut down any more :)

update 2: Played a dirty golf trick. Changed each to map to save 1 character :)

update 3: Changed File.read to IO.read +2. Array.group_by wasn't very fruitful, changed to reduce +6. Case insensitive check is not needed after lower casing with downcase in regex +1. Sorting in descending order is easily done by negating the value +6. Total savings +15

update 4: [0] rather than .first, +3. (@Shtééf)

update 5: Expand variable l in-place, +1. Expand variable s in-place, +2. (@Shtééf)

update 6: Use string addition rather than interpolation for the first line, +2. (@Shtééf)

w=(IO.read($_).downcase.scan(/[a-z]+/)-%w{the and of to a i it in or is}).reduce(Hash.new 0){|m,o|m[o]+=1;m}.sort_by{|k,v|-v}.take 22;m=76-w[0][0].size;puts' '+'_'*m;w.map{|x,f|puts"|#{'_'*(f*1.0/w[0][1]*m)}| #{x} "}

update 7: I went through a whole lot of hoopla to detect the first iteration inside the loop, using instance variables. All I got is +1, though perhaps there is potential. Preserving the previous version, because I believe this one is black magic. (@Shtééf)

(IO.read($_).downcase.scan(/[a-z]+/)-%w{the and of to a i it in or is}).reduce(Hash.new 0){|m,o|m[o]+=1;m}.sort_by{|k,v|-v}.take(22).map{|x,f|@f||(@f=f;puts' '+'_'*(@m=76-x.size));puts"|#{'_'*(f*1.0/@f*@m)}| #{x} "}

Readable version

string = File.read($_).downcase

words = string.scan(/[a-z]+/i)
allowed_words = words - %w{the and of to a i it in or is}
sorted_words = allowed_words.group_by{ |x| x }.map{ |x,y| [x, y.size] }.sort{ |a,b| b[1] <=> a[1] }.take(22)
highest_frequency = sorted_words.first
highest_frequency_count = highest_frequency[1]
highest_frequency_word = highest_frequency[0]

word_length = highest_frequency_word.size
widest = 76 - word_length

puts " #{'_' * widest}"    
sorted_words.each do |word, freq|
  width = (freq * 1.0 / highest_frequency_count) * widest
  puts "|#{'_' * width}| #{word} "
end

To use:

echo "Alice.txt" | ruby -ln GolfedWordFrequencies.rb

Output:

 _________________________________________________________________________
|_________________________________________________________________________| she 
|_______________________________________________________________| you 
|____________________________________________________________| said 
|_____________________________________________________| alice 
|_______________________________________________| was 
|___________________________________________| that 
|____________________________________| as 
|________________________________| her 
|_____________________________| with 
|_____________________________| at 
|____________________________| s 
|____________________________| t 
|__________________________| on 
|__________________________| all 
|_______________________| this 
|_______________________| for 
|_______________________| had 
|_______________________| but 
|______________________| be 
|_____________________| not 
|____________________| they 
|____________________| so

Anurag 2010-07-03 00:03:27

Whoa, Ruby is beating Perl.

chpwn 2010-07-03 00:40:23

Isn't "p" a shortcut for "puts" ? That could shave a few.

rfusca 2010-07-03 03:21:25

Nice. Your use of `scan`, though, gave me a better idea, so I got ahead again :).

JSBangs 2010-07-03 04:35:21

@rfusca, `p` puts quotes around the output so it wouldn't match OP's.

Anurag 2010-07-03 08:56:53

@JS Looks like its going to be a cat and mouse game, until J comes along :)

Anurag 2010-07-03 09:10:51

You miscounted. At update 3, you were at 224 characters. I brought you back to 221, any way. :) That reduce trick is black magic. :o

Shtééf 2010-07-03 09:23:48

@Shtééf - thanks :) .. last thing we want in code-golf is miscounting on the higher side.. lol :o)

Anurag 2010-07-03 09:27:24

Isnt using the shell to "read" and pipe the file cheating ?

mP 2010-07-03 09:56:22

The question states that the input can be piped in. Also, we are just piping in the file name, not its contents.

Anurag 2010-07-03 10:01:24

Okay, I am now totally done with this. You may now shout at me. :) I sure hope that Perl version doesn't become much shorter.

Shtééf 2010-07-03 10:42:12

You need to scale the bars so the longest word plus its bar fits on 80 characters. As Brian suggested, a long second word will break your program.

Gabe 2010-07-03 11:36:23

I wonder why this is still gathering votes. The solution is incorrect (in the general case) and two way shorter Ruby solutions are here by now.

Joey 2010-07-03 11:44:16

Now, Correct me if i'm wrong, but instead of using "downcase", why don't you use the REGEXP case insensitive flag, that saves 6-7 bytes, does it not?

st0le 2010-07-03 12:37:00

How about [0..21] instead of .take 22?

Adam 2010-07-03 14:08:08

Answer 9

+1 A:

Java, slowly getting shorter (~~1500~~ ~~1358~~ ~~1241~~ ~~1020~~ ~~913~~ 890 chars)

stripped even more white space and var name length. removed generics where possible, removed inline class and try/catch block too bad, my 900 version had a bug

removed another try / catch block

import java.net.*;import java.util.*;import java.util.regex.*;import org.apache.commons.io.*;public class G{public static void main(String[]a)throws Exception{String text=IOUtils.toString(new URL(a[0]).openStream()).toLowerCase().replaceAll("\\b(the|and|of|to|a|i[tns]?|or)\\b","");final Map<String,Integer>p=new HashMap();Matcher m=Pattern.compile("\\b\\w+\\b").matcher(text);Integer b;while(m.find()){String w=m.group();b=p.get(w);p.put(w,b==null?1:b+1);}List<String>v=new Vector(p.keySet());Collections.sort(v,new Comparator(){public int compare(Object l,Object m){return p.get(m)-p.get(l);}});boolean t=true;float r=0;for(String w:v.subList(0,22)){if(t){t=false;r=p.get(w)/(float)(80-(w.length()+4));System.out.println(" "+new String(new char[(int)(p.get(w)/r)]).replace('\0','_'));}System.out.println("|"+new String(new char[(int)(((Integer)p.get(w))/r)]).replace('\0','_')+"|"+w);}}}

Readable version:

import java.net.*;
import java.util.*;
import java.util.regex.*;
import org.apache.commons.io.*;

public class G{

    public static void main(String[] a) throws Exception{
        String text =
            IOUtils.toString(new URL(a[0]).openStream())
                .toLowerCase()
                .replaceAll("\\b(the|and|of|to|a|i[tns]?|or)\\b", "");
        final Map<String, Integer> p = new HashMap();
        Matcher m = Pattern.compile("\\b\\w+\\b").matcher(text);
        Integer b;
        while(m.find()){
            String w = m.group();
            b = p.get(w);
            p.put(w, b == null ? 1 : b + 1);
        }
        List<String> v = new Vector(p.keySet());
        Collections.sort(v, new Comparator(){

            public int compare(Object l, Object m){
                return p.get(m) - p.get(l);
            }
        });
        boolean t = true;
        float r = 0;
        for(String w : v.subList(0, 22)){
            if(t){
                t = false;
                r = p.get(w) / (float) (80 - (w.length() + 4));
                System.out.println(" "
                    + new String(new char[(int) (p.get(w) / r)]).replace('\0',
                        '_'));
            }
            System.out.println("|"
                + new String(new char[(int) (((Integer) p.get(w)) / r)]).replace('\0',
                    '_') + "|" + w);
        }
    }
}

seanizer 2010-07-03 00:54:35

not quite golf, either =/

Justin L. 2010-07-03 01:07:25

I like goofball high-character-count golf submissions. It's good to break up the monotony of line noise with something readable and almost laughably verbose.

John Y 2010-07-03 02:54:53

@John: I disagree. Even if you are going to use a verbose language (see my fortran 77 entries in some earlier code golfs for instance) you should code it as tightly as the language allows.Code golf isn't about good practices; indeed it is very nearly the antithesis of good practice.

dmckee 2010-07-03 06:14:41

@dmckee: I completely understand and accept your viewpoint. Still, I personally like to see just about any submission. Variety is the spice of life, and to me that even includes differing (even opposing) spirit and ideals in code golf. Better to dance, but dance "poorly" (for whatever definition of dance), than to stand in the corner or worse yet, not even show up.

John Y 2010-07-03 14:14:51

Answer 10

+1 A:

Javascript, 348 characters

After I finished mine, I stole some ideas from Matt :3

t=prompt().toLowerCase().replace(/\b(the|and|of|to|a|i[tns]?|or)\b/gm,'');r={};o=[];t.replace(/\b([a-z]+)\b/gm,function(a,w){r[w]?++r[w]:r[w]=1});for(i in r){o.push([i,r[i]])}m=o[0][1];o=o.slice(0,22);o.sort(function(F,D){return D[1]-F[1]});for(B in o){F=o[B];L=new Array(~~(F[1]/m*(76-F[0].length))).join('_');print(' '+L+'\n|'+L+'| '+F[0]+' \n')}

Requires print and prompt function support.

M28 2010-07-03 01:01:53

This will have some problems with strings like `the_foo`, right? (Because then `\b` breaks apart)

Joey 2010-07-05 07:52:47

Answer 11

+25 A:

belisarius 2010-07-03 02:43:14

You think "RegularExpression" is bad? I cried when I typed "System.Text.RegularExpressions.Regex.Split" into the C# version, up until I saw the Objective-C code: "stringWithContentsOfFile", "enumerateSubstringsInRange", "NSStringEnumerationByWords", "sortedArrayUsingComparator", and so on.

Gabe 2010-07-04 16:39:03

@Gabe Thanks ... I feel better now. In spanish we say "mal de muchos, consuelo de tontos" .. Something like "Many troubled, fools relieved" :D

belisarius 2010-07-04 17:20:53

The `|i|` is redundant in your regex because you already have `.|`.

Gabe 2010-07-04 23:29:52

@Gabe yes, thnx. Corrected.

belisarius 2010-07-05 00:02:40

I like that Spanish saying. The closest thing I can think of in English is "misery loves company". Here's my translation attempt: "It's a fool who, when suffering, takes consolation in thinking of others in the same situation." Amazing work on the Mathematica implementation, btw.

dreeves 2010-07-05 03:01:07

@dreeves Foolishness surpass the language barrier easily ... Glad to see you like my little Mathematica program, I'm just starting to learn the language

belisarius 2010-07-05 04:24:31

Pared it down to 199...

Michael Pilat 2010-07-09 04:14:27

@Michael Pilat Wow! I've a lot to learn ... wonderful!

belisarius 2010-07-09 04:39:07

The 199 version will not interpret things like `the_foo` according to spec, right?

Joey 2010-07-11 08:45:24

@Johannes Rössel Good eye! The bug is in all versions due to Mathematica matching the underscore as a letter char (Why did they do that??!!). The regexp should be something like "(_|\\W|\\b(.|the|and|of|to|i[tns]|or)\\b)+" , but \\W also recognizes digits as letters, so perhaps an utterly correct version is a little longer.

belisarius 2010-07-17 16:38:33

@belisarius: Actually all regex engines consider `\w` as something like `[a-zA-Z0-9_]` or maybe `[\p{L}\p{Nd}_]` for Unicode-aware engines. And since `\b` is considered a boundary between `\w` and `\W` this doesn't work according to the spec here. But many solutions have that problem and it took me quite a few characters to get that part right in my solution. As for the *why* I think it fits with what many programming languages allow as identifiers. You can simply match them with `\w+` (doesn't *quite* work, but close enough for most hackish solutions).

Joey 2010-07-17 19:36:06

@Johannes Rössel Yep, may be because usually "_" is an allowed char in many languages. Curiously, in Mathematica is a special character, and cannot be used in identifiers :D

belisarius 2010-07-18 00:30:58

Answer 12

+3 A:

Java - 896 chars

931 chars

1233 chars made unreadable

1977 chars "uncompressed"

Update: I have aggressively reduced the character count. Omits single-letter words per updated spec.

I envy C# and LINQ so much.

import java.util.*;import java.io.*;import static java.util.regex.Pattern.*;class g{public static void main(String[] a)throws Exception{PrintStream o=System.out;Map<String,Integer> w=new HashMap();Scanner s=new Scanner(new File(a[0])).useDelimiter(compile("[^a-z]+|\\b(the|and|of|to|.|it|in|or|is)\\b",2));while(s.hasNext()){String z=s.next().trim().toLowerCase();if(z.equals(""))continue;w.put(z,(w.get(z)==null?0:w.get(z))+1);}List<Integer> v=new Vector(w.values());Collections.sort(v);List<String> q=new Vector();int i,m;i=m=v.size()-1;while(q.size()<22){for(String t:w.keySet())if(!q.contains(t)&&w.get(t).equals(v.get(i)))q.add(t);i--;}int r=80-q.get(0).length()-4;String l=String.format("%1$0"+r+"d",0).replace("0","_");o.println(" "+l);o.println("|"+l+"| "+q.get(0)+" ");for(i=m-1;i>m-22;i--){o.println("|"+l.substring(0,(int)Math.round(r*(v.get(i)*1.0)/v.get(m)))+"| "+q.get(m-i)+" ");}}}

"Readable":

import java.util.*;
import java.io.*;
import static java.util.regex.Pattern.*;
class g
{
   public static void main(String[] a)throws Exception
      {
      PrintStream o = System.out;
      Map<String,Integer> w = new HashMap();
      Scanner s = new Scanner(new File(a[0]))
         .useDelimiter(compile("[^a-z]+|\\b(the|and|of|to|.|it|in|or|is)\\b",2));
      while(s.hasNext())
      {
         String z = s.next().trim().toLowerCase();
         if(z.equals(""))
            continue;
         w.put(z,(w.get(z) == null?0:w.get(z))+1);
      }
      List<Integer> v = new Vector(w.values());
      Collections.sort(v);
      List<String> q = new Vector();
      int i,m;
      i = m = v.size()-1;
      while(q.size()<22)
      {
         for(String t:w.keySet())
            if(!q.contains(t)&&w.get(t).equals(v.get(i)))
               q.add(t);
         i--;
      }
      int r = 80-q.get(0).length()-4;
      String l = String.format("%1$0"+r+"d",0).replace("0","_");
      o.println(" "+l);
      o.println("|"+l+"| "+q.get(0)+" ");
      for(i = m-1; i > m-22; i--)
      {
         o.println("|"+l.substring(0,(int)Math.round(r*(v.get(i)*1.0)/v.get(m)))+"| "+q.get(m-i)+" ");
      }
   }
}

Output of Alice:

 _________________________________________________________________________
|_________________________________________________________________________| she
|_______________________________________________________________| you
|_____________________________________________________________| said
|_____________________________________________________| alice
|_______________________________________________| was
|____________________________________________| that
|____________________________________| as
|_________________________________| her
|______________________________| with
|______________________________| at
|___________________________| on
|__________________________| all
|________________________| this
|________________________| for
|_______________________| had
|_______________________| but
|______________________| be
|______________________| not
|____________________| they
|____________________| so
|___________________| very
|___________________| what

Output of Don Quixote (also from Gutenberg):

 ________________________________________________________________________
|________________________________________________________________________| that
|________________________________________________________| he
|______________________________________________| for
|__________________________________________| his
|________________________________________| as
|__________________________________| with
|_________________________________| not
|_________________________________| was
|________________________________| him
|______________________________| be
|___________________________| don
|_________________________| my
|_________________________| this
|_________________________| all
|_________________________| they
|________________________| said
|_______________________| have
|_______________________| me
|______________________| on
|______________________| so
|_____________________| you
|_____________________| quixote

Jonathon 2010-07-03 03:19:20

Wholly carp, is there really no way to make it shorter in Java? I hope you guys get paid by number of characters and not by functionality :-)

Nas Banov 2010-07-03 04:43:26

Java is absolute shit, wow

Pierreten 2010-09-04 08:14:11

Answer 13

+2 A:

Java - 991 chars _{^{(incl newlines and indentations)}}

I took the code of @seanizer, fixed a bug (he omitted the 1st output line), made some improvements to make the code more 'golfy'.

import java.util.*;
import java.util.regex.*;
import org.apache.commons.io.IOUtils;
public class WF{
 public static void main(String[] a)throws Exception{
  String t=IOUtils.toString(new java.net.URL(a[0]).openStream());
  class W implements Comparable<W> {
   String w;int f=1;W(String W){w=W;}public int compareTo(W o){return o.f-f;}
   String d(float r){char[]c=new char[(int)(f/r)];Arrays.fill(c,'_');return "|"+new String(c)+"| "+w;}
  }
  Map<String,W>M=new HashMap<String,W>();
  Matcher m=Pattern.compile("\\b\\w+\\b").matcher(t.toLowerCase());
  while(m.find()){String w=m.group();W W=M.get(w);if(W==null)M.put(w,new W(w));else W.f++;}
  M.keySet().removeAll(Arrays.asList("the,and,of,to,a,i,it,in,or,is".split(",")));
  List<W>L=new ArrayList<W>(M.values());Collections.sort(L);int l=76-L.get(0).w.length();
  System.out.println(" "+new String(new char[l]).replace('\0','_'));
  for(W w:L.subList(0,22))System.out.println(w.d((float)L.get(0).f/(float)l));
 }
}

Output:

 _________________________________________________________________________
|_________________________________________________________________________| she
|_______________________________________________________________| you
|____________________________________________________________| said
|_____________________________________________________| alice
|_______________________________________________| was
|___________________________________________| that
|____________________________________| as
|________________________________| her
|_____________________________| with
|_____________________________| at
|____________________________| s
|____________________________| t
|__________________________| on
|__________________________| all
|_______________________| this
|_______________________| for
|_______________________| had
|_______________________| but
|______________________| be
|_____________________| not
|____________________| they
|____________________| so

BalusC 2010-07-03 05:09:21

`new String(new char[l]).replace('\0','_')` that's a nice trick to remember, thanks.

seanizer 2010-07-03 06:18:40

Answer 14

+5 A:

Java - 886 865 756 744 742 744 752 742 714 680 chars

Updates before first 742: improved regex, removed superfluous parameterized types, removed superfluous whitespace.
Update 742 > 744 chars: fixed the fixed-length hack. It's only dependent on the 1st word, not other words (yet). Found several places to shorten the code (\\s in regex replaced by and ArrayList replaced by Vector). I'm now looking for a short way to remove the Commons IO dependency and reading from stdin.
Update 744 > 752 chars: I removed the commons dependency. It now reads from stdin. Paste the text in stdin and hit Ctrl+Z to get result.
Update 752 > 742 chars: I removed public and a space, made classname 1 char instead of 2 and it's now ignoring one-letter words.
Update 742 > 714 chars: Updated as per comments of Carl: removed redundant assignment (742 > 730), replaced m.containsKey(k) by m.get(k)!=null (730 > 728), introduced substringing of line (728 > 714).
Update 714 > 680 chars: Updated as per comments of Rotsor: improved bar size calculation to remove unnecessary casting and improved split() to remove unnecessary replaceAll().

import java.util.*;class F{public static void main(String[]a)throws Exception{StringBuffer b=new StringBuffer();for(int c;(c=System.in.read())>0;b.append((char)c));final Map<String,Integer>m=new HashMap();for(String w:b.toString().toLowerCase().split("(\\b(.|the|and|of|to|i[tns]|or)\\b|\\W)+"))m.put(w,m.get(w)!=null?m.get(w)+1:1);List<String>l=new Vector(m.keySet());Collections.sort(l,new Comparator(){public int compare(Object l,Object r){return m.get(r)-m.get(l);}});int c=76-l.get(0).length();String s=new String(new char[c]).replace('\0','_');System.out.println(" "+s);for(String w:l.subList(0,22))System.out.println("|"+s.substring(0,m.get(w)*c/m.get(l.get(0)))+"| "+w);}}

Python 2.x, latitudinarian approach = 227 183 chars

import sys,re
t=re.split('\W+',sys.stdin.read().lower())
r=sorted((-t.count(w),w)for w in set(t)if w not in'andithetoforinis')[:22]
for l,w in r:print(78-len(r[0][1]))*l/r[0][0]*'=',w

Allowing for freedom in the implementation, I constructed a string concatenation that contains all the words requested for exclusion (the, and, of, to, a, i, it, in, or, is) - plus it also excludes the two infamous "words" s and t from the example - and I threw in for free the exclusion for an, for, he. I tried all concatenations of those words against corpus of the words from Alice, King James' Bible and the Jargon file to see if there are any words that will be mis-excluded by the string. And that is how I ended with two exclusion strings:itheandtoforinis and andithetoforinis.

PS. borrowed from other solutions to shorten the code.

=========================================================================== she 
================================================================= you
============================================================== said
====================================================== alice
================================================ was
============================================ that
===================================== as
================================= her
============================== at
============================== with
=========================== on
=========================== all
======================== this
======================== had
======================= but
====================== be
====================== not
===================== they
==================== so
=================== very
=================== what
================= little

Rant

Regarding words to ignore, one would think those would be taken from list of the most used words in English. That list depends on the text corpus used. Per one of the most popular lists (http://en.wikipedia.org/wiki/Most_common_words_in_English, http://www.english-for-students.com/Frequently-Used-Words.html, http://www.sporcle.com/games/common_english_words.php), top 10 words are: the be(am/are/is/was/were) to of and a in that have I

The top 10 words from the Alice in Wonderland text are the and to a of it she i you said
The top 10 words from the Jargon File (v4.4.7) are the a of to and in is that or for

So question is why or was included in the problem's ignore list, where it's ~30th in popularity when the word that (8th most used) is not. etc, etc. Hence I believe the ignore list should be provided dynamically (or could be omitted).

Alternative idea would be simply to skip the top 10 words from the result - which actually would shorten the solution (elementary - have to show only the 11th to 32nd entries).

Python 2.x, punctilious approach = 277 243 chars

The chart drawn in the above code is simplified (using only one character for the bars). If one wants to reproduce exactly the chart from the problem description (which was not required), this code will do it:

import sys,re
t=re.split('\W+',sys.stdin.read().lower())
r=sorted((-t.count(w),w)for w in set(t)-set(sys.argv))[:22]
h=min(9*l/(77-len(w))for l,w in r)
print'',9*r[0][0]/h*'_'
for l,w in r:print'|'+9*l/h*'_'+'|',w

I take an issue with the somewhat random choice of the 10 words to exclude the, and, of, to, a, i, it, in, or, is so those are to be passed as command line parameters, like so:
python WordFrequencyChart.py the and of to a i it in or is <"Alice's Adventures in Wonderland.txt"

This is 213 chars + 30 if we account for the "original" ignore list passed on command line = 243

PS. The second code also does "adjustment" for the lengths of all top words, so none of them will overflow in degenerate case.

 _______________________________________________________________
|_______________________________________________________________| she
|_______________________________________________________| superlongstringstring
|_____________________________________________________| said
|______________________________________________| alice
|_________________________________________| was
|______________________________________| that
|_______________________________| as
|____________________________| her
|__________________________| at
|__________________________| with
|_________________________| s
|_________________________| t
|_______________________| on
|_______________________| all
|____________________| this
|____________________| for
|____________________| had
|____________________| but
|___________________| be
|___________________| not
|_________________| they
|_________________| so

Nas Banov 2010-07-03 06:32:16

Nice solution so far although the word ignore list isn't implemented (yet) and the bars are a bit rudimentary at the moment.

ChristopheD 2010-07-03 07:05:37

@ChristopheD: it was there, but there was no "user guide". Just added bunch text

Nas Banov 2010-07-03 09:47:19

Regarding your list of languages and solutions: Please look for solutions that use splitting along `\W` or use `\b` in a regex because those are very likely *not* according to spec, meaning they won't split on digits or `_` and they might also not remove stop words from strings such as `the_foo_or123bar`. They may not appear in the test text but the specification is pretty clear on that case.

Joey 2010-07-11 09:10:51

Answer 16

+2 A:

Gotta love the big ones...Objective-C (~~1070~~ ~~931~~ 905 chars)

#define S NSString
#define C countForObject
#define O objectAtIndex
#define U stringWithCString
main(int g,char**b){id c=[NSCountedSet set];S*d=[S stringWithContentsOfFile:[S U:b[1]]];id p=[NSPredicate predicateWithFormat:@"SELF MATCHES[cd]'(the|and|of|to|a|i[tns]?|or)|[^a-z]'"];[d enumerateSubstringsInRange:NSMakeRange(0,[d length])options:NSStringEnumerationByWords usingBlock:^(S*s,NSRange x,NSRange y,BOOL*z){if(![p evaluateWithObject:s])[c addObject:[s lowercaseString]];}];id s=[[c allObjects]sortedArrayUsingComparator:^(id a,id b){return(NSComparisonResult)([c C:b]-[c C:a]);}];g=[c C:[s O:0]];int j=76-[[s O:0]length];char*k=malloc(80);memset(k,'_',80);S*l=[S U:k length:80];printf(" %s\n",[[l substringToIndex:j]cString]),[[s subarrayWithRange:NSMakeRange(0,22)]enumerateObjectsUsingBlock:^(id a,NSUInteger x,BOOL*y){printf("|%s| %s\n",[[l substringToIndex:[c C:a]*j/g]cString],[a cString]);}];}

Switched to using a lot of depreciate APIs, removed some memory management that wasn't needed, more aggressive whitespace removal

 _________________________________________________________________________
|_________________________________________________________________________| she
|______________________________________________________________| said
|__________________________________________________________| you
|____________________________________________________| alice
|________________________________________________| was
|_______________________________________| that
|____________________________________| as
|_________________________________| her
|______________________________| with
|______________________________| at
|___________________________| on
|__________________________| all
|________________________| this
|________________________| for
|________________________| had
|_______________________| but
|______________________| be
|______________________| not
|____________________| so
|___________________| very
|__________________| what
|_________________| they

Joshua Weinberg 2010-07-03 07:29:46

Note that the spec calls for ignoring 's, so "don't" parses as two words "don" and "s". You'll see in the reference implementation that "s" and "t" are represented in the top 22...

dmckee 2010-07-03 09:12:54

Kudos for doing it in obj-c (not a language you see often in code golfing)!

ChristopheD 2010-07-03 11:58:55

@Christophe: And here we see exactly *why* we don't see it that often ;)

Joey 2010-07-03 13:38:31

Try `#define S NSString`, `#define C countForObject`, and use these two appropriately. Also replace `calloc(80,1)` with simply `malloc(80)`, since you're setting the contents straight afterwards. Also, reuse the `a` parameter, to save on an `int` declaration. This should get it less than 1,000 chars...

brone 2010-07-03 16:47:16

@brone thanks for the idea, took those, and some other extra stuff I saw, well below 1000 now

Joshua Weinberg 2010-07-03 17:20:21

Use `id` instead of `NSCountedSet*` etc!

KennyTM 2010-07-10 05:47:30

Holy hell, how did I not think of that....edited to fix, 905 :)

Joshua Weinberg 2010-07-10 05:59:32

Answer 17

+33 A:

Ruby 207 213 211 210 207 203 201 200 chars

An improvement on Anurag, incorporating suggestion from rfusca. Also removes argument to sort and a few other minor golfings.

w=(STDIN.read.downcase.scan(/[a-z]+/)-%w{the and of to a i it in or is}).group_by{|x|x}.map{|x,y|[-y.size,x]}.sort.take 22;k,l=w[0];m=76.0-l.size;puts' '+'_'*m;w.map{|f,x|puts"|#{'_'*(m*f/k)}| #{x} "}

Execute as:

ruby GolfedWordFrequencies.rb < Alice.txt

Edit: put 'puts' back in, needs to be there to avoid having quotes in output.
Edit2: Changed File->IO
Edit3: removed /i
Edit4: Removed parentheses around (f*1.0), recounted
Edit5: Use string addition for the first line; expand s in-place.
Edit6: Made m float, removed 1.0. EDIT: Doesn't work, changes lengths. EDIT: No worse than before
Edit7: Use STDIN.read.

archgoon 2010-07-03 08:55:16

+1 - love the sorting part, very clever :)

Anurag 2010-07-03 09:39:30

Hey, small optimization compared to coming up with the bulk of it in the first place. :)

archgoon 2010-07-03 10:01:17

Nice! Added two of the changes I also made in Anurag's version. Shaves off another 4.

Shtééf 2010-07-03 10:57:32

The solution has deviated from the original output, I'm going totry and figure out where that happened.

archgoon 2010-07-03 11:06:11

Huh, note that the last two are the same length (in our and several other versions), but the original questioner has them as different. Anurag's original solution has this issue. It's going to be a pain tracking it down. I'm putting back in the 76.0 trick, since it isn't the problem.

archgoon 2010-07-03 11:11:34

@archgoon: I applaud your noble effort, but string addition is not shorter for the loop. It's only shorter because you took out the trailing space. But don't feel bad, it doesn't make Perl look any better. ;)

Shtééf 2010-07-03 11:26:53

@Shtééf, ah, that explains why you didn't do it already ;). At least we got the proper count. Congratulations to you.

archgoon 2010-07-03 11:33:22

How about [0..21] instead of .take 22?

Adam 2010-07-03 14:07:41

There's a shorter variant of this down further.

archgoon 2010-07-03 22:16:52

Answer 18

+32 A:

206

shell, grep, tr, grep, sort, uniq, sort, head, perl

~ % wc -c wfg
209 wfg
~ % cat wfg
egrep -oi \\b[a-z]+|tr A-Z a-z|egrep -wv 'the|and|of|to|a|i|it|in|or|is'|sort|uniq -c|sort -nr|head -22|perl -lape'($f,$w)=@F;$.>1or($q,$x)=($f,76-length$w);$b="_"x($f/$q*$x);$_="|$b| $w ";$.>1or$_=" $b\n$_"'
~ % # usage:
~ % sh wfg < 11.txt

~~hm, just seen above: sort -nr -> sort -n and then head -> tail => 208 :)~~
update2: erm, of course the above is silly, as it will be reversed then. So, 209.
update3: optimized the exclusion regexp -> 206

egrep -oi \\b[a-z]+|tr A-Z a-z|egrep -wv 'the|and|o[fr]|to|a|i[tns]?'|sort|uniq -c|sort -nr|head -22|perl -lape'($f,$w)=@F;$.>1or($q,$x)=($f,76-length$w);$b="_"x($f/$q*$x);$_="|$b| $w ";$.>1or$_=" $b\n$_"'

for fun, here's a perl-only version (much faster):

~ % wc -c pgolf
204 pgolf
~ % cat pgolf
perl -lne'$1=~/^(the|and|o[fr]|to|.|i[tns])$/i||$f{lc$1}++while/\b([a-z]+)/gi}{@w=(sort{$f{$b}<=>$f{$a}}keys%f)[0..21];$Q=$f{$_=$w[0]};$B=76-y///c;print" "."_"x$B;print"|"."_"x($B*$f{$_}/$Q)."| $_"for@w'
~ % # usage:
~ % sh pgolf < 11.txt

stor 2010-07-03 09:15:35

Impressive golfing!

ChristopheD 2010-07-03 11:49:07

Most impressive indeed.

Camilo Martin 2010-07-03 14:15:12

Answer 19

+33 A:

GolfScript, 177 175 173 167 164 163 144 131 130 chars

Slow - 3 minutes for the sample text (130)

{32|.123%97<n@if}%]''*n%"oftoitinorisa"2/-"theandi"3/-$(1@{.3$>1{;)}if}/]2/{~~\;}$22<.0=~:2;,76\-:1'_':0*' '\@{"
|"\~1*2/0*'| '@}/

Explanation:

{           #loop through all characters
 32|.       #convert to uppercase and duplicate
 123%97<    #determine if is a letter
 n@if       #return either the letter or a newline
}%          #return an array (of ints)
]''*        #convert array to a string with magic
n%          #split on newline, removing blanks (stack is an array of words now)
"oftoitinorisa"   #push this string
2/          #split into groups of two, i.e. ["of" "to" "it" "in" "or" "is" "a"]
-           #remove any occurrences from the text
"theandi"3/-#remove "the", "and", and "i"
$           #sort the array of words
(1@         #takes the first word in the array, pushes a 1, reorders stack
            #the 1 is the current number of occurrences of the first word
{           #loop through the array
 .3$>1{;)}if#increment the count or push the next word and a 1
}/
]2/         #gather stack into an array and split into groups of 2
{~~\;}$     #sort by the latter element - the count of occurrences of each word
22<         #take the first 22 elements
.0=~:2;     #store the highest count
,76\-:1     #store the length of the first line
'_':0*' '\@ #make the first line
{           #loop through each word
"
|"\~        #start drawing the bar
1*2/0       #divide by zero
*'| '@      #finish drawing the bar
}/

"Correct" (hopefully). (143)

{32|.123%97<n@if}%]''*n%"oftoitinorisa"2/-"theandi"3/-$(1@{.3$>1{;)}if}/]2/{~~\;}$22<..0=1=:^;{~76@,-^*\/}%$0=:1'_':0*' '\@{"
|"\~1*^/0*'| '@}/

Less slow - half a minute. (162)

'"'/' ':S*n/S*'"#{%q
'\+"
.downcase.tr('^a-z','
')}\""+~n%"oftoitinorisa"2/-"theandi"3/-$(1@{.3$>1{;)}if}/]2/{~~\;}$22<.0=~:2;,76\-:1'_':0*S\@{"
|"\~1*2/0*'| '@}/

Output visible in revision logs.

Nabb 2010-07-03 09:52:29

About GolfScript: http://www.golfscript.com/golfscript/

Assaf Lavie 2010-07-03 16:12:56

Not correct, in that if the second word is really long it will wrap to the next line.

Gabe 2010-07-03 19:48:58

Figures golfscript wins :P

RCIX 2010-07-04 22:45:23

"divide by zero" ...GolfScript allows that?

JAB 2010-07-06 16:20:11

I like the explanation!

Cornelius 2010-07-13 12:48:27

Answer 20

+19 A:

Windows PowerShell, 199 chars

$x=$input-split'\P{L}'-notmatch'^(the|and|of|to|.?|i[tns]|or)$'|group|sort *
filter f($w){' '+'_'*$w
$x[-1..-22]|%{"|$('_'*($w*$_.Count/$x[-1].Count))| "+$_.Name}}
f(76..1|?{!((f $_)-match'.'*80)})[0]

(The last line break isn't necessary, but included here for readability.)

(Current code and my test files available in my SVN repository. I hope my test cases catch most common errors (bar length, problems with regex matching and a few others))

Assumptions:

US ASCII as input. It probably gets weird with Unicode.
At least two non-stop words in the text

History

Relaxed version (137), since that's counted separately by now, apparently:

($x=$input-split'\P{L}'-notmatch'^(the|and|of|to|.?|i[tns]|or)$'|group|sort *)[-1..-22]|%{"|$('_'*(76*$_.Count/$x[-1].Count))| "+$_.Name}

doesn't close the first bar
doesn't account for word length of non-first word

Variations of the bar lengths of one character compared to other solutions is due to PowerShell using rounding instead of truncation when converting floating-point numbers into integers. Since the task required only proportional bar length this should be fine, though.

Compared to other solutions I took a slightly different approach in determining the longest bar length by simply trying out and taking the highest such length where no line is longer than 80 characters.

An older version explained can be found here.

Joey 2010-07-03 10:51:43

Impressive, seems Powershell is a suitable environment for golfing. Your approach considering the bar length is exactly what I tried to describe (not so brilliantly, I admit) in the spec.

ChristopheD 2010-07-03 11:51:31

@ChristopheD: In my experience (Anarchy Golf, some Project Euler tasks and some more tasks just for the fun of it), PowerShell is usually only slightly worse than Ruby and often tied with or better than Perl and Python. No match for GolfScript, though. But as far as I can see, this might be the shortest solution that correctly accounts for bar lengths ;-)

Joey 2010-07-03 11:57:44

Apparently I was right. Powershell *can* do better -- much better! Please provide an expanded version with comments.

Gabe 2010-07-03 12:02:57

Johannes: Did you try `-split("\b(?:the|and|of|to|a|i[tns]?|or)\b|[^a-z]")`? It works for me.

Gabe 2010-07-03 12:20:03

Don't forget to interpolate the output string: `"|$('_'*($w*$_.count/$x[0].count))| $($_.name) "` (or eliminate the last space, as it's sort of automatic). And you can use `-split("(?:\b(?:the|and|of|to|a|i[tns]?|or)\b|[^a-z])+")` to save a few more by not including blanks (or use `[-2..-23]`).

Gabe 2010-07-03 12:47:56

Note that without the trailing space you need to match `.{80}`. And you can guarantee that blanks will always be first like this: `"\b(?:the|and|of|to|a|i[tns]?|or)\b|[^a-z]()"` (the empty capturing group ensures a blank for every word)

Gabe 2010-07-03 13:10:56

So if you have `"$input"` do you still need the `()`? Also, now that you eliminated the trailing space, you can save a couple strokes by not interpolating the name: `"|$('_'*($w*$_.count/$x[0].count))| "+$_.name`. We'll get to 200 yet!

Gabe 2010-07-03 14:25:40

With the elimination of .ToString, it's now back under 200!

Gabe 2010-07-03 21:20:50

@Gabe: Yay, thank you. And good catch on `function` vs. `filter`. I thought about using `filter`, I just didn't think of the fact that filters can take arguments too. For me it was a comparison between `function f($w){...}` and `filter{$w=$_;...}` (since I definitely need a loop in the function and therefore can't leave the argument as `$_`. Nice trick to remember, thanks :-). Still, I think this approach has been golfed almost to death by now. [And I notice we killed it somewhere in between ... my other two test cases don't run anymore – debugging ...]

Joey 2010-07-03 22:35:03

One could argue that you're making it a little *too* general, but I'm not going to complain as along as it's still under 200.

Gabe 2010-07-04 15:00:54

@Gabe: Well, I've revisited my assumptions concerning at least one non-stop word already. But the `\b` problem was clearly against the spec and only happened to work for the test input.

Joey 2010-07-04 15:33:55

Answer 21

+35 A:

Ruby 1.9, 185 chars

(heavily based on the other Ruby solutions)

w=($<.read.downcase.scan(/[a-z]+/)-%w{the and of to a i it in or is}).group_by{|x|x}.map{|x,y|[-y.size,x]}.sort[0,22]
k,l=w[0]
puts [?\s+?_*m=76-l.size,w.map{|f,x|?|+?_*(f*m/k)+"| "+x}]

Instead of using any command line switches like the other solutions, you can simply pass the filename as argument. (i.e. ruby1.9 wordfrequency.rb Alice.txt)

Since I'm using character-literals here, this solution only works in Ruby 1.9.

Edit: Replaced semicolons by line breaks for "readability". :P

Edit 2: Shtééf pointed out I forgot the trailing space - fixed that.

Edit 3: Removed the trailing space again ;)

Ventero 2010-07-03 11:38:27

It's missing the trailing space, after each word.

Shtééf 2010-07-03 12:11:42

Aww shoot, disregard that. Looks like the golf was just updated, trailing space no longer required. :)

Shtééf 2010-07-03 12:19:36

Does not seem to accomodate for 'superlongstringstring' in 2nd or later position? (see problem description)

Nas Banov 2010-07-06 04:49:11

That looks really maintainable.

Zombies 2010-07-14 18:43:18

Answer 22

+2 A:

R 449 chars

can probably get shorter...

bar <- function(w, l)
    {
    b <- rep("-", l)
    s <- rep(" ", l)
    cat(" ", b, "\n|", s, "| ", w, "\n ", b, "\n", sep="")
    }

f <- "alice.txt"
e <- c("the", "and", "of", "to", "a", "i", "it", "in", "or", "is", "")
w <- unlist(lapply(readLines(file(f)), strsplit, s=" "))
w <- tolower(w)
w <- unlist(lapply(w, gsub, pa="[^a-z]", r=""))
u <- unique(w[!w %in% e])
n <- unlist(lapply(u, function(x){length(w[w==x])}))
o <- rev(order(n))
n <- n[o]
m <- 77 - max(unlist(lapply(u[1:22], nchar)))
n <- floor(m*n/n[1])
u <- u[o]

for (i in 1:22)
    bar(u[i], n[i])

nico 2010-07-03 12:27:28

@Johannes Rössel: It is dynamic, just scaled to 100% = 60px = max length. E.g.: 1st world = 50 occurrences, 2nd world = 25 occurrences. 1st bar = 60 px, 2nd bar = 30 px

nico 2010-07-03 14:46:11

@Johannes Rössel: Ok, I didn't read the part that said you should maximise the length, thought it just needed to fit 80 chars... now it works as intended :) Thanks for spotting that

nico 2010-07-03 16:37:13

Well, it's the one thing most often done wrong in the answers here, I think. Took me also quite a while to figure out an elegant way of doing so.

Joey 2010-07-03 16:41:50

Answer 23

+1 A:

Groovy, 424 389 378 321 chars

replaced b=map.get(a) with b=map[a], replaced split with matcher / iterator

def r,s,m=[:],n=0;def p={println it};def w={"_".multiply it};(new URL(this.args[0]).text.toLowerCase()=~/\b\w+\b/).each{s=it;if(!(s==~/(the|and|of|to|a|i[tns]?|or)/))m[s]=m[s]==null?1:m[s]+1};m.keySet().sort{a,b->m[b]<=>m[a]}.subList(0,22).each{k->if(n++<1){r=(m[k]/(76-k.length()));p" "+w(m[k]/r)};p"|"+w(m[k]/r)+"|"+k}

(executed as groovy script with the URL as cmd line arg. No imports required!)

Readable version here:

def r,s,m=[:],n=0;
def p={println it};
def w={"_".multiply it};
(new URL(this.args[0]).text.toLowerCase()
        =~ /\b\w+\b/
        ).each{
        s=it;
        if (!(s ==~/(the|and|of|to|a|i[tns]?|or)/))
            m[s] = m[s] == null ? 1 : m[s] + 1
        };
    m.keySet()
        .sort{
            a,b -> m[b] <=> m[a]
        }
        .subList(0,22).each{
            k ->
                if( n++ < 1 ){
                    r=(m[k]/(76-k.length()));
                    p " " + w(m[k]/r)
                };
                p "|" + w(m[k]/r) + "|" + k
}

seanizer 2010-07-03 13:12:32

Answer 24

+5 A:

Scala, 368 chars

First, a legible version in 592 characters:

object Alice {
  def main(args:Array[String]) {
    val s = io.Source.fromFile(args(0))
    val words = s.getLines.flatMap("(?i)\\w+\\b(?<!\\bthe|and|of|to|a|i|it|in|or|is)".r.findAllIn(_)).map(_.toLowerCase)
    val freqs = words.foldLeft(Map[String, Int]())((countmap, word)  => countmap + (word -> (countmap.getOrElse(word, 0)+1)))
    val sortedFreqs = freqs.toList.sort((a, b)  => a._2 > b._2)
    val top22 = sortedFreqs.take(22)
    val highestWord = top22.head._1
    val highestCount = top22.head._2
    val widest = 76 - highestWord.length
    println(" " + "_" * widest)
    top22.foreach(t => {
      val width = Math.round((t._2 * 1.0 / highestCount) * widest).toInt
      println("|" + "_" * width + "| " + t._1)
    })
  }
}

The console output looks like this:

$ scalac alice.scala 
$ scala Alice aliceinwonderland.txt
 _________________________________________________________________________
|_________________________________________________________________________| she
|_______________________________________________________________| you
|_____________________________________________________________| said
|_____________________________________________________| alice
|_______________________________________________| was
|____________________________________________| that
|____________________________________| as
|_________________________________| her
|______________________________| at
|______________________________| with
|_____________________________| s
|_____________________________| t
|___________________________| on
|__________________________| all
|_______________________| had
|_______________________| but
|______________________| be
|______________________| not
|____________________| they
|____________________| so
|___________________| very
|___________________| what

We can do some aggressive minifying and get it down to 415 characters:

object A{def main(args:Array[String]){val l=io.Source.fromFile(args(0)).getLines.flatMap("(?i)\\w+\\b(?<!\\bthe|and|of|to|a|i|it|in|or|is)".r.findAllIn(_)).map(_.toLowerCase).foldLeft(Map[String, Int]())((c,w)=>c+(w->(c.getOrElse(w,0)+1))).toList.sort((a,b)=>a._2>b._2).take(22);println(" "+"_"*(76-l.head._1.length));l.foreach(t=>println("|"+"_"*Math.round((t._2*1.0/l.head._2)*(76-l.head._1.length)).toInt+"| "+t._1))}}

The console session looks like this:

$ scalac a.scala 
$ scala A aliceinwonderland.txt
 _________________________________________________________________________
|_________________________________________________________________________| she
|_______________________________________________________________| you
|_____________________________________________________________| said
|_____________________________________________________| alice
|_______________________________________________| was
|____________________________________________| that
|____________________________________| as
|_________________________________| her
|______________________________| at
|______________________________| with
|_____________________________| s
|_____________________________| t
|___________________________| on
|__________________________| all
|_______________________| had
|_______________________| but
|______________________| be
|______________________| not
|____________________| they
|____________________| so
|___________________| very
|___________________| what

I'm sure a Scala expert could do even better.

Update: In the comments Thomas gave an even shorter version, at 368 characters:

object A{def main(a:Array[String]){val t=(Map[String, Int]()/:(for(x<-io.Source.fromFile(a(0)).getLines;y<-"(?i)\\w+\\b(?<!\\bthe|and|of|to|a|i|it|in|or|is)".r findAllIn x) yield y.toLowerCase).toList)((c,x)=>c+(x->(c.getOrElse(x,0)+1))).toList.sortBy(_._2).reverse.take(22);val w=76-t.head._1.length;print(" "+"_"*w);t map (s=>"\n|"+"_"*(s._2*w/t.head._2)+"| "+s._1) foreach print}}

Legibly, at 375 characters:

object Alice {
  def main(a:Array[String]) {
    val t = (Map[String, Int]() /: (
      for (
        x <- io.Source.fromFile(a(0)).getLines
        y <- "(?i)\\w+\\b(?<!\\bthe|and|of|to|a|i|it|in|or|is)".r.findAllIn(x)
      ) yield y.toLowerCase
    ).toList)((c, x) => c + (x -> (c.getOrElse(x, 0) + 1))).toList.sortBy(_._2).reverse.take(22)
    val w = 76 - t.head._1.length
    print (" "+"_"*w)
    t.map(s => "\n|" + "_" * (s._2 * w / t.head._2) + "| " + s._1).foreach(print)
  }
}

pr1001 2010-07-03 14:02:20

383 chars: `object A{def main(a:Array[String]){val t=(Map[String, Int]()/:(for(x<-io.Source.fromFile(a(0)).getLines;y<-"(?i)\\w+\\b(?<!\\bthe|and|of|to|a|i|it|in|or|is)".r findAllIn x) yield y.toLowerCase).toList)((c,x)=>c+(x->(c.getOrElse(x,0)+1))).toList.sortBy(_._2).reverse.take(22);val w=76-t.head._1.length;print(" "+"_"*w);t map (s=>"\n|"+"_"*(s._2*w/t.head._2)+"| "+s._1) foreach print}}`

Thomas Jung 2010-07-06 11:53:50

Of course, the ever handy for comprehension! Nice!

pr1001 2010-07-07 11:52:17

Answer 25

+2 A:

Python 2.6, 273 269 267 266 characters.

(Edit: Props to ChristopheD for character-shaving suggestions)

import sys,re
t=re.findall('[a-z]+',"".join(sys.stdin).lower())
d=sorted((t.count(w),w)for w in set(t)-set("the and of to a i it in or is".split()))[:-23:-1]
r=min((78.-len(m[1]))/m[0]for m in d)
print'','_'*(int(d[0][0]*r-2))
for(a,b)in d:print"|"+"_"*(int(a*r-2))+"|",b

Output:

 _________________________________________________________________________
|_________________________________________________________________________| she
|_______________________________________________________________| you
|____________________________________________________________| said
|____________________________________________________| alice
|______________________________________________| was
|__________________________________________| that
|___________________________________| as
|_______________________________| her
|____________________________| with
|____________________________| at
|___________________________| s
|___________________________| t
|_________________________| on
|_________________________| all
|______________________| this
|______________________| for
|______________________| had
|_____________________| but
|____________________| be
|____________________| not
|___________________| they
|__________________| so

Astatine 2010-07-03 14:04:26

You could drop the square brackets in `r=min([(78.0-len(m[1]))/m[0] for m in d])` (shaves off 2 characters: `min((78.0-len(m[1]))/m[0] for m in d)`). The same goes for the square brackets in line three: `sorted([...`

ChristopheD 2010-07-03 14:13:02

Also in line three and four you can lose an unneeded space just before `for` (shaves off 2 characters).

ChristopheD 2010-07-03 14:14:44

Aha! Thanks for that.

Astatine 2010-07-03 14:26:03

I like the way you abuse this `print'',` to print the starting space on the first line; clever ;-)

ChristopheD 2010-07-03 14:34:53

Just realised I didn't need a following zero to declare a float on the fourth line.Is this the only Python entry that takes into account that some words might be significantly longer than the most common one?

Astatine 2010-07-03 16:54:31

instead of 78 you can use 76 and saving two "-2"; instead of m[0],m[1] you can use w and r by doing "for w,r in d". you can use \w instead of [a-z]. sys.stdin.read() is shorter. I like the idea of using commas!

6502 2010-07-05 23:13:49

Good points; however \w matches underscores, which is why I didn't use it.

Astatine 2010-07-07 11:15:21

Answer 26

+11 A:

perl, 205 191 189 characters/ 205 characters (fully implemented)

Some parts were inspired by the earlier perl/ruby submissions, a couple similar ideas were arrived at independently, the others are original. Shorter version also incorporates some things I saw/learned from other submissions.

Original:

$k{$_}++for grep{$_!~/^(the|and|of|to|a|i|it|in|or|is)$/}map{lc=~/[a-z]+/g}<>;@t=sort{$k{$b}<=>$k{$a}}keys%k;$l=76-length$t[0];printf" %s
",'_'x$l;printf"|%s| $_
",'_'x int$k{$_}/$k{$t[0]}*$l for@t[0..21];

~~Latest version down to~~ 191 characters:

/^(the|and|of|to|.|i[tns]|or)$/||$k{$_}++for map{lc=~/[a-z]+/g}<>;@e=sort{$k{$b}<=>$k{$a}}keys%k;$n=" %s
";$r=(76-y///c)/$k{$_=$e[0]};map{printf$n,'_'x($k{$_}*$r),$_;$n="|%s| %s
"}@e[0,0..21]

Latest version down to 189 characters:

/^(the|and|of|to|.|i[tns]|or)$/||$k{$_}++for map{lc=~/[a-z]+/g}<>;@_=sort{$k{$b}<=>$k{$a}}keys%k;$n=" %s
";$r=(76-m//)/$k{$_=$_[0]};map{printf$n,'_'x($k{$_}*$r),$_;$n="|%s| %s
"}@_[0,0..21]

This version (205 char) accounts for the lines with words longer than what would be found later.

/^(the|and|of|to|.|i[tns]|or)$/||$k{$_}++for map{lc=~/[a-z]+/g}<>;($r)=sort{$a<=>$b}map{(76-y///c)/$k{$_}}@e=sort{$k{$b}<=>$k{$a}}keys%k;$n=" %s
";map{printf$n,'_'x($k{$_}*$r),$_;$n="|%s| %s
";}@e[0,0..21]

pdehaan 2010-07-03 14:37:48

Answer 27

+10 A:

Perl: 203 202 201 198 195 208 203 / 231 chars

$/=\0;/^(the|and|of|to|.|i[tns]|or)$/i||$x{lc$_}++for<>=~/[a-z]+/gi;map{$z=$x{$_};$y||{$y=(76-y///c)/$z}&&warn" "."_"x($z*$y)."\n";printf"|%.78s\n","_"x($z*$y)."| $_"}(sort{$x{$b}<=>$x{$a}}keys%x)[0..21]

Alternate, full implementation including indicated behaviour (global bar-squishing) for the pathological case in which the secondary word is both popular and long enough to combine to over 80 chars (this implementation is 231 chars):

$/=\0;/^(the|and|of|to|.|i[tns]|or)$/i||$x{lc$_}++for<>=~/[a-z]+/gi;@e=(sort{$x{$b}<=>$x{$a}}keys%x)[0..21];for(@e){$p=(76-y///c)/$x{$_};($y&&$p>$y)||($y=$p)}warn" "."_"x($x{$e[0]}*$y)."\n";for(@e){warn"|"."_"x($x{$_}*$y)."| $_\n"}

The specification didn't state anywhere that this had to go to STDOUT, so I used perl's warn() instead of print - four characters saved there. Used map instead of foreach, but I feel like there could still be some more savings in the split(join()). Still, got it down to 203 - might sleep on it. At least Perl's now under the "shell, grep, tr, grep, sort, uniq, sort, head, perl" char count for now ;)

PS: Reddit says "Hi" ;)

Update: Removed join() in favour of assignment and implicit scalar conversion join. Down to 202. Also please note I have taken advantage of the optional "ignore 1-letter words" rule to shave 2 characters off, so bear in mind the frequency count will reflect this.

Update 2: Swapped out assignment and implicit join for killing $/ to get the file in one gulp using <> in the first place. Same size, but nastier. Swapped out if(!$y){} for $y||{}&&, saved 1 more char => 201.

Update 3: Took control of lowercasing early (lc<>) by moving lc out of the map block - Swapped out both regexes to no longer use /i option, as no longer needed. Swapped explicit conditional x?y:z construct for traditional perlgolf || implicit conditional construct - /^...$/i?1:$x{$}++ for /^...$/||$x{$}++ Saved three characters! => 198, broke the 200 barrier. Might sleep soon... perhaps.

Update 4: Sleep deprivation has made me insane. Well. More insane. Figuring that this only has to parse normal happy text files, I made it give up if it hits a null. Saved two characters. Replaced "length" with the 1-char shorter (and much more golfish) y///c - you hear me, GolfScript?? I'm coming for you!!! sob

Update 5: Sleep dep made me forget about the 22row limit and subsequent-line limiting. Back up to 208 with those handled. Not too bad, 13 characters to handle it isn't the end of the world. Played around with perl's regex inline eval, but having trouble getting it to both work and save chars... lol. Updated the example to match current output.

Update 6: Removed unneeded braces protecting (...)for, since the syntactic candy ++ allows shoving it up against the for happily. Thanks to input from Chas. Owens (reminding my tired brain), got the character class i[tns] solution in there. Back down to 203.

Update 7: Added second piece of work, full implementation of specs (including the full bar-squishing behaviour for secondary long-words, instead of truncation which most people are doing, based on the original spec without the pathological example case)

Examples:

 _________________________________________________________________________
|_________________________________________________________________________| she
|_______________________________________________________________| you
|____________________________________________________________| said
|_____________________________________________________| alice
|_______________________________________________| was
|___________________________________________| that
|____________________________________| as
|________________________________| her
|_____________________________| with
|_____________________________| at
|__________________________| on
|__________________________| all
|_______________________| this
|_______________________| for
|_______________________| had
|_______________________| but
|______________________| be
|_____________________| not
|____________________| they
|____________________| so
|___________________| very
|__________________| what

Alternative implementation in pathological case example:

 _______________________________________________________________
|_______________________________________________________________| she
|_______________________________________________________| superlongstringstring
|____________________________________________________| said
|______________________________________________| alice
|________________________________________| was
|_____________________________________| that
|_______________________________| as
|____________________________| her
|_________________________| with
|_________________________| at
|_______________________| on
|______________________| all
|____________________| this
|____________________| for
|____________________| had
|____________________| but
|___________________| be
|__________________| not
|_________________| they
|_________________| so
|________________| very
|________________| what

Syntaera 2010-07-03 15:33:14

You can shorten the regex for the stop words by collapsing `is|in|it|i` into `i[snt]?` – and then there's no difference with the optional rule anymore. (Hm, I never would have thought about telling a Perl guy how to do Regex :D) – only problem now: I have to look how I can shave off three bytes from my own solution to be better than Perl again :-|

Joey 2010-07-03 16:44:35

Ok, disregard part of what I said earlier. Ignoring one-letter words is indeed a byte shorter than not doing it.

Joey 2010-07-03 16:55:23

Every byte counts ;) I considered doing the newline trick, but I figured it was actually the same number of bytes, even if it was fewer printable characters. Still working on seeing if I can shrink it some more :)

Syntaera 2010-07-03 17:16:39

Ah well, case normalization threw me back to 209. I don't see what else I could cut. Although PowerShell *can* be shorter than Perl. ;-)

Joey 2010-07-03 18:01:43

I don't see where you restrict the output to the top 22 words, nor where you make sure that a long second word doesn't wrap.

Gabe 2010-07-03 19:19:14

Had to sleep on it :)

Syntaera 2010-07-04 00:53:36

You can save even more by using `say`: `perl -E '$/=\0;map{/^(the|and|of|to|.|it|in|or|is|)$/||$x{$_}++}split(/[^a-z]/,lc<>);map{$z=$x{$_};$y||{$y=(76-y///c)/$z}say"|"."_"x($z*$y)."| $_"}sort{$x{$b}<=>$x{$a}}keys%x'`

Chas. Owens 2010-07-04 03:43:19

Even more by using a character class and using for instead of `map` where possible:`perl -E '$/=\0;/^(the|and|of|to|.|i[tns]|or)$/||$x{$_}++for split(/[^a-z]/,lc<>);map{$z=$x{$_};$y||{$y=(76-y///c)/$z}say"|"."_"x($z*$y)."| $_"}sort{$x{$b}<=>$x{$a}}keys%x'`

Chas. Owens 2010-07-04 03:51:25

Thanks for that - I came to the same conclusion about for, but also got rid of split(), just using a bare regex instead for it. Back down to 203!

Syntaera 2010-07-04 04:50:54

Answer 28

+2 A:

C++, 647 chars

I don't expect to score highly by using C++, but nevermind that. I'm pretty sure it hits all the requirements. Note that I used the C++0x auto keyword for variable declaration, so adjust your complier appropriately if you decide to test my code.

Minimised version

#include <iostream>
#include <cstring>
#include <map>
using namespace std;
#define C string
#define S(x)v=F/a,cout<<#x<<C(v,'_')
#define F t->first
#define G t->second
#define O &&F!=
#define L for(i=22;i-->0;--t)
int main(){map<C,int>f;char d[230];int i=1,v;for(;i<256;i++)d[i<123?i-1:i-27]=i;d[229]=0;char w[99];while(cin>>w){for(i=0;w[i];i++)w[i]=tolower(w[i]);char*p=strtok(w,d);while(p)++f[p],p=strtok(0,d);}multimap<int,C>c;for(auto t=f.end();--t!=f.begin();)if(F!="the"O"and"O"of"O"to"O"a"O"i"O"it"O"in"O"or"O"is")c.insert(pair<int,C>(G,F));auto t=--c.end();float a=0,A;L A=F/(76.0-G.length()),a=a>A?a:A;t=--c.end();S( );L S(\n|)<<"| "<<G;}

Here's a second version that is more "C++" by using string, not char[] and strtok. It's a bit larger, at 669 (+22 vs above), but I can't get it smaller at the moment so thought I'd post it anyway.

#include <iostream>
#include <map>
using namespace std;
#define C string
#define S(x)v=F/a,cout<<#x<<C(v,'_')
#define F t->first
#define G t->second
#define O &&F!=
#define L for(i=22;i-->0;--t)
#define E e=w.find_first_of(d,g);g=w.find_first_not_of(d,e);
int main(){map<C,int>f;int i,v;C w,x,d="abcdefghijklmnopqrstuvwxyz";while(cin>>w){for(i=w.size();i-->0;)w[i]=tolower(w[i]);unsigned g=0,E while(g-e>0){x=w.substr(e,g-e),++f[x],E}}multimap<int,C>c;for(auto t=f.end();--t!=f.begin();)if(F!="the"O"and"O"of"O"to"O"a"O"i"O"it"O"in"O"or"O"is")c.insert(pair<int,C>(G,F));auto t=--c.end();float a=0,A;L A=F/(76.0-G.length()),a=a>A?a:A;t=--c.end();S( );L S(\n|)<<"| "<<G;}

I've removed the full version, because I can't be bothered to keep updating it with my tweaks to the minimised version. See edit history if you're interested in the (possibly outdated) long version.

DMA57361 2010-07-03 15:47:17

If you're going to put an arbitrary limit on word length, you might as well make it 999 instead of 1024 and save a stroke.

Gabe 2010-07-03 21:38:38

If you use `float a=0,A;L A=F/(76.0-G.length()),a=a>A?a:A;` you can eliminate a #define and shave a few strokes.

Gabe 2010-07-03 21:40:36

@Gabe - thanks for that second one, trimmed a few extra away. As for `word`, having an arbitrary length doesn't really feel right - but I'm not sure of the best way to extract `cin` into a `char` array, as opposed to a `string`, without the risk of breaking in the middle of a word (ie, if I just pulled it in 80-char chunks). But I've put finding a "better" solution until probably tomorrow.

DMA57361 2010-07-03 22:45:02

Isn't `d[i-27]=0;` the same as `d[229]=0;`?

Gabe 2010-07-03 23:01:17

Why did you decide to use a char buffer instead of a string?

EvilTeach 2010-07-03 23:23:57

You could save a space by making `L{A=F/(76.0-G.length()),a=a>A?a:A;}` into `L A=F/(76.0-G.length()),a=a>A?a:A;`.

Gabe 2010-07-04 07:37:25

@EvilTeach - so I could use strtok. I'm not aware of a C++ string tokenising function (see http://stackoverflow.com/questions/53849/how-do-i-tokenize-a-string-in-c) that would take "lots" of delimiters, and needed a reliable method to split words like "don't" on the punctuation. @Gabe - nice catch (again, thanks!) on d[229], as for the second suggestion - you'd already given that earlier and I obviously hadn't paid sufficient attention...

DMA57361 2010-07-04 14:29:08

Answer 29

+1 A:

Python, 320 characters

import sys
i="the and of to a i it in or is".split()
d={}
for j in filter(lambda x:x not in i,sys.stdin.read().lower().split()):d[j]=d.get(j,0)+1
w=sorted(d.items(),key=lambda x:x[1])[:-23:-1]
m=sorted(dict(w).values())[-1]
print" %s\n"%("_"*(76-m)),"\n".join(map(lambda x:("|%s| "+x[0])%("_"*((76-m)*x[1]/w[0][1])),w))

dhruvbird 2010-07-03 16:38:16

Answer 30

+8 A:

Python 3.1 - 245 229 charaters

I guess using Counter is kind of cheating :) I just read about it about a week ago, so this was the perfect chance to see how it works.

import re,collections
o=collections.Counter([w for w in re.findall("[a-z]+",open("!").read().lower())if w not in"a and i in is it of or the to".split()]).most_common(22)
print('\n'.join('|'+76*v//o[0][1]*'_'+'| '+k for k,v in o))

Prints out:

|____________________________________________________________________________| she
|__________________________________________________________________| you
|_______________________________________________________________| said
|_______________________________________________________| alice
|_________________________________________________| was
|_____________________________________________| that
|_____________________________________| as
|__________________________________| her
|_______________________________| with
|_______________________________| at
|______________________________| s
|_____________________________| t
|____________________________| on
|___________________________| all
|________________________| this
|________________________| for
|________________________| had
|________________________| but
|______________________| be
|______________________| not
|_____________________| they
|____________________| so

Some of the code was "borrowed" from AKX's solution.

sdolan 2010-07-03 17:13:20

The first line is missing. And the bar length isn't correct.

Joey 2010-07-03 17:38:55

Missed the first bar requirement and I see my bars are off. Thanks for the feedback. This is my first time :)

sdolan 2010-07-03 17:52:12

in your code seems that `open('!')` reads from stdin - which version/OS is that on? or do you have to name the file '!'?

Nas Banov 2010-07-03 21:22:44

Name the file "!" :) Sorry that was pretty unclear, and I should have mentioned it.

sdolan 2010-07-03 21:56:53

Answer 31

+1 A:

MATLAB 404 ~~410 bytes~~ ~~357 bytes.~~ ~~390 bytes.~~

This version is a bit longer, however, it will properly scale the length of the bars if there is a word that is ridiculously long so that none of the columns go over 80.

So, my code is 357 bytes without re-scaling, and 410 long with re-scaling.

A=textscan(fopen('11.txt'),'%s','delimiter',' 0123456789,.!?-_*^:;=+\\/(){}[]@&#$%~`|"''');
s=lower(A{1});s(cellfun('length', s)<2)=[];s(ismember(s,{'the','and','of','to','it','in','or','is'}))=[];
[w,~,i]=unique(s);N=hist(i,max(i)); [j,k]=sort(N,'descend'); b=k(1:22); n=cellfun('length',w(b));
q=80*N(b)'/N(k(1))+n; q=floor(q*78/max(q)-n); for i=1:22, fprintf('%s| %s\n',repmat('_',1,l(i)),w{k(i)});end

Results:

___________________________________________________________________________| she
_________________________________________________________________| you
______________________________________________________________| said
_______________________________________________________| alice
________________________________________________| was
____________________________________________| that
_____________________________________| as
_________________________________| her
______________________________| at
______________________________| with
____________________________| on
___________________________| all
_________________________| this
________________________| for
________________________| had
________________________| but
_______________________| be
_______________________| not
_____________________| they
____________________| so
___________________| very
___________________| what

For example, replacing all instances of "you" in the Alice in Wonderland text with "superlongstringofridiculousness", my code will correctly scale the results:

_____________________________________________________________| she
_______________________________________________| superlongstringofridiculousness
__________________________________________________| said
____________________________________________| alice
_______________________________________| was
____________________________________| that
______________________________| as
___________________________| her
_________________________| at
________________________| with
______________________| on
_____________________| all
___________________| this
___________________| for
___________________| had
___________________| but
__________________| be
__________________| not
________________| they
________________| so
_______________| very
_______________| what

Here is the code written a little bit more legibly:

A=textscan(fopen('11.txt'),'%s','delimiter',' 0123456789,.!?-_*^:;=+\\/(){}[]@&#$%~`|"''');
s=lower(A{1});s(cellfun('length', s)<2)=[];s(ismember(s,{'the','and','of','to','it','in','or','is'}))=[];
[w,~,i]=unique(s);N=hist(i,max(i)); [j,k]=sort(N,'descend');
b=k(1:22);
n=cellfun('length',w(b));
q=ceil(80*N(b)/N(k(1)))'+n;
q=floor(q*78/max(q)-n);

for i=1:22,
  fprintf('%s| %s\n',repmat('_',1,q(i)),w{k(i)}); 
end

reso 2010-07-03 19:40:12

Kudos for implementing the spec completely! (I would upvote but I've run out of votes for today...)

ChristopheD 2010-07-03 21:01:59

shouldn't the bar for "superlongstringofridiculousness" be longer than the bar for "said"?

Bwmat 2010-07-03 21:57:01

@Bwmat: ahh!! good eye! back to the drawing board...

reso 2010-07-04 07:51:43

@reso: you could save 18 chars by replacing the delimiter string with: `char([32:64 91:96 123:126])`

Amro 2010-07-30 20:57:06

@Amro: hey, thanks for the tip, that is great. One day I will go back and fix the bug that Bwmat spoted and add that to it as well

reso 2010-08-23 15:43:19

Answer 32

+11 A:

Haskell - 366 351 344 337 333 characters

(One line break in main added for readability, and no line break needed at end of last line.)

import Data.List
import Data.Char
l=length
t=filter
m=map
f c|isAlpha c=toLower c|0<1=' '
h w=(-l w,head w)
x!(q,w)='|':replicate(minimum$m(q?)x)'_'++"| "++w
q?(g,w)=q*(77-l w)`div`g
b x=m(x!)x
a(l:r)=(' ':t(=='_')l):l:r
main=interact$unlines.a.b.take 22.sort.m h.group.sort
  .t(`notElem`words"the and of to a i it in or is").words.m f

How it works is best seen by reading the argument to interact backwards:

map f lowercases alphabetics, replaces everything else with spaces.
words produces a list of words, dropping the separating whitespace.
filter (notElemwords "the and of to a i it in or is") discards all entries with forbidden words.
group . sort sorts the words, and groups identical ones into lists.
map h maps each list of identical words to a tuple of the form (-frequency, word).
take 22 . sort sorts the tuples by descending frequency (the first tuple entry), and keeps only the first 22 tuples.
b maps tuples to bars (see below).
a prepends the first line of underscores, to complete the topmost bar.
unlines joins all these lines together with newlines.

The tricky bit is getting the bar length right. I assumed that only underscores counted towards the length of the bar, so || would be a bar of zero length. The function b maps c x over x, where x is the list of histograms. The entire list is passed to c, so that each invocation of c can compute the scale factor for itself by calling u. In this way, I avoid using floating-point math or rationals, whose conversion functions and imports would eat many characters.

Note the trick of using -frequency. This removes the need to reverse the sort since sorting (ascending) -frequency will places the words with the largest frequency first. Later, in the function u, two -frequency values are multiplied, which will cancel the negation out.

Thomas 2010-07-03 19:46:33

Very nice work (would upvote but ran out of votes for today with all the great answers in this thread).

ChristopheD 2010-07-03 19:50:52

This hurts my eyes in a way that's painful even to think about describing, but I learned a lot of Haskell by reverse-engineering it into legible code. Well done, sir. :-)

Owen S. 2010-07-04 05:53:35

It's actually fairly idiomatic Haskell still, albeit not really efficient. The short names make it look far worse than it really is.

Thomas 2010-07-04 08:26:36

@Thomas: You can say that again. :-)

Owen S. 2010-07-04 08:34:04

u q(g,w)=q*div(77-l w)g -- can save you 2 chars

Edward Kmett 2010-07-05 22:25:41

@MtnViewMark: Nice work! I didn't know that `words` discards runs of whitespace, nor that you can put `|` conditions onto one line. And I can't believe I put a two-letter variable name in there...

Thomas 2010-07-06 07:16:29

Can't move the `div`, actually! Try it- the output is wrong. The reason is that doing the `div` before the `*` looses precision.

MtnViewMark 2010-07-06 21:21:38

Ah, whoops, got precedences wrong. Should've tested before editing :P

Thomas 2010-07-07 12:54:30

@trinithis: It's shorter alright, but now I don't understand how it works any longer! I'm afraid you moved beyond my understanding of Haskell. Why is a bang pattern needed? What does the question mark even mean?

Thomas 2010-07-10 10:40:34

Its not a bang pattern :D. All I did was change binary functions to infix operators. I just chose to use `?` and `!` for the operator names.

trinithis 2010-07-10 15:45:49

Ah, brilliant :D

Thomas 2010-07-11 09:51:25

Answer 33

A:

Python, 250 chars

borrowing from all the other Python snippets

import re,sys
t=re.findall("\w+","".join(sys.stdin).lower())
W=sorted((-t.count(w),w)for w in set(t)-set("the and of to a i it in or is".split()))[:22]
Z,U=W[0],lambda n:"_"*int(n*(76.-len(Z[1]))/Z[0])
print"",U(Z[0])
for(n,w)in W:print"|"+U(n)+"|",w

If you're cheeky and put the words to avoid as arguments, 223 chars

import re,sys
t=re.findall("\w+","".join(sys.stdin).lower())
W=sorted((-t.count(w),w)for w in set(t)-set(sys.argv[1:]))[:22]
Z,U=W[0],lambda n:"_"*int(n*(76.-len(Z[1]))/Z[0])
print"",U(Z[0])
for(n,w)in W:print"|"+U(n)+"|",w

Output is:

$ python alice4.py  the and of to a i it in or is < 11.txt 
 _________________________________________________________________________
|_________________________________________________________________________| she
|_______________________________________________________________| you
|____________________________________________________________| said
|_____________________________________________________| alice
|_______________________________________________| was
|___________________________________________| that
|____________________________________| as
|________________________________| her
|_____________________________| at
|_____________________________| with
|____________________________| s
|____________________________| t
|__________________________| on
|__________________________| all
|_______________________| this
|_______________________| for
|_______________________| had
|_______________________| but
|______________________| be
|_____________________| not
|____________________| they
|____________________| so

Will 2010-07-03 23:04:36

This doesn't handle the problem of having the scale determined by a word that is not the most frequent one.

6502 2010-07-04 08:04:36

Answer 34

+7 A:

PHP CLI version (450 chars)

This solution takes into account the last requirement which most purists have conviniently chosen to ignore. That costed 170 characters!

Usage: php.exe <this.php> <file.txt>

Minified:

<?php $a=array_count_values(array_filter(preg_split('/[^a-z]/',strtolower(file_get_contents($argv[1])),-1,1),function($x){return !preg_match("/^(.|the|and|of|to|it|in|or|is)$/",$x);}));arsort($a);$a=array_slice($a,0,22);function R($a,$F,$B){$r=array();foreach($a as$x=>$f){$l=strlen($x);$r[$x]=$b=$f*$B/$F;if($l+$b>76)return R($a,$f,76-$l);}return$r;}$c=R($a,max($a),76-strlen(key($a)));foreach($a as$x=>$f)echo '|',str_repeat('-',$c[$x]),"| $x\n";?>

Human readable:

<?php

// Read:
$s = strtolower(file_get_contents($argv[1]));

// Split:
$a = preg_split('/[^a-z]/', $s, -1, PREG_SPLIT_NO_EMPTY);

// Remove unwanted words:
$a = array_filter($a, function($x){
       return !preg_match("/^(.|the|and|of|to|it|in|or|is)$/",$x);
     });

// Count:
$a = array_count_values($a);

// Sort:
arsort($a);

// Pick top 22:
$a=array_slice($a,0,22);


// Recursive function to adjust bar widths
// according to the last requirement:
function R($a,$F,$B){
    $r = array();
    foreach($a as $x=>$f){
        $l = strlen($x);
        $r[$x] = $b = $f * $B / $F;
        if ( $l + $b > 76 )
            return R($a,$f,76-$l);
    }
    return $r;
}

// Apply the function:
$c = R($a,max($a),76-strlen(key($a)));


// Output:
foreach ($a as $x => $f)
    echo '|',str_repeat('-',$c[$x]),"| $x\n";

?>

Output:

|-------------------------------------------------------------------------| she
|---------------------------------------------------------------| you
|------------------------------------------------------------| said
|-----------------------------------------------------| alice
|-----------------------------------------------| was
|-------------------------------------------| that
|------------------------------------| as
|--------------------------------| her
|-----------------------------| at
|-----------------------------| with
|--------------------------| on
|--------------------------| all
|-----------------------| this
|-----------------------| for
|-----------------------| had
|-----------------------| but
|----------------------| be
|---------------------| not
|--------------------| they
|--------------------| so
|-------------------| very
|------------------| what

When there is a long word, the bars are adjusted properly:

|--------------------------------------------------------| she
|---------------------------------------------------| thisisareallylongwordhere
|-------------------------------------------------| you
|-----------------------------------------------| said
|-----------------------------------------| alice
|------------------------------------| was
|---------------------------------| that
|---------------------------| as
|-------------------------| her
|-----------------------| with
|-----------------------| at
|--------------------| on
|--------------------| all
|------------------| this
|------------------| for
|------------------| had
|-----------------| but
|-----------------| be
|----------------| not
|---------------| they
|---------------| so
|--------------| very

Lazarus Inepologlou 2010-07-03 23:17:32

Answer 35

+27 A:

Transact SQL set based solution (SQL Server 2005) 1063 892 873 853 827 820 783 683 647 644 630 characters

Thanks to Gabe for some useful suggestions to reduce the character count.

NB: Line breaks added to avoid scrollbars only the last line break is required.

DECLARE @ VARCHAR(MAX),@F REAL SELECT @=BulkColumn FROM OPENROWSET(BULK'A',
SINGLE_BLOB)x;WITH N AS(SELECT 1 i,LEFT(@,1)L UNION ALL SELECT i+1,SUBSTRING
(@,i+1,1)FROM N WHERE i<LEN(@))SELECT i,L,i-RANK()OVER(ORDER BY i)R INTO #D
FROM N WHERE L LIKE'[A-Z]'OPTION(MAXRECURSION 0)SELECT TOP 22 W,-COUNT(*)C
INTO # FROM(SELECT DISTINCT R,(SELECT''+L FROM #D WHERE R=b.R FOR XML PATH
(''))W FROM #D b)t WHERE LEN(W)>1 AND W NOT IN('the','and','of','to','it',
'in','or','is')GROUP BY W ORDER BY C SELECT @F=MIN(($76-LEN(W))/-C),@=' '+
REPLICATE('_',-MIN(C)*@F)+' 'FROM # SELECT @=@+' 
|'+REPLICATE('_',-C*@F)+'| '+W FROM # ORDER BY C PRINT @

Readable Version

DECLARE @  VARCHAR(MAX),
        @F REAL
SELECT @=BulkColumn
FROM   OPENROWSET(BULK'A',SINGLE_BLOB)x; /*  Loads text file from path
                                             C:\WINDOWS\system32\A  */

/*Recursive common table expression to
generate a table of numbers from 1 to string length
(and associated characters)*/
WITH N AS
     (SELECT 1 i,
             LEFT(@,1)L

     UNION ALL

     SELECT i+1,
            SUBSTRING(@,i+1,1)
     FROM   N
     WHERE  i<LEN(@)
     )
  SELECT   i,
           L,
           i-RANK()OVER(ORDER BY i)R
           /*Will group characters
           from the same word together*/
  INTO     #D
  FROM     N
  WHERE    L LIKE'[A-Z]'OPTION(MAXRECURSION 0)
             /*Assuming case insensitive accent sensitive collation*/

SELECT   TOP 22 W,
         -COUNT(*)C
INTO     #
FROM     (SELECT DISTINCT R,
                          (SELECT ''+L
                          FROM    #D
                          WHERE   R=b.R FOR XML PATH('')
                          )W
                          /*Reconstitute the word from the characters*/
         FROM             #D b
         )
         T
WHERE    LEN(W)>1
AND      W NOT IN('the',
                  'and',
                  'of' ,
                  'to' ,
                  'it' ,
                  'in' ,
                  'or' ,
                  'is')
GROUP BY W
ORDER BY C

/*Just noticed this looks risky as it relies on the order of evaluation of the 
 variables. I'm not sure that's guaranteed but it works on my machine :-) */
SELECT @F=MIN(($76-LEN(W))/-C),
       @ =' '      +REPLICATE('_',-MIN(C)*@F)+' '
FROM   #

SELECT @=@+' 
|'+REPLICATE('_',-C*@F)+'| '+W
             FROM     #
             ORDER BY C

PRINT @

Output

 _________________________________________________________________________ 
|_________________________________________________________________________| she
|_______________________________________________________________| You
|____________________________________________________________| said
|_____________________________________________________| Alice
|_______________________________________________| was
|___________________________________________| that
|____________________________________| as
|________________________________| her
|_____________________________| at
|_____________________________| with
|__________________________| on
|__________________________| all
|_______________________| This
|_______________________| for
|_______________________| had
|_______________________| but
|______________________| be
|_____________________| not
|____________________| they
|____________________| So
|___________________| very
|__________________| what

And with the long string

 _______________________________________________________________ 
|_______________________________________________________________| she
|_______________________________________________________| superlongstringstring
|____________________________________________________| said
|______________________________________________| Alice
|________________________________________| was
|_____________________________________| that
|_______________________________| as
|____________________________| her
|_________________________| at
|_________________________| with
|_______________________| on
|______________________| all
|____________________| This
|____________________| for
|____________________| had
|____________________| but
|___________________| be
|__________________| not
|_________________| they
|_________________| So
|________________| very
|________________| what

Martin Smith 2010-07-03 23:48:56

I gave you a +1 because you did it in T-SQL, and to quote Team America - "You have balls. I like balls."

fortheworld 2010-07-04 00:03:30

I took the liberty of converting some spaces into newlines to make it more readable. Hopefully I didn't mess things up. I also minified it a bit more.

Gabe 2010-07-04 07:33:36

@Gabe Thanks. I ended up largely rewriting it though. It is now shorter and quicker than before.

Martin Smith 2010-07-04 12:06:22

That code is screaming at me! :O

Joey 2010-07-04 14:27:12

One good way to save is by changing `0.000` to just `0`, then using `-C` instead of `1.0/C`. And making `FLOAT` into `REAL` will save a stroke too. The biggest thing, though, is that it looks like you have lots of `AS` instances that should be optional.

Gabe 2010-07-04 15:13:33

@Gabe - Thanks for the tips. I was able to replace float with real and get rid of some of the 'AS's the two that remain are both required. The -C thing didn't work. The row with 0 on it is the top of the top bar. This ended up positioned at the bottom and I would have needed to replace 0.000 with a large magnitude negative number to get it at the right place. Thanks though!

Martin Smith 2010-07-04 19:24:00

How about this: `SELECT $0 O, ' '+REPLICATE('_', MAX(C)*@F)+' ' [ ] FROM # UNION SELECT $1/C, '|'+REPLICATE('_',C*@F)+'| '+W FROM # ORDER BY 1`

Gabe 2010-07-04 20:02:09

@Gabe - Yep that works I'll implement the `$` thing thanks. The problem is though that it returns an additional; column to the output that isn't part of the spec (Hence the need for the additional temp table step)

Martin Smith 2010-07-04 20:52:42

OK, how about `SELECT [ ] FROM (SELECT $0 O, ' '+REPLICATE('_', MAX(C)*@F)+' ' [ ] FROM # UNION SELECT $1/C, '|'+REPLICATE('_',C*@F)+'| '+W FROM #)X ORDER BY O`?

Gabe 2010-07-04 23:06:59

@Gabe - Nice! That brings it down to comfortably less than 800. Thanks for your help!

Martin Smith 2010-07-05 00:32:20

Oh. My. *God.* +1

Andrew Heath 2010-07-05 00:51:40

You don't need to declare `@F` where it's used. You can declare it up with `@` and save a whole `DECLARE` worth of chars.

Gabe 2010-07-05 04:09:21

Is `i-ROW_NUMBER` the same as `RANK`? Can the second CTE be moved into the `FROM` clause where it's used? Can the #D table query be made into a CTE? Can the #t table query be made into a CTE, or at least put into the `FROM` clause of the `SELECT TOP 22` query?

Gabe 2010-07-05 04:44:24

@Gabe - Thanks, All good points. Made some other simplifications as well and collectively knocked another 100 off. The '#D' needs to be a temp table. At the moment it takes about 12 seconds on my machine. Swapping to a CTE slowed it down massively (I cancelled the query after 2 minutes so don't know how long it would have taken-or indeed if it would have finished at all) – Martin Smith 6 mins ago

Martin Smith 2010-07-05 06:40:56

HA! Take that, Java!

Gabe 2010-07-05 16:11:28

@Gabe - High Five! Good golfing with you. Cheers for the assistance.

Martin Smith 2010-07-05 16:19:51

Answer 36

+1 A:

R, 298 chars

f=scan("stdin","ch")
u=unlist
s=strsplit
a=u(s(u(s(tolower(f),"[^a-z]")),"^(the|and|of|to|it|in|or|is|.|)$"))
v=unique(a)
r=sort(sapply(v,function(i) sum(a==i)),T)[2:23]  #the first item is an empty string, just skipping it
w=names(r)
q=(78-max(nchar(w)))*r/max(r)
cat(" ",rep("_",q[1])," \n",sep="")
for(i in 1:22){cat("|",rep("_",q[i]),"| ",w[i],"\n",sep="")}

The output is:

 _________________________________________________________________________ 
|_________________________________________________________________________| she
|_______________________________________________________________| you
|____________________________________________________________| said
|_____________________________________________________| alice
|_______________________________________________| was
|___________________________________________| that
|____________________________________| as
|________________________________| her
|_____________________________| at
|_____________________________| with
|__________________________| on
|__________________________| all
|_______________________| this
|_______________________| for
|_______________________| had
|_______________________| but
|______________________| be
|_____________________| not
|____________________| they
|____________________| so
|___________________| very
|__________________| what

And if "you" is replaced by something longer:

 ____________________________________________________________ 
|____________________________________________________________| she
|____________________________________________________| veryverylongstring
|__________________________________________________| said
|___________________________________________| alice
|______________________________________| was
|___________________________________| that
|_____________________________| as
|__________________________| her
|________________________| at
|________________________| with
|______________________| on
|_____________________| all
|___________________| this
|___________________| for
|___________________| had
|__________________| but
|__________________| be
|__________________| not
|________________| they
|________________| so
|_______________| very
|_______________| what

Andrei 2010-07-04 02:57:14

This is not doing the maximum scaling

6502 2010-07-04 08:19:24

Answer 37

+94 A:

LabVIEW 51 nodes, 5 structures, 10 diagrams

Teaching the elephant to tap-dance is never pretty. I'll, ah, skip the character count. results

Edit, explanation, program flows from left to right:

Edit: added some offsets for the chart's annotations and scaling, updated image links

Underflow 2010-07-04 05:07:54

Wow, I've never before seen an example of visual programming that looks useful! I'd thought it was kind of consensus that it was impossible or, not worth it.

JDonner 2010-07-04 05:49:53

It IS not worth it

M28 2010-07-04 06:18:43

LabVIEW's very happy in its hardware control and measurement niche, but really pretty awful for string manipulation.

Underflow 2010-07-04 06:23:28

No 3D yet? ... :D

belisarius 2010-07-05 04:50:22

Best code golf answer I've seen. +1 for thinking outside the box!

Blair Holloway 2010-07-06 01:48:21

Gotta count the elements for us...every box and widget you had to drag to the screen counts.

dmckee 2010-07-06 05:52:42

@dmckee Good call. Most metrics are based on node count, so I'll add that.

Underflow 2010-07-07 00:16:08

@Underflow: Fair enough. I'm not sure it is a precise comparison, but it is something.

dmckee 2010-07-07 00:24:28

Would it be possible to add a link to a bigger version of those charts?

Svish 2010-07-07 12:35:37

@Svish Switched to a different host for the images. Hopefully it helps.

Underflow 2010-07-08 05:33:15

Holy shit, this is one of the most awesome things I have ever seen.

Jesse Dhillon 2010-07-13 09:26:38

This is Coding?!

Anraiki 2010-07-26 16:09:58

This is sr8 bucknnasty

Pierreten 2010-09-04 08:10:30

Answer 38

+1 A:

Python 290, 255, 253

290 characters in python (text read from standard input)

import sys,re
c={}
for w in re.findall("[a-z]+",sys.stdin.read().lower()):c[w]=c.get(w,0)+1-(","+w+","in",a,i,the,and,of,to,it,in,or,is,")
r=sorted((-v,k)for k,v in c.items())[:22]
sf=max((76.0-len(k))/v for v,k in r)
print" "+"_"*int(r[0][0]*sf)
for v,k in r:print"|"+"_"*int(v*sf)+"| "+k

but... after reading other solutions I all of a sudden realized that efficiency was not a request; so this is another shorter and much slower one (255 characters)

import sys,re
w=re.findall("\w+",sys.stdin.read().lower())
r=sorted((-w.count(x),x)for x in set(w)-set("the and of to a i it in or is".split()))[:22]
f=max((76.-len(k))/v for v,k in r)
print" "+"_"*int(f*r[0][0])
for v,k in r:print"|"+"_"*int(f*v)+"| "+k

and after some more reading other solutions...

import sys,re
w=re.findall("\w+",sys.stdin.read().lower())
r=sorted((-w.count(x),x)for x in set(w)-set("the and of to a i it in or is".split()))[:22]
f=max((76.-len(k))/v for v,k in r)
print"","_"*int(f*r[0][0])
for v,k in r:print"|"+"_"*int(f*v)+"|",k

And now this solution is almost byte-per-byte identical to Astatine's one :-D

6502 2010-07-04 07:26:43

I worked out a very similar solution. Looking at yours there seems to be ways to merge both, you thought of some tricks I didn't...

kriss 2010-07-05 02:24:42

Answer 39

+6 A:

C (828)

It looks alot like obfuscated code, and uses glib for string, list and hash. Char count with wc -m says 828 . It does not consider single-char words. To calculate the max length of the bar, it consider the longest possible word among all, not only the first 22. Is this a deviation from the spec?

It does not handle failures and it does not release used memory.

#include <glib.h>
#define S(X)g_string_##X
#define H(X)g_hash_table_##X
GHashTable*h;int m,w=0,z=0;y(const void*a,const void*b){int*A,*B;A=H(lookup)(h,a);B=H(lookup)(h,b);return*B-*A;}void p(void*d,void*u){int *v=H(lookup)(h,d);if(w<22){g_printf("|");*v=*v*(77-z)/m;while(--*v>=0)g_printf("=");g_printf("| %s\n",d);w++;}}main(c){int*v;GList*l;GString*s=S(new)(NULL);h=H(new)(g_str_hash,g_str_equal);char*n[]={"the","and","of","to","it","in","or","is"};while((c=getchar())!=-1){if(isalpha(c))S(append_c)(s,tolower(c));else{if(s->len>1){for(c=0;c<8;c++)if(!strcmp(s->str,n[c]))goto x;if((v=H(lookup)(h,s->str))!=NULL)++*v;else{z=MAX(z,s->len);v=g_malloc(sizeof(int));*v=1;H(insert)(h,g_strdup(s->str),v);}}x:S(truncate)(s,0);}}l=g_list_sort(H(get_keys)(h),y);m=*(int*)H(lookup)(h,g_list_first(l)->data);g_list_foreach(l,p,NULL);}

ShinTakezou 2010-07-04 10:16:55

Newlines do count as characters, but you can strip any from lines that are not preprocessor instructions. For a golf, I wouldn't consider not freeing memory a bad practice.

Shtééf 2010-07-04 10:31:58

ok... put all in a line(expect preproc macros) and given a vers without freeing mem (and with two other spaces removed... a little bit of improvement can be made on the "obfuscation", e.g. `*v=*v*(77-lw)/m` will give 929 ... but I think it can be ok unless I find a way to do it a lot shorter)

ShinTakezou 2010-07-04 10:48:19

I think you can move at least the `int c` into the `main` declaration and `main` is implicitly `int` (as are any untyped arguments, afaik): `main(c){...}`. You could probably also just write `0` instead of `NULL`.

Joey 2010-07-04 11:27:26

doing it... of course will trigger some warning with the `-Wall` or with `-std=c99` flag on... but I suppose this is pointless for a code-golf, right?

ShinTakezou 2010-07-04 11:36:32

uff, sorry for short-gap time edits, ... I should change `Without freeing memory stuff, it reaches 866 (removed some other unuseful space)` to something else to let not think people that the difference with the free-memory version is all in that: now the no-free-memory version has a lot of more "improvements".

ShinTakezou 2010-07-04 11:48:43

still some improvements can be done shortening names of variables+function

ShinTakezou 2010-07-05 06:25:21

@Shin: BTW--you can have more than one answer to a single question. Scroll to the very bottom of the page to find the [Add Another Answer] button. I supposed it's moved down because the expectation is that multiple answer will be the exception, not the rule.

dmckee 2010-07-06 05:59:05

@dmckee thanks, I am going to disentangle C and Smalltalk!

ShinTakezou 2010-07-06 11:54:35

Answer 40

+5 A:

Common LISP, 670 characters

I'm a LISP newbie, and this is an attempt using an hash table for counting (so probably not the most compact method).

(flet((r()(let((x(read-char t nil)))(and x(char-downcase x)))))(do((c(
make-hash-table :test 'equal))(w NIL)(x(r)(r))y)((not x)(maphash(lambda
(k v)(if(not(find k '("""the""and""of""to""a""i""it""in""or""is"):test
'equal))(push(cons k v)y)))c)(setf y(sort y #'> :key #'cdr))(setf y
(subseq y 0(min(length y)22)))(let((f(apply #'min(mapcar(lambda(x)(/(-
76.0(length(car x)))(cdr x)))y))))(flet((o(n)(dotimes(i(floor(* n f)))
(write-char #\_))))(write-char #\Space)(o(cdar y))(write-char #\Newline)
(dolist(x y)(write-char #\|)(o(cdr x))(format t "| ~a~%"(car x))))))
(cond((char<= #\a x #\z)(push x w))(t(incf(gethash(concatenate 'string(
reverse w))c 0))(setf w nil)))))

can be run on for example with cat alice.txt | clisp -C golf.lisp.

In readable form is

(flet ((r () (let ((x (read-char t nil)))
               (and x (char-downcase x)))))
  (do ((c (make-hash-table :test 'equal))  ; the word count map
       w y                                 ; current word and final word list
       (x (r) (r)))  ; iteration over all chars
       ((not x)

        ; make a list with (word . count) pairs removing stopwords
        (maphash (lambda (k v)
                   (if (not (find k '("" "the" "and" "of" "to"
                                      "a" "i" "it" "in" "or" "is")
                                  :test 'equal))
                       (push (cons k v) y)))
                 c)

        ; sort and truncate the list
        (setf y (sort y #'> :key #'cdr))
        (setf y (subseq y 0 (min (length y) 22)))

        ; find the scaling factor
        (let ((f (apply #'min
                        (mapcar (lambda (x) (/ (- 76.0 (length (car x)))
                                               (cdr x)))
                                y))))
          ; output
          (flet ((outx (n) (dotimes (i (floor (* n f))) (write-char #\_))))
             (write-char #\Space)
             (outx (cdar y))
             (write-char #\Newline)
             (dolist (x y)
               (write-char #\|)
               (outx (cdr x))
               (format t "| ~a~%" (car x))))))

       ; add alphabetic to current word, and bump word counter
       ; on non-alphabetic
       (cond
        ((char<= #\a x #\z)
         (push x w))
        (t
         (incf (gethash (concatenate 'string (reverse w)) c 0))
         (setf w nil)))))

6502 2010-07-04 15:53:54

have you tried installing a custom reader macro to shave off some input size?

Aaron 2010-07-05 01:55:07

@Aaron actually it wasn't trivial for me even just getting this working... :-) for the actual golfing part i just used one-letter variables and that's all.Anyway besides somewhat high verbosity that is inherent in CL for this scale of problems ("concatenate 'string", "setf" or "gethash" are killers... in python they are "+", "=", "[]") still I felt this a lot worse that I would have expected even on a logical level.In a sense I've a feeling that lisp is ok, but common lisp is so-so and this beyond naming (re-reading it a very unfair comment as my experience with CL is close to zero).

6502 2010-07-05 06:18:05

true. scheme would make the golfing a bit easier, with the single namespace. instead of string-append all over the place, you could (letrec ((a string-append)(b gethash)) ... (a "x" "yz") ...)

Aaron 2010-07-07 16:37:03

Answer 41

+2 A:

Shell, 228 characters , with 80 chars constraint working

tr A-Z a-z|tr -Cs a-z "\n"|sort|egrep -v "^(the|and|of|to|a|i|it|in|or|is)$" |uniq -c|sort -r|head -22>g
n=1
while :
do
awk '{printf "|%0*s| %s\n",$1*'$n'/1e3,"",$2;}' g|tr 0 _>o 
egrep -q .{80} o&&break
n=$((n+1))
done
cat o

I'm surprised nobody seems to have used the amazing * feature of printf.

cat 11-very.txt > golf.sh

|__________________________________________________________________________| she
|________________________________________________________________| you
|_____________________________________________________________| said
|______________________________________________________| alice
|_______________________________________________| was
|____________________________________________| that
|____________________________________| as
|_________________________________| her
|______________________________| with
|______________________________| at
|_____________________________| s
|_____________________________| t
|___________________________| on
|__________________________| all
|________________________| this
|_______________________| for
|_______________________| had
|_______________________| but
|______________________| be
|______________________| not
|____________________| they
|____________________| so

cat 11 | golf.sh

|_________________________________________________________________| she
|_________________________________________________________| verylongstringstring
|______________________________________________________| said
|_______________________________________________| alice
|__________________________________________| was
|_______________________________________| that
|________________________________| as
|_____________________________| her
|___________________________| with
|___________________________| at
|__________________________| s
|_________________________| t
|________________________| on
|_______________________| all
|_____________________| this
|_____________________| for
|_____________________| had
|____________________| but
|___________________| be
|___________________| not
|__________________| they
|__________________| so

mb14 2010-07-04 15:58:56

Missing the very first line in the output (the top line of the first bar). Also couldn't you just sort ascending and then use the last 22 lines instead? Dunno whether that would make it shorter here but for me it was a serious consideration.

Joey 2010-07-04 21:15:35

I know for the first. I Just don't see a simple way to do it and I wasn't sure if that was really mandatory. I could not reverse indeed but then the output would be inversed (she at the last line)

mb14 2010-07-04 21:47:40

Answer 42

+1 A:

Object Rexx 4.0 with PC-Pipes

Where the PC-Pipes library can be found.
This solution ignores single letter words.


address rxpipe 'pipe (end ?) < Alice.txt',
   '|regex split /[^a-zA-Z]/', -- split at non alphbetic character
   '|locate 2',                -- discard words shorter that 2 char  
   '|xlate lower',             -- translate all words to lower case
   ,                           -- discard list words that match list
   '|regex not match /^(the||and||of||to||it||in||or||is)$/',
   '|l:lookup autoadd before count',  -- accumulate and count words
 '? l:',                       -- no master records to feed into lookup 
 '? l:',                       -- list of counted words comes here
   ,                           -- columns 1-10 hold count, 11-n hold word
   '|sort 1.10 d',             -- sort in desending order by count
   '|take 22',                 -- take first 22 records only
   '|array wordlist',          -- store into a rexx array
   '|count max',               -- get length of longest record 
   '|var maxword'              -- save into a rexx variable

parse value wordlist[1] with count 11 .  -- get frequency of first word
barunit = count % (76-(maxword-10))      -- frequency units per chart bar char

say ' '||copies('_', (count+barunit)%barunit)  -- first line of the chart
do cntwd over wordlist                    
  parse var cntwd count 11 word          -- get word frequency and the word
  say '|'||copies('_', (count+barunit)%barunit)||'| '||word||' '
end

The output produced

 ________________________________________________________________________________
|________________________________________________________________________________| she
|_____________________________________________________________________| you
|___________________________________________________________________| said
|__________________________________________________________| alice
|____________________________________________________| was
|________________________________________________| that
|________________________________________| as
|____________________________________| her
|_________________________________| at
|_________________________________| with
|______________________________| on
|_____________________________| all
|__________________________| this
|__________________________| for
|__________________________| had
|__________________________| but
|________________________| be
|________________________| not
|_______________________| they
|______________________| so
|_____________________| very
|_____________________| what

James Johnosn 2010-07-05 01:50:49

How long is the solution (number of characters) - this is a code-golf?

Nas Banov 2010-07-05 23:54:32

Answer 43

+2 A:

Yet another python 2.x - 206 chars (or 232 with 'width bar')

I believe this one if fully compliant with the question. Ignore list is here, it fully checks for line length (see exemple where I replaced Alice by Aliceinwonderlandbylewiscarroll througout the text making the fifth item the longest line. Even the filename is provided from command line instead of hardcoded (hardcoding it would remove about 10 chars). It has one drawback (but I believe it's ok with the question) as it compute an integer divider to make line shorter than 80 chars, the longest line is shorter than 80 characters, not exactly 80 characters. The python 3.x version does not have this defect (but is way longer).

Also I believe it is not so hard to read.

import sys,re
t=re.split("\W+(?:(?:the|and|o[fr]|to|a|i[tns]?)\W+)*",sys.stdin.read().lower())
b=sorted((-t.count(x),x)for x in set(t))[:22]
for l,w in b:print"|"+l/min(z/(78-len(e))for z,e in b)*'-'+"|",w

|----------------------------------------------------------------| she
|--------------------------------------------------------| you
|-----------------------------------------------------| said
|----------------------------------------------| aliceinwonderlandbylewiscarroll
|-----------------------------------------| was
|--------------------------------------| that
|-------------------------------| as
|----------------------------| her
|--------------------------| at
|--------------------------| with
|-------------------------| s
|-------------------------| t
|-----------------------| on
|-----------------------| all
|---------------------| this
|--------------------| for
|--------------------| had
|--------------------| but
|-------------------| be
|-------------------| not
|------------------| they
|-----------------| so

As it is not clear if we must print the max bar alone on it's line (like in sample output). Below is another one that do it, but 232 chars.

import sys,re
t=re.split("\W+(?:(?:the|and|o[fr]|to|a|i[tns]?)\W+)*",sys.stdin.read().lower())
b=sorted((-t.count(x),x)for x in set(t))[:22]
f=min(z/(78-len(e))for z,e in b)
print"",b[0][0]/f*'-'
for y,w in b:print"|"+y/f*'-'+"|",w

Python 3.x - 256 chars

Using Counter class from python 3.x, there was high hopes to make it shorter (as Counter does everything that we need here). It comes out it's not better. Below is my trial 266 chars:

import sys,re,collections as c
b=c.Counter(re.split("\W+(?:(?:the|and|o[fr]|to|a|i[tns]?)\W+)*",
sys.stdin.read().lower())).most_common(22)
F=lambda p,x,w:print(p+'-'*int(x/max(z/(77.-len(e))for e,z in b))+w)
F(" ",b[0][1],"")
for w,y in b:F("|",y,"| "+w)

The problem is that collections and most_common are very long words and even Counter is not short... really, not using Counter makes code only 2 characters longer ;-(

python 3.x also introduce other constraints : dividing two integers is not an integer any more (so we have to cast to int), print is now a function (must add parenthesis), etc. That's why it comes out 22 characters longer than python2.x version, but way faster. Maybe some more experimented python 3.x coder will have ideas to shorten the code.

kriss 2010-07-05 02:13:24

That's a clever way of sorting from high to low.

Wallacoloo 2010-07-07 03:39:34

Answer 44

+1 A:

Ruby, 205

This Ruby version handles "superlongstringstring". (The first two lines are almost identical to the previous Ruby programs.)

It must be run this way:

ruby -n0777 golf.rb Alice.txt

W=($_.upcase.scan(/\w+/)-%w(THE AND OF TO A I IT
IN OR IS)).group_by{|x|x}.map{|k,v|[-v.size,k]}.sort[0,22]
u=proc{|m|"_"*(W.map{|n,s|(76.0-s.size)/n}.max*m)}
puts" "+u[W[0][0]],W.map{|n,s|"|%s| "%u[n]+s}

The third line creates a closure or lambda that yields a correctly scaled string of underscores:

u = proc{|m|
  "_" *
    (W.map{|n,s| (76.0 - s.size)/n}.max * m)
}

.max is used instead of .min because the numbers are negative.

William James 2010-07-05 05:06:41

Implementing the full spec and still very short (213 characters at the moment according to `wc -c`), nice work!

ChristopheD 2010-07-05 05:11:16

Answer 45

+3 A:

Scala 2.8, 311 314 320 330 332 336 341 375 characters

including long word adjustment. Ideas borrowed from the other solutions.

Now as a script (a.scala):

val t="\\w+\\b(?<!\\bthe|and|of|to|a|i[tns]?|or)".r.findAllIn(io.Source.fromFile(argv(0)).mkString.toLowerCase).toSeq.groupBy(w=>w).mapValues(_.size).toSeq.sortBy(-_._2)take 22
def b(p:Int)="_"*(p*(for((w,c)<-t)yield(76.0-w.size)/c).min).toInt
println(" "+b(t(0)._2))
for(p<-t)printf("|%s| %s \n",b(p._2),p._1)

Run with

scala -howtorun:script a.scala alice.txt

BTW, the edit from 314 to 311 characters actually removes only 1 character. Someone got the counting wrong before (Windows CRs?).

mkneissl 2010-07-05 22:52:24

Answer 46

+1 A:

Bourne shell, 213/240 characters

Improving on the shell version posted earlier, I can get it down to 213 characters:

tr A-Z a-z|tr -Cs a-z \\n|sort|egrep -v '^(the|and|of|to|a|i|it|in|or|is)$'|uniq -c|sort -rn|sed 22q>g
n=1
>o
until egrep -q .{80} o
do
awk '{printf "|%0*d| %s\n",$1*'$n'/1e3,0,$2}' g|tr 0 _>o 
((n++))
done
cat o

In order to get the upper outline on the top bar, I had to expand it to 240 characters:

tr A-Z a-z|tr -Cs a-z \\n|sort|egrep -v "^(the|and|of|to|a|i|it|in|or|is)$"|uniq -c|sort -r|sed 1p\;22q>g
n=1
>o
until egrep -q .{80} o
do
awk '{printf "|%0*d| %s\n",$1*'$n'/1e3,0,NR==1?"":$2}' g|sed '1s,|, ,g'|tr 0 _>o 
((n++))
done
cat o

2010-07-06 00:29:29

Answer 47

+1 A:

shell, grep, tr, grep, sort, uniq, sort, head, perl - 194 chars

Adding some -i flags may drop the overly long tr A-Z a-z| step; the spec said nothing about the case displayed, and uniq -ci drops any case differences.

egrep -oi [a-z]+|egrep -wiv 'the|and|o[fr]|to|a|i[tns]?'|sort|uniq -ci|sort -nr|head -22|perl -lape'($f,$w)=@F;$.>1or($q,$x)=($f,76-length$w);$b="_"x($f/$q*$x);$_="|$b| $w ";$.>1or$_=" $b\n$_"'

That's minus 11 for the tr plus 2 for the -i's compared to the original 206 chars.

edit: minus 3 for the \\b which can be left out as pattern matching will commence on a boundary anyway.

sort gives lower case first, and uniq -ci takes the first occurence, so the only real change in output will be that Alice retains her upper case initial.

mvds 2010-07-06 01:14:43

The bar length constraint isn't working.

Joey 2010-07-06 14:05:05

Answer 48

+1 A:

Go, 613 chars, could probably be much smaller:

package main
import(r "regexp";. "bytes";. "io/ioutil";"os";st "strings";s "sort";. "container/vector")
type z struct{c int;w string}
func(e z)Less(o interface{})bool{return o.(z).c<e.c}
func main(){b,_:=ReadAll(os.Stdin);g:=r.MustCompile
c,m,x:=g("[A-Za-z]+").AllMatchesIter(b,0),map[string]int{},g("the|and|of|it|in|or|is|to")
for w:=range c{w=ToLower(w);if len(w)>1&&!x.Match(w){m[string(w)]++}}
o,y:=&Vector{},0
for k,v:=range m{o.Push(z{v,k});if v>y{y=v}}
s.Sort(o)
for i,v:=range *o{if i>21{break};x:=v.(z);c:=int(float(x.c)/float(y)*80)
u:=st.Repeat("_",c);if i<1{println(" "+u)};println("|"+u+"| "+x.w)}}

I feel so dirty.

Pat 2010-07-06 01:49:12

Answer 49

+1 A:

perl, 188 characters

The perl version above (as well as any regexp splitting based version) can get a few bytes shorter by including the list of forbidden words as negative lookahead assertions, rather than as a separate list. Furthermore the trailing semicolon can be left out.

I also included some other suggestions (- instead of <=>, for/foreach, dropped "keys") to get to

$c{$_}++for grep{$_}map{lc=~/\b(?!(?:the|and|a|of|or|i[nts]?|to)\b)[a-z]+/g}<>;@s=sort{$c{$b}-$c{$a}}%c;$f=76-length$s[0];say$"."_"x$f;say"|"."_"x($c{$_}/$c{$s[0]}*$f)."| $_ "for@s[0..21]

I don't know perl, but I presume that the (?!(?:...)\b) may lose the ?: if the handling around it is fixed.

mvds 2010-07-06 02:46:04

This throws a syntax error for me: »String found where operator expected at c.pl line 1, near "say"|""syntax error at c.pl line 1, near "say"|""Search pattern not terminated at c.pl line 1.« (Perl 5.10.1). Also the code looks like the bar length constraint isn't working. And it may also well be that strings such as `foo_the_bar` won't get the stop words removed (because of `\b`).

Joey 2010-07-06 14:08:19

Answer 50

+2 A:

Scala, 327 characters

This was adapted from mkneissl's answer inspired by a Python version, though it is bigger. I'm leaving it here in case someone can make it shorter.

val f="\\w+\\b(?<!\\bthe|and|of|to|a|i[tns]?|or)".r.findAllIn(io.Source.fromFile("11.txt").mkString.toLowerCase).toSeq
val t=f.toSet[String].map(x=> -f.count(x==)->x).toSeq.sorted take 22
def b(p:Int)="_"*(-p/(for((c,w)<-t)yield-c/(76.0-w.size)).max).toInt
println(" "+b(t(0)._1))
for(p<-t)printf("|%s| %s \n",b(p._1),p._2)

Daniel 2010-07-06 02:47:26

Answer 51

+5 A:

Perl, 185 char

~~200 (slightly broken)~~ ~~199~~ ~~197~~ ~~195~~ ~~193~~ ~~187~~ 185 characters. Last two newlines are significant. Complies with the spec.

map$X{+lc}+=!/^(.|the|and|to|i[nst]|o[rf])$/i,/[a-z]+/gfor<>;
$n=$n>($:=$X{$_}/(76-y+++c))?$n:$:for@w=(sort{$X{$b}-$X{$a}}%X)[0..21];
die map{$U='_'x($X{$_}/$n);" $U
"x!$z++,"|$U| $_
"}@w

First line loads counts of valid words into %X.

The second line computes minimum scaling factor so that all output lines will be <= 80 characters.

The third line (contains two newline characters) produces the output.

mobrule 2010-07-06 03:17:02

This won't remove stop words from strings such as "foo_the_bar". Line length is also one too long (re-read the spec: "bar + space + word **+ space** <= 80 chars")

Joey 2010-07-06 14:02:37

Answer 52

+1 A:

GNU Smalltalk (386)

I think it can be made a little bit shorter, but still no idea how.

|q s f m|q:=Bag new. f:=FileStream stdin. m:=0.[f atEnd]whileFalse:[s:=f nextLine.(s notNil)ifTrue:[(s tokenize:'\W+')do:[:i|(((i size)>1)&({'the'.'and'.'of'.'to'.'it'.'in'.'or'.'is'}includes:i)not)ifTrue:[q add:(i asLowercase)]. m:=m max:(i size)]]].(q:=q sortedByCount)from:1to:22 do:[:i|'|'display.((i key)*(77-m)//(q first key))timesRepeat:['='display].('| %1'%{i value})displayNl]

ShinTakezou 2010-07-06 11:55:50

Answer 53

+4 A:

Clojure 282 strict

(let[[[_ m]:as s](->>(slurp *in*).toLowerCase(re-seq #"\w+\b(?<!\bthe|and|of|to|a|i[tns]?|or)")frequencies(sort-by val >)(take 22))[b](sort(map #(/(- 76(count(key %)))(val %))s))p #(do(print %1)(dotimes[_(* b %2)](print \_))(apply println %&))](p " " m)(doseq[[k v]s](p \| v \| k)))

Somewhat more legibly:

(let[[[_ m]:as s](->> (slurp *in*)
                   .toLowerCase
                   (re-seq #"\w+\b(?<!\bthe|and|of|to|a|i[tns]?|or)")
                   frequencies
                   (sort-by val >)
                   (take 22))
     [b] (sort (map #(/ (- 76 (count (key %)))(val %)) s))
     p #(do
          (print %1)
          (dotimes[_(* b %2)] (print \_))
          (apply println %&))]
  (p " " m)
  (doseq[[k v] s] (p \| v \| k)))

Alex Taggart 2010-07-07 12:21:34

Answer 54

+1 A:

Clojure - 611 chars (not minimized)

I tried writing the code in as much idiomatic Clojure as I could so late in the night. I am not too proud of the draw-chart function, but I guess the code will speak volumes of the succinctness of Clojure.

(ns word-freq
(:require [clojure.contrib.io :as io]))

(defn word-freq
  [f]
  (take 22 (->> f
                io/read-lines ;;; slurp should work too, but I love map/red
                (mapcat (fn [l] (map #(.toLowerCase %) (re-seq #"\w+" l))))
                (remove #{"the" "and" "of" "to" "a" "i" "it" "in" "or" "is"})
                (reduce #(assoc %1 %2 (inc (%1 %2 0))) {})
                (sort-by (comp - val)))))

(defn draw-chart
  [fs]
  (let [[[w f] & _] fs]
    (apply str
           (interpose \newline
                      (map (fn [[k v]] (apply str (concat "|" (repeat (int (* (- 76 (count w)) (/ v f 1))) "_") "| " k " ")) ) fs)))))

;;; (println (draw-chart (word-freq "/Users/ghoseb/Desktop/alice.txt")))

Output:

|_________________________________________________________________________| she 
|_______________________________________________________________| you 
|____________________________________________________________| said 
|____________________________________________________| alice 
|_______________________________________________| was 
|___________________________________________| that 
|____________________________________| as 
|________________________________| her 
|_____________________________| with 
|_____________________________| at 
|____________________________| t 
|____________________________| s 
|__________________________| on 
|__________________________| all 
|_______________________| for 
|_______________________| had 
|_______________________| this 
|_______________________| but 
|______________________| be 
|_____________________| not 
|____________________| they 
|____________________| so

I know, this doesn't follow the spec, but hey, this is some very clean Clojure code which is already so small :)

Baishampayan Ghose 2010-07-08 18:53:08

Answer 55

+1 A:

Lua solution: 478 characters.

t,u={},{}for l in io.lines()do
for w in l:gmatch("%a+")do
w=w:lower()if not(" the and of to a i it in or is "):find(" "..w.." ")then
t[w]=1+(t[w]or 0)end
end
end
for k,v in next,t do
u[#u+1]={k,v}end
table.sort(u,function(a,b)return a[2]>b[2]end)m,n=u[1][2],math.min(#u,22)for w=80,1,-1 do
s=""for i=1,n do
a,b=u[i][1],w*u[i][2]/m
if b+#a>=78 then s=nil break end
s2=("_"):rep(b)if i==1 then
s=s.." " ..s2.."\n"end
s=s.."|"..s2.."| "..a.."\n"end
if s then print(s)break end end

Readable version:

t,u={},{}
for line in io.lines() do
    for w in line:gmatch("%a+") do
        w = w:lower()
        if not (" the and of to a i it in or is "):find(" "..w.." ") then
            t[w] = 1 + (t[w] or 0)
        end
    end
end
for k, v in pairs(t) do
    u[#u+1]={k, v}
end

table.sort(u, function(a, b)
    return a[2] > b[2]
end)

local max = u[1][2]
local n = math.min(#u, 22)

for w = 80, 1, -1 do
    s=""
    for i = 1, n do
        f = u[i][2]
        word = u[i][1]
        width = w * f / max
        if width + #word >= 78 then
            s=nil
            break
        end
        s2=("_"):rep(width)
        if i==1 then
            s=s.." " .. s2 .."\n"
        end
        s=s.."|" .. s2 .. "| " .. word.."\n"
    end
    if s then
        print(s)
        break
    end
end

Kristofer 2010-07-09 08:32:19

Answer 56

+1 A:

TCL 554 Strict

foreach w [regexp -all -inline {[a-z]+} [string tolower [read stdin]]] {if {[lsearch {the and of to it in or is a i} $w]>=0} {continue};if {[catch {incr Ws($w)}]} {set Ws($w) 1}}
set T [lrange [lsort -decreasing -stride 2 -index 1 -integer [array get Ws]] 0 43]
foreach {w c} $T {lappend L [string length $w];lappend C $c}
set N [tcl::mathfunc::max {*}$L]
set C [lsort -integer $C]
set M [lindex $C end]
puts " [string repeat _ [expr {int((76-$N) * [lindex $T 1] / $M)}]] "
foreach {w c} $T {puts "|[string repeat _ [expr {int((76-$N) * $c / $M)}]]| $w"}

Or, more legibly

foreach w [regexp -all -inline {[a-z]+} [string tolower [read stdin]]] {
    if {[lsearch {the and of to a i it in or is} $w] >= 0} { continue }
    if {[catch {incr words($w)}]} {
        set words($w) 1
    }
}
set topwords [lrange [lsort -decreasing -stride 2 -index 1 -integer [array get words]] 0 43]
foreach {word count} $topwords {
    lappend lengths [string length $word]
    lappend counts $count
}
set maxlength [lindex [lsort -integer $lengths] end]
set counts [lsort -integer $counts]
set mincount [lindex $counts 0].0
set maxcount [lindex $counts end].0
puts " [string repeat _ [expr {int((76-$maxlength) * [lindex $topwords 1] / $maxcount)}]] "
foreach {word count} $topwords {
    set barlength [expr {int((76-$maxlength) * $count / $maxcount)}]
    puts "|[string repeat _ $barlength]| $word"
}

RHSeeger 2010-07-11 05:34:07

ansaurus

tags:

views:

answers:

Code golf: Word frequency chart

The challenge:

Perl, 237 229 209 chars

C# - 510 451 436 446 434 426 422 chars (minified)

F#, 452 chars

Gawk -- 336 (originally 507) characters

*sh (+curl), partial solution

JavaScript 1.8 (SpiderMonkey) - 354

JavaScript (Rhino) - 405 395 387 377 368 343 304 chars

Python 2.6, 347 chars

Ruby, 215, 216, 218, 221, 224, 236, 237 chars

removed another try / catch block

Javascript, 348 characters

Java - 896 chars

931 chars

1233 chars made unreadable

1977 chars "uncompressed"

Java - 991 chars (incl newlines and indentations)

Java - 886 865 756 744 742 744 752 742 714 680 chars

Python 2.x, latitudinarian approach = 227 183 chars

Rant

Python 2.x, punctilious approach = 277 243 chars

Ruby 207 213 211 210 207 203 201 200 chars

206

shell, grep, tr, grep, sort, uniq, sort, head, perl

GolfScript, 177 175 173 167 164 163 144 131 130 chars

Windows PowerShell, 199 chars

Ruby 1.9, 185 chars

R 449 chars

Groovy, 424 389 378 321 chars

Scala, 368 chars

Python 2.6, 273 269 267 266 characters.

perl, 205 191 189 characters/ 205 characters (fully implemented)

Perl: 203 202 201 198 195 208 203 / 231 chars

C++, 647 chars

Python, 320 characters

Python 3.1 - 245 229 charaters

Haskell - 366 351 344 337 333 characters

PHP CLI version (450 chars)

Transact SQL set based solution (SQL Server 2005) 1063 892 873 853 827 820 783 683 647 644 630 characters

R, 298 chars

LabVIEW 51 nodes, 5 structures, 10 diagrams

Python 290, 255, 253

C (828)

Common LISP, 670 characters

Shell, 228 characters , with 80 chars constraint working

Object Rexx 4.0 with PC-Pipes

Yet another python 2.x - 206 chars (or 232 with 'width bar')

Python 3.x - 256 chars

Ruby, 205

Scala 2.8, 311 314 320 330 332 336 341 375 characters

Bourne shell, 213/240 characters

shell, grep, tr, grep, sort, uniq, sort, head, perl - 194 chars

Perl, 185 char

GNU Smalltalk (386)

Clojure 282 strict

Clojure - 611 chars (not minimized)

TCL 554 Strict

related questions

sh (+curl), partial* solution

Java - 991 chars _{^{(incl newlines and indentations)}}