views:

11564

answers:

56

The challenge:

Build an ASCII chart of the most commonly used words in a given text.

The rules:

  • Only accept a-z and A-Z (alphabetic characters) as part of a word.
  • Ignore casing (She == she for our purpose).
  • Ignore the following words (quite arbitary, I know): the, and, of, to, a, i, it, in, or, is
  • Clarification: considering don't: this would be taken as 2 different 'words' in the ranges a-z and A-Z: (don and t).

  • Optionally (it's too late to be formally changing the specifications now) you may choose to drop all single-letter 'words' (this could potentially make for a shortening of the ignore list too).

Parse a given text (read a file specified via command line arguments or piped in; presume us-ascii) and build us a word frequency chart with the following characteristics:

  • Display the chart (also see the example below) for the 22 most common words (ordered by descending frequency).
  • The bar width represents the number of occurences (frequency) of the word (proportionally). Append one space and print the word.
  • Make sure these bars (plus space-word-space) always fit: bar + [space] + word + [space] should be always <= 80 characters (make sure you account for possible differing bar and word lenghts: e.g.: the second most common word could be a lot longer then the first while not differing so much in frequency). Maximize bar width within these constraints and scale the bars appropriately (according to the frequencies they represent).

An example:

The text for the example can be found here (Alice's Adventures in Wonderland, by Lewis Carroll).

This specific text would yield the following chart:

 _________________________________________________________________________
|_________________________________________________________________________| she 
|_______________________________________________________________| you 
|____________________________________________________________| said 
|____________________________________________________| alice 
|______________________________________________| was 
|__________________________________________| that 
|___________________________________| as 
|_______________________________| her 
|____________________________| with 
|____________________________| at 
|___________________________| s 
|___________________________| t 
|_________________________| on 
|_________________________| all 
|______________________| this 
|______________________| for 
|______________________| had 
|_____________________| but 
|____________________| be 
|____________________| not 
|___________________| they 
|__________________| so 


For your information: these are the frequencies the above chart is built upon:

[('she', 553), ('you', 481), ('said', 462), ('alice', 403), ('was', 358), ('that
', 330), ('as', 274), ('her', 248), ('with', 227), ('at', 227), ('s', 219), ('t'
, 218), ('on', 204), ('all', 200), ('this', 181), ('for', 179), ('had', 178), ('
but', 175), ('be', 167), ('not', 166), ('they', 155), ('so', 152)]

A second example (to check if you implemented the complete spec): Replace every occurence of you in the linked Alice in Wonderland file with superlongstringstring:

 ________________________________________________________________
|________________________________________________________________| she 
|_______________________________________________________| superlongstringstring 
|_____________________________________________________| said 
|______________________________________________| alice 
|________________________________________| was 
|_____________________________________| that 
|______________________________| as 
|___________________________| her 
|_________________________| with 
|_________________________| at 
|________________________| s 
|________________________| t 
|______________________| on 
|_____________________| all 
|___________________| this 
|___________________| for 
|___________________| had 
|__________________| but 
|_________________| be 
|_________________| not 
|________________| they 
|________________| so 

The winner:

Shortest solution (by character count, per language). Have fun!


Edit: Table summarizing the results so far (2010-07-06) (added by user Nas Banov):

Language          Relaxed  Strict
=========         =======  ======
GolfScript          130     143
Perl                        185
Windows PowerShell  148     199
Mathematica                 199
Ruby                185     205
Unix Toolchain      194     228
Python              183     232
Clojure                     282
Scala                       311
Haskell                     333
Awk                         336
R                   298
Javascript          304     354
Groovy              321
Matlab                      404
C#                          422
Smalltalk           386
PHP                         450
F#                          452

The numbers represent the length of the shortest solution in a specific language. "Strict" refers to a solution that implements the spec completely (draws |____| bars, closes the first bar on top with a ____ line, accounts for the possibility of long words with high frequency etc). "Relaxed" means some liberties were taken to shorten to solution.

Only solutions shorter then 500 characters are included. The list of languages is sorted by the length of the 'strict' solution. 'Unix Toolchain' is used to signify various solutions that use traditional *nix shell plus a mix of tools (like grep, tr, sort, uniq, head, perl, awk).

+25  A: 

Perl, 237 229 209 chars

(Updated again to beat the Ruby version with more dirty golf tricks, replacing split/[^a-z/,lc with lc=~/[a-z]+/g, and eliminating a check for empty string in another place. These were inspired by the Ruby version, so credit where credit is due.)

Update: now with Perl 5.10! Replace print with say, and use ~~ to avoid a map. This has to be invoked on the command line as perl -E '<one-liner>' alice.txt. Since the entire script is on one line, writing it as a one-liner shouldn't present any difficulty :).

 @s=qw/the and of to a i it in or is/;$c{$_}++foreach grep{!($_~~@s)}map{lc=~/[a-z]+/g}<>;@s=sort{$c{$b}<=>$c{$a}}keys%c;$f=76-length$s[0];say" "."_"x$f;say"|"."_"x($c{$_}/$c{$s[0]}*$f)."| $_ "foreach@s[0..21];

Note that this version normalizes for case. This doesn't shorten the solution any, since removing ,lc (for lower-casing) requires you to add A-Z to the split regex, so it's a wash.

If you're on a system where a newline is one character and not two, you can shorten this by another two chars by using a literal newline in place of \n. However, I haven't written the above sample that way, since it's "clearer" (ha!) that way.


Here is a mostly correct, but not remotely short enough, perl solution:

use strict;
use warnings;

my %short = map { $_ => 1 } qw/the and of to a i it in or is/;
my %count = ();

$count{$_}++ foreach grep { $_ && !$short{$_} } map { split /[^a-zA-Z]/ } (<>);
my @sorted = (sort { $count{$b} <=> $count{$a} } keys %count)[0..21];
my $widest = 76 - (length $sorted[0]);

print " " . ("_" x $widest) . "\n";
foreach (@sorted)
{
    my $width = int(($count{$_} / $count{$sorted[0]}) * $widest);
    print "|" . ("_" x $width) . "| $_ \n";
}

The following is about as short as it can get while remaining relatively readable. (392 chars).

%short = map { $_ => 1 } qw/the and of to a i it in or is/;
%count;

$count{$_}++ foreach grep { $_ && !$short{$_} } map { split /[^a-z]/, lc } (<>);
@sorted = (sort { $count{$b} <=> $count{$a} } keys %count)[0..21];
$widest = 76 - (length $sorted[0]);

print " " . "_" x $widest . "\n";
print"|" . "_" x int(($count{$_} / $count{$sorted[0]}) * $widest) . "| $_ \n" foreach @sorted;
JSBangs
Has a few bugs right now; fixing and shortening.
JSBangs
This doesn't cover the case when the second word is much longer than the first, right?
Joey
Both `foreach` s can be written as `for` s. That's 8 chars down. Then you have the `grep{!($_~~@s)}map{lc=~/[a-z]+/g}<>`, which I believe could be written as `grep{!(/$_/i~~@s)}<>=~/[a-z]+/g` to go 4 more down. Replace the `" "` with `$"` and you're down 1 more...
Zaid
`sort{$c{$b}-$c{$a}}...` to save two more. You can also just pass `%c` instead of `keys %c` to the `sort` function and save four more.
mobrule
+27  A: 

C# - 510 451 436 446 434 426 422 chars (minified)

Not that short, but now probably correct! Note, the previous version did not show the first line of the bars, did not scale the bars correctly, downloaded the file instead of getting it from stdin, and did not include all the required C# verbosity. You could easily shave many strokes if C# didn't need so much extra crap. Maybe Powershell could do better.

using C=System.Console;   // alias for Console
using System.Linq;  // for Split, GroupBy, Select, OrderBy, etc.

class Class // must define a class
{
    static void Main()  // must define a Main
    {
        // split into words
        var allwords = System.Text.RegularExpressions.Regex.Split(
                // convert stdin to lowercase
                C.In.ReadToEnd().ToLower(),
                // eliminate stopwords and non-letters
                @"(?:\b(?:the|and|of|to|a|i[tns]?|or)\b|\W)+")
            .GroupBy(x => x)    // group by words
            .OrderBy(x => -x.Count()) // sort descending by count
            .Take(22);   // take first 22 words

        // compute length of longest bar + word
        var lendivisor = allwords.Max(y => y.Count() / (76.0 - y.Key.Length));

        // prepare text to print
        var toPrint = allwords.Select(x=> 
            new { 
                // remember bar pseudographics (will be used in two places)
                Bar = new string('_',(int)(x.Count()/lendivisor)), 
                Word=x.Key 
            })
            .ToList();  // convert to list so we can index into it

        // print top of first bar
        C.WriteLine(" " + toPrint[0].Bar);
        toPrint.ForEach(x =>  // for each word, print its bar and the word
            C.WriteLine("|" + x.Bar + "| " + x.Word));
    }
}

422 chars with lendivisor inlined (which makes it 22 times slower) in the below form (newlines used for select spaces):

using System.Linq;using C=System.Console;class M{static void Main(){var
a=System.Text.RegularExpressions.Regex.Split(C.In.ReadToEnd().ToLower(),@"(?:\b(?:the|and|of|to|a|i[tns]?|or)\b|\W)+").GroupBy(x=>x).OrderBy(x=>-x.Count()).Take(22);var
b=a.Select(x=>new{p=new string('_',(int)(x.Count()/a.Max(y=>y.Count()/(76d-y.Key.Length)))),t=x.Key}).ToList();C.WriteLine(" "+b[0].p);b.ForEach(x=>C.WriteLine("|"+x.p+"| "+x.t));}}
Paul Creasey
Clever one, this. I like it.
Arve Systad
+1 for the smart-ass downloading the file inline. :)
sarnold
Steal the short URL from Matt's answer.
indiv
The spec said the file must be piped in or passed as an args. If you were to assume that args[0] contained the local file name, you could shorten it considerably by using args[0] instead of (new WebClient()).DownloadString(@"http://www.gutenberg.org/files/11/11.txt") -> it would save you approx 70 characters
thorkia
Here is a version replacing the WebClient call with args 0, a call to StreamReader, and removing a few extra spaces. Total char count=413var a=Regex.Replace((new StreamReader(args[0])).ReadToEnd(),"[^a-zA-Z]"," ").ToLower().Split(' ').Where(x=>!(new[]{"the","and","of","to","a","i","it","in","or","is"}).Contains(x)).GroupBy(x=>x).Select(g=>new{w=g.Key,c=g.Count()}).OrderByDescending(x=>x.c).Skip(1).Take(22).ToList();var m=a.OrderByDescending(x=>x.c).First();a.ForEach(x=>Console.WriteLine("|"+new String('_',x.c*(80-m.w.Length-4)/m.c)+"| "+x.w));
thorkia
"new StreamReader" without "using" is dirty.File.ReadAllText(args[0]) or Console.In.ReadToEnd() are much better. In the latter case you can even remove argument from your Main(). :)
Rotsor
The bar widths are incorrect. "with"'s bar is shorter than "at"'s.
Rotsor
Rotsor: As far as I can tell, "with" and "at" have the same width of bar, which they should because they have the same frequency.
Gabe
You use Console.WriteLine a number of times. Save some more chars by aliasing `using C=System.Console;` and then in your code `C.WriteLine(..)`, or a different char since you already have C as a class name.
John K
This is an awesome example of the power of LINQ. Just imagine that in Java.
Zoomzoom83
@Zoomzoom83: It would be great to have but it would probably *still* be two orders of magnitude longer. We're talking about Java, after all ;) (and it will probably only show up in Java 8 which set its release date *after* Duke Nukem Forever).
Joey
+9  A: 

F#, 452 chars

Strightforward: get a sequence a of word-count pairs, find the best word-count-per-column multiplier k, then print results.

let a=
 stdin.ReadToEnd().Split(" .?!,\":;'\r\n".ToCharArray(),enum 1)
 |>Seq.map(fun s->s.ToLower())|>Seq.countBy id
 |>Seq.filter(fun(w,n)->not(set["the";"and";"of";"to";"a";"i";"it";"in";"or";"is"].Contains w))
 |>Seq.sortBy(fun(w,n)-> -n)|>Seq.take 22
let k=a|>Seq.map(fun(w,n)->float(78-w.Length)/float n)|>Seq.min
let u n=String.replicate(int(float(n)*k)-2)"_"
printfn" %s "(u(snd(Seq.nth 0 a)))
for(w,n)in a do printfn"|%s| %s "(u n)w

Example (I have different freq counts than you, unsure why):

% app.exe < Alice.txt

 _________________________________________________________________________
|_________________________________________________________________________| she
|_______________________________________________________________| you
|_____________________________________________________________| said
|_____________________________________________________| alice
|_______________________________________________| was
|___________________________________________| that
|___________________________________| as
|________________________________| her
|_____________________________| with
|_____________________________| at
|____________________________| t
|____________________________| s
|__________________________| on
|_________________________| all
|_______________________| this
|______________________| had
|______________________| for
|_____________________| but
|_____________________| be
|____________________| not
|___________________| they
|__________________| so
Brian
@Brian: turns out my own solution was indeed a little off (due to a little different spec), the solutions correspond now ;-)
ChristopheD
+1 for the only correct bar scaling implementation so far
Rotsor
(@Rotsor: Ironic, given that mine is the oldest solution.)
Brian
I bet you could shorten it quite a bit by merging the split, map, and filter stages. I'd also expect that you wouldn't need so many `float`s.
Gabe
Isn't nesting functions usually shorter than using the pipeline operator `|>`?
Joey
+7  A: 

Gawk -- 336 (originally 507) characters

(after fixing the output formatting; fixing the contractions thing; tweaking; tweaking again; removing a wholly unnecessary sorting step; tweaking yet again; and again (oops this one broke the formatting); tweak some more; taking up Matt's challenge I desperately tweak so more; found another place to save a few, but gave two back to fix the bar length bug)

Heh heh! I am momentarily ahead of [Matt's JavaScript][1] solutioncounter challenge! ;) and [AKX's python][2].

The problem seems to call out for a language that implements native associative arrays, so of course I've chosen one with a horribly deficient set of operators on them. In particular, you cannot control the order in which awk offers up the elements of a hash map, so I repeatedly scan the whole map to find the currently most numerous item, print it and delete it from the array.

It is all terribly inefficient, with all the golfifcations I've made it has gotten to be pretty awful, as well.

Minified:

{gsub("[^a-zA-Z]"," ");for(;NF;NF--)a[tolower($NF)]++}
END{split("the and of to a i it in or is",b," ");
for(w in b)delete a[b[w]];d=1;for(w in a){e=a[w]/(78-length(w));if(e>d)d=e}
for(i=22;i;--i){e=0;for(w in a)if(a[w]>e)e=a[x=w];l=a[x]/d-2;
t=sprintf(sprintf("%%%dc",l)," ");gsub(" ","_",t);if(i==22)print" "t;
print"|"t"| "x;delete a[x]}}

line breaks for clarity only: they are not necessary and should not be counted.


Output:

$ gawk -f wordfreq.awk.min < 11.txt 
 _________________________________________________________________________
|_________________________________________________________________________| she
|_______________________________________________________________| you
|____________________________________________________________| said
|____________________________________________________| alice
|______________________________________________| was
|__________________________________________| that
|___________________________________| as
|_______________________________| her
|____________________________| with
|____________________________| at
|___________________________| s
|___________________________| t
|_________________________| on
|_________________________| all
|______________________| this
|______________________| for
|______________________| had
|_____________________| but
|____________________| be
|____________________| not
|___________________| they
|__________________| so
$ sed 's/you/superlongstring/gI' 11.txt | gawk -f wordfreq.awk.min
 ______________________________________________________________________
|______________________________________________________________________| she
|_____________________________________________________________| superlongstring
|__________________________________________________________| said
|__________________________________________________| alice
|____________________________________________| was
|_________________________________________| that
|_________________________________| as
|______________________________| her
|___________________________| with
|___________________________| at
|__________________________| s
|__________________________| t
|________________________| on
|________________________| all
|_____________________| this
|_____________________| for
|_____________________| had
|____________________| but
|___________________| be
|___________________| not
|__________________| they
|_________________| so

Readable; 633 characters (originally 949):

{
    gsub("[^a-zA-Z]"," ");
    for(;NF;NF--)
    a[tolower($NF)]++
}
END{
    # remove "short" words
    split("the and of to a i it in or is",b," ");
    for (w in b) 
    delete a[b[w]];
    # Find the bar ratio
    d=1;
    for (w in a) {
    e=a[w]/(78-length(w));
    if (e>d)
        d=e
    }
    # Print the entries highest count first
    for (i=22; i; --i){               
    # find the highest count
    e=0;
    for (w in a) 
        if (a[w]>e)
        e=a[x=w];
        # Print the bar
    l=a[x]/d-2;
    # make a string of "_" the right length
    t=sprintf(sprintf("%%%dc",l)," ");
    gsub(" ","_",t);
    if (i==22) print" "t;
    print"|"t"| "x;
    delete a[x]
    }
}
dmckee
Nice work, good you included an indented / commented version ;-)
ChristopheD
+6  A: 

*sh (+curl), partial solution

This is incomplete, but for the hell of it, here's the word-frequency counting half of the problem in 192 bytes:

curl -s http://www.gutenberg.org/files/11/11.txt|sed -e 's@[^a-z]@\n@gi'|tr '[:upper:]' '[:lower:]'|egrep -v '(^[^a-z]*$|\b(the|and|of|to|a|i|it|in|or|is)\b)' |sort|uniq -c|sort -n|tail -n 22
Frank Farmer
+11  A: 

JavaScript 1.8 (SpiderMonkey) - 354

x={};p='|';e=' ';z=[];c=77
while(l=readline())l.toLowerCase().replace(/\b(?!(the|and|of|to|a|i[tns]?|or)\b)\w+/g,function(y)x[y]?x[y].c++:z.push(x[y]={w:y,c:1}))
z=z.sort(function(a,b)b.c-a.c).slice(0,22)
for each(v in z){v.r=v.c/z[0].c
c=c>(l=(77-v.w.length)/v.r)?l:c}for(k in z){v=z[k]
s=Array(v.r*c|0).join('_')
if(!+k)print(e+s+e)
print(p+s+p+e+v.w)}

Sadly, the for([k,v]in z) from the Rhino version doesn't seem to want to work in SpiderMonkey, and readFile() is a little easier than using readline() but moving up to 1.8 allows us to use function closures to cut a few more lines....

Adding whitespace for readability:

x={};p='|';e=' ';z=[];c=77
while(l=readline())
  l.toLowerCase().replace(/\b(?!(the|and|of|to|a|i[tns]?|or)\b)\w+/g,
   function(y) x[y] ? x[y].c++ : z.push( x[y] = {w: y, c: 1} )
  )
z=z.sort(function(a,b) b.c - a.c).slice(0,22)
for each(v in z){
  v.r=v.c/z[0].c
  c=c>(l=(77-v.w.length)/v.r)?l:c
}
for(k in z){
  v=z[k]
  s=Array(v.r*c|0).join('_')
  if(!+k)print(e+s+e)
  print(p+s+p+e+v.w)
}

Usage: js golf.js < input.txt

Output:

 _________________________________________________________________________ 
|_________________________________________________________________________| she
|_______________________________________________________________| you
|____________________________________________________________| said
|____________________________________________________| alice
|______________________________________________| was
|___________________________________________| that
|___________________________________| as
|________________________________| her
|_____________________________| at
|_____________________________| with
|____________________________| s
|____________________________| t
|__________________________| on
|_________________________| all
|_______________________| this
|______________________| for
|______________________| had
|______________________| but
|_____________________| be
|_____________________| not
|___________________| they
|___________________| so

(base version - doesn't handle bar widths correctly)

JavaScript (Rhino) - 405 395 387 377 368 343 304 chars

I think my sorting logic is off, but.. I duno. Brainfart fixed.

Minified (abusing \n's interpreted as a ; sometimes):

x={};p='|';e=' ';z=[]
readFile(arguments[0]).toLowerCase().replace(/\b(?!(the|and|of|to|a|i[tns]?|or)\b)\w+/g,function(y){x[y]?x[y].c++:z.push(x[y]={w:y,c:1})})
z=z.sort(function(a,b){return b.c-a.c}).slice(0,22)
for([k,v]in z){s=Array((v.c/z[0].c)*70|0).join('_')
if(!+k)print(e+s+e)
print(p+s+p+e+v.w)}
Matt
Ah, sir. I believe this is your gauntlet. Have your second speak to mine.
dmckee
BTW-- I like the `i[tns]?` bit. Very sneaky.
dmckee
@dmckee - well played, I don't think I can beat your 336, enjoy your much-deserved upvote :)
Matt
You can definitely beat 336... There is a 23 character cut available -- `.replace(/[^\w ]/g, e).split(/\s+/).map(` can be replaced with `.replace(/\w+/g,` and use the same function your `.map` did... Also not sure if Rhino supports `function(a,b)b.c-a.c` instead of your sort function (spidermonkey does), but that will shave `{return }` ... `b.c-a.c` is a better sort that `a.c<b.c` btw... Editing a Spidermonkey version at the bottom with these changes
gnarf
I moved my SpiderMonkey version up to the top since it conforms to the bar width constraint... Also managed to cut out a few more chars in your original version by using a negative lookahead regexp to deny words allowing for a single replace(), and golfed a few ifs with `?:` Great base to work from though!
gnarf
This will not eliminate stop words when surrounded by digits or underscores such as in `foo_the123` where only `foo` should remain.
Joey
+8  A: 

Python 2.6, 347 chars

import re
W,x={},"a and i in is it of or the to".split()
[W.__setitem__(w,W.get(w,0)-1)for w in re.findall("[a-z]+",file("11.txt").read().lower())if w not in x]
W=sorted(W.items(),key=lambda p:p[1])[:22]
bm=(76.-len(W[0][0]))/W[0][1]
U=lambda n:"_"*int(n*bm)
print "".join(("%s\n|%s| %s "%((""if i else" "+U(n)),U(n),w))for i,(w,n)in enumerate(W))

Output:

 _________________________________________________________________________
|_________________________________________________________________________| she 
|_______________________________________________________________| you 
|____________________________________________________________| said 
|_____________________________________________________| alice 
|_______________________________________________| was 
|___________________________________________| that 
|____________________________________| as 
|________________________________| her 
|_____________________________| with 
|_____________________________| at 
|____________________________| s 
|____________________________| t 
|__________________________| on 
|__________________________| all 
|_______________________| this 
|_______________________| for 
|_______________________| had 
|_______________________| but 
|______________________| be 
|_____________________| not 
|____________________| they 
|____________________| so 
AKX
You can lose the line `bm=(76.-len(W[0][0]))/W[0][1]` since you're only using bm once (make the next line `U=lambda n:"_"*int(n*(76.-len(W[0][0]))/W[0][1])`, shaves off 5 characters. Also: why would you use a 2-character variable name in code golfing? ;-)
ChristopheD
On the last line the space after print isn't necessary, shaves off one character
ChristopheD
Doesn't consider the case when the second-most frequent word is very long, right?
Joey
@ChristopheD: Because I had been staring at that code for a little too long. :P Good catch.@Johannes: That could be fixed too, yes. Not sure all other implementations did it when I wrote this either.
AKX
+18  A: 

Ruby, 215, 216, 218, 221, 224, 236, 237 chars

update 1: Hurray! It's a tie with JS Bangs' solution. Can't think of a way to cut down any more :)

update 2: Played a dirty golf trick. Changed each to map to save 1 character :)

update 3: Changed File.read to IO.read +2. Array.group_by wasn't very fruitful, changed to reduce +6. Case insensitive check is not needed after lower casing with downcase in regex +1. Sorting in descending order is easily done by negating the value +6. Total savings +15

update 4: [0] rather than .first, +3. (@Shtééf)

update 5: Expand variable l in-place, +1. Expand variable s in-place, +2. (@Shtééf)

update 6: Use string addition rather than interpolation for the first line, +2. (@Shtééf)

w=(IO.read($_).downcase.scan(/[a-z]+/)-%w{the and of to a i it in or is}).reduce(Hash.new 0){|m,o|m[o]+=1;m}.sort_by{|k,v|-v}.take 22;m=76-w[0][0].size;puts' '+'_'*m;w.map{|x,f|puts"|#{'_'*(f*1.0/w[0][1]*m)}| #{x} "}

update 7: I went through a whole lot of hoopla to detect the first iteration inside the loop, using instance variables. All I got is +1, though perhaps there is potential. Preserving the previous version, because I believe this one is black magic. (@Shtééf)

(IO.read($_).downcase.scan(/[a-z]+/)-%w{the and of to a i it in or is}).reduce(Hash.new 0){|m,o|m[o]+=1;m}.sort_by{|k,v|-v}.take(22).map{|x,f|@f||(@f=f;puts' '+'_'*(@m=76-x.size));puts"|#{'_'*(f*1.0/@f*@m)}| #{x} "}

Readable version

string = File.read($_).downcase

words = string.scan(/[a-z]+/i)
allowed_words = words - %w{the and of to a i it in or is}
sorted_words = allowed_words.group_by{ |x| x }.map{ |x,y| [x, y.size] }.sort{ |a,b| b[1] <=> a[1] }.take(22)
highest_frequency = sorted_words.first
highest_frequency_count = highest_frequency[1]
highest_frequency_word = highest_frequency[0]

word_length = highest_frequency_word.size
widest = 76 - word_length

puts " #{'_' * widest}"    
sorted_words.each do |word, freq|
  width = (freq * 1.0 / highest_frequency_count) * widest
  puts "|#{'_' * width}| #{word} "
end

To use:

echo "Alice.txt" | ruby -ln GolfedWordFrequencies.rb

Output:

 _________________________________________________________________________
|_________________________________________________________________________| she 
|_______________________________________________________________| you 
|____________________________________________________________| said 
|_____________________________________________________| alice 
|_______________________________________________| was 
|___________________________________________| that 
|____________________________________| as 
|________________________________| her 
|_____________________________| with 
|_____________________________| at 
|____________________________| s 
|____________________________| t 
|__________________________| on 
|__________________________| all 
|_______________________| this 
|_______________________| for 
|_______________________| had 
|_______________________| but 
|______________________| be 
|_____________________| not 
|____________________| they 
|____________________| so 
Anurag
Whoa, Ruby is beating Perl.
chpwn
Isn't "p" a shortcut for "puts" ? That could shave a few.
rfusca
Nice. Your use of `scan`, though, gave me a better idea, so I got ahead again :).
JSBangs
@rfusca, `p` puts quotes around the output so it wouldn't match OP's.
Anurag
@JS Looks like its going to be a cat and mouse game, until J comes along :)
Anurag
You miscounted. At update 3, you were at 224 characters. I brought you back to 221, any way. :) That reduce trick is black magic. :o
Shtééf
@Shtééf - thanks :) .. last thing we want in code-golf is miscounting on the higher side.. lol :o)
Anurag
Isnt using the shell to "read" and pipe the file cheating ?
mP
The question states that the input can be piped in. Also, we are just piping in the file name, not its contents.
Anurag
Okay, I am now totally done with this. You may now shout at me. :) I sure hope that Perl version doesn't become much shorter.
Shtééf
You need to scale the bars so the longest word plus its bar fits on 80 characters. As Brian suggested, a long second word will break your program.
Gabe
I wonder why this is still gathering votes. The solution is incorrect (in the general case) and two way shorter Ruby solutions are here by now.
Joey
Now, Correct me if i'm wrong, but instead of using "downcase", why don't you use the REGEXP case insensitive flag, that saves 6-7 bytes, does it not?
st0le
How about [0..21] instead of .take 22?
Adam
+1  A: 

Java, slowly getting shorter (1500 1358 1241 1020 913 890 chars)

stripped even more white space and var name length. removed generics where possible, removed inline class and try/catch block too bad, my 900 version had a bug

removed another try / catch block

import java.net.*;import java.util.*;import java.util.regex.*;import org.apache.commons.io.*;public class G{public static void main(String[]a)throws Exception{String text=IOUtils.toString(new URL(a[0]).openStream()).toLowerCase().replaceAll("\\b(the|and|of|to|a|i[tns]?|or)\\b","");final Map<String,Integer>p=new HashMap();Matcher m=Pattern.compile("\\b\\w+\\b").matcher(text);Integer b;while(m.find()){String w=m.group();b=p.get(w);p.put(w,b==null?1:b+1);}List<String>v=new Vector(p.keySet());Collections.sort(v,new Comparator(){public int compare(Object l,Object m){return p.get(m)-p.get(l);}});boolean t=true;float r=0;for(String w:v.subList(0,22)){if(t){t=false;r=p.get(w)/(float)(80-(w.length()+4));System.out.println(" "+new String(new char[(int)(p.get(w)/r)]).replace('\0','_'));}System.out.println("|"+new String(new char[(int)(((Integer)p.get(w))/r)]).replace('\0','_')+"|"+w);}}}

Readable version:

import java.net.*;
import java.util.*;
import java.util.regex.*;
import org.apache.commons.io.*;

public class G{

    public static void main(String[] a) throws Exception{
        String text =
            IOUtils.toString(new URL(a[0]).openStream())
                .toLowerCase()
                .replaceAll("\\b(the|and|of|to|a|i[tns]?|or)\\b", "");
        final Map<String, Integer> p = new HashMap();
        Matcher m = Pattern.compile("\\b\\w+\\b").matcher(text);
        Integer b;
        while(m.find()){
            String w = m.group();
            b = p.get(w);
            p.put(w, b == null ? 1 : b + 1);
        }
        List<String> v = new Vector(p.keySet());
        Collections.sort(v, new Comparator(){

            public int compare(Object l, Object m){
                return p.get(m) - p.get(l);
            }
        });
        boolean t = true;
        float r = 0;
        for(String w : v.subList(0, 22)){
            if(t){
                t = false;
                r = p.get(w) / (float) (80 - (w.length() + 4));
                System.out.println(" "
                    + new String(new char[(int) (p.get(w) / r)]).replace('\0',
                        '_'));
            }
            System.out.println("|"
                + new String(new char[(int) (((Integer) p.get(w)) / r)]).replace('\0',
                    '_') + "|" + w);
        }
    }
}
seanizer
not quite golf, either =/
Justin L.
I like goofball high-character-count golf submissions. It's good to break up the monotony of line noise with something readable and almost laughably verbose.
John Y
@John: I disagree. Even if you are going to use a verbose language (see my fortran 77 entries in some earlier code golfs for instance) you should code it as tightly as the language allows.Code golf isn't about good practices; indeed it is very nearly the antithesis of good practice.
dmckee
@dmckee: I completely understand and accept your viewpoint. Still, I personally like to see just about any submission. Variety is the spice of life, and to me that even includes differing (even opposing) spirit and ideals in code golf. Better to dance, but dance "poorly" (for whatever definition of dance), than to stand in the corner or worse yet, not even show up.
John Y
+1  A: 

Javascript, 348 characters

After I finished mine, I stole some ideas from Matt :3

t=prompt().toLowerCase().replace(/\b(the|and|of|to|a|i[tns]?|or)\b/gm,'');r={};o=[];t.replace(/\b([a-z]+)\b/gm,function(a,w){r[w]?++r[w]:r[w]=1});for(i in r){o.push([i,r[i]])}m=o[0][1];o=o.slice(0,22);o.sort(function(F,D){return D[1]-F[1]});for(B in o){F=o[B];L=new Array(~~(F[1]/m*(76-F[0].length))).join('_');print(' '+L+'\n|'+L+'| '+F[0]+' \n')}

Requires print and prompt function support.

M28
This will have some problems with strings like `the_foo`, right? (Because then `\b` breaks apart)
Joey
+25  A: 
belisarius
You think "RegularExpression" is bad? I cried when I typed "System.Text.RegularExpressions.Regex.Split" into the C# version, up until I saw the Objective-C code: "stringWithContentsOfFile", "enumerateSubstringsInRange", "NSStringEnumerationByWords", "sortedArrayUsingComparator", and so on.
Gabe
@Gabe Thanks ... I feel better now. In spanish we say "mal de muchos, consuelo de tontos" .. Something like "Many troubled, fools relieved" :D
belisarius
The `|i|` is redundant in your regex because you already have `.|`.
Gabe
@Gabe yes, thnx. Corrected.
belisarius
I like that Spanish saying. The closest thing I can think of in English is "misery loves company". Here's my translation attempt: "It's a fool who, when suffering, takes consolation in thinking of others in the same situation." Amazing work on the Mathematica implementation, btw.
dreeves
@dreeves Foolishness surpass the language barrier easily ... Glad to see you like my little Mathematica program, I'm just starting to learn the language
belisarius
Pared it down to 199...
Michael Pilat
@Michael Pilat Wow! I've a lot to learn ... wonderful!
belisarius
The 199 version will not interpret things like `the_foo` according to spec, right?
Joey
@Johannes Rössel Good eye! The bug is in all versions due to Mathematica matching the underscore as a letter char (Why did they do that??!!). The regexp should be something like "(_|\\W|\\b(.|the|and|of|to|i[tns]|or)\\b)+" , but \\W also recognizes digits as letters, so perhaps an utterly correct version is a little longer.
belisarius
@belisarius: Actually all regex engines consider `\w` as something like `[a-zA-Z0-9_]` or maybe `[\p{L}\p{Nd}_]` for Unicode-aware engines. And since `\b` is considered a boundary between `\w` and `\W` this doesn't work according to the spec here. But many solutions have that problem and it took me quite a few characters to get that part right in my solution. As for the *why* I think it fits with what many programming languages allow as identifiers. You can simply match them with `\w+` (doesn't *quite* work, but close enough for most hackish solutions).
Joey
@Johannes Rössel Yep, may be because usually "_" is an allowed char in many languages. Curiously, in Mathematica is a special character, and cannot be used in identifiers :D
belisarius
+3  A: 

Java - 896 chars

931 chars

1233 chars made unreadable

1977 chars "uncompressed"


Update: I have aggressively reduced the character count. Omits single-letter words per updated spec.

I envy C# and LINQ so much.

import java.util.*;import java.io.*;import static java.util.regex.Pattern.*;class g{public static void main(String[] a)throws Exception{PrintStream o=System.out;Map<String,Integer> w=new HashMap();Scanner s=new Scanner(new File(a[0])).useDelimiter(compile("[^a-z]+|\\b(the|and|of|to|.|it|in|or|is)\\b",2));while(s.hasNext()){String z=s.next().trim().toLowerCase();if(z.equals(""))continue;w.put(z,(w.get(z)==null?0:w.get(z))+1);}List<Integer> v=new Vector(w.values());Collections.sort(v);List<String> q=new Vector();int i,m;i=m=v.size()-1;while(q.size()<22){for(String t:w.keySet())if(!q.contains(t)&&w.get(t).equals(v.get(i)))q.add(t);i--;}int r=80-q.get(0).length()-4;String l=String.format("%1$0"+r+"d",0).replace("0","_");o.println(" "+l);o.println("|"+l+"| "+q.get(0)+" ");for(i=m-1;i>m-22;i--){o.println("|"+l.substring(0,(int)Math.round(r*(v.get(i)*1.0)/v.get(m)))+"| "+q.get(m-i)+" ");}}}

"Readable":

import java.util.*;
import java.io.*;
import static java.util.regex.Pattern.*;
class g
{
   public static void main(String[] a)throws Exception
      {
      PrintStream o = System.out;
      Map<String,Integer> w = new HashMap();
      Scanner s = new Scanner(new File(a[0]))
         .useDelimiter(compile("[^a-z]+|\\b(the|and|of|to|.|it|in|or|is)\\b",2));
      while(s.hasNext())
      {
         String z = s.next().trim().toLowerCase();
         if(z.equals(""))
            continue;
         w.put(z,(w.get(z) == null?0:w.get(z))+1);
      }
      List<Integer> v = new Vector(w.values());
      Collections.sort(v);
      List<String> q = new Vector();
      int i,m;
      i = m = v.size()-1;
      while(q.size()<22)
      {
         for(String t:w.keySet())
            if(!q.contains(t)&&w.get(t).equals(v.get(i)))
               q.add(t);
         i--;
      }
      int r = 80-q.get(0).length()-4;
      String l = String.format("%1$0"+r+"d",0).replace("0","_");
      o.println(" "+l);
      o.println("|"+l+"| "+q.get(0)+" ");
      for(i = m-1; i > m-22; i--)
      {
         o.println("|"+l.substring(0,(int)Math.round(r*(v.get(i)*1.0)/v.get(m)))+"| "+q.get(m-i)+" ");
      }
   }
}

Output of Alice:

 _________________________________________________________________________
|_________________________________________________________________________| she
|_______________________________________________________________| you
|_____________________________________________________________| said
|_____________________________________________________| alice
|_______________________________________________| was
|____________________________________________| that
|____________________________________| as
|_________________________________| her
|______________________________| with
|______________________________| at
|___________________________| on
|__________________________| all
|________________________| this
|________________________| for
|_______________________| had
|_______________________| but
|______________________| be
|______________________| not
|____________________| they
|____________________| so
|___________________| very
|___________________| what

Output of Don Quixote (also from Gutenberg):

 ________________________________________________________________________
|________________________________________________________________________| that
|________________________________________________________| he
|______________________________________________| for
|__________________________________________| his
|________________________________________| as
|__________________________________| with
|_________________________________| not
|_________________________________| was
|________________________________| him
|______________________________| be
|___________________________| don
|_________________________| my
|_________________________| this
|_________________________| all
|_________________________| they
|________________________| said
|_______________________| have
|_______________________| me
|______________________| on
|______________________| so
|_____________________| you
|_____________________| quixote
Jonathon
Wholly carp, is there really no way to make it shorter in Java? I hope you guys get paid by number of characters and not by functionality :-)
Nas Banov
Java is absolute shit, wow
Pierreten
+2  A: 

Java - 991 chars (incl newlines and indentations)

I took the code of @seanizer, fixed a bug (he omitted the 1st output line), made some improvements to make the code more 'golfy'.

import java.util.*;
import java.util.regex.*;
import org.apache.commons.io.IOUtils;
public class WF{
 public static void main(String[] a)throws Exception{
  String t=IOUtils.toString(new java.net.URL(a[0]).openStream());
  class W implements Comparable<W> {
   String w;int f=1;W(String W){w=W;}public int compareTo(W o){return o.f-f;}
   String d(float r){char[]c=new char[(int)(f/r)];Arrays.fill(c,'_');return "|"+new String(c)+"| "+w;}
  }
  Map<String,W>M=new HashMap<String,W>();
  Matcher m=Pattern.compile("\\b\\w+\\b").matcher(t.toLowerCase());
  while(m.find()){String w=m.group();W W=M.get(w);if(W==null)M.put(w,new W(w));else W.f++;}
  M.keySet().removeAll(Arrays.asList("the,and,of,to,a,i,it,in,or,is".split(",")));
  List<W>L=new ArrayList<W>(M.values());Collections.sort(L);int l=76-L.get(0).w.length();
  System.out.println(" "+new String(new char[l]).replace('\0','_'));
  for(W w:L.subList(0,22))System.out.println(w.d((float)L.get(0).f/(float)l));
 }
}

Output:

 _________________________________________________________________________
|_________________________________________________________________________| she
|_______________________________________________________________| you
|____________________________________________________________| said
|_____________________________________________________| alice
|_______________________________________________| was
|___________________________________________| that
|____________________________________| as
|________________________________| her
|_____________________________| with
|_____________________________| at
|____________________________| s
|____________________________| t
|__________________________| on
|__________________________| all
|_______________________| this
|_______________________| for
|_______________________| had
|_______________________| but
|______________________| be
|_____________________| not
|____________________| they
|____________________| so

BalusC
`new String(new char[l]).replace('\0','_')` that's a nice trick to remember, thanks.
seanizer
+5  A: 

Java - 886 865 756 744 742 744 752 742 714 680 chars

  • Updates before first 742: improved regex, removed superfluous parameterized types, removed superfluous whitespace.

  • Update 742 > 744 chars: fixed the fixed-length hack. It's only dependent on the 1st word, not other words (yet). Found several places to shorten the code (\\s in regex replaced by and ArrayList replaced by Vector). I'm now looking for a short way to remove the Commons IO dependency and reading from stdin.

  • Update 744 > 752 chars: I removed the commons dependency. It now reads from stdin. Paste the text in stdin and hit Ctrl+Z to get result.

  • Update 752 > 742 chars: I removed public and a space, made classname 1 char instead of 2 and it's now ignoring one-letter words.

  • Update 742 > 714 chars: Updated as per comments of Carl: removed redundant assignment (742 > 730), replaced m.containsKey(k) by m.get(k)!=null (730 > 728), introduced substringing of line (728 > 714).

  • Update 714 > 680 chars: Updated as per comments of Rotsor: improved bar size calculation to remove unnecessary casting and improved split() to remove unnecessary replaceAll().


import java.util.*;class F{public static void main(String[]a)throws Exception{StringBuffer b=new StringBuffer();for(int c;(c=System.in.read())>0;b.append((char)c));final Map<String,Integer>m=new HashMap();for(String w:b.toString().toLowerCase().split("(\\b(.|the|and|of|to|i[tns]|or)\\b|\\W)+"))m.put(w,m.get(w)!=null?m.get(w)+1:1);List<String>l=new Vector(m.keySet());Collections.sort(l,new Comparator(){public int compare(Object l,Object r){return m.get(r)-m.get(l);}});int c=76-l.get(0).length();String s=new String(new char[c]).replace('\0','_');System.out.println(" "+s);for(String w:l.subList(0,22))System.out.println("|"+s.substring(0,m.get(w)*c/m.get(l.get(0)))+"| "+w);}}

More readable version:

import java.util.*;
class F{
 public static void main(String[]a)throws Exception{
  StringBuffer b=new StringBuffer();for(int c;(c=System.in.read())>0;b.append((char)c));
  final Map<String,Integer>m=new HashMap();for(String w:b.toString().toLowerCase().split("(\\b(.|the|and|of|to|i[tns]|or)\\b|\\W)+"))m.put(w,m.get(w)!=null?m.get(w)+1:1);
  List<String>l=new Vector(m.keySet());Collections.sort(l,new Comparator(){public int compare(Object l,Object r){return m.get(r)-m.get(l);}});
  int c=76-l.get(0).length();String s=new String(new char[c]).replace('\0','_');System.out.println(" "+s);
  for(String w:l.subList(0,22))System.out.println("|"+s.substring(0,m.get(w)*c/m.get(l.get(0)))+"| "+w);
 }
}

Output:

 _________________________________________________________________________
|_________________________________________________________________________| she
|_______________________________________________________________| you
|____________________________________________________________| said
|_____________________________________________________| alice
|_______________________________________________| was
|___________________________________________| that
|____________________________________| as
|________________________________| her
|_____________________________| with
|_____________________________| at
|__________________________| on
|__________________________| all
|_______________________| this
|_______________________| for
|_______________________| had
|_______________________| but
|______________________| be
|_____________________| not
|____________________| they
|____________________| so
|___________________| very
|__________________| what

It pretty sucks that Java doesn't have String#join() and closures (yet).

Edit by Rotsor:

I have made several changes to your solution:

  • Replaced List with a String[]
  • Reused the 'args' argument instead of declaring my own String array. Also used it as an argument to .ToArray()
  • Replaced StringBuffer with a String (yes, yes, terrible performance)
  • Replaced Java sorting with a selection-sort with early halting (only first 22 elements have to be found)
  • Aggregated some int declaration into a single statement
  • Implemented the non-cheating algorithm finding the most limiting line of output. Implemented it without FP.
  • Fixed the problem of the program crashing when there were less than 22 distinct words in the text
  • Implemented a new algorithm of reading input, which is fast and only 9 characters longer than the slow one.

The condensed code is 688 711 684 characters long:

import java.util.*;class F{public static void main(String[]l)throws Exception{Map<String,Integer>m=new HashMap();String w="";int i=0,k=0,j=8,x,y,g=22;for(;(j=System.in.read())>0;w+=(char)j);for(String W:w.toLowerCase().split("(\\b(.|the|and|of|to|i[tns]|or)\\b|\\W)+"))m.put(W,m.get(W)!=null?m.get(W)+1:1);l=m.keySet().toArray(l);x=l.length;if(x<g)g=x;for(;i<g;++i)for(j=i;++j<x;)if(m.get(l[i])<m.get(l[j])){w=l[i];l[i]=l[j];l[j]=w;}for(;k<g;k++){x=76-l[k].length();y=m.get(l[k]);if(k<1||y*i>x*j){i=x;j=y;}}String s=new String(new char[m.get(l[0])*i/j]).replace('\0','_');System.out.println(" "+s);for(k=0;k<g;k++){w=l[k];System.out.println("|"+s.substring(0,m.get(w)*i/j)+"| "+w);}}}

The fast version (720 693 characters)

import java.util.*;class F{public static void main(String[]l)throws Exception{Map<String,Integer>m=new HashMap();String w="";int i=0,k=0,j=8,x,y,g=22;for(;j>0;){j=System.in.read();if(j>90)j-=32;if(j>64&j<91)w+=(char)j;else{if(!w.matches("^(|.|THE|AND|OF|TO|I[TNS]|OR)$"))m.put(w,m.get(w)!=null?m.get(w)+1:1);w="";}}l=m.keySet().toArray(l);x=l.length;if(x<g)g=x;for(;i<g;++i)for(j=i;++j<x;)if(m.get(l[i])<m.get(l[j])){w=l[i];l[i]=l[j];l[j]=w;}for(;k<g;k++){x=76-l[k].length();y=m.get(l[k]);if(k<1||y*i>x*j){i=x;j=y;}}String s=new String(new char[m.get(l[0])*i/j]).replace('\0','_');System.out.println(" "+s);for(k=0;k<g;k++){w=l[k];System.out.println("|"+s.substring(0,m.get(w)*i/j)+"| "+w);}}}

More readable version:

import java.util.*;class F{public static void main(String[]l)throws Exception{
    Map<String,Integer>m=new HashMap();String w="";
    int i=0,k=0,j=8,x,y,g=22;
    for(;j>0;){j=System.in.read();if(j>90)j-=32;if(j>64&j<91)w+=(char)j;else{
        if(!w.matches("^(|.|THE|AND|OF|TO|I[TNS]|OR)$"))m.put(w,m.get(w)!=null?m.get(w)+1:1);w="";
    }}
    l=m.keySet().toArray(l);x=l.length;if(x<g)g=x;
    for(;i<g;++i)for(j=i;++j<x;)if(m.get(l[i])<m.get(l[j])){w=l[i];l[i]=l[j];l[j]=w;}
    for(;k<g;k++){x=76-l[k].length();y=m.get(l[k]);if(k<1||y*i>x*j){i=x;j=y;}}
    String s=new String(new char[m.get(l[0])*i/j]).replace('\0','_');
    System.out.println(" "+s);
    for(k=0;k<g;k++){w=l[k];System.out.println("|"+s.substring(0,m.get(w)*i/j)+"| "+w);}}
}

The version without behaviour improvements is 615 characters:

import java.util.*;class F{public static void main(String[]l)throws Exception{Map<String,Integer>m=new HashMap();String w="";int i=0,k=0,j=8,g=22;for(;j>0;){j=System.in.read();if(j>90)j-=32;if(j>64&j<91)w+=(char)j;else{if(!w.matches("^(|.|THE|AND|OF|TO|I[TNS]|OR)$"))m.put(w,m.get(w)!=null?m.get(w)+1:1);w="";}}l=m.keySet().toArray(l);for(;i<g;++i)for(j=i;++j<l.length;)if(m.get(l[i])<m.get(l[j])){w=l[i];l[i]=l[j];l[j]=w;}i=76-l[0].length();String s=new String(new char[i]).replace('\0','_');System.out.println(" "+s);for(k=0;k<g;k++){w=l[k];System.out.println("|"+s.substring(0,m.get(w)*i/m.get(l[0]))+"| "+w);}}}
BalusC
Couldn't you just use the fully-qualified name to `IOUtils` instead of importing it? As far as I can see you're using it only once anyway.
Joey
You kind of cheated by assuming that the longest bar will be exactly 75 characters. You have to make sure that no bar+word is longer than 80 chars.
Gabe
You're missing a space after the word. ;)
st0le
@st0le: not anymore.
Joey
Jonathon
@Gabe and @Jon: I removed the 75 char fix, it's now however only dependent on 1st word. I removed the Commons IO dependency as well.
BalusC
`m.get(w)==null` is a shorter check than `m.containsKey(w)`, and you don't actually need to declare `t` - you can dump it straight into the for-each construct. I think it might also be shorter to assign the `new String(new char[c]).replace...` to a string, and use substring in the second call to get a slice of it.
Carl
@Carl: thanks, I now see in the update history that you actually updated it, but I totally missed it and have overriden it, sorry! :)
BalusC
Looks like you could shave some characters by making `b` a String instead of a StringBuffer. I don't want to think about what the performance would be, though (especially since you're adding one character at a time).
Michael Myers
@mmyers: Code golf isn't all about performance! You're right, I was too much used to a StringBuilder/Buffer for this kind of job. Updating...
BalusC
@mmyers: It didn't work nicely, it's indeed too slow. Pressing Ctrl+Z would abort it immediately. It took about 10 seconds to gather the example input instead of a subsecond.
BalusC
What is the purpose of `final` before `Map`? It seems extraneous to me.
Gabe
Here is a trick: Instead of writing System.out.println(...), define System x; and write later x.out.println(...)
Landei
@Gabe: so that it can be accessed in the anonymous `Comparator` class. @Landei: that's not possible in Java.
BalusC
Can you make `m` a `Map<String,Float>`? I would expect that to save 4 strokes.
Gabe
You can save some space by placing the code in an initializer:enum X{X{{..code goes here..}}}
ealf
@ealf: you need *somewhere* a `main()` anyway to execute the code.
BalusC
I *think* there is a way to shave some characters by doing something like: `final Map<String,Integer>m=new HashMap();new BufferedInputReader(new InputStreamReader(System.in)){{for(String w:readLine().toLowerCase().replaceAll("\\b(.|the|and|of|to|i[tns]|or)\\b|\\W"," ").split(" +"))m.put(w,m.get(w)!=null?m.get(w)+1:1);}};`
Carl
though that assumes no words are hyphen split over lines, which doesn't seem to be accounted for elsewhere, so seems safe enough an assumption.
Carl
@Carl: Unfortunately, the input can exist of multiple lines. I already considered that.
BalusC
hrm, how about sticking a `while(ready())` in front of the `for`?
Carl
Why do you need so much casting here? `(int)(((float)m.get(w)/m.get(l.get(0)))*c)`You can do this: `m.get(w)*c/m.get(l.get(0))` saving 16 characters, which will have an added benefit of being always exact and not FP-dangerous.However, l.get(0) is cheating. This solution will not work with 'superlongstring' example.
Rotsor
You can replace `.replaceAll("\\b(.|the|and|of|to|i[tns]|or)\\b|\\W"," ").split(" +")` with `.split("(\\b(.|the|and|of|to|i[tns]|or)\\b|\\W)+")`, saving 18 characters!
Rotsor
Solution with two changes above and StringBuffer replaced with a String is just 641 character long. Nobody required linear time, so I think String instead of StringBuffer is acceptable.
Rotsor
@Rotsor: Thank you very much for the hints :) I'll update the answer soon. And yes, I am cheating. But that requirement came later in and it's going to be another >100 chars extra ...
BalusC
I have changed it to a version without cheating, also fixing a bug of IndexOutOfBoundsException. I will append it to your post. Feel free to do anything you want with it.
Rotsor
The requirement didn't come later by the way, it was there from the beginning, but there just was no example to show this bug.
Rotsor
What an ugly fucking language java is.
Pierreten
+11  A: 

Python 2.x, latitudinarian approach = 227 183 chars

import sys,re
t=re.split('\W+',sys.stdin.read().lower())
r=sorted((-t.count(w),w)for w in set(t)if w not in'andithetoforinis')[:22]
for l,w in r:print(78-len(r[0][1]))*l/r[0][0]*'=',w

Allowing for freedom in the implementation, I constructed a string concatenation that contains all the words requested for exclusion (the, and, of, to, a, i, it, in, or, is) - plus it also excludes the two infamous "words" s and t from the example - and I threw in for free the exclusion for an, for, he. I tried all concatenations of those words against corpus of the words from Alice, King James' Bible and the Jargon file to see if there are any words that will be mis-excluded by the string. And that is how I ended with two exclusion strings:itheandtoforinis and andithetoforinis.

PS. borrowed from other solutions to shorten the code.

=========================================================================== she 
================================================================= you
============================================================== said
====================================================== alice
================================================ was
============================================ that
===================================== as
================================= her
============================== at
============================== with
=========================== on
=========================== all
======================== this
======================== had
======================= but
====================== be
====================== not
===================== they
==================== so
=================== very
=================== what
================= little

Rant

Regarding words to ignore, one would think those would be taken from list of the most used words in English. That list depends on the text corpus used. Per one of the most popular lists (http://en.wikipedia.org/wiki/Most_common_words_in_English, http://www.english-for-students.com/Frequently-Used-Words.html, http://www.sporcle.com/games/common_english_words.php), top 10 words are: the be(am/are/is/was/were) to of and a in that have I

The top 10 words from the Alice in Wonderland text are the and to a of it she i you said
The top 10 words from the Jargon File (v4.4.7) are the a of to and in is that or for

So question is why or was included in the problem's ignore list, where it's ~30th in popularity when the word that (8th most used) is not. etc, etc. Hence I believe the ignore list should be provided dynamically (or could be omitted).

Alternative idea would be simply to skip the top 10 words from the result - which actually would shorten the solution (elementary - have to show only the 11th to 32nd entries).


Python 2.x, punctilious approach = 277 243 chars

The chart drawn in the above code is simplified (using only one character for the bars). If one wants to reproduce exactly the chart from the problem description (which was not required), this code will do it:

import sys,re
t=re.split('\W+',sys.stdin.read().lower())
r=sorted((-t.count(w),w)for w in set(t)-set(sys.argv))[:22]
h=min(9*l/(77-len(w))for l,w in r)
print'',9*r[0][0]/h*'_'
for l,w in r:print'|'+9*l/h*'_'+'|',w

I take an issue with the somewhat random choice of the 10 words to exclude the, and, of, to, a, i, it, in, or, is so those are to be passed as command line parameters, like so:
python WordFrequencyChart.py the and of to a i it in or is <"Alice's Adventures in Wonderland.txt"

This is 213 chars + 30 if we account for the "original" ignore list passed on command line = 243

PS. The second code also does "adjustment" for the lengths of all top words, so none of them will overflow in degenerate case.

 _______________________________________________________________
|_______________________________________________________________| she
|_______________________________________________________| superlongstringstring
|_____________________________________________________| said
|______________________________________________| alice
|_________________________________________| was
|______________________________________| that
|_______________________________| as
|____________________________| her
|__________________________| at
|__________________________| with
|_________________________| s
|_________________________| t
|_______________________| on
|_______________________| all
|____________________| this
|____________________| for
|____________________| had
|____________________| but
|___________________| be
|___________________| not
|_________________| they
|_________________| so
Nas Banov
Nice solution so far although the word ignore list isn't implemented (yet) and the bars are a bit rudimentary at the moment.
ChristopheD
@ChristopheD: it was there, but there was no "user guide". Just added bunch text
Nas Banov
Regarding your list of languages and solutions: Please look for solutions that use splitting along `\W` or use `\b` in a regex because those are very likely *not* according to spec, meaning they won't split on digits or `_` and they might also not remove stop words from strings such as `the_foo_or123bar`. They may not appear in the test text but the specification is pretty clear on that case.
Joey
+2  A: 

Gotta love the big ones...Objective-C (1070 931 905 chars)

#define S NSString
#define C countForObject
#define O objectAtIndex
#define U stringWithCString
main(int g,char**b){id c=[NSCountedSet set];S*d=[S stringWithContentsOfFile:[S U:b[1]]];id p=[NSPredicate predicateWithFormat:@"SELF MATCHES[cd]'(the|and|of|to|a|i[tns]?|or)|[^a-z]'"];[d enumerateSubstringsInRange:NSMakeRange(0,[d length])options:NSStringEnumerationByWords usingBlock:^(S*s,NSRange x,NSRange y,BOOL*z){if(![p evaluateWithObject:s])[c addObject:[s lowercaseString]];}];id s=[[c allObjects]sortedArrayUsingComparator:^(id a,id b){return(NSComparisonResult)([c C:b]-[c C:a]);}];g=[c C:[s O:0]];int j=76-[[s O:0]length];char*k=malloc(80);memset(k,'_',80);S*l=[S U:k length:80];printf(" %s\n",[[l substringToIndex:j]cString]),[[s subarrayWithRange:NSMakeRange(0,22)]enumerateObjectsUsingBlock:^(id a,NSUInteger x,BOOL*y){printf("|%s| %s\n",[[l substringToIndex:[c C:a]*j/g]cString],[a cString]);}];}

Switched to using a lot of depreciate APIs, removed some memory management that wasn't needed, more aggressive whitespace removal

 _________________________________________________________________________
|_________________________________________________________________________| she
|______________________________________________________________| said
|__________________________________________________________| you
|____________________________________________________| alice
|________________________________________________| was
|_______________________________________| that
|____________________________________| as
|_________________________________| her
|______________________________| with
|______________________________| at
|___________________________| on
|__________________________| all
|________________________| this
|________________________| for
|________________________| had
|_______________________| but
|______________________| be
|______________________| not
|____________________| so
|___________________| very
|__________________| what
|_________________| they
Joshua Weinberg
Note that the spec calls for ignoring 's, so "don't" parses as two words "don" and "s". You'll see in the reference implementation that "s" and "t" are represented in the top 22...
dmckee
Kudos for doing it in obj-c (not a language you see often in code golfing)!
ChristopheD
@Christophe: And here we see exactly *why* we don't see it that often ;)
Joey
Try `#define S NSString`, `#define C countForObject`, and use these two appropriately. Also replace `calloc(80,1)` with simply `malloc(80)`, since you're setting the contents straight afterwards. Also, reuse the `a` parameter, to save on an `int` declaration. This should get it less than 1,000 chars...
brone
@brone thanks for the idea, took those, and some other extra stuff I saw, well below 1000 now
Joshua Weinberg
Use `id` instead of `NSCountedSet*` etc!
KennyTM
Holy hell, how did I not think of that....edited to fix, 905 :)
Joshua Weinberg
+33  A: 

Ruby 207 213 211 210 207 203 201 200 chars

An improvement on Anurag, incorporating suggestion from rfusca. Also removes argument to sort and a few other minor golfings.

w=(STDIN.read.downcase.scan(/[a-z]+/)-%w{the and of to a i it in or is}).group_by{|x|x}.map{|x,y|[-y.size,x]}.sort.take 22;k,l=w[0];m=76.0-l.size;puts' '+'_'*m;w.map{|f,x|puts"|#{'_'*(m*f/k)}| #{x} "}

Execute as:

ruby GolfedWordFrequencies.rb < Alice.txt

Edit: put 'puts' back in, needs to be there to avoid having quotes in output.
Edit2: Changed File->IO
Edit3: removed /i
Edit4: Removed parentheses around (f*1.0), recounted
Edit5: Use string addition for the first line; expand s in-place.
Edit6: Made m float, removed 1.0. EDIT: Doesn't work, changes lengths. EDIT: No worse than before
Edit7: Use STDIN.read.

archgoon
+1 - love the sorting part, very clever :)
Anurag
Hey, small optimization compared to coming up with the bulk of it in the first place. :)
archgoon
Nice! Added two of the changes I also made in Anurag's version. Shaves off another 4.
Shtééf
The solution has deviated from the original output, I'm going totry and figure out where that happened.
archgoon
Huh, note that the last two are the same length (in our and several other versions), but the original questioner has them as different. Anurag's original solution has this issue. It's going to be a pain tracking it down. I'm putting back in the 76.0 trick, since it isn't the problem.
archgoon
@archgoon: I applaud your noble effort, but string addition is not shorter for the loop. It's only shorter because you took out the trailing space. But don't feel bad, it doesn't make Perl look any better. ;)
Shtééf
@Shtééf, ah, that explains why you didn't do it already ;). At least we got the proper count. Congratulations to you.
archgoon
How about [0..21] instead of .take 22?
Adam
There's a shorter variant of this down further.
archgoon
+32  A: 

206

shell, grep, tr, grep, sort, uniq, sort, head, perl

~ % wc -c wfg
209 wfg
~ % cat wfg
egrep -oi \\b[a-z]+|tr A-Z a-z|egrep -wv 'the|and|of|to|a|i|it|in|or|is'|sort|uniq -c|sort -nr|head -22|perl -lape'($f,$w)=@F;$.>1or($q,$x)=($f,76-length$w);$b="_"x($f/$q*$x);$_="|$b| $w ";$.>1or$_=" $b\n$_"'
~ % # usage:
~ % sh wfg < 11.txt

hm, just seen above: sort -nr -> sort -n and then head -> tail => 208 :)
update2: erm, of course the above is silly, as it will be reversed then. So, 209.
update3: optimized the exclusion regexp -> 206

egrep -oi \\b[a-z]+|tr A-Z a-z|egrep -wv 'the|and|o[fr]|to|a|i[tns]?'|sort|uniq -c|sort -nr|head -22|perl -lape'($f,$w)=@F;$.>1or($q,$x)=($f,76-length$w);$b="_"x($f/$q*$x);$_="|$b| $w ";$.>1or$_=" $b\n$_"'



for fun, here's a perl-only version (much faster):

~ % wc -c pgolf
204 pgolf
~ % cat pgolf
perl -lne'$1=~/^(the|and|o[fr]|to|.|i[tns])$/i||$f{lc$1}++while/\b([a-z]+)/gi}{@w=(sort{$f{$b}<=>$f{$a}}keys%f)[0..21];$Q=$f{$_=$w[0]};$B=76-y///c;print" "."_"x$B;print"|"."_"x($B*$f{$_}/$Q)."| $_"for@w'
~ % # usage:
~ % sh pgolf < 11.txt
stor
Impressive golfing!
ChristopheD
Most impressive indeed.
Camilo Martin
+33  A: 

GolfScript, 177 175 173 167 164 163 144 131 130 chars

Slow - 3 minutes for the sample text (130)

{32|.123%97<n@if}%]''*n%"oftoitinorisa"2/-"theandi"3/-$(1@{.3$>1{;)}if}/]2/{~~\;}$22<.0=~:2;,76\-:1'_':0*' '\@{"
|"\~1*2/0*'| '@}/

Explanation:

{           #loop through all characters
 32|.       #convert to uppercase and duplicate
 123%97<    #determine if is a letter
 n@if       #return either the letter or a newline
}%          #return an array (of ints)
]''*        #convert array to a string with magic
n%          #split on newline, removing blanks (stack is an array of words now)
"oftoitinorisa"   #push this string
2/          #split into groups of two, i.e. ["of" "to" "it" "in" "or" "is" "a"]
-           #remove any occurrences from the text
"theandi"3/-#remove "the", "and", and "i"
$           #sort the array of words
(1@         #takes the first word in the array, pushes a 1, reorders stack
            #the 1 is the current number of occurrences of the first word
{           #loop through the array
 .3$>1{;)}if#increment the count or push the next word and a 1
}/
]2/         #gather stack into an array and split into groups of 2
{~~\;}$     #sort by the latter element - the count of occurrences of each word
22<         #take the first 22 elements
.0=~:2;     #store the highest count
,76\-:1     #store the length of the first line
'_':0*' '\@ #make the first line
{           #loop through each word
"
|"\~        #start drawing the bar
1*2/0       #divide by zero
*'| '@      #finish drawing the bar
}/

"Correct" (hopefully). (143)

{32|.123%97<n@if}%]''*n%"oftoitinorisa"2/-"theandi"3/-$(1@{.3$>1{;)}if}/]2/{~~\;}$22<..0=1=:^;{~76@,-^*\/}%$0=:1'_':0*' '\@{"
|"\~1*^/0*'| '@}/

Less slow - half a minute. (162)

'"'/' ':S*n/S*'"#{%q
'\+"
.downcase.tr('^a-z','
')}\""+~n%"oftoitinorisa"2/-"theandi"3/-$(1@{.3$>1{;)}if}/]2/{~~\;}$22<.0=~:2;,76\-:1'_':0*S\@{"
|"\~1*2/0*'| '@}/

Output visible in revision logs.

Nabb
About GolfScript: http://www.golfscript.com/golfscript/
Assaf Lavie
Not correct, in that if the second word is really long it will wrap to the next line.
Gabe
Figures golfscript wins :P
RCIX
"divide by zero" ...GolfScript allows that?
JAB
I like the explanation!
Cornelius
+19  A: 

Windows PowerShell, 199 chars

$x=$input-split'\P{L}'-notmatch'^(the|and|of|to|.?|i[tns]|or)$'|group|sort *
filter f($w){' '+'_'*$w
$x[-1..-22]|%{"|$('_'*($w*$_.Count/$x[-1].Count))| "+$_.Name}}
f(76..1|?{!((f $_)-match'.'*80)})[0]

(The last line break isn't necessary, but included here for readability.)

(Current code and my test files available in my SVN repository. I hope my test cases catch most common errors (bar length, problems with regex matching and a few others))

Assumptions:

  • US ASCII as input. It probably gets weird with Unicode.
  • At least two non-stop words in the text

History

Relaxed version (137), since that's counted separately by now, apparently:

($x=$input-split'\P{L}'-notmatch'^(the|and|of|to|.?|i[tns]|or)$'|group|sort *)[-1..-22]|%{"|$('_'*(76*$_.Count/$x[-1].Count))| "+$_.Name}
  • doesn't close the first bar
  • doesn't account for word length of non-first word

Variations of the bar lengths of one character compared to other solutions is due to PowerShell using rounding instead of truncation when converting floating-point numbers into integers. Since the task required only proportional bar length this should be fine, though.

Compared to other solutions I took a slightly different approach in determining the longest bar length by simply trying out and taking the highest such length where no line is longer than 80 characters.

An older version explained can be found here.

Joey
Impressive, seems Powershell is a suitable environment for golfing. Your approach considering the bar length is exactly what I tried to describe (not so brilliantly, I admit) in the spec.
ChristopheD
@ChristopheD: In my experience (Anarchy Golf, some Project Euler tasks and some more tasks just for the fun of it), PowerShell is usually only slightly worse than Ruby and often tied with or better than Perl and Python. No match for GolfScript, though. But as far as I can see, this might be the shortest solution that correctly accounts for bar lengths ;-)
Joey
Apparently I was right. Powershell *can* do better -- much better! Please provide an expanded version with comments.
Gabe
Johannes: Did you try `-split("\b(?:the|and|of|to|a|i[tns]?|or)\b|[^a-z]")`? It works for me.
Gabe
Don't forget to interpolate the output string: `"|$('_'*($w*$_.count/$x[0].count))| $($_.name) "` (or eliminate the last space, as it's sort of automatic). And you can use `-split("(?:\b(?:the|and|of|to|a|i[tns]?|or)\b|[^a-z])+")` to save a few more by not including blanks (or use `[-2..-23]`).
Gabe
Note that without the trailing space you need to match `.{80}`. And you can guarantee that blanks will always be first like this: `"\b(?:the|and|of|to|a|i[tns]?|or)\b|[^a-z]()"` (the empty capturing group ensures a blank for every word)
Gabe
So if you have `"$input"` do you still need the `()`? Also, now that you eliminated the trailing space, you can save a couple strokes by not interpolating the name: `"|$('_'*($w*$_.count/$x[0].count))| "+$_.name`. We'll get to 200 yet!
Gabe
With the elimination of .ToString, it's now back under 200!
Gabe
@Gabe: Yay, thank you. And good catch on `function` vs. `filter`. I thought about using `filter`, I just didn't think of the fact that filters can take arguments too. For me it was a comparison between `function f($w){...}` and `filter{$w=$_;...}` (since I definitely need a loop in the function and therefore can't leave the argument as `$_`. Nice trick to remember, thanks :-). Still, I think this approach has been golfed almost to death by now. [And I notice we killed it somewhere in between ... my other two test cases don't run anymore – debugging ...]
Joey
One could argue that you're making it a little *too* general, but I'm not going to complain as along as it's still under 200.
Gabe
@Gabe: Well, I've revisited my assumptions concerning at least one non-stop word already. But the `\b` problem was clearly against the spec and only happened to work for the test input.
Joey
+35  A: 

Ruby 1.9, 185 chars

(heavily based on the other Ruby solutions)

w=($<.read.downcase.scan(/[a-z]+/)-%w{the and of to a i it in or is}).group_by{|x|x}.map{|x,y|[-y.size,x]}.sort[0,22]
k,l=w[0]
puts [?\s+?_*m=76-l.size,w.map{|f,x|?|+?_*(f*m/k)+"| "+x}]

Instead of using any command line switches like the other solutions, you can simply pass the filename as argument. (i.e. ruby1.9 wordfrequency.rb Alice.txt)

Since I'm using character-literals here, this solution only works in Ruby 1.9.

Edit: Replaced semicolons by line breaks for "readability". :P

Edit 2: Shtééf pointed out I forgot the trailing space - fixed that.

Edit 3: Removed the trailing space again ;)

Ventero
It's missing the trailing space, after each word.
Shtééf
Aww shoot, disregard that. Looks like the golf was just updated, trailing space no longer required. :)
Shtééf
Does not seem to accomodate for 'superlongstringstring' in 2nd or later position? (see problem description)
Nas Banov
That looks really maintainable.
Zombies
+2  A: 

R 449 chars

can probably get shorter...

bar <- function(w, l)
    {
    b <- rep("-", l)
    s <- rep(" ", l)
    cat(" ", b, "\n|", s, "| ", w, "\n ", b, "\n", sep="")
    }

f <- "alice.txt"
e <- c("the", "and", "of", "to", "a", "i", "it", "in", "or", "is", "")
w <- unlist(lapply(readLines(file(f)), strsplit, s=" "))
w <- tolower(w)
w <- unlist(lapply(w, gsub, pa="[^a-z]", r=""))
u <- unique(w[!w %in% e])
n <- unlist(lapply(u, function(x){length(w[w==x])}))
o <- rev(order(n))
n <- n[o]
m <- 77 - max(unlist(lapply(u[1:22], nchar)))
n <- floor(m*n/n[1])
u <- u[o]

for (i in 1:22)
    bar(u[i], n[i])
nico
@Johannes Rössel: It is dynamic, just scaled to 100% = 60px = max length. E.g.: 1st world = 50 occurrences, 2nd world = 25 occurrences. 1st bar = 60 px, 2nd bar = 30 px
nico
@Johannes Rössel: Ok, I didn't read the part that said you should maximise the length, thought it just needed to fit 80 chars... now it works as intended :) Thanks for spotting that
nico
Well, it's the one thing most often done wrong in the answers here, I think. Took me also quite a while to figure out an elegant way of doing so.
Joey
+1  A: 

Groovy, 424 389 378 321 chars

replaced b=map.get(a) with b=map[a], replaced split with matcher / iterator

def r,s,m=[:],n=0;def p={println it};def w={"_".multiply it};(new URL(this.args[0]).text.toLowerCase()=~/\b\w+\b/).each{s=it;if(!(s==~/(the|and|of|to|a|i[tns]?|or)/))m[s]=m[s]==null?1:m[s]+1};m.keySet().sort{a,b->m[b]<=>m[a]}.subList(0,22).each{k->if(n++<1){r=(m[k]/(76-k.length()));p" "+w(m[k]/r)};p"|"+w(m[k]/r)+"|"+k}

(executed as groovy script with the URL as cmd line arg. No imports required!)

Readable version here:

def r,s,m=[:],n=0;
def p={println it};
def w={"_".multiply it};
(new URL(this.args[0]).text.toLowerCase()
        =~ /\b\w+\b/
        ).each{
        s=it;
        if (!(s ==~/(the|and|of|to|a|i[tns]?|or)/))
            m[s] = m[s] == null ? 1 : m[s] + 1
        };
    m.keySet()
        .sort{
            a,b -> m[b] <=> m[a]
        }
        .subList(0,22).each{
            k ->
                if( n++ < 1 ){
                    r=(m[k]/(76-k.length()));
                    p " " + w(m[k]/r)
                };
                p "|" + w(m[k]/r) + "|" + k
}
seanizer
+5  A: 

Scala, 368 chars

First, a legible version in 592 characters:

object Alice {
  def main(args:Array[String]) {
    val s = io.Source.fromFile(args(0))
    val words = s.getLines.flatMap("(?i)\\w+\\b(?<!\\bthe|and|of|to|a|i|it|in|or|is)".r.findAllIn(_)).map(_.toLowerCase)
    val freqs = words.foldLeft(Map[String, Int]())((countmap, word)  => countmap + (word -> (countmap.getOrElse(word, 0)+1)))
    val sortedFreqs = freqs.toList.sort((a, b)  => a._2 > b._2)
    val top22 = sortedFreqs.take(22)
    val highestWord = top22.head._1
    val highestCount = top22.head._2
    val widest = 76 - highestWord.length
    println(" " + "_" * widest)
    top22.foreach(t => {
      val width = Math.round((t._2 * 1.0 / highestCount) * widest).toInt
      println("|" + "_" * width + "| " + t._1)
    })
  }
}

The console output looks like this:

$ scalac alice.scala 
$ scala Alice aliceinwonderland.txt
 _________________________________________________________________________
|_________________________________________________________________________| she
|_______________________________________________________________| you
|_____________________________________________________________| said
|_____________________________________________________| alice
|_______________________________________________| was
|____________________________________________| that
|____________________________________| as
|_________________________________| her
|______________________________| at
|______________________________| with
|_____________________________| s
|_____________________________| t
|___________________________| on
|__________________________| all
|_______________________| had
|_______________________| but
|______________________| be
|______________________| not
|____________________| they
|____________________| so
|___________________| very
|___________________| what

We can do some aggressive minifying and get it down to 415 characters:

object A{def main(args:Array[String]){val l=io.Source.fromFile(args(0)).getLines.flatMap("(?i)\\w+\\b(?<!\\bthe|and|of|to|a|i|it|in|or|is)".r.findAllIn(_)).map(_.toLowerCase).foldLeft(Map[String, Int]())((c,w)=>c+(w->(c.getOrElse(w,0)+1))).toList.sort((a,b)=>a._2>b._2).take(22);println(" "+"_"*(76-l.head._1.length));l.foreach(t=>println("|"+"_"*Math.round((t._2*1.0/l.head._2)*(76-l.head._1.length)).toInt+"| "+t._1))}}

The console session looks like this:

$ scalac a.scala 
$ scala A aliceinwonderland.txt
 _________________________________________________________________________
|_________________________________________________________________________| she
|_______________________________________________________________| you
|_____________________________________________________________| said
|_____________________________________________________| alice
|_______________________________________________| was
|____________________________________________| that
|____________________________________| as
|_________________________________| her
|______________________________| at
|______________________________| with
|_____________________________| s
|_____________________________| t
|___________________________| on
|__________________________| all
|_______________________| had
|_______________________| but
|______________________| be
|______________________| not
|____________________| they
|____________________| so
|___________________| very
|___________________| what

I'm sure a Scala expert could do even better.

Update: In the comments Thomas gave an even shorter version, at 368 characters:

object A{def main(a:Array[String]){val t=(Map[String, Int]()/:(for(x<-io.Source.fromFile(a(0)).getLines;y<-"(?i)\\w+\\b(?<!\\bthe|and|of|to|a|i|it|in|or|is)".r findAllIn x) yield y.toLowerCase).toList)((c,x)=>c+(x->(c.getOrElse(x,0)+1))).toList.sortBy(_._2).reverse.take(22);val w=76-t.head._1.length;print(" "+"_"*w);t map (s=>"\n|"+"_"*(s._2*w/t.head._2)+"| "+s._1) foreach print}}

Legibly, at 375 characters:

object Alice {
  def main(a:Array[String]) {
    val t = (Map[String, Int]() /: (
      for (
        x <- io.Source.fromFile(a(0)).getLines
        y <- "(?i)\\w+\\b(?<!\\bthe|and|of|to|a|i|it|in|or|is)".r.findAllIn(x)
      ) yield y.toLowerCase
    ).toList)((c, x) => c + (x -> (c.getOrElse(x, 0) + 1))).toList.sortBy(_._2).reverse.take(22)
    val w = 76 - t.head._1.length
    print (" "+"_"*w)
    t.map(s => "\n|" + "_" * (s._2 * w / t.head._2) + "| " + s._1).foreach(print)
  }
}
pr1001
383 chars: `object A{def main(a:Array[String]){val t=(Map[String, Int]()/:(for(x<-io.Source.fromFile(a(0)).getLines;y<-"(?i)\\w+\\b(?<!\\bthe|and|of|to|a|i|it|in|or|is)".r findAllIn x) yield y.toLowerCase).toList)((c,x)=>c+(x->(c.getOrElse(x,0)+1))).toList.sortBy(_._2).reverse.take(22);val w=76-t.head._1.length;print(" "+"_"*w);t map (s=>"\n|"+"_"*(s._2*w/t.head._2)+"| "+s._1) foreach print}}`
Thomas Jung
Of course, the ever handy for comprehension! Nice!
pr1001
+2  A: 

Python 2.6, 273 269 267 266 characters.

(Edit: Props to ChristopheD for character-shaving suggestions)

import sys,re
t=re.findall('[a-z]+',"".join(sys.stdin).lower())
d=sorted((t.count(w),w)for w in set(t)-set("the and of to a i it in or is".split()))[:-23:-1]
r=min((78.-len(m[1]))/m[0]for m in d)
print'','_'*(int(d[0][0]*r-2))
for(a,b)in d:print"|"+"_"*(int(a*r-2))+"|",b

Output:

 _________________________________________________________________________
|_________________________________________________________________________| she
|_______________________________________________________________| you
|____________________________________________________________| said
|____________________________________________________| alice
|______________________________________________| was
|__________________________________________| that
|___________________________________| as
|_______________________________| her
|____________________________| with
|____________________________| at
|___________________________| s
|___________________________| t
|_________________________| on
|_________________________| all
|______________________| this
|______________________| for
|______________________| had
|_____________________| but
|____________________| be
|____________________| not
|___________________| they
|__________________| so
Astatine
You could drop the square brackets in `r=min([(78.0-len(m[1]))/m[0] for m in d])` (shaves off 2 characters: `min((78.0-len(m[1]))/m[0] for m in d)`). The same goes for the square brackets in line three: `sorted([...`
ChristopheD
Also in line three and four you can lose an unneeded space just before `for` (shaves off 2 characters).
ChristopheD
Aha! Thanks for that.
Astatine
I like the way you abuse this `print'',` to print the starting space on the first line; clever ;-)
ChristopheD
Just realised I didn't need a following zero to declare a float on the fourth line.Is this the only Python entry that takes into account that some words might be significantly longer than the most common one?
Astatine
instead of 78 you can use 76 and saving two "-2"; instead of m[0],m[1] you can use w and r by doing "for w,r in d". you can use \w instead of [a-z]. sys.stdin.read() is shorter. I like the idea of using commas!
6502
Good points; however \w matches underscores, which is why I didn't use it.
Astatine
+11  A: 

perl, 205 191 189 characters/ 205 characters (fully implemented)

Some parts were inspired by the earlier perl/ruby submissions, a couple similar ideas were arrived at independently, the others are original. Shorter version also incorporates some things I saw/learned from other submissions.

Original:

$k{$_}++for grep{$_!~/^(the|and|of|to|a|i|it|in|or|is)$/}map{lc=~/[a-z]+/g}<>;@t=sort{$k{$b}<=>$k{$a}}keys%k;$l=76-length$t[0];printf" %s
",'_'x$l;printf"|%s| $_
",'_'x int$k{$_}/$k{$t[0]}*$l for@t[0..21];

Latest version down to 191 characters:

/^(the|and|of|to|.|i[tns]|or)$/||$k{$_}++for map{lc=~/[a-z]+/g}<>;@e=sort{$k{$b}<=>$k{$a}}keys%k;$n=" %s
";$r=(76-y///c)/$k{$_=$e[0]};map{printf$n,'_'x($k{$_}*$r),$_;$n="|%s| %s
"}@e[0,0..21]

Latest version down to 189 characters:

/^(the|and|of|to|.|i[tns]|or)$/||$k{$_}++for map{lc=~/[a-z]+/g}<>;@_=sort{$k{$b}<=>$k{$a}}keys%k;$n=" %s
";$r=(76-m//)/$k{$_=$_[0]};map{printf$n,'_'x($k{$_}*$r),$_;$n="|%s| %s
"}@_[0,0..21]

This version (205 char) accounts for the lines with words longer than what would be found later.

/^(the|and|of|to|.|i[tns]|or)$/||$k{$_}++for map{lc=~/[a-z]+/g}<>;($r)=sort{$a<=>$b}map{(76-y///c)/$k{$_}}@e=sort{$k{$b}<=>$k{$a}}keys%k;$n=" %s
";map{printf$n,'_'x($k{$_}*$r),$_;$n="|%s| %s
";}@e[0,0..21]
pdehaan
+10  A: 

Perl: 203 202 201 198 195 208 203 / 231 chars

$/=\0;/^(the|and|of|to|.|i[tns]|or)$/i||$x{lc$_}++for<>=~/[a-z]+/gi;map{$z=$x{$_};$y||{$y=(76-y///c)/$z}&&warn" "."_"x($z*$y)."\n";printf"|%.78s\n","_"x($z*$y)."| $_"}(sort{$x{$b}<=>$x{$a}}keys%x)[0..21]

Alternate, full implementation including indicated behaviour (global bar-squishing) for the pathological case in which the secondary word is both popular and long enough to combine to over 80 chars (this implementation is 231 chars):

$/=\0;/^(the|and|of|to|.|i[tns]|or)$/i||$x{lc$_}++for<>=~/[a-z]+/gi;@e=(sort{$x{$b}<=>$x{$a}}keys%x)[0..21];for(@e){$p=(76-y///c)/$x{$_};($y&&$p>$y)||($y=$p)}warn" "."_"x($x{$e[0]}*$y)."\n";for(@e){warn"|"."_"x($x{$_}*$y)."| $_\n"}

The specification didn't state anywhere that this had to go to STDOUT, so I used perl's warn() instead of print - four characters saved there. Used map instead of foreach, but I feel like there could still be some more savings in the split(join()). Still, got it down to 203 - might sleep on it. At least Perl's now under the "shell, grep, tr, grep, sort, uniq, sort, head, perl" char count for now ;)

PS: Reddit says "Hi" ;)

Update: Removed join() in favour of assignment and implicit scalar conversion join. Down to 202. Also please note I have taken advantage of the optional "ignore 1-letter words" rule to shave 2 characters off, so bear in mind the frequency count will reflect this.

Update 2: Swapped out assignment and implicit join for killing $/ to get the file in one gulp using <> in the first place. Same size, but nastier. Swapped out if(!$y){} for $y||{}&&, saved 1 more char => 201.

Update 3: Took control of lowercasing early (lc<>) by moving lc out of the map block - Swapped out both regexes to no longer use /i option, as no longer needed. Swapped explicit conditional x?y:z construct for traditional perlgolf || implicit conditional construct - /^...$/i?1:$x{$}++ for /^...$/||$x{$}++ Saved three characters! => 198, broke the 200 barrier. Might sleep soon... perhaps.

Update 4: Sleep deprivation has made me insane. Well. More insane. Figuring that this only has to parse normal happy text files, I made it give up if it hits a null. Saved two characters. Replaced "length" with the 1-char shorter (and much more golfish) y///c - you hear me, GolfScript?? I'm coming for you!!! sob

Update 5: Sleep dep made me forget about the 22row limit and subsequent-line limiting. Back up to 208 with those handled. Not too bad, 13 characters to handle it isn't the end of the world. Played around with perl's regex inline eval, but having trouble getting it to both work and save chars... lol. Updated the example to match current output.

Update 6: Removed unneeded braces protecting (...)for, since the syntactic candy ++ allows shoving it up against the for happily. Thanks to input from Chas. Owens (reminding my tired brain), got the character class i[tns] solution in there. Back down to 203.

Update 7: Added second piece of work, full implementation of specs (including the full bar-squishing behaviour for secondary long-words, instead of truncation which most people are doing, based on the original spec without the pathological example case)

Examples:

 _________________________________________________________________________
|_________________________________________________________________________| she
|_______________________________________________________________| you
|____________________________________________________________| said
|_____________________________________________________| alice
|_______________________________________________| was
|___________________________________________| that
|____________________________________| as
|________________________________| her
|_____________________________| with
|_____________________________| at
|__________________________| on
|__________________________| all
|_______________________| this
|_______________________| for
|_______________________| had
|_______________________| but
|______________________| be
|_____________________| not
|____________________| they
|____________________| so
|___________________| very
|__________________| what

Alternative implementation in pathological case example:

 _______________________________________________________________
|_______________________________________________________________| she
|_______________________________________________________| superlongstringstring
|____________________________________________________| said
|______________________________________________| alice
|________________________________________| was
|_____________________________________| that
|_______________________________| as
|____________________________| her
|_________________________| with
|_________________________| at
|_______________________| on
|______________________| all
|____________________| this
|____________________| for
|____________________| had
|____________________| but
|___________________| be
|__________________| not
|_________________| they
|_________________| so
|________________| very
|________________| what
Syntaera
You can shorten the regex for the stop words by collapsing `is|in|it|i` into `i[snt]?` – and then there's no difference with the optional rule anymore. (Hm, I never would have thought about telling a Perl guy how to do Regex :D) – only problem now: I have to look how I can shave off three bytes from my own solution to be better than Perl again :-|
Joey
Ok, disregard part of what I said earlier. Ignoring one-letter words is indeed a byte shorter than not doing it.
Joey
Every byte counts ;) I considered doing the newline trick, but I figured it was actually the same number of bytes, even if it was fewer printable characters. Still working on seeing if I can shrink it some more :)
Syntaera
Ah well, case normalization threw me back to 209. I don't see what else I could cut. Although PowerShell *can* be shorter than Perl. ;-)
Joey
I don't see where you restrict the output to the top 22 words, nor where you make sure that a long second word doesn't wrap.
Gabe
Had to sleep on it :)
Syntaera
You can save even more by using `say`: `perl -E '$/=\0;map{/^(the|and|of|to|.|it|in|or|is|)$/||$x{$_}++}split(/[^a-z]/,lc<>);map{$z=$x{$_};$y||{$y=(76-y///c)/$z}say"|"."_"x($z*$y)."| $_"}sort{$x{$b}<=>$x{$a}}keys%x'`
Chas. Owens
Even more by using a character class and using for instead of `map` where possible:`perl -E '$/=\0;/^(the|and|of|to|.|i[tns]|or)$/||$x{$_}++for split(/[^a-z]/,lc<>);map{$z=$x{$_};$y||{$y=(76-y///c)/$z}say"|"."_"x($z*$y)."| $_"}sort{$x{$b}<=>$x{$a}}keys%x'`
Chas. Owens
Thanks for that - I came to the same conclusion about for, but also got rid of split(), just using a bare regex instead for it. Back down to 203!
Syntaera
+2  A: 

C++, 647 chars

I don't expect to score highly by using C++, but nevermind that. I'm pretty sure it hits all the requirements. Note that I used the C++0x auto keyword for variable declaration, so adjust your complier appropriately if you decide to test my code.

Minimised version

#include <iostream>
#include <cstring>
#include <map>
using namespace std;
#define C string
#define S(x)v=F/a,cout<<#x<<C(v,'_')
#define F t->first
#define G t->second
#define O &&F!=
#define L for(i=22;i-->0;--t)
int main(){map<C,int>f;char d[230];int i=1,v;for(;i<256;i++)d[i<123?i-1:i-27]=i;d[229]=0;char w[99];while(cin>>w){for(i=0;w[i];i++)w[i]=tolower(w[i]);char*p=strtok(w,d);while(p)++f[p],p=strtok(0,d);}multimap<int,C>c;for(auto t=f.end();--t!=f.begin();)if(F!="the"O"and"O"of"O"to"O"a"O"i"O"it"O"in"O"or"O"is")c.insert(pair<int,C>(G,F));auto t=--c.end();float a=0,A;L A=F/(76.0-G.length()),a=a>A?a:A;t=--c.end();S( );L S(\n|)<<"| "<<G;}

Here's a second version that is more "C++" by using string, not char[] and strtok. It's a bit larger, at 669 (+22 vs above), but I can't get it smaller at the moment so thought I'd post it anyway.

#include <iostream>
#include <map>
using namespace std;
#define C string
#define S(x)v=F/a,cout<<#x<<C(v,'_')
#define F t->first
#define G t->second
#define O &&F!=
#define L for(i=22;i-->0;--t)
#define E e=w.find_first_of(d,g);g=w.find_first_not_of(d,e);
int main(){map<C,int>f;int i,v;C w,x,d="abcdefghijklmnopqrstuvwxyz";while(cin>>w){for(i=w.size();i-->0;)w[i]=tolower(w[i]);unsigned g=0,E while(g-e>0){x=w.substr(e,g-e),++f[x],E}}multimap<int,C>c;for(auto t=f.end();--t!=f.begin();)if(F!="the"O"and"O"of"O"to"O"a"O"i"O"it"O"in"O"or"O"is")c.insert(pair<int,C>(G,F));auto t=--c.end();float a=0,A;L A=F/(76.0-G.length()),a=a>A?a:A;t=--c.end();S( );L S(\n|)<<"| "<<G;}

I've removed the full version, because I can't be bothered to keep updating it with my tweaks to the minimised version. See edit history if you're interested in the (possibly outdated) long version.

DMA57361
If you're going to put an arbitrary limit on word length, you might as well make it 999 instead of 1024 and save a stroke.
Gabe
If you use `float a=0,A;L A=F/(76.0-G.length()),a=a>A?a:A;` you can eliminate a #define and shave a few strokes.
Gabe
@Gabe - thanks for that second one, trimmed a few extra away. As for `word`, having an arbitrary length doesn't really feel right - but I'm not sure of the best way to extract `cin` into a `char` array, as opposed to a `string`, without the risk of breaking in the middle of a word (ie, if I just pulled it in 80-char chunks). But I've put finding a "better" solution until probably tomorrow.
DMA57361
Isn't `d[i-27]=0;` the same as `d[229]=0;`?
Gabe
Why did you decide to use a char buffer instead of a string?
EvilTeach
You could save a space by making `L{A=F/(76.0-G.length()),a=a>A?a:A;}` into `L A=F/(76.0-G.length()),a=a>A?a:A;`.
Gabe
@EvilTeach - so I could use strtok. I'm not aware of a C++ string tokenising function (see http://stackoverflow.com/questions/53849/how-do-i-tokenize-a-string-in-c) that would take "lots" of delimiters, and needed a reliable method to split words like "don't" on the punctuation. @Gabe - nice catch (again, thanks!) on d[229], as for the second suggestion - you'd already given that earlier and I obviously hadn't paid sufficient attention...
DMA57361
+1  A: 

Python, 320 characters

import sys
i="the and of to a i it in or is".split()
d={}
for j in filter(lambda x:x not in i,sys.stdin.read().lower().split()):d[j]=d.get(j,0)+1
w=sorted(d.items(),key=lambda x:x[1])[:-23:-1]
m=sorted(dict(w).values())[-1]
print" %s\n"%("_"*(76-m)),"\n".join(map(lambda x:("|%s| "+x[0])%("_"*((76-m)*x[1]/w[0][1])),w))
dhruvbird
+8  A: 

Python 3.1 - 245 229 charaters

I guess using Counter is kind of cheating :) I just read about it about a week ago, so this was the perfect chance to see how it works.

import re,collections
o=collections.Counter([w for w in re.findall("[a-z]+",open("!").read().lower())if w not in"a and i in is it of or the to".split()]).most_common(22)
print('\n'.join('|'+76*v//o[0][1]*'_'+'| '+k for k,v in o))

Prints out:

|____________________________________________________________________________| she
|__________________________________________________________________| you
|_______________________________________________________________| said
|_______________________________________________________| alice
|_________________________________________________| was
|_____________________________________________| that
|_____________________________________| as
|__________________________________| her
|_______________________________| with
|_______________________________| at
|______________________________| s
|_____________________________| t
|____________________________| on
|___________________________| all
|________________________| this
|________________________| for
|________________________| had
|________________________| but
|______________________| be
|______________________| not
|_____________________| they
|____________________| so

Some of the code was "borrowed" from AKX's solution.

sdolan
The first line is missing. And the bar length isn't correct.
Joey
Missed the first bar requirement and I see my bars are off. Thanks for the feedback. This is my first time :)
sdolan
in your code seems that `open('!')` reads from stdin - which version/OS is that on? or do you have to name the file '!'?
Nas Banov
Name the file "!" :) Sorry that was pretty unclear, and I should have mentioned it.
sdolan
+1  A: 

MATLAB 404 410 bytes 357 bytes. 390 bytes.

This version is a bit longer, however, it will properly scale the length of the bars if there is a word that is ridiculously long so that none of the columns go over 80.

So, my code is 357 bytes without re-scaling, and 410 long with re-scaling.

A=textscan(fopen('11.txt'),'%s','delimiter',' 0123456789,.!?-_*^:;=+\\/(){}[]@&#$%~`|"''');
s=lower(A{1});s(cellfun('length', s)<2)=[];s(ismember(s,{'the','and','of','to','it','in','or','is'}))=[];
[w,~,i]=unique(s);N=hist(i,max(i)); [j,k]=sort(N,'descend'); b=k(1:22); n=cellfun('length',w(b));
q=80*N(b)'/N(k(1))+n; q=floor(q*78/max(q)-n); for i=1:22, fprintf('%s| %s\n',repmat('_',1,l(i)),w{k(i)});end

Results:

___________________________________________________________________________| she
_________________________________________________________________| you
______________________________________________________________| said
_______________________________________________________| alice
________________________________________________| was
____________________________________________| that
_____________________________________| as
_________________________________| her
______________________________| at
______________________________| with
____________________________| on
___________________________| all
_________________________| this
________________________| for
________________________| had
________________________| but
_______________________| be
_______________________| not
_____________________| they
____________________| so
___________________| very
___________________| what

For example, replacing all instances of "you" in the Alice in Wonderland text with "superlongstringofridiculousness", my code will correctly scale the results:

_____________________________________________________________| she
_______________________________________________| superlongstringofridiculousness
__________________________________________________| said
____________________________________________| alice
_______________________________________| was
____________________________________| that
______________________________| as
___________________________| her
_________________________| at
________________________| with
______________________| on
_____________________| all
___________________| this
___________________| for
___________________| had
___________________| but
__________________| be
__________________| not
________________| they
________________| so
_______________| very
_______________| what

Here is the code written a little bit more legibly:

A=textscan(fopen('11.txt'),'%s','delimiter',' 0123456789,.!?-_*^:;=+\\/(){}[]@&#$%~`|"''');
s=lower(A{1});s(cellfun('length', s)<2)=[];s(ismember(s,{'the','and','of','to','it','in','or','is'}))=[];
[w,~,i]=unique(s);N=hist(i,max(i)); [j,k]=sort(N,'descend');
b=k(1:22);
n=cellfun('length',w(b));
q=ceil(80*N(b)/N(k(1)))'+n;
q=floor(q*78/max(q)-n);

for i=1:22,
  fprintf('%s| %s\n',repmat('_',1,q(i)),w{k(i)}); 
end
reso
Kudos for implementing the spec completely! (I would upvote but I've run out of votes for today...)
ChristopheD
shouldn't the bar for "superlongstringofridiculousness" be longer than the bar for "said"?
Bwmat
@Bwmat: ahh!! good eye! back to the drawing board...
reso
@reso: you could save 18 chars by replacing the delimiter string with: `char([32:64 91:96 123:126])`
Amro
@Amro: hey, thanks for the tip, that is great. One day I will go back and fix the bug that Bwmat spoted and add that to it as well
reso
+11  A: 

Haskell - 366 351 344 337 333 characters

(One line break in main added for readability, and no line break needed at end of last line.)

import Data.List
import Data.Char
l=length
t=filter
m=map
f c|isAlpha c=toLower c|0<1=' '
h w=(-l w,head w)
x!(q,w)='|':replicate(minimum$m(q?)x)'_'++"| "++w
q?(g,w)=q*(77-l w)`div`g
b x=m(x!)x
a(l:r)=(' ':t(=='_')l):l:r
main=interact$unlines.a.b.take 22.sort.m h.group.sort
  .t(`notElem`words"the and of to a i it in or is").words.m f

How it works is best seen by reading the argument to interact backwards:

  • map f lowercases alphabetics, replaces everything else with spaces.
  • words produces a list of words, dropping the separating whitespace.
  • filter (notElemwords "the and of to a i it in or is") discards all entries with forbidden words.
  • group . sort sorts the words, and groups identical ones into lists.
  • map h maps each list of identical words to a tuple of the form (-frequency, word).
  • take 22 . sort sorts the tuples by descending frequency (the first tuple entry), and keeps only the first 22 tuples.
  • b maps tuples to bars (see below).
  • a prepends the first line of underscores, to complete the topmost bar.
  • unlines joins all these lines together with newlines.

The tricky bit is getting the bar length right. I assumed that only underscores counted towards the length of the bar, so || would be a bar of zero length. The function b maps c x over x, where x is the list of histograms. The entire list is passed to c, so that each invocation of c can compute the scale factor for itself by calling u. In this way, I avoid using floating-point math or rationals, whose conversion functions and imports would eat many characters.

Note the trick of using -frequency. This removes the need to reverse the sort since sorting (ascending) -frequency will places the words with the largest frequency first. Later, in the function u, two -frequency values are multiplied, which will cancel the negation out.

Thomas
Very nice work (would upvote but ran out of votes for today with all the great answers in this thread).
ChristopheD
This hurts my eyes in a way that's painful even to think about describing, but I learned a lot of Haskell by reverse-engineering it into legible code. Well done, sir. :-)
Owen S.
It's actually fairly idiomatic Haskell still, albeit not really efficient. The short names make it look far worse than it really is.
Thomas
@Thomas: You can say that again. :-)
Owen S.
u q(g,w)=q*div(77-l w)g -- can save you 2 chars
Edward Kmett
@MtnViewMark: Nice work! I didn't know that `words` discards runs of whitespace, nor that you can put `|` conditions onto one line. And I can't believe I put a two-letter variable name in there...
Thomas
Can't move the `div`, actually! Try it- the output is wrong. The reason is that doing the `div` before the `*` looses precision.
MtnViewMark
Ah, whoops, got precedences wrong. Should've tested before editing :P
Thomas
@trinithis: It's shorter alright, but now I don't understand how it works any longer! I'm afraid you moved beyond my understanding of Haskell. Why is a bang pattern needed? What does the question mark even mean?
Thomas
Its not a bang pattern :D. All I did was change binary functions to infix operators. I just chose to use `?` and `!` for the operator names.
trinithis
Ah, brilliant :D
Thomas
A: 

Python, 250 chars

borrowing from all the other Python snippets

import re,sys
t=re.findall("\w+","".join(sys.stdin).lower())
W=sorted((-t.count(w),w)for w in set(t)-set("the and of to a i it in or is".split()))[:22]
Z,U=W[0],lambda n:"_"*int(n*(76.-len(Z[1]))/Z[0])
print"",U(Z[0])
for(n,w)in W:print"|"+U(n)+"|",w

If you're cheeky and put the words to avoid as arguments, 223 chars

import re,sys
t=re.findall("\w+","".join(sys.stdin).lower())
W=sorted((-t.count(w),w)for w in set(t)-set(sys.argv[1:]))[:22]
Z,U=W[0],lambda n:"_"*int(n*(76.-len(Z[1]))/Z[0])
print"",U(Z[0])
for(n,w)in W:print"|"+U(n)+"|",w

Output is:

$ python alice4.py  the and of to a i it in or is < 11.txt 
 _________________________________________________________________________
|_________________________________________________________________________| she
|_______________________________________________________________| you
|____________________________________________________________| said
|_____________________________________________________| alice
|_______________________________________________| was
|___________________________________________| that
|____________________________________| as
|________________________________| her
|_____________________________| at
|_____________________________| with
|____________________________| s
|____________________________| t
|__________________________| on
|__________________________| all
|_______________________| this
|_______________________| for
|_______________________| had
|_______________________| but
|______________________| be
|_____________________| not
|____________________| they
|____________________| so
Will
This doesn't handle the problem of having the scale determined by a word that is not the most frequent one.
6502
+7  A: 

PHP CLI version (450 chars)

This solution takes into account the last requirement which most purists have conviniently chosen to ignore. That costed 170 characters!

Usage: php.exe <this.php> <file.txt>

Minified:

<?php $a=array_count_values(array_filter(preg_split('/[^a-z]/',strtolower(file_get_contents($argv[1])),-1,1),function($x){return !preg_match("/^(.|the|and|of|to|it|in|or|is)$/",$x);}));arsort($a);$a=array_slice($a,0,22);function R($a,$F,$B){$r=array();foreach($a as$x=>$f){$l=strlen($x);$r[$x]=$b=$f*$B/$F;if($l+$b>76)return R($a,$f,76-$l);}return$r;}$c=R($a,max($a),76-strlen(key($a)));foreach($a as$x=>$f)echo '|',str_repeat('-',$c[$x]),"| $x\n";?>

Human readable:

<?php

// Read:
$s = strtolower(file_get_contents($argv[1]));

// Split:
$a = preg_split('/[^a-z]/', $s, -1, PREG_SPLIT_NO_EMPTY);

// Remove unwanted words:
$a = array_filter($a, function($x){
       return !preg_match("/^(.|the|and|of|to|it|in|or|is)$/",$x);
     });

// Count:
$a = array_count_values($a);

// Sort:
arsort($a);

// Pick top 22:
$a=array_slice($a,0,22);


// Recursive function to adjust bar widths
// according to the last requirement:
function R($a,$F,$B){
    $r = array();
    foreach($a as $x=>$f){
        $l = strlen($x);
        $r[$x] = $b = $f * $B / $F;
        if ( $l + $b > 76 )
            return R($a,$f,76-$l);
    }
    return $r;
}

// Apply the function:
$c = R($a,max($a),76-strlen(key($a)));


// Output:
foreach ($a as $x => $f)
    echo '|',str_repeat('-',$c[$x]),"| $x\n";

?>

Output:

|-------------------------------------------------------------------------| she
|---------------------------------------------------------------| you
|------------------------------------------------------------| said
|-----------------------------------------------------| alice
|-----------------------------------------------| was
|-------------------------------------------| that
|------------------------------------| as
|--------------------------------| her
|-----------------------------| at
|-----------------------------| with
|--------------------------| on
|--------------------------| all
|-----------------------| this
|-----------------------| for
|-----------------------| had
|-----------------------| but
|----------------------| be
|---------------------| not
|--------------------| they
|--------------------| so
|-------------------| very
|------------------| what

When there is a long word, the bars are adjusted properly:

|--------------------------------------------------------| she
|---------------------------------------------------| thisisareallylongwordhere
|-------------------------------------------------| you
|-----------------------------------------------| said
|-----------------------------------------| alice
|------------------------------------| was
|---------------------------------| that
|---------------------------| as
|-------------------------| her
|-----------------------| with
|-----------------------| at
|--------------------| on
|--------------------| all
|------------------| this
|------------------| for
|------------------| had
|-----------------| but
|-----------------| be
|----------------| not
|---------------| they
|---------------| so
|--------------| very
Lazarus Inepologlou
+27  A: 

Transact SQL set based solution (SQL Server 2005) 1063 892 873 853 827 820 783 683 647 644 630 characters

Thanks to Gabe for some useful suggestions to reduce the character count.

NB: Line breaks added to avoid scrollbars only the last line break is required.

DECLARE @ VARCHAR(MAX),@F REAL SELECT @=BulkColumn FROM OPENROWSET(BULK'A',
SINGLE_BLOB)x;WITH N AS(SELECT 1 i,LEFT(@,1)L UNION ALL SELECT i+1,SUBSTRING
(@,i+1,1)FROM N WHERE i<LEN(@))SELECT i,L,i-RANK()OVER(ORDER BY i)R INTO #D
FROM N WHERE L LIKE'[A-Z]'OPTION(MAXRECURSION 0)SELECT TOP 22 W,-COUNT(*)C
INTO # FROM(SELECT DISTINCT R,(SELECT''+L FROM #D WHERE R=b.R FOR XML PATH
(''))W FROM #D b)t WHERE LEN(W)>1 AND W NOT IN('the','and','of','to','it',
'in','or','is')GROUP BY W ORDER BY C SELECT @F=MIN(($76-LEN(W))/-C),@=' '+
REPLICATE('_',-MIN(C)*@F)+' 'FROM # SELECT @=@+' 
|'+REPLICATE('_',-C*@F)+'| '+W FROM # ORDER BY C PRINT @

Readable Version

DECLARE @  VARCHAR(MAX),
        @F REAL
SELECT @=BulkColumn
FROM   OPENROWSET(BULK'A',SINGLE_BLOB)x; /*  Loads text file from path
                                             C:\WINDOWS\system32\A  */

/*Recursive common table expression to
generate a table of numbers from 1 to string length
(and associated characters)*/
WITH N AS
     (SELECT 1 i,
             LEFT(@,1)L

     UNION ALL

     SELECT i+1,
            SUBSTRING(@,i+1,1)
     FROM   N
     WHERE  i<LEN(@)
     )
  SELECT   i,
           L,
           i-RANK()OVER(ORDER BY i)R
           /*Will group characters
           from the same word together*/
  INTO     #D
  FROM     N
  WHERE    L LIKE'[A-Z]'OPTION(MAXRECURSION 0)
             /*Assuming case insensitive accent sensitive collation*/

SELECT   TOP 22 W,
         -COUNT(*)C
INTO     #
FROM     (SELECT DISTINCT R,
                          (SELECT ''+L
                          FROM    #D
                          WHERE   R=b.R FOR XML PATH('')
                          )W
                          /*Reconstitute the word from the characters*/
         FROM             #D b
         )
         T
WHERE    LEN(W)>1
AND      W NOT IN('the',
                  'and',
                  'of' ,
                  'to' ,
                  'it' ,
                  'in' ,
                  'or' ,
                  'is')
GROUP BY W
ORDER BY C

/*Just noticed this looks risky as it relies on the order of evaluation of the 
 variables. I'm not sure that's guaranteed but it works on my machine :-) */
SELECT @F=MIN(($76-LEN(W))/-C),
       @ =' '      +REPLICATE('_',-MIN(C)*@F)+' '
FROM   #

SELECT @=@+' 
|'+REPLICATE('_',-C*@F)+'| '+W
             FROM     #
             ORDER BY C

PRINT @

Output

 _________________________________________________________________________ 
|_________________________________________________________________________| she
|_______________________________________________________________| You
|____________________________________________________________| said
|_____________________________________________________| Alice
|_______________________________________________| was
|___________________________________________| that
|____________________________________| as
|________________________________| her
|_____________________________| at
|_____________________________| with
|__________________________| on
|__________________________| all
|_______________________| This
|_______________________| for
|_______________________| had
|_______________________| but
|______________________| be
|_____________________| not
|____________________| they
|____________________| So
|___________________| very
|__________________| what

And with the long string

 _______________________________________________________________ 
|_______________________________________________________________| she
|_______________________________________________________| superlongstringstring
|____________________________________________________| said
|______________________________________________| Alice
|________________________________________| was
|_____________________________________| that
|_______________________________| as
|____________________________| her
|_________________________| at
|_________________________| with
|_______________________| on
|______________________| all
|____________________| This
|____________________| for
|____________________| had
|____________________| but
|___________________| be
|__________________| not
|_________________| they
|_________________| So
|________________| very
|________________| what
Martin Smith
I gave you a +1 because you did it in T-SQL, and to quote Team America - "You have balls. I like balls."
fortheworld
I took the liberty of converting some spaces into newlines to make it more readable. Hopefully I didn't mess things up. I also minified it a bit more.
Gabe
@Gabe Thanks. I ended up largely rewriting it though. It is now shorter and quicker than before.
Martin Smith
That code is screaming at me! :O
Joey
One good way to save is by changing `0.000` to just `0`, then using `-C` instead of `1.0/C`. And making `FLOAT` into `REAL` will save a stroke too. The biggest thing, though, is that it looks like you have lots of `AS` instances that should be optional.
Gabe
@Gabe - Thanks for the tips. I was able to replace float with real and get rid of some of the 'AS's the two that remain are both required. The -C thing didn't work. The row with 0 on it is the top of the top bar. This ended up positioned at the bottom and I would have needed to replace 0.000 with a large magnitude negative number to get it at the right place. Thanks though!
Martin Smith
How about this: `SELECT $0 O, ' '+REPLICATE('_', MAX(C)*@F)+' ' [ ] FROM # UNION SELECT $1/C, '|'+REPLICATE('_',C*@F)+'| '+W FROM # ORDER BY 1`
Gabe
@Gabe - Yep that works I'll implement the `$` thing thanks. The problem is though that it returns an additional; column to the output that isn't part of the spec (Hence the need for the additional temp table step)
Martin Smith
OK, how about `SELECT [ ] FROM (SELECT $0 O, ' '+REPLICATE('_', MAX(C)*@F)+' ' [ ] FROM # UNION SELECT $1/C, '|'+REPLICATE('_',C*@F)+'| '+W FROM #)X ORDER BY O`?
Gabe
@Gabe - Nice! That brings it down to comfortably less than 800. Thanks for your help!
Martin Smith
Oh. My. *God.* +1
Andrew Heath
You don't need to declare `@F` where it's used. You can declare it up with `@` and save a whole `DECLARE` worth of chars.
Gabe
Is `i-ROW_NUMBER` the same as `RANK`? Can the second CTE be moved into the `FROM` clause where it's used? Can the #D table query be made into a CTE? Can the #t table query be made into a CTE, or at least put into the `FROM` clause of the `SELECT TOP 22` query?
Gabe
@Gabe - Thanks, All good points. Made some other simplifications as well and collectively knocked another 100 off. The '#D' needs to be a temp table. At the moment it takes about 12 seconds on my machine. Swapping to a CTE slowed it down massively (I cancelled the query after 2 minutes so don't know how long it would have taken-or indeed if it would have finished at all) – Martin Smith 6 mins ago
Martin Smith
HA! Take that, Java!
Gabe
@Gabe - High Five! Good golfing with you. Cheers for the assistance.
Martin Smith
+1  A: 

R, 298 chars

f=scan("stdin","ch")
u=unlist
s=strsplit
a=u(s(u(s(tolower(f),"[^a-z]")),"^(the|and|of|to|it|in|or|is|.|)$"))
v=unique(a)
r=sort(sapply(v,function(i) sum(a==i)),T)[2:23]  #the first item is an empty string, just skipping it
w=names(r)
q=(78-max(nchar(w)))*r/max(r)
cat(" ",rep("_",q[1])," \n",sep="")
for(i in 1:22){cat("|",rep("_",q[i]),"| ",w[i],"\n",sep="")}

The output is:

 _________________________________________________________________________ 
|_________________________________________________________________________| she
|_______________________________________________________________| you
|____________________________________________________________| said
|_____________________________________________________| alice
|_______________________________________________| was
|___________________________________________| that
|____________________________________| as
|________________________________| her
|_____________________________| at
|_____________________________| with
|__________________________| on
|__________________________| all
|_______________________| this
|_______________________| for
|_______________________| had
|_______________________| but
|______________________| be
|_____________________| not
|____________________| they
|____________________| so
|___________________| very
|__________________| what

And if "you" is replaced by something longer:

 ____________________________________________________________ 
|____________________________________________________________| she
|____________________________________________________| veryverylongstring
|__________________________________________________| said
|___________________________________________| alice
|______________________________________| was
|___________________________________| that
|_____________________________| as
|__________________________| her
|________________________| at
|________________________| with
|______________________| on
|_____________________| all
|___________________| this
|___________________| for
|___________________| had
|__________________| but
|__________________| be
|__________________| not
|________________| they
|________________| so
|_______________| very
|_______________| what
Andrei
This is not doing the maximum scaling
6502
+94  A: 

LabVIEW 51 nodes, 5 structures, 10 diagrams

Teaching the elephant to tap-dance is never pretty. I'll, ah, skip the character count. coderesults

Edit, explanation, program flows from left to right: code with 'splainin

Edit: added some offsets for the chart's annotations and scaling, updated image links

Underflow
Wow, I've never before seen an example of visual programming that looks useful! I'd thought it was kind of consensus that it was impossible or, not worth it.
JDonner
It IS not worth it
M28
LabVIEW's very happy in its hardware control and measurement niche, but really pretty awful for string manipulation.
Underflow
No 3D yet? ... :D
belisarius
Best code golf answer I've seen. +1 for thinking outside the box!
Blair Holloway
Gotta count the elements for us...every box and widget you had to drag to the screen counts.
dmckee
@dmckee Good call. Most metrics are based on node count, so I'll add that.
Underflow
@Underflow: Fair enough. I'm not sure it is a precise comparison, but it is something.
dmckee
Would it be possible to add a link to a bigger version of those charts?
Svish
@Svish Switched to a different host for the images. Hopefully it helps.
Underflow
Holy shit, this is one of the most awesome things I have ever seen.
Jesse Dhillon
This is Coding?!
Anraiki
This is sr8 bucknnasty
Pierreten
+1  A: 

Python 290, 255, 253


290 characters in python (text read from standard input)

import sys,re
c={}
for w in re.findall("[a-z]+",sys.stdin.read().lower()):c[w]=c.get(w,0)+1-(","+w+","in",a,i,the,and,of,to,it,in,or,is,")
r=sorted((-v,k)for k,v in c.items())[:22]
sf=max((76.0-len(k))/v for v,k in r)
print" "+"_"*int(r[0][0]*sf)
for v,k in r:print"|"+"_"*int(v*sf)+"| "+k

but... after reading other solutions I all of a sudden realized that efficiency was not a request; so this is another shorter and much slower one (255 characters)

import sys,re
w=re.findall("\w+",sys.stdin.read().lower())
r=sorted((-w.count(x),x)for x in set(w)-set("the and of to a i it in or is".split()))[:22]
f=max((76.-len(k))/v for v,k in r)
print" "+"_"*int(f*r[0][0])
for v,k in r:print"|"+"_"*int(f*v)+"| "+k

and after some more reading other solutions...

import sys,re
w=re.findall("\w+",sys.stdin.read().lower())
r=sorted((-w.count(x),x)for x in set(w)-set("the and of to a i it in or is".split()))[:22]
f=max((76.-len(k))/v for v,k in r)
print"","_"*int(f*r[0][0])
for v,k in r:print"|"+"_"*int(f*v)+"|",k

And now this solution is almost byte-per-byte identical to Astatine's one :-D

6502
I worked out a very similar solution. Looking at yours there seems to be ways to merge both, you thought of some tricks I didn't...
kriss
+6  A: 

C (828)

It looks alot like obfuscated code, and uses glib for string, list and hash. Char count with wc -m says 828 . It does not consider single-char words. To calculate the max length of the bar, it consider the longest possible word among all, not only the first 22. Is this a deviation from the spec?

It does not handle failures and it does not release used memory.

#include <glib.h>
#define S(X)g_string_##X
#define H(X)g_hash_table_##X
GHashTable*h;int m,w=0,z=0;y(const void*a,const void*b){int*A,*B;A=H(lookup)(h,a);B=H(lookup)(h,b);return*B-*A;}void p(void*d,void*u){int *v=H(lookup)(h,d);if(w<22){g_printf("|");*v=*v*(77-z)/m;while(--*v>=0)g_printf("=");g_printf("| %s\n",d);w++;}}main(c){int*v;GList*l;GString*s=S(new)(NULL);h=H(new)(g_str_hash,g_str_equal);char*n[]={"the","and","of","to","it","in","or","is"};while((c=getchar())!=-1){if(isalpha(c))S(append_c)(s,tolower(c));else{if(s->len>1){for(c=0;c<8;c++)if(!strcmp(s->str,n[c]))goto x;if((v=H(lookup)(h,s->str))!=NULL)++*v;else{z=MAX(z,s->len);v=g_malloc(sizeof(int));*v=1;H(insert)(h,g_strdup(s->str),v);}}x:S(truncate)(s,0);}}l=g_list_sort(H(get_keys)(h),y);m=*(int*)H(lookup)(h,g_list_first(l)->data);g_list_foreach(l,p,NULL);}
ShinTakezou
Newlines do count as characters, but you can strip any from lines that are not preprocessor instructions. For a golf, I wouldn't consider not freeing memory a bad practice.
Shtééf
ok... put all in a line(expect preproc macros) and given a vers without freeing mem (and with two other spaces removed... a little bit of improvement can be made on the "obfuscation", e.g. `*v=*v*(77-lw)/m` will give 929 ... but I think it can be ok unless I find a way to do it a lot shorter)
ShinTakezou
I think you can move at least the `int c` into the `main` declaration and `main` is implicitly `int` (as are any untyped arguments, afaik): `main(c){...}`. You could probably also just write `0` instead of `NULL`.
Joey
doing it... of course will trigger some warning with the `-Wall` or with `-std=c99` flag on... but I suppose this is pointless for a code-golf, right?
ShinTakezou
uff, sorry for short-gap time edits, ... I should change `Without freeing memory stuff, it reaches 866 (removed some other unuseful space)` to something else to let not think people that the difference with the free-memory version is all in that: now the no-free-memory version has a lot of more "improvements".
ShinTakezou
still some improvements can be done shortening names of variables+function
ShinTakezou
@Shin: BTW--you can have more than one answer to a single question. Scroll to the very bottom of the page to find the [Add Another Answer] button. I supposed it's moved down because the expectation is that multiple answer will be the exception, not the rule.
dmckee
@dmckee thanks, I am going to disentangle C and Smalltalk!
ShinTakezou
+5  A: 

Common LISP, 670 characters

I'm a LISP newbie, and this is an attempt using an hash table for counting (so probably not the most compact method).

(flet((r()(let((x(read-char t nil)))(and x(char-downcase x)))))(do((c(
make-hash-table :test 'equal))(w NIL)(x(r)(r))y)((not x)(maphash(lambda
(k v)(if(not(find k '("""the""and""of""to""a""i""it""in""or""is"):test
'equal))(push(cons k v)y)))c)(setf y(sort y #'> :key #'cdr))(setf y
(subseq y 0(min(length y)22)))(let((f(apply #'min(mapcar(lambda(x)(/(-
76.0(length(car x)))(cdr x)))y))))(flet((o(n)(dotimes(i(floor(* n f)))
(write-char #\_))))(write-char #\Space)(o(cdar y))(write-char #\Newline)
(dolist(x y)(write-char #\|)(o(cdr x))(format t "| ~a~%"(car x))))))
(cond((char<= #\a x #\z)(push x w))(t(incf(gethash(concatenate 'string(
reverse w))c 0))(setf w nil)))))

can be run on for example with cat alice.txt | clisp -C golf.lisp.

In readable form is

(flet ((r () (let ((x (read-char t nil)))
               (and x (char-downcase x)))))
  (do ((c (make-hash-table :test 'equal))  ; the word count map
       w y                                 ; current word and final word list
       (x (r) (r)))  ; iteration over all chars
       ((not x)

        ; make a list with (word . count) pairs removing stopwords
        (maphash (lambda (k v)
                   (if (not (find k '("" "the" "and" "of" "to"
                                      "a" "i" "it" "in" "or" "is")
                                  :test 'equal))
                       (push (cons k v) y)))
                 c)

        ; sort and truncate the list
        (setf y (sort y #'> :key #'cdr))
        (setf y (subseq y 0 (min (length y) 22)))

        ; find the scaling factor
        (let ((f (apply #'min
                        (mapcar (lambda (x) (/ (- 76.0 (length (car x)))
                                               (cdr x)))
                                y))))
          ; output
          (flet ((outx (n) (dotimes (i (floor (* n f))) (write-char #\_))))
             (write-char #\Space)
             (outx (cdar y))
             (write-char #\Newline)
             (dolist (x y)
               (write-char #\|)
               (outx (cdr x))
               (format t "| ~a~%" (car x))))))

       ; add alphabetic to current word, and bump word counter
       ; on non-alphabetic
       (cond
        ((char<= #\a x #\z)
         (push x w))
        (t
         (incf (gethash (concatenate 'string (reverse w)) c 0))
         (setf w nil)))))
6502
have you tried installing a custom reader macro to shave off some input size?
Aaron
@Aaron actually it wasn't trivial for me even just getting this working... :-) for the actual golfing part i just used one-letter variables and that's all.Anyway besides somewhat high verbosity that is inherent in CL for this scale of problems ("concatenate 'string", "setf" or "gethash" are killers... in python they are "+", "=", "[]") still I felt this a lot worse that I would have expected even on a logical level.In a sense I've a feeling that lisp is ok, but common lisp is so-so and this beyond naming (re-reading it a very unfair comment as my experience with CL is close to zero).
6502
true. scheme would make the golfing a bit easier, with the single namespace. instead of string-append all over the place, you could (letrec ((a string-append)(b gethash)) ... (a "x" "yz") ...)
Aaron
+2  A: 

Shell, 228 characters , with 80 chars constraint working

tr A-Z a-z|tr -Cs a-z "\n"|sort|egrep -v "^(the|and|of|to|a|i|it|in|or|is)$" |uniq -c|sort -r|head -22>g
n=1
while :
do
awk '{printf "|%0*s| %s\n",$1*'$n'/1e3,"",$2;}' g|tr 0 _>o 
egrep -q .{80} o&&break
n=$((n+1))
done
cat o

I'm surprised nobody seems to have used the amazing * feature of printf.

cat 11-very.txt > golf.sh

|__________________________________________________________________________| she
|________________________________________________________________| you
|_____________________________________________________________| said
|______________________________________________________| alice
|_______________________________________________| was
|____________________________________________| that
|____________________________________| as
|_________________________________| her
|______________________________| with
|______________________________| at
|_____________________________| s
|_____________________________| t
|___________________________| on
|__________________________| all
|________________________| this
|_______________________| for
|_______________________| had
|_______________________| but
|______________________| be
|______________________| not
|____________________| they
|____________________| so

cat 11 | golf.sh

|_________________________________________________________________| she
|_________________________________________________________| verylongstringstring
|______________________________________________________| said
|_______________________________________________| alice
|__________________________________________| was
|_______________________________________| that
|________________________________| as
|_____________________________| her
|___________________________| with
|___________________________| at
|__________________________| s
|_________________________| t
|________________________| on
|_______________________| all
|_____________________| this
|_____________________| for
|_____________________| had
|____________________| but
|___________________| be
|___________________| not
|__________________| they
|__________________| so
mb14
Missing the very first line in the output (the top line of the first bar). Also couldn't you just sort ascending and then use the last 22 lines instead? Dunno whether that would make it shorter here but for me it was a serious consideration.
Joey
I know for the first. I Just don't see a simple way to do it and I wasn't sure if that was really mandatory. I could not reverse indeed but then the output would be inversed (she at the last line)
mb14
+1  A: 

Object Rexx 4.0 with PC-Pipes

Where the PC-Pipes library can be found.
This solution ignores single letter words.


address rxpipe 'pipe (end ?) < Alice.txt',
   '|regex split /[^a-zA-Z]/', -- split at non alphbetic character
   '|locate 2',                -- discard words shorter that 2 char  
   '|xlate lower',             -- translate all words to lower case
   ,                           -- discard list words that match list
   '|regex not match /^(the||and||of||to||it||in||or||is)$/',
   '|l:lookup autoadd before count',  -- accumulate and count words
 '? l:',                       -- no master records to feed into lookup 
 '? l:',                       -- list of counted words comes here
   ,                           -- columns 1-10 hold count, 11-n hold word
   '|sort 1.10 d',             -- sort in desending order by count
   '|take 22',                 -- take first 22 records only
   '|array wordlist',          -- store into a rexx array
   '|count max',               -- get length of longest record 
   '|var maxword'              -- save into a rexx variable

parse value wordlist[1] with count 11 .  -- get frequency of first word
barunit = count % (76-(maxword-10))      -- frequency units per chart bar char

say ' '||copies('_', (count+barunit)%barunit)  -- first line of the chart
do cntwd over wordlist                    
  parse var cntwd count 11 word          -- get word frequency and the word
  say '|'||copies('_', (count+barunit)%barunit)||'| '||word||' '
end
The output produced
 ________________________________________________________________________________
|________________________________________________________________________________| she
|_____________________________________________________________________| you
|___________________________________________________________________| said
|__________________________________________________________| alice
|____________________________________________________| was
|________________________________________________| that
|________________________________________| as
|____________________________________| her
|_________________________________| at
|_________________________________| with
|______________________________| on
|_____________________________| all
|__________________________| this
|__________________________| for
|__________________________| had
|__________________________| but
|________________________| be
|________________________| not
|_______________________| they
|______________________| so
|_____________________| very
|_____________________| what
James Johnosn
How long is the solution (number of characters) - this is a code-golf?
Nas Banov
+2  A: 

Yet another python 2.x - 206 chars (or 232 with 'width bar')

I believe this one if fully compliant with the question. Ignore list is here, it fully checks for line length (see exemple where I replaced Alice by Aliceinwonderlandbylewiscarroll througout the text making the fifth item the longest line. Even the filename is provided from command line instead of hardcoded (hardcoding it would remove about 10 chars). It has one drawback (but I believe it's ok with the question) as it compute an integer divider to make line shorter than 80 chars, the longest line is shorter than 80 characters, not exactly 80 characters. The python 3.x version does not have this defect (but is way longer).

Also I believe it is not so hard to read.

import sys,re
t=re.split("\W+(?:(?:the|and|o[fr]|to|a|i[tns]?)\W+)*",sys.stdin.read().lower())
b=sorted((-t.count(x),x)for x in set(t))[:22]
for l,w in b:print"|"+l/min(z/(78-len(e))for z,e in b)*'-'+"|",w

|----------------------------------------------------------------| she
|--------------------------------------------------------| you
|-----------------------------------------------------| said
|----------------------------------------------| aliceinwonderlandbylewiscarroll
|-----------------------------------------| was
|--------------------------------------| that
|-------------------------------| as
|----------------------------| her
|--------------------------| at
|--------------------------| with
|-------------------------| s
|-------------------------| t
|-----------------------| on
|-----------------------| all
|---------------------| this
|--------------------| for
|--------------------| had
|--------------------| but
|-------------------| be
|-------------------| not
|------------------| they
|-----------------| so

As it is not clear if we must print the max bar alone on it's line (like in sample output). Below is another one that do it, but 232 chars.

import sys,re
t=re.split("\W+(?:(?:the|and|o[fr]|to|a|i[tns]?)\W+)*",sys.stdin.read().lower())
b=sorted((-t.count(x),x)for x in set(t))[:22]
f=min(z/(78-len(e))for z,e in b)
print"",b[0][0]/f*'-'
for y,w in b:print"|"+y/f*'-'+"|",w

Python 3.x - 256 chars

Using Counter class from python 3.x, there was high hopes to make it shorter (as Counter does everything that we need here). It comes out it's not better. Below is my trial 266 chars:

import sys,re,collections as c
b=c.Counter(re.split("\W+(?:(?:the|and|o[fr]|to|a|i[tns]?)\W+)*",
sys.stdin.read().lower())).most_common(22)
F=lambda p,x,w:print(p+'-'*int(x/max(z/(77.-len(e))for e,z in b))+w)
F(" ",b[0][1],"")
for w,y in b:F("|",y,"| "+w)

The problem is that collections and most_common are very long words and even Counter is not short... really, not using Counter makes code only 2 characters longer ;-(

python 3.x also introduce other constraints : dividing two integers is not an integer any more (so we have to cast to int), print is now a function (must add parenthesis), etc. That's why it comes out 22 characters longer than python2.x version, but way faster. Maybe some more experimented python 3.x coder will have ideas to shorten the code.

kriss
That's a clever way of sorting from high to low.
Wallacoloo
+1  A: 

Ruby, 205


This Ruby version handles "superlongstringstring". (The first two lines are almost identical to the previous Ruby programs.)

It must be run this way:

ruby -n0777 golf.rb Alice.txt


W=($_.upcase.scan(/\w+/)-%w(THE AND OF TO A I IT
IN OR IS)).group_by{|x|x}.map{|k,v|[-v.size,k]}.sort[0,22]
u=proc{|m|"_"*(W.map{|n,s|(76.0-s.size)/n}.max*m)}
puts" "+u[W[0][0]],W.map{|n,s|"|%s| "%u[n]+s}

The third line creates a closure or lambda that yields a correctly scaled string of underscores:

u = proc{|m|
  "_" *
    (W.map{|n,s| (76.0 - s.size)/n}.max * m)
}

.max is used instead of .min because the numbers are negative.

William James
Implementing the full spec and still very short (213 characters at the moment according to `wc -c`), nice work!
ChristopheD
+3  A: 

Scala 2.8, 311 314 320 330 332 336 341 375 characters

including long word adjustment. Ideas borrowed from the other solutions.

Now as a script (a.scala):

val t="\\w+\\b(?<!\\bthe|and|of|to|a|i[tns]?|or)".r.findAllIn(io.Source.fromFile(argv(0)).mkString.toLowerCase).toSeq.groupBy(w=>w).mapValues(_.size).toSeq.sortBy(-_._2)take 22
def b(p:Int)="_"*(p*(for((w,c)<-t)yield(76.0-w.size)/c).min).toInt
println(" "+b(t(0)._2))
for(p<-t)printf("|%s| %s \n",b(p._2),p._1)

Run with

scala -howtorun:script a.scala alice.txt

BTW, the edit from 314 to 311 characters actually removes only 1 character. Someone got the counting wrong before (Windows CRs?).

mkneissl
+1  A: 

Bourne shell, 213/240 characters

Improving on the shell version posted earlier, I can get it down to 213 characters:

tr A-Z a-z|tr -Cs a-z \\n|sort|egrep -v '^(the|and|of|to|a|i|it|in|or|is)$'|uniq -c|sort -rn|sed 22q>g
n=1
>o
until egrep -q .{80} o
do
awk '{printf "|%0*d| %s\n",$1*'$n'/1e3,0,$2}' g|tr 0 _>o 
((n++))
done
cat o

In order to get the upper outline on the top bar, I had to expand it to 240 characters:

tr A-Z a-z|tr -Cs a-z \\n|sort|egrep -v "^(the|and|of|to|a|i|it|in|or|is)$"|uniq -c|sort -r|sed 1p\;22q>g
n=1
>o
until egrep -q .{80} o
do
awk '{printf "|%0*d| %s\n",$1*'$n'/1e3,0,NR==1?"":$2}' g|sed '1s,|, ,g'|tr 0 _>o 
((n++))
done
cat o
+1  A: 

shell, grep, tr, grep, sort, uniq, sort, head, perl - 194 chars

Adding some -i flags may drop the overly long tr A-Z a-z| step; the spec said nothing about the case displayed, and uniq -ci drops any case differences.

egrep -oi [a-z]+|egrep -wiv 'the|and|o[fr]|to|a|i[tns]?'|sort|uniq -ci|sort -nr|head -22|perl -lape'($f,$w)=@F;$.>1or($q,$x)=($f,76-length$w);$b="_"x($f/$q*$x);$_="|$b| $w ";$.>1or$_=" $b\n$_"'

That's minus 11 for the tr plus 2 for the -i's compared to the original 206 chars.

edit: minus 3 for the \\b which can be left out as pattern matching will commence on a boundary anyway.

sort gives lower case first, and uniq -ci takes the first occurence, so the only real change in output will be that Alice retains her upper case initial.

mvds
The bar length constraint isn't working.
Joey
+1  A: 

Go, 613 chars, could probably be much smaller:

package main
import(r "regexp";. "bytes";. "io/ioutil";"os";st "strings";s "sort";. "container/vector")
type z struct{c int;w string}
func(e z)Less(o interface{})bool{return o.(z).c<e.c}
func main(){b,_:=ReadAll(os.Stdin);g:=r.MustCompile
c,m,x:=g("[A-Za-z]+").AllMatchesIter(b,0),map[string]int{},g("the|and|of|it|in|or|is|to")
for w:=range c{w=ToLower(w);if len(w)>1&&!x.Match(w){m[string(w)]++}}
o,y:=&Vector{},0
for k,v:=range m{o.Push(z{v,k});if v>y{y=v}}
s.Sort(o)
for i,v:=range *o{if i>21{break};x:=v.(z);c:=int(float(x.c)/float(y)*80)
u:=st.Repeat("_",c);if i<1{println(" "+u)};println("|"+u+"| "+x.w)}}

I feel so dirty.

Pat
+1  A: 

perl, 188 characters

The perl version above (as well as any regexp splitting based version) can get a few bytes shorter by including the list of forbidden words as negative lookahead assertions, rather than as a separate list. Furthermore the trailing semicolon can be left out.

I also included some other suggestions (- instead of <=>, for/foreach, dropped "keys") to get to

$c{$_}++for grep{$_}map{lc=~/\b(?!(?:the|and|a|of|or|i[nts]?|to)\b)[a-z]+/g}<>;@s=sort{$c{$b}-$c{$a}}%c;$f=76-length$s[0];say$"."_"x$f;say"|"."_"x($c{$_}/$c{$s[0]}*$f)."| $_ "for@s[0..21]

I don't know perl, but I presume that the (?!(?:...)\b) may lose the ?: if the handling around it is fixed.

mvds
This throws a syntax error for me: »String found where operator expected at c.pl line 1, near "say"|""syntax error at c.pl line 1, near "say"|""Search pattern not terminated at c.pl line 1.« (Perl 5.10.1). Also the code looks like the bar length constraint isn't working. And it may also well be that strings such as `foo_the_bar` won't get the stop words removed (because of `\b`).
Joey
+2  A: 

Scala, 327 characters

This was adapted from mkneissl's answer inspired by a Python version, though it is bigger. I'm leaving it here in case someone can make it shorter.

val f="\\w+\\b(?<!\\bthe|and|of|to|a|i[tns]?|or)".r.findAllIn(io.Source.fromFile("11.txt").mkString.toLowerCase).toSeq
val t=f.toSet[String].map(x=> -f.count(x==)->x).toSeq.sorted take 22
def b(p:Int)="_"*(-p/(for((c,w)<-t)yield-c/(76.0-w.size)).max).toInt
println(" "+b(t(0)._1))
for(p<-t)printf("|%s| %s \n",b(p._1),p._2)
Daniel
+5  A: 

Perl, 185 char

200 (slightly broken) 199 197 195 193 187 185 characters. Last two newlines are significant. Complies with the spec.

map$X{+lc}+=!/^(.|the|and|to|i[nst]|o[rf])$/i,/[a-z]+/gfor<>;
$n=$n>($:=$X{$_}/(76-y+++c))?$n:$:for@w=(sort{$X{$b}-$X{$a}}%X)[0..21];
die map{$U='_'x($X{$_}/$n);" $U
"x!$z++,"|$U| $_
"}@w

First line loads counts of valid words into %X.

The second line computes minimum scaling factor so that all output lines will be <= 80 characters.

The third line (contains two newline characters) produces the output.

mobrule
This won't remove stop words from strings such as "foo_the_bar". Line length is also one too long (re-read the spec: "bar + space + word **+ space** <= 80 chars")
Joey
+1  A: 

GNU Smalltalk (386)

I think it can be made a little bit shorter, but still no idea how.

|q s f m|q:=Bag new. f:=FileStream stdin. m:=0.[f atEnd]whileFalse:[s:=f nextLine.(s notNil)ifTrue:[(s tokenize:'\W+')do:[:i|(((i size)>1)&({'the'.'and'.'of'.'to'.'it'.'in'.'or'.'is'}includes:i)not)ifTrue:[q add:(i asLowercase)]. m:=m max:(i size)]]].(q:=q sortedByCount)from:1to:22 do:[:i|'|'display.((i key)*(77-m)//(q first key))timesRepeat:['='display].('| %1'%{i value})displayNl]
ShinTakezou
+4  A: 

Clojure 282 strict

(let[[[_ m]:as s](->>(slurp *in*).toLowerCase(re-seq #"\w+\b(?<!\bthe|and|of|to|a|i[tns]?|or)")frequencies(sort-by val >)(take 22))[b](sort(map #(/(- 76(count(key %)))(val %))s))p #(do(print %1)(dotimes[_(* b %2)](print \_))(apply println %&))](p " " m)(doseq[[k v]s](p \| v \| k)))

Somewhat more legibly:

(let[[[_ m]:as s](->> (slurp *in*)
                   .toLowerCase
                   (re-seq #"\w+\b(?<!\bthe|and|of|to|a|i[tns]?|or)")
                   frequencies
                   (sort-by val >)
                   (take 22))
     [b] (sort (map #(/ (- 76 (count (key %)))(val %)) s))
     p #(do
          (print %1)
          (dotimes[_(* b %2)] (print \_))
          (apply println %&))]
  (p " " m)
  (doseq[[k v] s] (p \| v \| k)))
Alex Taggart
+1  A: 

Clojure - 611 chars (not minimized)

I tried writing the code in as much idiomatic Clojure as I could so late in the night. I am not too proud of the draw-chart function, but I guess the code will speak volumes of the succinctness of Clojure.

(ns word-freq
(:require [clojure.contrib.io :as io]))

(defn word-freq
  [f]
  (take 22 (->> f
                io/read-lines ;;; slurp should work too, but I love map/red
                (mapcat (fn [l] (map #(.toLowerCase %) (re-seq #"\w+" l))))
                (remove #{"the" "and" "of" "to" "a" "i" "it" "in" "or" "is"})
                (reduce #(assoc %1 %2 (inc (%1 %2 0))) {})
                (sort-by (comp - val)))))

(defn draw-chart
  [fs]
  (let [[[w f] & _] fs]
    (apply str
           (interpose \newline
                      (map (fn [[k v]] (apply str (concat "|" (repeat (int (* (- 76 (count w)) (/ v f 1))) "_") "| " k " ")) ) fs)))))

;;; (println (draw-chart (word-freq "/Users/ghoseb/Desktop/alice.txt")))

Output:

|_________________________________________________________________________| she 
|_______________________________________________________________| you 
|____________________________________________________________| said 
|____________________________________________________| alice 
|_______________________________________________| was 
|___________________________________________| that 
|____________________________________| as 
|________________________________| her 
|_____________________________| with 
|_____________________________| at 
|____________________________| t 
|____________________________| s 
|__________________________| on 
|__________________________| all 
|_______________________| for 
|_______________________| had 
|_______________________| this 
|_______________________| but 
|______________________| be 
|_____________________| not 
|____________________| they 
|____________________| so

I know, this doesn't follow the spec, but hey, this is some very clean Clojure code which is already so small :)

Baishampayan Ghose
+1  A: 

Lua solution: 478 characters.

t,u={},{}for l in io.lines()do
for w in l:gmatch("%a+")do
w=w:lower()if not(" the and of to a i it in or is "):find(" "..w.." ")then
t[w]=1+(t[w]or 0)end
end
end
for k,v in next,t do
u[#u+1]={k,v}end
table.sort(u,function(a,b)return a[2]>b[2]end)m,n=u[1][2],math.min(#u,22)for w=80,1,-1 do
s=""for i=1,n do
a,b=u[i][1],w*u[i][2]/m
if b+#a>=78 then s=nil break end
s2=("_"):rep(b)if i==1 then
s=s.." " ..s2.."\n"end
s=s.."|"..s2.."| "..a.."\n"end
if s then print(s)break end end

Readable version:

t,u={},{}
for line in io.lines() do
    for w in line:gmatch("%a+") do
        w = w:lower()
        if not (" the and of to a i it in or is "):find(" "..w.." ") then
            t[w] = 1 + (t[w] or 0)
        end
    end
end
for k, v in pairs(t) do
    u[#u+1]={k, v}
end

table.sort(u, function(a, b)
    return a[2] > b[2]
end)

local max = u[1][2]
local n = math.min(#u, 22)

for w = 80, 1, -1 do
    s=""
    for i = 1, n do
        f = u[i][2]
        word = u[i][1]
        width = w * f / max
        if width + #word >= 78 then
            s=nil
            break
        end
        s2=("_"):rep(width)
        if i==1 then
            s=s.." " .. s2 .."\n"
        end
        s=s.."|" .. s2 .. "| " .. word.."\n"
    end
    if s then
        print(s)
        break
    end
end
Kristofer
+1  A: 

TCL 554 Strict

foreach w [regexp -all -inline {[a-z]+} [string tolower [read stdin]]] {if {[lsearch {the and of to it in or is a i} $w]>=0} {continue};if {[catch {incr Ws($w)}]} {set Ws($w) 1}}
set T [lrange [lsort -decreasing -stride 2 -index 1 -integer [array get Ws]] 0 43]
foreach {w c} $T {lappend L [string length $w];lappend C $c}
set N [tcl::mathfunc::max {*}$L]
set C [lsort -integer $C]
set M [lindex $C end]
puts " [string repeat _ [expr {int((76-$N) * [lindex $T 1] / $M)}]] "
foreach {w c} $T {puts "|[string repeat _ [expr {int((76-$N) * $c / $M)}]]| $w"}

Or, more legibly

foreach w [regexp -all -inline {[a-z]+} [string tolower [read stdin]]] {
    if {[lsearch {the and of to a i it in or is} $w] >= 0} { continue }
    if {[catch {incr words($w)}]} {
        set words($w) 1
    }
}
set topwords [lrange [lsort -decreasing -stride 2 -index 1 -integer [array get words]] 0 43]
foreach {word count} $topwords {
    lappend lengths [string length $word]
    lappend counts $count
}
set maxlength [lindex [lsort -integer $lengths] end]
set counts [lsort -integer $counts]
set mincount [lindex $counts 0].0
set maxcount [lindex $counts end].0
puts " [string repeat _ [expr {int((76-$maxlength) * [lindex $topwords 1] / $maxcount)}]] "
foreach {word count} $topwords {
    set barlength [expr {int((76-$maxlength) * $count / $maxcount)}]
    puts "|[string repeat _ $barlength]| $word"
}
RHSeeger