views:

2258

answers:

8

What is a good/best way to count the number of characters, words, and lines of a text file using Perl (without using wc)?

+2  A: 

Non-serious answer:

system("wc foo");
Paul Tomblin
ITYM: my ($lines, $words, $chars) = split(' ', `wc foo`);
ysth
+4  A: 

The Word Count tool counts characters, words and lines in text files

TStamper
+5  A: 

Here's the perl code. Counting words can be somewhat subjective, but I just say it's any string of characters that isn't whitespace.

open(FILE, "<file.txt") or die "Could not open file: $!";

my ($lines, $words, $chars) = (0,0,0);

while (<FILE>) {
    $lines++;
    $chars += length($_);
    $words += scalar(split(/\s+/, $_));
}

print("lines=$lines words=$words chars=$chars\n");
bmdhacks
Nicely concise but not obfuscated.
Paul Tomblin
for word count, you want: scalar(split);That will split on /\s+/ and drop leading empty fields, just like awk does.
glenn jackman
As a note sort of related to glenn's, you can say "length;" instead of "length $_;" and Perl will default to using $_. However, using the defaults on split() is more beneficial, as it even has a default regex.
Chris Lutz
@Paul Tomblin: here, are you happy now: perl -ne 'END{print"$. $c $w\n"}$c+=length;$w+=split'
Chas. Owens
open(FILE, '<', '$file.txt')
Brad Gilbert
@Brad Gilbert why not go all the way: open my $fh, "<", "file.txt" or die "could not open file: $!";
Chas. Owens
+4  A: 

A variation on bmdhacks' answer that will probably produce better results is to use \s+ (or even better \W+) as the delimiter. Consider the string "The  quick  brown fox" (additional spaces if it's not obvious). Using a delimiter of a single whitespace character will give a word count of six not four. So, try:

open(FILE, "<file.txt") or die "Could not open file: $!";

my ($lines, $words, $chars) = (0,0,0);

while (<FILE>) {
    $lines++;
    $chars += length($_);
    $words += scalar(split(/\W+/, $_));
}

print("lines=$lines words=$words chars=$chars\n");

Using \W+ as the delimiter will stop punctuation (amongst other things) from counting as words.

Nic Gibson
Using \W will split "nit-picking" into two words. I don't know if this is correct behavior or not, but I always thought of hyphenated words as one word rather than two.
Chris Lutz
It's one of those 'you pays your money, you makes your choice' things. Personally, I usually roll my own regex that fits the definition of 'word' I need at the time. Quite often split can be less than helpful because it is a negative match. A normal regex matches the characters you *do* want, generally a better idea. You could certainly do the same sort of thing using m/.../g and calling it in list context.
Nic Gibson
+1  A: 

Reading the file in fixed-size chunks may be more efficient than reading line-by-line. The wc binary does this.

#!/usr/bin/env perl

use constant BLOCK_SIZE => 16384;

for my $file (@ARGV) {
    open my $fh, '<', $file or do {
        warn "couldn't open $file: $!\n";
        continue;
    };

    my ($chars, $words, $lines) = (0, 0, 0);

    my ($new_word, $new_line);
    while ((my $size = sysread $fh, local $_, BLOCK_SIZE) > 0) {
        $chars += $size;
        $words += /\s+/g;
        $words-- if $new_word && /\A\s/;
        $lines += () = /\n/g;

        $new_word = /\s\Z/;
        $new_line = /\n\Z/;
    }
    $lines-- if $new_line;

    print "\t$lines\t$words\t$chars\t$file\n";
}
ephemient
I'm not sure this gives you any benefit. Under the hood, as it were, perl's <> operator is using buffered IO. All you have done here is rewrite something built-in with something that has to be interpreted.
Nic Gibson
True. At least with my installation of 5.8.8, Perl buffers 4096 bytes at a time, and there's no performance benefit to doing this manually -- as you suspected, if anything, it's actually worse. I like reminding people to think low-level though :)
ephemient
A: 

To be able to count CHARS and not bytes, consider this:
(Try it with Chinese or Cyrillic letters and file saved in utf8)

use utf8;

my $file='file.txt';
my $LAYER = ':encoding(UTF-8)';
open( my $fh, '<', $file )
  || die( "$file couldn't be opened: $!" );
binmode( $fh, $LAYER );
read $fh, my $txt, -s $file;
close $fh;

print length $txt,$/;
use bytes;
print length $txt,$/;
Berov
Perl defaults to using the system locale. If your system is modern, the system locale will be an UTF-8 encoding, and thus Perl IO will be UTF-8 by default. If not, you probably should be using the system locale and not forcing UTF-8 mode...
ephemient
Wrong, ephemient. Perl defaults to the system locale, but prints characters 128-255 as "?" for backwards compatibility. To print proper UTF-8, one should say binmode($fh, ":utf8"); before using the filehandle. In this case, "use utf8;" is useless - it tells Perl that the source code can be in UTF-8, which is unnecessary unless you have variables names like $áccent or $ümlats.
Chris Lutz
@Chris Both my Perl 5.8 and 5.10 are documented as having `-C SDL` as the default, and `perl -e 'print "\xe2\x81\x89\n"'` produces "⁉" as expected -- not "???" as you seem to expect.
ephemient
I think those three hex values are combining into one UTF-8 character. And UTF-8 characters will print in Perl. Just not the ones from 128-255. Trying any one of those three hex codes individually on my machine gives me "?", whereas prefixing it with binmode(STDOUT, ":utf8"); gives me "â" for \xe2 and non printing characters for the other two. And as far as I can tell, I have no default setting of "-C" anything.
Chris Lutz
`echo $'\xe2\x80\x99' | perl -ne'print length,$/'` outputs 4 while `echo $'\xe2\x80\x99' | perl -CSDL -ne'print length,$/'` outputs 2, so I must be misremembering and Chris is correct.
ephemient
+2  A: 

Once Upon A Time there was the Perl Power Tools project whose goal was to reconstruct all the Unix bin utilities, primarily for those on operating systems deprived of Unix (in the days before Cygwin). And yes, they did wc. The implementation is overkill, but it is POSIX compliant.

It gets a little ridiculous when you look at the simple implementation of true and the extra fancy GNU compliant one.

Schwern
Most of the fancy 'true' implementation is POD. Still ridiculous.
Chris Lutz
A: 

I stumbled upon this while googling for a character count solution. Admittedly, I know next to nothing about perl so some of this may be off base, but here are my tweaks of newt's solution.

First, there is a built-in line count variable anyway, so I just used that. This is probably a bit more efficient, I guess. As it is, the character count includes newline characters, which is probably not what you want, so I chomped $_. Perl also complained about the way the split() is done (implicit split, see: http://stackoverflow.com/questions/2436160/why-does-perl-complain-use-of-implicit-split-to-is-deprecated ) so I tweaked that. My input files are UTF-8 so I opened them as such. That probably helps get the correct character count in the input file contains non-ASCII characters.

Here's the code:

open(FILE, "<:encoding(UTF-8)", "file.txt") or die "Could not open file: $!";

my ($lines, $words, $chars) = (0,0,0);
my @wordcounter;
while (<FILE>) {
    chomp($_);
    $chars += length($_);
    @wordcounter = split(/\W+/, $_);
    $words += @wordcounter;
}
$lines = $.;
close FILE;
print "\nlines=$lines, words=$words, chars=$chars\n";
elef