ansaurus

Question

How do I count the characters, words, and lines in a file, using Perl?

Answer 1

+2 A:

Non-serious answer:

system("wc foo");

Paul Tomblin 2009-04-23 14:50:44

ITYM: my ($lines, $words, $chars) = split(' ', `wc foo`);

ysth 2009-04-26 22:56:18

Answer 2

+4 A:

The Word Count tool counts characters, words and lines in text files

TStamper 2009-04-23 14:50:48

Answer 3

+5 A:

Here's the perl code. Counting words can be somewhat subjective, but I just say it's any string of characters that isn't whitespace.

open(FILE, "<file.txt") or die "Could not open file: $!";

my ($lines, $words, $chars) = (0,0,0);

while (<FILE>) {
    $lines++;
    $chars += length($_);
    $words += scalar(split(/\s+/, $_));
}

print("lines=$lines words=$words chars=$chars\n");

bmdhacks 2009-04-23 14:54:52

Nicely concise but not obfuscated.

Paul Tomblin 2009-04-23 14:56:25

for word count, you want: scalar(split);That will split on /\s+/ and drop leading empty fields, just like awk does.

glenn jackman 2009-04-23 15:33:47

As a note sort of related to glenn's, you can say "length;" instead of "length $_;" and Perl will default to using $_. However, using the defaults on split() is more beneficial, as it even has a default regex.

Chris Lutz 2009-04-23 15:48:48

@Paul Tomblin: here, are you happy now: perl -ne 'END{print"$. $c $w\n"}$c+=length;$w+=split'

Chas. Owens 2009-04-23 15:55:09

open(FILE, '<', '$file.txt')

Brad Gilbert 2009-04-23 16:54:46

@Brad Gilbert why not go all the way: open my $fh, "<", "file.txt" or die "could not open file: $!";

Chas. Owens 2009-04-24 19:40:12

Answer 4

+4 A:

A variation on bmdhacks' answer that will probably produce better results is to use \s+ (or even better \W+) as the delimiter. Consider the string "The quick brown fox" (additional spaces if it's not obvious). Using a delimiter of a single whitespace character will give a word count of six not four. So, try:

open(FILE, "<file.txt") or die "Could not open file: $!";

my ($lines, $words, $chars) = (0,0,0);

while (<FILE>) {
    $lines++;
    $chars += length($_);
    $words += scalar(split(/\W+/, $_));
}

print("lines=$lines words=$words chars=$chars\n");

Using \W+ as the delimiter will stop punctuation (amongst other things) from counting as words.

Nic Gibson 2009-04-23 15:08:59

Using \W will split "nit-picking" into two words. I don't know if this is correct behavior or not, but I always thought of hyphenated words as one word rather than two.

Chris Lutz 2009-04-23 15:50:28

It's one of those 'you pays your money, you makes your choice' things. Personally, I usually roll my own regex that fits the definition of 'word' I need at the time. Quite often split can be less than helpful because it is a negative match. A normal regex matches the characters you *do* want, generally a better idea. You could certainly do the same sort of thing using m/.../g and calling it in list context.

Nic Gibson 2009-04-23 17:19:08

Answer 5

+1 A:

Reading the file in fixed-size chunks may be more efficient than reading line-by-line. The wc binary does this.

#!/usr/bin/env perl

use constant BLOCK_SIZE => 16384;

for my $file (@ARGV) {
    open my $fh, '<', $file or do {
        warn "couldn't open $file: $!\n";
        continue;
    };

    my ($chars, $words, $lines) = (0, 0, 0);

    my ($new_word, $new_line);
    while ((my $size = sysread $fh, local $_, BLOCK_SIZE) > 0) {
        $chars += $size;
        $words += /\s+/g;
        $words-- if $new_word && /\A\s/;
        $lines += () = /\n/g;

        $new_word = /\s\Z/;
        $new_line = /\n\Z/;
    }
    $lines-- if $new_line;

    print "\t$lines\t$words\t$chars\t$file\n";
}

ephemient 2009-04-23 15:44:28

I'm not sure this gives you any benefit. Under the hood, as it were, perl's <> operator is using buffered IO. All you have done here is rewrite something built-in with something that has to be interpreted.

Nic Gibson 2009-04-23 17:20:19

True. At least with my installation of 5.8.8, Perl buffers 4096 bytes at a time, and there's no performance benefit to doing this manually -- as you suspected, if anything, it's actually worse. I like reminding people to think low-level though :)

ephemient 2009-04-23 17:35:50

Answer 6

A:

To be able to count CHARS and not bytes, consider this:
(Try it with Chinese or Cyrillic letters and file saved in utf8)

use utf8;

my $file='file.txt';
my $LAYER = ':encoding(UTF-8)';
open( my $fh, '<', $file )
  || die( "$file couldn't be opened: $!" );
binmode( $fh, $LAYER );
read $fh, my $txt, -s $file;
close $fh;

print length $txt,$/;
use bytes;
print length $txt,$/;

Berov 2009-04-23 20:13:35

Perl defaults to using the system locale. If your system is modern, the system locale will be an UTF-8 encoding, and thus Perl IO will be UTF-8 by default. If not, you probably should be using the system locale and not forcing UTF-8 mode...

ephemient 2009-04-23 21:00:14

Wrong, ephemient. Perl defaults to the system locale, but prints characters 128-255 as "?" for backwards compatibility. To print proper UTF-8, one should say binmode($fh, ":utf8"); before using the filehandle. In this case, "use utf8;" is useless - it tells Perl that the source code can be in UTF-8, which is unnecessary unless you have variables names like $áccent or $ümlats.

Chris Lutz 2009-04-24 09:47:13

@Chris Both my Perl 5.8 and 5.10 are documented as having `-C SDL` as the default, and `perl -e 'print "\xe2\x81\x89\n"'` produces "⁉" as expected -- not "???" as you seem to expect.

ephemient 2009-04-24 15:24:07

I think those three hex values are combining into one UTF-8 character. And UTF-8 characters will print in Perl. Just not the ones from 128-255. Trying any one of those three hex codes individually on my machine gives me "?", whereas prefixing it with binmode(STDOUT, ":utf8"); gives me "â" for \xe2 and non printing characters for the other two. And as far as I can tell, I have no default setting of "-C" anything.

Chris Lutz 2009-04-24 19:17:43

`echo $'\xe2\x80\x99' | perl -ne'print length,$/'` outputs 4 while `echo $'\xe2\x80\x99' | perl -CSDL -ne'print length,$/'` outputs 2, so I must be misremembering and Chris is correct.

ephemient 2009-04-24 20:47:40

Answer 7

+2 A:

Once Upon A Time there was the Perl Power Tools project whose goal was to reconstruct all the Unix bin utilities, primarily for those on operating systems deprived of Unix (in the days before Cygwin). And yes, they did wc. The implementation is overkill, but it is POSIX compliant.

It gets a little ridiculous when you look at the simple implementation of true and the extra fancy GNU compliant one.

Schwern 2009-04-24 09:33:40

Most of the fancy 'true' implementation is POD. Still ridiculous.

Chris Lutz 2009-04-24 09:50:45

Answer 8

A:

I stumbled upon this while googling for a character count solution. Admittedly, I know next to nothing about perl so some of this may be off base, but here are my tweaks of newt's solution.

First, there is a built-in line count variable anyway, so I just used that. This is probably a bit more efficient, I guess. As it is, the character count includes newline characters, which is probably not what you want, so I chomped $_. Perl also complained about the way the split() is done (implicit split, see: http://stackoverflow.com/questions/2436160/why-does-perl-complain-use-of-implicit-split-to-is-deprecated ) so I tweaked that. My input files are UTF-8 so I opened them as such. That probably helps get the correct character count in the input file contains non-ASCII characters.

Here's the code:

open(FILE, "<:encoding(UTF-8)", "file.txt") or die "Could not open file: $!";

my ($lines, $words, $chars) = (0,0,0);
my @wordcounter;
while (<FILE>) {
    chomp($_);
    $chars += length($_);
    @wordcounter = split(/\W+/, $_);
    $words += @wordcounter;
}
$lines = $.;
close FILE;
print "\nlines=$lines, words=$words, chars=$chars\n";

elef 2010-08-29 16:44:55

ansaurus

tags:

views:

answers:

How do I count the characters, words, and lines in a file, using Perl?

related questions