What is a good/best way to count the number of characters, words, and lines of a text file using Perl (without using wc)?
The Word Count tool counts characters, words and lines in text files
Here's the perl code. Counting words can be somewhat subjective, but I just say it's any string of characters that isn't whitespace.
open(FILE, "<file.txt") or die "Could not open file: $!";
my ($lines, $words, $chars) = (0,0,0);
while (<FILE>) {
$lines++;
$chars += length($_);
$words += scalar(split(/\s+/, $_));
}
print("lines=$lines words=$words chars=$chars\n");
A variation on bmdhacks' answer that will probably produce better results is to use \s+ (or even better \W+) as the delimiter. Consider the string "The quick brown fox" (additional spaces if it's not obvious). Using a delimiter of a single whitespace character will give a word count of six not four. So, try:
open(FILE, "<file.txt") or die "Could not open file: $!";
my ($lines, $words, $chars) = (0,0,0);
while (<FILE>) {
$lines++;
$chars += length($_);
$words += scalar(split(/\W+/, $_));
}
print("lines=$lines words=$words chars=$chars\n");
Using \W+ as the delimiter will stop punctuation (amongst other things) from counting as words.
Reading the file in fixed-size chunks may be more efficient than reading line-by-line. The wc
binary does this.
#!/usr/bin/env perl
use constant BLOCK_SIZE => 16384;
for my $file (@ARGV) {
open my $fh, '<', $file or do {
warn "couldn't open $file: $!\n";
continue;
};
my ($chars, $words, $lines) = (0, 0, 0);
my ($new_word, $new_line);
while ((my $size = sysread $fh, local $_, BLOCK_SIZE) > 0) {
$chars += $size;
$words += /\s+/g;
$words-- if $new_word && /\A\s/;
$lines += () = /\n/g;
$new_word = /\s\Z/;
$new_line = /\n\Z/;
}
$lines-- if $new_line;
print "\t$lines\t$words\t$chars\t$file\n";
}
To be able to count CHARS and not bytes, consider this:
(Try it with Chinese or Cyrillic letters and file saved in utf8)
use utf8;
my $file='file.txt';
my $LAYER = ':encoding(UTF-8)';
open( my $fh, '<', $file )
|| die( "$file couldn't be opened: $!" );
binmode( $fh, $LAYER );
read $fh, my $txt, -s $file;
close $fh;
print length $txt,$/;
use bytes;
print length $txt,$/;
Once Upon A Time there was the Perl Power Tools project whose goal was to reconstruct all the Unix bin utilities, primarily for those on operating systems deprived of Unix (in the days before Cygwin). And yes, they did wc. The implementation is overkill, but it is POSIX compliant.
It gets a little ridiculous when you look at the simple implementation of true and the extra fancy GNU compliant one.
I stumbled upon this while googling for a character count solution. Admittedly, I know next to nothing about perl so some of this may be off base, but here are my tweaks of newt's solution.
First, there is a built-in line count variable anyway, so I just used that. This is probably a bit more efficient, I guess. As it is, the character count includes newline characters, which is probably not what you want, so I chomped $_. Perl also complained about the way the split() is done (implicit split, see: http://stackoverflow.com/questions/2436160/why-does-perl-complain-use-of-implicit-split-to-is-deprecated ) so I tweaked that. My input files are UTF-8 so I opened them as such. That probably helps get the correct character count in the input file contains non-ASCII characters.
Here's the code:
open(FILE, "<:encoding(UTF-8)", "file.txt") or die "Could not open file: $!";
my ($lines, $words, $chars) = (0,0,0);
my @wordcounter;
while (<FILE>) {
chomp($_);
$chars += length($_);
@wordcounter = split(/\W+/, $_);
$words += @wordcounter;
}
$lines = $.;
close FILE;
print "\nlines=$lines, words=$words, chars=$chars\n";