views:

1122

answers:

6

I have to use Perl on a Windows environment at work, and I need to be able to find out the number of rows that a large csv file contains (about 1.4Gb). Any idea how to do this with minimum waste of resources?

Thanks

PS This must be done within the Perl script and we're not allowed to install any new modules onto the system.

+5  A: 

Yes, don't use perl.

Instead use the simple utility for counting lines; wc.exe

It's part of a suite of utilities ported from unix originals.

http://unxutils.sourceforge.net/

For example;

PS D:\> wc test.pl
     12      26     271 test.pl
PS D:\>

Where 12 == number of lines, 26 == number of words, 271 == number of characters.

If you really have to use perl;

D:\>perl -lne "END{print $.;}" < test.pl
12
Ed Guiness
Sure, wc would be the way to go on *nix where it will already be installed -- but is it really worth downloading a separate executable to do something that takes short line of Perl?
j_random_hacker
Yes, because Cygwin is a must-have for any Windows dev environment.
KenE
This isn't Cygwin but still a must-have. Counting lines in a file is such a common activity that its definitely worth having this utility.
Ed Guiness
@KenE: Is that sarcasm? FTR UnxUtils are non-Cygwin-based.
j_random_hacker
@edg: I see you added a Perl suggestion, +1.
j_random_hacker
Just use "wc -l filename". This is good advice, a very simple c/parser may be faster that one written in perl. With wc you don't have to write the parser. I use the GnuWin32 tools every day, they are worth getting even if you decide to write a perl parser: http://gnuwin32.sourceforge.net
daotoad
FTR it seems slightly silly that the asker is "required" to use Perl for this... Also I'd expect wc to be marginally faster (definitely faster startup, but that doesn't matter much for huge files).
j_random_hacker
For me, installing MSYS (http://www.mingw.org/ - I used to use unxutils) AND perl are the first things I do on any Windows system I have to work on.
runrig
Thank god the OP is using Perl. Imagine him using Java and not being allowed to install anything else.
innaM
@Manni: Or imagine if he was only allowed to use COBOL on a locked-down abacus that's on fire. :-P
j_random_hacker
Not sarcasm - I mean Cygwin or some equivalent which gives you these tools. Cygwin happens to be what I'm familiar with.
KenE
+14  A: 

Do you mean lines or rows? A cell may contain line breaks which would add lines to the file, but not rows. If you are guaranteed that no cells contain new lines, then just use the technique in the Perl FAQ. Otherwise, you will need a proper CSV parser like Text::xSV.

jiggy
I apologise, I meant rows.
Alex Wong
You should amend your question, since every other comment is just doing line counting.
jiggy
+1, good point, but it's also worth mentioning that there is no "official" CSV format -- just a collection of loosely-defined, somewhat incompatible formats that disagree on things like how to quote commas and whether line breaks are allowed in cells. Many tools assume 1 row == 1 line.
j_random_hacker
+4  A: 
perl -lne "END { print $. }" myfile.csv

This only reads one line at a time, so it doesn't waste any memory unless each line is enormously long.

j_random_hacker
Lines are not the same thing as CSV rows. Consider fields with embedded newlines, for instance.
brian d foy
@brian: That's true. But it's also true that working with CSV files containing fields with embedded newlines is destined to be painful, because there's no universal agreement across tools on whether or how such things should be encoded -- unfortunately, CSV is not a "standard."
j_random_hacker
A: 

Upvote for edg's answer, another option is to install cygwin to get wc and a bunch of other handy utilities on Windows.

KenE
IME, Cygwin adds too much complication unless you want to run a pseudo-unix environment. MinGW and MSYS provide a lighter weight system that works well for software development. For simple command line tools, GnuWin32 offers a good selection of tools with low impact, simple installers.
daotoad
Thanks for the tip - I'll give those a try sometime!
KenE
wc is not the answer since it counts lines, which is not the same as a CSV row. See Axeman's answer.
brian d foy
+3  A: 

This one-liner handles new lines within the rows:

  1. Considering lines with an odd number of quotes.
  2. Considering that doubled quotes is a way of indicating quotes within the field.
  3. It uses the awesome flip-flop operator.

    perl -ne 'BEGIN{$re=qr/^[^"]*(?:"[^"]*"[^"]*)*?"[^"]*$/;}END{print"Count: $t\n";}$t++ unless /$re/../$re/'
    

Consider:

  • wc is not going to work. It's awesome for counting lines, but not CSV rows
  • You should install--or fight to install--Text::CSV or some similar standard package for proper handling.
  • This may get you there, nonetheless.


EDIT: It slipped my mind that this was windows:

perl -ne "BEGIN{$re=qr/^[^\"]*(?:\"[^\"]*\"[^\"]*)*?\"[^\"]*$/;}END{print qq/Count: $t\n/;};$t++ unless $pq and $pq = /$re/../$re/;"

The weird thing is that The Broken OS' shell interprets && as the OS conditional exec and I couldn't do anything to change its mind!! If I escaped it, it would just pass it that way to perl.

Axeman
A: 

I was being idiotic, the simple way to do it in the script is:

open $extract, "<${extractFileName}" or die ("Cannot read row count of $extractFileName");
$rowCount=0; 
while (<$extract>)
{
 $rowCount=$rowCount+1;
}

close($extract);
Alex Wong