views:

338

answers:

6

I need to read a 200mb "space"-separated file line-by-line and collect its contents into an array.

Every time I run the script, Perl throws an "out of memory" exception, but I don't understand why!

Some advice please?

#!/usr/bin/perl -w
use strict;
use warnings;

open my $fh, "<", "../cnai_all.csd";
my @parse = ();

while (<$fh>) {
     my @words = split(/\s/,$_);
     push (@parse, \@words);
}

print scalar @parse;

the cnai file looks like this: it contains 11000 rows and 4200 values, seperated by "space", per line.

VALUE_GROUP_A VALUE_GROUP_B VALUE_GROUP_C 
VALUE_GROUP_A VALUE_GROUP_B VALUE_GROUP_C 
VALUE_GROUP_A VALUE_GROUP_B VALUE_GROUP_C 
VALUE_GROUP_A VALUE_GROUP_B VALUE_GROUP_C 

The code above is just a stripped down sample.
The final script will store all values in a hash and write it to a database later .

But first, I have to solve that memory problem!

+2  A: 

Your while loop does not read from the file. You should have

<$fh>

or something inside the parentheses.

antti.huima
Markdown formatting issue in the question. Edited.
Greg Bacon
+6  A: 

That would be because... you're running out of memory!

You're not merely storing 200MB of data. You're creating a new list data structure for each line, with all of its associated overhead, and also creating a bunch of separate string objects for each word, with all of their associated overhead.

Edit: As an example of the kind of overhead we're talking about here, each and every value (and that includes strings) has the following overhead:

/* start with 2 sv-head building blocks */
#define _SV_HEAD(ptrtype) \
    ptrtype sv_any;     /* pointer to body */   \
    U32     sv_refcnt;  /* how many references to us */ \
    U32     sv_flags    /* what we are */

#define _SV_HEAD_UNION \
    union {             \
    char*   svu_pv;     /* pointer to malloced string */    \
    IV      svu_iv;         \
    UV      svu_uv;         \
    SV*     svu_rv;     /* pointer to another SV */     \
    SV**    svu_array;      \
    HE**    svu_hash;       \
    GP* svu_gp;         \
    }   sv_u


struct STRUCT_SV {      /* struct sv { */
    _SV_HEAD(void*);
    _SV_HEAD_UNION;
};

So that's at least 4 32-bit values per Perl object.

Jonathan Feinberg
But that shouldn't be *that* much overhead, should it? Unless it's a really degenerate file, I'd be surprised if it could more than, say, double the effective size in memory.
fennec
Hu hu hu...degenerate files....
Paul Nathan
Those aren't list structures. They are **arrays**. In most languages, there is no distinction. In Perl, there is one.
daotoad
I've just finished talking to floopy-doo for an hour (we work in the same company). Turns out the problem went away as soon as we stopped relying on $_ transporting the lines from the while to the split statement .. STRANGE!
lexu
daotoad, that distinction is *completely* irrelevant to the point at hand.
Jonathan Feinberg
@lexu I've tried "my $line = $_; split /\s/, $line" and "while (defined(my $line = <$fh>))" but saw the same behavior. Please provide more specific information about the workaround.
Greg Bacon
lexu
+4  A: 

Generally this means you are running out of memory for Perl, but it's possible you aren't running out of system memory. First up, there are ways you can get more information on perl's memory usage in the perl debug guts doc -- though you may find yourself recompiling perl, then. (Also note the warning in that doc about perl's hunger for memory...)

However, many operating systems it is possible for memory limits to be set per-process or per-user. If, for example, you're using Linux (or another POSIX system) you might need to alter your ulimits. Type 'ulimit -a' and look at your memory sizes; it's possible your 'max memory size' is below the memory in your machine -- or you have a limited data seg size. You can then reset it with the appropriate option, e.g. ulimit -d 1048576 for a 1GB data seg size limit.

Of course, there is another option: process the file line-by-line, if your situation allows it. (The example code above can be rewritten in such a way.)

Arthur Shipkowski
A: 

The database you are using probably has a bulk import function. I would try that first.

If you need to do something with each row before putting it into the database (assuming the operations do not require referencing other rows), you should insert the row into the database as soon as processing is complete (turn AutoCommit off) rather than trying to store all the data in memory.

If the processing of each row depends on information in other rows, then you can use Tie::File to treat the input file as an array of lines. Again, don't try to store the contents of each line in memory. When processing is complete, ship it off to the database.

Sinan Ünür
The problem is in extracting the data from a somewhat (ha, I've heard stronger terms) convoluted comma-delimited format. Since the last line in the file (12'000+ lines 4'500 fields each) can change/augment something that was "omitted" in the first line. Since the content of the file goes to 12 different DB tables, I suggested to go over the file in two passes. Separate into 12 groups first, collect/compact each file into records in pass two. I'll read up on Tie::File, thx for the suggestion.
lexu
+4  A: 

Rather than reading all 46,200,000 values in core at once, your description of the data in cnai_all.csd as having many rows suggests that each line can be processed independently. If so, use

while (<$fh>) {
  my @words = split /\s/, $_;
  insert_row \@words;
}

where insert_row is a sub you'd define to insert that row into your database.

Note that split /\s/ is often a mistake. The perlfunc documentation on split explains:

As a special case, specifying a PATTERN of space (' ') will split on white space just as split with no arguments does. Thus, split(' ') can be used to emulate awk's default behavior, whereas split(/ /) will give you as many null initial fields as there are leading spaces. A split on /\s+/ is like a split(' ') except that any leading whitespace produces a null first field. A split with no arguments really does a split(' ', $_) internally.

In the nominal case, everything is fine:

  DB<1> x split /\s/, "foo bar baz"
0  'foo'
1  'bar'
2  'baz'

But what if there are multiple spaces between fields? Does that mean an empty field or just a "wide" separator?

  DB<2> x split /\s/, "foo  bar baz"
0  'foo'
1  ''
2  'bar'
3  'baz'

What about leading whitespace?

  DB<3> x split /\s/, " foo bar baz"
0  ''
1  'foo'
2  'bar'
3  'baz'

The default behavior of split isn't arbitrary. Let the tool work for you!

while (<$fh>) {
  insert_row [ split ];
}
Greg Bacon
+2  A: 

At last I have found a more suitable solution for my problem:

After some research for other parsers I've had to develop, I learned about the module DBD::CSV.

DBD::CSV allows me to select only the needed columns (out of >4000) of the "whitespace"-seperated fields. This reduceses memory usage and perfoms quite well.

More at DBD-CSV @ CPAN.org

Thanks to gbacon I've changed my strategy from reading the whole file in one go to reading it part by part. DBD::CSV makes this possible without much coding.

#!/usr/bin/perl  -w

use strict;
use warnings;
use DBI;
## -------------------------------------------------------------------------##

## -------------------------------------------------------------------------##
## SET GLOBAL CONFIG #############
my $globalConfig = {
                _DIR => qq{../Data},
                _FILES => {
                    'cnai_all.csd'   => '_TEST'
                    }               
                };
## -------------------------------------------------------------------------##


## -------------------------------------------------------------------------##
my $sTime = time();

my $sepChar = " ";
my $csv_dbh = DBI->connect("DBI:CSV:f_dir=".$globalConfig->{_DIR}.";");

$csv_dbh->{csv_eol} ="\n";
#$csv_dbh->{csv_quote_char} ="'";
#$csv_dbh->{csv_escape_char} ="\\";
$csv_dbh->{csv_null} = 1;
$csv_dbh->{csv_quote_char} = '"';
$csv_dbh->{csv_escape_char} = '"';
$csv_dbh->{csv_sep_char} = "$sepChar";
$csv_dbh->{csv_always_quote} = 0;
$csv_dbh->{csv_quote_space} = 0;
$csv_dbh->{csv_binary} = 0;
$csv_dbh->{csv_keep_meta_info} = 0;
$csv_dbh->{csv_allow_loose_quotes} = 0;
$csv_dbh->{csv_allow_loose_escapes} = 0;
$csv_dbh->{csv_allow_whitespace} = 0;
$csv_dbh->{csv_blank_is_undef} = 0;
$csv_dbh->{csv_empty_is_undef} = 0;
$csv_dbh->{csv_verbatim} = 0;
$csv_dbh->{csv_auto_diag} = 0;


my @list = $csv_dbh->func('list_tables');
my $sth = $csv_dbh->prepare("SELECT CELL,NW,BSC,n_cell_0 FROM cnai_all.tmp");


#print join ("\n",@list);

print "\n-------------------\n";

$sth->execute();
while (my $row = $sth->fetchrow_hashref) {
    # just print a hash refrence
    print "$row\n";
}
$sth->finish();

print "\n finish after ".(time()-$sTime)." sec ";

On my machine this runs in roughly 20s and use no more than 10MB memory.

Floopy-Doo