views:

121

answers:

5

I'm trying to write a piece of code that reads a file line by line and stores each line, up to a certain amount of input data. I want to guard against the end-user being evil and putting something like a gig of data on one line in addition to guarding against sucking in an abnormally large file. Doing $str = <FILE> will still read in a whole line, and that could be very long and blow up my memory.

fgets lets me do this by letting me specify a number of bytes to read during each call and essentially letting me split one long line into my max length. Is there a similar way to do this in perl? I saw something about sv_gets but am not sure how to use it (though I only did a cursory Google search).

The goal of this exercise is to avoid having to do additional parsing / buffering after reading data. fgets stops after N bytes or when a newline is reached.

EDIT I think I confused some. I want to read X lines, each with max length Y. I don't want to read more than Z bytes total, and I would prefer not to read all Z bytes at once. I guess I could just do that and split the lines, but wondering if there's some other way. If that's the best way, then using the read function and doing manual parse is my easiest bet.

Thanks.

+1  A: 

Use the read function (perlfunc read)

Konerak
The beauty of fgets is that it either reads N pieces of data or stops at a newline. I don't think read stops at a newline.
SB
+4  A: 
sub heres_what_id_do($$) {
    my ($fh, $len) = @_;
    my $buf = '';

    for (my $i = 0; $i < $len; ++$i) {
        my $ch = getc $fh;
        last if !defined $ch || $ch eq "\n";
        $buf .= $ch;
    }

    return $buf;
}

Not very "Perlish" but who cares? :) The OS (and possibly Perl itself) will do all the necessary buffering underneath.

j_random_hacker
`== '\n'` should be `eq "\n"`. `getc` makes this much simpler than using `read` to get a single character. Benchmarking shows its slower than mine by about 15%. Interestingly, the 3 arg for is significantly faster than `for my $i (0..$len-1)` but not than `my $i; my $end = $len-1; for $i (0..$len)` (it brings it up to parity with mine) indicating that Perl's `for(0..$foo)` iterator optimization is easily defeated.
Schwern
Thanks for the edit Schwern. It's embarrassing but I didn't know Perl actually has `getc()`! Will edit to use that.
j_random_hacker
A: 

You can implement fgets() yourself trivially. Here's one that works like C:

sub fgets{my($n,$c)=($_[1],''); ($_[0])=('');
  for(;defined($c)&&$c ne "\n"&&$n>0;$n--){$_[0].=($c=getc($_[2]));}
  defined($c)&&$_[0]; }

Here's one with PHP's semantics:

sub fgets{my($n,$c,$x)=($_[1],'','');
  for(;defined($c)&&$c ne "\n"&&$n>0;$n--){$x.=($c=getc($_[0]));}
  ($x ne '')&&$x; }

If you're trying to implement resource limits (i.e. trying to prevent an untrusted client from eating up all your memory) you really should not be doing it this way. Use ulimit to set up those resource limits before calling your script. A good sysadmin will set up resource limits anyway, but they like it when programmers make startup scripts that set reasonable limits.

If you're trying to limit input before you proxy this data to another site (say, limiting SMTP input lines because you know remote sites might not support more than 511 characters), then just check the length of the line after <INPUT> with length().

geocar
Can't...understand...code! It throws a warning at eof because it concatenates before checking if $c is defined. While it mirrors C's fgets very admirably, its not very Perlish. For all its inscrutability its no faster than mine or j_random's.
Schwern
@Schwem: Then `no strict` if you are bothered by it.
geocar
+4  A: 

Perl has no built-in fgets, but File::GetLineMaxLength implements it.

If you want to do it yourself, its pretty straightforward with getc.

sub fgets {
    my($fh, $limit) = @_;

    my($char, $str);
    for(1..$limit) {
        my $char = getc $fh;
        last unless defined $char;
        $str .= $char;
        last if $char eq "\n";
    }

    return $str;
}

Concatenating each character to $str is efficient as Perl will realloc opportunistically. If a Perl string has 16 bytes and you concatenate another character, Perl will reallocate it to 32 bytes (32 goes to 64, 64 to 128...) and remember the length. The next 15 concatenations require no memory reallocations or calls to strlen.

Schwern
I think this is clean, and I saw another one of your answers that discussed preallocating a string in Perl. Combining the two gets rid of the inefficiencies (if any) of constant reallocation since I only need to allocate the max length one time.
SB
Thanks. I don't think preallocation is going to buy you much. In fact, it'll probably be slower since its likely slower to preallocate a string in Perl than let perl do it. You'll also waste a lot of memory since every string will be using the maximum memory. Benchmarking bears this out. If you really want this to be as fast as possible, write an XS wrapper around fgets(). Its fairly trivial (by XS standards).
Schwern
What I meant was preallocate the string outside of the calls to fgets and pass by reference to your fgets to append to. Though not sure what happens when I assign the string to another. I might as well just let it allocate itself
SB
@SB I tried that, its about 5% slower. My guess is the dereferencing inside the loop slows things down more than you save in preallocating. Using an alias to $_[2] like geocar's doesn't help either (doesn't hurt). Rule of thumb for Perl optimization is you can't beat perl with Perl. You can see the benchmark program here: http://gist.github.com/417919 I don't think you're going to make this much faster by micro-optimizing, there's just a certain amount of overhead of looping over each character in a file in Perl.
Schwern
+1, but I hate to see people worried about a 5% change in speed when writing in an interpreted language.
j_random_hacker
@j_random_hacker Well, its not the 5% really but that the one with the worse interface isn't faster.
Schwern
+2  A: 

As an exercise, I've implemented a wrapper around C's fgets() function. It falls back to a Perl implementation for complicated filehandles defined as "anything without a fileno" to cover tied handles and whatnot. File::fgets is on its way to CPAN now, you can pull a copy from the repository.

Some basic benchmarking shows its over 10x faster than any of the implementations here. However, I cannot say its bug free or doesn't leak memory, my XS skills are not that great, but its better tested than anything here.

Schwern