ansaurus

Question

Answer 1

A:

That does sound surprising. Why not try a pure C version?

Or try your code on a different OS/perl version.

Rhythmic Fistman 2009-06-27 13:38:32

I've looked at the perl OS interface, and it calls the C version more or less directly, but unless I figure it out I will probably test a C version as well.As for OS/perl version, I've tested on two system, both x86_64. One is Ubuntu 8.04.2 (linux 2.6.24-22, perl 5.8.8) and the other Ubuntu 9.04 (linux 2.6.28-13, perl 5.10.0). Same behaviour. The second system was a laptop, and I can definitively confirm that there is serious disk io involved when mmap is called from my tests.

Marius Kjeldahl 2009-06-27 14:36:09

Answer 2

+7 A:

If you have a relatively recent version of Perl, you shouldn't be using Sys::Mmap. You should be using PerlIO's mmap layer.

Can you post the code you are using?

Chas. Owens 2009-06-27 13:43:22

Agree, the PerlIO mmap layer is probably preferrable as it would also allow the same code to run with/without mmap'ing by simply adding/removing the mmap attribute. Regardless, I found the problem, posted the code, problem solved.

Marius Kjeldahl 2009-06-27 15:12:03

Make that problem solved up to 2GB. For larger files perl still has problems, see my other answer related to this.

Marius Kjeldahl 2009-06-29 11:04:11

Answer 3

+10 A:

Ok, found the problem. As suspected, neither linux or perl were to blame. To open and access the file I do something like this:

#!/usr/bin/perl
# Create 1 GB file if you do not have one:
# dd if=/dev/urandom of=test.bin bs=1048576 count=1000
use strict; use warnings;
use Sys::Mmap;

open (my $fh, "<test.bin")
    || die "open: $!";

my $t = time;
print STDERR "mmapping.. ";
mmap (my $mh, 0, PROT_READ, MAP_SHARED, $fh)
    || die "mmap: $!";
my $str = unpack ("A1024", substr ($mh, 0, 1024));
print STDERR " ", time-$t, " seconds\nsleeping..";

sleep (60*60);

If you test that code, there are no delays like those I found in my original code, and after creating the minimal sample (always do that, right!) the reason suddenly became obvious.

The error was that I in my code treated the $mh scalar as a handle, something which is light weight and can be moved around easily (read: pass by value). Turns out, it's actually a GB long string, definitively not something you want to move around without creating an explicit reference (perl lingua for a "pointer"/handle value). So if you need to store in in a hash or similar, make sure you store \$mh, and deref it when you need to use it like ${$hash->{mh}}, typically as the first parameter in a substr or similar.

Marius Kjeldahl 2009-06-27 15:02:13

+1 for following up with a detailed explanation.

RichieHindle 2009-06-27 21:06:25

Use 3 arg form of open().

Brad Gilbert 2009-06-28 03:55:23

Answer 4

+2 A:

On 32-bit systems the address space for mmap()s is rather limited (and varies from OS to OS). Be aware of that if you're using multi-gigabyte files and your are only testing on a 64-bit system. (I would have preferred to write this in a comment but I don't have enough reputation points yet)

knweiss 2009-06-28 14:58:28

+1. Looks like a valid answer which addresses the asked question to me, so thank you for not posting it as a comment.

Dave Sherohman 2009-06-29 11:03:06

As I've posted in my other answer, even on 64 bit systems, there's still problems for larger files (>2GB). Your answer is correct though. I'm already 64 bit on all my machines, even the laptop, so it's not an issue for me.

Marius Kjeldahl 2009-06-29 11:06:08

Answer 5

A:

See Wide Finder for perl performance with mmap. But there is one big pitfall. If your dataset will be on classical HD and you will read from multiple processes, you can easily fall in random access and your IO will fall down to unacceptable values (20~40 times).

Hynek -Pichi- Vychodil 2009-06-28 19:56:27

What I am trying to do is random access by design from multiple processes, making sure only the parts of the file most often accessed remains in memory.at all times. What pattern would you suggest if random access from multiple processes and a huge file is required?

Marius Kjeldahl 2009-06-29 11:08:27

If you *really* need random access to huge file, there is not better solution.

Hynek -Pichi- Vychodil 2009-06-29 15:45:48

Answer 6

+1 A:

one thing that can help performance is the use of 'madvise(2)'. probably most easily done via Inline::C. 'madvise' lets you tell the kernel what your access pattern will be like (e.g. sequential, random, etc).

2009-06-28 23:10:35

Answer 7

A:

Ok, here's another update. Using Sys::Mmap or PerlIO's ":mmap" attribute both works fine in perl, but only up to 2 GB files (the magic 32 bit limit). Once the file is more than 2 GB, the following problems appear:

Using Sys::Mmap and substr for accessing the file, it seems that substr only accepts a 32 bit int for the position parameter, even on systems where perl supports 64 bit. There's at least one bug posted about it:

#62646: Maximum string length with substr

Using open(my $fh, "<:mmap", "bigfile.bin"), once the file is larger than 2 GB, it seems perl will either hang/or insist on reading the whole file on the first read (not sure which, I never ran it long enough to see if it completed), leading to dead slow performance.

I haven't found any workaround to either of these, and I'm currently stuck with slow file (non mmap'ed) operations for working on these files. Unless I find a workaround I may have to implement the processing in C or another higher level language that supports mmap'ing huge files better.

Marius Kjeldahl 2009-06-29 10:52:18

try using mmap from Sys::Mmap directly to create a sliding window in the scalar.

Chas. Owens 2009-06-29 13:23:56

Thanks, that's certainly a workaround. It would necessitate keeping track of the pointer into the file and map/unmapping when necessary, which probably affects performance. But it's probably still faster than straight file IO.

Marius Kjeldahl 2009-06-29 13:39:37

Did some benchmarking, confirming that dynamically map/unmapping using a segment size of 2 GB, and assuming that segment switches are fairly infrequent, speed is some 30-40% faster using mmap with unmap/mapping than straight file IO on a 3 GB file. On a 2 GB file the differences are less, but I suspect this is due to my laptop caching most of the file during the random accesses anyway. So at least I have a solution that works, although not as cleanly as I would have hoped. No need for further optimization at this stage though.

Marius Kjeldahl 2009-06-29 21:42:55

Answer 8

A:

If I may plug my own module: I'd advice using File::Map instead of Sys::Mmap. It's much easier to use, and is less crash-prone than Sys::Mmap.

Leon Timmermans 2009-06-29 21:57:25

Here's a suggestion for a new very useful feature, based on my observation of perl described in this thread (memory mapped files only working up to 2 GB); if the user maps a file larger than 2 GB, use a segmented approach with a "custom" read function that automatically unmaps/maps as necessary. At least until the 2 GB perl "bug" is fixed..

Marius Kjeldahl 2009-06-30 07:09:34

ansaurus

tags:

views:

answers:

Linux/perl mmap performance

related questions