views:

668

answers:

4

I'd be very grateful if you could compare the winning O’Rourke's Perl solution to Lundh's Python solution, as I don't know Perl good enough to understand what's going on there. More specifically I'd like to know what gave Perl version 3x advantage: algorithmic superiority, quality of C extensions, other factors?

Wide Finder: Results

+5  A: 

Perl is heavily optimized for text processing. There are so many factors that it's hard to say what's the exact difference. Text is represented completely differently internally (utf-8 versus utf-16/utf-32) and the regular expression engines are completely different too. Python's regular expression engine is a custom one and not as much used as the perl one. There are very few developers working on it (I think it's largely unmaintained) in contrast to the Perl one which is basically the "core of the language".

After all Perl is the text processing language.

Armin Ronacher
The sample logs are ASCII, as far as i know, and Python version uses byte strings without any Unicode conversion. So i believe there is no "utf-8 versus utf-16" here.
Constantin
I agree with constantin. I don't see awhat unicode has to do with it.
Leon Timmermans
+9  A: 

The better regex implementation of perl is one part of the story. That can't explain however why the perl implementation scales better. The difference become bigger with more processors. For some reason the python implementation has an issue there.

Leon Timmermans
+1  A: 

The Perl implementation uses the mmap system call. What that call does is establish a pointer which to the process appears to be a normal segment of memory or buffer to the program. It maps the contents of a file to a region of memory. There are performances advantages of doing this vs normal file IO (read) - one is that there are no user-space library calls necessary to get access to the data, another is that there are often less copy operations necessary (eg: moving data between kernel and user space).

Perl's strings and regular expressions are 8-bit byte based (as opposed to utf16 for Java for example), so Perl's native 'character type' is the same encoding of the mmapped file.

When the regular expression engine then operates on the mmap backed variable, it is directly accessing the file data via the mamped memory region - without going through Perl's IO functions, or even libc's IO functions.

The mmap is probably largely responsible for the performance difference vs the Python version using the normal Python IO libraries - which additionally introduce the overhead of looking for line breaks.

The Perl program also supports a -J to parallelize the processing, where the oepen "-|" causes a fork() where the file handle in the parent is to the child's stdout. The child processes serialize their results to stdout and the parent de-serializes them to coordinate and summarize the results.

Kyle Burton
Python version also uses mmap. And Python's regex also operates on mmap directly.
Constantin
A: 

The Perl implementation uses the mmap system call.

This. It avoids buffer copying and provides async I/O.

sean
Python version is mmap-based too. But can you elaborate on "mmap provides async I/O for Perl version"?
Constantin