views:

794

answers:

5

Background: I'm trying to create a pure D language implementation of functionality that's roughly equivalent to C's memchr but uses arrays and indices instead of pointers. The reason is so that std.string will work with compile time function evaluation. For those of you unfamiliar w/ D, functions can be evaluated at compile time if certain restrictions are met. One restriction is that they can't use pointers. Another is that they can't call C functions or use inline assembly language. Having the string library work at compile time is useful for some compile time code gen hacks.

Question: How does memchr work under the hood to perform as fast as it does? On Win32, anything that I've been able to create in pure D using simple loops is at least 2x slower even w/ obvious optimization techniques such as disabling bounds checking, loop unrolling, etc. What kinds of non-obvious tricks are available for something as simple as finding a character in a string?

+6  A: 

I would suggest taking a look at GNU libc's source. As for most functions, it will contain both a generic optimized C version of the function, and optimized assembly language versions for as many supported architectures as possible, taking advantage of machine specific tricks.

Chris Young
Thanks, except this is LGPL code and D's standard library is supposed to be permissively licensed. I don't want that to be an issue.
dsimcha
Well, I was suggesting you look at it for technique inspiration, rather than for copying the source.
Chris Young
It is roughly 150 lines of code, about half or more of which are comments, so it explains the optimizations in a fair amount of detail.
Chris Young
http://stackoverflow.com/questions/405208/gpl-code-what-counts-as-a-derivative-work I'm the OP on this question.
dsimcha
I've read that question before, and it's a good question. The answer is that you can't copyright techniques, only the code itself. Also, the FSF are vehemently proposed to software patents or the idea that algorithms and ideas can be effectively copyrighted and their use restricted.
Chris Young
+5  A: 

This implementation of memchr from newlib is one example of someone's optimizing memchr: it's reading and testing 4 bytes at a time (apart from memchr, other functions in the newlib library are here).

Incidentally, most of the the source code for the MSVC run-time library is available, as an optional part of the MSVC installation (so, you could look at that).

ChrisW
i was going to answer with the newlib code of memchr - until i clicked your link and saw it's also about newlib :)
Johannes Schaub - litb
if you like, you can link them to this: http://sourceware.org/cgi-bin/cvsweb.cgi/src/newlib/libc/string/?cvsroot=src , cvs directory containing all the sweet fast string functions of newlib, including memchr.c
Johannes Schaub - litb
+1  A: 

Here is FreeBSD's (BSD-licensed) memchr() from memchr.c. FreeBSD's online source code browser is a good reference for time-tested, BSD-licensed code examples.

void *
memchr(s, c, n)
    const void *s;
    unsigned char c;
    size_t n;
{
    if (n != 0) {
     const unsigned char *p = s;

     do {
      if (*p++ == c)
       return ((void *)(p - 1));
     } while (--n != 0);
    }
    return (NULL);
}
cpeterso
Yeah, I found this too. Nothing fancy here, though, that would explain the ridiculous speed difference.
dsimcha
+2  A: 

memchr like memset and memcpy generally reduce to fairly small amount of machine code. You are unlikely to be able to reproduce that kind of speed without inlining similar assembly code. One major issue to consider in an implementation is data alignment.

One generic technique you may be able to use is to insert a sentinel at the end of the string being searched, which guarantees that you will find it. It allows you to move the test for end of string from inside the loop, to after the loop.

EvilTeach
+3  A: 

On the x86 platform, the faster implementations usually compile down to the rep scasb (REPeated SCAn String for Byte) instruction. (Some compilers may be able to optimize C loops to that, some not.)

CyberShadow
Do you think that'll be faster than the newlib implemention which reads 4 bytes at a time? I don't know but those 8088/8086-era opcodes aren't necessarily the fastest anymore.
ChrisW
Well, given that here we use one instruction (with a prefix) to search an entire string - versus a hand-written loop with a whole bunch of instructions every 4 bytes - I'd think that, intuitively, the one-opcode variant would be faster :)
CyberShadow