views:

891

answers:

9

Summary: I want to take advantage of compiler optimizations and processor instruction sets, but still have a portable application (running on different processors). Normally I could indeed compile 5 times and let the user choose the right one to run.

My question is: how can I can automate this, so that the processor is detected at runtime and the right executable is executed without the user having to chose it?


I have an application with a lot of low level math calculations. These calculations will typically run for a long time.

I would like to take advantage of as much optimization as possible, preferably also of (not always supported) instruction sets. On the other hand I would like my application to be portable and easy to use (so I would not like to compile 5 different versions and let the user choose).

Is there a possibility to compile 5 different versions of my code and run dynamically the most optimized version that's possible at execution time? With 5 different versions I mean with different instruction sets and different optimizations for processors.

I don't care about the size of the application.

At this moment I'm using gcc on Linux (my code is in C++), but I'm also interested in this for the Intel compiler and for the MinGW compiler for compilation to Windows.

The executable doesn't have to be able to run on different OS'es, but ideally there would be something possible with automatically selecting 32 bit and 64 bit as well.

Edit: Please give clear pointers how to do it, preferably with small code examples or links to explanations. From my point of view I need a super generic solution, which is applicable on any random C++ project I have later.

Edit I assigned the bounty to ShuggyCoUk, he had a great number of pointers to look out for. I would have liked to split it between multiple answers but that is not possible. I'm not having this implemented yet, so the question is still 'open'! Please, still add and/or improve answers, even though there is no bounty to be given anymore.

Thanks everybody!

+6  A: 

Can you use script?

You could detect the CPU using script, and dynamically load the executable that is most optimized for architecture. It can choose 32/64 bit versions too.

If you are using a Linux you can query the cpu with

cat /proc/cpuinfo

You could probably do this with a bash/perl/python script or windows scripting host on windows. You probably don't want to force the user to install a script engine. One that works on the OS out of the box IMHO would be best.

In fact, on windows you probably would want to write a small C# app so you can more easily query the architecture. The C# app could just spawn whatever executable is fastest.

Alternatively you could put your different versions of code in a dll's or shared object's, then dynamically load them based on the detected architecture. As long as they have the same call signature it should work.

Byron Whitlock
You really don't need script for detecting the CPU -- you can do it with native OS-dependent system calls.
Adam Rosenfield
But if you use script it becomes portable across OS's and 64/32 bit architectures.
Byron Whitlock
Considering that he's already writing (quite deliberately) OS-dependent code, I don't think it is necessary to ensure that the OS-detection is portable. Though having that part of the application be portable would probably make things easier.
Brian
+16  A: 

Yes it's possible. Compile all your differently optimised versions as different dynamic libraries with a common entry point, and provide an executable stub that that loads and runs the correct library at run-time, via the entry point, depending on config file or other information.

anon
Thanks! Do you maybe have some more specific pointers how to compile in that way? And how the stub should look like?
Peter Smit
Under windows can you fire up a 64-bit DLL from a 32-bit process? I didn't think you could .. but would love to see how you could do it :)
Goz
Then one might provide another layer: a 32-bit loader that, having detected itself running on a 64-bit arch, exec'ed 64-bit runner, who in turn loads 64-bit library.
Pavel Shved
Well essentially thats what i was thinking. Fire up a 32-bit process that detects everything it needs to do and then, instead of firing a new DLL, fires off a new process be that process 32-bit or 64-bit.
Goz
+2  A: 

Since you mention you are using GCC, I'll assume your code is in C (or C++).

Neil Butterworth already suggested making separate dynamic libraries, but that requires some non-trivial cross-platform considerations (manually loading dynamic libraries is different on Linux, Windows, OSX, etc., and getting it right will likely take some time).

A cheap solution is to simply write all of your variants using unique names, and use a function pointer to select the proper one at runtime.

I suspect the extra dereference caused by the function pointer will be amortized by the actual work you are doing (but you'll want to confirm that).

Also, getting different compiler optimizations will likely require different .c/.cpp files, as well as some twiddling of your build tool. But it's probably less overall work than separate libraries (which needed this already in one form or another).

jhoule
This is a horrible suggestion and you would have to be nuts to use it. I don't often make such statements, but in this case I feel I must. Do not do this.
anon
I absolutely don't want to have different .cpp files. That is a nightmare to maintain! If I have some optimization for specific platforms in my code I think ifdefs will serve me.
Peter Smit
OK, I need like I feel to defend myself a little here, considering the strength of those comments.First of, my understanding is you want to compile various versions of a math intensive routine for the same architecture (e.g. x86), but with different implementations/optimizations (SSE, -O1/O2/O3, etc.).I believe GCC's "-mtune" and "-mfpmath" cannot be controlled by the preprocessor, so you might have to recompile the same .cpp to generate different .o files. Neil's suggestion is to have those end up in different dynamic libraries. Mine was to have them all in the same binary (cont.).
jhoule
What I suggested avoids implementing cross-platform plugin system. You basically can recompile the same piece of code with different options, but the linker will complain of dupes. Give them different names (generated by the same source with macros if you want), and you have multiple routines doing the same work slightly differently.Having separate .cpp might be overkill: I just assumed it was easier for the build tool.My main point was just that you could have multiple C routines being selected as rapidly as a C++ method call by using a ptr-to-func. That is what DLL entry points are too!
jhoule
My point was mainly about different .cpp files, which I would find a horrible solution. Also having macro's that rename my libraries I would find cluttering my code, but it would be indeed a solution. However I keep looking for a more generic solution.
Peter Smit
+4  A: 

Have a look at liboil: http://liboil.freedesktop.org/wiki/ . It can dynamically select implementations of multimedia-related computations at run-time. You may find you can liboil itself and not just its techniques.

camh
A: 

I hope this doesn't expose my ignorance too much but I'm a bit puzzled by both the original question and some of the answers ...

Peter -- do you want a different implementation for each of the platforms on which you will run the program, with each implementation optimised for that platform ? That's a perfectly sensible thing to want to do, but typically for 5 platforms I'd expect to have 5 compilations and 5 executables. Which one the user runs depends then on access to the platforms -- perhaps something a job management system would handle, or simply a matter of which system a user logs onto in the morning.

Or, do you mean that you want the same program optimised 5 different ways and installed on one platform ? And then choose, at run time, the implementation most optimal for the current job ? I think that this would be much harder to implement -- how would the user (or system) know which version was optimal before running the job ?

As I say, I think I've misunderstood the question, so if this response is wide of the mark, excuse my ignorance !

Regards

Mark

High Performance Mark
I want to take advantage of compiler optimizations and processor instruction sets, but still have a portable app (running on different processors) Normally I could indeed compile 5 times and let the user choose the right one to run. My question is or I can automate this, so that the processor is detected at runtime and the right executable is executed without the user having to chose it.
Peter Smit
I think he is referring to different processors all implementing the x86 instruction set. For (an obsolete) example, every processor from the Pentium 1 to the Core i7 can run the same basic x86 instructions, but only some can run SSE operations, fewer can run SSE2, fewer still SSE3, and so on. These are associated with useful compiler-opt flags like /arch:SSE2 . It's possible to compile a library with each of five different compiler flag permuations, and then select at runtime which version you want to branch into.
Crashworks
@Crashworks that's correct! I guess also it's possible and my question is: what is the way to do it!
Peter Smit
Ahhh, now I think I understand: if your app were to be delivered as a single binary file that file would contain all the processor-specific variants.And if I do understand correctly, I wouldn't do it that way ! I think this is an install-time decision, not a run-time decision. But that's just my opinion and not a helpful answer to your original question so ignore me.
High Performance Mark
+3  A: 

Since you didn't specify whether you have limits on the number of files, I propose another solution: compile 5 executables, and then create a sixth executable that launches the appropriate binary. Here is some pseudocode, for Linux

int main(int argc, char* argv[])
{
    char* target_path[MAXPATH];
    char* new_argv[];
    char* specific_version = determine_name_of_specific_version();
    strcpy(target_path, "/usr/lib/myapp/versions");
    strcat(target_path, specific_version);

    /* append NULL to argv */
    new_argv = malloc(sizeof(char*)*(argc+1));
    memcpy(new_argv, argv, argc*sizeof(char*));
    new_argv[argc] = 0;
    /* optionally set new_argv[0] to target_path */

    execv(target_path, new_argv);
}

On the plus side, this approach allows to provide the user transparently with both 32-bit and 64-bit binaries, unlike any library methods that have been proposed. On the minus side, there is no execv in Win32 (but a good emulation in cygwin); on Windows, you have to create a new process, rather than re-execing the current one.

Martin v. Löwis
+1  A: 

You mentioned the Intel compiler. That is funny, because it can do something like this by default. However, there is a catch. The Intel compiler didn't insert checks for the approopriate SSE functionality. Instead, they checked if you had a particular Intel chip. There would still be a slow default case. As a result, AMD CPUs would not get suitable SSE-optimized versions. There are hacks floating around that will replace the Intel check with a proper SSE check.

The 32/64 bits difference will require two executables. Both the ELF and PE format store this information in the exectuables header. It's not too hard to start the 32 bits version by default, check if you are on a 64 bit system, and then restart the 64 bit version. But it may be easier to create an appropriate symlink at installation time.

MSalters
How is this Intel functionality called? Or do you have links to documentation and mentioned hacks?
Peter Smit
+4  A: 

If you wish this to cleanly work on Windows and take full advantage in 64bit capable platforms of the additional 1. Addressing space and 2. registers (likely of more use to you) you must have at a minimum a separate process for the 64bit ones.

You can achieve this by having a separate executable with the relevant PE64 header. Simply using CreateProcess will launch this as the relevant bitness (unless the executable launched is in some redirected location there is no need to worry about WoW64 folder redirection

Given this limitation on windows it is likely that simply 'chaining along' to the relevant executable will be the simplest option for all different options, as well as making testing an individual one simpler.

It also means you 'main' executable is free to be totally separate depending on the target operating system (as detecting the cpu/OS capabilities is, by it's nature, very OS specific) and then do most of the rest of your code as shared objects/dlls. Also you can 'share' the same files for two different architectures if you currently do not feel that there is any point using the differing capabilities.

I would suggest that the main executable is capable of being forced into making a specific choice so you can see what happens with 'lesser' versions on a more capable machine (or what errors come up if you try something different).

Other possibilities given this model are:

  • Statically linking to different versions of the standard runtimes (for ones with/without thread safety) and using them appropriately if you are running without any SMP/SMT capabilities.
  • Detect if multiple cores are present and whether they are real or hyper threading (also whether the OS knows how the schedule effectively in those cases)
  • checking the performance of things like the system timer/high performance timers and using code optimized to this behaviour, say if you do anything where you look for a certain amount of time to expire and thus can know your best possible granularity.
  • If you wish to optimize you choice of code based on cache sizing/other load on the box. If you are using unrolled loops then more aggressive unrolling options may depend on having a certain amount level 1/2 cache.
  • Compiling conditionally to use doubles/floats depending on the architecture. Less important on intel hardware but if you are targetting certain ARM cpu's some have actual floating point hardware support and others require emulation. The optimal code would change heavily, even to the extent you just use conditional compilation rather than using the optimizing compiler(1).
  • Making use of co-processor hardware like CUDA capable graphics cards.
  • detect virtualization and alter behaviour (perhaps trying to avoid file system writes)


As to doing this check you have a few options, the most useful one on Intel being the the cpuid instruction.

Alternatively re-implement/update an existing one using available documentation on the features you need.

Quite a lot of separate documents to work out how to detect things:

A large part of what you would be paying for in the CPU-Z library is someone doing all this (and the nasty little issues involved) for you.


  1. be careful with this - it is hard to beat decent optimizing compilers on this
ShuggyCoUk
+1  A: 

Lets break the problem down to its two constituent parts. 1) Creating platform dependent optimized code and 2) building on multiple platforms.

The first problem is pretty straightforward. Encapsulate the platform dependent code in a set of functions. Create a different implementation of each function for each platform. Put each implementation in its own file or set of files. It's easiest for the build system if you put each platform's code in a separate directory.

For part two I suggest you look at Gnu Atuotools (Automake, AutoConf, and Libtool). If you've ever downloaded and built a GNU program from source code you know you have to run ./configure before running make. The purpose of the configure script is to 1) verify that your system has all of the required libraries and utilities need to build and run the program and 2) customize the Makefiles for the target platform. Autotools is the set of utilities for generating the configure script.

Using autoconf, you can create little macros to check that the machine supports all of the CPU instructions your platform dependent code needs. In most cases, the macros already exists, you just have to copy them into your autoconf script. Then, automake and autoconf can set up the Makefiles to pull in the appropriate implementation.

All this is a bit much for creating an example here. It takes a little time to learn. But the documentation is all out there. There is even a free book available online. And the process is applicable to your future projects. For multi-platform support, this is really the most robust and easiest way to go, I think. A lot of the suggestions posted in other answers are things that Autotools deals with (CPU detection, static & shared library support) without you have to think about it too much. The only wrinkle you might have to deal with is finding out if Autotools are available for MinGW. I know they are part of Cygwin if you can go that route instead.

Steve K