Here is a list of obvious and less obvious tips to optimize a program to benefit of the 3DNow! technology.
If you don't mind writing assembly code, the main resources you should consult the 3DNow! Technology Manual and the AMD-K6®-2 Processor Code Optimization Application Note on the AMD site.
In brief, 3DNow! processors (currently the AMD K6-2 only, but some other CPU might support it in the future) behave like a Pentium II MMX processor, but they have a few assembly instruction more. Oh, yes, we have already heard the story... when Intel added the MMX instruction set to the Pentium. It was a fake, because the MMX set is almost useless: the only algorithms that benefit from the MMX instruction set are the JPEG and MPEG compression/decompression and some alpha blending algorithm. Nonetheless, the MMX instructions corrupt the FPU registers; as a result, after a group of MMX instruction (and before any floating point instruction) the FPU state must be cleared with the EMMS instruction (this operation is also called context switching). Such operation is quite slow, so, in practice:
I can tell you, 3DNow! really makes the difference!
The most interesting ones regards floating point calculations. Yes, we already have the FPU for those. But the K6-2 can operate two floating point operations with the same 3DNow! instruction and, in some case, the CPU may execute two such instruction at the same time. So you can process a 4-component vector in a time comparable to a single FPU operation! That is: 75% performance benefit. Well... it's not always that good, because you have some overhead with the dreaded context switches, but in general you will get a 50% in many FPU-bound algorithms.
Another thing that 3D programmers will surely appreciate is the possibility to compute approximated square roots and divisions. They are extremely fast, and the error is so small that a 3D application can probably ignore it completely. I wrote an approximated 4-vector norm function which is 6 times faster than a regular (non-3DNow!) one.
The last interesting instruction is called FEMMS, which means "Fast EMMS": the name says it all.
One important thing has to be noted: the 3DNow! instruction sets works only for single precision floating point numbers. If you use double precision numbers, then you have to use the plain FPU. By the way: Direct3D native data type is the float, OpenGL can use both floats and doubles, and many accelerator card converts doubles to floats at some time during the rendering. So: use floats when possible!
May seem obvious, but... if you want to benefit of your 3DNow! CPU, you should forget Visual C++ 5.0/6.0. Not only it doesn't optimize for 3DNow!, you can't even write 3DNow! assembly instructions with it. By the way, did you realize that VC does not optimize even for the MMX?
As far as I know, the only compiler that optimizes for 3DNow! is the CodeWarrior Pro 4 from Metrowerks. It's a really good compiler. With the same program you can compile C/C++/EC++, Pascal and Java for both Win32 and Mac. The only thing is that the IDE is a little Mac-ish... but you will get accustomed in a few days. CodeWarrior in-line assembler supports both the MMX and 3DNow! instruction set. But, if you are lazy, the optimizer can produce a decent MMX/3DNow! optimized code from you C/C++ source.
If you think that the following two functions do the same thing, you should have a read at some good C/C++ book:
void scale1(float* v)
{
v[0] = v[0] * 3.0;
v[1] = v[1] * 3.0;
}
void scale2(float* v)
{
v[0] = v[0] * 3.0f;
v[1] = v[1] * 3.0f;
}
Both functions scale the 2-component vector v by the factor 3. But the first function is uselessly inefficient. The constant 3.0 is a double precision constant, while 3.0f is single precision one. So the expression
v[0] = v[0] * 3.0
has the following semantic:
v[0] to double3.0
(double/double multiplication)floatv[0]while the expression
v[0] = v[0] * 3.0
has the following one:
v[0] by 3.0f
(float/float multiplication)v[0]The difference now is apparent. This is the theory. In practice, the x86 FPU does all calculations in extended real format (an 80 bits format that has no equivalent in C). So the effective operations performed by a x86 are:
v[0] to extended3.0 (3.0f) to
extendedextended/extended
multiplication)floatv[0]In this scenario the two functions are handled in almost the same way. Yes, the conversion from double to extended is slightly slower than the conversion from float to extended, but multiplication is so much slower compared to that, that the difference is negligible. That's the reason why that a lot of (good and bad) programmers ignores the compilers warning like "implicit conversion from double to float" or "truncation from 'const double ' to 'float '". The distinction between the good and the bad programmers is that the former ones know that the x86 FPU treat both functions the same way, while the latter ones ignores that there should be a difference, and thinks that compiler warnings are just a nuisance.
If you program for 3DNow! you can no more be so careless. As 3DNow! only does single precision mathematics, function scale2 may be optimized to benefit the from the new 3DNow! instruction, while function scale1 will not be optimized without breaking the C semantic. Morale: