Does __fastcall make a difference for C++ classes?
We're running through a routine round of code reviews of the audio engine, and I noticed the following code (obscured):
HRESULT __fastcall CSomeClass::SomeMethod(SomeParameters);
I looked at it a couple of times, because it seemed like it was wrong. The thing that caught my eye was the "__fastcall" declaration. __fastcall is a Microsoft C++ extension that allows the compiler to put the first 2 DWORD parameters to the routine into the ECX and EDX registers (obviously it's x32 only).
But when compiling C++ code, the default calling convention is "thiscall", and in the thiscall convention, the "this" pointer is passed in the ECX register, which seems to collide with the __fastcall declaration.
So does it make a difference? I could have left a code review comment and made the person who owned the code run through the exercise, but I figured why not figure out the answer myself? And, to be honest, I found the path to the answer almost more interesting than the answer itself.
As I usually do in these cases, I wrote a tiny little test application to test it out:
class fctest
{
int _member;
public:
fctest::fctest(void);
fctest::~fctest(void);
int __fastcall fctest::FastcallFunction(int *param1, int *param2)
{
return *param1 * *param2;
}
int fctest::ThiscallFunction(int *param1, int *param2)
{
return *param1 * *param2;
}
};
int _tmain(int argc, _TCHAR* argv[])
{
fctest test;
int param1, param2;
int result;
result = test.FastcallFunction(¶m1, ¶m2);
result = test.ThiscallFunction(¶m1, ¶m2);
return 0;
}
I compiled it for "Retail", and then I looked at the generated output. Somewhat to my surprise, the code generated was:
main:
xor eax, eax
ret
Yup, the compiler had optimized out my entire program. Crud, back to the drawing board.
Try #2:
int _tmain(int argc, _TCHAR* argv[])
{
fctest test;
int param1, param2;
int result;
result = test.FastcallFunction(¶m1, ¶m2);
printf("%d: %d: %d", param1, param2, result);
result = test.ThiscallFunction(¶m1, ¶m2);
printf("%d: %d: %d", param1, param2, result);
return 0;
}
This one was somewhat better:
main:
mov eax, [sp]
imul eax, [sp+4]
<call to printf #1>
<call to printf #2>
xor eax, eax
ret
Hmm, that wasn't much of an improvement. The compiler realized that FastcallFunction and ThiscallFunction did the same thing and not only did it inline the call, but it optimized out the 2nd call.
Try #3:
int _tmain(int argc, _TCHAR* argv[])
{
fctest test;
int param1, param2;
int result;
param1 = rand();
param2 = rand();
result = test.FastcallFunction(¶m1, ¶m2);
printf("%d: %d: %d", param1, param2, result);
param1 = rand();
param2 = rand();
result = test.ThiscallFunction(¶m1, ¶m2);
printf("%d: %d: %d", param1, param2, result);
return 0;
}
Try #3's code:
main:
call rand
mov [sp], eax
call rand
mov [sp], eax
mov eax, [sp]
imul eax, [sp+4]
<call to printf #1>
call rand
mov [sp], eax
call rand
mov [sp], eax
mov eax, [sp]
imul eax, [sp+4]
<call to printf #2>
xor eax, eax
ret
Much better, now at least both functions are inlined. But the stupid function is STILL inlined, I haven't learned anything yet.
Try #4: I moved fctest into its own source file (I'm not going to show the source code for this one).
The code for this one finally got it right:
param1 = rand();
00401029 call rand (401131h)
0040102E mov dword ptr [esp+4],eax
param2 = rand();
00401032 call rand (401131h)
00401037 mov dword ptr [esp],eax
result = test.FastcallFunction(¶m1, ¶m2);
0040103A lea eax,[esp]
0040103D push eax
0040103E lea edx,[esp+8]
00401042 lea ecx,[esp+0Ch]
00401046 call fctest::FastcallFunction (4010E0h)
printf("%d: %d: %d", param1, param2, result);
0040104B mov ecx,dword ptr [esp]
param1 = rand();
00401062 call rand (401131h)
00401067 mov dword ptr [esp+4],eax
param2 = rand();
0040106B call rand (401131h)
00401070 mov dword ptr [esp],eax
result = test.ThiscallFunction(¶m1, ¶m2);
00401073 lea eax,[esp]
00401076 push eax
00401077 lea ecx,[esp+8]
0040107B push ecx
0040107C lea ecx,[esp+10h]
00401080 call fctest::ThiscallFunction (4010F0h)
So what's in all this gobbeldygook?
Well, the relevant parts are the instructions from 0x4013a to 0x40146 and 0x401073 to 40107c. Side by Side, they are:
0040103A lea eax,[esp] 0040103D push eax 0040103E lea edx,[esp+8] 00401042 lea ecx,[esp+0Ch] 00401046 call fctest::FastcallFunction (4010E0h) | 00401073 lea eax,[esp] 00401076 push eax 00401077 lea ecx,[esp+8] 0040107B push ecx 0040107C lea ecx,[esp+10h] 00401080 call fctest::ThiscallFunction (4010F0h) |
Note that on both functions, the ECX register is loaded with the address of "test". But in the fastcall function, the 1st parameter is loaded into the EDX register - in the thiscall function, it's pushed onto the stack.
So yes, __fastcall makes a difference for C++ classes. Not as much as it does for C functions, but it DOES make a difference.
Comments
- Anonymous
October 10, 2005
Maybe next time turn off optimization instead of rewriting your test case over and over again? Or does turning off optimization disable __fastcall? - Anonymous
October 10, 2005
Joe, that's a good point, and the answer is "I don't know". The optimizer introduces a LOT of changes to the system, it might make a difference here, it might not. But without turning the optimizer on, I wouldn't be certain how the code behaves with the optimizer enabled.
I might have been able to guess that the compiler wouldn't optimize the thiscall code but I wouldn't be sure. Actually, in this case, it probably would be safe, but... - Anonymous
October 10, 2005
Great article!
I am wondering how are you going to deal with the code after the review. As you pointed out, the performance gain is not significant. If the code is left untouched, some time later it may confuse another maintainer. Of course this not important enough, I just want to know if this kind of small optimization are encouraged at MS (Personally I am reluctanted to do it). - Anonymous
October 10, 2005
Great article!
I am wondering how are you going to deal with the code after the review. As you pointed out, the performance gain is not significant. If the code is left untouched, some time later it may confuse another maintainer. Of course this not important enough, I just want to know if this kind of small optimization are encouraged at MS (Personally I am reluctanted to do it). - Anonymous
October 10, 2005
This should make sense if you think about it. C++ member functions take an implicit "this" as the first parameter. The first declared parameter is really the second actual parameter. - Anonymous
October 10, 2005
Nicholas, yup, I should have pointed that out, but...
yawl, Actually the perf gain CAN be significant, especially on x32 machines where the register set is so tightly constrained. For x64 and ia64 machines it's not as important because there are more general purpose registers.
IIRC, back in the NT4 days, the entire NT kernel was recompiled with __fastcall and it got something like a 10% overall speedup.
So being able to save the transfer through memory of even one parameter to a routine can result in huge perf improvements. - Anonymous
October 10, 2005
The comment has been removed - Anonymous
October 10, 2005
The comment has been removed - Anonymous
October 10, 2005
Since it's a calling convention, the optimizer can't actually do anything different with it. The only difference the optimizer made was getting in the way of your testing, since it realized everything was local.
For a function/method that will be visible to callers the optimizer can't analyze (e.g. exported functions), the calling convention will always be followed to the letter. Anything different would just not work.
For "everything local" cases, it's really an exploration of the optimizer, not the calling convention ;) - Anonymous
October 10, 2005
On a different note, it was fun to see the path around the optimizer. And of course, the conclusion was worth it :) - Anonymous
October 10, 2005
The comment has been removed - Anonymous
October 11, 2005
Any time you have cross-DLL calling you really need to have your calling convention called out clearly in your headers.
Otherwise maybe you compiled with /Gd (__cdecl default) and someone else compiles /Gz (__stdcall default) and lo and behold, you can't actually run the code. - Anonymous
October 11, 2005
I was just about to say that if you put one function in a DLL and call it from another then you'll test exactly what you were trying for and the optimizer can't delete it.
But Michael Grier brought up something that looks like yet another Windows bug, detouring what I was just about to say. When the Setup APIs report that some Setup API was compiled with the wrong calling convention, it looks like Windows is reporting a Windows bug. There's a chance that some higher level application might be stepping on something so the Setup APIs might just be victims, but it looks like that chance is a rather small one. It looks like some Setup API DLL neglected to have its calling convention called out clearly in its headers, and Windows diagnoses itself. If Microsoft would test its own stuff on checked builds, maybe Microsoft would find the same. - Anonymous
October 13, 2005
This question might be out of the intended scope but...
Which convention does .NET uses? - Anonymous
October 13, 2005
Considering the way .NET programs have to declare their uses of unmanaged DLLs, it sure looks like __stdcall and __thiscall.