Minimize the size of your program - low level
NOTICE: The following is not intended for real-world application. It is just an intellectual exercise to minimize the size of the program. The generated PE file may or may not be valid as it depends on behavior of specific architecture, OS and toolset.
The previous post: Minimize the size of your program - high level
Do you know the smallest x86 PE file in the world?
Here is the answer: https://www.phreedom.org/solar/code/tinype/
- Smallest possible PE file: 97 bytes
- Smallest possible PE file on Windows 2000: 133 bytes
- Smallest PE file that downloads a file over WebDAV and executes it: 133 bytes
BTW: You can use IDA Pro, WinDBG or OllyDbg to open and debug the tiny PE :P
The main idea of tiny PE is simple: Striping unnecessary bytes and collapsing data structure by overlapping. So the overhead of PE format will be minimized (however, striping will lead to invalid PE header). Here, we are more concerned about how to decrease the size of our code.
To investigate the code generated by the compiler, you can use the /FAsc flag.
First, we can apply the technique used in tiny PE to collapse the PE Header. There're lots of fields that are not used by the current OS loader. It is unsafe to assume that they are not used in real-world application, but as far as our main purpose here is to minimize the size, we will move our data into these fields to save space. Another benefit is that, the address difference of these data will be within 0x80, which is suitable for our further optimization of the code.
Now let's begin our optimization!
1. Algorithm
a. Local variable initialization
Normally, the initialization involves lots of bytes of code.
unsigned long keyT[4]={0xCBDCEDFE,0x8798A9BA,0x43546576,0x00102132};
The above code will be translated to assembly like:
mov DWORD PTR [ebp+offset8], imm32
It is 7 bytes. Then 7*4 = 28 bytes to initialize the whole array.
Instead, we can change the initialization into for-loop.
for (size_t i=0;i<16;++i) ((char *)keyT)[i]=0xFE-i*0x11;
((char *)keyT)[15]++;
The above code can be implemented as:
lea edx, [ebp+keyT-1]
xor ecx, ecx
mov cl, 16
mov al, 0FEh
loc_4:
inc edx
mov [edx], al
sub al, 11h
loop loc_4
inc byte [edx]
28 -> 18 bytes!
b. memcpy
By default, the compiler will call the API memcpy in msvcrt.dll. But the cost is high. The alternative is to use intrinsic version of memcpy which takes advantage of "rep movs" instruction instead.
If we take a closer look at the algorithm, it turns out that the code simply rotates keyT array t bytes left.
memcpy(Buf1,(char *)keyT+t,0x10-t);
memcpy((char *)Buf1+0x10-t,keyT,t);
memcpy(keyT,Buf1,16);
That means most of the parameter for memcpy can be reused, which will save lots of store/load.
;ebx == t, edi == end of keyT
xor ecx, ecx
mov cl, 10h
sub cl, bl
mov esi, edi
sub esi, ecx
lea edi, [ebp+Buf1]
push edi
rep movsb
mov cl, bl
lea esi, [ebp+keyT]
rep movsb
pop esi
mov cl, 10h
rep movsb
Thanks to the value reuse, the last two memcpy only use 12 bytes, and 35 bytes in total! In comparison, 54 bytes before optimization, not including more than 10 bytes for PE format overhead related to reference memcpy API
2. 8-bit vs 32-bit
In 32-bit environment, all the data and address are 32-bit. Then the 32-bit immediate numbers and long jump instructions can contribute to large portion of our code.
We can try to make these value small enough, so they can be held in 8-bit data type.
For example, when we call:
printf("%08X",keyT[k]);
The compiler will generate the code like this:
push hexstr+ImageBase ; "%08X"
call [printf+ImageBase]
Because we arrange the positions of hexstr and printf within 80 bytes, we can use the following code instead:
mov eax, hexstr+ImageBase
push eax
add al, printf-hexstr
call [eax]
Because printf uses cdecl calling convention, the caller is responsible to balance the stack. When we pop up the arguments from stack, we can reuse the pushed "eax" for "printf("\n");"
add al, newline-hexstr
push eax
add al, printf-newline
call [eax]
22 -> 17 bytes!
We can also split the long jmp into two short jmps (5 -> 4 bytes)
3. Misc
- Prevent using 16-bit instruction.
- Use "loop" instruction to do the loop.
- Use enter/leave the setup stack.
- Use "lods/stos" to load and store from the memory.
- Use "pusha/popa" for multiple push/pop.
- Sometimes, the instruction for "eax" will be a little shorter than for other registers.
- Sometimes, 8-bit instruction will be a little shorter than the 32-bit version.
- Return value is not needed here, and OS doesn't require us to save the registers. So we can omit both the "xor eax, eax" at the end of the program and the "push/pop" instructions to save the register (this is unsafe for real-world application)
- Register allocation is also very important. If chosen properly, we can share their values and prevent many load/store.
- Take advantage of al, ah and alike. Otherwise the limited number of registers on X86 will be a big issue.
For more detailed information on how to optimize your code for size, please see the excellent optimization manual by Agner Fog.
BTW: Don't forget to set the "Debug Directory Size" in the PE Header to 0, otherwise the program will crash on WinXP. Because that data member overlaps with our code, that means we have to insert instructions like this into our code:
jmp short skip_debug
times 4 db 0 ;ensure debug directory size is 0
skip_debug:
Here is the source (you can compile it using "nasmw -f bin") and the result:
PE Header | Code | Data | Total size | |
Before optimization | 480 | 272 | 144 | 896 |
After optimization | 136 | 180 | 7 | 323 |
Comments
- Anonymous
June 09, 2009
PingBack from http://cellulitecreamsite.info/story.php?id=7039