Understanding Arm64EC ABI and assembly code
Arm64EC ("Emulation Compatible") is a new application binary interface (ABI) for building apps for Windows 11 on Arm. For an overview of Arm64EC and how to start building Win32 apps as Arm64EC, see Using Arm64EC to build apps for Windows 11 on Arm devices.
The purpose of this document is to provide a detailed view of the Arm64EC ABI with enough information for an application developer to write and debug code compiled for Arm64EC, including low-level/assembler debugging and writing assembly code targeting the Arm64EC ABI.
Design of Arm64EC
Arm64EC was designed to deliver native-level functionality and performance, while providing transparent and direct interoperability with x64 code running under emulation.
Arm64EC is mostly additive to the Classic Arm64 ABI. Very little of the Classic ABI was changed, but portions were added to enable x64 interoperability.
In this document, the original, standard Arm64 ABI shall be referred to as "Classic ABI". This avoids the ambiguity inherent to overloaded terms like "Native". Arm64EC, to be clear, is every bit as native as the original ABI.
Arm64EC vs Arm64 Classic ABI
The following list points out where Arm64EC has diverged from Arm64 Classic ABI.
These are small changes when seen in perspective of how much the entire ABI defines.
Register mapping and blocked registers
For there to be type-level interoperability with x64 code, Arm64EC code is compiled with the same pre-processor architecture definitions as x64 code.
In other words, _M_AMD64
and _AMD64_
are defined. One of the types affected by this rule is the CONTEXT
structure. The CONTEXT
structure defines the state of the CPU at a given point. It is used for things like Exception Handling
and GetThreadContext
APIs. Existing x64 code expects the CPU context to be represented as an x64 CONTEXT
structure or, in other words, the CONTEXT
structure as it is defined during x64 compilation.
This structure must be used to represent the CPU context while executing x64 code, as well as Arm64EC code. Existing code would not understand a novel concept, such as the CPU register set changing from function to function. If the x64 CONTEXT
structure is used to represent Arm64 execution states, this implies Arm64 registers are effectively mapped into x64 registers.
It also implies that any Arm64 registers which cannot be fitted into the x64 CONTEXT
must not be used, as their values can be lost anytime an operation using CONTEXT
occurs (and some can be asynchronous and unexpected, such as the Garbage Collection operation of a Managed Language Runtime, or an APC).
The mapping rules between Arm64EC and x64 registers are represented by the ARM64EC_NT_CONTEXT
structure in the Windows headers, present in the SDK. This structure is essentially a union of the CONTEXT
structure, exactly as it is defined for x64, but with an extra Arm64 register overlay.
For example, RCX
maps to X0
, RDX
to X1
, RSP
to SP
, RIP
to PC
, etc. We can also see how the registers x13
, x14
, x23
, x24
, x28
, v16
-v31
have no representation and, thus, cannot be used in Arm64EC.
This register usage restriction is the first difference between the Arm64 Classic and EC ABIs.
Call checkers
Call checkers have been a part of Windows ever since Control Flow Guard (CFG) was introduced in Windows 8.1. Call checkers are address sanitizers for function pointers (before these things were called address sanitizers). Every time code is compiled with the option /guard:cf
the compiler will generate an extra call to the checker function just before every indirect call/jump. The checker function itself is provided by Windows and, for CFG, it performs a validity check against the known-to-be-good call targets. This information is also included in binaries compiled with /guard:cf
.
This is an example of a call checker use in Classic Arm64:
mov x15, <target>
adrp x16, __guard_check_icall_fptr
ldr x16, [x16, __guard_check_icall_fptr]
blr x16 ; check target function
blr x15 ; call function
In the CFG case, the call checker will simply return if the target is valid, or fast-fail the process if it is not. Call checkers have custom calling conventions. They take the function pointer in a register not used by the normal calling convention and preserving all normal calling-convention registers. This way, they don't introduce register spillage around them.
Call checkers are optional on all other Windows ABIs, but mandatory on Arm64EC. On Arm64EC, call checkers accumulate the task of verifying the architecture of the function being called. They verify whether the call is another EC ("Emulation Compatible") function or an x64 function that must be executed under emulation. In many cases, this can only be verified at runtime.
Arm64EC call checkers build on top of the existing Arm64 checkers, but they have a slightly different custom calling convention. They take an extra parameter and they may modify the register containing the target address. For example, if the target is x64 code, control must be transferred to the emulation scaffolding logic first.
In Arm64EC, the same call checker use would become:
mov x11, <target>
adrp x9, __os_arm64x_check_icall_cfg
ldr x9, [x9, __os_arm64x_check_icall_cfg]
adrp x10, <name of the exit thunk>
add x10, x10, <name of the exit thunk>
blr x9 ; check target function
blr x11 ; call function
Slight differences from Classic Arm64 include:
- Symbol name for the call checker is different.
- The target address is supplied in
x11
instead ofx15
. - The target address (
x11
) is[in, out]
instead of[in]
. - There is an extra parameter, provided through
x10
, called an "Exit Thunk".
An Exit Thunk is a funclet which transforms function parameters from Arm64EC calling convention to x64 calling convention.
The Arm64EC call checker is located through a different symbol than is used for the other ABIs in Windows. On the Classic Arm64 ABI, the symbol for the call checker is __guard_check_icall_fptr
. This symbol will be present in Arm64EC, but it is there for x64 statically linked code to use, not Arm64EC code itself. Arm64EC code will use either __os_arm64x_check_icall
or __os_arm64x_check_icall_cfg
.
On Arm64EC, call checkers are not optional. However, CFG is still optional, as is the case for other ABIs. CFG may be disabled at compile-time, or there may be a legitimate reason to not perform a CFG check even when CFG is enabled (e.g. the function pointer never resides in RW memory). For an indirect call with CFG check, the __os_arm64x_check_icall_cfg
checker should be used. If CFG is disabled or unnecessary, __os_arm64x_check_icall
should be used instead.
Below is a summary table of the call checker usage on Classic Arm64, x64 and Arm64EC noting the fact that an Arm64EC binary can have two options depending on the architecture of the code.
Binary | Code | Unprotected indirect call | CFG protected indirect call |
---|---|---|---|
x64 | x64 | no call checker | __guard_check_icall_fptr or __guard_dispatch_icall_fptr |
Arm64 Classic | Arm64 | no call checker | __guard_check_icall_fptr |
Arm64EC | x64 | no call checker | __guard_check_icall_fptr or __guard_dispatch_icall_fptr |
Arm64EC | __os_arm64x_check_icall |
__os_arm64x_check_icall_cfg |
Independently of the ABI, having CFG enabled code (code with reference to the CFG call-checkers), does not imply CFG protection at runtime. CFG protected binaries can run down-level, on systems not supporting CFG: the call-checker is initialized with a no-op helper at compile time. A process may also have CFG disabled by configuration. When CFG is disabled (or OS support is not present) on previous ABIs the OS will simply not update the call-checker when the binary is loaded. On Arm64EC, if CFG protection is disabled, the OS will set __os_arm64x_check_icall_cfg
the same as __os_arm64x_check_icall
, which will still provide the needed target architecture check in all cases, but not CFG protection.
As with CFG in Classic Arm64, the call to the target function (x11
) must immediately follow the call to the Call Checker. The address of the Call Checker must be placed in a volatile register and neither it, nor the address of the target function, should ever be copied to another register or spilled to memory.
Stack Checkers
__chkstk
is used automatically by the compiler every time a function allocates an area on the stack larger than a page. To avoid skipping over the stack guard page protecting the end of the stack, __chkstk
is called to make sure all pages in the allocated area are probed.
__chkstk
is usually called from the prolog of the function. For that reason, and for optimal code generation, it uses a custom calling convention.
This implies that x64 code and Arm64EC code need their own, distinct, __chkstk
functions, as Entry and Exit thunks assume standard calling conventions.
x64 and Arm64EC share the same symbol namespace so there can't be two functions named __chkstk
. To accommodate compatibility with pre-existing x64 code, __chkstk
name will be associated with the x64 stack checker. Arm64EC code will use __chkstk_arm64ec
instead.
The custom calling convention for __chkstk_arm64ec
is the same as for Classic Arm64 __chkstk
: x15
provides the size of the allocation in bytes, divided by 16. All non-volatile registers, as well as all volatile registers involved in the standard calling convention are preserved.
Everything said above about __chkstk
applies equally to __security_check_cookie
and its Arm64EC counterpart: __security_check_cookie_arm64ec
.
Variadic calling convention
Arm64EC follows the Classic Arm64 ABI calling convention, except for Variadic functions (aka varargs, aka functions with the ellipsis (. . .) parameter keyword).
For the variadic specific case, Arm64EC follows a calling convention very similar to x64 variadic, with only a few differences. Below are the major rules for Arm64EC variadic:
- Only the first 4 registers are used for parameter passing:
x0
,x1
,x2
,x3
. Remaining parameters are spilled onto the stack. This follows the x64 variadic calling convention exactly, and differs from Arm64 Classic, where registersx0
->x7
are used. - Floating Point / SIMD parameters being passed by register will use a General-Purpose register, not a SIMD one. This is similar to Arm64 Classic, and differs from x64, where FP/SIMD parameters are passed in both a General-Purpose and SIMD register. For example, for a function
f1(int, …)
being called asf1(int, double)
, on x64, the second parameter will be assigned to bothRDX
andXMM1
. On Arm64EC, the second parameter will be assigned to justx1
. - When passing structures by value through a register, x64 size rules apply: Structures with sizes exactly 1, 2, 4 and 8 will be loaded directly into the General-Purpose register. Structures with other sizes are spilled onto the stack, and a pointer to the spilled location is assigned to the register. This essentially demotes by-value into by-reference, at the low-level. On the Classic Arm64 ABI, structures of any size up to 16 bytes are assigned directly to General-Purposed registers.
- X4 register is loaded with a pointer to the first parameter passed via stack (the 5th parameter). This does not include structures spilled due to the size restrictions outlined above.
- X5 register is loaded with the size, in bytes, of all parameters passed by stack (size of all parameters, starting with the 5th). This does not include structures passed by value spilled due to the size restrictions outlined above.
In the following example: pt_nova_function
below takes parameters in a non-variadic form, therefore following the Classic Arm64 calling convention. It then calls pt_va_function
with the exact same parameters but in a variadic call instead.
struct three_char {
char a;
char b;
char c;
};
void
pt_va_function (
double f,
...
);
void
pt_nova_function (
double f,
struct three_char tc,
__int64 ull1,
__int64 ull2,
__int64 ull3
)
{
pt_va_function(f, tc, ull1, ull2, ull3);
}
pt_nova_function
takes 5 parameters which will be assigned following the Classic Arm64 calling convention rules:
- 'f ' is a double. It will be assigned to d0.
- 'tc' is a struct, with a size of 3 bytes. It will be assigned to x0.
- ull1 is an 8-byte integer. It will be assigned to x1.
- ull2 is an 8-byte integer. It will be assigned to x2.
- ull3 is an 8-byte integer. It will be assigned to x3.
pt_va_function
is a variadic function, so it will follow the Arm64EC variadic rules outlined above:
- 'f ' is a double. It will be assigned to x0.
- 'tc' is a struct, with a size of 3 bytes. It will be spilled onto the stack and its location loaded into x1.
- ull1 is an 8-byte integer. It will be assigned to x2.
- ull2 is an 8-byte integer. It will be assigned to x3.
- ull3 is an 8-byte integer. It will be assigned directly to the stack.
- x4 is loaded with the location of ull3 in the stack.
- x5 is loaded with the size of ull3.
The following shows a possible compilation output for pt_nova_function
, which illustrates the parameter assignment differences outlined above.
stp fp,lr,[sp,#-0x30]!
mov fp,sp
sub sp,sp,#0x10
str x3,[sp] ; Spill 5th parameter
mov x3,x2 ; 4th parameter to x3 (from x2)
mov x2,x1 ; 3rd parameter to x2 (from x1)
str w0,[sp,#0x20] ; Spill 2nd parameter
add x1,sp,#0x20 ; Address of 2nd parameter to x1
fmov x0,d0 ; 1st parameter to x0 (from d0)
mov x4,sp ; Address of the 1st in-stack parameter to x4
mov x5,#8 ; Size of the in-stack parameter area
bl pt_va_function
add sp,sp,#0x10
ldp fp,lr,[sp],#0x30
ret
ABI additions
To achieve transparent interoperability with x64 code, many additions have been made to the Classic Arm64 ABI. They handle the calling conventions differences between Arm64EC and x64.
The following list includes these additions:
Entry and Exit Thunks
Entry and Exit Thunks take care of translating the Arm64EC calling convention (mostly the same as Classic Arm64) into the x64 calling convention, and vice-versa.
A common misconception is that calling conventions can be converted by following a single rule applied to all function signatures. The reality is that calling conventions have parameter assignment rules. These rules depend on the parameter type and are different from ABI to ABI. A consequence is that translation between ABIs will be specific to each function signature, varying with the type of each parameter.
Consider the following function:
int fJ(int a, int b, int c, int d);
Parameter assignment will occur as follows:
- Arm64: a -> x0, b -> x1, c -> x2, d -> x3
- x64: a -> RCX, b -> RDX, c -> R8, d -> r9
- Arm64 -> x64 translation: x0 -> RCX, x1 -> RDX, x2 -> R8, x3 -> R9
Now consider a different function:
int fK(int a, double b, int c, double d);
Parameter assignment will occur as follows:
- Arm64: a -> x0, b -> d0, c -> x1, d -> d1
- x64: a -> RCX, b -> XMM1, c -> R8, d -> XMM3
- Arm64 -> x64 translation: x0 -> RCX, d0 -> XMM1, x1 -> R8, d1 -> XMM3
These examples demonstrate that parameter assignment and translation vary by type, but also the types of the preceding parameters in the list are depended upon. This detail is illustrated by the 3rd parameter. In both functions, the parameter's type is "int", but the resulting translation is different.
Entry and Exit Thunks exist for this reason and are specifically tailored for each individual function signature.
Both types of thunks are, themselves, functions. Entry Thunks are automatically invoked by the emulator when x64 functions call into Arm64EC functions (execution Enters Arm64EC). Exit Thunks are automatically invoked by the call checkers when Arm64EC functions call into x64 functions (execution Exits Arm64EC).
When compiling Arm64EC code, the compiler will generate an Entry Thunk for each Arm64EC function, matching its signature. The compiler will also generate an Exit Thunk for every function an Arm64EC function calls.
Consider the following example:
struct SC {
char a;
char b;
char c;
};
int fB(int a, double b, int i1, int i2, int i3);
int fC(int a, struct SC c, int i1, int i2, int i3);
int fA(int a, double b, struct SC c, int i1, int i2, int i3) {
return fB(a, b, i1, i2, i3) + fC(a, c, i1, i2, i3);
}
When compiling the code above targeting Arm64EC, the compiler will generate:
- Code for 'fA'.
- Entry Thunk for 'fA'
- Exit Thunk for 'fB'
- Exit Thunk for 'fC'
The fA
Entry Thunk is generated in case fA
and called from x64 code. Exit Thunks for fB
and fC
are generated in case fB
and/or fC
and turn out to be x64 code.
The same Exit Thunk may be generated multiple times, given the compiler will generate them at the call site rather than the function itself. This may result in a considerable amount of redundant thunks so, in reality, the compiler will apply trivial optimization rules to make sure only the required thunks make it into the final binary.
For example, in a binary where Arm64EC function A
calls Arm64EC function B
, B
is not exported and its address is never known outside of A
. It is safe to eliminate the Exit Thunk from A
to B
, along with the Entry Thunk for B
. It is also safe to alias together all Exit and Entry thunks which contain the same code, even if they were generated for distinct functions.
Exit Thunks
Using the example functions fA
, fB
and fC
above, this is how the compiler would generate both fB
and fC
Exit Thunks:
Exit Thunk to int fB(int a, double b, int i1, int i2, int i3);
$iexit_thunk$cdecl$i8$i8di8i8i8:
stp fp,lr,[sp,#-0x10]!
mov fp,sp
sub sp,sp,#0x30
adrp x8,__os_arm64x_dispatch_call_no_redirect
ldr xip0,[x8]
str x3,[sp,#0x20] ; Spill 5th param (i3) into the stack
fmov d1,d0 ; Move 2nd param (b) from d0 to XMM1 (x1)
mov x3,x2 ; Move 4th param (i2) from x2 to R9 (x3)
mov x2,x1 ; Move 3rd param (i1) from x1 to R8 (x2)
blr xip0 ; Call the emulator
mov x0,x8 ; Move return from RAX (x8) to x0
add sp,sp,#0x30
ldp fp,lr,[sp],#0x10
ret
Exit Thunk to int fC(int a, struct SC c, int i1, int i2, int i3);
$iexit_thunk$cdecl$i8$i8m3i8i8i8:
stp fp,lr,[sp,#-0x20]!
mov fp,sp
sub sp,sp,#0x30
adrp x8,__os_arm64x_dispatch_call_no_redirect
ldr xip0,[x8]
str w1,[sp,#0x40] ; Spill 2nd param (c) onto the stack
add x1,sp,#0x40 ; Make RDX (x1) point to the spilled 2nd param
str x4,[sp,#0x20] ; Spill 5th param (i3) into the stack
blr xip0 ; Call the emulator
mov x0,x8 ; Move return from RAX (x8) to x0
add sp,sp,#0x30
ldp fp,lr,[sp],#0x20
ret
In the fB
case, we can see how the presence of a 'double' parameter will cause the remaining GP register assignment to reshuffle, a result of Arm64 and x64's different assignment rules. We can also see x64 only assigns 4 parameters to registers, so the 5th parameter must be spilled onto the stack.
In the fC
case, the second parameter is a structure of 3-byte length. Arm64 will allow any size structure to be assigned to a register directly. x64 only allows sizes 1, 2, 4 and 8. This Exit Thunk must then transfer this struct
from the register onto the stack and assign a pointer to the register instead. This still consumes one register (to carry the pointer) so it does not change assignments for the remaining registers: no register reshuffling happens for the 3rd and 4th parameter. Just as for the fB
case, the 5th parameter must be spilled onto the stack.
Additional considerations for Exit Thunks:
- The compiler will name them not by the function name they translate from->to, but rather the signature they address. This makes it easier to find redundancies.
- The Exit Thunk is called with the register
x9
carrying the address of the target (x64) function. This is set by the call checker and passes through the Exit Thunk, undisturbed, into the emulator.
After rearranging the parameters, the Exit Thunk then calls into the emulator via __os_arm64x_dispatch_call_no_redirect
.
It is worth, at this point, reviewing the function of the call checker, and detail about its own custom ABI. This is what an indirect call to fB
would look like:
mov x11, <target>
adrp x9, __os_arm64x_check_icall_cfg
ldr x9, [x9, __os_arm64x_check_icall_cfg]
adrp x10, $iexit_thunk$cdecl$i8$i8di8i8i8 ; fB function's exit thunk
add x10, x10, $iexit_thunk$cdecl$i8$i8di8i8i8
blr x9 ; check target function
blr x11 ; call function
When calling the call checker:
x11
supplies the address of the target function to call (fB
in this case). It may not be known, at this point, if the target function is Arm64EC or x64.x10
supplies an Exit Thunk matching the signature of the function being called (fB
in this case).
The data returned by the call checker will depend on the target function being Arm64EC or x64.
If the target is Arm64EC:
x11
will return the address of the Arm64EC code to call. This may or may not be the same value that was provided in.
If the target is x64 code:
x11
will return the address of the Exit Thunk. This is copied from input provided inx10
.x10
will return the address of the Exit Thunk, undisturbed from input.x9
will return the target x64 function. This may or may not be the same value it was provided in viax11
.
Call checkers will always leave calling convention parameter registers undisturbed, so the calling code should follow the call to the call checker immediately with blr x11
(or br x11
in case of a tail-call). These are the registers call checkers. They will always preserve above and beyond standard non-volatile registers: x0
-x8
, x15
(chkstk
) and q0
-q7
.
Entry Thunks
Entry Thunks take care of the transformations required from the x64 to the Arm64 calling conventions. This is, essentially, the reverse of Exit Thunks but there are a few more aspects to consider.
Consider the previous example of compiling fA
, an Entry Thunk is generated so that fA
can be called by x64 code.
Entry Thunk for int fA(int a, double b, struct SC c, int i1, int i2, int i3)
$ientry_thunk$cdecl$i8$i8dm3i8i8i8:
stp q6,q7,[sp,#-0xA0]! ; Spill full non-volatile XMM registers
stp q8,q9,[sp,#0x20]
stp q10,q11,[sp,#0x40]
stp q12,q13,[sp,#0x60]
stp q14,q15,[sp,#0x80]
stp fp,lr,[sp,#-0x10]!
mov fp,sp
ldrh w1,[x2] ; Load 3rd param (c) bits [15..0] directly into x1
ldrb w8,[x2,#2] ; Load 3rd param (c) bits [16..23] into temp w8
bfi w1,w8,#0x10,#8 ; Merge 3rd param (c) bits [16..23] into x1
mov x2,x3 ; Move the 4th param (i1) from R9 (x3) to x2
fmov d0,d1 ; Move the 2nd param (b) from XMM1 (d1) to d0
ldp x3,x4,[x4,#0x20] ; Load the 5th (i2) and 6th (i3) params
; from the stack into x3 and x4 (using x4)
blr x9 ; Call the function (fA)
mov x8,x0 ; Move the return from x0 to x8 (RAX)
ldp fp,lr,[sp],#0x10
ldp q14,q15,[sp,#0x80] ; Restore full non-volatile XMM registers
ldp q12,q13,[sp,#0x60]
ldp q10,q11,[sp,#0x40]
ldp q8,q9,[sp,#0x20]
ldp q6,q7,[sp],#0xA0
adrp xip0,__os_arm64x_dispatch_ret
ldr xip0,[xip0,__os_arm64x_dispatch_ret]
br xip0
The address of the target function is provided by the emulator in x9
.
Before calling the Entry Thunk, the x64 emulator will pop the return address from the stack into the LR
register. It is then expected that LR
will be pointing at x64 code when control is transferred to the Entry Thunk.
The emulator may also perform another adjustment to the stack, depending on the following: Both Arm64 and x64 ABIs define a stack alignment requirement where the stack must be aligned to 16-bytes at the point a function is called. When running Arm64 code, hardware enforces this rule, but there is no hardware enforcement for x64. While running x64 code, erroneously calling functions with an unaligned stack may go unnoticed indefinitely, until some 16-byte alignment instruction is used (some SSE instructions do) or Arm64EC code is called.
To address this potential compatibility problem, before calling the Entry Thunk, the emulator will always align-down the Stack Pointer to 16-bytes and store its original value in the x4
register. This way Entry Thunks always start executing with an aligned stack but can still correctly reference the parameters passed on the stack, via x4
.
When it comes to non-volatile SIMD registers, there is a significant difference between the Arm64 and x64 calling conventions. On Arm64, the low 8 bytes (64 bits) of the register are considered non-volatile. In other words, only the Dn
part of the Qn
registers is non-volatile. On x64, the entire 16 bytes of the XMMn
register is considered non-volatile. Furthermore, on x64, XMM6
and XMM7
are non-volatile registers whereas D6 and D7 (the corresponding Arm64 registers) are volatile.
To address these SIMD register manipulation asymmetries, Entry Thunks must explicitly save all SIMD registers which are considered non-volatile in x64. This is only needed on Entry Thunks (not Exit Thunks) because x64 is stricter than Arm64. In other words, register saving/preservation rules in x64 exceed the Arm64 requirements in all cases.
To address the correct recovery of these register values when unwinding the stack (e.g. setjmp + longjmp, or throw + catch), a new unwind opcode was introduced: save_any_reg (0xE7)
. This new 3-byte unwind opcode enables saving any General Purpose or SIMD register (including the ones considered volatile) and including full-sized Qn
registers. This new opcode is used for the Qn
register spills/fill operations above. save_any_reg
is compatible with save_next_pair (0xE6)
.
For reference, below is the corresponding unwind information belonging to the Entry Thunk presented above:
Prolog unwind:
06: E76689.. +0004 stp q6,q7,[sp,#-0xA0]! ; Actual=stp q6,q7,[sp,#-0xA0]!
05: E6...... +0008 stp q8,q9,[sp,#0x20] ; Actual=stp q8,q9,[sp,#0x20]
04: E6...... +000C stp q10,q11,[sp,#0x40] ; Actual=stp q10,q11,[sp,#0x40]
03: E6...... +0010 stp q12,q13,[sp,#0x60] ; Actual=stp q12,q13,[sp,#0x60]
02: E6...... +0014 stp q14,q15,[sp,#0x80] ; Actual=stp q14,q15,[sp,#0x80]
01: 81...... +0018 stp fp,lr,[sp,#-0x10]! ; Actual=stp fp,lr,[sp,#-0x10]!
00: E1...... +001C mov fp,sp ; Actual=mov fp,sp
+0020 (end sequence)
Epilog #1 unwind:
0B: 81...... +0044 ldp fp,lr,[sp],#0x10 ; Actual=ldp fp,lr,[sp],#0x10
0C: E74E88.. +0048 ldp q14,q15,[sp,#0x80] ; Actual=ldp q14,q15,[sp,#0x80]
0F: E74C86.. +004C ldp q12,q13,[sp,#0x60] ; Actual=ldp q12,q13,[sp,#0x60]
12: E74A84.. +0050 ldp q10,q11,[sp,#0x40] ; Actual=ldp q10,q11,[sp,#0x40]
15: E74882.. +0054 ldp q8,q9,[sp,#0x20] ; Actual=ldp q8,q9,[sp,#0x20]
18: E76689.. +0058 ldp q6,q7,[sp],#0xA0 ; Actual=ldp q6,q7,[sp],#0xA0
1C: E3...... +0060 nop ; Actual=90000030
1D: E3...... +0064 nop ; Actual=ldr xip0,[xip0,#8]
1E: E4...... +0068 end ; Actual=br xip0
+0070 (end sequence)
After the Arm64EC function returns, the __os_arm64x_dispatch_ret
routine is used to re-enter the emulator, back to x64 code (pointed to by LR
).
Arm64EC functions have the 4 bytes before the first instruction in the function reserved for storing information to be used at runtime. It is in these 4 bytes that the relative address of Entry Thunk for the function can be found. When performing a call from an x64 function to an Arm64EC function, the emulator will read the 4 bytes before the start of the function, mask-out the lower two bits and add that amount to the address of the function. This will produce the address of the Entry Thunk to call.
Adjustor Thunks
Adjustor Thunks are signature-less functions which simply transfer control to (tail-call) another function, after performing some transformation to one of the parameters. The type of the parameter(s) being transformed is known, but all the remaining parameters can be anything and, in any number – Adjustor Thunks will not touch any register potentially holding a parameter and will not touch the stack. This is what makes Adjustor Thunks signature-less functions.
Adjustor Thunks can be generated automatically by the compiler. This is common, for example, with C++ multiple-inheritance, where any virtual method may be delegated to the parent class, unmodified, aside from an adjustment to the this
pointer.
Below is a real-world example:
[thunk]:CObjectContext::Release`adjustor{8}':
sub x0,x0,#8
b CObjectContext::Release
The thunk subtracts 8 bytes to the this
pointer and forwards the call to the parent class.
In summary, Arm64EC functions callable from x64 functions must have an associated Entry Thunk. The Entry Thunk is signature specific. Arm64 signature-less functions, such as Adjustor Thunks need a different mechanism which can handle signature-less functions.
The Entry Thunk of an Adjustor Thunk uses the __os_arm64x_x64_jump
helper to defer the execution of the real Entry Thunk work (adjust the parameters from one convention to the other) to the next call. It is at this time that the signature becomes apparent. This includes the option of not doing calling convention adjustments at all, if the target of the Adjustor Thunk turns out to be an x64 function. Remember that by the time an Entry Thunk starts running, the parameters are in their x64 form.
In the example above, consider how the code looks in Arm64EC.
Adjustor Thunk in Arm64EC
[thunk]:CObjectContext::Release`adjustor{8}':
sub x0,x0,#8
adrp x9,CObjectContext::Release
add x11,x9,CObjectContext::Release
stp fp,lr,[sp,#-0x10]!
mov fp,sp
adrp xip0, __os_arm64x_check_icall
ldr xip0,[xip0, __os_arm64x_check_icall]
blr xip0
ldp fp,lr,[sp],#0x10
br x11
Adjustor Thunk's Entry Trunk
[thunk]:CObjectContext::Release$entry_thunk`adjustor{8}':
sub x0,x0,#8
adrp x9,CObjectContext::Release
add x9,x9,CObjectContext::Release
adrp xip0,__os_arm64x_x64_jump
ldr xip0,[xip0,__os_arm64x_x64_jump]
br xip0
Fast-Forward Sequences
Some applications make run-time modifications to functions residing in binaries that they do not own but depend on – commonly operating system binaries – for the purpose of detouring execution when the function is called. This is also known as hooking.
At the high-level, the hooking process is simple. In detail, however, hooking is architecture specific and quite complex given the potential variations the hooking logic must address.
In general terms, the process involves the following:
- Determine the address of the function to hook.
- Replace the first instruction of the function with a jump to the hook routine.
- When the hook is done, come back to the original logic, which includes running the displaced original instruction.
The variations arise from things like:
- The size of the 1st instruction: It is a good idea to replace it with a JMP which is the same size or smaller, to avoid replacing the top of the function while other thread may be running it in flight.
- The type of the first instruction: If the first instruction has some PC relative nature to it, relocating it may require changing things like the displacement fields. Since they are likely to overflow when an instruction is moved to a distant place, this may require providing equivalent logic with different instructions altogether.
Due to all of this complexity, robust and generic hooking logic is rare to find. Frequently the logic present in applications can only cope with a limited set of cases that the application is expecting to encounter in the specific APIs it is interested in. It is not hard to imagine how much of an application compatibility problem this is. Even a simple change in the code or compiler optimizations may render applications unusable if the code no longer looks exactly as expected.
What would happen to these applications if they were to encounter Arm64 code when setting up a hook? They would most certainly fail.
Fast-Forward Sequence (FFS) functions address this compatibility requirement in Arm64EC.
FFS are very small x64 functions which contain no real logic and tail-call to the real Arm64EC function. They are optional but enabled by default for all DLL exports and for any function decorated with __declspec(hybrid_patchable)
.
For these cases, when code obtains a pointer to a given function, either by GetProcAddress
in the export case, or by &function
in the __declspec(hybrid_patchable)
case, the resulting address will contain x64 code. That x64 code will pass for a legitimate x64 function, satisfying most of the hooking logic currently available.
Consider the following example (error handling omitted for brevity):
auto module_handle =
GetModuleHandleW(L"api-ms-win-core-processthreads-l1-1-7.dll");
auto pgma =
(decltype(&GetMachineTypeAttributes))
GetProcAddress(module_handle, "GetMachineTypeAttributes");
hr = (*pgma)(IMAGE_FILE_MACHINE_Arm64, &MachineAttributes);
The function pointer value in the pgma
variable will contain the address of GetMachineTypeAttributes
's FFS.
This is an example of a Fast-Forward Sequence:
kernelbase!EXP+#GetMachineTypeAttributes:
00000001`800034e0 488bc4 mov rax,rsp
00000001`800034e3 48895820 mov qword ptr [rax+20h],rbx
00000001`800034e7 55 push rbp
00000001`800034e8 5d pop rbp
00000001`800034e9 e922032400 jmp 00000001`80243810
The FFS x64 function has a canonical prolog and epilog, ending with a tail-call (jump) to the real GetMachineTypeAttributes
function in Arm64EC code:
kernelbase!GetMachineTypeAttributes:
00000001`80243810 d503237f pacibsp
00000001`80243814 a9bc7bfd stp fp,lr,[sp,#-0x40]!
00000001`80243818 a90153f3 stp x19,x20,[sp,#0x10]
00000001`8024381c a9025bf5 stp x21,x22,[sp,#0x20]
00000001`80243820 f9001bf9 str x25,[sp,#0x30]
00000001`80243824 910003fd mov fp,sp
00000001`80243828 97fbe65e bl kernelbase!#__security_push_cookie
00000001`8024382c d10083ff sub sp,sp,#0x20
[...]
It would be quite inefficient if it was required to run 5 emulated x64 instructions between two Arm64EC functions. FFS functions are special. FFS functions don't really run if they remain unaltered. The call-checker helper will efficiently check if the FFS hasn't been changed. If that is the case, the call will be transferred directly to the real destination. If the FFS has been changed in any possible way, then it will no longer be an FFS. Execution will be transferred to the altered FFS and run whichever code may be there, emulating the detour and any hooking logic.
When the hook transfers execution back to the end of the FFS, it will eventually reach the tail-call to the Arm64EC code, which will then execute after the hook, just as the application is expecting it would.
Authoring Arm64EC in Assembly
Windows SDK headers and the C compiler can simplify the job of authoring Arm64EC assembly. For example, the C compiler can be used to generate Entry and Exit Thunks for functions not compiled from C code.
Consider the example of an equivalent to the following function fD
that must be authored in Assembly (ASM). This function can be called by both Arm64EC and x64 code and the pfE
function pointer may point at either Arm64EC or x64 code as well.
typedef int (PF_E)(int, double);
extern PF_E * pfE;
int fD(int i, double d) {
return (*pfE)(i, d);
}
Writing fD
in ASM would look something like:
#include "ksarm64.h"
IMPORT __os_arm64x_check_icall_cfg
IMPORT |$iexit_thunk$cdecl$i8$i8d|
IMPORT pfE
NESTED_ENTRY_COMDAT A64NAME(fD)
PROLOG_SAVE_REG_PAIR fp, lr, #-16!
adrp x11, pfE ; Get the global function
ldr x11, [x11, pfE] ; pointer pfE
adrp x9, __os_arm64x_check_icall_cfg ; Get the EC call checker
ldr x9, [x9, __os_arm64x_check_icall_cfg] ; with CFG
adrp x10, |$iexit_thunk$cdecl$i8$i8d| ; Get the Exit Thunk for
add x10, x10, |$iexit_thunk$cdecl$i8$i8d| ; int f(int, double);
blr x9 ; Invoke the call checker
blr x11 ; Invoke the function
EPILOG_RESTORE_REG_PAIR fp, lr, #16!
EPILOG_RETURN
NESTED_END
end
In the example above:
- Arm64EC uses the same procedure declaration and prolog/epilog macros as Arm64.
- Function names should be wrapped by the
A64NAME
macro. When compiling C/C++ code as Arm64EC, the compiler marks theOBJ
asARM64EC
containing Arm64EC code. This does not happen withARMASM
. When compiling ASM code there is an alternate way to inform the linker that the produced code is Arm64EC. This is by prefixing the function name with#
. TheA64NAME
macro performs this operation when_ARM64EC_
is defined and leaves the name unchanged when_ARM64EC_
is not defined. This makes it possible to share source code between Arm64 and Arm64EC. - The
pfE
function pointer must first be run through the EC call checker, together with the appropriate Exit Thunk, in case the target function is x64.
Generating Entry and Exit Thunks
The next step is to generate the Entry Thunk for fD
and the Exit Thunk for pfE
. The C compiler can perform this task with minimal effort, using the _Arm64XGenerateThunk
compiler keyword.
void _Arm64XGenerateThunk(int);
int fD2(int i, double d) {
UNREFERENCED_PARAMETER(i);
UNREFERENCED_PARAMETER(d);
_Arm64XGenerateThunk(2);
return 0;
}
int fE(int i, double d) {
UNREFERENCED_PARAMETER(i);
UNREFERENCED_PARAMETER(d);
_Arm64XGenerateThunk(1);
return 0;
}
The _Arm64XGenerateThunk
keyword tells the C compiler to use the function signature, ignore the body, and generate either an Exit Thunk (when the parameter is 1) or an Entry Thunk (when the parameter is 2).
It is recommended to place thunk generation in its own C file. Being in isolated files makes it simpler to confirm the symbol names by dumping the corresponding OBJ
symbols or even disassembly.
Custom Entry Thunks
Macros have been added to the SDK to help authoring custom, hand-coded, Entry Thunks. One case where this can be used is when authoring custom Adjustor Thunks.
Most Adjustor Thunks are generated by the C++ compiler, but they can also be generated manually. This can be found in cases where a generic callback transfers control to the real callback, identified by one of the parameters.
Below is an example in Arm64 Classic code:
NESTED_ENTRY MyAdjustorThunk
PROLOG_SAVE_REG_PAIR fp, lr, #-16!
ldr x15, [x0, 0x18]
adrp x16, __guard_check_icall_fptr
ldr x16, [x16, __guard_check_icall_fptr]
blr xip0
EPILOG_RESTORE_REG_PAIR fp, lr, #16
EPILOG_END br x15
NESTED_END
In this example, the target function address is retrieved from the element of a structure, provided by reference, through the 1st parameter. Because the structure is writable, the target address must be validated through Control Flow Guard (CFG).
The below example demonstrates how the equivalent Adjustor Thunk would look when ported to Arm64EC:
NESTED_ENTRY_COMDAT A64NAME(MyAdjustorThunk)
PROLOG_SAVE_REG_PAIR fp, lr, #-16!
ldr x11, [x0, 0x18]
adrp xip0, __os_arm64x_check_icall_cfg
ldr xip0, [xip0, __os_arm64x_check_icall_cfg]
blr xip0
EPILOG_RESTORE_REG_PAIR fp, lr, #16
EPILOG_END br x11
NESTED_END
The code above does not supply an Exit Thunk (in register x10). This is not possible since the code can be executed for many different signatures. This code is taking advantage of the caller having set x10 to the Exit Thunk. The caller would have made the call targeting an explicit signature.
The above code does need an Entry Thunk to address the case when the caller is x64 code. This is how to author the corresponding Entry Thunk, using the macro for custom Entry Thunks:
ARM64EC_CUSTOM_ENTRY_THUNK A64NAME(MyAdjustorThunk)
ldr x9, [x0, 0x18]
adrp xip0, __os_arm64x_x64_jump
ldr xip0, [xip0, __os_arm64x_x64_jump]
br xip0
LEAF_END
Unlike other functions, this Entry Thunk does not eventually transfer control to the associated function (the Adjustor Thunk). In this case, the functionality itself (performing the parameter adjustment) is embedded into the Entry Thunk and control is transferred directly to the end target, via the __os_arm64x_x64_jump
helper.
Dynamically Generating (JIT Compiling) Arm64EC code
In Arm64EC processes there are two types of executable memory: Arm64EC code and x64 code.
The operating system extracts this information from the loaded binaries. x64 binaries are all x64 and Arm64EC contain a range-table for Arm64EC vs x64 code pages.
What about Dynamically Generated code? Just-in-time (JIT) compilers generate code, at runtime, which is not backed by any binary file.
Usually this implies:
- Allocating writable memory (
VirtualAlloc
). - Producing the code into the allocated memory.
- Re-protecting the memory from Read-Write to Read-Execute (
VirtualProtect
). - Add unwind function entries for all non-trivial (non-leaf) generated functions (
RtlAddFunctionTable
orRtlAddGrowableFunctionTable
).
For trivial compatibility reasons, any application performing these steps in an Arm64EC process will result in the code being considered x64 code. This will happen for any process using the unmodified x64 Java Runtime, .NET runtime, JavaScript engine, etc.
To generate Arm64EC dynamic code, the process is mostly the same with only two differences:
- When allocating the memory, use newer
VirtualAlloc2
(instead ofVirtualAlloc
orVirtualAllocEx
) and provide theMEM_EXTENDED_PARAMETER_EC_CODE
attribute. - When adding function entries:
- They must be in Arm64 format. When compiling Arm64EC code, the
RUNTIME_FUNCTION
type will match the x64 format. For Arm64 format when compiling Arm64EC, use theARM64_RUNTIME_FUNCTION
type instead. - Do not use the older
RtlAddFunctionTable
API. Always use the newerRtlAddGrowableFunctionTable
API instead.
- They must be in Arm64 format. When compiling Arm64EC code, the
Below is an example of memory allocation:
MEM_EXTENDED_PARAMETER Parameter = { 0 };
Parameter.Type = MemExtendedParameterAttributeFlags;
Parameter.ULong64 = MEM_EXTENDED_PARAMETER_EC_CODE;
HANDLE process = GetCurrentProcess();
ULONG allocationType = MEM_RESERVE;
DWORD protection = PAGE_EXECUTE_READ | PAGE_TARGETS_INVALID;
address = VirtualAlloc2 (
process,
NULL,
numBytesToAllocate,
allocationType,
protection,
&Parameter,
1);
And an example of adding one unwind function entry:
ARM64_RUNTIME_FUNCTION FunctionTable[1];
FunctionTable[0].BeginAddress = 0;
FunctionTable[0].Flags = PdataPackedUnwindFunction;
FunctionTable[0].FunctionLength = nSize / 4;
FunctionTable[0].RegF = 0; // no D regs saved
FunctionTable[0].RegI = 0; // no X regs saved beyond fp,lr
FunctionTable[0].H = 0; // no home for x0-x7
FunctionTable[0].CR = PdataCrChained; // stp fp,lr,[sp,#-0x10]!
// mov fp,sp
FunctionTable[0].FrameSize = 1; // 16 / 16 = 1
this->DynamicTable = NULL;
Result == RtlAddGrowableFunctionTable(
&this->DynamicTable,
reinterpret_cast<PRUNTIME_FUNCTION>(FunctionTable),
1,
1,
reinterpret_cast<ULONG_PTR>(pBegin),
reinterpret_cast<ULONG_PTR>(reinterpret_cast<PBYTE>(pBegin) + nSize)
);
Windows on Arm