aapcs64
aapcs64
aapcs64
Architecture (AArch64)
2023Q3
1.2 Keywords
Procedure call, function call, calling conventions, data layout
2
Copyright © 2011, 2013, 2018, 2020-2023, Arm Limited and its affiliates. All rights reserved.
1.4 Licence
This work is licensed under the Creative Commons Attribution-ShareAlike 4.0 International License. To view a copy of
this license, visit http://creativecommons.org/licenses/by-sa/4.0/ or send a letter to Creative Commons, PO Box 1866,
Mountain View, CA 94042, USA.
Grant of Patent License. Subject to the terms and conditions of this license (both the Public License and this Patent
License), each Licensor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free,
irrevocable (except as stated in this section) patent license to make, have made, use, offer to sell, sell, import, and
otherwise transfer the Licensed Material, where such license applies only to those patent claims licensable by such
Licensor that are necessarily infringed by their contribution(s) alone or by combination of their contribution(s) with the
Licensed Material to which such contribution(s) was submitted. If You institute patent litigation against any entity
(including a cross-claim or counterclaim in a lawsuit) alleging that the Licensed Material or a contribution incorporated
within the Licensed Material constitutes direct or contributory patent infringement, then any licenses granted to You
under this license for that Licensed Material shall terminate as of the date such litigation is filed.
1.6 Contributions
Contributions to this project are licensed under an inbound=outbound model such that any such contributions are
licensed by the contributor under the same terms as those in the Licence section.
1.8 Copyright
Copyright (c) 2011, 2013, 2018, 2020-2023, Arm Limited and its affiliates. All rights reserved.
3
Copyright © 2011, 2013, 2018, 2020-2023, Arm Limited and its affiliates. All rights reserved.
Contents
1 Preamble 2
1.1 Abstract 2
1.2 Keywords 2
1.3 Latest release and defects report 2
1.4 Licence 3
1.5 About the license 3
1.6 Contributions 3
1.7 Trademark notice 3
1.8 Copyright 3
2 About this document 7
2.1 Change control 7
2.1.1 Current status and anticipated changes 7
2.1.2 Change history 7
2.1.3 References 8
2.2 Terms and abbreviations 9
3 Scope 12
4 Introduction 13
4.1 Design goals 13
4.2 Conformance 13
5 Data types and alignment 14
5.1 Fundamental Data Types 14
5.2 Half-precision Floating Point 15
5.3 Decimal Floating Point 15
5.4 Short Vectors 16
5.5 Scalable Vectors 16
5.6 Scalable Predicates 16
5.7 Pointers 16
5.8 Byte order ("Endianness") 17
5.9 Composite Types 17
5.9.1 Aggregates 18
5.9.2 Unions 18
5.9.3 Arrays 18
5.9.4 Bit-fields subdivision 18
5.9.5 Homogeneous Aggregates 18
5.10 Pure Scalable Types (PSTs) 19
6 The Base Procedure Call Standard 20
6.1 Machine Registers 20
4
Copyright © 2011, 2013, 2018, 2020-2023, Arm Limited and its affiliates. All rights reserved.
6.1.1 General-purpose Registers 20
6.1.2 SIMD and Floating-Point registers 21
6.1.3 Scalable vector registers 22
6.1.4 Scalable Predicate Registers 22
6.2 SME state 22
6.3 Threads and processes 22
6.4 Memory and the Stack 23
6.4.1 Memory addresses 23
6.4.2 Properties of a thread 23
6.4.3 TPIDR2_EL0 23
6.4.4 Categories of memory 24
6.4.5 The Stack 24
6.4.6 The Frame Pointer 25
6.5 Subroutine calls 26
6.5.1 Use of IP0 and IP1 by the linker 26
6.5.2 Normal returns 26
6.6 The ZA lazy saving scheme 27
6.6.1 Overview 27
6.6.2 ZA save buffers 27
6.6.3 ZA states 28
6.6.4 Setting up a lazy save buffer 29
6.6.5 Abandoning a lazy save 30
6.6.6 Committing a lazy save 30
6.6.7 Preserving ZA 30
6.6.8 Complying with the lazy saving scheme 31
6.6.9 Changes to the TPIDR2 block 32
6.7 Types of subroutine interface 32
6.7.1 PSTATE.SM interfaces 32
6.7.2 ZA interfaces 33
6.8 Parameter passing 33
6.8.1 Variadic subroutines 33
6.8.2 Parameter passing rules 33
6.9 Result return 36
6.10 Interworking 36
7 The standard variants 37
7.1 Half-precision format compatibility 37
7.2 sizeof(long), sizeof(wchar_t), pointers 37
7.3 size_t, ptrdiff_t 37
8 Support routines 38
5
Copyright © 2011, 2013, 2018, 2020-2023, Arm Limited and its affiliates. All rights reserved.
8.1 SME support routines 38
8.1.1 __arm_sme_state 38
8.1.2 __arm_tpidr2_save 39
8.1.3 __arm_za_disable 40
8.1.4 __arm_tpidr2_restore 40
9 Pseudo-code examples 41
9.1 SME pseudo-code examples 41
10 Arm C and C++ language mappings 43
10.1 Data types 43
10.1.1 Arithmetic types 43
10.1.2 Types varying by data model 45
10.1.3 Enumerated types 46
10.1.4 Additional types 46
10.1.5 Definition of va_list 46
10.1.6 Volatile data types 47
10.1.7 Structure, union and class layout 47
10.1.8 Bit-fields 47
10.2 Argument passing conventions 50
10.3 setjmp and longjmp 50
10.4 Exceptions 51
11 APPENDIX Support for Advanced SIMD Extensions 53
12 APPENDIX Support for Scalable vectors 54
13 APPENDIX C++ mangling 55
14 APPENDIX Variable argument lists 56
14.1 Register save areas 56
14.2 The va_list type 56
14.3 The va_start() macro 56
14.4 The va_arg() macro 58
15 Footnotes 60
6
Copyright © 2011, 2013, 2018, 2020-2023, Arm Limited and its affiliates. All rights reserved.
2 About this document
2.1 Change control
7
Copyright © 2011, 2013, 2018, 2020-2023, Arm Limited and its affiliates. All rights reserved.
Issue Date Change
2019Q4 30th January 2020 Github release with an open source license.
Major changes:
2020Q2 1st July 2020 Add requirements for stack space with MTE tags. Extend the AAPCS64 to
support SVE types and registers. Conform aapcs64 volatile bit-fields rules to
C/C++.
2020Q3 1st October 2020 Specify ABI handling for 8.7-A's new FPCR bits.
2023Q3 6th October 2023 In Data Types include _BitInt(N) in language mapping.
2.1.3 References
This document refers to, or is referred to by, the following documents:
AAPCS64 Source for this document Procedure Call Standard for the Arm 64-bit
Architecture
CPPABI64 IHI 0059 C++ ABI for the Arm 64-bit Architecture
8
Copyright © 2011, 2013, 2018, 2020-2023, Arm Limited and its affiliates. All rights reserved.
Ref URL or other reference Title
1. The specifications to which an executable must conform in order to execute in a specific execution
environment. For example, the Linux ABI for the Arm Architecture.
2. A particular aspect of the specifications to which independently produced relocatable files must conform in
order to be statically linkable and executable. For example, the CPPABI64, AAELF64, ...
Arm-based
... based on the Arm architecture ...
Floating point
Depending on context floating point means or qualifies: (a) floating-point arithmetic conforming to IEEE 754 2008;
(b) the Armv8 floating point instruction set; (c) the register set shared by (b) and the Armv8 SIMD instruction set.
Q-o-I
Quality of Implementation – a quality, behavior, functionality, or mechanism not required by this standard, but
which might be provided by systems conforming to it. Q-o-I is often used to describe the toolchain-specific means
by which a standard requirement is met.
MTE
The Arm architecture's Memory Tagging Extension.
SIMD
Single Instruction Multiple Data – A term denoting or qualifying: (a) processing several data items in parallel
under the control of one instruction; (b) the Armv8 SIMD instruction set: (c) the register set shared by (b) and the
Armv8 floating point instruction set.
SIMD and floating point
The Arm architecture’s SIMD and Floating Point architecture comprising the floating point instruction set, the
SIMD instruction set and the register set shared by them.
SME
9
Copyright © 2011, 2013, 2018, 2020-2023, Arm Limited and its affiliates. All rights reserved.
The Arm architecture's Scalable Matrix Extension.
SVE
The Arm architecture's Scalable Vector Extension.
SVL
Streaming Vector Length; that is, the number of bits in a Scalable Vector when the processor is in streaming
mode.
SVL.B
As for SVL, but measured in bytes rather than bits.
T32
The instruction set named Thumb in the Armv7 architecture; T32 uses 16-bit and 32-bit instructions.
VG
The number of 64-bit “vector granules” in an SVE vector; in other words, the number of bits in an SVE vector
register divided by 64.
ILP32
SysV-like data model where int, long int and pointer are 32-bit.
LP64
SysV-like data model where int is 32-bit, but long int and pointer are 64-bit.
LLP64
Windows-like data model where int and long int are 32-bit, but long long int and pointer are 64-bit.
This document uses the following terms:
Routine, subroutine
A fragment of program to which control can be transferred that, on completing its task, returns control to its caller
at an instruction following the call. Routine is used for clarity where there are nested calls: a routine is the caller
and a subroutine is the callee.
Procedure
A routine that returns no result value.
Function
A routine that returns a result value.
Activation stack, call-frame stack
The stack of routine activation records (call frames).
Activation record, call frame
The memory used by a routine for saving registers and holding local variables (usually allocated on a stack, once
per activation of the routine).
PIC, PID
Position-independent code, position-independent data.
Argument, parameter
The terms argument and parameter are used interchangeably. They may denote a formal parameter of a routine
given the value of the actual parameter when the routine is called, or an actual parameter, according to context.
Externally visible [interface]
[An interface] between separately compiled or separately assembled routines.
Variadic routine
A routine is variadic if the number of arguments it takes, and their type, is determined by the caller instead of the
callee.
Global register
A register whose value is neither saved nor destroyed by a subroutine. The value may be updated, but only in a
manner defined by the execution environment.
10
Copyright © 2011, 2013, 2018, 2020-2023, Arm Limited and its affiliates. All rights reserved.
Program state
The state of the program’s memory, including values in machine registers.
Scratch register, temporary register, caller-saved register
A register used to hold an intermediate value during a calculation (usually, such values are not named in the
program source and have a limited lifetime). If a function needs to preserve the value held in such a register over
a call to another function, then the calling function must save and restore the value.
Callee-saved register
A register whose value must be preserved over a function call. If the function being called (the callee) needs to
use the register, then it is responsible for saving and restoring the old value.
SysV
Unix System V. A variant of the Unix Operating System. Although this specification refers to SysV, many other
operating systems, such as Linux or BSD use similar conventions.
Platform
A program execution environment such as that defined by an operating system or run-time environment. A
platform defines the specific variant of the ABI and may impose additional constraints. Linux is a platform in this
sense.
More specific terminology is defined when it is first used.
11
Copyright © 2011, 2013, 2018, 2020-2023, Arm Limited and its affiliates. All rights reserved.
3 Scope
The AAPCS64 defines how subroutines can be separately written, separately compiled, and separately assembled to
work together. It describes a contract between a calling routine and a called routine, or between a routine and its
execution environment, that defines:
• Obligations on the caller to create a program state in which the called routine may start to execute.
• Obligations on the called routine to preserve the program state of the caller across the call.
• The rights of the called routine to alter the program state of its caller.
• Obligations on all routines to preserve certain global invariants.
This standard specifies the base for a family of Procedure Call Standard (PCS) variants generated by choices that
reflect arbitrary, but historically important, choice among:
• Byte order.
• Size and format of data types: pointer, long int and wchar_t and the format of half-precision floating-point values.
Here we define three data models (see The standard variants and Arm C and C++ language mappings for
details):
• ILP32: (Beta) SysV-like variant where int, long int and pointer are 32-bit.
• LP64: SysV-like variant where int is 32-bit, but long int and pointer are 64-bit.
• LLP64: Windows-like variant where int and long int are 32-bit, but long long int and pointer are 64-bit.
• Whether floating-point operations use floating-point hardware resources or are implemented by calls to
integer-only routines 2.
This specification does not standardize the representation of publicly visible C++-language entities that are not also C
language entities (these are described in CPPABI64) and it places no requirements on the representation of language
entities that are not visible across public interfaces.
12
Copyright © 2011, 2013, 2018, 2020-2023, Arm Limited and its affiliates. All rights reserved.
4 Introduction
The AAPCS64 is the first revision of Procedure Call standard for the Arm 64-bit Architecture. It forms part of the
complete ABI specification for the Arm 64-bit Architecture.
4.2 Conformance
The AAPCS64 defines how separately compiled and separately assembled routines can work together. There is an
externally visible interface between such routines. It is common that not all the externally visible interfaces to software
are intended to be publicly visible or open to arbitrary use. In effect, there is a mismatch between the machine-level
concept of external visibility—defined rigorously by an object code format—and a higher level, application-oriented
concept of external visibility—which is system specific or application specific.
Conformance to the AAPCS64 requires that 3:
• At all times, stack limits and basic stack alignment are observed (Universal stack constraints).
• At each call where the control transfer instruction is subject to a BL-type relocation at static link time, rules on the
use of IP0 and IP1 are observed (Use of IP0 and IP1 by the Linker).
• The routines of each publicly visible interface conform to the relevant procedure call standard variant.
• The data elements 4 of each publicly visible interface conform to the data layout rules.
13
Copyright © 2011, 2013, 2018, 2020-2023, Arm Limited and its affiliates. All rights reserved.
5 Data types and alignment
5.1 Fundamental Data Types
Table 1, shows the fundamental data types (Machine Types) of the machine.
Natural
Byte Alignment
Type class Machine type size (bytes) Note
Signed byte 1 1
Unsigned half-word 2 2
Signed half-word 2 2
Unsigned word 4 4
Signed word 4 4
Unsigned double-word 8 8
Signed double-word 8 8
Unsigned quad-word 16 16
Signed quad-word 16 16
Double precision 8 8
Quad precision 16 16
128-bit decimal fp 16 16
128-bit vector 16 16
14
Copyright © 2011, 2013, 2018, 2020-2023, Arm Limited and its affiliates. All rights reserved.
Natural
Byte Alignment
Type class Machine type size (bytes) Note
The first two formats are mutually exclusive. The base standard of the AAPCS specifies use of the IEEE 754-2008
variant, and a procedure call variant that uses the Arm Alternative format is permitted.
Note
There is no support in the AArch64 ISA for Decimal Floating Point, so all operations must be emulated in software.
15
Copyright © 2011, 2013, 2018, 2020-2023, Arm Limited and its affiliates. All rights reserved.
5.4 Short Vectors
A short vector is a machine type that is composed of repeated instances of one fundamental integral or floating-point
type. It may be 8 or 16 bytes in total size. A short vector has a base type that is the fundamental integral or
floating-point type from which it is composed, but its alignment is always the same as its total size. The number of
elements in the short vector is always such that the type is fully packed. For example, an 8-byte short vector may
contain 8 unsigned byte elements, 4 unsigned half-word elements, 2 single-precision floating-point elements, or any
other combination where the product of the number of elements and the size of an individual element is equal to 8.
Similarly, for 16-byte short vectors the product of the number of elements and the size of the individual elements must
be 16.
Elements in a short vector are numbered such that the lowest numbered element (element 0) occupies the lowest
numbered bit (bit zero) in the vector and successive elements take on progressively increasing bit positions in the
vector. When a short vector transferred between registers and memory it is treated as an opaque object. That is a
short vector is stored in memory as if it were stored with a single STR of the entire register; a short vector is loaded
from memory using the corresponding LDR instruction. On a little-endian system this means that element 0 will always
contain the lowest addressed element of a short vector; on a big-endian system element 0 will contain the
highest-addressed element of a short vector.
A language binding may define extended types that map directly onto short vectors. Short vectors are not otherwise
created spontaneously (for example because a user has declared an aggregate consisting of eight consecutive
byte-sized objects).
5.7 Pointers
Code and data pointers are either 64-bit or 32-bit unsigned types 5. A NULL pointer is always represented by
all-bits-zero.
All 64 bits in a 64-bit pointer are always significant. When tagged addressing is enabled, a tag is part of a pointer’s
value for the purposes of pointer arithmetic. The result of subtracting or comparing two pointers with different tags is
unspecified. See also Memory addresses, below. A 32-bit pointer does not support tagged addressing.
Note
(Beta)
16
Copyright © 2011, 2013, 2018, 2020-2023, Arm Limited and its affiliates. All rights reserved.
The A64 load and store instructions always use the full 64-bit base register and perform a 64-bit address
calculation. Care must be taken within ILP32 to ensure that the upper 32 bits of a base register are zero and 32-bit
register offsets are sign-extended to 64 bits (immediate offsets are implicitly extended).
• In a little-endian view of memory the least significant byte of a data object is at the lowest byte address the data
object occupies in memory.
• In a big-endian view of memory the least significant byte of a data object is at the highest byte address the data
object occupies in memory.
MSB LSB M+ 3
M+ 2
M+ 1
M+ 0
M+ 3
M+ 2
M+ 1
MSB LSB M+ 0
31 0
• An aggregate, where the members are laid out sequentially in memory (possibly with inter-member padding).
• A union, where each of the members has the same address.
• An array, which is a repeated sequence of some other type (its base type).
17
Copyright © 2011, 2013, 2018, 2020-2023, Arm Limited and its affiliates. All rights reserved.
The definitions are recursive; that is, each of the types may contain a Composite Type as a member.
• The member alignment of an element of a composite type is the alignment of that member after the application
of any language alignment modifiers to that member
• The natural alignment of a composite type is the maximum of each of the member alignments of the 'top-level'
members of the composite type i.e. before any alignment adjustment of the entire composite is applied
5.9.1 Aggregates
5.9.2 Unions
5.9.3 Arrays
18
Copyright © 2011, 2013, 2018, 2020-2023, Arm Limited and its affiliates. All rights reserved.
5.9.5.2 Homogeneous Short-Vector Aggregates (HVA)
A Homogeneous Short-Vector Aggregate (HVA) is a Homogeneous Aggregate with a Fundamental Data Type that is a
Short-Vector type and at most four uniquely addressable members.
As with Homogeneous Aggregates, these rules apply after data layout is completed and without regard to access
control or other source language restrictions. However, there are several notable differences from Homogeneous
Aggregates:
• A Pure Scalable Type may contain a mixture of different Fundamental Data Types. For example, an aggregate
that contains a scalable vector of 8-bit elements, a scalable predicate, and a scalable vector of 16-bit elements is
a Pure Scalable Type.
• Alignment and padding do not play a role when determining whether something is a Pure Scalable Type. (In fact,
a Pure Scalable Type that contains both predicate types and vector types will often contain padding.)
• Pure Scalable Types are never unions and never contain unions.
Note
Composite Types have at least one member and the type of each member is either a Fundamental Data Type or
another Composite Type. Since all Fundamental Data Types have nonzero size, it follows that all members of a
Composite Type have nonzero size.
Any language-level members that have zero size must therefore disappear in the language-to-ABI mapping and do
not affect whether the containing type is a Pure Scalable Type.
19
Copyright © 2011, 2013, 2018, 2020-2023, Arm Limited and its affiliates. All rights reserved.
6 The Base Procedure Call Standard
The base standard defines a machine-level calling standard for the A64 instruction set. It assumes the availability of
the vector registers for passing floating-point and SIMD arguments. Application code is expected to conform to one of
three data models defined in this standard; ILP32, LP64 or LLP64.
r18 The Platform Register, if needed; otherwise a temporary register. See notes.
r17 IP1 The second intra-procedure-call temporary register (can be used by call
veneers and PLT code); at other times may be used as a temporary register.
r16 IP0 The first intra-procedure-call scratch register (can be used by call veneers and
PLT code); at other times may be used as a temporary register.
The first eight registers, r0-r7, are used to pass argument values into a subroutine and to return result values from a
function. They may also be used to hold intermediate values within a routine (but, in general, only between subroutine
calls).
Registers r16 (IP0) and r17 (IP1) may be used by a linker as a scratch register between a routine and any subroutine it
calls (for details, see Use of IP0 and IP1 by the linker). They can also be used within a routine to hold intermediate
values between subroutine calls.
20
Copyright © 2011, 2013, 2018, 2020-2023, Arm Limited and its affiliates. All rights reserved.
The role of register r18 is platform specific. If a platform ABI has need of a dedicated general-purpose register to carry
inter-procedural state (for example, the thread context) then it should use this register for that purpose. If the platform
ABI has no such requirements, then it should use r18 as an additional temporary register. The platform ABI
specification must document the usage for this register.
Note
Software developers creating platform-independent code are advised to avoid using r18 if at all possible. Most
compilers provide a mechanism to prevent specific registers from being used for general allocation; portable
hand-coded assembler should avoid it entirely. It should not be assumed that treating the register as callee-saved
will be sufficient to satisfy the requirements of the platform. Virtualization code must, of course, treat the register as
they would any other resource provided to the virtual machine.
A subroutine invocation must preserve the contents of the registers r19-r29 and SP. All 64 bits of each value stored in
r19-r29 must be preserved, even when using the ILP32 data model (Beta).
In all variants of the procedure call standard, registers r16, r17, r29 and r30 have special roles. In these roles they are
labeled IP0, IP1, FP and LR when being used for holding addresses (that is, the special name implies accessing the
register as a 64-bit entity).
Note
The special register names (IP0, IP1, FP and LR) should be used only in the context in which they are special. It is
recommended that disassemblers always use the architectural names for the registers.
The NZCV register is a global condition flag register with the following properties:
• The N, Z, C and V flags are undefined on entry to and return from a public interface.
Note
Unlike in AArch32, in AArch64 the 128-bit and 64-bit views of a SIMD and Floating-Point register do not overlap
multiple registers in a narrower view, so q1, d1 and s1 all refer to the same entry in the register bank.
The first eight registers, v0-v7, are used to pass argument values into a subroutine and to return result values from a
function. They may also be used to hold intermediate values within a routine (but, in general, only between subroutine
calls).
Registers v8-v15 must be preserved by a callee across subroutine calls; the remaining registers (v0-v7, v16-v31) do
not need to be preserved (or should be preserved by the caller). Additionally, only the bottom 64 bits of each value
stored in v8-v15 need to be preserved 8; it is the responsibility of the caller to preserve larger values.
The FPSR is a status register that holds the cumulative exception bits of the floating-point unit. It contains the fields
IDC, IXC, UFC, OFC, DZC, IOC and QC. These fields are not preserved across a public interface and may have any
value on entry to a subroutine.
The FPCR is used to control the behavior of the floating-point unit. It is a global register with the following properties.
• The exception-control bits (8-12), rounding mode bits (22-23), flush-to-zero bits (24), and the AH and FIZ bits
(0-1) may be modified by calls to specific support functions that affect the global state of the application.
• The NEP bit (bit 2) must be zero on entry to and return from a public interface.
21
Copyright © 2011, 2013, 2018, 2020-2023, Arm Limited and its affiliates. All rights reserved.
• All other bits are reserved and must not be modified. It is not defined whether the bits read as zero or one, or
whether they are preserved across a public interface.
Decimal Floating-Point emulation code requires additional control bits which cannot be stored in the FPCR. Since the
information must be held for each thread of execution, the state must be held in thread-local storage on platforms
where multi-threaded code is supported. The exact location of such information is platform specific.
22
Copyright © 2011, 2013, 2018, 2020-2023, Arm Limited and its affiliates. All rights reserved.
the platform supports multiple processes but has no separate concept of threads, each process will have a single
thread of execution. If a platform has no concurrency or preemption then there will be a single thread and process that
executes all instructions.
Each thread has its own register state, defined by the contents of the underlying machine registers. A process has a
program state defined by its threads' register states and by the contents of the memory that the process can access.
The memory that a process can access, without causing a run-time fault, may vary during the execution of its threads.
6.4.3 TPIDR2_EL0
(Alpha)
This section only applies to threads that have access to TPIDR2_EL0.
Conforming software must ensure that, all times during the execution of a thread, TPIDR2_EL0 is in one of two states:
• TPIDR2_EL0 is null.
• TPIDR2_EL0 points to a “TPIDR2 block” with the format below, and the thread has read access to every byte of
the block.
23
Copyright © 2011, 2013, 2018, 2020-2023, Arm Limited and its affiliates. All rights reserved.
Byte offset Type Referred to in the AAPCS64 as
Note that the field names are just a notational convenience. Language bindings may choose different names.
The reserved parts of the block are defined to be zero by this revision of the AAPCS64. All nonzero values are
reserved for use by future revisions of the AAPCS64.
If TPIDR2_EL0 is nonnull and if any reserved byte in the first 16 bytes of the TPIDR2 block has a nonzero value, the
thread must do one of the following:
Byte offsets of 16 and greater are reserved for use by future revisions of the AAPCS64.
Negative offsets from TPIDR2_EL0 are reserved for use by the platform. The block of data stored at negative offsets is
therefore referred to as the “platform TPIDR2 block”; this block starts at a platform-defined offset from TPIDR2_EL0
and ends at TPIDR2_EL0.
In the rest of this document, za_save_buffer and num_za_save_slices (without qualification) refer to the fields
of a TPIDR2 block at address TPIDR2_EL0. BLK.za_save_buffer and BLK.num_za_save_slices instead refer
to the fields of a TPIDR2 block at address BLK.
See Changes to the TPIDR2 block for additional requirements relating to the TPIDR2 block.
• Code (the program being executed), which must be readable, but need not be writable, by the process.
• Read-only static data.
• Writable static data.
• The heap.
• Stacks, with one stack for each thread.
Each category of memory can contain multiple individual regions. These individual regions do not need to be
contiguous and regions of one memory class can be interspersed with regions of another memory class.
Writable static data may be further sub-divided into initialized, zero-initialized, and uninitialized data.
The heap is an area (or areas) of memory that the process manages itself (for example, with the C malloc function). It
is typically used to create dynamic data objects.
Each individual stack must occupy a single, contiguous region of memory. However, as noted above, multiple stacks
do not need to be organized contiguously.
A process must always have access to code and stacks, and it may have access to any of the other categories of
memory.
A conforming program must only execute instructions that are in areas of memory designated to contain code.
24
Copyright © 2011, 2013, 2018, 2020-2023, Arm Limited and its affiliates. All rights reserved.
The stack is defined in terms of three values:
• a base
• a limit
• the current stack extent, stored in the special-purpose register SP
The SP moves from the base to the limit as the stack grows, and from the limit to the base as the stack shrinks. In
practice, an application might not be able to determine the value of either the base or the limit.
In the description below, the base, limit, and current stack extent for a thread T are denoted T.base, T.limit, and T.SP
respectively.
The stack implementation is full-descending, so that for each thread T:
• T.limit < T.base and the stack occupies the area of memory delimited by the half-open internal [T.limit, T.base).
• The active region of T's stack is the area of memory delimited by the half-open interval [T.SP, T.base). The
active region is empty when T.SP is equal to T.base.
• The inactive region of T's stack is the area of memory denoted by the half-open interval [T.limit, T.SP). The
inactive region is empty when T.SP is equal to T.limit.
The stack may have a fixed size or be dynamically extendable (by adjusting the stack-limit downwards).
The rules for maintenance of the stack are divided into two parts: a set of constraints that must be observed at all
times, and an additional constraint that must be observed at a public interface.
• T.limit ≤ T.SP ≤ T.base. T's stack pointer must lie within the extent of the memory occupied by S.
• No thread is permitted to access (for reading or for writing) the inactive region of S.
• If MTE is enabled, then the tag stored in T.SP must match the tag set on the inactive region of S.
Additionally, at any point at which memory is accessed via SP, the hardware requires that
25
Copyright © 2011, 2013, 2018, 2020-2023, Arm Limited and its affiliates. All rights reserved.
Note
There will always be a short period during construction or destruction of each frame record during which the frame
pointer will point to the caller’s record.
A platform shall mandate the minimum level of conformance with respect to the maintenance of frame records. The
options are, in decreasing level of functionality:
• It may require the frame pointer to address a valid frame record at all times, except that small subroutines which
do not modify the link register may elect not to create a frame record
• It may require the frame pointer to address a valid frame record at all times, except that any subroutine may elect
not to create a frame record
• It may permit the frame pointer register to be used as a general-purpose callee-saved register, but provide a
platform-specific mechanism for external agents to reliably detect this condition
• It may elect not to maintain a frame chain and to use the frame pointer register as a general-purpose
callee-saved register.
Note
R_AARCH64_CALL26, and R_AARCH64_JUMP26 are the ELF relocation types with this property.
26
Copyright © 2011, 2013, 2018, 2020-2023, Arm Limited and its affiliates. All rights reserved.
The main distinction is between a “normal return” and an “exceptional return”, where “exceptional return” includes
things like thread cancelation and C++ exception handling.
6.6.1 Overview
(Alpha)
SME provides a piece of storage called “ZA” that can be enabled and disabled using a processor state bit called
“PSTATE.ZA”. This storage is SVL.B × SVL.B bytes in size. It can be accessed both horizontally and vertically, with
each horizontal and vertical “slice” containing SVL.B bytes.
The size of ZA can therefore vary between implementations of the Arm Architecture. For a 256-bit SME
implementation, ZA can hold the same amount of data as the 32 Scalable Vector Registers. For a 512-bit SME
implementation, ZA can handle twice as much data as the vector registers, and so on.
Suppose that a subroutine S1 with live data in ZA calls a subroutine S2 that has no knowledge of S1. If the AAPCS64
defined ZA to be “call-preserved” (“callee-saved”), S1 would need to save and restore ZA around S1's own use of ZA,
in case S1's caller also had live data in ZA. If the AAPCS64 defined ZA to be “call-clobbered” (“caller-saved”), S1
would need to save and restore ZA around the call to S2, in case S2 also used ZA. However, nested uses of ZA are
expected to be rare, so these saves and restores would usually be wasted work.
The AAPCS64 therefore defines a “lazy saving” scheme that often reduces the total number of saves and restores
compared to the two approaches above. Informally, the scheme allows “ZA is call-preserved” to become a dynamic
rather than a static property: if S2 complies with the lazy saving scheme, S1 can test after the call to S2 whether the
call did in fact preserve ZA. If the call did not preserve ZA, S1 is able to restore the old contents of ZA from a known
buffer.
The procedure is as follows:
27
Copyright © 2011, 2013, 2018, 2020-2023, Arm Limited and its affiliates. All rights reserved.
6.6.3 ZA states
(Alpha)
The AAPCS64 uses the term “ZA_LIVE” to refer the number of leading horizontal slices of ZA that might have useful
contents. That is, horizontal slice ZA[i] can have useful contents only if i is less than ZA_LIVE.
The term “live contents of ZA” refers to the first ZA_LIVE horizontal slices of ZA.
The current state of ZA depends on PSTATE.ZA and on the following fields of the TPIDR2 block:
za_save_buffer
a 64-bit data pointer at byte offset 0 from TPIDR2_EL0
num_za_save_slices
an unsigned halfword at byte offset 8 from TPIDR2_EL0
At any given time during the execution of a thread, ZA must be in one of three states:
off
PSTATE.ZA is 0 and either:
• TPIDR2_EL0 is null; or
• both of the following are true:
There are several reasons why the thread might be in this state. Example scenarios include:
• The thread has set PSTATE.ZA to 1 in preparation for using ZA, but it has not yet put any data into ZA.
• The thread no longer has useful data in ZA, but it has not yet cleared PSTATE.ZA.
• The thread is actively using ZA and it is not executing a call that would benefit from the lazy saving scheme.
dormant
All of the following are true:
• PSTATE.ZA is 1
• TPIDR2_EL0 is nonnull
• za_save_buffer is nonnull
• za_save_buffer points to a ZA save buffer of size W×16×SVL.B for some cardinal W
• 0 < num_za_save_slices ≤ W×16
• only the first num_za_save_slices horizontal slices in ZA have useful contents (that is, ZA_LIVE is equal
to num_za_save_slices); and
• the lazy saving scheme is in use for those ZA contents.
28
Copyright © 2011, 2013, 2018, 2020-2023, Arm Limited and its affiliates. All rights reserved.
A thread that is in this state must have read and write access to every byte of za_save_buffer. The thread
may store the live contents of ZA to this buffer at any time. A thread must not store any other data into
za_save_buffer while the thread is in this state.
That is, if i < SVL.B and if j < W×16, the thread may store byte i of horizontal slice ZA[j] to the following address at
any time:
za_save_buffer + j * SVL.B + i
The thread must not store any other value to that address.
A consequence of the above requirements is that:
• If the thread needs to set PSTATE.ZA to 1 and set za_save_buffer to a nonnull value, it must set
PSTATE.ZA to 1 first.
• If the thread needs to set PSTATE.ZA to 0 and set TPIDR2_EL0 to null, it must set TPIDR2_EL0 to null first.
Therefore, it is not possible for a running thread to enter the ZA dormant state directly from the ZA off state; the thread
must go through the ZA active state first. The same is true in reverse: it is not possible for a running thread to enter the
ZA off state directly from the ZA dormant state; the thread must go through the ZA active state first.
The following table summarizes the three ZA states:
PSTATE.ZA 0 1 1
off Turn ZA on, for example using SMSTART or SMSTART ZA. active
active Turn ZA off, for example using SMSTOP or SMSTOP ZA. off
active Set up a lazy save buffer for the current ZA contents. dormant
dormant Commit the lazy save, storing the current ZA contents to active
za_save_buffer.
29
Copyright © 2011, 2013, 2018, 2020-2023, Arm Limited and its affiliates. All rights reserved.
Both BLK and B would typically be on the stack, but they do not need to be.
Note that TPIDR2_EL0 must necessarily be null before this procedure; see the requirements for the ZA active state for
details.
• Store the first num_za_save_slices horizontal slices of ZA to za_save_buffer. The thread can (at its
option) store to any other part of za_save_buffer as well. In particular, the thread can store slices in groups of
16 regardless of whether num_za_save_slices is a multiple of 16.
The easiest way of doing this while following the requirements for reserved bytes in the TPIDR2 block is to call
__arm_tpidr2_save; see SME support routines for details.
• Set TPIDR2_EL0 to null. At this point ZA becomes active and its contents can be changed. It also becomes
possible to turn ZA off.
If a call to a subroutine S returns normally, the call is said to “commit a lazy save” for that particular return if all the
following conditions are true:
• ZA is dormant on entry to S, which implies that TPIDR2_EL0 has a nonzero value (“BLK”) on entry to S.
• On return from S, all the following conditions are true:
• ZA is off.
• BLK.za_save_buffer holds the data that was stored in the first BLK.num_za_save_slices horizontal
slices of ZA on entry to S.
More generally, a call to a subroutine S is said to “commit a lazy save” if the call returns normally at least once and if
the call commits a lazy save for one such return.
6.6.7 Preserving ZA
(Alpha)
If a call to a subroutine S returns normally, the call is said to “preserve ZA” for that particular return if one of the
following conditions is true:
• ZA is dormant.
• TPIDR2_EL0 has the same value (“BLK”) as it did on entry to S.
• The contents of BLK on return from S are the same as they were on entry to S.
30
Copyright © 2011, 2013, 2018, 2020-2023, Arm Limited and its affiliates. All rights reserved.
• The first BLK.num_za_save_slices horizontal slices of ZA have the same contents on return from S as
they did on entry to S.
• ZA is active on entry to S and all the following conditions are true on return from S:
• ZA is active.
• Every byte of ZA has the same value on return from S as it did on entry to S.
More generally, a call to a subroutine S is said to “preserve ZA” if the call preserves ZA every time that the call returns
normally. A call trivially satisfies this requirement if the call never returns normally.
A subroutine S can (at its option) choose to guarantee that every possible call to S preserves ZA. S itself is then said
to “preserve ZA”.
A call trivially satisfies this requirement if the call never returns normally.
A subroutine S is said to “comply with the lazy saving scheme” if every possible call to S complies with the lazy saving
scheme.
From a quality-of-implementation perspective, the following considerations might affect the choice between committing
a lazy save and preserving ZA:
• Keeping PSTATE.ZA set to 1 for a (subjectively) “long” time might increase the chances that higher exception
levels will need to save and restore ZA.
• Keeping PSTATE.ZA set to 1 for a (subjectively) “long” time might be less energy-efficient than committing the
lazy save.
• Committing the lazy save will almost certainly require the caller to restore ZA, whereas preserving ZA will not.
As a general rule, a subroutine S that complies with the lazy saving scheme is encouraged to do the following:
• Commit the lazy save if preserving ZA would require S to restore ZA. For example, this would be true if S directly
changes PSTATE.ZA or ZA.
• Clear PSTATE.ZA soon after a lazy save, unless S is about to use ZA for something else.
• Rely on the lazy saving scheme for any calls that S makes. For example, if S is simply a wrapper around a call to
another subroutine S2 and if S2 also complies with the lazy saving scheme, S is encouraged to delegate the
handling of the lazy saving scheme to S2.
• Use __arm_tpidr2_save when committing a lazy save. This ensures that the code will handle future
extensions safely or abort.
The intention is that the vast majority of SME-unaware subroutines would naturally comply with the lazy saving
scheme and so would not need to become SME-aware. Counterexamples include the C library subroutine longjmp;
see setjmp and longjmp for details.
31
Copyright © 2011, 2013, 2018, 2020-2023, Arm Limited and its affiliates. All rights reserved.
6.6.9 Changes to the TPIDR2 block
(Alpha)
If, for a particular call C to a subroutine S:
then conforming software must ensure that, every time that S returns normally from C, BLK still has the same contents
as it did on entry to S. For example, this means that:
• No subroutine is permitted to modify BLK directly until S returns from C for the final time.
• No subroutine is permitted to induce another subroutine to modify BLK until S returns from C for the final time.
This requirement applies to callers of S as well as to S and its callees.
For example, the C function memset complies with the lazy saving scheme. The following artificial pseudo-code is
therefore non-conforming, because it induces memset to modify BLK before memset has returned:
Non-streaming 0 0
Streaming 1 1
Every subroutine has exactly one PSTATE.SM interface. A subroutine's PSTATE.SM interface is independent of all
other aspects of its interface. Callers must know which PSTATE.SM interface a callee has.
All subroutines that were written before the introduction of SME are retroactively classified as having a non-streaming
interface.
In the table above, the “PSTATE.SM on entry” column describes a requirement on callers: it is the caller's
responsibility to ensure that PSTATE.SM has a valid value on entry to a callee. The “PSTATE.SM on normal return”
column describes a requirement on callees: callees must ensure that PSTATE.SM has a valid value before returning
to their caller.
If a subroutine has a streaming-compatible interface, it can call __arm_sme_state to determine whether the current
thread has access to SME and, if so, what the current value of PSTATE.SM is.
32
Copyright © 2011, 2013, 2018, 2020-2023, Arm Limited and its affiliates. All rights reserved.
6.7.2 ZA interfaces
(Alpha)
As noted in ZA states, there are three possible ZA states: off, dormant, and active. A subroutine's “ZA interface”
specifies the possible states of ZA on entry to a subroutine and the possible states of ZA on a normal return. The
AAPCS64 defines two types of ZA interface:
Every subroutine has exactly one ZA interface. A subroutine's ZA interface is independent of all other aspects of its
interface. Callers must know which ZA interface a callee has.
All subroutines that were written before the introduction of SME are retroactively classified as having a private-ZA
interface.
Every subroutine with a private-ZA interface must comply with the lazy saving scheme.
The shared-ZA interface is so called because it allows the subroutine to share ZA contents with its caller. This can be
useful if an SME operation is split into several cooperating subroutines.
Subroutines with a private-ZA interface and subroutines with a shared-ZA interface can both (at their option) choose to
guarantee that they preserve ZA.
• A mapping from the type of a source language argument onto a machine type.
• The marshaling of machine types to produce the final parameter list.
The mapping from a source language type onto a machine type is specific for each language and is described
separately (the C and C++ language bindings are described in Arm C and C++ language mappings). The result is an
ordered list of arguments that are to be passed to the subroutine.
For a caller, sufficient stack space to hold stacked argument values is assumed to have been allocated prior to
marshaling: in practice the amount of stack space required cannot be known until after the argument marshaling has
been completed. A callee is permitted to modify any stack space used for receiving parameter values from the caller.
33
Copyright © 2011, 2013, 2018, 2020-2023, Arm Limited and its affiliates. All rights reserved.
Stage A - Initialization
This stage is performed exactly once, before processing of the arguments commences.
A.2 The Next SIMD and Floating-point Register Number (NSRN) is set to zero.
A.3 The Next Scalable Predicate Register Number (NPRN) is set to zero.
A.4 The next stacked argument address (NSAA) is set to the current stack-pointer value (SP).
For each argument in the list the first matching rule from the following list is applied. If no rule matches the
argument is used unmodified.
B.1 If the argument type is a Pure Scalable Type, no change is made at this stage.
B.2 If the argument type is a Composite Type whose size cannot be statically determined by
both the caller and the callee, the argument is copied to memory and the argument is
replaced by a pointer to the copy. (There are no such types in C/C++ but they exist in other
languages or in language extensions).
B.3 If the argument type is an HFA or an HVA, then the argument is used unmodified.
B.4 If the argument type is a Composite Type that is larger than 16 bytes, then the argument is
copied to memory allocated by the caller and the argument is replaced by a pointer to the
copy.
B.5 If the argument type is a Composite Type then the size of the argument is rounded up to the
nearest multiple of 8 bytes.
B.6 If the argument is an alignment adjusted type its value is passed as a copy of the actual
value. The copy will have an alignment defined as follows:
• For a Fundamental Data Type, the alignment is the natural alignment of that type,
after any promotions.
• For a Composite Type, the alignment of the copy will have 8-byte alignment if its
natural alignment is ≤ 8 and 16-byte alignment if its natural alignment is ≥ 16.
The alignment of the copy is used for applying marshaling rules.
For each argument in the list the following rules are applied in turn until the argument has been allocated. When an
argument is assigned to a register any unused bits in the register have unspecified value. When an argument is
assigned to a stack slot any unused padding bytes have unspecified value.
C.1 If the argument is a Half-, Single-, Double- or Quad- precision Floating-point or short vector type and
the NSRN is less than 8, then the argument is allocated to the least significant bits of register v[NSRN].
The NSRN is incremented by one. The argument has now been allocated.
C.2 If the argument is an HFA or an HVA and there are sufficient unallocated SIMD and Floating-point
registers (NSRN + number of members ≤ 8), then the argument is allocated to SIMD and Floating-point
registers (with one register per member of the HFA or HVA). The NSRN is incremented by the number
of registers used. The argument has now been allocated.
C.3 If the argument is an HFA or an HVA then the NSRN is set to 8 and the size of the argument is rounded
up to the nearest multiple of 8 bytes.
34
Copyright © 2011, 2013, 2018, 2020-2023, Arm Limited and its affiliates. All rights reserved.
Stage C – Assignment of arguments to registers and stack
C.4 If the argument is an HFA, an HVA, a Quad-precision Floating-point or short vector type then the NSAA
is rounded up to the next multiple of 8 if its natural alignment is ≤ 8 or the next multiple of 16 if its natural
alignment is ≥ 16.
C.5 If the argument is a Half- or Single- precision Floating Point type, then the size of the argument is set to
8 bytes. The effect is as if the argument had been copied to the least significant bits of a 64-bit register
and the remaining bits filled with unspecified values.
C.6 If the argument is an HFA, an HVA, a Half-, Single-, Double- or Quad- precision Floating-point or short
vector type, then the argument is copied to memory at the adjusted NSAA. The NSAA is incremented
by the size of the argument. The argument has now been allocated.
C.7 If the argument is a Pure Scalable Type that consists of NV Scalable Vector Types and NP Scalable
Predicate Types, if the argument is named, if NSRN+NV ≤ 8, and if NPRN+NP ≤ 4, then the Scalable
Vector Types are allocated in order to z[NSRN]…z[NSRN+NV-1] inclusive and the Scalable Predicate
Types are allocated in order to p[NPRN]…p[NPRN+NP-1] inclusive. The NSRN is incremented by NV
and the NPRN is incremented by NP. The argument has now been allocated.
C.8 If the argument is a Pure Scalable Type that has not been allocated by the rules above, then the
argument is copied to memory allocated by the caller and the argument is replaced by a pointer to the
copy (as for B.4 above). The argument is then allocated according to the rules below.
C.9 If the argument is an Integral or Pointer Type, the size of the argument is less than or equal to 8 bytes
and the NGRN is less than 8, the argument is copied to the least significant bits in x[NGRN]. The NGRN
is incremented by one. The argument has now been allocated.
C.10 If the argument has an alignment of 16 then the NGRN is rounded up to the next even number.
C.11 If the argument is an Integral Type, the size of the argument is equal to 16 and the NGRN is less than
7, the argument is copied to x[NGRN] and x[NGRN+1]. x[NGRN] shall contain the lower addressed
double-word of the memory representation of the argument. The NGRN is incremented by two. The
argument has now been allocated.
C.12 If the argument is a Composite Type and the size in double-words of the argument is not more than 8
minus NGRN, then the argument is copied into consecutive general-purpose registers, starting at
x[NGRN]. The argument is passed as though it had been loaded into the registers from a
double-word-aligned address with an appropriate sequence of LDR instructions loading consecutive
registers from memory (the contents of any unused parts of the registers are unspecified by this
standard). The NGRN is incremented by the number of registers used. The argument has now been
allocated.
C.14 The NSAA is rounded up to the larger of 8 or the Natural Alignment of the argument’s type.
C.15 If the argument is a composite type then the argument is copied to memory at the adjusted NSAA. The
NSAA is incremented by the size of the argument. The argument has now been allocated.
C.16 If the size of the argument is less than 8 bytes then the size of the argument is set to 8 bytes. The effect
is as if the argument was copied to the least significant bits of a 64-bit register and the remaining bits
filled with unspecified values.
C.17 The argument is copied to memory at the adjusted NSAA. The NSAA is incremented by the size of the
argument. The argument has now been allocated.
It should be noted that the above algorithm makes provision for languages other than C and C++ in that it provides for
passing arrays by value and for passing arguments of dynamic size. The rules are defined in a way that allows the
caller to be always able to statically determine the amount of stack space that must be allocated for arguments that are
not passed in registers, even if the routine is variadic.
35
Copyright © 2011, 2013, 2018, 2020-2023, Arm Limited and its affiliates. All rights reserved.
Several observations can be made:
• The address of the first stacked argument is defined to be the initial value of SP. Therefore, the total amount of
stack space needed by the caller for argument passing cannot be determined until all the arguments in the list
have been processed.
• Floating-point and short vector types are passed in SIMD and Floating-point registers or on the stack; never in
general-purpose registers (except when they form part of a small structure that is neither an HFA nor an HVA).
• Unlike in the 32-bit AAPCS, named integral values must be narrowed by the callee rather than the caller.
• Unlike in the 32-bit AAPCS, half-precision floating-point values can be passed directly (and HFAs of
half-precision floats are also permitted).
• Any part of a register or a stack slot that is not used for an argument (padding bits) has unspecified content at
the callee entry point.
• The rules here do not require narrow arguments to subroutines to be widened. However a language may require
widening in some or all circumstances (for example, in C, unprototyped and variadic functions require
single-precision values to be converted to double-precision and char and short values to be converted to int.
• HFAs and HVAs are special cases of a composite type. If they are passed as parameters in registers then each
uniquely addressable element goes in its own register. However, if they are not allocated to registers then they
are always passed on the stack (never in general-purpose registers) and they are laid out in exactly the same
way as any other composite.
• Both before and after the layout of each argument, then NSAA will have a minimum alignment of 8.
would require that arg be passed as a value in a register (or set of registers) according to the rules in Parameter
passing, then the result is returned in the same registers as would be used for such an argument.
• Otherwise, the caller shall reserve a block of memory of sufficient size and alignment to hold the result. The
address of the memory block shall be passed as an additional argument to the function in x8. The callee may
modify the result memory block at any point during the execution of the subroutine (there is no requirement for
the callee to preserve the value stored in x8).
6.10 Interworking
Interworking between the 32-bit AAPCS and the AAPCS64 is not supported within a single process. (In AArch64, all
inter-operation between 32-bit and 64-bit machine states takes place across a change of exception level).
Interworking between data model variants of AAPCS64 (although technically possible) is not defined within a single
process.
36
Copyright © 2011, 2013, 2018, 2020-2023, Arm Limited and its affiliates. All rights reserved.
7 The standard variants
7.1 Half-precision format compatibility
The set of values that can be represented in Arm Alternative format differs from the set that can be represented in
IEEE754-2008 format rendering code built to use either format incompatible with code that uses the other.
Nevertheless, most code will make no use of either format and will therefore be compatible with both variants.
37
Copyright © 2011, 2013, 2018, 2020-2023, Arm Limited and its affiliates. All rights reserved.
8 Support routines
8.1 SME support routines
(Alpha)
Every platform that supports SME must provide the following runtime support routines:
__arm_sme_state
Provides a safe way of detecting:
__arm_tpidr2_save
Provides a safe way to commit a lazy save
__arm_za_disable
Provides a safe way to turn ZA off without losing data.
__arm_tpidr2_restore
Provides a simple way of restoring lazily-saved ZA data.
8.1.1 __arm_sme_state
(Alpha)
Platforms that support SME must provide a function with the following properties:
• Bit 63 of X0 is set to one if and only if the current thread has access to SME.
• Bit 62 of X0 is set to one if and only if the current thread has access to TPIDR2_EL0.
• Bits 2 to 61 of X0 are zero for this revision of the AAPCS64, but are reserved for future expansion.
• If the current thread has access to SME:
38
Copyright © 2011, 2013, 2018, 2020-2023, Arm Limited and its affiliates. All rights reserved.
• If the current thread has access to TPIDR2_EL0, X1 contains the current value of TPIDR2_EL0.
Otherwise, X1 is zero.
• The only memory modified by the function (if any) is stack memory below the incoming SP.
8.1.2 __arm_tpidr2_save
(Alpha)
Platforms that support SME must provide a subroutine to commit a lazy save, with the subroutine having the following
properties:
• If the current thread does not have access to TPIDR2_EL0, the subroutine does nothing.
• If TPIDR2_EL0 is null, the subroutine does nothing.
• Otherwise:
• If any of the reserved bytes in the first 16 bytes of the TPIDR2 block are nonzero, the subroutine
either:
Note that the subroutine does not change TPIDR2_EL0 or PSTATE.ZA. If ZA was dormant on entry then it remains
dormant on return.
Note
The idea here is to make as many registers call-preserved as possible, so that the save does not require much
spilling in the caller.
Aborting for unrecognized reserved bytes prevents older runtimes from silently mishandling any future TPIDR2
state.
39
Copyright © 2011, 2013, 2018, 2020-2023, Arm Limited and its affiliates. All rights reserved.
8.1.3 __arm_za_disable
(Alpha)
Platforms that support SME must provide a subroutine to set PSTATE.ZA to 0, with the subroutine having the following
properties:
• If the current thread does not have access to SME, the subroutine does nothing.
• Otherwise, the subroutine behaves as if it did the following:
• Call __arm_tpidr2_save.
• Set TPIDR2_EL0 to null.
• Set PSTATE.ZA to 0.
8.1.4 __arm_tpidr2_restore
(Alpha)
Platforms that support SME must provide a subroutine to restore data after a lazy save, with the subroutine having the
following properties:
40
Copyright © 2011, 2013, 2018, 2020-2023, Arm Limited and its affiliates. All rights reserved.
• If BLK.za_save_buffer points to a ZA save buffer B and if BLK.num_za_save_slices contains the
value NS, then the subroutine restores the first NS horizontal slices of ZA from B. The subroutine can (at
its option) restore from any other part of B as well. In particular, the subroutine can restore slices in groups
of 16 regardless of whether NS is a multiple of 16.
• The only memory modified by the subroutine (if any) is stack memory below the incoming SP.
9 Pseudo-code examples
9.1 SME pseudo-code examples
(Alpha)
This section gives examples of various sequences that conform to the SME PCS rules. In each case, comments in the
code give the set of preconditions that are assumed to hold at the start of the sequence and the set of postconditions
that hold at the end of the sequence. These comments help to document the example's intended use case; the
example will only be useful in situations where the given preconditions apply and where the given postconditions
describe the desired outcome.
If S has a private-ZA interface, the following pseudo-code describes a conforming way for S to flush any dormant ZA
state before S uses ZA itself:
// Current state:
// PSTATE.SM == SM (might be 0 or 1)
// PSTATE.ZA == 0 or 1
// TPIDR2_EL0 might or might not be null
// Current state:
// PSTATE.SM == SM
// PSTATE.ZA == 1
// TPIDR2_EL0 is null
If S has a private-ZA interface, the following pseudo-code describes a conforming way for S to clear PSTATE.ZA. This
procedure is useful for things like the C subroutine longjmp (see setjmp and longjmp) and exception unwinders (see
Exceptions).
// Current state:
// PSTATE.SM == SM (might be 0 or 1)
// PSTATE.ZA == 0 or 1
// TPIDR2_EL0 might or might not be null
// Current state:
41
Copyright © 2011, 2013, 2018, 2020-2023, Arm Limited and its affiliates. All rights reserved.
// PSTATE.SM == SM
// PSTATE.ZA == 0
// TPIDR2_EL0 is null
The following pseudo-code shows a basic example of a subroutine S1 calling a subroutine S2, where:
// Current state:
// PSTATE.SM == 1
// PSTATE.ZA == 1
// TPIDR2_EL0 is null
// the first NS horizontal slices of ZA contain useful contents
za_save_buffer_type B;
__arm_tpidr2_block BLK = {}; // Zero initialize.
BLK.za_save_buffer = &B;
BLK.num_za_save_slices = NS;
TPIDR2_EL0 = &BLK;
S2();
if (TPIDR2_EL0) {
// No need to restore ZA. Abandon the lazy save.
assert(TPIDR2_EL0 == &BLK);
TPIDR2_EL0 = nullptr;
} else {
// Restore the saved ZA contents.
__arm_tpidr2_restore(&BLK);
}
// Current state:
// PSTATE.SM == 1
// PSTATE.ZA == 1
// TPIDR2_EL0 is null
The same approach works if S1 calls multiple subroutines in succession. For example:
// Current state:
// PSTATE.SM == 1
// PSTATE.ZA == 1
// TPIDR2_EL0 is null
// the first NS slices of ZA contain useful contents
za_save_buffer_type B;
__arm_tpidr2_block BLK = {};
BLK.za_save_buffer = &B;
BLK.num_za_save_slices = NS;
42
Copyright © 2011, 2013, 2018, 2020-2023, Arm Limited and its affiliates. All rights reserved.
TPIDR2_EL0 = &BLK;
if (TPIDR2_EL0) {
// No need to restore ZA. Abandon the lazy save.
assert(TPIDR2_EL0 == &BLK);
TPIDR2_EL0 = nullptr;
} else {
// Restore the saved ZA contents.
__arm_tpidr2_restore(&BLK);
}
// Current state:
// PSTATE.SM == 1
// PSTATE.ZA == 1
// TPIDR2_EL0 is null
43
Copyright © 2011, 2013, 2018, 2020-2023, Arm Limited and its affiliates. All rights reserved.
C/C++ Type Machine Type Notes
[signed] long signed word or signed double- word See Types Varying by Data Model
unsigned long unsigned word or unsigned See Types Varying by Data Model
double-word
__fp16 half precision (IEEE754-2008 format or Arm extension. See Types Varying by
Arm Alternative Format) Data Model
long double _Imaginary quad precision (IEEE 754- 2008) C99 Only
long double _Complex 2 quad precision (IEEE 754-2008) C99 Only. Layout is
44
Copyright © 2011, 2013, 2018, 2020-2023, Arm Limited and its affiliates. All rights reserved.
C/C++ Type Machine Type Notes
_BitInt(N <= 128) Smallest of the signed Fundamental C2x Only. Significant bits are allocated
Integral Data Types where byte-size*8 from least significant end of the
>= N. Machine Type. Non-significant bits
within the Machine Type are
unspecified.
unsigned _BitInt(N <= Smallest of the unsigned Fundamental C2x Only. Significant bits are allocated
128) Integral Data Types where byte-size*8 from least significant end of the
>= N. Machine Type. Non-significant bits
within the Machine Type are
unspecified.
_BitInt(N > 128) Mapped as if C2x Only. Significant bits are allocated
unsigned __int128[M] array from least significant end of the
where M*128 >= N. Last element Machine Type. The lower addressed
contains sign bit. quad-word contains the least significant
bits of the type on a little-endian view
and the most significant bits of the type
in a big-endian view. Non-significant
bits within the last quad-word are
unspecified.
unsigned _Bitint(N > 1 Mapped as if C2x Only. Significant bits are allocated
28) unsigned __int128[M] where from least significant end of the
M*128 >= N. Machine Type. The lower addressed
quad-word contains the least significant
bits of the type on a little-endian view
and the most significant bits of the type
in a big-endian view. Non-significant
bits within the last quad-word are
unspecified.
A platform ABI may specify a different combination of primitive variants but we discourage this.
45
Copyright © 2011, 2013, 2018, 2020-2023, Arm Limited and its affiliates. All rights reserved.
C/C++ Type Machine Type Notes
T * 32-bit data pointer 64-bit data pointer 64-bit data pointer Any data type T
T (*F)() 32-bit code pointer 64-bit code pointer 64-bit code pointer Any function type F
T& 32-bit data pointer 64-bit data pointer 64-bit data pointer C++ reference
struct __va_list { A va_list may address any object in a parameter list. In C++,
va_list __va_list is in namespace std. See APPENDIX Variable
void *__stack;
void *__gr_top; argument lists.
void *__vr_top;
int __gr_offs;
int __vr_offs;
}
46
Copyright © 2011, 2013, 2018, 2020-2023, Arm Limited and its affiliates. All rights reserved.
10.1.6 Volatile data types
A data type declaration may be qualified with the volatile type qualifier. The compiler may not remove any access to a
volatile data type unless it can prove that the code containing the access will never be executed; however, a compiler
may ignore a volatile qualification of an automatic variable whose address is never taken unless the function calls
setjmp(). A volatile qualification on a structure or union shall be interpreted as applying the qualification recursively to
each of the fundamental data types of which it is composed. Access to a volatile-qualified fundamental data type must
always be made by accessing the whole type.
The behavior of assigning to or from an entire structure or union that contains volatile-qualified members is undefined.
Likewise, the behavior is undefined if a cast is used to change either the qualification or the size of the type.
The memory system underlying the processor may have a restricted bus width to some or all of memory. The only
guarantee applying to volatile types in these circumstances are that each byte of the type shall be accessed exactly
once for each access mandated above, and that any bytes containing volatile data that lie outside the type shall not be
accessed. Nevertheless, a compiler shall use an instruction that will access the type exactly.
10.1.8 Bit-fields
A bit-field may have any integral type (including enumerated and bool types). A sequence of bit-fields is laid out in the
order declared using the rules below. For each bit-field, the type of its container is:
• Its declared type if its size is no larger than the size of its declared type.
• The largest integral type no larger than its size if its size is larger than the size of its declared type (see
Over-sized bit-fields).
The container type contributes to the alignment of the containing aggregate in the same way a plain (not bit-field)
member of that type would, without exception for zero-sized or anonymous bit-fields.
Note
The C++ standard states that an anonymous bit-field is not a member, so it is unclear whether or not an anonymous
bit-field of non-zero size should contribute to an aggregate’s alignment. Under this ABI it does.
The content of each bit-field is contained by exactly one instance of its container type. Initially, we define the layout of
fields that are no bigger than their container types.
CA(F) = &(container(F));
This address will always be at the Natural Alignment of the container type, that is
CA(F) % sizeof(container(F)) == 0.
47
Copyright © 2011, 2013, 2018, 2020-2023, Arm Limited and its affiliates. All rights reserved.
• For big-endian data types K(F) is the offset from the most significant bit of the container to the most significant
bit of the bit-field.
• For little-endian data types K(F) is the offset from the least significant bit of the container to the least significant
bit of the bit-field.
A bit-field can be extracted by loading its container, shifting and masking by amounts that depend on the byte order,
K(F), the container size, and the field width, then sign extending if needed.
The bit-address of F, BA(F), can now be defined as:
For a bit address BA falling in a container of width C and alignment A (<= C) (both expressed in bits), define the
unallocated container bits (UCB) to be:
UCB(BA, C, A) = C - (BA % A)
TRUNCATE(X,Y) = Y * FLOOR(X/Y)
NCBA(BA, A) = TRUNCATE(BA + A – 1, A)
Note
The AAPCS64 does not allow exported interfaces to contain packed structures or bit-fields. However a scheme for
laying out packed bit-fields can be achieved by reducing the alignment, A, in the above rules to below that of the
natural container type. ARMCC uses an alignment of A=8 in these cases, but GCC uses an alignment of A=1.
• Load the (naturally aligned) container at byte address TRUNCATE(BA(F), C) / 8 into a 64-bit register R
• Set Q = MAX(64, C)
48
Copyright © 2011, 2013, 2018, 2020-2023, Arm Limited and its affiliates. All rights reserved.
• Little-endian, set R = (R << ((Q – W) – (BA MOD C))) >> (Q – W).
• Big-endian, set R = (R << (Q – C +(BA MOD C))) >> (Q – W).
See Volatile bit-fields -- preserving number and width of container accesses for volatile bit-fields.
• Selecting a new container width C’ which is the width of the fundamental integer data type with the largest size
less than or equal to W. The alignment of this container will be A’. Note that C’ ≥ C and A’ ≥ A.
• If C’ > UCB(CBA, C’, A’) setting CBA = NCBA(CBA, A’). This ensures that the bit-field will be placed at
the start of the next container type.
• Allocating a normal (undersized) bit-field using the values (C, C’, A’) for (W, C, A).
• Setting CBA = CBA + W – C.
Each segment of an oversized bit-field can be accessed simply by accessing its container type.
Note
Any tail-padding added to a structure that immediately precedes a bit-field member is part of the structure and must
be taken into account when determining the CBA.
When a non-bit-field member follows a bit-field it is placed at the lowest acceptable address following the allocated
bit-field.
Note
When laying out fundamental data types it is possible to consider them all to be bit-fields with a width equal to the
container size. The rules in Bit-fields no larger than their container can then be applied to determine the precise
address within a structure.
Note
This ABI does not place any restrictions on the access widths of bit-fields where the container overlaps with a
non-bit-field member or where the container overlaps with any zero length bit-field placed between two other
49
Copyright © 2011, 2013, 2018, 2020-2023, Arm Limited and its affiliates. All rights reserved.
bit-fields. This is because the C/C++ memory model defines these as being separate memory locations, which can
be accessed by two threads simultaneously. For this reason, compilers must be permitted to use a narrower
memory access width (including splitting the access into multiple instructions) to avoid writing to a different memory
location. For example, in struct S { int a:24; char b; }; a write to a must not also write to the location
occupied by b, this requires at least two memory accesses in all current Arm architectures. In the same way, in
struct S { int a:24; int:0; int b:8; };, writes to a or b must not overwrite each other.
Multiple accesses to the same volatile bit-field, or to additional volatile bit-fields within the same container may not be
merged. For example, an increment of a volatile bit-field must always be implemented as two reads and a write.
Note
Note the volatile access rules apply even when the width and alignment of the bit-field imply that the access could
be achieved more efficiently using a narrower type. For a write operation the read must always occur even if the
entire contents of the container will be replaced.
If the containers of two volatile bit-fields overlap then access to one bit-field will cause an access to the other. For
example, in struct S {volatile int a:8; volatile char b:2}; an access to a will also cause an access
to b, but not vice-versa.
If the container of a non-volatile bit-field overlaps a volatile bit-field then it is undefined whether access to the
non-volatile field will cause the volatile field to be accessed.
• For C++, an implicit this parameter is passed as an extra argument that immediately precedes the first user
argument. Other rules for marshaling C++ arguments are described in CPPABI64.
• For unprototyped (i.e. pre-ANSI or K&R C) and variadic functions, in addition to the normal conversions and
promotions, arguments of type __fp16 are converted to type double.
• The rules for passing Pure Scalable Types depend on whether the arguments are named. It is an error to pass
such types to an unprototyped function.
The argument list is then processed according to the standard rules for procedure calls (see Parameter passing) or the
appropriate variant.
• ZA must be in the “off” state when setjmp returns to its caller via a longjmp.
50
Copyright © 2011, 2013, 2018, 2020-2023, Arm Limited and its affiliates. All rights reserved.
Note
A consequence of this is that any existing open-coded copies of longjmp become invalid.
10.4 Exceptions
(Alpha)
If:
then the platform must ensure that all the following conditions are true on entry to EH:
• PSTATE.SM is 0.
• PSTATE.ZA is 0.
• TPIDR2_EL0 is null.
The platform subroutines for throwing synchronous exceptions would typically have a private-ZA non-streaming
interface. Such subroutines can meet the requirements above by using __arm_za_disable to turn ZA off before
entering EH.
An asynchronous exception might be thrown when PSTATE.SM is 1. In this case, the platform code that is responsible
for implementing exceptions must clear PSTATE.SM before entering the exception handler. The platform code can
again use __arm_za_disable to turn ZA off
The intention is to allow a subroutine that implements the lazy saving scheme to catch exceptions without being aware
of ZA.
A consequence of this definition is that the subroutine S above can choose to use the lazy saving scheme to manage
ZA contents that are live between I and EH. For example, the following pseudo-code describes a conforming way for S
to retain ZA contents across an exception:
// Current state:
// PSTATE.SM == 1
// PSTATE.ZA == 1
// TPIDR2_EL0 is null
// the first NS slices of ZA contain useful contents
za_save_buffer_type B;
__arm_tpidr2_block BLK = {};
BLK.za_save_buffer = &B;
BLK.num_za_save_slices = NS;
TPIDR2_EL0 = &BLK;
try {
…
} catch (…) {
// Current state:
// PSTATE.SM == 0
// PSTATE.ZA == 0
// TPIDR2_EL0 is null
51
Copyright © 2011, 2013, 2018, 2020-2023, Arm Limited and its affiliates. All rights reserved.
__arm_tpidr2_restore(&BLK);
…
}
52
Copyright © 2011, 2013, 2018, 2020-2023, Arm Limited and its affiliates. All rights reserved.
11 APPENDIX Support for Advanced
SIMD Extensions
The AARCH64 architecture supports a number of short-vector operations. To facilitate accessing these types from C
and C++ a number of extended types need to be added to the language.
Following the conventions used for adding types to C99 a number of additional types (internal types) are defined
unconditionally. To facilitate use in applications a header file is also defined (arm_neon.h) that maps these internal
types onto more user-friendly names. These types are listed in Table 7: Short vector extended types 9.
The header file arm_neon.h also defines a number of intrinsic functions that can be used with the types defined
below. The list of intrinsic functions and their specification is beyond the scope of this document.
53
Copyright © 2011, 2013, 2018, 2020-2023, Arm Limited and its affiliates. All rights reserved.
Internal type arm_neon.h type Base Type Elements
54
Copyright © 2011, 2013, 2018, 2020-2023, Arm Limited and its affiliates. All rights reserved.
Internal type arm_sve.h type Base type Elements
void f(int8x8_t)
is mangled as
_Z1fu10__Int8x8_t
A platform ABI may instead choose to treat the types as normal structure types, without a u prefix. For example, a
platform ABI may choose to mangle the function above as:
_Z1f10__Int8x8_t
instead.
The SVE tuple types are mangled using their arm_sve.h names (svBASExN_t).
55
Copyright © 2011, 2013, 2018, 2020-2023, Arm Limited and its affiliates. All rights reserved.
14 APPENDIX Variable argument lists
Languages such as C and C++ permit routines that take a variable number of arguments (that is, the number of
parameters is controlled by the caller rather than the callee). Furthermore, they may then pass some or even all of
these parameters as a block to further subroutines to process the list. If a routine shares any of its optional arguments
with other routines then a parameter control block needs to be created as specified in Arm C and C++ language
mappings. The remainder of this appendix is informative.
• __stack: set to the address following the last (highest addressed) named incoming argument on the stack,
rounded upwards to a multiple of 8 bytes, or if there are no named arguments on the stack, then the value of the
stack pointer when the function was entered.
• __gr_top: set to the address of the byte immediately following the general register argument save area, the
end of the save area being aligned to a 16 byte boundary.
• __vr_top: set to the address of the byte immediately following the FP/SIMD register argument save area, the
end of the save area being aligned to a 16 byte boundary.
• __gr_offs: set to 0 – ((8 – named_gr) * 8).
• __vr_offs: set to 0 – ((8 – named_vr) * 16).
If it is known that a va_list structure is never used to access arguments that could be passed in the FP/SIMD
argument registers, then no FP/SIMD argument registers need to be saved, and the __vr_top and __vr_offs fields
initialised to zero. Furthermore, if in this case the general register argument save area is located immediately below
the value of the stack pointer on entry, then the __stack field may be set to the address of the anonymous argument
in the general register argument save area and the __gr_top and __gr_offs fields also set to zero, permitting a
56
Copyright © 2011, 2013, 2018, 2020-2023, Arm Limited and its affiliates. All rights reserved.
simplified implementation of va_arg which simply advances the __stack pointer through the argument save area
and into the incoming stacked arguments. This simplification may not be used in the reverse case where anonymous
arguments are known to be in FP/SIMD registers but not in general registers.
Although this standard does not mandate a particular stack frame organisation beyond what is required to meet the
stack constraints described in The Stack, the following figure illustrates one possible stack layout for a variadic routine
which invokes the va_start macro.
LR'' LR''
Caller
Caller
FP'' FP''
FP' Dynamic Allocation
FP' Dynamic Allocation
on entry
Stack Arg Area Stack Arg Area
(outgoing) (incoming)
SP' SP'
on entry
Stack grows
down GR Arg Save Area
Callee-Saved Registers
Callee
Local Variables
LR'
FP'
FP
while running Dynamic Allocation
Focussing on just the top of callee’s stack frame, the following figure illustrates graphically how the __va_list
structure might be initialised by va_start to identify the three potential locations of the next anonymous argument.
57
Copyright © 2011, 2013, 2018, 2020-2023, Arm Limited and its affiliates. All rights reserved.
next stack arg
__stack
Stack Arg Area
SP' __gr_top
+ __gr_offs
on entry GR Arg Save Area
next gen.reg arg va_list
__vr_top
+ __vr_offs
VR Arg Save Area
next fp/simd reg arg
The va_list
58
Copyright © 2011, 2013, 2018, 2020-2023, Arm Limited and its affiliates. All rights reserved.
goto on_stack; // reg save area empty
nreg = (sizeof(type) + 15) / 16;
ap.__vr_offs = offs + (nreg * 16);
if (ap.__vr_offs > 0)
goto on_stack; // overflowed reg save area
#ifdef BIG_ENDIAN
if (classof(type) != "aggregate" && sizeof(type) < 16)
offs += 16 - sizeof(type);
#endif
return *(type *)(ap.__vr_top + offs);
}
on_stack:
intptr_t arg = ap.__stack;
if (alignof(type) > 8)
arg = (arg + 15) & -16;
ap.__stack = (void *)((arg + sizeof(type) + 7) & -8);
#ifdef BIG_ENDIAN
if (classof(type) != "aggregate" && sizeof(type) < 8)
arg += 8 - sizeof(type);
#endif
return *(type *)arg;
}
Review note: The above pseudo code does not currently handle composite types that are passed by value, and
where a copy is made and reference created to the copy. This will be corrected in a future revision of this
standard.
If type is a Pure Scalable Type, the pseudo-code above should be used to obtain a type*, which should then be
dereferenced to get the required value.
It is expected that the implementation of the va_arg macro will be specialized by the compiler for the type, size and
alignment of the type. By way of example the following sample code illustrates one possible expansion of
va_arg(ap,int) for the LP64 data model, where register x0 holds a pointer to va_list ap, and the argument is
returned in register w1. Further optimizations are possible.
59
Copyright © 2011, 2013, 2018, 2020-2023, Arm Limited and its affiliates. All rights reserved.
15 Footnotes
60
Copyright © 2011, 2013, 2018, 2020-2023, Arm Limited and its affiliates. All rights reserved.