Chapter 5: Assembly Language
- What assembly language is and why a kernel needs it
- How ARM64 instructions are structured
- Data movement, arithmetic, logical, and branch instructions
- How the stack works and how to use it correctly
- System instructions for kernel control
- How to write inline assembly in C
- How assembly and C work together in our kernel
5.1 What is Assembly Language?
Assembly language is a human-readable way to write machine code. Each assembly instruction corresponds to one CPU instruction. Unlike C, where one line of code can do many things, one line of assembly does exactly one thing: add two numbers, load from memory, or jump to another address.
Every CPU architecture has its own assembly language. ARM64 assembly is specific to the ARMv8-A architecture in 64-bit mode. Code written for x86 will not work on ARM64, and vice versa.
A program called an assembler converts assembly code into machine code (binary
instructions that the CPU executes). We use the GNU assembler (aarch64-none-elf-as),
which is part of our cross-compiler toolchain.
Why Does a Kernel Need Assembly?
Most of our kernel is written in C, but some things require assembly:
- Kernel entry point. When the CPU starts, it is not running C code yet. The stack pointer is not set up, the BSS section is not cleared, and there is no C runtime environment. We need a small assembly stub to prepare everything before jumping to C code.
- Exception vectors. When an interrupt or exception occurs, the CPU jumps to a fixed address. The code at that address must be assembly because it needs to save and restore registers that C cannot access directly.
- Context switching. Switching between processes requires saving and restoring all registers, including system registers. C cannot do this directly.
- Atomic operations. Some hardware-level atomic operations require specific
assembly instructions like
LDXRandSTXR. - System register access. Reading and writing CPU control registers
requires the
MRSandMSRinstructions, which are not available in standard C.
In short: C is for logic, assembly is for control. We use each where it belongs.
5.2 ARM64 Instruction Format
Every ARM64 instruction is exactly 4 bytes (32 bits) long. This is called a fixed instruction width. The CPU decodes the bits to determine what operation to perform.
A typical instruction has three parts: an opcode (what to do), operands (what to do it with), and sometimes flags (modifiers that change the behavior).
/* General form: opcode destination, source1, source2 */
add x0, x1, x2 /* x0 = x1 + x2 */
/* Opcode: add */
/* Destination: x0 */
/* Source 1: x1 */
/* Source 2: x2 */
Instructions fall into three main categories:
| Category | Purpose | Examples |
|---|---|---|
| Data processing | Arithmetic, logic, shifts on registers | ADD, SUB, AND, ORR, LSL |
| Memory access | Load from memory, store to memory | LDR, STR, STP, LDP |
| Branch | Change the flow of execution | B, BL, RET, CBZ |
Our kernel uses instructions from all three categories. Understanding them means you can read and write the assembly parts of our kernel.
5.3 Data Movement Instructions
Moving Values Between Registers
mov x0, x1 /* x0 = x1 (copy x1 into x0) */
mov x0, #42 /* x0 = 42 (load immediate value) */
mvn x0, x1 /* x0 = ~x1 (bitwise NOT of x1) */
MOV copies a value from one register to another, or loads a small constant
(called an immediate) into a register. The immediate value must fit in a
limited number of bits. For larger constants, we use LDR with a literal pool
(see below).
Loading and Storing Memory
ARM64 is a load-store architecture. This means only LDR and
STR instructions access memory. Everything else works on registers.
ldr x0, [x1] /* x0 = memory[x1] (load from address in x1) */
str x0, [x1] /* memory[x1] = x0 (store to address in x1) */
ldr w0, [x1] /* w0 = memory[x1] (32-bit load) */
str w0, [x1] /* memory[x1] = w0 (32-bit store) */
ldrb w0, [x1] /* w0 = memory[x1] (8-bit load, zero-extended) */
strb w0, [x1] /* memory[x1] = w0 (8-bit store) */
ldrh w0, [x1] /* w0 = memory[x1] (16-bit load, zero-extended) */
strh w0, [x1] /* memory[x1] = w0 (16-bit store) */
Addressing Modes
ARM64 provides several ways to calculate the memory address:
/* Base register only */
ldr x0, [x1] /* address = x1 */
/* Base + offset (immediate) */
ldr x0, [x1, #16] /* address = x1 + 16 */
str x0, [x1, #-8] /* address = x1 - 8 */
/* Pre-index: update base register before access */
ldr x0, [x1, #16]! /* x1 = x1 + 16; then x0 = memory[x1] */
/* Post-index: update base register after access */
ldr x0, [x1], #16 /* x0 = memory[x1]; then x1 = x1 + 16 */
/* Register offset */
ldr x0, [x1, x2] /* address = x1 + x2 */
/* Scaled register offset (shift left by log2(size)) */
ldr x0, [x1, x2, lsl #3] /* address = x1 + (x2 * 8) */
The pre-index and post-index modes are especially useful for stack operations. We use
[sp, #-16]! to push and [sp], #16 to pop.
Literal Pools and Large Constants
ARM64 instructions are 32 bits, so they cannot hold a full 64-bit address. To load a large constant, we store it in a literal pool (a table of constants embedded in the code) and use PC-relative addressing:
ldr x0, =0x09000000 /* assembler puts 0x09000000 in literal pool, */
/* generates a PC-relative load */
The assembler places the constant value near the instruction and generates a load from that location. This is how our kernel loads the UART base address.
Load and Store Pair
stp x0, x1, [sp, #-16]! /* push x0 and x1 onto stack (16 bytes) */
ldp x0, x1, [sp], #16 /* pop x0 and x1 from stack */
STP (store pair) and LDP (load pair) operate on two registers
at once. They are the standard way to push and pop values on the stack. The exclamation
mark ! means "write back" -- update the base register after the operation.
5.4 Arithmetic Instructions
add x0, x1, x2 /* x0 = x1 + x2 */
add x0, x1, #16 /* x0 = x1 + 16 */
sub x0, x1, x2 /* x0 = x1 - x2 */
sub x0, x1, #8 /* x0 = x1 - 8 */
mul x0, x1, x2 /* x0 = x1 * x2 */
udiv x0, x1, x2 /* x0 = x1 / x2 (unsigned) */
sdiv x0, x1, x2 /* x0 = x1 / x2 (signed) */
neg x0, x1 /* x0 = -x1 */
The ADDS and SUBS variants update the condition flags (NZCV)
based on the result. Use the plain ADD/SUB when you do not need
the flags.
adds x0, x1, x2 /* x0 = x1 + x2; update flags */
subs x0, x1, x2 /* x0 = x1 - x2; update flags */
cmp x0, x1 /* compare: sets flags like x0 - x1 (without storing result) */
cmn x0, x1 /* compare negative: sets flags like x0 + x1 */
CMP is the most common way to test values before a conditional branch. It
performs a subtraction and discards the result, only updating the flags.
5.5 Logical and Shift Instructions
and x0, x1, x2 /* x0 = x1 & x2 (bitwise AND) */
orr x0, x1, x2 /* x0 = x1 | x2 (bitwise OR) */
eor x0, x1, x2 /* x0 = x1 ^ x2 (bitwise XOR) */
bic x0, x1, x2 /* x0 = x1 & ~x2 (bit clear: AND with NOT of x2) */
lsl x0, x1, #4 /* x0 = x1 << 4 (logical shift left) */
lsr x0, x1, #4 /* x0 = x1 >> 4 (logical shift right) */
asr x0, x1, #4 /* x0 = x1 >> 4 (arithmetic shift right, sign-extends) */
BIC is especially useful for clearing bits in hardware registers. Many ARM64
instructions can also include a shift as part of the operation:
add x0, x1, x2, lsl #3 /* x0 = x1 + (x2 << 3) */
orr x0, x1, x2, lsr #2 /* x0 = x1 | (x2 >> 2) */
This is called a shifted register operand. It lets us combine a shift and an arithmetic operation in a single instruction, which saves both code space and execution time.
5.6 Control Flow Instructions
Unconditional Branch
b loop /* jump to label 'loop' (unconditional) */
b 0x40000100 /* jump to address 0x40000100 */
Conditional Branch
Conditional branches use the condition flags set by a previous CMP,
ADDS, or SUBS instruction:
cmp x0, #5
b.eq label /* branch if x0 == 5 (equal) */
b.ne label /* branch if x0 != 5 (not equal) */
b.lt label /* branch if x0 < 5 (signed less than) */
b.le label /* branch if x0 <= 5 (signed less or equal) */
b.gt label /* branch if x0 > 5 (signed greater than) */
b.ge label /* branch if x0 >= 5 (signed greater or equal) */
b.hi label /* branch if x0 > 5 (unsigned higher) */
b.lo label /* branch if x0 < 5 (unsigned lower) */
Full list of condition codes:
| Code | Meaning | Flags Tested |
|---|---|---|
| EQ | Equal | Z == 1 |
| NE | Not equal | Z == 0 |
| LT | Signed less than | N != V |
| LE | Signed less or equal | N != V or Z == 1 |
| GT | Signed greater than | Z == 0 and N == V |
| GE | Signed greater or equal | N == V |
| LO | Unsigned lower | C == 0 |
| HS | Unsigned higher or same | C == 1 |
| HI | Unsigned higher | C == 1 and Z == 0 |
| MI | Negative (minus) | N == 1 |
| PL | Positive or zero (plus) | N == 0 |
| VS | Overflow (signed overflow) | V == 1 |
| VC | No overflow | V == 0 |
Compare and Branch
CBZ and CBNZ combine a comparison with a branch in one
instruction. They are faster than using CMP followed by B.EQ:
cbz x0, label /* branch to label if x0 == 0 */
cbnz x0, label /* branch to label if x0 != 0 */
/* Equivalent to: */
cmp x0, #0
b.eq label
Function Calls
bl my_function /* branch and link: save return address in x30, then jump */
ret /* return: jump to address in x30 */
/* Example: calling a function */
_start:
bl kernel_main /* call kernel_main; x30 = return address */
b . /* infinite loop ('.' means current address) */
kernel_main:
/* function body */
ret /* return to _start */
BL saves the address of the next instruction into register x30
(the link register) and then jumps to the target. RET
jumps back to the address in x30. This is how function calls work at the
assembly level.
5.7 The Stack
The stack is a region of memory used for temporary storage. It grows
downward: as you push data, the stack pointer (SP) decreases. When you pop,
it increases.
In ARM64, the stack must always be 16-byte aligned. This means the stack pointer must always be a multiple of 16. Violating this rule causes alignment faults.
/* Push two registers (16 bytes) */
stp x29, x30, [sp, #-16]! /* subtract 16 from SP, then store x29 and x30 */
/* Pop two registers */
ldp x29, x30, [sp], #16 /* load x29 and x30, then add 16 to SP */
/* Push a single register (still uses 16 bytes for alignment) */
str x0, [sp, #-16]! /* allocate 16 bytes, store x0 at the top */
ldr x0, [sp], #16 /* load x0, then deallocate 16 bytes */
The stack is used for three main purposes:
- Saving return addresses. When a function calls another function, it must save its own return address (x30) on the stack.
- Saving callee-saved registers. If a function uses x19-x28, it must save them on entry and restore them on exit.
- Local variables. If a function has more local variables than fit in registers, they go on the stack.
Frame Pointer
The frame pointer (x29) points to the beginning of a function's stack frame. It is used for debugging and unwinding the call stack (for example, when printing a backtrace):
my_function:
stp x29, x30, [sp, #-16]! /* save frame pointer and link register */
mov x29, sp /* set new frame pointer to current SP */
sub sp, sp, #32 /* allocate 32 bytes for local variables */
/* ... function body ... */
add sp, sp, #32 /* deallocate local variables */
ldp x29, x30, [sp], #16 /* restore frame pointer and link register */
ret
The frame pointer creates a linked list of stack frames. Each frame pointer points to the previous one, forming a chain that debuggers can walk to produce a backtrace.
5.8 System Instructions
These instructions control the CPU itself. They are only available at higher exception levels (EL1 and above).
System Register Access
mrs x0, CurrentEL /* read CurrentEL into x0 */
msr SCTLR_EL1, x0 /* write x0 to SCTLR_EL1 */
isb /* instruction synchronization barrier */
dsb sy /* data synchronization barrier (full system) */
dmb sy /* data memory barrier (full system) */
MRS: Move system register to general-purpose register (read)MSR: Move general-purpose register to system register (write)ISB: Flush the instruction pipeline. Use after changing system registers that affect instruction execution.DSB: Wait for all memory accesses to complete. Use before TLB maintenance.DMB: Ensure memory access ordering. Use in synchronization code.
Exception-Related Instructions
svc #0 /* supervisor call: trigger a system call from EL0 to EL1 */
eret /* exception return: return from EL1 to EL0 */
hvc #0 /* hypervisor call (EL1 to EL2) */
smc #0 /* secure monitor call (EL2 to EL3) */
In our kernel, SVC is used by user-space programs to make system calls, and
ERET is used by the kernel to return to user space after handling an exception.
Power Management
wfi /* wait for interrupt */
wfe /* wait for event */
sev /* send event (wake up cores waiting with WFE) */
WFI puts the CPU into a low-power state until an interrupt occurs. Our kernel
uses WFI in the idle loop when no processes are ready to run.
5.9 Labels, Directives, and the Assembler
Assembly code uses labels to mark locations and directives to control the assembler:
.section .text._start /* directive: put following code in this section */
.global _start /* directive: make _start visible to the linker */
_start: /* label: marks the entry point address */
ldr x0, =_stack_end
mov sp, x0
bl kernel_main
.section .rodata /* directive: read-only data section */
msg:
.asciz "Hello" /* directive: null-terminated string */
.section .data /* directive: writable data section */
counter:
.quad 0 /* directive: 64-bit value initialized to 0 */
.section .bss /* directive: zero-initialized data */
.align 4 /* directive: align to 16 bytes */
_stack_start:
.skip 4096 /* directive: reserve 4096 bytes */
_stack_end:
Common directives:
| Directive | Purpose |
|---|---|
.section | Switch to a specific section (text, data, bss, rodata) |
.global | Make a label visible to the linker |
.align N | Align to 2^N bytes |
.byte, .word, .quad | Emit data of specific sizes |
.asciz | Emit a null-terminated string |
.skip | Reserve N bytes (like BSS space) |
.rept ... .endr | Repeat a block of instructions |
.macro ... .endm | Define an assembler macro |
.equ | Define a numeric constant |
5.10 Inline Assembly in C
Sometimes we need assembly code inside a C function. GCC provides the asm()
keyword for this, known as inline assembly.
Basic Inline Assembly
/* Execute a WFI instruction */
__asm__("wfi");
/* Read CurrentEL */
uint64_t el;
__asm__("mrs %0, CurrentEL" : "=r" (el));
/* Write to a system register */
__asm__("msr SCTLR_EL1, %0" : : "r" (value));
The syntax is: asm("instructions" : outputs : inputs : clobbers).
In the examples above:
%0refers to the first operand (output or input)"=r"means "output in any general-purpose register""r"means "input from any general-purpose register"
Extended Inline Assembly
/* Atomic compare-and-swap (LDXR/STXR) */
int atomic_cas(uint64_t *ptr, uint64_t expected, uint64_t desired) {
uint64_t result;
__asm__ __volatile__(
"1: ldxr %x0, [%2]\n"
" cmp %x0, %3\n"
" b.ne 2f\n"
" stxr %w1, %4, [%2]\n"
" cbnz %w1, 1b\n"
"2:"
: "=&r" (result), "=&r" (result)
: "r" (ptr), "r" (expected), "r" (desired)
: "memory", "cc"
);
return result == expected;
}
This uses __volatile__ to prevent the compiler from optimizing away the
assembly block, and a local label format (1:, 2:) with
b.ne 2f (forward) and cbnz ... 1b (backward).
When to Use Inline vs Separate Assembly
| Situation | Use |
|---|---|
| Single instruction (wfi, dsb, isb) | Inline assembly |
| Short sequence with C operand access | Inline assembly |
| Kernel entry point (_start) | Separate .S file |
| Exception vector table | Separate .S file |
| Context switching code | Separate .S file |
| Code over 10-15 instructions | Separate .S file |
As a general rule: one or two instructions go in inline assembly. Larger blocks go in
a separate .S file. This keeps the C code readable and the assembly code
maintainable.
5.11 Our Implementation
Now let us look at how assembly language is used in our actual kernel code.
The Entry Point (start.S)
We saw this file in Chapter 1. Now we understand every line:
.section .text._start /* Place this code in its own section */
.global _start /* Export _start so the linker can find it */
_start:
ldr x0, =_stack_end /* Load the address of _stack_end into x0 */
mov sp, x0 /* Set the stack pointer to that address */
bl kernel_main /* Call kernel_main (saves return address in x30) */
wfi /* Wait for interrupt (should never reach here) */
b _start /* If WFI returns, loop forever */
Breaking it down:
_startis the entry point. The linker script tells the linker that the kernel binary starts here.- The CPU starts with no valid stack pointer. We load the address of
_stack_end(which is the top of the stack area defined in the linker script) and set SP to it. BL kernel_maincalls our C code. The return address is saved in x30.- If
kernel_mainever returns, we executeWFIto save power, then loop.
The Exception Vector Table
When an exception occurs, the CPU looks at a table of 16 entries called the exception vector table. Each entry is 128 bytes (32 instructions) of assembly code. We will write this when we get to exception handling (Chapter 11):
.align 11 /* must be 2KB aligned */
vectors:
/* Current EL with SP0 (SP_EL0) */
.align 7 /* each entry is 128-byte aligned */
b el1_sync_sp0 /* synchronous */
.align 7
b el1_irq_sp0 /* IRQ */
.align 7
b el1_fiq_sp0 /* FIQ */
.align 7
b el1_serror_sp0 /* SError */
/* Current EL with SPx (SP_EL1) */
.align 7
b el1_sync_sp1 /* synchronous */
.align 7
b el1_irq_sp1 /* IRQ */
.align 7
b el1_fiq_sp1 /* FIQ */
.align 7
b el1_serror_sp1 /* SError */
/* Lower EL using AArch64 */
.align 7
b el0_sync_64 /* synchronous */
.align 7
b el0_irq_64 /* IRQ */
.align 7
b el0_fiq_64 /* FIQ */
.align 7
b el0_serror_64 /* SError */
/* Lower EL using AArch32 (not used) */
.align 7
b el0_sync_32 /* synchronous */
.align 7
b el0_irq_32 /* IRQ */
.align 7
b el0_fiq_32 /* FIQ */
.align 7
b el0_serror_32 /* SError */
Each branch goes to a handler function that saves registers, handles the exception, and returns. We will fill these handlers in Chapter 11. The important thing now is that the entire table is assembly -- you cannot write this in C.
Calling Convention for Our Kernel
We follow the ARM64 Procedure Call Standard (AAPCS64) for all function calls:
- Arguments 1-8 in x0-x7
- Return value in x0
- Callee-saved registers: x19-x29, x30
- Stack must be 16-byte aligned at each call boundary
- The stack grows downward (full descending)
When we write assembly functions that C code calls, we must follow this convention. Here is an example of an assembly function that C can call:
.global cpu_get_current_el
cpu_get_current_el:
mrs x0, CurrentEL /* read current exception level */
and x0, x0, #0xC /* mask bits 3:2 (EL field) */
lsr x0, x0, #2 /* shift right by 2 to get 0, 1, 2, or 3 */
ret /* return value is in x0 */
.global cpu_wait_for_interrupt
cpu_wait_for_interrupt:
wfi
ret
C code calls these functions like any other function:
uint64_t el = cpu_get_current_el();
cpu_wait_for_interrupt();
Assembly Patterns in Our Kernel
These patterns appear throughout our kernel:
/* 1. Critical section: mask interrupts */
msr DAIFSet, #2 /* set I bit (mask IRQs) */
/* ... critical code ... */
msr DAIFClr, #2 /* clear I bit (unmask IRQs) */
/* 2. Read-modify-write a system register */
mrs x0, SCTLR_EL1 /* read */
orr x0, x0, #1 /* set bit 0 (enable MMU) */
msr SCTLR_EL1, x0 /* write */
isb /* synchronize */
/* 3. Memory barrier before page table switch */
dsb sy /* ensure all previous memory accesses complete */
msr TTBR0_EL1, x0 /* switch page table */
isb /* flush pipeline and ensure new translation is used */
/* 4. Save context on exception entry */
sub sp, sp, #(34 * 8) /* allocate space for 34 registers */
stp x0, x1, [sp, #16*0] /* save x0, x1 */
stp x2, x3, [sp, #16*1] /* save x2, x3 */
/* ... save x4-x29 ... */
mrs x0, ELR_EL1 /* save exception return address */
str x0, [sp, #16*15]
mrs x0, SPSR_EL1 /* save saved processor state */
str x0, [sp, #16*15 + 8]
/* 5. Restore context on exception return */
ldp x0, x1, [sp, #16*0]
ldp x2, x3, [sp, #16*1]
/* ... restore x4-x29 ... */
ldr x0, [sp, #16*15] /* restore ELR_EL1 */
msr ELR_EL1, x0
ldr x0, [sp, #16*15 + 8] /* restore SPSR_EL1 */
msr SPSR_EL1, x0
add sp, sp, #(34 * 8) /* deallocate context storage */
eret /* return to where the exception came from */
These patterns form the backbone of our kernel's low-level operations. As we progress through the book, we will build each pattern into working code.
5.12 Exercises
Exercise 1: Read Assembly Code
Here is a simple assembly function. Write what it does in C:
func:
mov x1, #0
loop:
cmp x1, #10
b.ge done
add x0, x0, x1
add x1, x1, #1
b loop
done:
ret
Exercise 2: Write Assembly
Write an assembly function called strcpy_asm that copies a null-terminated
string from x0 (source) to x1 (destination). Use LDRB, STRB,
CBZ, and post-index addressing.
Exercise 3: Stack Operations
Write an assembly function that takes three arguments (x0, x1, x2), saves them on the stack, calls another function, and then returns the original value of x0. Use the frame pointer.
Exercise 4: Inline Assembly
Write a C function uint64_t read_ttbr0_el1(void) that reads the TTBR0_EL1
register using inline assembly and returns the value.
Exercise 5: Loop with Conditional
Write an assembly function that counts the number of set bits in x0 and returns the
count in x0. Use a loop, LSR, and AND to test each bit.
This is also called a population count (popcount).
Exercise 6: Understanding Our start.S (Challenge)
Modify start.S to also clear the BSS section before calling kernel_main. BSS starts
at label _bss_start and ends at _bss_end. You will need to
write a loop that stores zero to each 8-byte word in that range.
5.13 Summary
In this chapter, we learned what assembly language is and why a kernel needs it. The kernel uses assembly for things that C cannot do directly: setting up the initial environment, handling exceptions, switching between processes, and controlling CPU hardware.
We covered the main categories of ARM64 instructions:
- Data movement: MOV, LDR, STR, STP, LDP with various addressing modes
- Arithmetic: ADD, SUB, MUL, UDIV, SDIV and their flag-setting variants
- Logical: AND, ORR, EOR, BIC with optional shifts
- Control flow: B, B.cond, BL, RET, CBZ, CBNZ
- System: MRS, MSR, SVC, ERET, WFI, DSB, ISB
We examined how the stack works, how to use frame pointers, and how to push and pop registers correctly. We saw how inline assembly lets us use assembly instructions directly in C code for short operations.
Finally, we looked at the actual assembly code in our kernel: the entry point, exception vector table, and common patterns. These will appear throughout the rest of the book as we build each subsystem.
In the next chapter, we will learn how to write C code specifically for kernel development, including the freestanding environment, volatile access, and linker scripts.