ARM64 OS Handbook
🔍

Chapter 5: Assembly Language

What You Will Learn in This Chapter
  • What assembly language is and why a kernel needs it
  • How ARM64 instructions are structured
  • Data movement, arithmetic, logical, and branch instructions
  • How the stack works and how to use it correctly
  • System instructions for kernel control
  • How to write inline assembly in C
  • How assembly and C work together in our kernel

5.1 What is Assembly Language?

Assembly language is a human-readable way to write machine code. Each assembly instruction corresponds to one CPU instruction. Unlike C, where one line of code can do many things, one line of assembly does exactly one thing: add two numbers, load from memory, or jump to another address.

Every CPU architecture has its own assembly language. ARM64 assembly is specific to the ARMv8-A architecture in 64-bit mode. Code written for x86 will not work on ARM64, and vice versa.

A program called an assembler converts assembly code into machine code (binary instructions that the CPU executes). We use the GNU assembler (aarch64-none-elf-as), which is part of our cross-compiler toolchain.

Why Does a Kernel Need Assembly?

Most of our kernel is written in C, but some things require assembly:

  • Kernel entry point. When the CPU starts, it is not running C code yet. The stack pointer is not set up, the BSS section is not cleared, and there is no C runtime environment. We need a small assembly stub to prepare everything before jumping to C code.
  • Exception vectors. When an interrupt or exception occurs, the CPU jumps to a fixed address. The code at that address must be assembly because it needs to save and restore registers that C cannot access directly.
  • Context switching. Switching between processes requires saving and restoring all registers, including system registers. C cannot do this directly.
  • Atomic operations. Some hardware-level atomic operations require specific assembly instructions like LDXR and STXR.
  • System register access. Reading and writing CPU control registers requires the MRS and MSR instructions, which are not available in standard C.

In short: C is for logic, assembly is for control. We use each where it belongs.

5.2 ARM64 Instruction Format

Every ARM64 instruction is exactly 4 bytes (32 bits) long. This is called a fixed instruction width. The CPU decodes the bits to determine what operation to perform.

A typical instruction has three parts: an opcode (what to do), operands (what to do it with), and sometimes flags (modifiers that change the behavior).

/* General form: opcode destination, source1, source2 */
add x0, x1, x2    /* x0 = x1 + x2 */

/* Opcode: add */
/* Destination: x0 */
/* Source 1: x1 */
/* Source 2: x2 */

Instructions fall into three main categories:

CategoryPurposeExamples
Data processingArithmetic, logic, shifts on registersADD, SUB, AND, ORR, LSL
Memory accessLoad from memory, store to memoryLDR, STR, STP, LDP
BranchChange the flow of executionB, BL, RET, CBZ

Our kernel uses instructions from all three categories. Understanding them means you can read and write the assembly parts of our kernel.

5.3 Data Movement Instructions

Moving Values Between Registers

mov x0, x1        /* x0 = x1 (copy x1 into x0) */
mov x0, #42       /* x0 = 42 (load immediate value) */
mvn x0, x1        /* x0 = ~x1 (bitwise NOT of x1) */

MOV copies a value from one register to another, or loads a small constant (called an immediate) into a register. The immediate value must fit in a limited number of bits. For larger constants, we use LDR with a literal pool (see below).

Loading and Storing Memory

ARM64 is a load-store architecture. This means only LDR and STR instructions access memory. Everything else works on registers.

ldr x0, [x1]          /* x0 = memory[x1]        (load from address in x1) */
str x0, [x1]          /* memory[x1] = x0        (store to address in x1) */

ldr w0, [x1]          /* w0 = memory[x1]        (32-bit load) */
str w0, [x1]          /* memory[x1] = w0        (32-bit store) */

ldrb w0, [x1]         /* w0 = memory[x1]        (8-bit load, zero-extended) */
strb w0, [x1]         /* memory[x1] = w0        (8-bit store) */

ldrh w0, [x1]         /* w0 = memory[x1]        (16-bit load, zero-extended) */
strh w0, [x1]         /* memory[x1] = w0        (16-bit store) */

Addressing Modes

ARM64 provides several ways to calculate the memory address:

/* Base register only */
ldr x0, [x1]                  /* address = x1 */

/* Base + offset (immediate) */
ldr x0, [x1, #16]             /* address = x1 + 16 */
str x0, [x1, #-8]             /* address = x1 - 8 */

/* Pre-index: update base register before access */
ldr x0, [x1, #16]!            /* x1 = x1 + 16; then x0 = memory[x1] */

/* Post-index: update base register after access */
ldr x0, [x1], #16             /* x0 = memory[x1]; then x1 = x1 + 16 */

/* Register offset */
ldr x0, [x1, x2]              /* address = x1 + x2 */

/* Scaled register offset (shift left by log2(size)) */
ldr x0, [x1, x2, lsl #3]     /* address = x1 + (x2 * 8) */

The pre-index and post-index modes are especially useful for stack operations. We use [sp, #-16]! to push and [sp], #16 to pop.

Literal Pools and Large Constants

ARM64 instructions are 32 bits, so they cannot hold a full 64-bit address. To load a large constant, we store it in a literal pool (a table of constants embedded in the code) and use PC-relative addressing:

ldr x0, =0x09000000    /* assembler puts 0x09000000 in literal pool, */
                         /* generates a PC-relative load */

The assembler places the constant value near the instruction and generates a load from that location. This is how our kernel loads the UART base address.

Load and Store Pair

stp x0, x1, [sp, #-16]!   /* push x0 and x1 onto stack (16 bytes) */
ldp x0, x1, [sp], #16     /* pop x0 and x1 from stack */

STP (store pair) and LDP (load pair) operate on two registers at once. They are the standard way to push and pop values on the stack. The exclamation mark ! means "write back" -- update the base register after the operation.

5.4 Arithmetic Instructions

add x0, x1, x2       /* x0 = x1 + x2 */
add x0, x1, #16      /* x0 = x1 + 16 */
sub x0, x1, x2       /* x0 = x1 - x2 */
sub x0, x1, #8       /* x0 = x1 - 8 */
mul x0, x1, x2       /* x0 = x1 * x2 */
udiv x0, x1, x2      /* x0 = x1 / x2 (unsigned) */
sdiv x0, x1, x2      /* x0 = x1 / x2 (signed) */
neg x0, x1           /* x0 = -x1 */

The ADDS and SUBS variants update the condition flags (NZCV) based on the result. Use the plain ADD/SUB when you do not need the flags.

adds x0, x1, x2      /* x0 = x1 + x2; update flags */
subs x0, x1, x2      /* x0 = x1 - x2; update flags */

cmp x0, x1           /* compare: sets flags like x0 - x1 (without storing result) */
cmn x0, x1           /* compare negative: sets flags like x0 + x1 */

CMP is the most common way to test values before a conditional branch. It performs a subtraction and discards the result, only updating the flags.

5.5 Logical and Shift Instructions

and x0, x1, x2       /* x0 = x1 & x2 (bitwise AND) */
orr x0, x1, x2       /* x0 = x1 | x2 (bitwise OR) */
eor x0, x1, x2       /* x0 = x1 ^ x2 (bitwise XOR) */
bic x0, x1, x2       /* x0 = x1 & ~x2 (bit clear: AND with NOT of x2) */
lsl x0, x1, #4       /* x0 = x1 << 4 (logical shift left) */
lsr x0, x1, #4       /* x0 = x1 >> 4 (logical shift right) */
asr x0, x1, #4       /* x0 = x1 >> 4 (arithmetic shift right, sign-extends) */

BIC is especially useful for clearing bits in hardware registers. Many ARM64 instructions can also include a shift as part of the operation:

add x0, x1, x2, lsl #3   /* x0 = x1 + (x2 << 3) */
orr x0, x1, x2, lsr #2   /* x0 = x1 | (x2 >> 2) */

This is called a shifted register operand. It lets us combine a shift and an arithmetic operation in a single instruction, which saves both code space and execution time.

5.6 Control Flow Instructions

Unconditional Branch

b loop           /* jump to label 'loop' (unconditional) */
b 0x40000100     /* jump to address 0x40000100 */

Conditional Branch

Conditional branches use the condition flags set by a previous CMP, ADDS, or SUBS instruction:

cmp x0, #5
b.eq label        /* branch if x0 == 5 (equal) */
b.ne label        /* branch if x0 != 5 (not equal) */
b.lt label        /* branch if x0 < 5  (signed less than) */
b.le label        /* branch if x0 <= 5 (signed less or equal) */
b.gt label        /* branch if x0 > 5  (signed greater than) */
b.ge label        /* branch if x0 >= 5 (signed greater or equal) */
b.hi label        /* branch if x0 > 5  (unsigned higher) */
b.lo label        /* branch if x0 < 5  (unsigned lower) */

Full list of condition codes:

CodeMeaningFlags Tested
EQEqualZ == 1
NENot equalZ == 0
LTSigned less thanN != V
LESigned less or equalN != V or Z == 1
GTSigned greater thanZ == 0 and N == V
GESigned greater or equalN == V
LOUnsigned lowerC == 0
HSUnsigned higher or sameC == 1
HIUnsigned higherC == 1 and Z == 0
MINegative (minus)N == 1
PLPositive or zero (plus)N == 0
VSOverflow (signed overflow)V == 1
VCNo overflowV == 0

Compare and Branch

CBZ and CBNZ combine a comparison with a branch in one instruction. They are faster than using CMP followed by B.EQ:

cbz x0, label       /* branch to label if x0 == 0 */
cbnz x0, label      /* branch to label if x0 != 0 */

/* Equivalent to: */
cmp x0, #0
b.eq label

Function Calls

bl my_function      /* branch and link: save return address in x30, then jump */
ret                 /* return: jump to address in x30 */

/* Example: calling a function */
_start:
    bl kernel_main   /* call kernel_main; x30 = return address */
    b .              /* infinite loop ('.' means current address) */

kernel_main:
    /* function body */
    ret              /* return to _start */

BL saves the address of the next instruction into register x30 (the link register) and then jumps to the target. RET jumps back to the address in x30. This is how function calls work at the assembly level.

5.7 The Stack

The stack is a region of memory used for temporary storage. It grows downward: as you push data, the stack pointer (SP) decreases. When you pop, it increases.

In ARM64, the stack must always be 16-byte aligned. This means the stack pointer must always be a multiple of 16. Violating this rule causes alignment faults.

/* Push two registers (16 bytes) */
stp x29, x30, [sp, #-16]!    /* subtract 16 from SP, then store x29 and x30 */

/* Pop two registers */
ldp x29, x30, [sp], #16      /* load x29 and x30, then add 16 to SP */

/* Push a single register (still uses 16 bytes for alignment) */
str x0, [sp, #-16]!           /* allocate 16 bytes, store x0 at the top */
ldr x0, [sp], #16             /* load x0, then deallocate 16 bytes */

The stack is used for three main purposes:

  • Saving return addresses. When a function calls another function, it must save its own return address (x30) on the stack.
  • Saving callee-saved registers. If a function uses x19-x28, it must save them on entry and restore them on exit.
  • Local variables. If a function has more local variables than fit in registers, they go on the stack.

Frame Pointer

The frame pointer (x29) points to the beginning of a function's stack frame. It is used for debugging and unwinding the call stack (for example, when printing a backtrace):

my_function:
    stp x29, x30, [sp, #-16]!    /* save frame pointer and link register */
    mov x29, sp                    /* set new frame pointer to current SP */
    sub sp, sp, #32                /* allocate 32 bytes for local variables */

    /* ... function body ... */

    add sp, sp, #32                /* deallocate local variables */
    ldp x29, x30, [sp], #16       /* restore frame pointer and link register */
    ret

The frame pointer creates a linked list of stack frames. Each frame pointer points to the previous one, forming a chain that debuggers can walk to produce a backtrace.

5.8 System Instructions

These instructions control the CPU itself. They are only available at higher exception levels (EL1 and above).

System Register Access

mrs x0, CurrentEL        /* read CurrentEL into x0 */
msr SCTLR_EL1, x0        /* write x0 to SCTLR_EL1 */

isb                       /* instruction synchronization barrier */
dsb sy                    /* data synchronization barrier (full system) */
dmb sy                    /* data memory barrier (full system) */
  • MRS: Move system register to general-purpose register (read)
  • MSR: Move general-purpose register to system register (write)
  • ISB: Flush the instruction pipeline. Use after changing system registers that affect instruction execution.
  • DSB: Wait for all memory accesses to complete. Use before TLB maintenance.
  • DMB: Ensure memory access ordering. Use in synchronization code.

Exception-Related Instructions

svc #0                  /* supervisor call: trigger a system call from EL0 to EL1 */
eret                     /* exception return: return from EL1 to EL0 */
hvc #0                  /* hypervisor call (EL1 to EL2) */
smc #0                  /* secure monitor call (EL2 to EL3) */

In our kernel, SVC is used by user-space programs to make system calls, and ERET is used by the kernel to return to user space after handling an exception.

Power Management

wfi                      /* wait for interrupt */
wfe                      /* wait for event */
sev                      /* send event (wake up cores waiting with WFE) */

WFI puts the CPU into a low-power state until an interrupt occurs. Our kernel uses WFI in the idle loop when no processes are ready to run.

5.9 Labels, Directives, and the Assembler

Assembly code uses labels to mark locations and directives to control the assembler:

.section .text._start    /* directive: put following code in this section */
.global _start            /* directive: make _start visible to the linker */

_start:                   /* label: marks the entry point address */
    ldr x0, =_stack_end
    mov sp, x0
    bl kernel_main

.section .rodata          /* directive: read-only data section */
msg:
    .asciz "Hello"        /* directive: null-terminated string */

.section .data            /* directive: writable data section */
counter:
    .quad 0               /* directive: 64-bit value initialized to 0 */

.section .bss             /* directive: zero-initialized data */
.align 4                  /* directive: align to 16 bytes */
_stack_start:
    .skip 4096            /* directive: reserve 4096 bytes */
_stack_end:

Common directives:

DirectivePurpose
.sectionSwitch to a specific section (text, data, bss, rodata)
.globalMake a label visible to the linker
.align NAlign to 2^N bytes
.byte, .word, .quadEmit data of specific sizes
.ascizEmit a null-terminated string
.skipReserve N bytes (like BSS space)
.rept ... .endrRepeat a block of instructions
.macro ... .endmDefine an assembler macro
.equDefine a numeric constant

5.10 Inline Assembly in C

Sometimes we need assembly code inside a C function. GCC provides the asm() keyword for this, known as inline assembly.

Basic Inline Assembly

/* Execute a WFI instruction */
__asm__("wfi");

/* Read CurrentEL */
uint64_t el;
__asm__("mrs %0, CurrentEL" : "=r" (el));

/* Write to a system register */
__asm__("msr SCTLR_EL1, %0" : : "r" (value));

The syntax is: asm("instructions" : outputs : inputs : clobbers). In the examples above:

  • %0 refers to the first operand (output or input)
  • "=r" means "output in any general-purpose register"
  • "r" means "input from any general-purpose register"

Extended Inline Assembly

/* Atomic compare-and-swap (LDXR/STXR) */
int atomic_cas(uint64_t *ptr, uint64_t expected, uint64_t desired) {
    uint64_t result;
    __asm__ __volatile__(
        "1: ldxr %x0, [%2]\n"
        "   cmp %x0, %3\n"
        "   b.ne 2f\n"
        "   stxr %w1, %4, [%2]\n"
        "   cbnz %w1, 1b\n"
        "2:"
        : "=&r" (result), "=&r" (result)
        : "r" (ptr), "r" (expected), "r" (desired)
        : "memory", "cc"
    );
    return result == expected;
}

This uses __volatile__ to prevent the compiler from optimizing away the assembly block, and a local label format (1:, 2:) with b.ne 2f (forward) and cbnz ... 1b (backward).

When to Use Inline vs Separate Assembly

SituationUse
Single instruction (wfi, dsb, isb)Inline assembly
Short sequence with C operand accessInline assembly
Kernel entry point (_start)Separate .S file
Exception vector tableSeparate .S file
Context switching codeSeparate .S file
Code over 10-15 instructionsSeparate .S file

As a general rule: one or two instructions go in inline assembly. Larger blocks go in a separate .S file. This keeps the C code readable and the assembly code maintainable.

5.11 Our Implementation

Now let us look at how assembly language is used in our actual kernel code.

The Entry Point (start.S)

We saw this file in Chapter 1. Now we understand every line:

start.S
.section .text._start     /* Place this code in its own section */
.global _start             /* Export _start so the linker can find it */

_start:
    ldr x0, =_stack_end    /* Load the address of _stack_end into x0 */
    mov sp, x0              /* Set the stack pointer to that address */
    bl kernel_main          /* Call kernel_main (saves return address in x30) */
    wfi                     /* Wait for interrupt (should never reach here) */
    b _start                /* If WFI returns, loop forever */

Breaking it down:

  • _start is the entry point. The linker script tells the linker that the kernel binary starts here.
  • The CPU starts with no valid stack pointer. We load the address of _stack_end (which is the top of the stack area defined in the linker script) and set SP to it.
  • BL kernel_main calls our C code. The return address is saved in x30.
  • If kernel_main ever returns, we execute WFI to save power, then loop.

The Exception Vector Table

When an exception occurs, the CPU looks at a table of 16 entries called the exception vector table. Each entry is 128 bytes (32 instructions) of assembly code. We will write this when we get to exception handling (Chapter 11):

.align 11                  /* must be 2KB aligned */
vectors:
    /* Current EL with SP0 (SP_EL0) */
    .align 7                /* each entry is 128-byte aligned */
    b el1_sync_sp0          /* synchronous */
    .align 7
    b el1_irq_sp0           /* IRQ */
    .align 7
    b el1_fiq_sp0           /* FIQ */
    .align 7
    b el1_serror_sp0        /* SError */

    /* Current EL with SPx (SP_EL1) */
    .align 7
    b el1_sync_sp1          /* synchronous */
    .align 7
    b el1_irq_sp1           /* IRQ */
    .align 7
    b el1_fiq_sp1           /* FIQ */
    .align 7
    b el1_serror_sp1        /* SError */

    /* Lower EL using AArch64 */
    .align 7
    b el0_sync_64           /* synchronous */
    .align 7
    b el0_irq_64            /* IRQ */
    .align 7
    b el0_fiq_64            /* FIQ */
    .align 7
    b el0_serror_64         /* SError */

    /* Lower EL using AArch32 (not used) */
    .align 7
    b el0_sync_32           /* synchronous */
    .align 7
    b el0_irq_32            /* IRQ */
    .align 7
    b el0_fiq_32            /* FIQ */
    .align 7
    b el0_serror_32         /* SError */

Each branch goes to a handler function that saves registers, handles the exception, and returns. We will fill these handlers in Chapter 11. The important thing now is that the entire table is assembly -- you cannot write this in C.

Calling Convention for Our Kernel

We follow the ARM64 Procedure Call Standard (AAPCS64) for all function calls:

  • Arguments 1-8 in x0-x7
  • Return value in x0
  • Callee-saved registers: x19-x29, x30
  • Stack must be 16-byte aligned at each call boundary
  • The stack grows downward (full descending)

When we write assembly functions that C code calls, we must follow this convention. Here is an example of an assembly function that C can call:

cpu_asm.S
.global cpu_get_current_el
cpu_get_current_el:
    mrs x0, CurrentEL     /* read current exception level */
    and x0, x0, #0xC      /* mask bits 3:2 (EL field) */
    lsr x0, x0, #2        /* shift right by 2 to get 0, 1, 2, or 3 */
    ret                    /* return value is in x0 */

.global cpu_wait_for_interrupt
cpu_wait_for_interrupt:
    wfi
    ret

C code calls these functions like any other function:

uint64_t el = cpu_get_current_el();
cpu_wait_for_interrupt();

Assembly Patterns in Our Kernel

These patterns appear throughout our kernel:

/* 1. Critical section: mask interrupts */
msr DAIFSet, #2          /* set I bit (mask IRQs) */
/* ... critical code ... */
msr DAIFClr, #2          /* clear I bit (unmask IRQs) */

/* 2. Read-modify-write a system register */
mrs x0, SCTLR_EL1        /* read */
orr x0, x0, #1           /* set bit 0 (enable MMU) */
msr SCTLR_EL1, x0        /* write */
isb                       /* synchronize */

/* 3. Memory barrier before page table switch */
dsb sy                   /* ensure all previous memory accesses complete */
msr TTBR0_EL1, x0        /* switch page table */
isb                      /* flush pipeline and ensure new translation is used */

/* 4. Save context on exception entry */
sub sp, sp, #(34 * 8)    /* allocate space for 34 registers */
stp x0, x1, [sp, #16*0]  /* save x0, x1 */
stp x2, x3, [sp, #16*1]  /* save x2, x3 */
/* ... save x4-x29 ... */
mrs x0, ELR_EL1          /* save exception return address */
str x0, [sp, #16*15]
mrs x0, SPSR_EL1         /* save saved processor state */
str x0, [sp, #16*15 + 8]

/* 5. Restore context on exception return */
ldp x0, x1, [sp, #16*0]
ldp x2, x3, [sp, #16*1]
/* ... restore x4-x29 ... */
ldr x0, [sp, #16*15]     /* restore ELR_EL1 */
msr ELR_EL1, x0
ldr x0, [sp, #16*15 + 8] /* restore SPSR_EL1 */
msr SPSR_EL1, x0
add sp, sp, #(34 * 8)    /* deallocate context storage */
eret                      /* return to where the exception came from */

These patterns form the backbone of our kernel's low-level operations. As we progress through the book, we will build each pattern into working code.

5.12 Exercises

Exercise 1: Read Assembly Code

Here is a simple assembly function. Write what it does in C:

func:
    mov x1, #0
loop:
    cmp x1, #10
    b.ge done
    add x0, x0, x1
    add x1, x1, #1
    b loop
done:
    ret

Exercise 2: Write Assembly

Write an assembly function called strcpy_asm that copies a null-terminated string from x0 (source) to x1 (destination). Use LDRB, STRB, CBZ, and post-index addressing.

Exercise 3: Stack Operations

Write an assembly function that takes three arguments (x0, x1, x2), saves them on the stack, calls another function, and then returns the original value of x0. Use the frame pointer.

Exercise 4: Inline Assembly

Write a C function uint64_t read_ttbr0_el1(void) that reads the TTBR0_EL1 register using inline assembly and returns the value.

Exercise 5: Loop with Conditional

Write an assembly function that counts the number of set bits in x0 and returns the count in x0. Use a loop, LSR, and AND to test each bit. This is also called a population count (popcount).

Exercise 6: Understanding Our start.S (Challenge)

Modify start.S to also clear the BSS section before calling kernel_main. BSS starts at label _bss_start and ends at _bss_end. You will need to write a loop that stores zero to each 8-byte word in that range.

5.13 Summary

In this chapter, we learned what assembly language is and why a kernel needs it. The kernel uses assembly for things that C cannot do directly: setting up the initial environment, handling exceptions, switching between processes, and controlling CPU hardware.

We covered the main categories of ARM64 instructions:

  • Data movement: MOV, LDR, STR, STP, LDP with various addressing modes
  • Arithmetic: ADD, SUB, MUL, UDIV, SDIV and their flag-setting variants
  • Logical: AND, ORR, EOR, BIC with optional shifts
  • Control flow: B, B.cond, BL, RET, CBZ, CBNZ
  • System: MRS, MSR, SVC, ERET, WFI, DSB, ISB

We examined how the stack works, how to use frame pointers, and how to push and pop registers correctly. We saw how inline assembly lets us use assembly instructions directly in C code for short operations.

Finally, we looked at the actual assembly code in our kernel: the entry point, exception vector table, and common patterns. These will appear throughout the rest of the book as we build each subsystem.

In the next chapter, we will learn how to write C code specifically for kernel development, including the freestanding environment, volatile access, and linker scripts.