Chapter 7: Boot Process

What You Will Learn in This Chapter

The complete boot sequence from power-on to kernel execution
The role of firmware, bootloader, and kernel in the boot chain
How exception levels change during boot
What the CPU state is when our kernel starts
How QEMU boots our kernel with the -kernel flag
The device tree and how to read it
What our _start code must do before calling kernel_main

7.1 The Boot Sequence Overview

When you press the power button, the CPU is in an undefined state. A sequence of events must occur before our kernel can run. This sequence is called the boot chain.

            graph LR
                A[Power On] --> B[ROM Firmware]
                B --> C[EL3 Boot Loader]
                C --> D[EL2 Boot Loader]
                D --> E[EL1 Kernel]
                E --> F[EL0 Applications]

Figure 7.1: The ARM64 boot chain. Each stage runs at a lower exception level than the previous one.

Each stage initializes some hardware, prepares for the next stage, and then drops to a lower exception level before jumping to the next stage. This is called exception level dropping. Once an exception level is lowered, it cannot go back up without explicit firmware calls.

Stage	EL	What It Does
ROM Firmware	EL3	Basic CPU init, loads first boot loader from flash
EL3 Boot Loader	EL3	DRAM init, loads EL2 boot loader (e.g., U-Boot)
EL2 Boot Loader	EL2	Loads kernel from disk/network, passes device tree
Kernel	EL1	Our kernel: MMU, scheduler, drivers, system calls
Applications	EL0	User-space programs running under kernel control

7.2 Firmware and the Boot ROM

When power is first applied, the CPU starts executing code from a fixed address in ROM (read-only memory). This is the boot ROM, built into the CPU itself. It cannot be modified.

The boot ROM does the minimum needed to get the system started:

Initializes the CPU (caches off, MMU off, all interrupts masked)
Sets up the stack pointer for EL3
Reads boot configuration (boot device priority, etc.)
Loads the next-stage boot loader from flash/EEPROM into SRAM
Jumps to it at EL3

On real hardware (Raspberry Pi 4/5), this ROM loads a file called bootcode.bin from the SD card. On QEMU, the firmware is provided by QEMU itself (usually QEMU_EFI.fd for UEFI, or a built-in EL3 firmware).

7.3 The Boot Loader

The boot loader is a program that loads our kernel into memory and prepares the environment for it. On ARM64 systems, common boot loaders include:

U-Boot: the most common open-source boot loader for ARM boards
UEFI: modern firmware interface, used on Raspberry Pi 4/5
ARM Trusted Firmware (TF-A): reference EL3 firmware for ARM
QEMU's built-in loader: when using -kernel, QEMU acts as a simple boot loader itself

The boot loader's responsibilities:

Initialize DRAM (memory) so there is somewhere to load the kernel
Load the kernel image from storage (SD card, network, flash) into DRAM
Load the device tree into memory (tells the kernel about hardware)
Set up CPU registers before jumping to the kernel
Drop exception level to EL1 or EL2 before entering the kernel

How QEMU Handles This

When we run:

qemu-system-aarch64 -M virt -cpu cortex-a72 -nographic -kernel kernel.elf

QEMU does the following:

Creates a virtual ARM64 machine with the virt platform
Loads a built-in EL3 firmware (ARM Trusted Firmware)
The firmware initializes the virtual hardware and drops to EL2
QEMU's internal loader places our kernel.elf at the address specified in the ELF headers (0x40000000)
The firmware at EL2 drops to EL1 and jumps to the kernel entry point

This means our kernel starts at EL1 with:

MMU disabled (all addresses are physical)
Data and instruction caches disabled
Stack pointer undefined (we must set it up)
All interrupts masked (DAIF bits set)
Device tree address in x0 (if QEMU provides one)
CPU ID in x1 (for multicore systems)

7.4 The Device Tree (FDT)

A device tree is a data structure that describes the hardware to the kernel. It tells the kernel what devices exist, where their registers are in memory, how interrupts are wired, and other configuration details.

The device tree is a Flattened Device Tree (FDT) or Device Tree Blob (DTB). It is a binary format, but it can be represented as text in a Device Tree Source (DTS) file.

When QEMU boots our kernel, it can pass a device tree. The address of the device tree is placed in register x0 before our kernel starts. The device tree describes:

/dts-v1/;

/ {
    model = "QEMU virt";
    compatible = "arm,virt";

    memory@40000000 {
        device_type = "memory";
        reg = <0x00000000 0x40000000>;  /* 1 GB at 0x40000000 */
    };

    uart@9000000 {
        compatible = "arm,pl011";
        reg = <0x00000000 0x09000000 0x00000000 0x00001000>;
        interrupts = <0x00000001 0x00000003>;
    };

    cpu@0 {
        device_type = "cpu";
        compatible = "arm,cortex-a72";
        reg = <0x00000000 0x00000000>;
    };
};

The device tree allows the same kernel binary to run on different hardware with different memory sizes, different UART addresses, or different numbers of CPUs. Instead of hard-coding addresses, the kernel reads them from the device tree.

For now, our kernel hard-codes the UART address (0x09000000). Later, we will write a device tree parser so the kernel can discover hardware dynamically.

7.5 CPU State at Kernel Entry

When the boot loader jumps to our kernel entry point (_start), the CPU is in a specific state. Understanding this state is critical because our startup code must handle it correctly.

Component	State at Entry	What We Must Do
Exception level	EL1 (or EL2 if booted by EL2 loader)	If at EL2, drop to EL1
MMU	Disabled (all addresses physical)	Keep disabled until we set up page tables
Data cache	Disabled	Keep disabled until MMU is on
Instruction cache	Disabled (may be enabled)	Can enable early for performance
Stack pointer	Undefined (SP_EL1 is not set up)	Set SP_EL1 immediately
Interrupts	All masked (DAIF bits set)	Keep masked until we have handlers
x0	Device tree address (or 0 if none)	Save before using (we pass to kernel_main)
x1	CPU ID (0 for primary core)	Save for multicore boot
Other registers	Undefined	Do not assume any value
BSS section	Not zeroed (contains garbage)	Zero it before using any global variables

Our current _start code handles the most critical items:

_start:
    ldr x0, =_stack_end    /* load top of stack address */
    mov sp, x0              /* set stack pointer */
    bl kernel_main          /* jump to C code */
    wfi
    b _start

This is minimal. Later, we will need to add BSS clearing, exception level checking, and multicore handling.

7.6 BSS Clearing

The BSS section contains global and static variables that are initialized to zero. In a normal C program, the C runtime startup code (crt0) zeros BSS before calling main. In our freestanding kernel, we must do this ourselves.

Our linker script defines two symbols that mark the BSS region:

/* Before calling kernel_main, zero the BSS section */
_start:
    ldr x0, =_stack_end
    mov sp, x0

    /* Clear BSS */
    ldr x0, =_bss_start
    ldr x1, =_bss_end
    mov x2, xzr              /* zero */
1:
    cmp x0, x1
    b.ge 2f
    str x2, [x0], #8         /* store zero and advance by 8 bytes */
    b 1b
2:
    bl kernel_main
    wfi
    b _start

This loop stores 8 bytes of zero to each 8-byte word in the BSS range. Without this step, any global variable that should be zero will contain garbage values, causing unpredictable behavior.

7.7 Exception Level Drop (EL2 to EL1)

Some boot configurations start our kernel at EL2 instead of EL1. Since our kernel is designed to run at EL1, we need to detect this and drop to EL1 if necessary.

_start:
    /* Check current exception level */
    mrs x0, CurrentEL
    lsr x0, x0, #2
    cmp x0, #2               /* Are we at EL2? */
    b.ne setup_el1

    /* We are at EL2. Configure EL2 and drop to EL1. */
    /* Set up a minimal EL2 environment... */

    /* Set SPSR_EL2 to boot to EL1 */
    mov x0, #0x3C5           /* EL1h, all interrupts masked */
    msr SPSR_EL2, x0

    /* Set the return address to our EL1 startup code */
    adr x0, setup_el1
    msr ELR_EL2, x0

    /* Return to EL1 */
    eret

setup_el1:
    /* Now at EL1 */
    ldr x0, =_stack_end
    mov sp, x0

    /* Clear BSS */
    ldr x0, =_bss_start
    ldr x1, =_bss_end
    mov x2, xzr
1:  cmp x0, x1
    b.ge 2f
    str x2, [x0], #8
    b 1b
2:
    bl kernel_main
    wfi
    b _start

The ERET instruction loads the exception return address from ELR_EL2 and the processor state from SPSR_EL2, and then jumps to the return address at the specified exception level.

7.8 Multicore Considerations

QEMU virt defaults to 1 CPU, but can be configured for more:

qemu-system-aarch64 -M virt -cpu cortex-a72 -smp 4 -nographic -kernel kernel.elf

When multiple CPUs are present, all CPUs start executing at the kernel entry point simultaneously. We need to:

Identify which CPU is the primary (boot core, CPU 0)
Send secondary CPUs to a spin loop (wait for work)
Let only the primary CPU continue with initialization

_start:
    /* x1 contains the CPU ID (set by QEMU/bootloader) */
    mov x2, x1               /* save CPU ID */
    cbz x2, primary_cpu      /* CPU 0 is primary, proceed */

secondary_cpu:
    /* Secondary CPUs spin here until the kernel wakes them */
    wfe
    b secondary_cpu

primary_cpu:
    ldr x0, =_stack_end
    mov sp, x0
    /* ... clear BSS, call kernel_main ... */

We will implement a proper multicore wake-up mechanism using SEV (send event) later in the book when we discuss scheduling.

7.9 The Boot Flow on QEMU virt

Let us trace the exact boot flow when we run our kernel on QEMU virt:

QEMU starts and creates the virtual machine with the virt platform
ROM firmware at EL3 initializes the CPU, sets up the GIC (interrupt controller), configures the virtual memory map, and loads the next stage
ARM Trusted Firmware (ATF) at EL3 performs PSCI (Power State Coordination Interface) setup, then drops to EL2
QEMU's internal loader reads our kernel.elf, parses the ELF headers, and loads the segments at the addresses specified in the program headers
EL2 stub (if present) or the firmware jumps to our kernel entry point
Our _start code executes: set stack, clear BSS, call kernel_main

We can observe this boot flow using QEMU's tracing:

# Trace the boot process
qemu-system-aarch64 -M virt -cpu cortex-a72 -nographic \
    -kernel kernel.elf -d cpu_reset,int -D qemu_trace.log

7.10 Our Implementation

Let us now write a complete start.S that handles all the boot requirements we have discussed:

start.S

.section .text._start
.global _start

_start:
    /* Save boot parameters from bootloader */
    mov x20, x0               /* save device tree address */
    mov x21, x1               /* save CPU ID */

    /* Check if we need to drop from EL2 to EL1 */
    mrs x0, CurrentEL
    lsr x0, x0, #2
    cmp x0, #2
    b.ne 1f

    /* Drop from EL2 to EL1 */
    mov x0, #0x3C5            /* EL1h, DAIF masked */
    msr SPSR_EL2, x0
    adr x0, 1f
    msr ELR_EL2, x0
    eret

1:
    /* Set stack pointer for EL1 */
    ldr x0, =_stack_end
    mov sp, x0

    /* Clear BSS section */
    ldr x0, =_bss_start
    ldr x1, =_bss_end
    mov x2, xzr
2:  cmp x0, x1
    b.ge 3f
    str x2, [x0], #8
    b 2b
3:
    /* Restore boot parameters and enter C code */
    mov x0, x20               /* device tree address */
    mov x1, x21               /* CPU ID */
    bl kernel_main

    /* If kernel_main returns, halt */
halt:
    wfi
    b halt

This start.S now does five things:

Saves boot parameters from the boot loader
Detects and handles EL2 start (drops to EL1)
Sets up the stack pointer for C code
Zeroes the BSS section
Passes the device tree address and CPU ID to kernel_main

Correspondingly, our kernel_main signature changes to accept these parameters:

void kernel_main(uint64_t dtb_addr, uint64_t cpu_id) {
    /* Now we know which CPU we are and where the device tree is */
    if (cpu_id == 0) {
        /* Primary CPU does full initialization */
        uart_init();
        uart_puts("Primary CPU booting...\r\n");
    } else {
        /* Secondary CPUs wait */
        while (1) __asm__("wfe");
    }
}

This is our foundation. In Chapter 9 (Kernel Entry Point), we will refine the startup sequence further with cache enabling, exception vector installation, and early memory initialization.

7.11 Exercises

Exercise 1: Trace the Boot Flow

When you run qemu-system-aarch64 -M virt -nographic -kernel kernel.elf, list every software component that executes between power-on and the first instruction of our _start. Use QEMU documentation and online resources.

Exercise 2: Read CurrentEL

Add code to kernel_main that reads the current exception level and prints it. If it is EL2, print a warning. Build and run to verify our kernel starts at EL1.

Exercise 3: Dump the Device Tree

Use QEMU's -M virt,dumpdtb=qemu-virt.dtb option to extract the device tree binary. Then use dtc -I dtb -O dts qemu-virt.dtb to convert it to human-readable DTS format. Identify the UART, GIC, and memory nodes. Add their addresses.

Exercise 4: BSS Bug Hunting

Remove the BSS-clearing loop from start.S and add a global variable int counter = 0; in kernel.c that increments in kernel_main. Build and run several times. Observe that the value is not always zero at start. Write a short explanation of why this happens.

Exercise 5: Boot with SMP

Run QEMU with -smp 4 and add code to start.S that prints "CPU X booting" for each CPU that reaches _start. The primary CPU should print the message; secondary CPUs should spin. Count how many messages you see.

Exercise 6: EL2 Detection (Challenge)

Modify QEMU's boot to start our kernel at EL2 instead of EL1. One way is to use a different firmware: -machine virt,secure=on or a custom EL2 loader. Then verify that our EL2-to-EL1 drop code works correctly. Hint: look at QEMU's -bios option for providing an EL2 loader.

7.12 Summary

In this chapter, we traced the complete boot sequence from power-on to our kernel entry point. The boot chain goes through multiple stages, each running at a higher exception level than the next. The firmware at EL3 initializes the system, the boot loader at EL2 loads our kernel into memory, and our kernel runs at EL1.

We learned about the device tree, which describes hardware to the kernel in a platform- independent way. While we currently hard-code hardware addresses, we will eventually parse the device tree to discover devices dynamically.

We examined the exact CPU state at kernel entry: MMU off, caches off, interrupts masked, stack undefined, BSS uninitialized. Our _start code must handle each of these before it can safely call C code.

Finally, we built a robust start.S that saves boot parameters, drops from EL2 to EL1 if needed, sets up the stack, clears BSS, and calls kernel_main with the device tree address and CPU ID.

In the next chapter, we will look at building a complete boot loader that can load our kernel from disk or over a serial connection.