ARM64 OS Handbook
🔍

Chapter 6: C for Kernel Code

What You Will Learn in This Chapter
  • What a freestanding C environment is and how it differs from hosted
  • Why the standard library is not available in a kernel
  • How to use volatile for memory-mapped I/O
  • Essential GCC built-ins and function attributes
  • How linker scripts work and how to write one
  • Kernel-safe C coding patterns
  • How our kernel is structured and built

6.1 Freestanding vs Hosted Environment

C programs normally run in a hosted environment. This means the program runs on top of an operating system, and the C compiler assumes that the standard library (libc) is available. Functions like printf, malloc, memcpy, and strlen are all provided by libc.

A kernel runs in a freestanding environment. There is no operating system beneath it, so there is no standard library. The kernel must provide everything itself, or do without it.

FeatureHosted (user space)Freestanding (kernel)
Entry pointmain()Any label (we use _start)
Standard libraryFull libc (printf, malloc, etc.)None (must implement or omit)
Runtime startupC runtime (crt0) initializes libcWe write our own startup assembly
Memory allocatormalloc/free provided by libcWe must build our own
Floating pointAvailableAvoid (no FPU context save in kernel)
Stack sizeLarge (MB)Small (we define it, often 4-16 KB)

We tell GCC that we are building for a freestanding environment with the -ffreestanding flag. This flag does two things: it disables the assumption that a hosted libc is available, and it ensures that certain built-in functions (like memcpy, memset) are still available as compiler built-ins.

aarch64-none-elf-gcc -c -ffreestanding -O2 -Wall -Wextra kernel.c -o kernel.o

6.2 Life Without the Standard Library

Many familiar functions are simply not available. Here is how we deal with the most important ones:

No printf

We cannot use printf because it requires a complex formatting engine and file I/O support. Instead, we write directly to the UART:

void uart_putc(char c) {
    volatile char *uart = (volatile char *)0x09000000;
    *uart = c;
}

void uart_puts(const char *s) {
    while (*s) uart_putc(*s++);
}

If we need formatted output (like printing numbers), we implement the conversion ourselves:

void uart_puthex(uint64_t n) {
    const char *hex = "0123456789abcdef";
    for (int i = 15; i >= 0; i--) {
        uart_putc(hex[(n >> (i * 4)) & 0xf]);
    }
}

void uart_putdec(uint64_t n) {
    char buf[20];
    int i = 0;
    if (n == 0) { uart_putc('0'); return; }
    while (n > 0) { buf[i++] = '0' + (n % 10); n /= 10; }
    while (i > 0) uart_putc(buf[--i]);
}

No malloc

A kernel needs dynamic memory allocation, but malloc is part of libc. We build our own memory allocator (Chapter 15-21). In the early stages, we use static allocation or a simple page allocator.

No string functions

Functions like strcpy, memcpy, and memset are trivial but necessary. We implement them ourselves:

void *memcpy(void *dst, const void *src, unsigned long n) {
    for (unsigned long i = 0; i < n; i++)
        ((unsigned char *)dst)[i] = ((const unsigned char *)src)[i];
    return dst;
}

void *memset(void *dst, int c, unsigned long n) {
    for (unsigned long i = 0; i < n; i++)
        ((unsigned char *)dst)[i] = (unsigned char)c;
    return dst;
}

int strcmp(const char *a, const char *b) {
    while (*a && *a == *b) { a++; b++; }
    return (unsigned char)*a - (unsigned char)*b;
}

unsigned long strlen(const char *s) {
    unsigned long len = 0;
    while (*s++) len++;
    return len;
}

Most of these can be optimized using GCC built-ins, which we will cover in Section 6.5.

6.3 The Volatile Keyword

When we access hardware registers (memory-mapped I/O), we must use the volatile keyword. volatile tells the compiler: "this value can change at any time, outside the control of the program." The compiler will not optimize away reads or writes to volatile variables.

Consider this code without volatile:

/* UART status register: bit 5 means transmitter is ready */
int uart_tx_ready(void) {
    int *status = (int *)0x09000018;
    return (*status >> 5) & 1;
}

/* The compiler might call this only once and cache the result: */
while (!uart_tx_ready());  /* compiler may hoist the read outside the loop! */

Without volatile, the compiler might read the status register once and reuse the cached value forever. The loop would either run forever or never run.

With volatile:

int uart_tx_ready(void) {
    volatile int *status = (volatile int *)0x09000018;
    return (*status >> 5) & 1;
}

/* Now every loop iteration reads from the actual hardware register */
while (!uart_tx_ready());  /* correct */

Rules for volatile usage:

  • Always declare pointers to hardware registers as volatile
  • Use volatile for any memory location that can change asynchronously (DMA buffers, shared memory between cores)
  • Do NOT use volatile for synchronization between threads (use atomic operations and memory barriers instead)

6.4 Memory-Mapped I/O in C

On ARM64, all hardware peripherals are accessed through memory-mapped I/O (MMIO). Each device registers appear at specific physical addresses. Reading or writing those addresses communicates with the device.

/* UART registers for the PL011 device on QEMU virt */
#define UART_BASE    0x09000000
#define UART_DR      (UART_BASE + 0x000)   /* Data register (read/write) */
#define UART_FR      (UART_BASE + 0x018)   /* Flag register (read-only) */
#define UART_IBRD    (UART_BASE + 0x024)   /* Integer baud rate divisor */
#define UART_FBRD    (UART_BASE + 0x028)   /* Fractional baud rate divisor */
#define UART_LCR_H   (UART_BASE + 0x02C)   /* Line control register */
#define UART_CR      (UART_BASE + 0x030)   /* Control register */
#define UART_IMSC    (UART_BASE + 0x038)   /* Interrupt mask set/clear */
#define UART_MIS     (UART_BASE + 0x040)   /* Masked interrupt status */
#define UART_ICR     (UART_BASE + 0x044)   /* Interrupt clear register */

/* UART flags */
#define UART_FR_TXFF (1 << 5)              /* Transmit FIFO full */
#define UART_FR_RXFE (1 << 4)              /* Receive FIFO empty */

/* Helper functions using volatile access */
static inline void uart_write32(unsigned long addr, uint32_t val) {
    volatile uint32_t *reg = (volatile uint32_t *)addr;
    *reg = val;
}

static inline uint32_t uart_read32(unsigned long addr) {
    volatile uint32_t *reg = (volatile uint32_t *)addr;
    return *reg;
}

void uart_putc(char c) {
    /* Wait until the transmit FIFO is not full */
    while (uart_read32(UART_FR) & UART_FR_TXFF);
    uart_write32(UART_DR, c);
}

char uart_getc(void) {
    /* Wait until data is available */
    while (uart_read32(UART_FR) & UART_FR_RXFE);
    return uart_read32(UART_DR) & 0xFF;
}

Notice that every access goes through volatile pointers. This ensures the compiler generates a load or store instruction every time, without caching or reordering.

6.5 GCC Built-ins and Attributes

GCC provides special functions and attributes that are essential for kernel programming.

Built-in Functions

/* Memory and string operations (optimized, often single instructions) */
__builtin_memcpy(dst, src, n);    /* optimized memcpy */
__builtin_memset(dst, c, n);      /* optimized memset */
__builtin_memcmp(a, b, n);        /* optimized memcmp */

/* Bit manipulation */
int leading = __builtin_clz(n);    /* count leading zeros (32-bit) */
int trailing = __builtin_ctz(n);   /* count trailing zeros (32-bit) */
int popcount = __builtin_popcount(n);  /* count set bits (32-bit) */

/* For 64-bit values: */
int leading64 = __builtin_clzll(n);   /* count leading zeros (64-bit) */
int trailing64 = __builtin_ctzll(n);  /* count trailing zeros (64-bit) */
int popcount64 = __builtin_popcountll(n); /* count set bits (64-bit) */

/* Expect/verify hints for optimization */
if (__builtin_expect(error, 0)) {   /* tell compiler "error is unlikely" */
    handle_error();
}

/* Compile-time assertions */
static_assert(sizeof(uint64_t) == 8, "uint64_t must be 8 bytes");

/* Unreachable code marker */
__builtin_unreachable();

These built-ins compile to efficient ARM64 instructions. For example, __builtin_ctzll compiles to a single RBIT + CLZ sequence.

Function Attributes

/* Prevent the function from being inlined */
static inline __attribute__((noinline)) void slow_path(void);

/* Force a function to always be inlined */
static inline __attribute__((always_inline)) void fast_path(void);

/* Place a function in a specific section */
__attribute__((section(".text.init"))) void early_start(void);

/* Aligned data */
uint64_t page_table[512] __attribute__((aligned(4096)));

/* Packed struct (no padding between fields) */
struct __attribute__((packed)) device_id {
    uint16_t vendor;
    uint16_t device;
    uint32_t revision;
};

/* Used to prevent compiler warnings about unused parameters */
void func(int x __attribute__((unused))) { }

/* Mark a function as pure (result depends only on arguments) */
int hash(const char *s) __attribute__((pure));

Alignment Checks

The offsetof macro and struct alignment are well-defined in C, but in kernel code we often need to be explicit:

#include   /* for offsetof */

/* Ensure a struct has no padding (used for hardware register maps) */
static_assert(sizeof(struct uart_regs) == 0x48, "UART struct size mismatch");

/* Ensure a field is at the right offset */
static_assert(offsetof(struct uart_regs, cr) == 0x30, "UART CR offset mismatch");

6.6 The Linker Script in Detail

The linker script tells the linker where to place each section of our kernel in memory. Without it, the linker would use default layout rules that assume a user-space program.

Our linker script (kernel.ld):

ENTRY(_start)                  /* Entry point symbol */

SECTIONS
{
    . = 0x40000000;            /* Start address (QEMU virt load address) */

    .text : {
        *(.text._start)        /* Entry point must come first */
        *(.text*)              /* All other code */
    }

    .rodata : {
        *(.rodata*)            /* Read-only data (strings, constants) */
    }

    .data : {
        *(.data*)              /* Read-write data (global variables) */
    }

    .bss : {
        _bss_start = .;        /* Mark the beginning of BSS */
        *(.bss*)               /* Zero-initialized data */
        *(COMMON)              /* Common symbols (also zero-initialized) */
        _bss_end = .;          /* Mark the end of BSS */
    }

    . = ALIGN(16);             /* Align stack to 16 bytes */
    .stack : {
        _stack_start = .;      /* Bottom of the stack area */
        . += 4K;               /* Reserve 4 KB for stack */
        _stack_end = .;        /* Top of the stack area */
    }
}

Key concepts:

  • ENTRY(_start): tells the linker where execution begins
  • . = 0x40000000: sets the current memory address to our load address
  • .text : { *(.text*) }: collect all .text sections from all object files and place them here
  • _bss_start and _bss_end: symbols that our startup code uses to zero-initialize the BSS section
  • .stack: reserves stack space directly in the binary. This is a simple approach for early boot; a real kernel allocates stack per-process later.

Using the linker script, the build process looks like this:

# Step 1: Compile C files to object files
aarch64-none-elf-gcc -c -ffreestanding -O2 -Wall -Wextra kernel.c -o kernel.o

# Step 2: Assemble .S files to object files
aarch64-none-elf-as start.S -o start.o

# Step 3: Link everything using the linker script
aarch64-none-elf-ld -T kernel.ld start.o kernel.o -o kernel.elf

# Step 4: Check the output
aarch64-none-elf-objdump -d kernel.elf    # disassemble
aarch64-none-elf-size kernel.elf          # section sizes

6.7 Kernel-Safe C Patterns

Avoid Floating Point

Do not use float or double in kernel code. The floating-point unit (FPU) registers are not saved and restored during context switches by default, so using them would cause corruption. Even when the FPU is configured, context switching becomes much slower.

Avoid Dynamic Allocation in Early Boot

During early boot (before our memory allocator is ready), we cannot dynamically allocate memory. Use statically allocated structures:

/* Statically allocated page table (4 KB aligned) */
static uint64_t page_tables[3][512] __attribute__((aligned(4096)));

/* Fixed-size buffer for kernel messages */
static char kernel_buffer[256];

Use Fixed-Width Integer Types

Always use exact-width types from <stdint.h> when interacting with hardware. The size of int and long can vary between platforms:

#include 

uint8_t  byte;     /* exactly 8 bits, unsigned */
uint16_t half;     /* exactly 16 bits, unsigned */
uint32_t word;     /* exactly 32 bits, unsigned */
uint64_t dword;    /* exactly 64 bits, unsigned */

/* For memory addresses, use uint64_t or uintptr_t */
uint64_t phys_addr = 0x40000000;
uintptr_t virt_addr = (uintptr_t)some_pointer;

Use const for Read-Only Data

Read-only data goes into the .rodata section, which can be protected from writes by the MMU:

/* Goes to .rodata, not .data */
const char kernel_version[] = "0.0.1";

/* String literals are already in .rodata */
uart_puts("Hello from kernel");

Check for NULL Pointers

In user space, dereferencing NULL causes a segmentation fault. In kernel space, it could be a silent corruption or crash the entire system. Always check pointers:

void handle_buffer(struct buffer *buf) {
    if (!buf) return;
    /* safe to use buf */
}

/* For MMIO addresses, use macros that never produce NULL */
#define UART_DR ((volatile uint32_t *)0x09000000)

Use Static and Inline Judiciously

/* Hide internal functions from external linkage */
static void internal_helper(void) { ... }

/* Small frequently-called functions can be inlined */
static inline uint64_t read_current_el(void) {
    uint64_t el;
    __asm__("mrs %0, CurrentEL" : "=r" (el));
    return (el >> 2) & 3;
}

6.8 Our Build System

As our kernel grows, we need a proper build system. We use a Makefile:

CROSS = aarch64-none-elf-
CC    = $(CROSS)gcc
AS    = $(CROSS)as
LD    = $(CROSS)ld
OBJDUMP = $(CROSS)objdump
SIZE  = $(CROSS)size

CFLAGS   = -ffreestanding -O2 -Wall -Wextra -nostdlib -nostartfiles
LDFLAGS  = -T kernel.ld

OBJS = start.o kernel.o

all: kernel.elf

start.o: start.S
	$(AS) $< -o $@

kernel.o: kernel.c
	$(CC) $(CFLAGS) -c $< -o $@

kernel.elf: $(OBJS)
	$(LD) $(LDFLAGS) $^ -o $@
	$(OBJDUMP) -d $@ > kernel.dis
	$(SIZE) $@

run: kernel.elf
	qemu-system-aarch64 -M virt -cpu cortex-a72 -nographic -kernel $<

debug: kernel.elf
	qemu-system-aarch64 -M virt -cpu cortex-a72 -nographic -kernel $< -s -S

clean:
	rm -f *.o *.elf *.dis

The -nostdlib flag explicitly tells the linker not to link the C standard library. Combined with -ffreestanding, we get a truly bare-metal binary.

6.9 Our Implementation

Let us see how all these concepts come together in our actual kernel code.

Complete kernel.c

Here is the kernel.c file we have been building toward. It demonstrates all the patterns from this chapter:

kernel.c
#include 
#include 

/* Memory-mapped UART (PL011) on QEMU virt */
#define UART_BASE 0x09000000

static volatile uint32_t * const uart_dr   = (uint32_t *)(UART_BASE + 0x000);
static volatile uint32_t * const uart_fr   = (uint32_t *)(UART_BASE + 0x018);
static volatile uint32_t * const uart_cr   = (uint32_t *)(UART_BASE + 0x030);
static volatile uint32_t * const uart_ibrd = (uint32_t *)(UART_BASE + 0x024);
static volatile uint32_t * const uart_fbrd = (uint32_t *)(UART_BASE + 0x028);
static volatile uint32_t * const uart_lcr_h = (uint32_t *)(UART_BASE + 0x02C);
static volatile uint32_t * const uart_imsc = (uint32_t *)(UART_BASE + 0x038);
static volatile uint32_t * const uart_icr  = (uint32_t *)(UART_BASE + 0x044);

#define UART_FR_TXFF (1 << 5)
#define UART_FR_RXFE (1 << 4)

static void uart_init(void) {
    /* Disable UART while configuring */
    *uart_cr = 0;

    /* Set baud rate: 115200 (assuming 24 MHz UART clock) */
    *uart_ibrd = 13;
    *uart_fbrd = 1;

    /* 8-bit, no parity, 1 stop bit, enable FIFOs */
    *uart_lcr_h = (0x3 << 5) | (1 << 4);

    /* Enable UART, transmit, and receive */
    *uart_cr = (1 << 0) | (1 << 8) | (1 << 9);

    /* Clear any pending interrupts */
    *uart_icr = 0x7FF;
}

static void uart_putc(char c) {
    while (*uart_fr & UART_FR_TXFF);
    *uart_dr = c;
}

static void uart_puts(const char *s) {
    while (*s) uart_putc(*s++);
}

static void uart_puthex(uint64_t n) {
    const char *hex = "0123456789abcdef";
    uart_puts("0x");
    for (int i = 15; i >= 0; i--)
        uart_putc(hex[(n >> (i * 4)) & 0xf]);
}

static void uart_putdec(uint64_t n) {
    char buf[20];
    int i = 0;
    if (n == 0) { uart_putc('0'); return; }
    while (n > 0) { buf[i++] = '0' + (n % 10); n /= 10; }
    while (i > 0) uart_putc(buf[--i]);
}

/* Exception level string for debugging */
static const char *el_name(uint64_t el) {
    switch (el) {
        case 0: return "EL0";
        case 1: return "EL1";
        case 2: return "EL2";
        case 3: return "EL3";
        default: return "UNKNOWN";
    }
}

void kernel_main(void) {
    uint64_t current_el;

    /* Read current exception level using inline assembly */
    __asm__("mrs %0, CurrentEL" : "=r" (current_el));
    current_el = (current_el >> 2) & 3;

    /* Initialize UART hardware */
    uart_init();

    /* Print boot message */
    uart_puts("\r\n");
    uart_puts("========================================\r\n");
    uart_puts("  ARM64 OS Kernel Boot\r\n");
    uart_puts("========================================\r\n");
    uart_puts("\r\n");

    uart_puts("Exception level: ");
    uart_puts(el_name(current_el));
    uart_puts("\r\n");

    uart_puts("Kernel loaded at: ");
    uart_puthex(0x40000000);
    uart_puts("\r\n");

    uart_puts("UART at: ");
    uart_puthex(UART_BASE);
    uart_puts("\r\n");

    uart_puts("\r\nCounting to 10: ");
    for (int i = 1; i <= 10; i++) {
        uart_putdec(i);
        uart_putc(' ');
    }
    uart_puts("\r\n");

    uart_puts("\r\nKernel initialization complete.\r\n");
    uart_puts("Halting.\r\n");

    /* Halt */
    while (1) {
        __asm__("wfi");
    }
}

This kernel demonstrates:

  • Freestanding C with no standard library
  • volatile MMIO access for UART registers
  • Static helper functions (not visible outside this file)
  • Inline assembly for reading system registers
  • Fixed-width integer types from <stdint.h>
  • Explicit string and number formatting (no printf)
  • The while(1) { wfi; } idle pattern

Building and Running

aarch64-none-elf-gcc -c -ffreestanding -O2 -Wall -Wextra kernel.c -o kernel.o
aarch64-none-elf-as start.S -o start.o
aarch64-none-elf-ld -T kernel.ld start.o kernel.o -o kernel.elf
qemu-system-aarch64 -M virt -cpu cortex-a72 -nographic -kernel kernel.elf

Expected output:

========================================
  ARM64 OS Kernel Boot
========================================

Exception level: EL1
Kernel loaded at: 0x40000000
UART at: 0x9000000

Counting to 10: 1 2 3 4 5 6 7 8 9 10

Kernel initialization complete.
Halting.

6.10 Exercises

Exercise 1: Implement memmove

Write a memmove function that handles overlapping memory regions correctly (unlike memcpy which does not). Test it by moving a string within the same buffer so that source and destination overlap.

Exercise 2: Hexadecimal Dump

Write a function hexdump(void *addr, unsigned long len) that prints a hex dump of memory in this format:

0x40000000:  48 65 6C 6C 6F 00 00 00  00 00 00 00 00 00 00 00  Hello...........

Show the address, 16 hex bytes (two groups of 8), and the ASCII representation. Non-printable characters should be shown as dots.

Exercise 3: UART with Newline Conversion

Modify uart_puts so that when it encounters a newline character ('\n'), it also sends a carriage return ('\r'). This prevents the "staircase" effect when outputting to a terminal.

Exercise 4: Read a Device Register

Use inline assembly to read the MIDR_EL1 register (Main ID Register), which identifies the CPU. Print the value as a hex number. The register provides the implementer, architecture, and part number of the CPU.

Exercise 5: Simple Stack Check

Add a stack_canary pattern to your kernel. Place a known value at the bottom of the stack (in the BSS area before the stack space). Periodically check if the value has changed. If it has, it means the stack has overflowed into adjacent memory. Print a warning message.

Exercise 6: Memory Test (Challenge)

Write a function that tests the first 1 MB of RAM by writing a pattern (0xAA, then 0x55, then 0x00000000FFFFFFFF) and reading it back. Report any addresses where the read value does not match. This is a simple memory integrity test. Be careful not to test the memory where your kernel code is running.

6.11 Summary

In this chapter, we learned how C programming for a kernel differs from writing user-space applications.

The freestanding environment means no standard library, no C runtime startup, and no assumptions about the environment. We must implement everything ourselves or use GCC built-ins.

We covered the volatile keyword, which is essential for MMIO access. Every hardware register access must go through a volatile pointer to prevent the compiler from optimizing away critical reads and writes.

We examined GCC built-ins and function attributes that are particularly useful for kernel code: __builtin_ctz for bit manipulation, __attribute__((aligned)) for page-aligned data, and __attribute__((section)) for placing code in specific locations.

We dissected the linker script and saw how it controls memory layout. We now understand every line of kernel.ld and how it determines where our code, data, BSS, and stack live in memory.

Finally, we built a complete kernel.c that initializes the UART, reads the exception level, and prints diagnostic information. This kernel is the foundation we will extend throughout the rest of the book.

In the next chapter, we will look at the boot process in detail: what happens from the moment power is applied until our _start code begins executing.