ARM64 OS Handbook
🔍

Chapter 20: TLB

What You Will Learn in This Chapter
  • What the Translation Lookaside Buffer (TLB) is and why it exists
  • How the TLB caches page table entries for fast translation
  • TLB structure: L1 (micro-TLB) and L2 (main TLB) on ARM64
  • When the TLB must be invalidated
  • ARM64 TLB maintenance instructions
  • Our TLB management strategy

20.1 What is the TLB?

The Translation Lookaside Buffer (TLB) is a hardware cache inside the CPU that stores recently used page table entries. Translating a virtual address requires walking 4 levels of page tables (up to 4 memory reads). Walking the page tables on every memory access would be slow and consume significant memory bandwidth. The TLB avoids this by caching the results of previous translations.

When the CPU needs to translate a virtual address, it first checks the TLB. If the translation is found (a TLB hit), the physical address is available in a few CPU cycles. If not (a TLB miss), the hardware walks the page tables and loads the translation into the TLB.

20.2 TLB Structure on ARM64

ARM64 CPUs typically implement two levels of TLB:

  • L1 (micro-TLB): small (8-64 entries), very fast, per execution pipeline
  • L2 (main TLB): larger (128-4096 entries), slower, shared between pipelines

Modern Cortex-A series have separate instruction TLBs (for instruction fetches) and data TLBs (for load/store operations). Some designs also have a unified L2 TLB. Each TLB entry contains:

  • Virtual address tag (ASID-tagged, see below)
  • Physical address
  • Page size (4 KB, 2 MB, or 1 GB)
  • Memory attributes (cache policy, permissions)
  • ASID (Address Space ID)

20.3 ASIDs: Avoiding TLB Flushes on Context Switch

Without ASIDs, every context switch would require a complete TLB flush, because the old process's translations are cached and would be incorrect for the new process. ASIDs (Address Space IDs) allow the TLB to distinguish between translations of different processes. Each TLB entry is tagged with an ASID. When the MMU translates an address, it only matches entries with the current ASID (from TTBR0_EL1).

/* TTBR0_EL1 format: bits [63:48] = ASID, bits [47:1] = table address */
void set_ttbr0(uint64_t *table, uint16_t asid) {
    uint64_t ttbr0 = ((uint64_t)asid << 48) | (uint64_t)table;
    asm volatile("msr ttbr0_el1, %0; isb" : : "r"(ttbr0));
}

/* On context switch with ASIDs: invalidate only the old process's entries */
void tlb_invalidate_asid(uint16_t asid) {
    asm volatile("tlbi aside1, %0" : : "r"((uint64_t)asid << 48));
    asm volatile("dsb ish; isb");
}

20.4 TLB Maintenance Instructions

ARM64 provides the TLBI (Translation Lookaside Buffer Invalidate) instruction family:

InstructionEffect
TLBI VMALLE1Invalidate all TLB entries at EL1 (both TTBR0 and TTBR1)
TLBI VMALLE1ISSame, but broadcast to inner-shareable domain (SMP safe)
TLBI ASIDE1, x0Invalidate all entries for a specific ASID
TLBI VAAE1, x0Invalidate a specific VA for a specific ASID
TLBI VAE1, x0Invalidate a specific VA (current ASID)
TLBI IPAS2E1, x0Invalidate IPA (for virtualization)

Every TLB invalidation must be followed by a data synchronization barrier (DSB ISH) to ensure the invalidation completes before subsequent memory accesses, and an instruction synchronization barrier (ISB) to flush the pipeline.

20.5 When Must the TLB be Invalidated?

The TLB caches translations from page tables. Any time a page table entry changes, the corresponding TLB entry(s) must be invalidated. Key scenarios:

  1. Mapping a new page: invalidate the specific VA (or the whole TLB)
  2. Unmapping a page: invalidate the specific VA
  3. Changing permissions: invalidate the specific VA
  4. Context switch (no ASIDs): invalidate all user entries (TLBI VMALLE1 or ASIDE1)
  5. Context switch (with ASIDs): no invalidation needed; just set TTBR0 with new ASID
  6. Kernel mapping change: invalidate all kernel entries (rare, happens at boot)
/* TLB management functions */
void tlb_flush_all(void) {
    asm volatile("tlbi vmalle1; dsb ish; isb");
}

void tlb_flush_va(uint64_t va) {
    asm volatile("tlbi vae1, %0; dsb ish; isb" : : "r"(va));
}

void tlb_flush_va_asid(uint64_t va, uint16_t asid) {
    uint64_t operand = (va & ~0xFFF) | ((uint64_t)asid << 48);
    asm volatile("tlbi vaae1, %0; dsb ish; isb" : : "r"(operand));
}

20.6 TLB性能和TLB压力

TLB misses are expensive. A miss requires a hardware page table walk (up to 4 memory reads) and stalls the pipeline until the translation is complete. Strategies to reduce TLB pressure:

  • Huge pages: 2 MB or 1 GB block mappings cover more memory with fewer TLB entries
  • TLB prefetching: some CPUs prefetch adjacent TLB entries
  • Page coloring: align pages to reduce TLB conflicts (software optimization)
  • ASIDs: avoid flushing on every context switch

20.7 Our Implementation

Our kernel's TLB management is simple but correct:

  • During boot, we flush the entire TLB after enabling the MMU
  • After each map_page or unmap_page call, we flush the specific VA using TLBI VAE1
  • During context switch, we flush the old process's entries using TLBI ASIDE1 (we use ASIDs)
  • No TLB flush is needed when switching between kernel threads (same address space)
  • We use DSB ISH after every invalidation (inner-shareable domain for SMP)

Future optimization: implement lazy TLB invalidation (defer flushes until the page is actually accessed by another process) and use a TLB shootdown protocol for SMP (sending IPIs to other cores to perform local TLB flushes).

20.8 Exercises

Exercise 1: TLB Size Estimation

Assume a TLB has 64 entries and each entry covers 4 KB. What is the maximum memory that can be covered without TLB misses? Answer: 64 × 4 KB = 256 KB. A 2 MB block mapping would cover 2 MB with one entry.

Exercise 2: TLB Flush Benchmark

Write a test that measures the overhead of TLB flushes: map 1000 pages, time a full TLB flush vs flushing each VA individually vs flushing by ASID.

20.9 Summary

The TLB caches page table entries to accelerate address translation. On ARM64, it has two levels (micro-TLB and main TLB) and separate TLBs for instructions and data. ASIDs allow the TLB to cache entries from multiple processes simultaneously, avoiding expensive flushes on every context switch. The TLBI family of instructions provides fine-grained invalidation: by VA, by ASID, or globally. Our kernel uses ASIDs and invalidates only the necessary entries, flushing by VA on mapping changes and by ASID on context switches.