Chapter 20: TLB
- What the Translation Lookaside Buffer (TLB) is and why it exists
- How the TLB caches page table entries for fast translation
- TLB structure: L1 (micro-TLB) and L2 (main TLB) on ARM64
- When the TLB must be invalidated
- ARM64 TLB maintenance instructions
- Our TLB management strategy
20.1 What is the TLB?
The Translation Lookaside Buffer (TLB) is a hardware cache inside the CPU that stores recently used page table entries. Translating a virtual address requires walking 4 levels of page tables (up to 4 memory reads). Walking the page tables on every memory access would be slow and consume significant memory bandwidth. The TLB avoids this by caching the results of previous translations.
When the CPU needs to translate a virtual address, it first checks the TLB. If the translation is found (a TLB hit), the physical address is available in a few CPU cycles. If not (a TLB miss), the hardware walks the page tables and loads the translation into the TLB.
20.2 TLB Structure on ARM64
ARM64 CPUs typically implement two levels of TLB:
- L1 (micro-TLB): small (8-64 entries), very fast, per execution pipeline
- L2 (main TLB): larger (128-4096 entries), slower, shared between pipelines
Modern Cortex-A series have separate instruction TLBs (for instruction fetches) and data TLBs (for load/store operations). Some designs also have a unified L2 TLB. Each TLB entry contains:
- Virtual address tag (ASID-tagged, see below)
- Physical address
- Page size (4 KB, 2 MB, or 1 GB)
- Memory attributes (cache policy, permissions)
- ASID (Address Space ID)
20.3 ASIDs: Avoiding TLB Flushes on Context Switch
Without ASIDs, every context switch would require a complete TLB flush, because the old process's translations are cached and would be incorrect for the new process. ASIDs (Address Space IDs) allow the TLB to distinguish between translations of different processes. Each TLB entry is tagged with an ASID. When the MMU translates an address, it only matches entries with the current ASID (from TTBR0_EL1).
/* TTBR0_EL1 format: bits [63:48] = ASID, bits [47:1] = table address */
void set_ttbr0(uint64_t *table, uint16_t asid) {
uint64_t ttbr0 = ((uint64_t)asid << 48) | (uint64_t)table;
asm volatile("msr ttbr0_el1, %0; isb" : : "r"(ttbr0));
}
/* On context switch with ASIDs: invalidate only the old process's entries */
void tlb_invalidate_asid(uint16_t asid) {
asm volatile("tlbi aside1, %0" : : "r"((uint64_t)asid << 48));
asm volatile("dsb ish; isb");
}
20.4 TLB Maintenance Instructions
ARM64 provides the TLBI (Translation Lookaside Buffer Invalidate) instruction family:
| Instruction | Effect |
|---|---|
TLBI VMALLE1 | Invalidate all TLB entries at EL1 (both TTBR0 and TTBR1) |
TLBI VMALLE1IS | Same, but broadcast to inner-shareable domain (SMP safe) |
TLBI ASIDE1, x0 | Invalidate all entries for a specific ASID |
TLBI VAAE1, x0 | Invalidate a specific VA for a specific ASID |
TLBI VAE1, x0 | Invalidate a specific VA (current ASID) |
TLBI IPAS2E1, x0 | Invalidate IPA (for virtualization) |
Every TLB invalidation must be followed by a data synchronization barrier (DSB ISH) to ensure the invalidation completes before subsequent memory accesses, and an instruction synchronization barrier (ISB) to flush the pipeline.
20.5 When Must the TLB be Invalidated?
The TLB caches translations from page tables. Any time a page table entry changes, the corresponding TLB entry(s) must be invalidated. Key scenarios:
- Mapping a new page: invalidate the specific VA (or the whole TLB)
- Unmapping a page: invalidate the specific VA
- Changing permissions: invalidate the specific VA
- Context switch (no ASIDs): invalidate all user entries (TLBI VMALLE1 or ASIDE1)
- Context switch (with ASIDs): no invalidation needed; just set TTBR0 with new ASID
- Kernel mapping change: invalidate all kernel entries (rare, happens at boot)
/* TLB management functions */
void tlb_flush_all(void) {
asm volatile("tlbi vmalle1; dsb ish; isb");
}
void tlb_flush_va(uint64_t va) {
asm volatile("tlbi vae1, %0; dsb ish; isb" : : "r"(va));
}
void tlb_flush_va_asid(uint64_t va, uint16_t asid) {
uint64_t operand = (va & ~0xFFF) | ((uint64_t)asid << 48);
asm volatile("tlbi vaae1, %0; dsb ish; isb" : : "r"(operand));
}
20.6 TLB性能和TLB压力
TLB misses are expensive. A miss requires a hardware page table walk (up to 4 memory reads) and stalls the pipeline until the translation is complete. Strategies to reduce TLB pressure:
- Huge pages: 2 MB or 1 GB block mappings cover more memory with fewer TLB entries
- TLB prefetching: some CPUs prefetch adjacent TLB entries
- Page coloring: align pages to reduce TLB conflicts (software optimization)
- ASIDs: avoid flushing on every context switch
20.7 Our Implementation
Our kernel's TLB management is simple but correct:
- During boot, we flush the entire TLB after enabling the MMU
- After each
map_pageorunmap_pagecall, we flush the specific VA usingTLBI VAE1 - During context switch, we flush the old process's entries using
TLBI ASIDE1(we use ASIDs) - No TLB flush is needed when switching between kernel threads (same address space)
- We use
DSB ISHafter every invalidation (inner-shareable domain for SMP)
Future optimization: implement lazy TLB invalidation (defer flushes until the page is actually accessed by another process) and use a TLB shootdown protocol for SMP (sending IPIs to other cores to perform local TLB flushes).
20.8 Exercises
Exercise 1: TLB Size Estimation
Assume a TLB has 64 entries and each entry covers 4 KB. What is the maximum memory that can be covered without TLB misses? Answer: 64 × 4 KB = 256 KB. A 2 MB block mapping would cover 2 MB with one entry.
Exercise 2: TLB Flush Benchmark
Write a test that measures the overhead of TLB flushes: map 1000 pages, time a full TLB flush vs flushing each VA individually vs flushing by ASID.
20.9 Summary
The TLB caches page table entries to accelerate address translation. On ARM64, it has two levels (micro-TLB and main TLB) and separate TLBs for instructions and data. ASIDs allow the TLB to cache entries from multiple processes simultaneously, avoiding expensive flushes on every context switch. The TLBI family of instructions provides fine-grained invalidation: by VA, by ASID, or globally. Our kernel uses ASIDs and invalidates only the necessary entries, flushing by VA on mapping changes and by ASID on context switches.