Chapter 35: Storage
- How the kernel interfaces with block storage devices
- The generic block layer abstraction
- The virtio-blk driver for QEMU virtual disks
- SD card interface (for Raspberry Pi)
- Block I/O request queue and completion
- Our storage driver implementation
35.1 The Block Layer
Storage devices are block devices: data is read and written in fixed-size blocks (typically 512 bytes or 4096 bytes). The kernel's block layer abstracts away device-specific details behind a common interface.
/* Generic block device structure */
struct block_device {
const char *name;
int block_size; /* Block size in bytes (usually 512) */
uint64_t num_blocks; /* Total blocks on device */
struct block_device_ops *ops; /* Device-specific operations */
void *private_data; /* Driver-specific data */
struct list_head bd_list; /* For block device list */
};
/* Block device operations (filled in by each driver) */
struct block_device_ops {
int (*read)(struct block_device *dev, uint64_t lba,
void *buffer, int count);
int (*write)(struct block_device *dev, uint64_t lba,
const void *buffer, int count);
};
35.2 The virtio-blk Driver
On QEMU virt, the primary storage device is virtio-blk, a paravirtualized block device. It uses a shared memory ring buffer (virtqueue) for communication between the guest kernel and the QEMU host:
/* virtio-blk device */
struct virtio_blk {
uint64_t mmio_base; /* MMIO base from device tree */
struct virtqueue *vq; /* Virtqueue for I/O requests */
uint64_t capacity; /* Number of 512-byte sectors */
int features; /* Negotiated features */
};
/* virtio-blk MMIO registers */
#define VIRTIO_MMIO_MAGIC 0x000
#define VIRTIO_MMIO_VERSION 0x004
#define VIRTIO_MMIO_DEVICE_ID 0x008
#define VIRTIO_MMIO_QUEUE_NUM 0x030
#define VIRTIO_MMIO_QUEUE_READY 0x044
/* virtio-blk request header */
struct virtio_blk_req {
uint32_t type; /* 0 = read, 1 = write */
uint32_t reserved;
uint64_t sector; /* LBA starting sector */
char data[0]; /* Data buffer follows */
uint8_t status; /* Status byte (0 = OK) */
};
/* Read blocks from virtio-blk */
int virtio_blk_read(struct block_device *dev, uint64_t lba,
void *buffer, int count) {
struct virtio_blk *vblk = dev->private_data;
/* Build request */
struct virtio_blk_req *req = kmalloc(sizeof(*req) + count * 512);
req->type = 0; /* READ */
req->sector = lba;
memcpy(req->data, buffer, count * 512);
/* Submit to virtqueue */
virtqueue_add(vblk->vq, req, count * 512);
/* Notify host */
writel(vblk->mmio_base, VIRTIO_MMIO_QUEUE_NOTIFY, 0);
/* Wait for completion (poll or interrupt) */
while (!(readl(vblk->mmio_base, VIRTIO_MMIO_INTERRUPT_STATUS) & 1));
/* Copy data back and free */
memcpy(buffer, req->data, count * 512);
int status = req->status;
kfree(req);
return (status == 0) ? 0 : -1;
}
35.3 Generic Block I/O Request Queue
Rather than calling the driver directly for every read/write, the kernel uses a generic I/O request queue. This allows scheduling, merging, and caching of requests:
/* Block I/O request */
struct bio_request {
struct block_device *dev;
uint64_t lba;
void *buffer;
int count;
int dir; /* 0 = read, 1 = write */
struct semaphore completion; /* For synchronous I/O */
int error;
struct list_head node;
};
/* Submit a synchronous I/O request */
int blk_read(struct block_device *dev, uint64_t lba,
void *buffer, int count) {
struct bio_request *req = kmalloc(sizeof(*req));
req->dev = dev;
req->lba = lba;
req->buffer = buffer;
req->count = count;
req->dir = 0;
sem_init(&req->completion, 0);
/* Add to request queue */
spinlock_lock(&blk_request_lock);
list_add_tail(&blk_request_list, &req->node);
spinlock_unlock(&blk_request_lock);
/* Wake up the block I/O thread */
sem_signal(&blk_request_sem);
/* Wait for completion */
sem_wait(&req->completion);
int error = req->error;
kfree(req);
return error;
}
/* Block I/O kernel thread (processes requests) */
void blk_thread(void) {
while (1) {
sem_wait(&blk_request_sem);
spinlock_lock(&blk_request_lock);
struct bio_request *req = list_pop(&blk_request_list);
spinlock_unlock(&blk_request_lock);
if (req) {
if (req->dir == 0)
req->error = req->dev->ops->read(req->dev, req->lba,
req->buffer, req->count);
else
req->error = req->dev->ops->write(req->dev, req->lba,
req->buffer, req->count);
sem_signal(&req->completion);
}
}
}
35.4 SD Card Interface (Raspberry Pi)
On real hardware (Raspberry Pi 4/5), storage is typically on an SD card connected via the SDHCI (SD Host Controller Interface) or the BCM2711 EMMC2 controller. The interface involves:
- Initializing the SD controller (clock, voltage, bus width)
- Sending SD commands (CMD0, CMD8, ACMD41, CMD2, CMD3) to identify the card
- Reading/writing sectors using CMD18 (read multiple) and CMD25 (write multiple)
On QEMU virt, SD cards are emulated via -drive file=disk.img,if=sd,format=raw. Our kernel detects and initializes the SDHCI controller from the device tree.
35.5 Our Implementation
Our storage subsystem (drivers/block/) provides:
- Generic block layer:
block_deviceinterface with read/write operations - virtio-blk driver: for QEMU virtual disks (fast paravirtualized I/O)
- SDHCI driver: for SD card access on Raspberry Pi
- Block I/O thread: asynchronous request processing with semaphore synchronization
- Partition support: parses the MBR partition table to detect partitions
- Device registration: block devices appear as /dev/sda, /dev/sdb, etc. via devfs
35.6 Exercises
Exercise 1: MBR Parser
Implement a function that reads the Master Boot Record (LBA 0) and returns the four partition entries with their start sectors and sizes.
Exercise 2: Read Speed Benchmark
Benchmark the read speed of the virtio-blk device by reading 1 MB in various transfer sizes (512 bytes, 4 KB, 64 KB). Compare the throughput.
35.7 Summary
The storage subsystem provides block-level I/O for persistent storage. The generic block layer abstracts device details behind a common interface. The virtio-blk driver (for QEMU) uses paravirtualized I/O via virtqueues. The block I/O thread processes requests asynchronously, allowing other threads to continue while I/O is in progress. Our kernel supports both virtio-blk (QEMU) and SDHCI (Raspberry Pi) with MBR partition parsing.