Optimizing Proxmox: iothreads, aio, & io_uring

OVERVIEW

Proxmox exposes several QEMU storage configuration options through its interfaces. This technote aims to quantify the system impact of the asynchronous I/O and I/O threading model options.

Rather than emphasize the maximum performance of the underlying storage solution, this technote focuses on the aggregate performance of an enterprise server running 32 virtual machines to assess the capabilities of Proxmox/QEMU/KVM operating under moderate to heavy load.

The system under test is a Dell PowerEdge R7515 with an AMD EPYC 7452 32-Core Processor and Mellanox 100Gbit networking. The server is installed with Proxmox 7.2 and our Proxmox storage plugin version 3.0. The backend storage software is Blockbridge 6. The network storage protocol is NVMe/TCP.

The testbed consists of 32 virtual machines operating on a single host, each configured with four virtual cpus. Each virtual machine operates as a fio worker. A single virtual disk is attached to each virtual machine for testing. For each set of Proxmox configuration options considered, we execute a battery of concurrent I/O tests with varying queue depths and block sizes. Each data point collected represents the average performance over a 20-minute interval following a 1-minute warm-up.

You can review our findings here, TLDR for network attached storage here, and our raw benchmark data here,

PROXMOX I/O OPTIONS

What is aio native?

In Proxmox, the AIO disk parameter selects the method for implementing asynchronous I/O. Asynchronous I/O allows QEMU to issue multiple transfer requests to the hypervisor without serializing QEMU’s centralized scheduler.

AIO is also the name of a Linux systems interface for performing asynchronous I/O introduced in Linux 2.6. Setting aio=native in Proxmox informs the system to use the Linux AIO system interface for managing Asynchronous I/O.

In the Linux AIO model, submission and completion operations are system calls. The primary issue with the Linux AIO implementation is that it can block in variety of circumstances (i.e., buffered I/O, high queue depth, dm devices).

AIO blocks if anything in the I/O submission path is unable to complete inline. However, when used with raw block devices and caching disabled, AIO will not block. Therefore, it is a good choice for network-attached block storage.

What is aio io_uring?

Io-uring is an alternative Linux system interface for executing concurrent asynchronous I/O. Similiar to Linux AIO, it allows QEMU to issue multiple transfer requests without serializing QEMU’s centralized scheduler.

Unlike Linux AIO, io_uring leverages independent shared memory queues for command submission and completion instead of system calls. One architectural goal of io_uring is to minimize latency by reducing system call overhead. However, in our experience, it is often difficult to extract the full performance potential of io_uring in applications like QEMU due to an inherent need to share system resources fairly (i.e., busy polling isnt a good option).

Io_uring has one significant architectural advantage compared to AIO: it is guaranteed not to block. However, this does not necessarily result in operational gains since Linux AIO is also non-blocking when used with O_DIRECT raw block storage.

What are IOThreads?

When a VM executes an asynchronous disk I/O operation, it issues a request to the hypervisor and waits for an event indicating completion. By default, this happens in the QEMU “main loop.” An IOThread provides a dedicated event loop, operating in a seperate thread, for handling I/O. IOThreads offload work from the “main loop” into a separate thread that can execute concurrently.

There are several claimed advantages of IOThreads, including decreased latency, reduced contention, and improved scalability. However, nothing is free. IOThreads are resources that the hypervisor must prioritize and execute.

Tip: In Proxmox, you must use the virtio-scsi-single SCSI controller to enable IOThreads.

FINDINGS

IOThreads significantly improve performance for most workloads.

IOThreads deliver performance gains exceeding 15% at low queue depth. The performance benefits of an IOThread (for a single storage device) appear to diminish with increasing queue depth. However, in most cases, the benefits outweigh any potential consequences.

The graph below shows the percentage gains, averaged across block sizes, for each queue depth. For example, the datapoint QD=1 is the average gain measured for the .5KiB, 4KiB, 8KiB, 16KiB, 32KiB, 64KiB, and 128KiB block sizes. The graph shows a nearly 15% average gain for both aio=native and aio=io_uring when IOThreads are enabled at QD=1.

Note: Gains at higher queue depths are most likely impacted by near saturation of the 100G network adapter.

aio=native and aio=io_uring offer competetive performance.

aio=native and aio=io_uring offer comparable overall performance. However, to grasp the full picture, we must compare the two with and without IOThreads.

In the absence of IOThreads, aio=io_uring outperforms aio=native in 7 out of 8 queue depths. When we use IOThreads, aio=native wins in 5 out of 8. Note that the values in the graphs are the average of gains across all block sizes for a given queue depth.

Neither AIO model is without quirks. aio=native, with an IOThread, offers the best IOPS for small to medium block sizes at medium queue depth. However, aio=native exhibits a consistent performance anomaly at QD=2. aio=io_uring offers consistent performance at QD=2, but has performance issues when the system is saturated.

Note: We have tested, retested, and tested again. There is a performance anomaly affecting aio=native QD2 performance across all block sizes, giving aio=io_uring as much as a 10% advantage at QD2.

aio=native has a slight latency advantage for QD1 workloads

aio=native consistently outperforms aio=io_uring by a small margin. The benefits are more pronounced when running without an IOThread.

Note: A few microseconds is an incredibly tiny amount of time. It is unlikely to impact any application other than a benchmark.

aio=io_uring performance degrades in extreme load conditions.

In the 128K block size tests, we see significant performance degradation at high queue depth with aio=io_uring. The impact on aio=native is less significant.

The graph below shows the average bandwidth for the 128K blocksize tests operating over a range of queue depths. The system is effectively saturated at queue depth 2. The peak system bandwidth is measured 12.8GB/s. For higher queue depths, the performance of aio=io_uring drops significantly faster than aio=native. The difference in performance at a peak test load is approximately 10%.

Note: The performance degradation occurs with and without iothreads, which highly suggests an architectural or resource issue related to io_uring. However, keep in mind that these are extreme operating conditions.

A dual-port Gen3 100G NIC is limited to 2 million IOPS with default settings

Proxmox is a highly capable platform for demanding storage applications. Using a single NIC, a single Proxmox host can achieve more than 1.5 million random read IOPS with sub-millisecond latency. However, the maximum performance of a single NIC was limited to roughly 2 million IOPS in a configuration where the backend storage is capable of significantly more.

A dual-port Gen3 100G NIC is limited to 12.8 GB/s with default settings

Our benchmark results were limited to 12.8 GB/s of throughput with default settings. With additional tuning parameters (outside the scope of the technote), we confirmed an ability to saturate the NICs x16 Gen3 PCIe link with approximatelly 14.7GB/s of disk throughput.

SUMMARY

Our test data shows a clear and significant benefit that supports the use of IOThreads. Performance differences between aio=native and aio=io_uring were less significant. Except for unusual behavior reported in our results for QD=2, aio=native offers slightly better performance (when deployed with an IOThread) and gets our vote for the most conservative pick.

Note: Our recommendation for aio=native applies to unbuffered, O_DIRECT, raw block storage only, with the disk cache policy must be set to none. Raw block storage types include iSCSI, CEPH/RBD, and NVMe.

Warning: For thin-LVM, anything stacked on top of software RAID, and file-based solutions (including NFS), aio=io_uring (plus an IOThread) is preferred because aio=native can block in these configurations.

Recommended Settings

Based on the data, we recommend:

Parameter	Description	Default In 7.2	Recommended Setting
SCSI Controller	SCSI emulation driver type	n/a	scsi-virtio-single
AIO Type	System interface used to Asynchronous I/O	io_uring	native
IOThreads	Provides dedicated threads for disk I/O	disabled	enabled

Tip: In Proxmox, you must use the virtio-scsi-single SCSI controller to enable IOThreads.

ENVIRONMENT

Network Diagram

         ┌──────────────────────────────┐                                       ┌─────────────────────┐
         │      ┌────┐                  |             ┌───────────────┐         |                     │
         │    ┌────┐ |  PROXMOX 7.2     │── NVME/TCP ─┤ SN2100 - 100G ├──────── ┤  BLOCKBRIDGE 6.X    │
         │  ┌────┐ | |  100G DUAL PORT  │             └───────────────┘         │  QUAD ENGINE        │
         │  │ 32 │ |─┘  X16 GEN3        │             ┌───────────────┐         │  2X 100G DUAL PORT  │
         │  │ VM │─┘    32 CORE AMD     |── NVME/TCP ─┤ SN2100 - 100G ├──────── ┤  4M IOPS / 25 GB/s  │
         |  └────┘                      |             └───────────────┘         |                     |
         └──────────────────────────────┘                                       └─────────────────────┘

Note: The maximum performance of the Proxmox host is limited to 126Gb/s by the NIC’s PCI slot. In practice, this translates to roughly 14.8GB/s of disk bandwidth after accounting for protocol overhead

Description

Proxmox 7.2 (kernel version 5.15.53-1-pve) is installed on a Dell PowerEdge R7515 with an AMD EPYC 7452 32-Core Processor, 512GB of RAM, and a single Mellanox dual-port 100Gbit network adapter. The Mellanox adapter is a x16 Gen3 device with a maximum throughput of 126Gbit/s (limited by the PCIe connectivity). The AMD processor is running with NPS=4 and without hyperthreading.

Thirty-two virtual machines are provisioned on the host. Each VM is installed with Ubuntu 22.04.01 LTS, running Linux kernel version 5.15.0-1018-kvm. The VMs have four virtual CPUs and 4GB of RAM. The logical CPU overcommit is 4:1 (128 provisioned VCPUS running on a 32-core processor).

On each VM, fio-3.28 runs in “server” mode. We use an external controller node to coordinate benchmark runs across the VMs.

Each VM has a boot block device containing the root filesystem separate from the storage under test. For each VM, we provision storage using pvesm alloc and attach it to the VM with qm set. Before each test run, the VMs are power cycled to ensure consistency.

We test Proxmox configuration changes using a suite of 56 different I/O workloads. Each suite contains varying block sizes and queue depths. Each workload consists of a 1-minute warmup and a 20-minute measurement period. Each suite takes 19.6 hours to complete. A sample workload description appears below:

$ cat read-rand-bs4096-qd32.fio
[global]
rw=randread
direct=1
ioengine=libaio
time_based=1
runtime=1200
ramp_time=60
numjobs=1

[sdb]
filename=/dev/sdb
bs=4096
iodepth=32

To minimize system overhead, the storage protocol is NVME/TCP. The backend storage system is sized for 25GB/s of throughput and 4 million random IOPS to eliminate an possible bottlenecks as the storage layer.

Software

Proxmox Version

# pveversion
pve-manager/7.2-7/d0dd0e85 (running kernel: 5.15.53-1-pve)

Linux Kernel Options

BOOT_IMAGE=/boot/vmlinuz-5.15.53-1-pve root=/dev/mapper/pve-root ro quiet

Blockbridge Version

version:   6.1.0
release:   6667.1
build:     4056

Hardware

Server Platform

System Information
	Manufacturer: Dell Inc.
	Product Name: PowerEdge R7515

Processor

Processor: 32 x AMD EPYC 7452 32-Core Processor (1 Socket)
Kernel Version: Linux 5.15.53-1-pve #1 SMP PVE 5.15.53-1 (Fri, 26 Aug 2022 16:53:52 +0200)
PVE Manager Version pve-manager/7.2-7/d0dd0e

Processor NUMA Configuration

Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   43 bits physical, 48 bits virtual
CPU(s):                          32
On-line CPU(s) list:             0-31
Thread(s) per core:              1
Core(s) per socket:              32
Socket(s):                       1
NUMA node(s):                    4
Vendor ID:                       AuthenticAMD
CPU family:                      23
Model:                           49
Model name:                      AMD EPYC 7452 32-Core Processor
Stepping:                        0
CPU MHz:                         3139.938
BogoMIPS:                        4690.89
Virtualization:                  AMD-V
L1d cache:                       1 MiB
L1i cache:                       1 MiB
L2 cache:                        16 MiB
L3 cache:                        128 MiB
NUMA node0 CPU(s):               0-7
NUMA node1 CPU(s):               8-15
NUMA node2 CPU(s):               16-23
NUMA node3 CPU(s):               24-31

Network Adapter

Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5]
Subsystem: Mellanox Technologies Mellanox ConnectX®-5 MCX516A-CCAT
Flags: bus master, fast devsel, latency 0, IRQ 624, NUMA node 1, IOMMU group 89
Memory at ac000000 (64-bit, prefetchable) [size=32M]
Expansion ROM at ab100000 [disabled] [size=1M]
Kernel driver in use: mlx5_core
Kernel modules: mlx5_core

Network Adapter PCI Connectivity

[    3.341416] mlx5_core 0000:41:00.0: firmware version: 16.26.1040
[    3.341456] mlx5_core 0000:41:00.0: 126.016 Gb/s available PCIe bandwidth (8.0 GT/s PCIe x16 link)
[    3.638556] mlx5_core 0000:41:00.0: Rate limit: 127 rates are supported, range: 0Mbps to 97656Mbps
[    3.638597] mlx5_core 0000:41:00.0: E-Switch: Total vports 4, per vport: max uc(1024) max mc(16384)
[    3.641492] mlx5_core 0000:41:00.0: Port module event: module 0, Cable plugged

Network Adapter Link

Advertised pause frame use: Symmetric
Advertised auto-negotiation: Yes
Advertised FEC modes: None	 RS	 BASER
Link partner advertised link modes:  Not reported
Link partner advertised pause frame use: No
Link partner advertised auto-negotiation: Yes
Link partner advertised FEC modes: Not reported
Speed: 100000Mb/s
Duplex: Full
Auto-negotiation: on
Port: Direct Attach Copper
PHYAD: 0
Transceiver: internal
Supports Wake-on: d
Wake-on: d
    Current message level: 0x00000004 (4)
                           link
Link detected: yes

Network Adapter Interrupt Coalesce Settings

Adaptive RX: on  TX: on
stats-block-usecs: n/a
sample-interval: n/a
pkt-rate-low: n/a
pkt-rate-high: n/a

rx-usecs: 8
rx-frames: 128
rx-usecs-irq: n/a
rx-frames-irq: n/a

tx-usecs: 8
tx-frames: 128
tx-usecs-irq: n/a
tx-frames-irq: n/a

BENCHMARKS

IOPS

IOPS charts show the relative performance of aio=io_uring and aio=native, with and without IOThreads. Each chart presents average IOPS results for eight different queue depths operating with a fixed block size. This section contains graphs for seven different block sizes including .5K, 4K, 8K, 16K, 32K, 64K, and 128K.

IOPS EFFECT OF IOTHREADS

The charts in the following section present the relative gain or loss in IOPS associated with using IOThreads, for both aio=native and aio=io_uring. A positive value indicates that using IOThreads increases IOPS. A negative value indicates that using IOThreads decreases IOPS.

Note: In cases where the results appear negative, the network link may be nearly saturated (operating at or above 12GB/s). It can be helpful to cross-reference any assumptions made here by looking at the bandwidth graphs.

IOPS EFFECT OF AIO MODE

The charts in the following section present the relative gain or loss in IOPS associated with using aio=io_uring, with and without IOThreads. A positive value indicates that aio=io_uring is faster than aio=native. A negative value indicates that aio=io_uring is slower than aio=native.

Note: It should be evident from the graphs that there is an anomaly with aio=native at queue depth two. We’ve yet to uncover the source of the issue! Perhaps its Linux kernel specific

LATENCY

Latency charts show the relative performance of aio=io_uring vs. aio=native, with and without IOThreads. The charts present average latency for eight different queue depths operating with a fixed block size. This section contains graphs for seven different block sizes including .5K, 4K, 8K, 16K, 32K, 64K, and 128K. Lower latency is better.

Note: Remember that comparing latency directly when operating at different levels of IOPS performance can be inaccurate. The only technically valid head-to-head comparison here is QD=1.

LATENCY EFFECT OF IOTHREADS

The charts in the following section present the relative gain or loss in Latency associated with using IOThreads for both aio=native and aio=io_uring. A negative value indicates that using IOThreads reduces (i.e., improves) latency.

LATENCY EFFECT OF AIO MODE

The charts in the following section present the relative gain or loss in latency associated with using aio=io_uring, with and without IOThreads. A positive value indicates that aio=io_uring results in higher latency aio=native. A negative value indicates that aio=io_uring results in lower latency than aio=native.

BANDWIDTH

Bandwidth charts show the relative performance of aio=io_uring and aio=native, with and without IOThreads. The charts present average bandwidth results for eight different queue depths operating with a fixed block size. This section contains graphs for seven different block sizes including .5K, 4K, 8K, 16K, 32K, 64K, and 128K.

Note: Bandwidth should directly correlates with IOPS since the block size is fixed! Bandwidth = IOPS * Queue Depth