OVERVIEW
Proxmox exposes several QEMU storage configuration options through its interfaces. This technote aims to quantify the system impact of the asynchronous I/O and I/O threading model options.
Rather than emphasize the maximum performance of the underlying storage solution, this technote focuses on the aggregate performance of an enterprise server running 32 virtual machines to assess the capabilities of Proxmox/QEMU/KVM operating under moderate to heavy load.
The system under test is a Dell PowerEdge R7515 with an AMD EPYC 7452 32-Core Processor and Mellanox 100Gbit networking. The server is installed with Proxmox 7.2 and our Proxmox storage plugin version 3.0. The backend storage software is Blockbridge 6. The network storage protocol is NVMe/TCP.
The testbed consists of 32 virtual machines operating on a single host, each configured with four virtual cpus. Each virtual machine operates as a fio worker. A single virtual disk is attached to each virtual machine for testing. For each set of Proxmox configuration options considered, we execute a battery of concurrent I/O tests with varying queue depths and block sizes. Each data point collected represents the average performance over a 20-minute interval following a 1-minute warm-up.
You can review our findings here, TLDR for network attached storage here, and our raw benchmark data here,
PROXMOX I/O OPTIONS
What is aio native?
In Proxmox, the AIO disk parameter selects the method for implementing asynchronous I/O. Asynchronous I/O allows QEMU to issue multiple transfer requests to the hypervisor without serializing QEMU’s centralized scheduler.
AIO is also the name of a Linux systems interface for performing
asynchronous I/O introduced in Linux 2.6. Setting aio=native
in
Proxmox informs the system to use the Linux AIO system
interface for
managing Asynchronous I/O.
In the Linux AIO model, submission and completion operations are system calls. The primary issue with the Linux AIO implementation is that it can block in variety of circumstances (i.e., buffered I/O, high queue depth, dm devices).
AIO blocks if anything in the I/O submission path is unable to complete inline. However, when used with raw block devices and caching disabled, AIO will not block. Therefore, it is a good choice for network-attached block storage.
What is aio io_uring?
Io-uring is an alternative Linux system interface for executing concurrent asynchronous I/O. Similiar to Linux AIO, it allows QEMU to issue multiple transfer requests without serializing QEMU’s centralized scheduler.
Unlike Linux AIO, io_uring leverages independent shared memory queues for command submission and completion instead of system calls. One architectural goal of io_uring is to minimize latency by reducing system call overhead. However, in our experience, it is often difficult to extract the full performance potential of io_uring in applications like QEMU due to an inherent need to share system resources fairly (i.e., busy polling isnt a good option).
Io_uring has one significant architectural advantage compared to AIO:
it is guaranteed not to block. However, this does not necessarily
result in operational gains since Linux AIO is also non-blocking when
used with O_DIRECT
raw block storage.
What are IOThreads?
When a VM executes an asynchronous disk I/O operation, it issues a request to the hypervisor and waits for an event indicating completion. By default, this happens in the QEMU “main loop.” An IOThread provides a dedicated event loop, operating in a seperate thread, for handling I/O. IOThreads offload work from the “main loop” into a separate thread that can execute concurrently.
There are several claimed advantages of IOThreads, including decreased latency, reduced contention, and improved scalability. However, nothing is free. IOThreads are resources that the hypervisor must prioritize and execute.
virtio-scsi-single
SCSI controller to enable IOThreads.FINDINGS
IOThreads significantly improve performance for most workloads.
IOThreads deliver performance gains exceeding 15% at low queue depth. The performance benefits of an IOThread (for a single storage device) appear to diminish with increasing queue depth. However, in most cases, the benefits outweigh any potential consequences.
The graph below shows the percentage gains, averaged across block
sizes, for each queue depth. For example, the datapoint QD=1 is the
average gain measured for the .5KiB, 4KiB, 8KiB, 16KiB, 32KiB, 64KiB,
and 128KiB block sizes. The graph shows a nearly 15% average gain for
both aio=native
and aio=io_uring
when IOThreads are enabled at
QD=1.
aio=native and aio=io_uring offer competetive performance.
aio=native
and aio=io_uring
offer comparable overall
performance. However, to grasp the full picture, we must compare the
two with and without IOThreads.
In the absence of IOThreads, aio=io_uring
outperforms aio=native
in
7 out of 8 queue depths. When we use IOThreads, aio=native
wins in 5
out of 8. Note that the values in the graphs are the average of gains
across all block sizes for a given queue depth.
Neither AIO model is without quirks. aio=native
, with an IOThread,
offers the best IOPS for small to medium block sizes at medium queue
depth. However, aio=native
exhibits a consistent performance anomaly
at QD=2. aio=io_uring
offers consistent performance at QD=2, but has
performance issues when the system is saturated.
aio=native
QD2
performance across all block sizes, giving aio=io_uring
as much as a
10% advantage at QD2.aio=native has a slight latency advantage for QD1 workloads
aio=native
consistently outperforms aio=io_uring
by a small
margin. The benefits are more pronounced when running without an
IOThread.
aio=io_uring performance degrades in extreme load conditions.
In the 128K block size tests, we see significant performance
degradation at high queue depth with aio=io_uring
. The impact on
aio=native
is less significant.
The graph below shows the average bandwidth for the 128K blocksize
tests operating over a range of queue depths. The system is
effectively saturated at queue depth 2. The peak system bandwidth is
measured 12.8GB/s. For higher queue depths, the performance of
aio=io_uring
drops significantly faster than aio=native
. The
difference in performance at a peak test load is approximately 10%.
A dual-port Gen3 100G NIC is limited to 2 million IOPS with default settings
Proxmox is a highly capable platform for demanding storage applications. Using a single NIC, a single Proxmox host can achieve more than 1.5 million random read IOPS with sub-millisecond latency. However, the maximum performance of a single NIC was limited to roughly 2 million IOPS in a configuration where the backend storage is capable of significantly more.
A dual-port Gen3 100G NIC is limited to 12.8 GB/s with default settings
Our benchmark results were limited to 12.8 GB/s of throughput with default settings. With additional tuning parameters (outside the scope of the technote), we confirmed an ability to saturate the NICs x16 Gen3 PCIe link with approximatelly 14.7GB/s of disk throughput.
SUMMARY
Our test data shows a clear and significant benefit that supports the
use of IOThreads. Performance differences between aio=native
and
aio=io_uring
were less significant. Except for unusual behavior
reported in our results for QD=2, aio=native
offers slightly
better performance (when deployed with an IOThread) and gets our vote
for the most conservative pick.
aio=native
applies
to unbuffered, O_DIRECT, raw block storage only, with the disk cache
policy must be set to none. Raw block storage types include iSCSI,
CEPH/RBD, and NVMe.aio=io_uring
(plus an IOThread) is preferred because aio=native
can block in these
configurations.Recommended Settings
Based on the data, we recommend:
Parameter | Description | Default In 7.2 | Recommended Setting |
---|---|---|---|
SCSI Controller | SCSI emulation driver type | n/a | scsi-virtio-single |
AIO Type | System interface used to Asynchronous I/O | io_uring | native |
IOThreads | Provides dedicated threads for disk I/O | disabled | enabled |
virtio-scsi-single
SCSI controller to enable IOThreads.ENVIRONMENT
Network Diagram
┌──────────────────────────────┐ ┌─────────────────────┐
│ ┌────┐ | ┌───────────────┐ | │
│ ┌────┐ | PROXMOX 7.2 │── NVME/TCP ─┤ SN2100 - 100G ├──────── ┤ BLOCKBRIDGE 6.X │
│ ┌────┐ | | 100G DUAL PORT │ └───────────────┘ │ QUAD ENGINE │
│ │ 32 │ |─┘ X16 GEN3 │ ┌───────────────┐ │ 2X 100G DUAL PORT │
│ │ VM │─┘ 32 CORE AMD |── NVME/TCP ─┤ SN2100 - 100G ├──────── ┤ 4M IOPS / 25 GB/s │
| └────┘ | └───────────────┘ | |
└──────────────────────────────┘ └─────────────────────┘
Description
Proxmox 7.2 (kernel version 5.15.53-1-pve) is installed on a Dell PowerEdge R7515 with an AMD EPYC 7452 32-Core Processor, 512GB of RAM, and a single Mellanox dual-port 100Gbit network adapter. The Mellanox adapter is a x16 Gen3 device with a maximum throughput of 126Gbit/s (limited by the PCIe connectivity). The AMD processor is running with NPS=4 and without hyperthreading.
Thirty-two virtual machines are provisioned on the host. Each VM is installed with Ubuntu 22.04.01 LTS, running Linux kernel version 5.15.0-1018-kvm. The VMs have four virtual CPUs and 4GB of RAM. The logical CPU overcommit is 4:1 (128 provisioned VCPUS running on a 32-core processor).
On each VM, fio-3.28 runs in “server” mode. We use an external controller node to coordinate benchmark runs across the VMs.
Each VM has a boot block device containing the root filesystem
separate from the storage under test. For each VM, we provision
storage using pvesm alloc
and attach it to the VM with qm
set
. Before each test run, the VMs are power cycled to ensure
consistency.
We test Proxmox configuration changes using a suite of 56 different I/O workloads. Each suite contains varying block sizes and queue depths. Each workload consists of a 1-minute warmup and a 20-minute measurement period. Each suite takes 19.6 hours to complete. A sample workload description appears below:
$ cat read-rand-bs4096-qd32.fio
[global]
rw=randread
direct=1
ioengine=libaio
time_based=1
runtime=1200
ramp_time=60
numjobs=1
[sdb]
filename=/dev/sdb
bs=4096
iodepth=32
To minimize system overhead, the storage protocol is NVME/TCP. The backend storage system is sized for 25GB/s of throughput and 4 million random IOPS to eliminate an possible bottlenecks as the storage layer.
Software
Proxmox Version
# pveversion
pve-manager/7.2-7/d0dd0e85 (running kernel: 5.15.53-1-pve)
Linux Kernel Options
BOOT_IMAGE=/boot/vmlinuz-5.15.53-1-pve root=/dev/mapper/pve-root ro quiet
Blockbridge Version
version: 6.1.0
release: 6667.1
build: 4056
Hardware
Server Platform
System Information
Manufacturer: Dell Inc.
Product Name: PowerEdge R7515
Processor
Processor: 32 x AMD EPYC 7452 32-Core Processor (1 Socket)
Kernel Version: Linux 5.15.53-1-pve #1 SMP PVE 5.15.53-1 (Fri, 26 Aug 2022 16:53:52 +0200)
PVE Manager Version pve-manager/7.2-7/d0dd0e
Processor NUMA Configuration
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 43 bits physical, 48 bits virtual
CPU(s): 32
On-line CPU(s) list: 0-31
Thread(s) per core: 1
Core(s) per socket: 32
Socket(s): 1
NUMA node(s): 4
Vendor ID: AuthenticAMD
CPU family: 23
Model: 49
Model name: AMD EPYC 7452 32-Core Processor
Stepping: 0
CPU MHz: 3139.938
BogoMIPS: 4690.89
Virtualization: AMD-V
L1d cache: 1 MiB
L1i cache: 1 MiB
L2 cache: 16 MiB
L3 cache: 128 MiB
NUMA node0 CPU(s): 0-7
NUMA node1 CPU(s): 8-15
NUMA node2 CPU(s): 16-23
NUMA node3 CPU(s): 24-31
Network Adapter
Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5]
Subsystem: Mellanox Technologies Mellanox ConnectX®-5 MCX516A-CCAT
Flags: bus master, fast devsel, latency 0, IRQ 624, NUMA node 1, IOMMU group 89
Memory at ac000000 (64-bit, prefetchable) [size=32M]
Expansion ROM at ab100000 [disabled] [size=1M]
Kernel driver in use: mlx5_core
Kernel modules: mlx5_core
Network Adapter PCI Connectivity
[ 3.341416] mlx5_core 0000:41:00.0: firmware version: 16.26.1040
[ 3.341456] mlx5_core 0000:41:00.0: 126.016 Gb/s available PCIe bandwidth (8.0 GT/s PCIe x16 link)
[ 3.638556] mlx5_core 0000:41:00.0: Rate limit: 127 rates are supported, range: 0Mbps to 97656Mbps
[ 3.638597] mlx5_core 0000:41:00.0: E-Switch: Total vports 4, per vport: max uc(1024) max mc(16384)
[ 3.641492] mlx5_core 0000:41:00.0: Port module event: module 0, Cable plugged
Network Adapter Link
Advertised pause frame use: Symmetric
Advertised auto-negotiation: Yes
Advertised FEC modes: None RS BASER
Link partner advertised link modes: Not reported
Link partner advertised pause frame use: No
Link partner advertised auto-negotiation: Yes
Link partner advertised FEC modes: Not reported
Speed: 100000Mb/s
Duplex: Full
Auto-negotiation: on
Port: Direct Attach Copper
PHYAD: 0
Transceiver: internal
Supports Wake-on: d
Wake-on: d
Current message level: 0x00000004 (4)
link
Link detected: yes
Network Adapter Interrupt Coalesce Settings
Adaptive RX: on TX: on
stats-block-usecs: n/a
sample-interval: n/a
pkt-rate-low: n/a
pkt-rate-high: n/a
rx-usecs: 8
rx-frames: 128
rx-usecs-irq: n/a
rx-frames-irq: n/a
tx-usecs: 8
tx-frames: 128
tx-usecs-irq: n/a
tx-frames-irq: n/a
BENCHMARKS
IOPS
IOPS charts show the relative performance of aio=io_uring
and
aio=native
, with and without IOThreads. Each chart presents average
IOPS results for eight different queue depths operating with a fixed
block size. This section contains graphs for seven different block
sizes including .5K, 4K, 8K, 16K, 32K, 64K, and 128K.
IOPS EFFECT OF IOTHREADS
The charts in the following section present the relative gain or loss
in IOPS associated with using IOThreads, for both aio=native
and
aio=io_uring
. A positive value indicates that using IOThreads increases
IOPS. A negative value indicates that using IOThreads decreases IOPS.
IOPS EFFECT OF AIO MODE
The charts in the following section present the relative gain or loss
in IOPS associated with using aio=io_uring
, with and without
IOThreads. A positive value indicates that aio=io_uring
is faster than
aio=native
. A negative value indicates that aio=io_uring
is slower
than aio=native
.
aio=native
at queue depth two. We’ve yet
to uncover the source of the issue! Perhaps its Linux kernel specificLATENCY
Latency charts show the relative performance of aio=io_uring
vs.
aio=native
, with and without IOThreads. The charts present average
latency for eight different queue depths operating with a fixed block
size. This section contains graphs for seven different block sizes
including .5K, 4K, 8K, 16K, 32K, 64K, and 128K. Lower latency is
better.
LATENCY EFFECT OF IOTHREADS
The charts in the following section present the relative gain or loss
in Latency associated with using IOThreads for both aio=native
and
aio=io_uring
. A negative value indicates that using IOThreads reduces
(i.e., improves) latency.
LATENCY EFFECT OF AIO MODE
The charts in the following section present the relative gain or loss
in latency associated with using aio=io_uring
, with and without
IOThreads. A positive value indicates that aio=io_uring
results in
higher latency aio=native
. A negative value indicates that
aio=io_uring
results in lower latency than aio=native
.
BANDWIDTH
Bandwidth charts show the relative performance of aio=io_uring
and
aio=native
, with and without IOThreads. The charts present average
bandwidth results for eight different queue depths operating with a
fixed block size. This section contains graphs for seven different
block sizes including .5K, 4K, 8K, 16K, 32K, 64K, and 128K.