OVERVIEW
This technote is the fourth installment in a series of technical articles devoted to optimizing Windows on Proxmox. In Part 4, we quantify and compare IOPS, bandwidth, and latency across all storage controllers and AIO modes under ideal conditions, utilizing Windows Server 2022 running on Proxmox with iSCSI shared block storage.
Workload Description
Using fio
, we executed several random read workloads within a
Windows Server 2022 guest, directing them towards a raw physical device, and
collected data over a 10-minute interval. The subsequent sections
provide comparisons of the raw data. We then use this data to
calculate a performance score and present a comparative analysis.
TECHNOTE SERIES
Abstract
Proxmox exposes several QEMU storage configuration options through its management interfaces. This technote series aims to quantify the efficiency of all storage controller types operating in all possible AIO modes that are supported by Windows Server 2022. The purpose is to provide unbiased quantitative guidance for selecting an optimal configuration for Proxmox Windows guests running on shared block storage.
Methodology
Rather than emphasize maximum performance, we take precise measurements of system CPU cycles, userland CPU cycles, and system context switches under fixed workload conditions. The measurements provide the basis for an efficiency comparison from the perspective of the CPU and Operating System scheduler.
The Proxmox system under test is a SuperMicro H13 server with a single AMD Zen4 9554P 64-core processor and 768GiB of DDR5 operating at 4800MT/s. All testing was performed on Proxmox 8.0.3, running Linux Kernel 6.2.16-3. Workloads were generated using a Windows Server 2022 guest with fio executing against a raw physical device configured in the guest. All compatible storage controllers (i.e., sata, ide, virtio, vmware, etc.) and aio modes (i.e., native, iouring, threads) were tested against iSCSI shared block storage.
The storage system used for testing is a Blockbridge NVME-24-SMC-ZEN4 with an AMD Zen4 9554P processor, Micron 7450 Gen4 NVMe devices, and 200Gbit storage connectivity. The Blockbridge storage operating system is version 6.0.
Tests were conducted using a single Windows Server 2022 virtual machine on an idle system. The Windows guest was configured with a specific controller and aio mode combination for each test and rebooted. Each data point collected represents a 10-minute I/O test. Multiple runs for each data point were collected to validate consistency.
Significant efforts were made to isolate the CPU workload to ensure accurate measurements. With an awareness of the underlying CPU architecture, we leveraged host CPU affinity, guest CPU affinity, network IRQ affinity, transmit flow steering, receive flow steering, and Linux work queue affinity to ensure all “work” performed during the test runs was measurable by the system and hardware profiling tools, as well as to eliminate any architecture-specific performance artifacts.
To ensure consistent I/O workloads, we leveraged the programmable QoS features of the Blockbridge storage system. Initial attempts to rate limit storage using inbuilt QEMU features were found to be extraordinarily CPU intensive, vary based on configuration, and unfairly bias the efficiency measurements. Our external rate-limiting approach results in more stable workloads and ensures that efficiency measurements do not include rate-limiting overhead.
Series Links
-
Part 1: An Introduction to Supported Windows Storage Controllers, AIO modes, and Efficiency Metrics.
-
Part 4: Unloaded Performance Study of Windows Guests on Shared Block Storage.
PERFORMANCE DATA
Average Latency QD1
QD1 latency stands out as the single most valuable indicator of I/O processing potential, providing a comprehensive assessment of end-to-end I/O processing overhead. QD1 latency benchmarks leave nowhere to hide.
Sequential Reads, 512-byte blocks, Single Outstanding I/O
The following chart graphs the average QD1 latency for each controller and AIO mode, measured within the Windows guest using fio over a 10 minute interval. Lower latencies indicate superior performance.
aio=native
, green for aio=iouring
, and orange for aio=threads
.Upon closer examination of the data above, several valuable observations come to light:
-
virtio-blk
has lower per-io overhead thanvirtio-scsi
, likely due to the absence of SCSI protocol translation. -
aio=io_uring
is on the order of a microsecond slower thanaio=native
in most cases. -
The utilization of
iothreads
leads to a reduction in latency across all scenarios. -
aio=threads
has higher overhead thanaio=iouring
andaio=native
. -
vmware-pvscsi
has less driver overhead thanSATA
andIDE
.
Workload Description For QD1 Latency Testing
[qd1-latency]
bs=512
rw=read
iodepth=1
direct=1
time_based=1
runtime=60000
numjobs=1
thread=1
ioengine=windowaio
cpus_allowed=0
Average IOPS QD128
Configurations that deliver high IOPS typically exhibit lower random memory accesses and fewer synchronization points, resulting in lower execution overhead. To focus our testing on the critical code paths associated with performance, we use a high queue depth random read workload with the smallest feasible data transfer size.
Random Reads, 512-byte blocks, 128 Queue Depth
The following graph details the average IOPS for each controller and AIO mode, measured within the Windows guest using fio over a 10 minute interval. Greater IOPS values indicate superior performance.
aio=native
, green for aio=iouring
, and orange for aio=threads
.Note that IDE
and SATA
controllers demonstrate notably poor
performance in comparison to virtio
. This subpar performance is
undoubtedly influenced by how the storage drivers communicate with the
virtual hardware. Based on our experience, SATA
and IDE
are
hindered by the substantial number of vm_exit()
and vm_enter()
calls required to support the programmed I/O model of the emulated ICH9
host chipset driver.
Workload Description for QD128 IOPS Testing
[iops]
bs=512
rw=randread
iodepth=128
direct=1
time_based=1
runtime=60000
numjobs=1
thread=1
ioengine=windowaio
cpus_allowed=0
Average Latency QD128
Latency under a heavy load provides insights into the responsiveness of a system when subjected to intense I/O activity. Workloads with high queue depths exert pressure on the effectiveness of I/O submission and completion paths as well as I/O scheduling.
Random Reads, 512-byte blocks, 128 Queue Depth
The chart below shows the average latency under load for each controller and AIO mode, assessed within the Windows guest using fio over a 10 minute interval. Lower latencies are generally preferred.
aio=native
, green for aio=iouring
, and orange for aio=threads
.Workload Description For QD128 Latency Testing
[qd128-latency]
bs=512
rw=randread
iodepth=128
direct=1
time_based=1
runtime=60000
numjobs=1
thread=1
ioengine=windowaio
cpus_allowed=0
Average Bandwidth
For bandwidth measurements, we prefer a random read workload with a large block size. Bandwidth benchmarks are influenced to a lesser extent by the inline latency of control paths and more by the I/O scheduling characteristics.
Random Reads, 1MiB blocks, 32 Queue Depth
The chart below shows the average bandwidth attained for each controller and AIO mode, assessed within the Windows guest using fio over a 10 minute interval. Higher bandwidth values are preferred.
aio=native
, green for aio=iouring
, and orange for aio=threads
.
For single-path storage, aio=native
exhibits a slight advantage over
aio=io_uring
and a more noticeable advantage compared to
aio=threads
. Our research indicates that this is attributable to
irregular I/O scheduling associated with userland threads and/or
kernel tasklets.
aio=io_uring
and aio=threads
can offer significant bandwidth
improvements compared to aio=native.Workload Description For Bandwidth Testing
[bandwidth]
bs=1M
rw=randread
iodepth=32
direct=1
time_based=1
runtime=60000
numjobs=1
thread=1
ioengine=windowaio
cpus_allowed=0
Normalized Scores
For comparison via scoring, we generate normalized values for each performance metric (i.e., IOPS, bandwidth, QD1 latency, and QD128 latency) using unity-based normalization (i.e., min-max feature scaling). This assigns a relative score for each metric in the range of [0,1] as follows:
\[\mathit{normalized\_value} = \frac{val - Min(val)}{ Max(val) - Min(val)}\]Given four metrics to optimize (with equal weight), we generate a score that represents the magnitude of the vector in a 4D space as follows:
\[\mathit{normalized\_score} = \sqrt{\mathit{normalized\_iops}^2 + \mathit{normalized\_bw}^2 + \mathit{normalized\_qd1\_latency}^2 + \mathit{normalized\_qd128\_latency}^2}\]The table below lists the scores alongside aggregated performance data. Lower scores are better.
CONCLUSIONS
-
While the IOPS and latency data show the critical distinctions in performance among storage controllers, all controllers and AIO combinations consistently deliver acceptable bandwidth.
-
The
virtio-scsi
controller withaio=native
and aniothread
stands out as the top performer overall. Opting forvirtio-scsi
withaio=iouring
is also a viable option, although it tends to lag slightly in bandwidth-oriented tests. -
virtio-blk
excels in QD1 performance, delivering a 2-microsecond improvement overvirtio-scsi
, likely attributed to reduced protocol overhead on the driver. However, it fails to deliver competitive IOPS scores. -
The most performant storage controller native to Windows Server 2022 is
vmware-pvscsi
. Bothsata
andide
deliver comparably poor performance. -
Compared to
vmware-pvscsi
,virtio-scsi
exhibits significant performance advantages including a 157% increase in IOPS, a 14% reduction in QD1 latency, and a 67% decrease in QD128 latency.
Important Considerations
-
Synthetic efficiency scores derived from this dataset are normalized values and cannot be used for comparison with other datasets.
-
The data presented in this technote describes performance on an unloaded system as means of comparing base implementation efficiency.
-
The testing was conducted using iSCSI shared block storage with a single path. Performance for NVMe shared block storage and multipath configurations may vary.
ADDITIONAL RESOURCES
- Blockbridge // Proxmox Overview
- Blockbridge // Proxmox Storage Guide
- Blockbridge // Optimizing Proxmox: iothreads, aio, & io_uring
- Blockbridge // Proxmox & ESXi Performance Comparison
- Blockbridge // Low Latency Storage Optimizations For Proxmox, KVM, & QEMU
- Blockbridge // Optimizing Proxmox Storage for Windows