OVERVIEW
This technote is the third installment in a series of technical articles devoted to optimizing Windows on Proxmox. Part 3 explores the computational efficiency of individual storage controller configurations when subjected to a constrained bandwidth workload. We present raw efficiency data, considering CPU utilization and operating system context switches, and employ a scoring system to compare efficiency across different configurations.
Workload Description
Using fio
, we ran a random read workload with large block sizes in a
Windows Server 2022 guest, targeting a raw physical device. The backend
storage’s Adaptive QoS controls maintained a consistent workload,
delivering 1GiB/s of bandwidth to the guest. Utilization metrics on
the hypervisor were captured using the perf
tool over a 10-minute
interval.
The following sections present the raw data, data virtualizations, normalized data, derived efficiency scores, and conclusions.
TECHNOTE SERIES
Abstract
Proxmox exposes several QEMU storage configuration options through its management interfaces. This technote series aims to quantify the efficiency of all storage controller types operating in all possible AIO modes that are supported by Windows Server 2022. The purpose is to provide unbiased quantitative guidance for selecting an optimal configuration for Proxmox Windows guests running on shared block storage.
Methodology
Rather than emphasize maximum performance, we take precise measurements of system CPU cycles, userland CPU cycles, and system context switches under fixed workload conditions. The measurements provide the basis for an efficiency comparison from the perspective of the CPU and Operating System scheduler.
The Proxmox system under test is a SuperMicro H13 server with a single AMD Zen4 9554P 64-core processor and 768GiB of DDR5 operating at 4800MT/s. All testing was performed on Proxmox 8.0.3, running Linux Kernel 6.2.16-3. Workloads were generated using a Windows Server 2022 guest with fio executing against a raw physical device configured in the guest. All compatible storage controllers (i.e., sata, ide, virtio, vmware, etc.) and aio modes (i.e., native, iouring, threads) were tested against iSCSI shared block storage.
The storage system used for testing is a Blockbridge NVME-24-SMC-ZEN4 with an AMD Zen4 9554P processor, Micron 7450 Gen4 NVMe devices, and 200Gbit storage connectivity. The Blockbridge storage operating system is version 6.0.
Tests were conducted using a single Windows Server 2022 virtual machine on an idle system. The Windows guest was configured with a specific controller and aio mode combination for each test and rebooted. Each data point collected represents a 10-minute I/O test. Multiple runs for each data point were collected to validate consistency.
Significant efforts were made to isolate the CPU workload to ensure accurate measurements. With an awareness of the underlying CPU architecture, we leveraged host CPU affinity, guest CPU affinity, network IRQ affinity, transmit flow steering, receive flow steering, and Linux work queue affinity to ensure all “work” performed during the test runs was measurable by the system and hardware profiling tools, as well as to eliminate any architecture-specific performance artifacts.
To ensure consistent I/O workloads, we leveraged the programmable QoS features of the Blockbridge storage system. Initial attempts to rate limit storage using inbuilt QEMU features were found to be extraordinarily CPU intensive, vary based on configuration, and unfairly bias the efficiency measurements. Our external rate-limiting approach results in more stable workloads and ensures that efficiency measurements do not include rate-limiting overhead.
Series Links
-
Part 1: An Introduction to Supported Windows Storage Controllers, AIO modes, and Efficiency Metrics.
-
Part 4: Unloaded Performance Study of Windows Guests on Shared Block Storage.
EFFICIENCY MEASUREMENTS
Context Switches Per MiB
Context switch rates serve as reliable indicators of real-world performance. In a heavily loaded system, controllers with higher context switch rates are more prone to delivering inconsistent bandwidth. From our experience, elevated context switch rates can be linked to inefficient serialization or necessary segmentation of I/O at the block layer. Nevertheless, performance is significantly influenced by the operating system scheduler, and the presence of numerous runnable processes often leads to more inconsistent performance.
The data below presents the average number of context switches per
MiB transferred
for each controller
and asynchronous I/O
mode
. Context switches were measured using
perf and scoped to include
only the CPUs executing the guest, QEMU worker threads, kernel iSCSI
initiator logic, and related system functions (i.e., IRQs). An
adaptive external IOPS QoS limiter was used to ensure a consistent
transfer rate of 1GiB/s for each test configuration.
aio=native
, green for aio=iouring
, and
orange for aio=threads
.CPU Utilization
Every combination of guest storage controller, device driver, AIO mode, and hypervisor driver has distinct efficiency characteristics. Using a profiler, we can gauge CPU cycle consumption to identify the most effective configurations in terms of computing performance. In simple terms, the fewer cycles used for I/O, the more cycles remain available for computation (and other I/O tasks).
The data below presents the average number of combined user and
system CPU cycles per MiB transferred
for each controller
and
asynchronous I/O mode
. CPU cycles were measured using
perf and scoped to include
only the CPUs executing the guest, QEMU worker threads, kernel iSCSI
initiator logic, and related system functions (i.e., IRQs). An
adaptive external IOPS QoS limiter was used to ensure a sustained
transfer rate of 1GiB/s for each test configuration.
aio=native
, green for aio=iouring
, and
orange for aio=threads
.Data Analysis
Measured CPU Cycles and Context Switch Scatter Plot
The scatter plot below shows the correlation between context switch efficiency and cpu utilization. For ease of comparison, the data points are styled as follows:
- A symbol correlates with a
storage controller
. - The color of a symbol corresponds with the
AIO mode
.
Normalized CPU cycles and Context Switch Scatter Plot
To visualize the correlation between context switches and CPU more accurately, we can normalize the data to eliminate scaling effects. We use unity-based normalization (i.e., min-max feature scaling) to project the data into the range of [0,1] as follows:
\[\mathit{normalized\_value} = \frac{val - Min(val)}{ Max(val) - Min(val)}\]The graph below plots the normalized values for context switch and CPU cycle efficiency measurements on the same graph. For ease of comparison, the data points are displayed as follows:
- A symbol correlates with a
storage controller
. - The color of a symbol corresponds with the
AIO mode
.
Normalized Efficiency Scores
In the previous section, we presented a graph of normalized values for context switches and CPU cycles. Using these normalized values, we can derive a single score to compare the efficiency of the configurations under test. Our score calculation is simply the length of the vector extending from the origin and is calculated as follows:.
\[\mathit{efficiency\_score} = \sqrt{\mathit{normalized\_cpu\_cycles\_per\_MiB}^2 + \mathit{normalized\_context\_switches\_per\_MiB}^2}\]The chart below presents a comparison of efficiency scores derived from the normalized CPU cycles and context switch measurement data. Lower scores are better.
aio=native
, green for aio=iouring
, and
orange for aio=threads
.Table of Measured Values, Rankings, and Efficiency Scores
The table below presents raw data, normalized scoring, and efficiency rankings for context switches and CPU cycles. Click on any header to sort the data to your liking.
CONCLUSIONS
-
The
virtio-scsi
controller withaio=native
achieves the best overall efficiency score. -
aio=native
is the most efficient aio mode for each controller type;aio=threads
is the least efficient. -
With
aio=native
,virtio-scsi
is 4% more CPU intensive thanvirtio-blk
but generates 25% fewer context switches. -
With
virtio-scsi
andaio=native
, aniothread
introduces a small CPU efficiency overhead of 1.5%, but reduces context switches by 5% -
vmware-pvscsi
is the most efficient storage controller option natively supported by Windows Server 2022. -
vmware-pvscsi
withaio=native
consumes 60% less CPU and generates 40% fewer context switches thanvmware-pvscsi
withaio=iouring
. -
The
SATA
andIDE
controllers achieve the worst efficiency scores primarily due to high context switching rates.
Important Considerations
-
The data set in this technote focuses on the system efficiency of a bandwidth-centric workload characterized by moderate queue depths of large sized I/Os. Data for iops-focused workloads may differ.
-
Our data presents a complete system view that includes the interactions between system hardware, network drivers, storage drivers, Linux, QEMU, guest storage drivers, and Windows itself.
-
System efficiency is a reliable indicator of expected storage performance under moderate to heavy system load. However, it does not necessarily correlate with peak performance on an unloaded system.
-
Synthetic efficiency scores derived from this dataset are normalized values and should not be used for comparison with other datasets.
ADDITIONAL RESOURCES
- Blockbridge // Proxmox Overview
- Blockbridge // Proxmox Storage Guide
- Blockbridge // Optimizing Proxmox: iothreads, aio, & io_uring
- Blockbridge // Proxmox & ESXi Performance Comparison
- Blockbridge // Low Latency Storage Optimizations For Proxmox, KVM, & QEMU
- Blockbridge // Optimizing Proxmox Storage for Windows