OVERVIEW

This technote is the third installment in a series of technical articles devoted to optimizing Windows on Proxmox. Part 3 explores the computational efficiency of individual storage controller configurations when subjected to a constrained bandwidth workload. We present raw efficiency data, considering CPU utilization and operating system context switches, and employ a scoring system to compare efficiency across different configurations.

Workload Description

Using fio, we ran a random read workload with large block sizes in a Windows Server 2022 guest, targeting a raw physical device. The backend storage’s Adaptive QoS controls maintained a consistent workload, delivering 1GiB/s of bandwidth to the guest. Utilization metrics on the hypervisor were captured using the perf tool over a 10-minute interval.

The following sections present the raw data, data virtualizations, normalized data, derived efficiency scores, and conclusions.

TECHNOTE SERIES

Abstract

Proxmox exposes several QEMU storage configuration options through its management interfaces. This technote series aims to quantify the efficiency of all storage controller types operating in all possible AIO modes that are supported by Windows Server 2022. The purpose is to provide unbiased quantitative guidance for selecting an optimal configuration for Proxmox Windows guests running on shared block storage.

Methodology

Rather than emphasize maximum performance, we take precise measurements of system CPU cycles, userland CPU cycles, and system context switches under fixed workload conditions. The measurements provide the basis for an efficiency comparison from the perspective of the CPU and Operating System scheduler.

The Proxmox system under test is a SuperMicro H13 server with a single AMD Zen4 9554P 64-core processor and 768GiB of DDR5 operating at 4800MT/s. All testing was performed on Proxmox 8.0.3, running Linux Kernel 6.2.16-3. Workloads were generated using a Windows Server 2022 guest with fio executing against a raw physical device configured in the guest. All compatible storage controllers (i.e., sata, ide, virtio, vmware, etc.) and aio modes (i.e., native, iouring, threads) were tested against iSCSI shared block storage.

The storage system used for testing is a Blockbridge NVME-24-SMC-ZEN4 with an AMD Zen4 9554P processor, Micron 7450 Gen4 NVMe devices, and 200Gbit storage connectivity. The Blockbridge storage operating system is version 6.0.

Tests were conducted using a single Windows Server 2022 virtual machine on an idle system. The Windows guest was configured with a specific controller and aio mode combination for each test and rebooted. Each data point collected represents a 10-minute I/O test. Multiple runs for each data point were collected to validate consistency.

Significant efforts were made to isolate the CPU workload to ensure accurate measurements. With an awareness of the underlying CPU architecture, we leveraged host CPU affinity, guest CPU affinity, network IRQ affinity, transmit flow steering, receive flow steering, and Linux work queue affinity to ensure all “work” performed during the test runs was measurable by the system and hardware profiling tools, as well as to eliminate any architecture-specific performance artifacts.

To ensure consistent I/O workloads, we leveraged the programmable QoS features of the Blockbridge storage system. Initial attempts to rate limit storage using inbuilt QEMU features were found to be extraordinarily CPU intensive, vary based on configuration, and unfairly bias the efficiency measurements. Our external rate-limiting approach results in more stable workloads and ensures that efficiency measurements do not include rate-limiting overhead.

EFFICIENCY MEASUREMENTS

Context Switches Per MiB

Context switch rates serve as reliable indicators of real-world performance. In a heavily loaded system, controllers with higher context switch rates are more prone to delivering inconsistent bandwidth. From our experience, elevated context switch rates can be linked to inefficient serialization or necessary segmentation of I/O at the block layer. Nevertheless, performance is significantly influenced by the operating system scheduler, and the presence of numerous runnable processes often leads to more inconsistent performance.

The data below presents the average number of context switches per MiB transferred for each controller and asynchronous I/O mode. Context switches were measured using perf and scoped to include only the CPUs executing the guest, QEMU worker threads, kernel iSCSI initiator logic, and related system functions (i.e., IRQs). An adaptive external IOPS QoS limiter was used to ensure a consistent transfer rate of 1GiB/s for each test configuration.

CPU Utilization

Every combination of guest storage controller, device driver, AIO mode, and hypervisor driver has distinct efficiency characteristics. Using a profiler, we can gauge CPU cycle consumption to identify the most effective configurations in terms of computing performance. In simple terms, the fewer cycles used for I/O, the more cycles remain available for computation (and other I/O tasks).

The data below presents the average number of combined user and system CPU cycles per MiB transferred for each controller and asynchronous I/O mode. CPU cycles were measured using perf and scoped to include only the CPUs executing the guest, QEMU worker threads, kernel iSCSI initiator logic, and related system functions (i.e., IRQs). An adaptive external IOPS QoS limiter was used to ensure a sustained transfer rate of 1GiB/s for each test configuration.

Data Analysis

Measured CPU Cycles and Context Switch Scatter Plot

The scatter plot below shows the correlation between context switch efficiency and cpu utilization. For ease of comparison, the data points are styled as follows:

  • A symbol correlates with a storage controller.
  • The color of a symbol corresponds with the AIO mode.

Normalized CPU cycles and Context Switch Scatter Plot

To visualize the correlation between context switches and CPU more accurately, we can normalize the data to eliminate scaling effects. We use unity-based normalization (i.e., min-max feature scaling) to project the data into the range of [0,1] as follows:

\[\mathit{normalized\_value} = \frac{val - Min(val)}{ Max(val) - Min(val)}\]

The graph below plots the normalized values for context switch and CPU cycle efficiency measurements on the same graph. For ease of comparison, the data points are displayed as follows:

  • A symbol correlates with a storage controller.
  • The color of a symbol corresponds with the AIO mode.

Normalized Efficiency Scores

In the previous section, we presented a graph of normalized values for context switches and CPU cycles. Using these normalized values, we can derive a single score to compare the efficiency of the configurations under test. Our score calculation is simply the length of the vector extending from the origin and is calculated as follows:.

\[\mathit{efficiency\_score} = \sqrt{\mathit{normalized\_cpu\_cycles\_per\_MiB}^2 + \mathit{normalized\_context\_switches\_per\_MiB}^2}\]

The chart below presents a comparison of efficiency scores derived from the normalized CPU cycles and context switch measurement data. Lower scores are better.

Table of Measured Values, Rankings, and Efficiency Scores

The table below presents raw data, normalized scoring, and efficiency rankings for context switches and CPU cycles. Click on any header to sort the data to your liking.

CONCLUSIONS

  • The virtio-scsi controller with aio=native achieves the best overall efficiency score.

  • aio=native is the most efficient aio mode for each controller type; aio=threads is the least efficient.

  • With aio=native, virtio-scsi is 4% more CPU intensive than virtio-blk but generates 25% fewer context switches.

  • With virtio-scsi and aio=native, an iothread introduces a small CPU efficiency overhead of 1.5%, but reduces context switches by 5%

  • vmware-pvscsi is the most efficient storage controller option natively supported by Windows Server 2022.

  • vmware-pvscsi with aio=native consumes 60% less CPU and generates 40% fewer context switches than vmware-pvscsi with aio=iouring.

  • The SATA and IDE controllers achieve the worst efficiency scores primarily due to high context switching rates.

Important Considerations

  • The data set in this technote focuses on the system efficiency of a bandwidth-centric workload characterized by moderate queue depths of large sized I/Os. Data for iops-focused workloads may differ.

  • Our data presents a complete system view that includes the interactions between system hardware, network drivers, storage drivers, Linux, QEMU, guest storage drivers, and Windows itself.

  • System efficiency is a reliable indicator of expected storage performance under moderate to heavy system load. However, it does not necessarily correlate with peak performance on an unloaded system.

  • Synthetic efficiency scores derived from this dataset are normalized values and should not be used for comparison with other datasets.

ADDITIONAL RESOURCES