Optimizing Proxmox Storage Part 2: Efficiency Study of Constrained IOPS Workloads with Windows Guests on Shared Block Storage.

OVERVIEW

This technote is the second installment in a series of technical articles devoted to optimizing Windows on Proxmox. Part 2 delves into the computational efficiency of each storage controller configuration through a constrained IOPS workload. We present raw efficiency data, considering CPU utilization and operating system context switches, and employ a scoring system to facilitate a comparative analysis of the efficiency across different configurations.

Workload Description

Using fio, we executed a random read workload within a Windows Server 2022 guest targeting a raw physical device. The backend storage software incorporated Adaptive QoS controls, ensuring a sustained workload of 32K I/Os per second for the guest. To assess system performance, the perf tool was used to capture utilization metrics on the hypervisor throughout a 10-minute testing interval.

TECHNOTE SERIES

Abstract

Proxmox exposes several QEMU storage configuration options through its management interfaces. This technote series aims to quantify the efficiency of all storage controller types operating in all possible AIO modes that are supported by Windows Server 2022. The purpose is to provide unbiased quantitative guidance for selecting an optimal configuration for Proxmox Windows guests running on shared block storage.

Methodology

Rather than emphasize maximum performance, we take precise measurements of system CPU cycles, userland CPU cycles, and system context switches under fixed workload conditions. The measurements provide the basis for an efficiency comparison from the perspective of the CPU and Operating System scheduler.

The Proxmox system under test is a SuperMicro H13 server with a single AMD Zen4 9554P 64-core processor and 768GiB of DDR5 operating at 4800MT/s. All testing was performed on Proxmox 8.0.3, running Linux Kernel 6.2.16-3. Workloads were generated using a Windows Server 2022 guest with fio executing against a raw physical device configured in the guest. All compatible storage controllers (i.e., sata, ide, virtio, vmware, etc.) and aio modes (i.e., native, iouring, threads) were tested against iSCSI shared block storage.

The storage system used for testing is a Blockbridge NVME-24-SMC-ZEN4 with an AMD Zen4 9554P processor, Micron 7450 Gen4 NVMe devices, and 200Gbit storage connectivity. The Blockbridge storage operating system is version 6.0.

Tests were conducted using a single Windows Server 2022 virtual machine on an idle system. The Windows guest was configured with a specific controller and aio mode combination for each test and rebooted. Each data point collected represents a 10-minute I/O test. Multiple runs for each data point were collected to validate consistency.

Significant efforts were made to isolate the CPU workload to ensure accurate measurements. With an awareness of the underlying CPU architecture, we leveraged host CPU affinity, guest CPU affinity, network IRQ affinity, transmit flow steering, receive flow steering, and Linux work queue affinity to ensure all “work” performed during the test runs was measurable by the system and hardware profiling tools, as well as to eliminate any architecture-specific performance artifacts.

To ensure consistent I/O workloads, we leveraged the programmable QoS features of the Blockbridge storage system. Initial attempts to rate limit storage using inbuilt QEMU features were found to be extraordinarily CPU intensive, vary based on configuration, and unfairly bias the efficiency measurements. Our external rate-limiting approach results in more stable workloads and ensures that efficiency measurements do not include rate-limiting overhead.

Series Links

EFFICIENCY MEASUREMENTS

Note: For an introduction to context switches and CPU cycles please see Understanding Efficiency in Part 1.

Context Switches Per I/O

The data below presents the average number of context switches per I/O for each controller and asynchronous I/O mode. Context switches were measured using perf and scoped to include only the CPUs executing the guest, QEMU worker threads, kernel iSCSI initiator logic, and related system functions (i.e., IRQs). An adaptive external IOPS QoS limiter was used to ensure a consistent rate of 32K IOPS for each test configuration.

Tip: The bar colors in the graph above correlate with the asynchronous I/O mode: blue for aio=native, green for aio=iouring, and orange for aio=threads.

CPU Utilization

Each combination of guest storage controller, guest device driver, AIO mode, and hypervisor driver has unique efficiency characteristics. Using a profiler, we can measure CPU cycle consumption to determine the best configurations from a compute perspective. Simply put, the fewer cycles consumed performing I/O, the more cycles available for computation (and other I/O)!

The data below presents the average number of combined user and system CPU cycles per I/O for each controller and asynchronous I/O mode. CPU cycles were measured using perf and scoped to include only the CPUs executing the guest, QEMU worker threads, kernel iSCSI initiator logic, and related system functions (i.e., IRQs). An adaptive external IOPS QoS limiter was used to ensure a sustained rate of 32K IOPS for each test configuration.

Tip: The bar colors in the graph above correlate with the asynchronous I/O mode: blue for aio=native, green for aio=iouring, and orange for aio=threads.

Data Analysis

Measured CPU Cycles and Context Switch Scatter Plot

The scatter plot below shows the correlation between context switch efficiency and cpu utilization. For ease of comparison, the data points are styled as follows:

A symbol correlates with a storage controller.
The color of the symbol corresponds to the AIO mode.

Normalized CPU cycles and Context Switch Scatter Plot

To visualize the correlation between context switches and CPU more accurately, we can normalize the data to eliminate scaling effects. We use unity-based normalization (i.e., min-max feature scaling) to project the data into the range of [0,1] as follows:

\[\mathit{normalized\_value} = \frac{val - Min(val)}{ Max(val) - Min(val)}\]

The graph below plots the normalized values for context switch and CPU cycle efficiency measurements on the same graph. For ease of comparison, the data points are displayed as follows:

A symbol correlates with a storage controller.
The color of the symbol corresponds to the AIO mode.

Normalized Efficiency Scores

In the previous section, we presented a graph of normalized values for context switches and CPU cycles. Using these normalized values, we can derive a single score to compare the efficiency of the configurations under test. Our score calculation is simply the length of the vector extending from the origin and is calculated as follows:.

\[\mathit{efficiency\_score} = \sqrt{\mathit{normalized\_cpu\_cycles\_per\_IOP}^2 + \mathit{normalized\_context\_switches\_per\_IOP}^2}\]

The chart below presents a comparison of efficiency scores derived from the normalized CPU cycles and context switch measurement data. Lower scores are better.

Tip: The bar colors in the graph above correlate with the asynchronous I/O mode: blue for aio=native, green for aio=iouring, and orange for aio=threads.

Table of Measured Values, Rankings, and Efficiency Scores

The table below presents raw data, normalized scoring, and efficiency rankings for context switches and CPU cycles. Click on any header to sort the data to your liking.

CONCLUSIONS

For IOPS-intensive workloads operating in a Windows Server 2022 guest with iSCSI shared storage:

Virtio drivers provide up to 53% better CPU cycle efficiency compared to the best-performing native drivers included with Windows.
Virtio drivers result in 81% fewer context switches compared to the best-performing native drivers included with Windows.
virtio-scsi with aio=native is 84% more efficient than virtio-scsi with aio=io_uring in terms of CPU cycles and context switches.
aio=threads is the least efficient model for asynchronous I/O for all storage controllers, resulting in the highest CPU cycle utilization and context switch rates.
virtio-scsi with aio=native outperforms all virtio-blk configurations in terms of CPU cycle and context switch efficiency.
An iothread has negligible efficiency overhead when used with virtio-scsi and aio=native.
virtio-scsi with aio=native is the optimal configuration in terms of CPU cycle efficiency and context switch overhead.

Important Considerations

The data set in this technote focuses on the system efficiency of an IOPS-centric workload characterized by moderate queue depths of smaller-sized I/Os. Data for bandwidth-focused workloads may differ.
Our data presents a complete system view that includes the interactions between system hardware, network drivers, storage drivers, Linux, QEMU, guest storage drivers, and Windows itself.
System efficiency is a reliable indicator of expected storage performance under moderate to heavy system load. However, it does not necessarily correlate with peak performance on an unloaded system.
Synthetic efficiency scores derived from this dataset are normalized values and should not be used for comparison with other datasets.

Optimizing Proxmox Storage Part 2: Efficiency Study of Constrained IOPS Workloads with Windows Guests on Shared Block Storage.

OVERVIEW

Workload Description

TECHNOTE SERIES

Abstract

Methodology

Series Links

EFFICIENCY MEASUREMENTS

Context Switches Per I/O

CPU Utilization

Data Analysis

Measured CPU Cycles and Context Switch Scatter Plot

Normalized CPU cycles and Context Switch Scatter Plot

Normalized Efficiency Scores

Table of Measured Values, Rankings, and Efficiency Scores

CONCLUSIONS

Important Considerations

ADDITIONAL RESOURCES