Optimizing Proxmox Storage Part 4: Unloaded Performance Study of Windows Guests on Shared Block Storage.

OVERVIEW

This technote is the fourth installment in a series of technical articles devoted to optimizing Windows on Proxmox. In Part 4, we quantify and compare IOPS, bandwidth, and latency across all storage controllers and AIO modes under ideal conditions, utilizing Windows Server 2022 running on Proxmox with iSCSI shared block storage.

Workload Description

Using fio, we executed several random read workloads within a Windows Server 2022 guest, directing them towards a raw physical device, and collected data over a 10-minute interval. The subsequent sections provide comparisons of the raw data. We then use this data to calculate a performance score and present a comparative analysis.

Note: This following data was collected on idle system. It establishes ideal performance rather than real-world performance.

TECHNOTE SERIES

Abstract

Proxmox exposes several QEMU storage configuration options through its management interfaces. This technote series aims to quantify the efficiency of all storage controller types operating in all possible AIO modes that are supported by Windows Server 2022. The purpose is to provide unbiased quantitative guidance for selecting an optimal configuration for Proxmox Windows guests running on shared block storage.

Methodology

Rather than emphasize maximum performance, we take precise measurements of system CPU cycles, userland CPU cycles, and system context switches under fixed workload conditions. The measurements provide the basis for an efficiency comparison from the perspective of the CPU and Operating System scheduler.

The Proxmox system under test is a SuperMicro H13 server with a single AMD Zen4 9554P 64-core processor and 768GiB of DDR5 operating at 4800MT/s. All testing was performed on Proxmox 8.0.3, running Linux Kernel 6.2.16-3. Workloads were generated using a Windows Server 2022 guest with fio executing against a raw physical device configured in the guest. All compatible storage controllers (i.e., sata, ide, virtio, vmware, etc.) and aio modes (i.e., native, iouring, threads) were tested against iSCSI shared block storage.

The storage system used for testing is a Blockbridge NVME-24-SMC-ZEN4 with an AMD Zen4 9554P processor, Micron 7450 Gen4 NVMe devices, and 200Gbit storage connectivity. The Blockbridge storage operating system is version 6.0.

Tests were conducted using a single Windows Server 2022 virtual machine on an idle system. The Windows guest was configured with a specific controller and aio mode combination for each test and rebooted. Each data point collected represents a 10-minute I/O test. Multiple runs for each data point were collected to validate consistency.

Significant efforts were made to isolate the CPU workload to ensure accurate measurements. With an awareness of the underlying CPU architecture, we leveraged host CPU affinity, guest CPU affinity, network IRQ affinity, transmit flow steering, receive flow steering, and Linux work queue affinity to ensure all “work” performed during the test runs was measurable by the system and hardware profiling tools, as well as to eliminate any architecture-specific performance artifacts.

To ensure consistent I/O workloads, we leveraged the programmable QoS features of the Blockbridge storage system. Initial attempts to rate limit storage using inbuilt QEMU features were found to be extraordinarily CPU intensive, vary based on configuration, and unfairly bias the efficiency measurements. Our external rate-limiting approach results in more stable workloads and ensures that efficiency measurements do not include rate-limiting overhead.

Series Links

PERFORMANCE DATA

Average Latency QD1

QD1 latency stands out as the single most valuable indicator of I/O processing potential, providing a comprehensive assessment of end-to-end I/O processing overhead. QD1 latency benchmarks leave nowhere to hide.

Sequential Reads, 512-byte blocks, Single Outstanding I/O

The following chart graphs the average QD1 latency for each controller and AIO mode, measured within the Windows guest using fio over a 10 minute interval. Lower latencies indicate superior performance.

Tip: Bar colors indicate aio mode: blue for aio=native, green for aio=iouring, and orange for aio=threads.

Upon closer examination of the data above, several valuable observations come to light:

virtio-blk has lower per-io overhead than virtio-scsi, likely due to the absence of SCSI protocol translation.
aio=io_uring is on the order of a microsecond slower than aio=native in most cases.
The utilization of iothreads leads to a reduction in latency across all scenarios.
aio=threads has higher overhead than aio=iouring and aio=native.
vmware-pvscsi has less driver overhead than SATA and IDE.

Workload Description For QD1 Latency Testing

[qd1-latency]
bs=512
rw=read
iodepth=1
direct=1
time_based=1
runtime=60000
numjobs=1
thread=1
ioengine=windowaio
cpus_allowed=0

Average IOPS QD128

Configurations that deliver high IOPS typically exhibit lower random memory accesses and fewer synchronization points, resulting in lower execution overhead. To focus our testing on the critical code paths associated with performance, we use a high queue depth random read workload with the smallest feasible data transfer size.

Random Reads, 512-byte blocks, 128 Queue Depth

The following graph details the average IOPS for each controller and AIO mode, measured within the Windows guest using fio over a 10 minute interval. Greater IOPS values indicate superior performance.

Tip: Bar colors indicate aio mode: blue for aio=native, green for aio=iouring, and orange for aio=threads.

Note that IDE and SATA controllers demonstrate notably poor performance in comparison to virtio. This subpar performance is undoubtedly influenced by how the storage drivers communicate with the virtual hardware. Based on our experience, SATA and IDE are hindered by the substantial number of vm_exit() and vm_enter() calls required to support the programmed I/O model of the emulated ICH9 host chipset driver.

Workload Description for QD128 IOPS Testing

[iops]
bs=512
rw=randread
iodepth=128
direct=1
time_based=1
runtime=60000
numjobs=1
thread=1
ioengine=windowaio
cpus_allowed=0

Average Latency QD128

Latency under a heavy load provides insights into the responsiveness of a system when subjected to intense I/O activity. Workloads with high queue depths exert pressure on the effectiveness of I/O submission and completion paths as well as I/O scheduling.

Tip: When comparing latency values, also consider the I/O rate. Ideal solutions deliver high IOPS and low latency.

Random Reads, 512-byte blocks, 128 Queue Depth

The chart below shows the average latency under load for each controller and AIO mode, assessed within the Windows guest using fio over a 10 minute interval. Lower latencies are generally preferred.

Tip: Bar colors indicate aio mode: blue for aio=native, green for aio=iouring, and orange for aio=threads.

Workload Description For QD128 Latency Testing

[qd128-latency]
bs=512
rw=randread
iodepth=128
direct=1
time_based=1
runtime=60000
numjobs=1
thread=1
ioengine=windowaio
cpus_allowed=0

Average Bandwidth

For bandwidth measurements, we prefer a random read workload with a large block size. Bandwidth benchmarks are influenced to a lesser extent by the inline latency of control paths and more by the I/O scheduling characteristics.

Random Reads, 1MiB blocks, 32 Queue Depth

The chart below shows the average bandwidth attained for each controller and AIO mode, assessed within the Windows guest using fio over a 10 minute interval. Higher bandwidth values are preferred.

Tip: Bar colors indicate aio mode: blue for aio=native, green for aio=iouring, and orange for aio=threads.

For single-path storage, aio=native exhibits a slight advantage over aio=io_uring and a more noticeable advantage compared to aio=threads. Our research indicates that this is attributable to irregular I/O scheduling associated with userland threads and/or kernel tasklets.

Note: For multiqueue devices, such as NVMe, aio=io_uring and aio=threads can offer significant bandwidth improvements compared to aio=native.

Workload Description For Bandwidth Testing

[bandwidth]
bs=1M
rw=randread
iodepth=32
direct=1
time_based=1
runtime=60000
numjobs=1
thread=1
ioengine=windowaio
cpus_allowed=0

Normalized Scores

For comparison via scoring, we generate normalized values for each performance metric (i.e., IOPS, bandwidth, QD1 latency, and QD128 latency) using unity-based normalization (i.e., min-max feature scaling). This assigns a relative score for each metric in the range of [0,1] as follows:

\[\mathit{normalized\_value} = \frac{val - Min(val)}{ Max(val) - Min(val)}\]

Given four metrics to optimize (with equal weight), we generate a score that represents the magnitude of the vector in a 4D space as follows:

\[\mathit{normalized\_score} = \sqrt{\mathit{normalized\_iops}^2 + \mathit{normalized\_bw}^2 + \mathit{normalized\_qd1\_latency}^2 + \mathit{normalized\_qd128\_latency}^2}\]

The table below lists the scores alongside aggregated performance data. Lower scores are better.

Tip: Click on any column header to sort the data.

CONCLUSIONS

While the IOPS and latency data show the critical distinctions in performance among storage controllers, all controllers and AIO combinations consistently deliver acceptable bandwidth.
The virtio-scsi controller with aio=native and an iothread stands out as the top performer overall. Opting for virtio-scsi with aio=iouring is also a viable option, although it tends to lag slightly in bandwidth-oriented tests.
virtio-blk excels in QD1 performance, delivering a 2-microsecond improvement over virtio-scsi, likely attributed to reduced protocol overhead on the driver. However, it fails to deliver competitive IOPS scores.
The most performant storage controller native to Windows Server 2022 is vmware-pvscsi. Both sata and ide deliver comparably poor performance.
Compared to vmware-pvscsi, virtio-scsi exhibits significant performance advantages including a 157% increase in IOPS, a 14% reduction in QD1 latency, and a 67% decrease in QD128 latency.

Important Considerations

Synthetic efficiency scores derived from this dataset are normalized values and cannot be used for comparison with other datasets.
The data presented in this technote describes performance on an unloaded system as means of comparing base implementation efficiency.
The testing was conducted using iSCSI shared block storage with a single path. Performance for NVMe shared block storage and multipath configurations may vary.

Optimizing Proxmox Storage Part 4: Unloaded Performance Study of Windows Guests on Shared Block Storage.

OVERVIEW

Workload Description

TECHNOTE SERIES

Abstract

Methodology

Series Links

PERFORMANCE DATA

Average Latency QD1

Sequential Reads, 512-byte blocks, Single Outstanding I/O

Workload Description For QD1 Latency Testing

Average IOPS QD128

Random Reads, 512-byte blocks, 128 Queue Depth

Workload Description for QD128 IOPS Testing

Average Latency QD128

Random Reads, 512-byte blocks, 128 Queue Depth

Workload Description For QD128 Latency Testing

Average Bandwidth

Random Reads, 1MiB blocks, 32 Queue Depth

Workload Description For Bandwidth Testing

Normalized Scores

CONCLUSIONS

Important Considerations

ADDITIONAL RESOURCES