OVERVIEW

This technote marks the first in a series of technical articles dedicated to optimizing Windows storage performance on Proxmox. Part 1 provides a comprehensive guide to Proxmox storage controllers, Windows driver compatibility, Asynchronous I/O modes, and IOThreads. It establishes foundational system concepts for efficiency comparisons in upcoming installments and describes our testing environment.

TECHNOTE SERIES

Abstract

Proxmox exposes several QEMU storage configuration options through its management interfaces. This technote series aims to quantify the efficiency of all storage controller types operating in all possible AIO modes that are supported by Windows Server 2022. The purpose is to provide unbiased quantitative guidance for selecting an optimal configuration for Proxmox Windows guests running on shared block storage.

Methodology

Rather than emphasize maximum performance, we take precise measurements of system CPU cycles, userland CPU cycles, and system context switches under fixed workload conditions. The measurements provide the basis for an efficiency comparison from the perspective of the CPU and Operating System scheduler.

The Proxmox system under test is a SuperMicro H13 server with a single AMD Zen4 9554P 64-core processor and 768GiB of DDR5 operating at 4800MT/s. All testing was performed on Proxmox 8.0.3, running Linux Kernel 6.2.16-3. Workloads were generated using a Windows Server 2022 guest with fio executing against a raw physical device configured in the guest. All compatible storage controllers (i.e., sata, ide, virtio, vmware, etc.) and aio modes (i.e., native, iouring, threads) were tested against iSCSI shared block storage.

The storage system used for testing is a Blockbridge NVME-24-SMC-ZEN4 with an AMD Zen4 9554P processor, Micron 7450 Gen4 NVMe devices, and 200Gbit storage connectivity. The Blockbridge storage operating system is version 6.0.

Tests were conducted using a single Windows Server 2022 virtual machine on an idle system. The Windows guest was configured with a specific controller and aio mode combination for each test and rebooted. Each data point collected represents a 10-minute I/O test. Multiple runs for each data point were collected to validate consistency.

Significant efforts were made to isolate the CPU workload to ensure accurate measurements. With an awareness of the underlying CPU architecture, we leveraged host CPU affinity, guest CPU affinity, network IRQ affinity, transmit flow steering, receive flow steering, and Linux work queue affinity to ensure all “work” performed during the test runs was measurable by the system and hardware profiling tools, as well as to eliminate any architecture-specific performance artifacts.

To ensure consistent I/O workloads, we leveraged the programmable QoS features of the Blockbridge storage system. Initial attempts to rate limit storage using inbuilt QEMU features were found to be extraordinarily CPU intensive, vary based on configuration, and unfairly bias the efficiency measurements. Our external rate-limiting approach results in more stable workloads and ensures that efficiency measurements do not include rate-limiting overhead.

PROXMOX STORAGE CONTROLLERS

What is a storage controller?

A storage controller refers to a system hardware device that interfaces with a storage device (i.e., a drive). Storage controllers can be discrete devices logically connected via PCI (i.e., host bus adapters) or functionality embedded in standard motherboard building blocks.

QEMU’s PC-Q35 hardware emulation presents a guest with an emulated ICH9 host chipset, which supports SATA and IDE devices. By emulating the function of the ICH9, QEMU can provide access to storage devices connected to the ICH9 hub.

QEMU can also present virtual hardware devices logically connected via PCI. All Proxmox “SCSI Hardware” options present as PCI connected devices. There are significant performance differences between controller implementations that corresponds to the interface and emulation complexity of the hardware API.

Windows Server 2022 Driver Status

There are three controller options natively supported by Windows Server 2022 and that do not require driver installation: SATA (via ICH9), IDE (via ICH9), and VMware PVSCSI. VirtIO drivers are open-source, freely available for download, and offer significant performance improvements. VirtIO storage controller options include virtio-blk, virtio-scsi, and virtio-scsi-single. The following table presents the storage controller options supported by Proxmox along with driver support status for Windows 2022.

Controller Hardware (circa) Windows 2022 Driver Status
ICH9/SATA Intel I/O Controller Hub 9 (2007) Native
ICH9/IDE Intel I/O Controller Hub 9 (2007) Native
VMware PVSCSI ESX 4.x and Later (2009) Native
VIRTIO SCSI Virtio-scsi (2011) 3rd Party
VIRTIO SCSI SINGLE Virtio-scsi (2011) 3rd Party
VIRTIO BLOCK Virtio-blk (2007) 3rd Party
MegaRAID SAS 8708EM2 PCIE 1.0 <-> SAS/SATA RAID (2007) Not Available
LSI 53C895A 32 BIT PCI <-> ULTRA2 SCSI (2001) Not Available
LSI 53C810 32 BIT PCI <-> SCSI (2001) Not Available

PROXMOX AIO MODES

What is aio=native?

The aio disk parameter selects the method for implementing asynchronous I/O. Asynchronous I/O allows QEMU to issue multiple transfer requests to the hypervisor without serializing QEMU’s centralized scheduler.

AIO is also the name of a Linux systems interface for performing asynchronous I/O introduced in Linux 2.6. Setting aio=native in Proxmox informs the system to use the Linux AIO system interface for managing Asynchronous I/O.

In the Linux AIO model, submission and completion operations are system calls. The primary issue with the Linux AIO implementation is that it can block in variety of circumstances (i.e., buffered I/O, high queue depth, dm devices).

AIO blocks if anything in the I/O submission path is unable to complete inline. However, when used with raw block devices and caching disabled, AIO will not block. Therefore, it is a good choice for network-attached block storage.

What is aio=io_uring?

Io-uring is an alternative Linux system interface for executing concurrent asynchronous I/O. Similar to Linux AIO, it allows QEMU to issue multiple transfer requests without serializing QEMU’s centralized scheduler.

Unlike Linux AIO, io_uring leverages independent shared memory queues for command submission and completion instead of system calls. One architectural goal of io_uring is to minimize latency by reducing system call overhead. However, in our experience, it is often difficult to extract the full performance potential of io_uring in applications like QEMU due to an inherent need to share system resources fairly (i.e., busy polling isn’t a good option).

Io_uring has one significant architectural advantage compared to AIO: it is guaranteed not to block. However, this does not necessarily result in operational gains since Linux AIO is also non-blocking when used with O_DIRECT raw block storage.

What is aio=threads?

The aio=threads model provides concurrent I/O execution using traditional blocking system calls. QEMU actively manages the pool of threads, operating in userland, to which it dispatches I/O requests. The aio=threads model predates the aio=native and aio=iouring options.

Although aio=threads is a legacy technology, there are still some exceptional cases where it offers performance benefits, particularly on idle systems.

PROXMOX I/O OFFLOAD

What are IOThreads?

When a VM executes an asynchronous disk I/O operation, it issues a request to the hypervisor and waits for an event indicating completion. By default, this happens in the QEMU “main loop.” An IOThread provides a dedicated event loop, operating in a separate thread, for handling I/O. IOThreads offload work from the “main loop” into a separate thread that can execute concurrently.

There are several claimed advantages of IOThreads, including decreased latency, reduced contention, and improved scalability. However, nothing is free. IOThreads are resources that the hypervisor must prioritize and execute.

EXPLORING EFFICIENCY

Each combination of guest OS, storage controller, device driver, AIO mode, workload, and storage type exhibits unique performance characteristics. The interplay between these components significantly influences storage performance.

A precise method for comparing configurations involves quantifying the work required for a system to complete a specific task. Metrics such as operating system context switches and CPU consumption provide valuable insights into the efficiency of system operations. Not only are these metrics relatively simple to measure, but they also provide straightforward points of comparison.

What is a Context Switch?

A context switch is an operation that saves the state of a running process and restores the state of another process to resume execution. It is a foundational operating system scheduling primitive that permits multitasking.

Context switches occur for a variety of reasons. For example, an operating system scheduler may force a context switch when a process has exceeded its scheduling quantum (i.e., for fairness). Alternatively, the operating system may schedule a process for execution when a blocking I/O dependency has been resolved, such as a disk operation.

Performance Implications of Context Switches

Context switches have become relatively cheap from an execution perspective. As an efficiency measure, we’re less concerned about the execution overhead of the context switch itself and more concerned with the latency implications related to queuing theory. Specifically, when a context switch occurs, any number of runnable processes may execute before we get a chance to run again.

In real-world systems, the number of context switches needed to perform an I/O operation and the number of runnable tasks correlate with I/O latency. Accordingly, we should expect configurations that result in more context switches to exhibit higher latency on loaded systems. Our preference is for consistent low latency. Therefore, lower context switch rates are better.

What is a CPU cycle?

A CPU cycle is a single unit of CPU time corresponding to the inverse of the CPU clock rate. On modern processors, it is possible for simple arithmetic operations (operating on registers) to complete within a single CPU cycle. Meanwhile, those with memory operands can wait for hundreds of CPU cycles while accessing memory. The key takeaway is that a CPU cycle is a unit of time, not a unit of work.

Understanding System Vs. Userland Cycles

A Proxmox guest is a QEMU process that operates in userland. QEMU storage controllers, device virtualization, and guest device drivers consume userland CPU cycles. When QEMU performs I/O on behalf of the guest, it communicates with the hypervisor’s kernel using system calls. Any kernel logic executed while performing an I/O (i.e., block scheduling, iSCSI, etc.) consumes system CPU cycles. The core difference between system and userland CPU cycles is the privilege level of the CPU. The diagram below provides a rough illustration of the I/O processing stack for an iSCSI block device.

     USER  GUEST
      │      │         ┌─────────────┐
      │      │         │ DEVICE      │
      │      │         │ DRIVER      │
      │      ▼         └┬────────────┘
      │    QEMU         │
      │      │         ┌▼────────────┐   ┌─────────┐
      │      │         │ VIRTUAL     ├───► ASYNC   │
      │      │         │ DEVICE      │   │ I/O     │
      ▼      ▼         └─────────────┘   └┬────────┘
    SYSTEM KERNEL                         │
      │      │                           ┌▼────────┐  ┌─────────┐
      │      │                           │ BLOCK   ├──► SCHED   │
      │      │                           │ LAYER   │  │         │
      │      │                           └─────────┘  └┬────────┘
      │      │                                         │
      │      │                                        ┌▼────────┐  ┌─────────┐
      │      │                                        │ ISCSI   ├──► ISCSI   │
      │      │                                        │ CORE    │  │ TCP     │
      ▼      ▼                                        └─────────┘  └─────────┘

TEST ENVIRONMENT

Description

Proxmox 8.0.3 (kernel version 6.2.16-3-pve) is installed on a SuperMicro H13 server containing an AMD Epyc Zen4 9554P 64-Core Processor, 768GiB of 4800MT/s DDR5 RAM (NPS=4), and a Mellanox 200Gbit network adapter. The Mellanox adapter is a x16 Gen4 device with a maximum throughput of 200Gbit/s. The server is running with default settings and hyperthreads enabled.

A single virtual machine is provisioned on the host and installed with Windows 2022 Build 20348.fe_release.210507-1500. The VM has two virtual CPUs and 16GB of RAM. The VM has a boot block device containing the root filesystem separate from the storage under test.

The virtual machine has a VCPU affinity profile that constrains QEMU execution to CCX 0 (i.e., physical CPUs 0-7). Similarly, NIC interrupts and relevant kernel worker threads were affinitized to the first CCX. System and hardware profile measurements were constrained to CCX 0.

We executed a consistent I/O workload for each tested configuration and collected efficiency data. Using system and hardware profiling tools, we collected context switches, system time, user time, IOPS, bandwidth, and latency data. Each data point collected corresponds to a 10-minute I/O test. We executed multiple runs for each data point to validate consistency.

An external rate limiter, built into Blockbridge, was used to ensure each tested configuration sustained the same level of performance. Initial attempts to rate limit storage using inbuilt QEMU features were found to be extraordinarily CPU intensive, varying based on configuration, and unfairly bias the efficiency measurements. Our external rate-limiting approach results in more stable workloads and ensures that efficiency measurements do not include rate-limiting overhead.

All testing was performed with fio 3.25-x64 (for windows) against raw devices using physical device paths as follows:

C:\Users\Administrator\Downloads\fio-3.25-x64-windows.exe <test-config> --filename=\\.\PHYSICALDRIVE1

Sample configurations used for testing appear below:

[iops]
bs=512
rw=randread
iodepth=128
direct=1
time_based=1
runtime=60000
numjobs=1
thread=1
ioengine=windowaio
cpus_allowed=0
[bandwidth]
bs=1M
rw=randread
iodepth=32
direct=1
time_based=1
runtime=60000
numjobs=1
thread=1
ioengine=windowaio
cpus_allowed=0
[latency]
[bandwidth]
bs=512
rw=read
iodepth=1
direct=1
time_based=1
runtime=60000
numjobs=1
thread=1
ioengine=windowaio
cpus_allowed=0

Network Diagram


              /─────────────────────────────┐                              /─────────────────────┐
              │                             │    /────────────────────┐    │                     │
              │  ┌──────┐  PROXMOX 8.0.3    │── ─┤ 200G MELLANOX 200G ├─ ──┤  BLOCKBRIDGE 6.X    │
              │  | WIN  |  200G SINGLE PORT │    └────────────────────/    │  QUAD ENGINE        │
              │  │ 2022 │  X16 GEN4         │                              │  2X 200G DUAL PORT  │
              │  └──────┘  ZEN4 9554P       │                              │  ZEN4 9554P         │
              │                             │                              │                     │
              └─────────────────────────────/                              └─────────────────────/

Software

Proxmox Version

# pveversion
pve-manager/8.0.3/bbf3993334bfa916 (running kernel: 6.2.16-3-pve)

Linux Kernel Options

BOOT_IMAGE=/boot/vmlinuz-6.2.16-3-pve root=/dev/mapper/pve-root ro quiet amd_iommu=disable iommu=pt

Blockbridge Version

version:   6.0.2
release:   6802.1
branch:    production-6.0
timestamp: Jun 28 2023 21:24:39

Hardware And Networking

Server Platform

System Information
        Manufacturer: Supermicro
        Product Name: AS-1115CS-TNR

Processor

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                128
On-line CPU(s) list:   0-127
Thread(s) per core:    2
Core(s) per socket:    64
Socket(s):             1
NUMA node(s):          4
Vendor ID:             AuthenticAMD
CPU family:            25
Model:                 17
Model name:            AMD EPYC 9554P 64-Core Processor
Stepping:              1
CPU MHz:               3748.399
CPU max MHz:           3100.0000
CPU min MHz:           1500.0000
BogoMIPS:              6199.91
Virtualization:        AMD-V
L1d cache:             32K
L1i cache:             32K
L2 cache:              1024K
L3 cache:              32768K
NUMA node0 CPU(s):     0-15,64-79
NUMA node1 CPU(s):     16-31,80-95
NUMA node2 CPU(s):     32-47,96-111
NUMA node3 CPU(s):     48-63,112-127

Network Adapter

Ethernet controller: Mellanox Technologies MT28908 Family [ConnectX-6]
 Subsystem: Mellanox Technologies Device 0007
 Physical Slot: 1
 Flags: bus master, fast devsel, latency 0, IRQ 2068, NUMA node 3
 Memory at 303dbc000000 (64-bit, prefetchable) [size=32M]
 Expansion ROM at f8500000 [disabled] [size=1M]
 Capabilities: [60] Express Endpoint, MSI 00
                 DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited
                        ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 75.000W
                DevCtl: Report errors: Correctable+ Non-Fatal+ Fatal+ Unsupported-
                        RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset-
                        MaxPayload 512 bytes, MaxReadReq 512 bytes
                DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend-
                LnkCap: Port #0, Speed 16GT/s, Width x16, ASPM not supported, Exit Latency L0s unlimited, L1 <4us
                        ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
                LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 16GT/s, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
                DevCap2: Completion Timeout: Range ABC, TimeoutDis+, LTR-, OBFF Not Supported
                DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
                LnkCtl2: Target Link Speed: 16GT/s, EnterCompliance- SpeedDis-
                         Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
                         Compliance De-emphasis: -6dB
                LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+, EqualizationPhase1+
                         EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest
 Capabilities: [48] Vital Product Data
 Capabilities: [9c] MSI-X: Enable+ Count=128 Masked-
 Capabilities: [c0] Vendor Specific Information: Len=18 <?>
 Capabilities: [40] Power Management version 3
 Capabilities: [100] Advanced Error Reporting
 Capabilities: [150] Alternative Routing-ID Interpretation (ARI)
 Capabilities: [1c0] #19
 Capabilities: [320] #27
 Capabilities: [370] #26
 Capabilities: [420] #25
 Kernel driver in use: mlx5_core
 Kernel modules: mlx5_core

Network Adapter PCI Connectivity

[    5.436861] mlx5_core 0000:01:00.0: firmware version: 20.37.1014
[    5.436885] mlx5_core 0000:01:00.0: 252.048 Gb/s available PCIe bandwidth (16 GT/s x16 link)
[    5.644051] mlx5_core 0000:01:00.0: Rate limit: 127 rates are supported, range: 0Mbps to 97656Mbps
[    5.644053] mlx5_core 0000:01:00.0: E-Switch: Total vports 2, per vport: max uc(128) max mc(2048)
[    5.650970] mlx5_core 0000:01:00.0: mlx5_pcie_event:296:(pid 920): PCIe slot advertised sufficient power (75W).
[    5.687350] mlx5_core 0000:c1:00.0: MLX5E: StrdRq(0) RqSz(1024) StrdSz(256) RxCqeCmprss(0)
	Supported ports: [ Backplane ]
	Supported link modes:   1000baseT/Full
	                        1000baseKX/Full
	                        10000baseT/Full
	                        10000baseKR/Full
	                        40000baseKR4/Full
	                        40000baseCR4/Full
	                        40000baseSR4/Full
	                        40000baseLR4/Full
	                        25000baseCR/Full
	                        25000baseKR/Full
	                        25000baseSR/Full
	                        50000baseCR2/Full
	                        50000baseKR2/Full
	                        100000baseKR4/Full
	                        100000baseSR4/Full
	                        100000baseCR4/Full
	                        100000baseLR4_ER4/Full
	                        50000baseSR2/Full
	                        1000baseX/Full
	                        10000baseCR/Full
	                        10000baseSR/Full
	                        10000baseLR/Full
	                        10000baseER/Full
	                        50000baseKR/Full
	                        50000baseSR/Full
	                        50000baseCR/Full
	                        50000baseLR_ER_FR/Full
	                        50000baseDR/Full
	                        100000baseKR2/Full
	                        100000baseSR2/Full
	                        100000baseCR2/Full
	                        100000baseLR2_ER2_FR2/Full
	                        100000baseDR2/Full
	                        200000baseKR4/Full
	                        200000baseSR4/Full
	                        200000baseLR4_ER4_FR4/Full
	                        200000baseDR4/Full
	                        200000baseCR4/Full
	Supported pause frame use: Symmetric
	Supports auto-negotiation: Yes
	Supported FEC modes: Not reported
	Advertised link modes:  1000baseT/Full
	                        1000baseKX/Full
	                        10000baseT/Full
	                        10000baseKR/Full
	                        40000baseKR4/Full
	                        40000baseCR4/Full
	                        40000baseSR4/Full
	                        40000baseLR4/Full
	                        25000baseCR/Full
	                        25000baseKR/Full
	                        25000baseSR/Full
	                        50000baseCR2/Full
	                        50000baseKR2/Full
	                        100000baseKR4/Full
	                        100000baseSR4/Full
	                        100000baseCR4/Full
	                        100000baseLR4_ER4/Full
	                        50000baseSR2/Full
	                        1000baseX/Full
	                        10000baseCR/Full
	                        10000baseSR/Full
	                        10000baseLR/Full
	                        10000baseER/Full
	                        50000baseKR/Full
	                        50000baseSR/Full
	                        50000baseCR/Full
	                        50000baseLR_ER_FR/Full
	                        50000baseDR/Full
	                        100000baseKR2/Full
	                        100000baseSR2/Full
	                        100000baseCR2/Full
	                        100000baseLR2_ER2_FR2/Full
	                        100000baseDR2/Full
	                        200000baseKR4/Full
	                        200000baseSR4/Full
	                        200000baseLR4_ER4_FR4/Full
	                        200000baseDR4/Full
	                        200000baseCR4/Full
	Advertised pause frame use: Symmetric
	Advertised auto-negotiation: Yes
	Advertised FEC modes: Not reported
	Link partner advertised link modes:  Not reported
	Link partner advertised pause frame use: No
	Link partner advertised auto-negotiation: Yes
	Link partner advertised FEC modes: Not reported
	Speed: 200000Mb/s
	Duplex: Full
	Port: Direct Attach Copper
	PHYAD: 0
	Transceiver: internal
	Auto-negotiation: on
	Supports Wake-on: d
	Wake-on: d
	Current message level: 0x00000004 (4)

Network Adapter Interrupt Coalesce Settings

Adaptive RX: on  TX: on
stats-block-usecs: n/a
sample-interval: n/a
pkt-rate-low: n/a
pkt-rate-high: n/a

rx-usecs: 8
rx-frames: 128
rx-usecs-irq: n/a
rx-frames-irq: n/a

tx-usecs: 8
tx-frames: 128
tx-usecs-irq: n/a
tx-frames-irq: n/a

CQE mode RX: on  TX: off

ADDITIONAL RESOURCES