OVERVIEW
This technote marks the first in a series of technical articles dedicated to optimizing Windows storage performance on Proxmox. Part 1 provides a comprehensive guide to Proxmox storage controllers, Windows driver compatibility, Asynchronous I/O modes, and IOThreads. It establishes foundational system concepts for efficiency comparisons in upcoming installments and describes our testing environment.
TECHNOTE SERIES
Abstract
Proxmox exposes several QEMU storage configuration options through its management interfaces. This technote series aims to quantify the efficiency of all storage controller types operating in all possible AIO modes that are supported by Windows Server 2022. The purpose is to provide unbiased quantitative guidance for selecting an optimal configuration for Proxmox Windows guests running on shared block storage.
Methodology
Rather than emphasize maximum performance, we take precise measurements of system CPU cycles, userland CPU cycles, and system context switches under fixed workload conditions. The measurements provide the basis for an efficiency comparison from the perspective of the CPU and Operating System scheduler.
The Proxmox system under test is a SuperMicro H13 server with a single AMD Zen4 9554P 64-core processor and 768GiB of DDR5 operating at 4800MT/s. All testing was performed on Proxmox 8.0.3, running Linux Kernel 6.2.16-3. Workloads were generated using a Windows Server 2022 guest with fio executing against a raw physical device configured in the guest. All compatible storage controllers (i.e., sata, ide, virtio, vmware, etc.) and aio modes (i.e., native, iouring, threads) were tested against iSCSI shared block storage.
The storage system used for testing is a Blockbridge NVME-24-SMC-ZEN4 with an AMD Zen4 9554P processor, Micron 7450 Gen4 NVMe devices, and 200Gbit storage connectivity. The Blockbridge storage operating system is version 6.0.
Tests were conducted using a single Windows Server 2022 virtual machine on an idle system. The Windows guest was configured with a specific controller and aio mode combination for each test and rebooted. Each data point collected represents a 10-minute I/O test. Multiple runs for each data point were collected to validate consistency.
Significant efforts were made to isolate the CPU workload to ensure accurate measurements. With an awareness of the underlying CPU architecture, we leveraged host CPU affinity, guest CPU affinity, network IRQ affinity, transmit flow steering, receive flow steering, and Linux work queue affinity to ensure all “work” performed during the test runs was measurable by the system and hardware profiling tools, as well as to eliminate any architecture-specific performance artifacts.
To ensure consistent I/O workloads, we leveraged the programmable QoS features of the Blockbridge storage system. Initial attempts to rate limit storage using inbuilt QEMU features were found to be extraordinarily CPU intensive, vary based on configuration, and unfairly bias the efficiency measurements. Our external rate-limiting approach results in more stable workloads and ensures that efficiency measurements do not include rate-limiting overhead.
Series Links
-
Part 1: An Introduction to Supported Windows Storage Controllers, AIO modes, and Efficiency Metrics.
-
Part 4: Unloaded Performance Study of Windows Guests on Shared Block Storage.
PROXMOX STORAGE CONTROLLERS
What is a storage controller?
A storage controller
refers to a system hardware device that
interfaces with a storage device (i.e., a drive). Storage controllers
can be discrete devices logically connected via PCI (i.e., host bus
adapters) or functionality embedded in standard motherboard building
blocks.
QEMU’s PC-Q35 hardware emulation presents a guest with an emulated ICH9 host chipset, which supports SATA and IDE devices. By emulating the function of the ICH9, QEMU can provide access to storage devices connected to the ICH9 hub.
QEMU can also present virtual hardware devices logically connected via PCI. All Proxmox “SCSI Hardware” options present as PCI connected devices. There are significant performance differences between controller implementations that corresponds to the interface and emulation complexity of the hardware API.
Windows Server 2022 Driver Status
There are three controller options natively supported by Windows
Server 2022 and that do not require driver installation: SATA
(via
ICH9), IDE
(via ICH9), and VMware PVSCSI
. VirtIO
drivers are
open-source, freely available for download, and offer significant
performance improvements. VirtIO storage controller options include
virtio-blk
, virtio-scsi
, and virtio-scsi-single
. The following
table presents the storage controller options supported by Proxmox
along with driver support status for Windows 2022.
Controller | Hardware (circa) | Windows 2022 Driver Status |
---|---|---|
ICH9/SATA | Intel I/O Controller Hub 9 (2007) | Native |
ICH9/IDE | Intel I/O Controller Hub 9 (2007) | Native |
VMware PVSCSI | ESX 4.x and Later (2009) | Native |
VIRTIO SCSI | Virtio-scsi (2011) | 3rd Party |
VIRTIO SCSI SINGLE | Virtio-scsi (2011) | 3rd Party |
VIRTIO BLOCK | Virtio-blk (2007) | 3rd Party |
MegaRAID SAS 8708EM2 | PCIE 1.0 <-> SAS/SATA RAID (2007) | Not Available |
LSI 53C895A | 32 BIT PCI <-> ULTRA2 SCSI (2001) | Not Available |
LSI 53C810 | 32 BIT PCI <-> SCSI (2001) | Not Available |
PROXMOX AIO MODES
What is aio=native?
The aio
disk parameter selects the method for implementing
asynchronous I/O. Asynchronous I/O allows QEMU to issue multiple
transfer requests to the hypervisor without serializing QEMU’s
centralized scheduler.
AIO is also the name of a Linux systems interface for performing
asynchronous I/O introduced in Linux 2.6. Setting aio=native
in
Proxmox informs the system to use the Linux AIO system
interface for
managing Asynchronous I/O.
In the Linux AIO model, submission and completion operations are system calls. The primary issue with the Linux AIO implementation is that it can block in variety of circumstances (i.e., buffered I/O, high queue depth, dm devices).
AIO blocks if anything in the I/O submission path is unable to complete inline. However, when used with raw block devices and caching disabled, AIO will not block. Therefore, it is a good choice for network-attached block storage.
What is aio=io_uring?
Io-uring is an alternative Linux system interface for executing concurrent asynchronous I/O. Similar to Linux AIO, it allows QEMU to issue multiple transfer requests without serializing QEMU’s centralized scheduler.
Unlike Linux AIO, io_uring leverages independent shared memory queues for command submission and completion instead of system calls. One architectural goal of io_uring is to minimize latency by reducing system call overhead. However, in our experience, it is often difficult to extract the full performance potential of io_uring in applications like QEMU due to an inherent need to share system resources fairly (i.e., busy polling isn’t a good option).
Io_uring has one significant architectural advantage compared to AIO:
it is guaranteed not to block. However, this does not necessarily
result in operational gains since Linux AIO is also non-blocking when
used with O_DIRECT
raw block storage.
What is aio=threads?
The aio=threads
model provides concurrent I/O execution using
traditional blocking system calls. QEMU actively manages the pool of
threads, operating in userland, to which it dispatches I/O
requests. The aio=threads
model predates the aio=native
and
aio=iouring
options.
Although aio=threads
is a legacy technology, there are still some
exceptional cases where it offers performance benefits, particularly
on idle systems.
PROXMOX I/O OFFLOAD
What are IOThreads?
When a VM executes an asynchronous disk I/O operation, it issues a request to the hypervisor and waits for an event indicating completion. By default, this happens in the QEMU “main loop.” An IOThread provides a dedicated event loop, operating in a separate thread, for handling I/O. IOThreads offload work from the “main loop” into a separate thread that can execute concurrently.
There are several claimed advantages of IOThreads, including decreased latency, reduced contention, and improved scalability. However, nothing is free. IOThreads are resources that the hypervisor must prioritize and execute.
virtio-scsi-single
SCSI controller to enable IOThreads.EXPLORING EFFICIENCY
Each combination of guest OS, storage controller, device driver, AIO mode, workload, and storage type exhibits unique performance characteristics. The interplay between these components significantly influences storage performance.
A precise method for comparing configurations involves quantifying the work required for a system to complete a specific task. Metrics such as operating system context switches and CPU consumption provide valuable insights into the efficiency of system operations. Not only are these metrics relatively simple to measure, but they also provide straightforward points of comparison.
What is a Context Switch?
A context switch is an operation that saves the state of a running process and restores the state of another process to resume execution. It is a foundational operating system scheduling primitive that permits multitasking.
Context switches occur for a variety of reasons. For example, an operating system scheduler may force a context switch when a process has exceeded its scheduling quantum (i.e., for fairness). Alternatively, the operating system may schedule a process for execution when a blocking I/O dependency has been resolved, such as a disk operation.
Performance Implications of Context Switches
Context switches have become relatively cheap from an execution perspective. As an efficiency measure, we’re less concerned about the execution overhead of the context switch itself and more concerned with the latency implications related to queuing theory. Specifically, when a context switch occurs, any number of runnable processes may execute before we get a chance to run again.
In real-world systems, the number of context switches needed to perform an I/O operation and the number of runnable tasks correlate with I/O latency. Accordingly, we should expect configurations that result in more context switches to exhibit higher latency on loaded systems. Our preference is for consistent low latency. Therefore, lower context switch rates are better.
What is a CPU cycle?
A CPU cycle
is a single unit of CPU time corresponding to the
inverse of the CPU clock rate. On modern processors,
it is possible for simple arithmetic operations (operating on registers)
to complete within a single CPU cycle. Meanwhile, those with memory
operands can wait for hundreds of CPU cycles while accessing memory. The key
takeaway is that a CPU cycle is a unit of time, not a unit of work.
Understanding System Vs. Userland Cycles
A Proxmox guest is a QEMU process that operates in userland. QEMU storage controllers, device virtualization, and guest device drivers consume userland CPU cycles. When QEMU performs I/O on behalf of the guest, it communicates with the hypervisor’s kernel using system calls. Any kernel logic executed while performing an I/O (i.e., block scheduling, iSCSI, etc.) consumes system CPU cycles. The core difference between system and userland CPU cycles is the privilege level of the CPU. The diagram below provides a rough illustration of the I/O processing stack for an iSCSI block device.
USER GUEST
│ │ ┌─────────────┐
│ │ │ DEVICE │
│ │ │ DRIVER │
│ ▼ └┬────────────┘
│ QEMU │
│ │ ┌▼────────────┐ ┌─────────┐
│ │ │ VIRTUAL ├───► ASYNC │
│ │ │ DEVICE │ │ I/O │
▼ ▼ └─────────────┘ └┬────────┘
SYSTEM KERNEL │
│ │ ┌▼────────┐ ┌─────────┐
│ │ │ BLOCK ├──► SCHED │
│ │ │ LAYER │ │ │
│ │ └─────────┘ └┬────────┘
│ │ │
│ │ ┌▼────────┐ ┌─────────┐
│ │ │ ISCSI ├──► ISCSI │
│ │ │ CORE │ │ TCP │
▼ ▼ └─────────┘ └─────────┘
TEST ENVIRONMENT
Description
Proxmox 8.0.3 (kernel version 6.2.16-3-pve) is installed on a SuperMicro H13 server containing an AMD Epyc Zen4 9554P 64-Core Processor, 768GiB of 4800MT/s DDR5 RAM (NPS=4), and a Mellanox 200Gbit network adapter. The Mellanox adapter is a x16 Gen4 device with a maximum throughput of 200Gbit/s. The server is running with default settings and hyperthreads enabled.
A single virtual machine is provisioned on the host and installed with Windows 2022 Build 20348.fe_release.210507-1500. The VM has two virtual CPUs and 16GB of RAM. The VM has a boot block device containing the root filesystem separate from the storage under test.
The virtual machine has a VCPU affinity profile that constrains QEMU execution to CCX 0 (i.e., physical CPUs 0-7). Similarly, NIC interrupts and relevant kernel worker threads were affinitized to the first CCX. System and hardware profile measurements were constrained to CCX 0.
We executed a consistent I/O workload for each tested configuration and collected efficiency data. Using system and hardware profiling tools, we collected context switches, system time, user time, IOPS, bandwidth, and latency data. Each data point collected corresponds to a 10-minute I/O test. We executed multiple runs for each data point to validate consistency.
An external rate limiter, built into Blockbridge, was used to ensure each tested configuration sustained the same level of performance. Initial attempts to rate limit storage using inbuilt QEMU features were found to be extraordinarily CPU intensive, varying based on configuration, and unfairly bias the efficiency measurements. Our external rate-limiting approach results in more stable workloads and ensures that efficiency measurements do not include rate-limiting overhead.
All testing was performed with fio 3.25-x64
(for windows) against raw
devices using physical device paths as follows:
C:\Users\Administrator\Downloads\fio-3.25-x64-windows.exe <test-config> --filename=\\.\PHYSICALDRIVE1
Sample configurations used for testing appear below:
[iops]
bs=512
rw=randread
iodepth=128
direct=1
time_based=1
runtime=60000
numjobs=1
thread=1
ioengine=windowaio
cpus_allowed=0
[bandwidth]
bs=1M
rw=randread
iodepth=32
direct=1
time_based=1
runtime=60000
numjobs=1
thread=1
ioengine=windowaio
cpus_allowed=0
[latency]
[bandwidth]
bs=512
rw=read
iodepth=1
direct=1
time_based=1
runtime=60000
numjobs=1
thread=1
ioengine=windowaio
cpus_allowed=0
Network Diagram
/─────────────────────────────┐ /─────────────────────┐
│ │ /────────────────────┐ │ │
│ ┌──────┐ PROXMOX 8.0.3 │── ─┤ 200G MELLANOX 200G ├─ ──┤ BLOCKBRIDGE 6.X │
│ | WIN | 200G SINGLE PORT │ └────────────────────/ │ QUAD ENGINE │
│ │ 2022 │ X16 GEN4 │ │ 2X 200G DUAL PORT │
│ └──────┘ ZEN4 9554P │ │ ZEN4 9554P │
│ │ │ │
└─────────────────────────────/ └─────────────────────/
Software
Proxmox Version
# pveversion
pve-manager/8.0.3/bbf3993334bfa916 (running kernel: 6.2.16-3-pve)
Linux Kernel Options
BOOT_IMAGE=/boot/vmlinuz-6.2.16-3-pve root=/dev/mapper/pve-root ro quiet amd_iommu=disable iommu=pt
Blockbridge Version
version: 6.0.2
release: 6802.1
branch: production-6.0
timestamp: Jun 28 2023 21:24:39
Hardware And Networking
Server Platform
System Information
Manufacturer: Supermicro
Product Name: AS-1115CS-TNR
Processor
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 128
On-line CPU(s) list: 0-127
Thread(s) per core: 2
Core(s) per socket: 64
Socket(s): 1
NUMA node(s): 4
Vendor ID: AuthenticAMD
CPU family: 25
Model: 17
Model name: AMD EPYC 9554P 64-Core Processor
Stepping: 1
CPU MHz: 3748.399
CPU max MHz: 3100.0000
CPU min MHz: 1500.0000
BogoMIPS: 6199.91
Virtualization: AMD-V
L1d cache: 32K
L1i cache: 32K
L2 cache: 1024K
L3 cache: 32768K
NUMA node0 CPU(s): 0-15,64-79
NUMA node1 CPU(s): 16-31,80-95
NUMA node2 CPU(s): 32-47,96-111
NUMA node3 CPU(s): 48-63,112-127
Network Adapter
Ethernet controller: Mellanox Technologies MT28908 Family [ConnectX-6]
Subsystem: Mellanox Technologies Device 0007
Physical Slot: 1
Flags: bus master, fast devsel, latency 0, IRQ 2068, NUMA node 3
Memory at 303dbc000000 (64-bit, prefetchable) [size=32M]
Expansion ROM at f8500000 [disabled] [size=1M]
Capabilities: [60] Express Endpoint, MSI 00
DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited
ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 75.000W
DevCtl: Report errors: Correctable+ Non-Fatal+ Fatal+ Unsupported-
RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset-
MaxPayload 512 bytes, MaxReadReq 512 bytes
DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend-
LnkCap: Port #0, Speed 16GT/s, Width x16, ASPM not supported, Exit Latency L0s unlimited, L1 <4us
ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 16GT/s, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Range ABC, TimeoutDis+, LTR-, OBFF Not Supported
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
LnkCtl2: Target Link Speed: 16GT/s, EnterCompliance- SpeedDis-
Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+, EqualizationPhase1+
EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest
Capabilities: [48] Vital Product Data
Capabilities: [9c] MSI-X: Enable+ Count=128 Masked-
Capabilities: [c0] Vendor Specific Information: Len=18 <?>
Capabilities: [40] Power Management version 3
Capabilities: [100] Advanced Error Reporting
Capabilities: [150] Alternative Routing-ID Interpretation (ARI)
Capabilities: [1c0] #19
Capabilities: [320] #27
Capabilities: [370] #26
Capabilities: [420] #25
Kernel driver in use: mlx5_core
Kernel modules: mlx5_core
Network Adapter PCI Connectivity
[ 5.436861] mlx5_core 0000:01:00.0: firmware version: 20.37.1014
[ 5.436885] mlx5_core 0000:01:00.0: 252.048 Gb/s available PCIe bandwidth (16 GT/s x16 link)
[ 5.644051] mlx5_core 0000:01:00.0: Rate limit: 127 rates are supported, range: 0Mbps to 97656Mbps
[ 5.644053] mlx5_core 0000:01:00.0: E-Switch: Total vports 2, per vport: max uc(128) max mc(2048)
[ 5.650970] mlx5_core 0000:01:00.0: mlx5_pcie_event:296:(pid 920): PCIe slot advertised sufficient power (75W).
[ 5.687350] mlx5_core 0000:c1:00.0: MLX5E: StrdRq(0) RqSz(1024) StrdSz(256) RxCqeCmprss(0)
Network Adapter Link
Supported ports: [ Backplane ]
Supported link modes: 1000baseT/Full
1000baseKX/Full
10000baseT/Full
10000baseKR/Full
40000baseKR4/Full
40000baseCR4/Full
40000baseSR4/Full
40000baseLR4/Full
25000baseCR/Full
25000baseKR/Full
25000baseSR/Full
50000baseCR2/Full
50000baseKR2/Full
100000baseKR4/Full
100000baseSR4/Full
100000baseCR4/Full
100000baseLR4_ER4/Full
50000baseSR2/Full
1000baseX/Full
10000baseCR/Full
10000baseSR/Full
10000baseLR/Full
10000baseER/Full
50000baseKR/Full
50000baseSR/Full
50000baseCR/Full
50000baseLR_ER_FR/Full
50000baseDR/Full
100000baseKR2/Full
100000baseSR2/Full
100000baseCR2/Full
100000baseLR2_ER2_FR2/Full
100000baseDR2/Full
200000baseKR4/Full
200000baseSR4/Full
200000baseLR4_ER4_FR4/Full
200000baseDR4/Full
200000baseCR4/Full
Supported pause frame use: Symmetric
Supports auto-negotiation: Yes
Supported FEC modes: Not reported
Advertised link modes: 1000baseT/Full
1000baseKX/Full
10000baseT/Full
10000baseKR/Full
40000baseKR4/Full
40000baseCR4/Full
40000baseSR4/Full
40000baseLR4/Full
25000baseCR/Full
25000baseKR/Full
25000baseSR/Full
50000baseCR2/Full
50000baseKR2/Full
100000baseKR4/Full
100000baseSR4/Full
100000baseCR4/Full
100000baseLR4_ER4/Full
50000baseSR2/Full
1000baseX/Full
10000baseCR/Full
10000baseSR/Full
10000baseLR/Full
10000baseER/Full
50000baseKR/Full
50000baseSR/Full
50000baseCR/Full
50000baseLR_ER_FR/Full
50000baseDR/Full
100000baseKR2/Full
100000baseSR2/Full
100000baseCR2/Full
100000baseLR2_ER2_FR2/Full
100000baseDR2/Full
200000baseKR4/Full
200000baseSR4/Full
200000baseLR4_ER4_FR4/Full
200000baseDR4/Full
200000baseCR4/Full
Advertised pause frame use: Symmetric
Advertised auto-negotiation: Yes
Advertised FEC modes: Not reported
Link partner advertised link modes: Not reported
Link partner advertised pause frame use: No
Link partner advertised auto-negotiation: Yes
Link partner advertised FEC modes: Not reported
Speed: 200000Mb/s
Duplex: Full
Port: Direct Attach Copper
PHYAD: 0
Transceiver: internal
Auto-negotiation: on
Supports Wake-on: d
Wake-on: d
Current message level: 0x00000004 (4)
Network Adapter Interrupt Coalesce Settings
Adaptive RX: on TX: on
stats-block-usecs: n/a
sample-interval: n/a
pkt-rate-low: n/a
pkt-rate-high: n/a
rx-usecs: 8
rx-frames: 128
rx-usecs-irq: n/a
rx-frames-irq: n/a
tx-usecs: 8
tx-frames: 128
tx-usecs-irq: n/a
tx-frames-irq: n/a
CQE mode RX: on TX: off
ADDITIONAL RESOURCES
- Blockbridge // Proxmox Overview
- Blockbridge // Proxmox Storage Guide
- Blockbridge // Optimizing Proxmox: iothreads, aio, & io_uring
- Blockbridge // Proxmox & ESXi Performance Comparison
- Blockbridge // Low Latency Storage Optimizations For Proxmox, KVM, & QEMU
- Blockbridge // Optimizing Proxmox Storage for Windows