Low Latency Storage Optimizations for Proxmox, KVM, & QEMU

Inline storage latency limits the performance of applications that rely on databases such as MySQL, PostgreSQL, and MariaDB. This technote describes how to optimize I/O latency in a performance-critical virtual environment consisting of KVM, QEMU, and Proxmox. Using a step-by-step approach, we explore essential tuning concepts and quantify the effects of configuration changes across a range of block sizes using a QD1 workload. Through tuning, we demonstrate how to reduce latency by up to 40% and increase QD1 IOPS by 65%.

Tests were conducted using Proxmox 7.3 on a 16-core AMD RYZEN 5950X processor with Mellanox 25-gigabit networking in a production customer hosting environment. The storage system under test is a DELL-NVME48-ZEN3 running Blockbridge 6. The network storage protocol is NVMe/TCP. For each configuration change, we used fio to measure QD1 latency. Each data point collected represents the average performance over a 10-minute interval following a 5-minute warm-up.

Optimizations are limited to tunable software and hardware parameters, do not involve third-party drivers or software modification and are fit for a production environment.

SUMMARY

Tuning Reduces Latency

You can run applications that require high availability and low latency on commodity hardware using Proxmox. A guest can achieve QD1 I/O latency within roughly 10 microseconds of bare metal by optimizing both the host and guest.

The chart below compares non-optimized guest latency with optimized guest latency and includes optimized bare-metal latency as a reference. The data shows that a 40% reduction in QD1 latency is achievable through system tuning.

HARDWARE CONCEPTS

Performance optimization requires an understanding of your system’s processor and memory layout. High-performance packet processing, message passing, and inter-thread synchronization depend on cache-to-cache and memory latencies. The following sections cover the essential concepts needed to understand your hardware.

NUMA Topology

Start by evaluating the system’s NUMA topology. It is important to constrain performance critical workloads to a single NUMA node to minimize memory latency. Since we’re dealing with network-attached storage, it makes sense to identify which NUMA node the NIC connects to and the associated set of NUMA-local CPUs. This information is conveniently available in sysfs.

# The NUMA node that the NIC is connected to:
root@host:~# cat /sys/class/net/enp45s0f0np0/device/numa_node
-1

# The CPUs that are local to the NIC's NUMA node:
root@host:~# cat /sys/class/net/enp45s0f0np0/device/local_cpulist
0-31

The information above indicates that our CPU has uniform memory access (numa_node is -1) and that all logical CPUs are an equal distance from RAM (every CPU is a local CPU).

Tip: Some systems can hide the details of NUMA. If you know that you have multiple NUMA domains, but don’t see them, check your BIOS NPS (NUMA Per Socket) settings.

Processor Topology

With most modern CPUs, optimization requires an understanding of the processor’s internal architecture. The logical block diagram below shows an AMD Ryzen 5950x processor. Notice that it has cores distributed across two chiplets.

            ┌────────────────────────────┐
            │   CORE CHIPLET DIE (CCD-0) │
            │ ┌────────────────────────┐ │
            │ │                        │ │           ┌──────────────────────────────┐
            │ │   CORE COMPLEX (CCX)   │ │           │  I/O CONTROLLER DIE (cIOD)   │
            │ │   8 CORE / 16 THREAD   │ │           │ ┌──────────┐  ┌────────────┐ │    ┌────────┐
            │ │     32MB SHARED L3     ├───────────────┤          │  │            │ │    │        │
            │ │                        ├───────────────┤          │──│   MEMORY   ├──────┤  DDR4  │
            │ └────────────────────────┘ │           │ │          │──│ CONTROLLER ├──────┤  DRAM  │
            └────────────────────────────┘           │ │          │  |            │ │    |        |
                                                     │ │ INFINITY │  └────────────┘ │    └────────┘
            ┌────────────────────────────┐           │ │  FABRIC  │  ┌────────────┐ │    ┌────────┐
            │   CORE CHIPLET DIE (CCD-1) │           │ │          │  │            │ │    │        │
            │ ┌────────────────────────┐ │           │ │          │──│    I/O     ├──────┤  25Gb  │
            │ │                        ├───────────────┤          │──│ CONTROLLER ├──────┤  NIC   │
            │ │   CORE COMPLEX (CCX)   ├───────────────┤          │  │            │ │    │        │
            │ │   8 CORE / 16 THREAD   │ │           │ └──────────┘  └────────────┘ │    └────────┘
            │ │     32MB SHARED L3     │ │           └──────────────────────────────┘
            │ │                        │ │
            │ └────────────────────────┘ │
            └────────────────────────────┘

Each core within a chiplet has uniform access to devices and main memory. However, core-to-core communication between chiplets is more expensive than within a chiplet. The benchmark results below demonstrate the penalty using a test that measures synchronization latency between threads pinned to different cores. Results are given in nanoseconds.

CPU: AMD Ryzen 9 5950X 16-Core Processor
Num cores: 16 (hyperthreads disabled)
Num iterations per samples: 10000
Num samples: 300

Single-writer single-reader latency on two shared cache lines

           0    1    2    3    4    5    6    7    8    9   10   11   12   13   14    15
      0    -
      1   45    -
      2   42   44    -
      3   46   49   46    -
      4   42   45   44   46    -
      5   46   49   47   49   46    -
      6   41   45   43   45   43   45    -
      7   46   49   47   49   46   49   47    -
      8  186  187  186  187  186  188  186  193    -
      9  187  192  187  195  187  195  188  195   45    -
     10  186  187  186  187  186  188  187  193   45   46    -
     11  187  194  187  195  188  195  188  195   47   49   47    -
     12  186  187  187  187  186  188  186  192   44   48   45   48    -
     13  187  195  187  195  189  196  191  196   47   50   48   50   46    -
     14  187  187  187  187  187  187  187  189   43   47   45   47   44   47    -
     15  187  192  186  195  187  195  187  195   46   49   47   49   46   50   48    -

     Min  latency: 41.5ns ±0.0 cores: (6,0)
     Max  latency: 195.5ns ±0.0 cores: (13,7)
     Mean latency: 122.6ns

The data above shows sub-50ns latencies when communicating between cores on the same chiplet. Latencies rise to over 190ns when the cores are on different chiplets. That’s a 4.7x latency penalty for core-to-core synchronization latency across chiplets. Therefore, peak performance for our I/O benchmarks will be achieved by constraining our workload to either one of the chiplets (i.e., physical cores 0-7 or 8-15). For the sake of simplicity, we’ll make use of the first chiplet (i.e., cores 0-7).

Note: A careful reader will note the potential tradeoffs in cores per CCX. A processor with two CCDs and eight cores per CCX (i.e., 2x8) is radically different than a processor with eight CCDs and two cores per CCX (i.e., 8x2).

Tip: There is an excellent article on Anandtech that describes the Ryzen architecture and quantifies the latency impact of cross-core synchronization. You can measure it for yourself with a core-to-core latency benchmark. Please note that the tests used to generate the Anandtech results are different than what is provided above: the results are not directly comparable as they test fundamentally different operations.

BASELINE PERFORMANCE

Optimized Bare-Metal Latency

Measuring optimized bare-metal latency establishes the best-case performance achievable on the host without virtualization. Use it to establish a lower bound on VM latency since the guest can’t outperform the host.

Tip: Complete optimization of a bare metal system is worthy of a separate technote. However, if you apply the host tunings found in this technote, you will achieve comparable performance.

Non-optimized Guest Latency

Non-Optimized Guest latency establishes the performance of a non-optimized guest operating on a non-optimized host. This metric represents the default performance of the system.

Note: Since the storage we are testing is fast, the overhead of virtualization is relatively high. Our baseline measurement show an increase in latency of over 2.5x, translating to a 60% reduction in QD1 IOPS!

TUNING PROCEDURE

Virtual I/O processing involves synchronized communication between the guest’s virtual storage controller, QEMU’s I/O processing logic, and the storage device. In the case of network-attached storage, the “storage device” is a NIC. To achieve best-case I/O latency, our optimization efforts will focus on:

The physical CPU handling the NVMe/TCP device (i.e., NIC) interrupts
The physical CPU running QEMU’s I/O logic
The physical CPU running the guest’s VCPU

Tip: You know you have it right if the NIC interrupts, VCPU, and IOThread share an L3 cache.

QEMU IOThreads

Minimum latency requires that your guest VM uses an IOThread to offload I/O processing. With Proxmox, you must use the scsi-virtio-single storage controller. Our testing shows that aio=native and aio=io_uring offer comparable overall performance. Our recommendation is to use aio=native where possible based on code maturity.

Tip: For a technical comparison of AIO models, the quantified impact of IOThreads, and benchmark analysis, see Optimizing Proxmox: iothreads, aio, & io_uring

You can verify that your virtual machine is configured correctly by reviewing the configuration using the Proxmox shell. In the example below, the storage pool name is bb-nvme and the VMID is 101. There are two disks: disk-0 and disk-1.

root@host# qm config 101 | grep scsi
boot: order=scsi0
scsi0: bb-nvme:vm-101-disk-0,aio=native,iothread=1,size=80G
scsi1: bb-nvme:vm-101-disk-1,aio=native,iothread=1,size=16G
scsihw: virtio-scsi-single

If your virtual machine is not configured for IOThreads, use the qm set command to update the guest configuration. You will need to stop and start the guest for the changes to fully take effect.

# example: configuring virtio-scsi-single, aio=native, and iothreads
root@host# qm set 101 --scsihw virtio-scsi-single --scsi1 bb-nvme:vm-101-disk-1,aio=native,iothread=1
root@host# qm stop 101
root@host# qm start 101

Performance Impact Of QEMU IOThreads

The graph below shows a comparison of performance with and without IOThreads enabled. The performance with IOThreads enabled is shown in blue. Latency improvements range from 12% to 20%.

NIC Interrupt Modulation

Interrupt Modulation (aka Interrupt Coalescing) is a mechanism to reduce the number of interrupts issued to a CPU. When configured, your NIC will delay sending an interrupt in an attempt to batch multiple notifications with a single interrupt. This can reduce CPU utilization and increase throughput, at the expense of latency.

There are two major types of interrupts: receive and transmit. Receive interrupts allow the NIC to notify the operating system that a packet has arrived. Transmit interrupts signal the operating system that packets were transmitted and resources can be reclaimed.

Optimal values for NIC interrupt coalescing are NIC, CPU, and use-case dependent. By default, Mellanox NICs are optimized for balanced performance. Our goal is minimum latency. Therefore, we must ensure that the NIC does not hold on to packets in an attempt to optimize resources.

On our Mellanox ConnectX-4 NIC, we’ll:

disable adaptive receive coalescing
set the receive coalescing delay to 1

Our system has a dual-port 25Gb NIC configured in an active-active LACP LAG. We’ll need to modify the coalescing settings for both ports.

The example below shows the default settings for one of our ethernet ports.

root@host:~# ethtool -c enp45s0f1np1
Coalesce parameters for enp45s0f1np1:
Adaptive RX: on  TX: on
stats-block-usecs: n/a
sample-interval: n/a
pkt-rate-low: n/a
pkt-rate-high: n/a

rx-usecs: 8
rx-frames: 32
rx-usecs-irq: n/a
rx-frames-irq: n/a

tx-usecs: 8
tx-frames: 128
tx-usecs-irq: n/a
tx-frames-irq: n/a

The example below shows how to disable adaptive receive modulation and set the receive modulation interval for both ethernet ports.

root@host:~# ethtool -C enp45s0f0np0 adaptive-rx off
root@host:~# ethtool -C enp45s0f0np0 rx-usecs 1
root@host:~# ethtool -C enp45s0f1np1 adaptive-rx off
root@host:~# ethtool -C enp45s0f1np1 rx-usecs 1

Performance Impact Of Interrupt Modulation

The graph below shows incremental improvements achieved by adjusting interrupt modulation. Notice that substantial gains occur only when the I/O size exceeds 8KiB: this correlates with our networking MTU of 9000.

Our findings suggest that our NIC employs heuristics to optimize network latency, even when the adaptive algorithms are disabled. When our I/O reply data fits within a single ethernet frame, the NIC sees no benefit in coalescing interrupts (as there’s a long time between successive packets). However, when our I/O reply data takes multiple ethernet frames to transfer, the NIC invokes the timer-based coalescing logic, likely while holding the second frame in hopes of receiving a third.

Tip: MTU matters! If your MTU is 1500, expect to see gains for I/O sizes exceeding 1KiB

NIC Interrupt Affinity

Modern NICs implement multiple packet queues to facilitate Receive Side Scaling (i.e., RSS), Flow Steering, QoS, and more. Typically, each packet queue has an associated Message Signaled Interrupt to notify the operating system of packet-related events. By default, a NIC’s packet queues and interrupts are evenly distributed across CPU cores.

Internally, a NIC uses a hash function to associate a packet flow with a packet queue. Linux assigns responsibility for a packet queue to a physical CPU using an interrupt mask. The seemingly random association of flows to queues (caused by the hash function) and the fair distribution of interrupts over available CPUs leads to unpredictable behavior and performance anomalies in latency-sensitive workloads.

To optimize performance, we’ll need to ensure that our NVMe/TCP flow gets associated with a packet queue that routes to the first chiplet, where our guest will be running. A straightforward approach is to modify the interrupt masks of each packet queue.

Our system has a dual-port 25Gb NIC configured in an active-active LACP LAG. For consistency, we’ll need to specify interrupt affinity for both ports. Using the list of NIC interrupts available in sysfs, we can set interrupt affinity dynamically via the /proc filesystem, as shown below.

# Direct all interrupts for Port 0 (interface enp1s0f0np0) to CPU 1
for irq in /sys/class/net/enp1s0f0np0/device/msi_irqs/* ; \
    do echo 2 > /proc/irq/$(basename $irq)/smp_affinity ; \
done

# Direct all interrupts for Port 1 (interface enp1s0f1np1) to CPU 1
for irq in /sys/class/net/enp1s0f1np1/device/msi_irqs/* ; \
    do echo 2 > /proc/irq/$(basename $irq)/smp_affinity ; \
done

Performance Impact of Interrupt Affinity

Interrupt affinity on its own, with this CPU configuration, is not expected to affect performance significantly. The benefits of interrupt affinity will be recognized only when the guest’s QEMU threads are pinned to the same chiplet that handles the NIC interrupts (i.e., the next section)

The graph below shows the latency effect of our changes to interrupt affinity.

QEMU VCPU Affinity

A virtual machine is a process comprised of several threads spawned by QEMU. As previously established, we need these threads to execute on the same chiplet that handles the NIC interrupts. Proxmox does not have a built-in facility to manage CPU affinity that’s flexible enough to pin specific QEMU threads to specific CPUs. However, you can manually administer affinity using basic tools available in the shell. To get and set the CPU affinity, use taskset.

Determine The PID of QEMU Process

You can find the main PID for your VM using qm list.

root@host:~# qm list
  VMID NAME                 STATUS     MEM(MB)    BOOTDISK(GB) PID
  101  TestUbuntuVM002      running    8192       100.00       200906

You can view a list of all threads associated with your VM, use ps. The example below shows the threads for our test VM configuration which has four VCPUs and IOThread.

root@host:~# ps -T -p 200906
    PID    SPID TTY          TIME CMD
 200906  200906 ?        00:00:01 kvm
 200906  200907 ?        00:00:00 call_rcu
 200906  200908 ?        00:00:22 kvm
 200906  200909 ?        00:03:22 kvm
 200906  200929 ?        00:09:17 CPU 0/KVM
 200906  200930 ?        00:00:05 CPU 1/KVM
 200906  200931 ?        00:00:02 CPU 2/KVM
 200906  200932 ?        00:00:03 CPU 3/KVM

Setting VCPU Affinity

Set affinity for the VCPU threads of the guest VM using taskset as shown below. We must confine execution to the physical cores of the first chiplet (i.e., CPUs 0-7). You can pin each VCPU thread to a dedicated core for more predictable results (as shown in the example below).

root@host:~# taskset -p --cpu-list 4 200929
pid 200929's current affinity list: 0-31
pid 200929's new affinity list: 4

root@host:~# taskset -p --cpu-list 5 200930
pid 200930's current affinity list: 0-31
pid 200930's new affinity list: 5

root@host:~# taskset -p --cpu-list 6 200931
pid 200931's current affinity list: 0-31
pid 200931's new affinity list: 6

root@host:~# taskset -p --cpu-list 7 200932
pid 200932's current affinity list: 0-31
pid 200932's new affinity list: 7

QEMU IOThread Affinity

When a guest VM executes a disk I/O operation, the guest OS submits a request to the hypervisor and waits for a completion event. By default, the QEMU main loop handles requests and completions. An IOThread provides a dedicated event loop operating in a separate thread that handles I/O. IOThreads offload work from the “main loop” into a separate thread that executes concurrently, which reduces latency.

Previously, we established CPU 1 for NIC interrupt handling. In theory, you can permit the IOThread to float across all the cores on the first chiplet and remain local to the NIC interrupts. However, to reduce scheduler latency, maximize cache efficiency, and further enable micro-optimizations, we’ll bind the IOThread execution to CPU 2.

Determine The PID Of The IOThread

To find the PID, we can use the qm monitor command

root@host:~# qm monitor 101
Entering Qemu Monitor for VM 101 - type 'help' for help
qm> info iothreads
iothread-virtioscsi1:
  thread_id=200909
  poll-max-ns=32768
  poll-grow=0
  poll-shrink=0
  aio-max-batch=0
iothread-virtioscsi0:
  thread_id=200908
  poll-max-ns=32768
  poll-grow=0
  poll-shrink=0
  aio-max-batch=0

Set CPU Affinity Of the IOThread

To get and set the CPU affinity, use taskset.

root@host:~# taskset -p --cpu-list 2 200909
pid 200909's current affinity list: 0-31
pid 200909's new affinity list: 2

Performance Impact Of VCPU and IOThread Affinity

The graph below shows the combined latency effect of pinning the guest’s VCPUs and IOThread to the same chiplet that handles the NIC interrupts. A consistent improvement across all I/Os sizes correlates with reduced inter-core synchronization latency and caching.

Guest Halt Polling

A significant source of I/O latency in virtual machines can be attributed to delays in detecting completion events. Typically, when a guest VCPU becomes idle or is otherwise blocked, the guest OS hands control over to the hypervisor allowing it to perform other tasks. This context switch results in significant latency for several reasons:

After an I/O completes, our VCPU might not be immediately scheduled if other runnable processes are present; I/O latency becomes correlated with the system load.
If our VCPU is not actively polling for events, QEMU must send a notification to schedule the VCPU for execution.
We lose the benefit of the processor cache if other workloads pollute it or our VCPU gets scheduled on a different physical CPU.

One solution to minimize the “wakeup latency” is to use the cpuidle_haltpoll driver to avoid yielding the CPU altogether. Instead of returning control to the hypervisor when the guest is idle, the driver polls for events for a short period, reducing completion latency at the expense of CPU cycles.

Warning: Use this technique sparingly and only with low-latency storage. You will likely only see a noticeable benefit if your storage is capable of delivering sub hundred microsecond latency).

Install The `cpuidle-haltpoll` Kernel Module

To load the cpuidle-haltpoll, use modprobe:

root@guest:~# modprobe cpuidle-haltpoll force=1

Tip: Some distributions do not package the cpuidle_haltpoll driver with the core kernel modules to reduce footprint. For example, you need to install the linux-modules-extra-*-generic package on Ubuntu.

Confirm The `cpuidle-haltpoll` Module Is Loaded

Find it in the output of lsmod:

root@guest:~# lsmod | grep cpuidle
cpuidle_haltpoll       16384  0

Enable the Haltpoll CPU Governor

To enable the haltpoll governor, update the CPU’s current_governor in sysfs.

root@guest:~# echo haltpoll > /sys/devices/system/cpu/cpuidle/current_governor

Additional Tuning Parameters

Guest Haltpoll Tuning Parameters are available in sysfs. The default parameters are more than suitable for Blockbridge storage.

root@guest:~# ls -l /sys/module/haltpoll/parameters/
total 0
-rw-r--r-- 1 root root 4096 Jan 12 22:28 guest_halt_poll_allow_shrink
-rw-r--r-- 1 root root 4096 Jan 12 22:28 guest_halt_poll_grow
-rw-r--r-- 1 root root 4096 Jan 12 22:28 guest_halt_poll_grow_start
-rw-r--r-- 1 root root 4096 Jan 12 22:28 guest_halt_poll_ns
-rw-r--r-- 1 root root 4096 Jan 12 22:28 guest_halt_poll_shrink

Performance Impact Of Guest Haltpolling

The total impact of the haltpoll governor optimizations is shown in the graph below.

Processor C-States

C-states are a power-saving mechanism for CPUs that are idle. The basic C-states (defined by ACPI) are:

C0: Active - executing instructions
C1: Halt - not executing instructions, but can return to C0 “instantly”
C2: StopClock - similar to C1, with a delayed transition to C0

While a processor waits for I/O completion (i.e., an interrupt), it is often idle. When a processor is idle, it can optionally stop processing instructions and shut down internal subsystems to save power. This allows a processor to use the power for other purposes. For example, it may decide to boost the frequency of another core that’s busy. The transition from an idle state back to an active state has a measurable penalty known as exit latency.

The exit latency from C1 to C0 is measured in low single-digit microseconds. The exit latency from C2 to C0 is measured in the low tens of microseconds. A table of exit latencies reported by our 5950X is shown below.

C-STATE	DESCRIPTION	EXIT LATENCY
C0	ACTIVE	0 us
C1	HALT	1 us
C2	STOP-CLOCK	18 us
C3	SLEEP	350 us

Tip: Exit latencies for your processor may be available in /sys/devices/system/cpu/cpu0/cpuidle/state*

Disable C-States For Selected CPUs

We can reduce storage latency by several microseconds by disabling C-States on the IOThread and NIC interrupt CPUs.

# Disable Processor Idle States for CPUs 1 and 2
root@host:~# cpupower --cpu-list 1,2 idleset -d 2
root@host:~# cpupower --cpu-list 1,2 idleset -d 1

Warning: Disabling idle states can lead to higher power utilization and thermal-based frequency throttling: apply sparingly.

Performance Impact Of Processor C-States

The graph below shows the latency effect of our changes to the processor idle states. Incremental latency improvements are limited to about 1 microsecond since these cores are usually busy enough to operate in C1 and C0.

Tip: More focused testing suggests that the IOThread contributes most of the performance benefit. To save additional power, consider disabling C1 only on the IOThread core.

Processor Vulnerability Mitigation

Attacks on Transient execution CPU vulnerabilities can be used to extract sensitive data in multi-user systems. The vulnerabilities are a byproduct of how modern CPUs achieve high performance. Operating systems implement software-based techniques to mitigate vulnerabilities. The overhead of mitigations is significant relative to the performance of high-speed storage.

By default, mitigations are enabled. You can check what mitigations are in place for your CPU as shown below.

# Show Vulnerabilities with Mitigation
root@host:~# lscpu | grep Mitigation
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP always-on, ...

If you are operating in a trusted environment, it may be safe to disable mitigations. To do so, add mitigations=off to the linux kernel command line parameters of the host. Add or modify the following line in the grub configuration file (/etc/default/grub) and update your grub configuration (update-grub on Ubuntu). A reboot is required for the changes to take effect. If successful, your CPU vulnerabilities will show as vulnerable.

# Show Vulnerabilities without Mitigations
root@host:~# lscpu | grep Vulnerable
Vulnerability Spec store bypass: Vulnerable
Vulnerability Spectre v1:        Vulnerable: __user pointer sanitization and usercopy barriers only; ...
Vulnerability Spectre v2:        Vulnerable, IBPB: disabled, STIBP: disabled, PBRSB-eIBRS: Not affected

Performance Impact Of Vulnerability Mitigation

QEMU uses syscalls to interact with the Linux kernel to perform guest I/O and send event notifications. Mitigations have a measurable negative impact on syscall performance. The data below shows latency improvements of up to 2.8 microseconds are possible.

Warning: Attacks on CPU vulnerabilities are a real world security concern. We recommend carefully assessing risks before deploying systems with mitigations=off into production.

ENVIRONMENT

Network Diagram

           ┌──────────────────────────┐                                        ┌─────────────────────┐
           │                          |             ┌──────────────────┐       |                     │
           │  ┌────┐   PROXMOX 7.3    │── NVME/TCP ─┤ 25G SN3700C 100G ├───────┤  BLOCKBRIDGE 6.X    │
           │  |    |   25G DUAL PORT  │             └──────────────────┘       │  QUAD ENGINE        │
           │  │ VM │   X8 GEN3        │             ┌──────────────────┐       │  2X 100G DUAL PORT  │
           │  └────┘   16 CORE RYZEN  |── NVME/TCP ─┤ 25G SN3700C 100G ├───────┤  4M IOPS / 25 GB/s  │
           |                          |             └──────────────────┘       |                     |
           └──────────────────────────┘                                        └─────────────────────┘

Description

Proxmox 7.3 (kernel version 5.15.83-1-pve) is installed on an ASRockRack 1U4LW-X570 with an AMD Ryzen 5950X 16-Core Processor, 128GB of RAM, and a single Mellanox dual-port 25Gbit network adapter. The Mellanox adapter is an x8 Gen3 device with a maximum throughput of 63Gbit/s. The server is running with default settings and hyperthreads enabled.

The Proxmox host connects to a redundant pair of Mellanox 100G SN3700C switches using an Active-Active LACP LAG. While the Blockbridge storage is 100G connected, the port speed of the host limits performance to 25Gbit. The network MTU is 9000.

A single virtual machine is provisioned on the host. The VM is installed with Ubuntu 23.04, running Linux kernel version 5.19.0-21-generic. The VM has four virtual CPUs and 8GB of RAM. The VM has a boot block device containing the root filesystem separate from the storage under test.

A read-only workload is executed that fits within the encrypted data cache of the storage system to ensure consistency and repeatability of the results. QD1 tests are executed for seven block sizes. Each test consists of a 5-minute warmup followed by a 10-minute measurement period. A sample workload description appears below:

$ cat read-bs4096-qd1.fio
[global]
rw=read
direct=1
ioengine=libaio
time_based=1
runtime=600
ramp_time=300
numjobs=1
cpus_allowed=0

[device]
filename=/dev/sdb
size=1G

Software

Proxmox Version

# pveversion
pve-manager/7.3-4/d69b70d4 (running kernel: 5.15.83-1-pve)

Linux Kernel Options

BOOT_IMAGE=/boot/vmlinuz-5.15.83-1-pve root=/dev/mapper/pve-root ro quiet

Blockbridge Version

version:   6.0.0
release:   6712.2
build:     4102

Hardware And Networking

Server Platform

System Information
	Manufacturer: ASRockRack
	Product Name: 1U4LW-X570 RPSU

Processor

Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   48 bits physical, 48 bits virtual
CPU(s):                          32
On-line CPU(s) list:             0-31
Thread(s) per core:              2
Core(s) per socket:              16
Socket(s):                       1
NUMA node(s):                    1
Vendor ID:                       AuthenticAMD
CPU family:                      25
Model:                           33
Model name:                      AMD Ryzen 9 5950X 16-Core Processor
Stepping:                        2
Frequency boost:                 enabled
CPU MHz:                         3400.000
CPU max MHz:                     5083.3979
CPU min MHz:                     2200.0000
BogoMIPS:                        6787.10
Virtualization:                  AMD-V
L1d cache:                       512 KiB
L1i cache:                       512 KiB
L2 cache:                        8 MiB
L3 cache:                        64 MiB
NUMA node0 CPU(s):               0-31

Network Adapter

Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx]
  Subsystem: Mellanox Technologies Stand-up ConnectX-4 Lx EN, 25GbE dual-port SFP28, PCIe3.0 x8, MCX4121A-ACAT
  Flags: bus master, fast devsel, latency 0, IRQ 121, IOMMU group 25
  Memory at c0000000 (64-bit, prefetchable) [size=32M]
  Expansion ROM at fcd00000 [disabled] [size=1M]
  Capabilities: [60] Express Endpoint, MSI 00
  Capabilities: [48] Vital Product Data
  Capabilities: [9c] MSI-X: Enable+ Count=64 Masked-
  Capabilities: [c0] Vendor Specific Information: Len=18 <?>
  Capabilities: [40] Power Management version 3
  Capabilities: [100] Advanced Error Reporting
  Capabilities: [150] Alternative Routing-ID Interpretation (ARI)
  Capabilities: [180] Single Root I/O Virtualization (SR-IOV)
  Capabilities: [1c0] Secondary PCI Express
  Capabilities: [230] Access Control Services
  Kernel driver in use: mlx5_core
  Kernel modules: mlx5_core

Network Adapter PCI Connectivity

[    2.453955] mlx5_core 0000:2d:00.0: firmware version: 14.31.1014
[    2.453985] mlx5_core 0000:2d:00.0: 63.008 Gb/s available PCIe bandwidth (8.0 GT/s PCIe x8 link)
[    2.740344] mlx5_core 0000:2d:00.0: E-Switch: Total vports 10, per vport: max uc(128) max mc(2048)
[    2.743839] mlx5_core 0000:2d:00.0: Port module event: module 0, Cable plugged
[    3.079562] mlx5_core 0000:2d:00.0: MLX5E: StrdRq(0) RqSz(1024) StrdSz(256) RxCqeCmprss(0)
[    3.325403] mlx5_core 0000:2d:00.0: Supported tc offload range - chains: 4294967294, prios: 429496729

Network Adapter Link

Settings for enp45s0f0np0:
  Supported ports: [ Backplane ]
  Supported pause frame use: Symmetric
  Supports auto-negotiation: Yes
  Advertised pause frame use: Symmetric
  Advertised auto-negotiation: Yes
  Advertised FEC modes: None	 RS	 BASER
  Link partner advertised link modes:  Not reported
  Link partner advertised pause frame use: No
  Link partner advertised auto-negotiation: Yes
  Link partner advertised FEC modes: Not reported
  Speed: 25000Mb/s
  Duplex: Full
  Auto-negotiation: on
  Port: Direct Attach Copper
  PHYAD: 0
  Transceiver: internal
  Supports Wake-on: d
  Wake-on: d
      Current message level: 0x00000004 (4)
                             link
  Link detected: yes

Network Adapter Interrupt Coalesce Settings

Adaptive RX: on  TX: on
stats-block-usecs: n/a
sample-interval: n/a
pkt-rate-low: n/a
pkt-rate-high: n/a

rx-usecs: 8
rx-frames: 128
rx-usecs-irq: n/a
rx-frames-irq: n/a

tx-usecs: 8
tx-frames: 128
tx-usecs-irq: n/a
tx-frames-irq: n/a