Inline storage latency limits the performance of applications that rely on databases such as MySQL, PostgreSQL, and MariaDB. This technote describes how to optimize I/O latency in a performance-critical virtual environment consisting of KVM, QEMU, and Proxmox. Using a step-by-step approach, we explore essential tuning concepts and quantify the effects of configuration changes across a range of block sizes using a QD1 workload. Through tuning, we demonstrate how to reduce latency by up to 40% and increase QD1 IOPS by 65%.
Tests were conducted using Proxmox 7.3 on a 16-core AMD RYZEN 5950X processor with Mellanox 25-gigabit networking in a production customer hosting environment. The storage system under test is a DELL-NVME48-ZEN3 running Blockbridge 6. The network storage protocol is NVMe/TCP. For each configuration change, we used fio to measure QD1 latency. Each data point collected represents the average performance over a 10-minute interval following a 5-minute warm-up.
Optimizations are limited to tunable software and hardware parameters, do not involve third-party drivers or software modification and are fit for a production environment.
SUMMARY
Tuning Reduces Latency
You can run applications that require high availability and low latency on commodity hardware using Proxmox. A guest can achieve QD1 I/O latency within roughly 10 microseconds of bare metal by optimizing both the host and guest.
The chart below compares non-optimized guest latency with optimized guest latency and includes optimized bare-metal latency as a reference. The data shows that a 40% reduction in QD1 latency is achievable through system tuning.
HARDWARE CONCEPTS
Performance optimization requires an understanding of your system’s processor and memory layout. High-performance packet processing, message passing, and inter-thread synchronization depend on cache-to-cache and memory latencies. The following sections cover the essential concepts needed to understand your hardware.
NUMA Topology
Start by evaluating the system’s NUMA topology. It is important to constrain performance critical workloads to a single NUMA node to minimize memory latency. Since we’re dealing with network-attached storage, it makes sense to identify which NUMA node the NIC connects to and the associated set of NUMA-local CPUs. This information is conveniently available in sysfs.
# The NUMA node that the NIC is connected to:
root@host:~# cat /sys/class/net/enp45s0f0np0/device/numa_node
-1
# The CPUs that are local to the NIC's NUMA node:
root@host:~# cat /sys/class/net/enp45s0f0np0/device/local_cpulist
0-31
The information above indicates that our CPU has uniform memory access
(numa_node
is -1
) and that all logical CPUs are an equal distance
from RAM (every CPU is a local
CPU).
Processor Topology
With most modern CPUs, optimization requires an understanding of the processor’s internal architecture. The logical block diagram below shows an AMD Ryzen 5950x processor. Notice that it has cores distributed across two chiplets.
┌────────────────────────────┐
│ CORE CHIPLET DIE (CCD-0) │
│ ┌────────────────────────┐ │
│ │ │ │ ┌──────────────────────────────┐
│ │ CORE COMPLEX (CCX) │ │ │ I/O CONTROLLER DIE (cIOD) │
│ │ 8 CORE / 16 THREAD │ │ │ ┌──────────┐ ┌────────────┐ │ ┌────────┐
│ │ 32MB SHARED L3 ├───────────────┤ │ │ │ │ │ │
│ │ ├───────────────┤ │──│ MEMORY ├──────┤ DDR4 │
│ └────────────────────────┘ │ │ │ │──│ CONTROLLER ├──────┤ DRAM │
└────────────────────────────┘ │ │ │ | │ │ | |
│ │ INFINITY │ └────────────┘ │ └────────┘
┌────────────────────────────┐ │ │ FABRIC │ ┌────────────┐ │ ┌────────┐
│ CORE CHIPLET DIE (CCD-1) │ │ │ │ │ │ │ │ │
│ ┌────────────────────────┐ │ │ │ │──│ I/O ├──────┤ 25Gb │
│ │ ├───────────────┤ │──│ CONTROLLER ├──────┤ NIC │
│ │ CORE COMPLEX (CCX) ├───────────────┤ │ │ │ │ │ │
│ │ 8 CORE / 16 THREAD │ │ │ └──────────┘ └────────────┘ │ └────────┘
│ │ 32MB SHARED L3 │ │ └──────────────────────────────┘
│ │ │ │
│ └────────────────────────┘ │
└────────────────────────────┘
Each core within a chiplet has uniform access to devices and main memory. However, core-to-core communication between chiplets is more expensive than within a chiplet. The benchmark results below demonstrate the penalty using a test that measures synchronization latency between threads pinned to different cores. Results are given in nanoseconds.
CPU: AMD Ryzen 9 5950X 16-Core Processor
Num cores: 16 (hyperthreads disabled)
Num iterations per samples: 10000
Num samples: 300
Single-writer single-reader latency on two shared cache lines
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 -
1 45 -
2 42 44 -
3 46 49 46 -
4 42 45 44 46 -
5 46 49 47 49 46 -
6 41 45 43 45 43 45 -
7 46 49 47 49 46 49 47 -
8 186 187 186 187 186 188 186 193 -
9 187 192 187 195 187 195 188 195 45 -
10 186 187 186 187 186 188 187 193 45 46 -
11 187 194 187 195 188 195 188 195 47 49 47 -
12 186 187 187 187 186 188 186 192 44 48 45 48 -
13 187 195 187 195 189 196 191 196 47 50 48 50 46 -
14 187 187 187 187 187 187 187 189 43 47 45 47 44 47 -
15 187 192 186 195 187 195 187 195 46 49 47 49 46 50 48 -
Min latency: 41.5ns ±0.0 cores: (6,0)
Max latency: 195.5ns ±0.0 cores: (13,7)
Mean latency: 122.6ns
The data above shows sub-50ns latencies when communicating between cores on the same chiplet. Latencies rise to over 190ns when the cores are on different chiplets. That’s a 4.7x latency penalty for core-to-core synchronization latency across chiplets. Therefore, peak performance for our I/O benchmarks will be achieved by constraining our workload to either one of the chiplets (i.e., physical cores 0-7 or 8-15). For the sake of simplicity, we’ll make use of the first chiplet (i.e., cores 0-7).
BASELINE PERFORMANCE
Optimized Bare-Metal Latency
Measuring optimized bare-metal latency establishes the best-case performance achievable on the host without virtualization. Use it to establish a lower bound on VM latency since the guest can’t outperform the host.
Non-optimized Guest Latency
Non-Optimized Guest latency establishes the performance of a non-optimized guest operating on a non-optimized host. This metric represents the default performance of the system.
TUNING PROCEDURE
Virtual I/O processing involves synchronized communication between the guest’s virtual storage controller, QEMU’s I/O processing logic, and the storage device. In the case of network-attached storage, the “storage device” is a NIC. To achieve best-case I/O latency, our optimization efforts will focus on:
- The physical CPU handling the NVMe/TCP device (i.e., NIC) interrupts
- The physical CPU running QEMU’s I/O logic
- The physical CPU running the guest’s VCPU
QEMU IOThreads
Minimum latency requires that your guest VM uses an IOThread
to
offload I/O processing. With Proxmox, you must use the
scsi-virtio-single
storage controller. Our testing shows that
aio=native
and aio=io_uring
offer comparable overall performance.
Our recommendation is to use aio=native
where possible based on
code maturity.
You can verify that your virtual machine is configured correctly by
reviewing the configuration using the Proxmox shell. In the example
below, the storage pool name is bb-nvme
and the VMID is 101
. There
are two disks: disk-0
and disk-1
.
root@host# qm config 101 | grep scsi
boot: order=scsi0
scsi0: bb-nvme:vm-101-disk-0,aio=native,iothread=1,size=80G
scsi1: bb-nvme:vm-101-disk-1,aio=native,iothread=1,size=16G
scsihw: virtio-scsi-single
If your virtual machine is not configured for IOThreads, use the qm
set
command to update the guest configuration. You will need to
stop
and start
the guest for the changes to fully take effect.
# example: configuring virtio-scsi-single, aio=native, and iothreads
root@host# qm set 101 --scsihw virtio-scsi-single --scsi1 bb-nvme:vm-101-disk-1,aio=native,iothread=1
root@host# qm stop 101
root@host# qm start 101
Performance Impact Of QEMU IOThreads
The graph below shows a comparison of performance with and without IOThreads enabled. The performance with IOThreads enabled is shown in blue. Latency improvements range from 12% to 20%.
NIC Interrupt Modulation
Interrupt Modulation (aka Interrupt Coalescing) is a mechanism to reduce the number of interrupts issued to a CPU. When configured, your NIC will delay sending an interrupt in an attempt to batch multiple notifications with a single interrupt. This can reduce CPU utilization and increase throughput, at the expense of latency.
There are two major types of interrupts: receive and transmit. Receive interrupts allow the NIC to notify the operating system that a packet has arrived. Transmit interrupts signal the operating system that packets were transmitted and resources can be reclaimed.
Optimal values for NIC interrupt coalescing are NIC, CPU, and use-case dependent. By default, Mellanox NICs are optimized for balanced performance. Our goal is minimum latency. Therefore, we must ensure that the NIC does not hold on to packets in an attempt to optimize resources.
On our Mellanox ConnectX-4 NIC, we’ll:
- disable adaptive receive coalescing
- set the receive coalescing delay to 1
Our system has a dual-port 25Gb NIC configured in an active-active LACP LAG. We’ll need to modify the coalescing settings for both ports.
The example below shows the default settings for one of our ethernet ports.
root@host:~# ethtool -c enp45s0f1np1
Coalesce parameters for enp45s0f1np1:
Adaptive RX: on TX: on
stats-block-usecs: n/a
sample-interval: n/a
pkt-rate-low: n/a
pkt-rate-high: n/a
rx-usecs: 8
rx-frames: 32
rx-usecs-irq: n/a
rx-frames-irq: n/a
tx-usecs: 8
tx-frames: 128
tx-usecs-irq: n/a
tx-frames-irq: n/a
The example below shows how to disable adaptive receive modulation and set the receive modulation interval for both ethernet ports.
root@host:~# ethtool -C enp45s0f0np0 adaptive-rx off
root@host:~# ethtool -C enp45s0f0np0 rx-usecs 1
root@host:~# ethtool -C enp45s0f1np1 adaptive-rx off
root@host:~# ethtool -C enp45s0f1np1 rx-usecs 1
Performance Impact Of Interrupt Modulation
The graph below shows incremental improvements achieved by adjusting interrupt modulation. Notice that substantial gains occur only when the I/O size exceeds 8KiB: this correlates with our networking MTU of 9000.
Our findings suggest that our NIC employs heuristics to optimize network latency, even when the adaptive algorithms are disabled. When our I/O reply data fits within a single ethernet frame, the NIC sees no benefit in coalescing interrupts (as there’s a long time between successive packets). However, when our I/O reply data takes multiple ethernet frames to transfer, the NIC invokes the timer-based coalescing logic, likely while holding the second frame in hopes of receiving a third.
NIC Interrupt Affinity
Modern NICs implement multiple packet queues
to
facilitate Receive Side
Scaling
(i.e., RSS), Flow
Steering,
QoS, and more. Typically, each packet queue
has an associated
Message Signaled
Interrupt
to notify the operating system of packet-related events. By default,
a NIC’s packet queues
and interrupts
are evenly distributed across CPU
cores.
Internally, a NIC uses a hash function to associate a packet flow
with a packet queue
. Linux assigns responsibility for a packet
queue
to a physical CPU
using an interrupt mask
. The seemingly
random association of flows to queues (caused by the hash function)
and the fair distribution of interrupts over available CPUs leads to
unpredictable behavior and performance anomalies in latency-sensitive
workloads.
To optimize performance, we’ll need to ensure that our NVMe/TCP flow
gets associated with a packet queue
that routes to the first
chiplet, where our guest will be running. A straightforward approach
is to modify the interrupt masks
of each packet queue
.
Our system has a dual-port 25Gb NIC configured in an active-active
LACP LAG. For
consistency, we’ll need to specify interrupt affinity for both
ports. Using the list of NIC interrupts available in sysfs
, we can
set interrupt affinity dynamically via the /proc
filesystem, as
shown below.
# Direct all interrupts for Port 0 (interface enp1s0f0np0) to CPU 1
for irq in /sys/class/net/enp1s0f0np0/device/msi_irqs/* ; \
do echo 2 > /proc/irq/$(basename $irq)/smp_affinity ; \
done
# Direct all interrupts for Port 1 (interface enp1s0f1np1) to CPU 1
for irq in /sys/class/net/enp1s0f1np1/device/msi_irqs/* ; \
do echo 2 > /proc/irq/$(basename $irq)/smp_affinity ; \
done
Performance Impact of Interrupt Affinity
Interrupt affinity on its own, with this CPU configuration, is not expected to affect performance significantly. The benefits of interrupt affinity will be recognized only when the guest’s QEMU threads are pinned to the same chiplet that handles the NIC interrupts (i.e., the next section)
The graph below shows the latency effect of our changes to interrupt affinity.
QEMU VCPU Affinity
A virtual machine is a process comprised of several threads spawned by QEMU. As previously established, we need these threads to execute on the same chiplet that handles the NIC interrupts. Proxmox does not have a built-in facility to manage CPU affinity that’s flexible enough to pin specific QEMU threads to specific CPUs. However, you can manually administer affinity using basic tools available in the shell. To get and set the CPU affinity, use taskset.
Determine The PID of QEMU Process
You can find the main PID for your VM using qm list
.
root@host:~# qm list
VMID NAME STATUS MEM(MB) BOOTDISK(GB) PID
101 TestUbuntuVM002 running 8192 100.00 200906
You can view a list of all threads associated with your VM, use
ps
. The example below shows the threads for our test VM
configuration which has four VCPUs and IOThread.
root@host:~# ps -T -p 200906
PID SPID TTY TIME CMD
200906 200906 ? 00:00:01 kvm
200906 200907 ? 00:00:00 call_rcu
200906 200908 ? 00:00:22 kvm
200906 200909 ? 00:03:22 kvm
200906 200929 ? 00:09:17 CPU 0/KVM
200906 200930 ? 00:00:05 CPU 1/KVM
200906 200931 ? 00:00:02 CPU 2/KVM
200906 200932 ? 00:00:03 CPU 3/KVM
Setting VCPU Affinity
Set affinity for the VCPU threads of the guest VM using taskset
as
shown below. We must confine execution to the physical cores of the
first chiplet (i.e., CPUs 0-7). You can pin each VCPU thread to a
dedicated core for more predictable results (as shown in the example
below).
root@host:~# taskset -p --cpu-list 4 200929
pid 200929's current affinity list: 0-31
pid 200929's new affinity list: 4
root@host:~# taskset -p --cpu-list 5 200930
pid 200930's current affinity list: 0-31
pid 200930's new affinity list: 5
root@host:~# taskset -p --cpu-list 6 200931
pid 200931's current affinity list: 0-31
pid 200931's new affinity list: 6
root@host:~# taskset -p --cpu-list 7 200932
pid 200932's current affinity list: 0-31
pid 200932's new affinity list: 7
QEMU IOThread Affinity
When a guest VM executes a disk I/O operation, the guest OS submits a request to the hypervisor and waits for a completion event. By default, the QEMU main loop handles requests and completions. An IOThread provides a dedicated event loop operating in a separate thread that handles I/O. IOThreads offload work from the “main loop” into a separate thread that executes concurrently, which reduces latency.
Previously, we established CPU 1
for NIC interrupt handling. In
theory, you can permit the IOThread to float across all the cores
on the first chiplet and remain local
to the NIC
interrupts. However, to reduce scheduler latency, maximize cache
efficiency, and further enable micro-optimizations, we’ll bind the
IOThread execution to CPU 2
.
Determine The PID Of The IOThread
To find the PID, we can use the qm monitor command
root@host:~# qm monitor 101
Entering Qemu Monitor for VM 101 - type 'help' for help
qm> info iothreads
iothread-virtioscsi1:
thread_id=200909
poll-max-ns=32768
poll-grow=0
poll-shrink=0
aio-max-batch=0
iothread-virtioscsi0:
thread_id=200908
poll-max-ns=32768
poll-grow=0
poll-shrink=0
aio-max-batch=0
Set CPU Affinity Of the IOThread
To get and set the CPU affinity, use taskset.
root@host:~# taskset -p --cpu-list 2 200909
pid 200909's current affinity list: 0-31
pid 200909's new affinity list: 2
Performance Impact Of VCPU and IOThread Affinity
The graph below shows the combined latency effect of pinning the guest’s VCPUs and IOThread to the same chiplet that handles the NIC interrupts. A consistent improvement across all I/Os sizes correlates with reduced inter-core synchronization latency and caching.
Guest Halt Polling
A significant source of I/O latency in virtual machines can be attributed to delays in detecting completion events. Typically, when a guest VCPU becomes idle or is otherwise blocked, the guest OS hands control over to the hypervisor allowing it to perform other tasks. This context switch results in significant latency for several reasons:
-
After an I/O completes, our VCPU might not be immediately scheduled if other runnable processes are present; I/O latency becomes correlated with the system load.
-
If our VCPU is not actively polling for events, QEMU must send a notification to schedule the VCPU for execution.
-
We lose the benefit of the processor cache if other workloads pollute it or our VCPU gets scheduled on a different physical CPU.
One solution to minimize the “wakeup latency” is to use the cpuidle_haltpoll driver to avoid yielding the CPU altogether. Instead of returning control to the hypervisor when the guest is idle, the driver polls for events for a short period, reducing completion latency at the expense of CPU cycles.
Install The cpuidle-haltpoll
Kernel Module
To load the cpuidle-haltpoll
, use
modprobe:
root@guest:~# modprobe cpuidle-haltpoll force=1
cpuidle_haltpoll
driver with the core kernel modules to reduce
footprint. For example, you need to install the
linux-modules-extra-*-generic
package on Ubuntu.Confirm The cpuidle-haltpoll
Module Is Loaded
Find it in the output of lsmod
:
root@guest:~# lsmod | grep cpuidle
cpuidle_haltpoll 16384 0
Enable the Haltpoll CPU Governor
To enable the haltpoll governor, update the CPU’s current_governor
in sysfs
.
root@guest:~# echo haltpoll > /sys/devices/system/cpu/cpuidle/current_governor
Additional Tuning Parameters
Guest Haltpoll Tuning Parameters are
available in sysfs
. The default parameters are more than suitable for
Blockbridge storage.
root@guest:~# ls -l /sys/module/haltpoll/parameters/
total 0
-rw-r--r-- 1 root root 4096 Jan 12 22:28 guest_halt_poll_allow_shrink
-rw-r--r-- 1 root root 4096 Jan 12 22:28 guest_halt_poll_grow
-rw-r--r-- 1 root root 4096 Jan 12 22:28 guest_halt_poll_grow_start
-rw-r--r-- 1 root root 4096 Jan 12 22:28 guest_halt_poll_ns
-rw-r--r-- 1 root root 4096 Jan 12 22:28 guest_halt_poll_shrink
Performance Impact Of Guest Haltpolling
The total impact of the haltpoll governor optimizations is shown in the graph below.
Processor C-States
C-states are a power-saving mechanism for CPUs that are idle. The basic C-states (defined by ACPI) are:
-
C0: Active - executing instructions
-
C1: Halt - not executing instructions, but can return to C0 “instantly”
-
C2: StopClock - similar to C1, with a delayed transition to C0
While a processor waits for I/O completion (i.e., an interrupt), it is
often idle. When a processor is idle, it can optionally stop
processing instructions and shut down internal subsystems to save
power. This allows a processor to use the power for other
purposes. For example, it may decide to boost the frequency of
another core that’s busy. The transition from an idle state
back to
an active state
has a measurable penalty known as exit latency
.
The exit latency from C1
to C0
is measured in low single-digit
microseconds. The exit latency from C2
to C0
is measured in the
low tens of microseconds. A table of exit latencies reported by our
5950X is shown below.
C-STATE | DESCRIPTION | EXIT LATENCY |
---|---|---|
C0 | ACTIVE | 0 us |
C1 | HALT | 1 us |
C2 | STOP-CLOCK | 18 us |
C3 | SLEEP | 350 us |
/sys/devices/system/cpu/cpu0/cpuidle/state*
Disable C-States For Selected CPUs
We can reduce storage latency by several microseconds by disabling C-States on the IOThread and NIC interrupt CPUs.
# Disable Processor Idle States for CPUs 1 and 2
root@host:~# cpupower --cpu-list 1,2 idleset -d 2
root@host:~# cpupower --cpu-list 1,2 idleset -d 1
Performance Impact Of Processor C-States
The graph below shows the latency effect of our changes to the
processor idle states. Incremental latency improvements are limited to
about 1 microsecond since these cores are usually busy enough to
operate in C1
and C0
.
Processor Vulnerability Mitigation
Attacks on Transient execution CPU vulnerabilities can be used to extract sensitive data in multi-user systems. The vulnerabilities are a byproduct of how modern CPUs achieve high performance. Operating systems implement software-based techniques to mitigate vulnerabilities. The overhead of mitigations is significant relative to the performance of high-speed storage.
By default, mitigations are enabled. You can check what mitigations are in place for your CPU as shown below.
# Show Vulnerabilities with Mitigation
root@host:~# lscpu | grep Mitigation
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP always-on, ...
If you are operating in a trusted environment, it may be safe to
disable mitigations. To do so, add mitigations=off
to the linux
kernel command line parameters of the host. Add or modify the following
line in the grub configuration file (/etc/default/grub
) and update
your grub configuration (update-grub
on Ubuntu). A reboot is
required for the changes to take effect. If successful, your CPU
vulnerabilities will show as vulnerable.
# Show Vulnerabilities without Mitigations
root@host:~# lscpu | grep Vulnerable
Vulnerability Spec store bypass: Vulnerable
Vulnerability Spectre v1: Vulnerable: __user pointer sanitization and usercopy barriers only; ...
Vulnerability Spectre v2: Vulnerable, IBPB: disabled, STIBP: disabled, PBRSB-eIBRS: Not affected
Performance Impact Of Vulnerability Mitigation
QEMU uses syscalls to interact with the Linux kernel to perform guest I/O and send event notifications. Mitigations have a measurable negative impact on syscall performance. The data below shows latency improvements of up to 2.8 microseconds are possible.
mitigations=off
into production. ENVIRONMENT
Network Diagram
┌──────────────────────────┐ ┌─────────────────────┐
│ | ┌──────────────────┐ | │
│ ┌────┐ PROXMOX 7.3 │── NVME/TCP ─┤ 25G SN3700C 100G ├───────┤ BLOCKBRIDGE 6.X │
│ | | 25G DUAL PORT │ └──────────────────┘ │ QUAD ENGINE │
│ │ VM │ X8 GEN3 │ ┌──────────────────┐ │ 2X 100G DUAL PORT │
│ └────┘ 16 CORE RYZEN |── NVME/TCP ─┤ 25G SN3700C 100G ├───────┤ 4M IOPS / 25 GB/s │
| | └──────────────────┘ | |
└──────────────────────────┘ └─────────────────────┘
Description
Proxmox 7.3 (kernel version 5.15.83-1-pve) is installed on an ASRockRack 1U4LW-X570 with an AMD Ryzen 5950X 16-Core Processor, 128GB of RAM, and a single Mellanox dual-port 25Gbit network adapter. The Mellanox adapter is an x8 Gen3 device with a maximum throughput of 63Gbit/s. The server is running with default settings and hyperthreads enabled.
The Proxmox host connects to a redundant pair of Mellanox 100G SN3700C switches using an Active-Active LACP LAG. While the Blockbridge storage is 100G connected, the port speed of the host limits performance to 25Gbit. The network MTU is 9000.
A single virtual machine is provisioned on the host. The VM is installed with Ubuntu 23.04, running Linux kernel version 5.19.0-21-generic. The VM has four virtual CPUs and 8GB of RAM. The VM has a boot block device containing the root filesystem separate from the storage under test.
A read-only workload is executed that fits within the encrypted data cache of the storage system to ensure consistency and repeatability of the results. QD1 tests are executed for seven block sizes. Each test consists of a 5-minute warmup followed by a 10-minute measurement period. A sample workload description appears below:
$ cat read-bs4096-qd1.fio
[global]
rw=read
direct=1
ioengine=libaio
time_based=1
runtime=600
ramp_time=300
numjobs=1
cpus_allowed=0
[device]
filename=/dev/sdb
size=1G
Software
Proxmox Version
# pveversion
pve-manager/7.3-4/d69b70d4 (running kernel: 5.15.83-1-pve)
Linux Kernel Options
BOOT_IMAGE=/boot/vmlinuz-5.15.83-1-pve root=/dev/mapper/pve-root ro quiet
Blockbridge Version
version: 6.0.0
release: 6712.2
build: 4102
Hardware And Networking
Server Platform
System Information
Manufacturer: ASRockRack
Product Name: 1U4LW-X570 RPSU
Processor
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 48 bits physical, 48 bits virtual
CPU(s): 32
On-line CPU(s) list: 0-31
Thread(s) per core: 2
Core(s) per socket: 16
Socket(s): 1
NUMA node(s): 1
Vendor ID: AuthenticAMD
CPU family: 25
Model: 33
Model name: AMD Ryzen 9 5950X 16-Core Processor
Stepping: 2
Frequency boost: enabled
CPU MHz: 3400.000
CPU max MHz: 5083.3979
CPU min MHz: 2200.0000
BogoMIPS: 6787.10
Virtualization: AMD-V
L1d cache: 512 KiB
L1i cache: 512 KiB
L2 cache: 8 MiB
L3 cache: 64 MiB
NUMA node0 CPU(s): 0-31
Network Adapter
Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx]
Subsystem: Mellanox Technologies Stand-up ConnectX-4 Lx EN, 25GbE dual-port SFP28, PCIe3.0 x8, MCX4121A-ACAT
Flags: bus master, fast devsel, latency 0, IRQ 121, IOMMU group 25
Memory at c0000000 (64-bit, prefetchable) [size=32M]
Expansion ROM at fcd00000 [disabled] [size=1M]
Capabilities: [60] Express Endpoint, MSI 00
Capabilities: [48] Vital Product Data
Capabilities: [9c] MSI-X: Enable+ Count=64 Masked-
Capabilities: [c0] Vendor Specific Information: Len=18 <?>
Capabilities: [40] Power Management version 3
Capabilities: [100] Advanced Error Reporting
Capabilities: [150] Alternative Routing-ID Interpretation (ARI)
Capabilities: [180] Single Root I/O Virtualization (SR-IOV)
Capabilities: [1c0] Secondary PCI Express
Capabilities: [230] Access Control Services
Kernel driver in use: mlx5_core
Kernel modules: mlx5_core
Network Adapter PCI Connectivity
[ 2.453955] mlx5_core 0000:2d:00.0: firmware version: 14.31.1014
[ 2.453985] mlx5_core 0000:2d:00.0: 63.008 Gb/s available PCIe bandwidth (8.0 GT/s PCIe x8 link)
[ 2.740344] mlx5_core 0000:2d:00.0: E-Switch: Total vports 10, per vport: max uc(128) max mc(2048)
[ 2.743839] mlx5_core 0000:2d:00.0: Port module event: module 0, Cable plugged
[ 3.079562] mlx5_core 0000:2d:00.0: MLX5E: StrdRq(0) RqSz(1024) StrdSz(256) RxCqeCmprss(0)
[ 3.325403] mlx5_core 0000:2d:00.0: Supported tc offload range - chains: 4294967294, prios: 429496729
Network Adapter Link
Settings for enp45s0f0np0:
Supported ports: [ Backplane ]
Supported pause frame use: Symmetric
Supports auto-negotiation: Yes
Advertised pause frame use: Symmetric
Advertised auto-negotiation: Yes
Advertised FEC modes: None RS BASER
Link partner advertised link modes: Not reported
Link partner advertised pause frame use: No
Link partner advertised auto-negotiation: Yes
Link partner advertised FEC modes: Not reported
Speed: 25000Mb/s
Duplex: Full
Auto-negotiation: on
Port: Direct Attach Copper
PHYAD: 0
Transceiver: internal
Supports Wake-on: d
Wake-on: d
Current message level: 0x00000004 (4)
link
Link detected: yes
Network Adapter Interrupt Coalesce Settings
Adaptive RX: on TX: on
stats-block-usecs: n/a
sample-interval: n/a
pkt-rate-low: n/a
pkt-rate-high: n/a
rx-usecs: 8
rx-frames: 128
rx-usecs-irq: n/a
rx-frames-irq: n/a
tx-usecs: 8
tx-frames: 128
tx-usecs-irq: n/a
tx-frames-irq: n/a