OVERVIEW

This technical note presents a performance comparison between iSCSI and NVMe/TCP shared storage in a Proxmox/QEMU virtual environment. Both protocols enable block-level access to storage devices over standard TCP/IP networks, enabling high-avaibility and mobility of virtual machines.

Tests were executed on a Dell PowerEdge R7515 with an AMD EPYC 7452 32-Core Processor and Mellanox 100Gbit networking. The server was installed with Proxmox 7.2 and the Blockbridge Proxmox Storage Plugin version 3.0. The backend storage software is Blockbridge 6.

The testbed consists of 32 virtual machines operating on a single host, each configured with four virtual cpus. Each virtual machine operates as a fio worker. A single virtual disk is attached at a time to each virtual machine for testing, either iSCSI or NVMe/TCP. We execute a battery of concurrent I/O tests for each disk type with varying queue depths and block sizes. Each data point collected represents the average performance over a 20-minute interval following a 1-minute warm-up.

To guarantee that host performance is properly evaluated, the backend storage system was sized to accommodate roughly double the maximum load the client could generate using either protocol.

FINDINGS

SMALL I/O IMPROVEMENTS

NVMe/TCP consistently outperforms iSCSI in 4K random I/O tests. The benefits of NVMe/TCP are exceptionally pronounced for medium queue depth workloads. At a queue depth of 4, NVMe/TCP delivers up to over 50% more IOPS with 33% less latency.

The graph below shows the percentage improvement in IOPS and latency achieved using NVMe/TCP (vs. iSCSI) for 4K-sized I/Os at varying queue depths.

AVERAGE IOPS IMPROVEMENT

In almost all workloads, NVMe/TCP achieves higher IOPS than iSCSI. This difference is particularly significant for small I/O sizes. With large I/O sizes, the host achieves peak bandwidth quickly. At this point, iSCSI exhibits a slight performance advantage over NVMe/TCP, but the difference is negligible at approximately 0.1%.

AVERAGE IOPS BY BLOCK SIZE

The following graph displays the improvement in IOPS for different block sizes, with the average improvement calculated across various queue depths. For instance, the data point for 512B illustrates an average IOPS improvement of 35.4%, which represents the mean gain in IOPS measured across tests that execute with 1 to 128 concurrent I/Os per virtual machine.


You can find a detailed IOPS comparison for specific block sizes here:

AVERAGE IOPS BY QUEUE DEPTH

The following graph illustrates the improvement in IOPS for various queue depths, with the average improvement calculated across different block sizes. For instance, the data point for a Queue Depth of 1 reflects an average IOPS improvement of 18.2%, which is the mean gain measured across block sizes ranging from .5KiB to 128KiB.

AVERAGE LATENCY IMPROVEMENT

In most workloads, NVMe/TCP offers reduced latency in comparison to iSCSI. The difference is particularly noticeable when smaller I/O sizes are used. However, as the I/O sizes increase, the system quickly reaches peak bandwidth. At this point, the performance advantage of iSCSI over NVMe/TCP is minimal, at approximately 0.1%.

AVERAGE LATENCY BY QUEUE DEPTH

The following graph presents the percentage improvement, or reduction, in latency for each queue depth, with the average improvement calculated across different block sizes. For instance, the data point for QD=1 represents the mean reduction in latency measured across block sizes ranging from .5KiB to 128KiB.


You can find a detailed latency comparison for specific block sizes here:

ENVIRONMENT

Network Diagram


         ┌──────────────────────────────┐                                       ┌─────────────────────┐
         │      ┌────┐                  |             ┌───────────────┐         |                     │
         │    ┌────┐ |  PROXMOX 7.2     │── NVME/TCP ─┤ SN2100 - 100G ├──────── ┤  BLOCKBRIDGE 6.X    │
         │  ┌────┐ | |  100G DUAL PORT  │             └───────────────┘         │  QUAD ENGINE        │
         │  │ 32 │ |─┘  X16 GEN3        │             ┌───────────────┐         │  2X 100G DUAL PORT  │
         │  │ VM │─┘    32 CORE AMD     |── NVME/TCP ─┤ SN2100 - 100G ├──────── ┤  4M IOPS / 25 GB/s  │
         |  └────┘                      |             └───────────────┘         |                     |
         └──────────────────────────────┘                                       └─────────────────────┘

Description

Proxmox 7.2 (kernel version 5.15.53-1-pve) is installed on a Dell PowerEdge R7515 with an AMD EPYC 7452 32-Core Processor, 512GB of RAM, and a single Mellanox dual-port 100Gbit network adapter. The Mellanox adapter is a x16 Gen3 device with a maximum throughput of 126Gbit/s (limited by the PCIe connectivity). The AMD processor is running with NPS=4 and without hyperthreading.

Thirty-two virtual machines are provisioned on the host. Each VM is installed with Ubuntu 22.04.01 LTS, running Linux kernel version 5.15.0-1018-kvm. The VMs have four virtual CPUs and 4GB of RAM. The logical CPU overcommit is 4:1 (128 provisioned VCPUS running on a 32-core processor).

On each VM, fio-3.28 runs in “server” mode. We use an external controller node to coordinate benchmark runs across the VMs.

Each VM has a boot block device containing the root filesystem separate from the storage under test. For each VM, we provision storage using pvesm alloc and attach it to the VM with qm set. Before each test run, the VMs are power cycled to ensure consistency.

We tested iSCSI and NVMe/TCP block devices using a suite of 56 different I/O workloads. Each suite contains varying block sizes and queue depths. Each workload consists of a 1-minute warmup and a 20-minute measurement period. Each suite takes 19.6 hours to complete. A sample workload description appears below:

$ cat read-rand-bs4096-qd32.fio
[global]
rw=randread
direct=1
ioengine=libaio
time_based=1
runtime=1200
ramp_time=60
numjobs=1

[sdb]
filename=/dev/sdb
bs=4096
iodepth=32

The backend storage system is sized for 25GB/s of throughput and 4 million random IOPS to eliminate any possible bottlenecks at the storage layer.

Software

Blockbridge Version

version:   6.1.0
release:   6667.1
build:     4056

Hardware

Server Platform

System Information
	Manufacturer: Dell Inc.
	Product Name: PowerEdge R7515

Processor

Processor: 32 x AMD EPYC 7452 32-Core Processor (1 Socket)
Kernel Version: Linux 5.15.53-1-pve #1 SMP PVE 5.15.53-1 (Fri, 26 Aug 2022 16:53:52 +0200)
PVE Manager Version pve-manager/7.2-7/d0dd0e

Processor NUMA Configuration

Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   43 bits physical, 48 bits virtual
CPU(s):                          32
On-line CPU(s) list:             0-31
Thread(s) per core:              1
Core(s) per socket:              32
Socket(s):                       1
NUMA node(s):                    4
Vendor ID:                       AuthenticAMD
CPU family:                      23
Model:                           49
Model name:                      AMD EPYC 7452 32-Core Processor
Stepping:                        0
CPU MHz:                         3139.938
BogoMIPS:                        4690.89
Virtualization:                  AMD-V
L1d cache:                       1 MiB
L1i cache:                       1 MiB
L2 cache:                        16 MiB
L3 cache:                        128 MiB
NUMA node0 CPU(s):               0-7
NUMA node1 CPU(s):               8-15
NUMA node2 CPU(s):               16-23
NUMA node3 CPU(s):               24-31

Network Adapter

Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5]
Subsystem: Mellanox Technologies Mellanox ConnectX®-5 MCX516A-CCAT
Flags: bus master, fast devsel, latency 0, IRQ 624, NUMA node 1, IOMMU group 89
Memory at ac000000 (64-bit, prefetchable) [size=32M]
Expansion ROM at ab100000 [disabled] [size=1M]
Kernel driver in use: mlx5_core
Kernel modules: mlx5_core

Network Adapter PCI Connectivity

[    3.341416] mlx5_core 0000:41:00.0: firmware version: 16.26.1040
[    3.341456] mlx5_core 0000:41:00.0: 126.016 Gb/s available PCIe bandwidth (8.0 GT/s PCIe x16 link)
[    3.638556] mlx5_core 0000:41:00.0: Rate limit: 127 rates are supported, range: 0Mbps to 97656Mbps
[    3.638597] mlx5_core 0000:41:00.0: E-Switch: Total vports 4, per vport: max uc(1024) max mc(16384)
[    3.641492] mlx5_core 0000:41:00.0: Port module event: module 0, Cable plugged
Advertised pause frame use: Symmetric
Advertised auto-negotiation: Yes
Advertised FEC modes: None	 RS	 BASER
Link partner advertised link modes:  Not reported
Link partner advertised pause frame use: No
Link partner advertised auto-negotiation: Yes
Link partner advertised FEC modes: Not reported
Speed: 100000Mb/s
Duplex: Full
Auto-negotiation: on
Port: Direct Attach Copper
PHYAD: 0
Transceiver: internal
Supports Wake-on: d
Wake-on: d
    Current message level: 0x00000004 (4)
                           link
Link detected: yes

Network Adapter Interrupt Coalesce Settings

Adaptive RX: on  TX: on
stats-block-usecs: n/a
sample-interval: n/a
pkt-rate-low: n/a
pkt-rate-high: n/a

rx-usecs: 8
rx-frames: 128
rx-usecs-irq: n/a
rx-frames-irq: n/a

tx-usecs: 8
tx-frames: 128
tx-usecs-irq: n/a
tx-frames-irq: n/a

BENCHMARKS

512B BENCHMARK RESULTS

512B IMPROVEMENT SUMMARY

512B IOPS COMPARISON

512B LATENCY COMPARISON

512B BANDWIDTH COMPARISON

4K BENCHARMK RESULTS

4K IMPROVEMENT SUMMARY

4K IOPS COMPARISON

4K LATENCY COMPARISON

4K BANDWIDTH COMPARISON

8K BENCHMARK RESULTS

8K IMPROVEMENT SUMMARY

8K IOPS COMPARISON

8K LATENCY COMPARISON

8K BANDWIDTH COMPARISON

16K BENCHMARK RESULTS

16K IMPROVEMENT SUMMARY

16K IOPS COMPARISON

16K LATENCY COMPARISON

16K BANDWIDTH COMPARISON

32K BENCHMARK RESULTS

32K IMPROVEMENT SUMMARY

32K IOPS COMPARISON

32K LATENCY COMPARISON

32K BANDWIDTH COMPARISON

64K BENCHMARK RESULTS

64K IMPROVEMENT SUMMARY

64K IOPS COMPARISON

64K LATENCY COMPARISON

64K BANDWIDTH COMPARISON

128K BENCHMARK RESULTS

128K IMPROVEMENT SUMMARY

128K IOPS COMPARISON

128K LATENCY COMPARISON

128K BANDWIDTH COMPARISON

ADDITIONAL RESOURCES