OVERVIEW
Proxmox and VMware offer competing virtualization platforms. Proxmox is an open-source product that leverages QEMU and KVM. VMware ESXi is a commercial product built from proprietary software.
This technote compares the performance of Proxmox VE 7.2 and VMware ESXi 7.0 Update 3c for storage-dominant applications. Experiments were conducted on identical hardware configurations operating under moderate to heavy load. Testing focuses on the aggregate storage performance of 32 concurrently active virtual machines.
Our test system is a Dell PowerEdge R7515 with an AMD EPYC 7452 32-Core Processor and Mellanox 100Gbit networking. The server is configured for dual-boot: Proxmox 7.2 and ESXi 7.0 Update 3c. Storage is network-attached using NVMe/TCP. The backend storage software is Blockbridge 6.
The testbed consists of 32 Ubuntu virtual machines operating on a single host, each configured with four virtual CPUs. A single virtual disk is attached to each virtual machine for testing. Each virtual machine operates as a fio worker; tests execute concurrently on all 32 virtual machines. For each queue depth and block size, a data point is collected that represents the average performance over a 20-minute interval immediately following a 1-minute warm-up period.
You can review our findings/TLDR here and our raw benchmark data here.
ARCHITECTURE COMPARISON
VMware / PVSCSI / VMFS
VMware presents storage to guests via virtualized SCSI, SATA, or NVMe controllers. Typically, storage is a virtual SCSI disk presented using a VMware Paravirtual SCSI Adapter. The backing storage associated with these virtual disks is generally a file stored on a special-purpose cluster filesystem called VMFS. VFMS provides storage management features, including thin provisioning, snapshots, and cluster mobility.
The diagram below illustrates the flow of an I/O issued by a guest. Note that the existing storage stack is heavily SCSI-centric. NVMe devices fit into this model using a thin SCSI virtualization layer, referred to as a “shim” in the diagram.
GUEST
│ ┌────────┐
│ │ PVSCSI │
│ │ DRIVER │
▼ └┬───────┘
KERNEL │
│ ┌▼───────┐ ┌────────┐ ┌─────────┐ ┌────────┐
│ │ PVSCSI ├──► VMFS ├──► I/O ├──► SCSI │
│ │ DEVICE │ │ │ │ SCHED │ │ DISK │
│ └────────┘ └────────┘ └─────────┘ └┬───────┘
│ │
│ ┌▼───────┐ ┌────────┐ ┌────────┐
│ │ HPP ├──► SCSI ├──► NVME │
│ │ │ │ PATH │ │ SHIM │
│ └────────┘ └────────┘ └┬───────┘
│ │
│ ┌▼───────┐ ┌────────┐
│ │ NVME ├──► NVME │
│ │ CORE │ │ TCP │
▼ └────────┘ └────────┘
Our experience suggests that the centralized I/O scheduler is a significant bottleneck and source of latency. Fortunately, NVMe/TCP uses the more recent “High-Performance Plugin” (i.e., HPP) by default. The HPP plugin allows guest I/Os to bypass the scheduler as long as the backend storage remains fast.
esxcli storage core device
latencythreshold set -t [value in milliseconds]
. More information is available hereProxmox / Virtio-SCSI / RAW
Proxmox typically presents storage to guests as virtualized SCSI devices connected to a virtual SCSI controller implemented using virtio-scsi. When used with network-attached storage, the guest’s virtual SCSI devices are backed by native Linux block devices; there is no intermediate cluster filesystem layer in Proxmox. Thin provisioning, snapshots, encryption, and high availability are implemented by the network-attached storage.
The diagram below illustrates the flow of an I/O issued by a guest. Note that the scheduling domains are remarkably different between Proxmox and VMware. Proxmox schedules I/Os for individual devices, and NVMe devices use the Linux no-op scheduler. VMware schedules I/O for competing VMs, trying to coordinate the efficient use of a physical device’s I/O queuing capabilities.
GUEST
│ ┌─────────────┐
│ │ VIRTIO-SCSI │
│ │ DRIVER │
▼ └┬────────────┘
QEMU │
│ ┌▼────────────┐ ┌─────────┐
│ │ VIRTIO-SCSI ├───► ASYNC │
│ │ DEVICE │ │ I/O │
▼ └─────────────┘ └┬────────┘
KERNEL │
│ ┌▼────────┐ ┌─────────┐
│ │ BLOCK ├──► SCHED │
│ │ LAYER │ │ NOOP │
│ └─────────┘ └┬────────┘
│ │
│ ┌▼────────┐ ┌─────────┐
│ │ NVME ├──► NVME │
│ │ CORE │ │ TCP │
▼ └─────────┘ └─────────┘
FINDINGS
Proxmox Offers Higher IOPS
Proxmox VE beat VMware ESXi in 56 of 57 tests, delivering IOPS performance gains of nearly 50%. Peak gains in individual test cases with large queue depths and small I/O sizes exceed 70%.
The graph below shows the percentage gains (averaged across block sizes) for each queue depth. For example, the datapoint QD=128 is the average gain measured for the .5KiB, 4KiB, 8KiB, 16KiB, 32KiB, 64KiB, and 128KiB block sizes at a queue depth of 128. The graph shows an average performance advantage of 48.9% in favor of Proxmox.
Proxmox Has Lower Latency
Proxmox VE reduced latency by more than 30% while simultaneously delivering higher IOPS, besting VMware in 56 of 57 tests.
The graph below shows the latency reduction (averaged across block sizes) for each queue depth. For example, the datapoint QD=128 is the average reduction in latency for the .5KiB, 4KiB, 8KiB, 16KiB, 32KiB, 64KiB, and 128KiB block sizes at a queue depth of 128. The graph shows 32.6% performance advantage in favor of Proxmox.
Proxmox Delivers More Bandwidth
Proxmox achieved 38% higher bandwidth than VMware ESXi during peak load conditions: 12.8GB/s for Proxmox versus 9.3GB/s for VMware ESXi.
BENCHMARKS
IOPS
The following IOPS charts plot the relative performance of Proxmox VE and VMware ESXi. Each chart presents average IOPS results for eight different queue depths operating at fixed block size. Results are presented for seven block sizes including .5K, 4K, 8K, 16K, 32K, 64K, and 128K. Higher IOPS results are better.
LATENCY
The following latency charts plot the average I/O latency measured during the IOPS tests for Proxmox VE and VMware. Each chart shows the average latency for eight queue depths operating at fixed block size. Results are presented for seven block sizes including .5K, 4K, 8K, 16K, 32K, 64K, and 128K. Lower latency is better.
BANDWIDTH
The bandwidth charts plot the average data throughput measured during the IOPS tests for Proxmox VE and VMware. Each chart shows the average bandwidth for eight queue depths operating at fixed block size. Results are presented for seven block sizes including .5K, 4K, 8K, 16K, 32K, 64K, and 128K. Higher bandwidth is better.
RELATIVE COMPARISONS
IOPS IMPROVEMENT
The charts in the following section present the percentage gain or loss in IOPS associated with using Proxmox VE in place of VMware ESXi. A positive value indicates that Proxmox VE achieves higher IOPS. A negative value indicated that Proxmox VE achieves lower IOPS.
LATENCY REDUCTION
The charts in the following section present the percentage increase or decrease in latency associated with using Proxmox VE in place of VMware ESXi. A positive value indicates that Proxmox VE has lower latency. A negative value indicated that Proxmox VE has higher latency.
AVERAGE IOPS IMPROVEMENT
The graph below shows the average IOPS percentage for all block sizes at each queue depth. For example, the datapoint QD=128 is the average gain measured for the .5KiB, 4KiB, 8KiB, 16KiB, 32KiB, 64KiB, and 128KiB block sizes. The graph shows an average performance advantage of 48.9% in favor of Proxmox.
AVERAGE LATENCY REDUCTION
The graph below shows the average latency reduction for all block sizes at each queue depth. For example, the datapoint QD=128 is the average reduction in latency for the .5KiB, 4KiB, 8KiB, 16KiB, 32KiB, 64KiB, and 128KiB block sizes. The graph shows 32.6% performance advantage in favor of Proxmox.
VMWARE ENVIRONMENT
Network Diagram
┌──────────────────────────────┐ ┌─────────────────────┐
│ ┌────┐ | ┌───────────────┐ | │
│ ┌────┐ | ESXi 7.0-U3C │── NVME/TCP ─┤ SN2100 - 100G ├──────── ┤ BLOCKBRIDGE 6.X │
│ ┌────┐ | | 100G DUAL PORT │ └───────────────┘ │ QUAD ENGINE │
│ │ 32 │ |─┘ X16 GEN3 │ ┌───────────────┐ │ 2X 100G DUAL PORT │
│ │ VM │─┘ 32 CORE AMD |── NVME/TCP ─┤ SN2100 - 100G ├──────── ┤ 4M IOPS / 25 GB/s │
| └────┘ | └───────────────┘ | |
└──────────────────────────────┘ └─────────────────────┘
Description
VMware ESXi 7.0 Update 3c is installed on a Dell PowerEdge R7515 with an AMD EPYC 7452 32-Core Processor, 512GB of RAM, and a single Mellanox dual-port 100Gbit network adapter. The Mellanox adapter is a x16 Gen3 device with a maximum throughput of 126Gbit/s (limited by the PCIe connectivity). The AMD processor is running with NPS=4 and without hyperthreading.
Thirty-two virtual machines are provisioned on the host. Each VM is installed with Ubuntu 22.04.01 LTS, running Linux kernel version 5.15.0-1018-kvm. The VMs have four virtual CPUs and 4GB of RAM. The logical CPU overcommit is 4:1 (128 provisioned VCPUS running on a 32-core processor). Each VM has a boot block device containing the root filesystem and a separate device under test.
To maximize performance and evenly distribute the load, four VMFS6 datastore were used. Each datastore was backed by a single NVMe/TCP device, each on a different Blockbridge dataplane engine. Default settings for multipathing, IO queue pairs, and queue depth were used. We observed that VMware opened eight queue pairs per storage path, for a combined logical queue depth of 4096 IOs.
On each VM, fio-3.28 runs in “server” mode. We use an external controller node to coordinate benchmark runs across the VMs.
Our test suite consists of 56 different I/O workloads. Each suite contains varying block sizes and queue depths. Each workload consists of a 1-minute warmup and a 20-minute measurement period. Each suite takes 19.6 hours to complete. A sample workload description appears below:
$ cat read-rand-bs4096-qd32.fio
[global]
rw=randread
direct=1
ioengine=libaio
time_based=1
runtime=1200
ramp_time=60
numjobs=1
[sdb]
filename=/dev/sdb
bs=4096
iodepth=32
Required Tuning
Requests Outstanding
ESXi has a special setting that controls how deep the device I/O queue
is for a guest when other guests are accessing the same storage
device. In earlier versions of ESXi, this was via the global parameter
Disk.SchedNumReqOutstanding
. Starting in 5.5, control has been
relegated to an esxcli-only parameter. Given that we are executing
benchmarking of concurrent machines operating with high queue depth,
it is essential to tune the defaults.
esxcli storage core device set --sched-num-req-outstanding 1024 -d
I/O Scheduler Bypass
By default, ESXi passes every I/O through an I/O scheduler. This scheduler creates internal queuing, which is highly inefficient with high-speed storage devices.
Setting the latency-sensitive threshold allows VMware to bypass the I/O scheduler, sending I/Os directly from the PSA (i.e., Pluggable Storage Architecture) to HPP (i.e., High-Performance Plugin). This bypass delivers a noticeable boost to performance for NVMe/TCP, which natively leverages HPP for multipath and IO Queue pair selection.
esxcli storage core device latencythreshold set -v 'NVMe' -m 'Blockbridge' -t 10
Software
VMware Version
Product: VMware ESXi
Version: 7.0.3
Build: Releasebuild-19035710
Update: 3
Patch: 20
Blockbridge Version
version: 6.1.0
release: 6667.1
build: 4056
Guest Version
Distributor ID: Ubuntu
Description: Ubuntu 22.04.1 LTS
Release: 22.04
Codename: jammy
Hardware
Server Platform
System Information
Manufacturer: Dell Inc.
Product Name: PowerEdge R7515
Processor
memorySize = 549330464768,
cpuModel = "AMD EPYC 7452 32-Core Processor ",
cpuMhz = 2346,
numCpuPkgs = 1,
numCpuCores = 32,
numCpuThreads = 32,
numNics = 6,
numHBAs = 19
Processor NUMA Configuration
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 43 bits physical, 48 bits virtual
CPU(s): 32
On-line CPU(s) list: 0-31
Thread(s) per core: 1
Core(s) per socket: 32
Socket(s): 1
NUMA node(s): 4
Vendor ID: AuthenticAMD
CPU family: 23
Model: 49
Model name: AMD EPYC 7452 32-Core Processor
Stepping: 0
CPU MHz: 3139.938
BogoMIPS: 4690.89
Virtualization: AMD-V
L1d cache: 1 MiB
L1i cache: 1 MiB
L2 cache: 16 MiB
L3 cache: 128 MiB
NUMA node0 CPU(s): 0-7
NUMA node1 CPU(s): 8-15
NUMA node2 CPU(s): 16-23
NUMA node3 CPU(s): 24-31
Network Adapter & Link
Name Driver Link Status Speed MTU Description
------ ---------- ----------- ------ ---- -----------
vmnic0 ntg3 Up 1000 1500 Broadcom Corporation NetXtreme BCM5720 Gigabit Ethernet
vmnic1 ntg3 Up 1000 1500 Broadcom Corporation NetXtreme BCM5720 Gigabit Ethernet
vmnic2 nmlx5_core Up 100000 9000 Mellanox Technologies 100GbE dual-port QSFP28 (MCX516A-CCAT)
vmnic3 nmlx5_core Up 100000 9000 Mellanox Technologies 100GbE dual-port QSFP28 (MCX516A-CCAT)
Network Adapter Interrupt Coalesce Settings
NIC RX microseconds RX maximum frames TX microseconds TX Maximum frames Adaptive RX Adaptive TX
------ --------------- ----------------- --------------- ----------------- ----------- -----------
vmnic0 18 15 72 53 Off Off
vmnic1 18 15 72 53 Off Off
vmnic2 3 64 16 32 On Off
vmnic3 3 64 16 32 On Off
PROXMOX ENVIRONMENT
Network Diagram
┌──────────────────────────────┐ ┌─────────────────────┐
│ ┌────┐ | ┌───────────────┐ | │
│ ┌────┐ | PROXMOX 7.2 │── NVME/TCP ─┤ SN2100 - 100G ├──────── ┤ BLOCKBRIDGE 6.X │
│ ┌────┐ | | 100G DUAL PORT │ └───────────────┘ │ QUAD ENGINE │
│ │ 32 │ |─┘ X16 GEN3 │ ┌───────────────┐ │ 2X 100G DUAL PORT │
│ │ VM │─┘ 32 CORE AMD |── NVME/TCP ─┤ SN2100 - 100G ├──────── ┤ 4M IOPS / 25 GB/s │
| └────┘ | └───────────────┘ | |
└──────────────────────────────┘ └─────────────────────┘
Description
Proxmox 7.2 (kernel version 5.15.53-1-pve) is installed on a Dell PowerEdge R7515 with an AMD EPYC 7452 32-Core Processor, 512GB of RAM, and a single Mellanox dual-port 100Gbit network adapter. The Mellanox adapter is a x16 Gen3 device with a maximum throughput of 126Gbit/s (limited by the PCIe connectivity). The AMD processor is running with NPS=4 and without hyperthreading.
Thirty-two virtual machines are provisioned on the host. Each VM is installed with Ubuntu 22.04.01 LTS, running Linux kernel version 5.15.0-1018-kvm. The VMs have four virtual CPUs and 4GB of RAM. The logical CPU overcommit is 4:1 (128 provisioned VCPUS running on a 32-core processor).
On each VM, fio-3.28 runs in “server” mode. We use an external controller node to coordinate benchmark runs across the VMs.
Each VM has a boot block device containing the root filesystem
separate from the storage under test. For each VM, we provision
storage using pvesm alloc
and attach it to the VM with qm
set
. Before each test run, the VMs are power cycled to ensure
consistency.
Our test suite consists of 56 different I/O workloads. Each suite contains varying block sizes and queue depths. Each workload consists of a 1-minute warmup and a 20-minute measurement period. Each suite takes 19.6 hours to complete. A sample workload description appears below:
$ cat read-rand-bs4096-qd32.fio
[global]
rw=randread
direct=1
ioengine=libaio
time_based=1
runtime=1200
ramp_time=60
numjobs=1
[sdb]
filename=/dev/sdb
bs=4096
iodepth=32
Required Tuning
No tuning parameters were required.
Software
Proxmox Version
# pveversion
pve-manager/7.2-7/d0dd0e85 (running kernel: 5.15.53-1-pve)
Blockbridge Version
version: 6.1.0
release: 6667.1
build: 4056
Guest Version
Distributor ID: Ubuntu
Description: Ubuntu 22.04.1 LTS
Release: 22.04
Codename: jammy
Hardware
Server Platform
System Information
Manufacturer: Dell Inc.
Product Name: PowerEdge R7515
Processor
Processor: 32 x AMD EPYC 7452 32-Core Processor (1 Socket)
Kernel Version: Linux 5.15.53-1-pve #1 SMP PVE 5.15.53-1 (Fri, 26 Aug 2022 16:53:52 +0200)
PVE Manager Version pve-manager/7.2-7/d0dd0e
Processor NUMA Configuration
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 43 bits physical, 48 bits virtual
CPU(s): 32
On-line CPU(s) list: 0-31
Thread(s) per core: 1
Core(s) per socket: 32
Socket(s): 1
NUMA node(s): 4
Vendor ID: AuthenticAMD
CPU family: 23
Model: 49
Model name: AMD EPYC 7452 32-Core Processor
Stepping: 0
CPU MHz: 3139.938
BogoMIPS: 4690.89
Virtualization: AMD-V
L1d cache: 1 MiB
L1i cache: 1 MiB
L2 cache: 16 MiB
L3 cache: 128 MiB
NUMA node0 CPU(s): 0-7
NUMA node1 CPU(s): 8-15
NUMA node2 CPU(s): 16-23
NUMA node3 CPU(s): 24-31
Network Adapter
Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5]
Subsystem: Mellanox Technologies Mellanox ConnectX®-5 MCX516A-CCAT
Flags: bus master, fast devsel, latency 0, IRQ 624, NUMA node 1, IOMMU group 89
Memory at ac000000 (64-bit, prefetchable) [size=32M]
Expansion ROM at ab100000 [disabled] [size=1M]
Kernel driver in use: mlx5_core
Kernel modules: mlx5_core
Network Adapter PCI Connectivity
[ 3.341416] mlx5_core 0000:41:00.0: firmware version: 16.26.1040
[ 3.341456] mlx5_core 0000:41:00.0: 126.016 Gb/s available PCIe bandwidth (8.0 GT/s PCIe x16 link)
[ 3.638556] mlx5_core 0000:41:00.0: Rate limit: 127 rates are supported, range: 0Mbps to 97656Mbps
[ 3.638597] mlx5_core 0000:41:00.0: E-Switch: Total vports 4, per vport: max uc(1024) max mc(16384)
[ 3.641492] mlx5_core 0000:41:00.0: Port module event: module 0, Cable plugged
Network Adapter Link
Advertised pause frame use: Symmetric
Advertised auto-negotiation: Yes
Advertised FEC modes: None RS BASER
Link partner advertised link modes: Not reported
Link partner advertised pause frame use: No
Link partner advertised auto-negotiation: Yes
Link partner advertised FEC modes: Not reported
Speed: 100000Mb/s
Duplex: Full
Auto-negotiation: on
Port: Direct Attach Copper
PHYAD: 0
Transceiver: internal
Supports Wake-on: d
Wake-on: d
Current message level: 0x00000004 (4)
link
Link detected: yes
Network Adapter Interrupt Coalesce Settings
Adaptive RX: on TX: on
stats-block-usecs: n/a
sample-interval: n/a
pkt-rate-low: n/a
pkt-rate-high: n/a
rx-usecs: 8
rx-frames: 128
rx-usecs-irq: n/a
rx-frames-irq: n/a
tx-usecs: 8
tx-frames: 128
tx-usecs-irq: n/a
tx-frames-irq: n/a