Proxmox vs. VMware ESXi: A Performance Comparison Using NVMe/TCP

OVERVIEW

Proxmox and VMware offer competing virtualization platforms. Proxmox is an open-source product that leverages QEMU and KVM. VMware ESXi is a commercial product built from proprietary software.

This technote compares the performance of Proxmox VE 7.2 and VMware ESXi 7.0 Update 3c for storage-dominant applications. Experiments were conducted on identical hardware configurations operating under moderate to heavy load. Testing focuses on the aggregate storage performance of 32 concurrently active virtual machines.

Our test system is a Dell PowerEdge R7515 with an AMD EPYC 7452 32-Core Processor and Mellanox 100Gbit networking. The server is configured for dual-boot: Proxmox 7.2 and ESXi 7.0 Update 3c. Storage is network-attached using NVMe/TCP. The backend storage software is Blockbridge 6.

The testbed consists of 32 Ubuntu virtual machines operating on a single host, each configured with four virtual CPUs. A single virtual disk is attached to each virtual machine for testing. Each virtual machine operates as a fio worker; tests execute concurrently on all 32 virtual machines. For each queue depth and block size, a data point is collected that represents the average performance over a 20-minute interval immediately following a 1-minute warm-up period.

You can review our findings/TLDR here and our raw benchmark data here.

ARCHITECTURE COMPARISON

VMware / PVSCSI / VMFS

VMware presents storage to guests via virtualized SCSI, SATA, or NVMe controllers. Typically, storage is a virtual SCSI disk presented using a VMware Paravirtual SCSI Adapter. The backing storage associated with these virtual disks is generally a file stored on a special-purpose cluster filesystem called VMFS. VFMS provides storage management features, including thin provisioning, snapshots, and cluster mobility.

Note: VVOLs and Raw Device Mappings offer a more direct path but are not supported for NVMeOF devices.

The diagram below illustrates the flow of an I/O issued by a guest. Note that the existing storage stack is heavily SCSI-centric. NVMe devices fit into this model using a thin SCSI virtualization layer, referred to as a “shim” in the diagram.

      GUEST
      │           ┌────────┐
      │           │ PVSCSI │
      │           │ DRIVER │
      ▼           └┬───────┘
      KERNEL       │
      │           ┌▼───────┐  ┌────────┐  ┌─────────┐  ┌────────┐
      │           │ PVSCSI ├──►  VMFS  ├──►   I/O   ├──►  SCSI  │
      │           │ DEVICE │  │        │  │  SCHED  │  │  DISK  │
      │           └────────┘  └────────┘  └─────────┘  └┬───────┘
      │                                                 │
      │                                                ┌▼───────┐  ┌────────┐  ┌────────┐
      │                                                │  HPP   ├──►  SCSI  ├──►  NVME  │
      │                                                │        │  │  PATH  │  │  SHIM  │
      │                                                └────────┘  └────────┘  └┬───────┘
      │                                                                         │
      │                                                                        ┌▼───────┐  ┌────────┐
      │                                                                        │  NVME  ├──►  NVME  │
      │                                                                        │  CORE  │  │  TCP   │
      ▼                                                                        └────────┘  └────────┘

Our experience suggests that the centralized I/O scheduler is a significant bottleneck and source of latency. Fortunately, NVMe/TCP uses the more recent “High-Performance Plugin” (i.e., HPP) by default. The HPP plugin allows guest I/Os to bypass the scheduler as long as the backend storage remains fast.

Tip: Latency thresholds for bypassing the I/O scheduler are configured as follows:

esxcli storage core device
latencythreshold set -t [value in milliseconds]

. More information is available here

Proxmox / Virtio-SCSI / RAW

Proxmox typically presents storage to guests as virtualized SCSI devices connected to a virtual SCSI controller implemented using virtio-scsi. When used with network-attached storage, the guest’s virtual SCSI devices are backed by native Linux block devices; there is no intermediate cluster filesystem layer in Proxmox. Thin provisioning, snapshots, encryption, and high availability are implemented by the network-attached storage.

The diagram below illustrates the flow of an I/O issued by a guest. Note that the scheduling domains are remarkably different between Proxmox and VMware. Proxmox schedules I/Os for individual devices, and NVMe devices use the Linux no-op scheduler. VMware schedules I/O for competing VMs, trying to coordinate the efficient use of a physical device’s I/O queuing capabilities.

Note: We were careful to ensure that aggregate queue depth was not a limiting factor of performance.

      GUEST
      │           ┌─────────────┐
      │           │ VIRTIO-SCSI │
      │           │ DRIVER      │
      ▼           └┬────────────┘
      QEMU         │
      │           ┌▼────────────┐   ┌─────────┐
      │           │ VIRTIO-SCSI ├───► ASYNC   │
      │           │ DEVICE      │   │ I/O     │
      ▼           └─────────────┘   └┬────────┘
      KERNEL                         │
      │                             ┌▼────────┐  ┌─────────┐
      │                             │ BLOCK   ├──► SCHED   │
      │                             │ LAYER   │  │ NOOP    │
      │                             └─────────┘  └┬────────┘
      │                                           │
      │                                          ┌▼────────┐  ┌─────────┐
      │                                          │ NVME    ├──► NVME    │
      │                                          │ CORE    │  │ TCP     │
      ▼                                          └─────────┘  └─────────┘

FINDINGS

Proxmox Offers Higher IOPS

Proxmox VE beat VMware ESXi in 56 of 57 tests, delivering IOPS performance gains of nearly 50%. Peak gains in individual test cases with large queue depths and small I/O sizes exceed 70%.

The graph below shows the percentage gains (averaged across block sizes) for each queue depth. For example, the datapoint QD=128 is the average gain measured for the .5KiB, 4KiB, 8KiB, 16KiB, 32KiB, 64KiB, and 128KiB block sizes at a queue depth of 128. The graph shows an average performance advantage of 48.9% in favor of Proxmox.

Proxmox Has Lower Latency

Proxmox VE reduced latency by more than 30% while simultaneously delivering higher IOPS, besting VMware in 56 of 57 tests.

The graph below shows the latency reduction (averaged across block sizes) for each queue depth. For example, the datapoint QD=128 is the average reduction in latency for the .5KiB, 4KiB, 8KiB, 16KiB, 32KiB, 64KiB, and 128KiB block sizes at a queue depth of 128. The graph shows 32.6% performance advantage in favor of Proxmox.

Proxmox Delivers More Bandwidth

Proxmox achieved 38% higher bandwidth than VMware ESXi during peak load conditions: 12.8GB/s for Proxmox versus 9.3GB/s for VMware ESXi.

Note: Further investigation showed that VMware ESXi has difficulty sustaining high-bandwidth storage networking. The performance issues are possibly related to the interaction of dynamic Receive Side Scaling (i.e., RSS), periodic rebalancing of network interrupts, and the behavior of path selection policies implemented for NVMe.

BENCHMARKS

IOPS

The following IOPS charts plot the relative performance of Proxmox VE and VMware ESXi. Each chart presents average IOPS results for eight different queue depths operating at fixed block size. Results are presented for seven block sizes including .5K, 4K, 8K, 16K, 32K, 64K, and 128K. Higher IOPS results are better.

LATENCY

The following latency charts plot the average I/O latency measured during the IOPS tests for Proxmox VE and VMware. Each chart shows the average latency for eight queue depths operating at fixed block size. Results are presented for seven block sizes including .5K, 4K, 8K, 16K, 32K, 64K, and 128K. Lower latency is better.

Tip: Comparing latency directly when operating at different levels of IOPS performance can be misleading.

BANDWIDTH

The bandwidth charts plot the average data throughput measured during the IOPS tests for Proxmox VE and VMware. Each chart shows the average bandwidth for eight queue depths operating at fixed block size. Results are presented for seven block sizes including .5K, 4K, 8K, 16K, 32K, 64K, and 128K. Higher bandwidth is better.

Tip: Bandwidth should directly correlate with IOPS since the block size is fixed! Bandwidth = IOPS * Queue Depth

RELATIVE COMPARISONS

IOPS IMPROVEMENT

The charts in the following section present the percentage gain or loss in IOPS associated with using Proxmox VE in place of VMware ESXi. A positive value indicates that Proxmox VE achieves higher IOPS. A negative value indicated that Proxmox VE achieves lower IOPS.

LATENCY REDUCTION

The charts in the following section present the percentage increase or decrease in latency associated with using Proxmox VE in place of VMware ESXi. A positive value indicates that Proxmox VE has lower latency. A negative value indicated that Proxmox VE has higher latency.

AVERAGE IOPS IMPROVEMENT

The graph below shows the average IOPS percentage for all block sizes at each queue depth. For example, the datapoint QD=128 is the average gain measured for the .5KiB, 4KiB, 8KiB, 16KiB, 32KiB, 64KiB, and 128KiB block sizes. The graph shows an average performance advantage of 48.9% in favor of Proxmox.

AVERAGE LATENCY REDUCTION

The graph below shows the average latency reduction for all block sizes at each queue depth. For example, the datapoint QD=128 is the average reduction in latency for the .5KiB, 4KiB, 8KiB, 16KiB, 32KiB, 64KiB, and 128KiB block sizes. The graph shows 32.6% performance advantage in favor of Proxmox.

VMWARE ENVIRONMENT

Network Diagram

         ┌──────────────────────────────┐                                       ┌─────────────────────┐
         │      ┌────┐                  |             ┌───────────────┐         |                     │
         │    ┌────┐ |  ESXi 7.0-U3C    │── NVME/TCP ─┤ SN2100 - 100G ├──────── ┤  BLOCKBRIDGE 6.X    │
         │  ┌────┐ | |  100G DUAL PORT  │             └───────────────┘         │  QUAD ENGINE        │
         │  │ 32 │ |─┘  X16 GEN3        │             ┌───────────────┐         │  2X 100G DUAL PORT  │
         │  │ VM │─┘    32 CORE AMD     |── NVME/TCP ─┤ SN2100 - 100G ├──────── ┤  4M IOPS / 25 GB/s  │
         |  └────┘                      |             └───────────────┘         |                     |
         └──────────────────────────────┘                                       └─────────────────────┘

Note: The maximum performance of the host is limited to 126Gb/s by the NIC’s PCI slot. In practice, this translates to roughly 14.8GB/s of disk bandwidth after accounting for protocol overhead

Description

VMware ESXi 7.0 Update 3c is installed on a Dell PowerEdge R7515 with an AMD EPYC 7452 32-Core Processor, 512GB of RAM, and a single Mellanox dual-port 100Gbit network adapter. The Mellanox adapter is a x16 Gen3 device with a maximum throughput of 126Gbit/s (limited by the PCIe connectivity). The AMD processor is running with NPS=4 and without hyperthreading.

Thirty-two virtual machines are provisioned on the host. Each VM is installed with Ubuntu 22.04.01 LTS, running Linux kernel version 5.15.0-1018-kvm. The VMs have four virtual CPUs and 4GB of RAM. The logical CPU overcommit is 4:1 (128 provisioned VCPUS running on a 32-core processor). Each VM has a boot block device containing the root filesystem and a separate device under test.

To maximize performance and evenly distribute the load, four VMFS6 datastore were used. Each datastore was backed by a single NVMe/TCP device, each on a different Blockbridge dataplane engine. Default settings for multipathing, IO queue pairs, and queue depth were used. We observed that VMware opened eight queue pairs per storage path, for a combined logical queue depth of 4096 IOs.

On each VM, fio-3.28 runs in “server” mode. We use an external controller node to coordinate benchmark runs across the VMs.

Our test suite consists of 56 different I/O workloads. Each suite contains varying block sizes and queue depths. Each workload consists of a 1-minute warmup and a 20-minute measurement period. Each suite takes 19.6 hours to complete. A sample workload description appears below:

$ cat read-rand-bs4096-qd32.fio
[global]
rw=randread
direct=1
ioengine=libaio
time_based=1
runtime=1200
ramp_time=60
numjobs=1

[sdb]
filename=/dev/sdb
bs=4096
iodepth=32

Required Tuning

Requests Outstanding

ESXi has a special setting that controls how deep the device I/O queue is for a guest when other guests are accessing the same storage device. In earlier versions of ESXi, this was via the global parameter Disk.SchedNumReqOutstanding. Starting in 5.5, control has been relegated to an esxcli-only parameter. Given that we are executing benchmarking of concurrent machines operating with high queue depth, it is essential to tune the defaults.

esxcli storage core device set --sched-num-req-outstanding 1024 -d

Tip: Tuning this parameter effectively eliminates queuing inside of VMware for our test workload.

I/O Scheduler Bypass

By default, ESXi passes every I/O through an I/O scheduler. This scheduler creates internal queuing, which is highly inefficient with high-speed storage devices.

Setting the latency-sensitive threshold allows VMware to bypass the I/O scheduler, sending I/Os directly from the PSA (i.e., Pluggable Storage Architecture) to HPP (i.e., High-Performance Plugin). This bypass delivers a noticeable boost to performance for NVMe/TCP, which natively leverages HPP for multipath and IO Queue pair selection.

esxcli storage core device latencythreshold set -v 'NVMe' -m 'Blockbridge' -t 10

Software

VMware Version

Product: VMware ESXi
Version: 7.0.3
Build: Releasebuild-19035710
Update: 3
Patch: 20

Blockbridge Version

version:   6.1.0
release:   6667.1
build:     4056

Guest Version

Distributor ID: Ubuntu
Description:    Ubuntu 22.04.1 LTS
Release:        22.04
Codename:       jammy

Hardware

Server Platform

System Information
	Manufacturer: Dell Inc.
	Product Name: PowerEdge R7515

Processor

memorySize = 549330464768,
cpuModel = "AMD EPYC 7452 32-Core Processor                ",
cpuMhz = 2346,
numCpuPkgs = 1,
numCpuCores = 32,
numCpuThreads = 32,
numNics = 6,
numHBAs = 19

Processor NUMA Configuration

Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   43 bits physical, 48 bits virtual
CPU(s):                          32
On-line CPU(s) list:             0-31
Thread(s) per core:              1
Core(s) per socket:              32
Socket(s):                       1
NUMA node(s):                    4
Vendor ID:                       AuthenticAMD
CPU family:                      23
Model:                           49
Model name:                      AMD EPYC 7452 32-Core Processor
Stepping:                        0
CPU MHz:                         3139.938
BogoMIPS:                        4690.89
Virtualization:                  AMD-V
L1d cache:                       1 MiB
L1i cache:                       1 MiB
L2 cache:                        16 MiB
L3 cache:                        128 MiB
NUMA node0 CPU(s):               0-7
NUMA node1 CPU(s):               8-15
NUMA node2 CPU(s):               16-23
NUMA node3 CPU(s):               24-31

Network Adapter & Link

Name    Driver      Link Status   Speed   MTU  Description
------  ----------  -----------  ------  ----  -----------
vmnic0  ntg3        Up             1000  1500  Broadcom Corporation NetXtreme BCM5720 Gigabit Ethernet
vmnic1  ntg3        Up             1000  1500  Broadcom Corporation NetXtreme BCM5720 Gigabit Ethernet
vmnic2  nmlx5_core  Up           100000  9000  Mellanox Technologies 100GbE dual-port QSFP28 (MCX516A-CCAT)
vmnic3  nmlx5_core  Up           100000  9000  Mellanox Technologies 100GbE dual-port QSFP28 (MCX516A-CCAT)

Network Adapter Interrupt Coalesce Settings

NIC     RX microseconds  RX maximum frames  TX microseconds  TX Maximum frames  Adaptive RX  Adaptive TX
------  ---------------  -----------------  ---------------  -----------------  -----------  -----------
vmnic0  18               15                 72               53                 Off          Off
vmnic1  18               15                 72               53                 Off          Off
vmnic2  3                64                 16               32                 On           Off
vmnic3  3                64                 16               32                 On           Off

PROXMOX ENVIRONMENT

Network Diagram

         ┌──────────────────────────────┐                                       ┌─────────────────────┐
         │      ┌────┐                  |             ┌───────────────┐         |                     │
         │    ┌────┐ |  PROXMOX 7.2     │── NVME/TCP ─┤ SN2100 - 100G ├──────── ┤  BLOCKBRIDGE 6.X    │
         │  ┌────┐ | |  100G DUAL PORT  │             └───────────────┘         │  QUAD ENGINE        │
         │  │ 32 │ |─┘  X16 GEN3        │             ┌───────────────┐         │  2X 100G DUAL PORT  │
         │  │ VM │─┘    32 CORE AMD     |── NVME/TCP ─┤ SN2100 - 100G ├──────── ┤  4M IOPS / 25 GB/s  │
         |  └────┘                      |             └───────────────┘         |                     |
         └──────────────────────────────┘                                       └─────────────────────┘

Note: The maximum performance of the host is limited to 126Gb/s by the NIC’s PCI slot. In practice, this translates to roughly 14.8GB/s of disk bandwidth after accounting for protocol overhead

Description

Proxmox 7.2 (kernel version 5.15.53-1-pve) is installed on a Dell PowerEdge R7515 with an AMD EPYC 7452 32-Core Processor, 512GB of RAM, and a single Mellanox dual-port 100Gbit network adapter. The Mellanox adapter is a x16 Gen3 device with a maximum throughput of 126Gbit/s (limited by the PCIe connectivity). The AMD processor is running with NPS=4 and without hyperthreading.

On each VM, fio-3.28 runs in “server” mode. We use an external controller node to coordinate benchmark runs across the VMs.

Each VM has a boot block device containing the root filesystem separate from the storage under test. For each VM, we provision storage using pvesm alloc and attach it to the VM with qm set. Before each test run, the VMs are power cycled to ensure consistency.

$ cat read-rand-bs4096-qd32.fio
[global]
rw=randread
direct=1
ioengine=libaio
time_based=1
runtime=1200
ramp_time=60
numjobs=1

[sdb]
filename=/dev/sdb
bs=4096
iodepth=32

Required Tuning

No tuning parameters were required.

Software

Proxmox Version

# pveversion
pve-manager/7.2-7/d0dd0e85 (running kernel: 5.15.53-1-pve)

Blockbridge Version

version:   6.1.0
release:   6667.1
build:     4056

Guest Version

Distributor ID: Ubuntu
Description:    Ubuntu 22.04.1 LTS
Release:        22.04
Codename:       jammy

Hardware

Server Platform

System Information
	Manufacturer: Dell Inc.
	Product Name: PowerEdge R7515

Processor

Processor: 32 x AMD EPYC 7452 32-Core Processor (1 Socket)
Kernel Version: Linux 5.15.53-1-pve #1 SMP PVE 5.15.53-1 (Fri, 26 Aug 2022 16:53:52 +0200)
PVE Manager Version pve-manager/7.2-7/d0dd0e

Processor NUMA Configuration

Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   43 bits physical, 48 bits virtual
CPU(s):                          32
On-line CPU(s) list:             0-31
Thread(s) per core:              1
Core(s) per socket:              32
Socket(s):                       1
NUMA node(s):                    4
Vendor ID:                       AuthenticAMD
CPU family:                      23
Model:                           49
Model name:                      AMD EPYC 7452 32-Core Processor
Stepping:                        0
CPU MHz:                         3139.938
BogoMIPS:                        4690.89
Virtualization:                  AMD-V
L1d cache:                       1 MiB
L1i cache:                       1 MiB
L2 cache:                        16 MiB
L3 cache:                        128 MiB
NUMA node0 CPU(s):               0-7
NUMA node1 CPU(s):               8-15
NUMA node2 CPU(s):               16-23
NUMA node3 CPU(s):               24-31

Network Adapter

Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5]
Subsystem: Mellanox Technologies Mellanox ConnectX®-5 MCX516A-CCAT
Flags: bus master, fast devsel, latency 0, IRQ 624, NUMA node 1, IOMMU group 89
Memory at ac000000 (64-bit, prefetchable) [size=32M]
Expansion ROM at ab100000 [disabled] [size=1M]
Kernel driver in use: mlx5_core
Kernel modules: mlx5_core

Network Adapter PCI Connectivity

[    3.341416] mlx5_core 0000:41:00.0: firmware version: 16.26.1040
[    3.341456] mlx5_core 0000:41:00.0: 126.016 Gb/s available PCIe bandwidth (8.0 GT/s PCIe x16 link)
[    3.638556] mlx5_core 0000:41:00.0: Rate limit: 127 rates are supported, range: 0Mbps to 97656Mbps
[    3.638597] mlx5_core 0000:41:00.0: E-Switch: Total vports 4, per vport: max uc(1024) max mc(16384)
[    3.641492] mlx5_core 0000:41:00.0: Port module event: module 0, Cable plugged

Network Adapter Link

Advertised pause frame use: Symmetric
Advertised auto-negotiation: Yes
Advertised FEC modes: None	 RS	 BASER
Link partner advertised link modes:  Not reported
Link partner advertised pause frame use: No
Link partner advertised auto-negotiation: Yes
Link partner advertised FEC modes: Not reported
Speed: 100000Mb/s
Duplex: Full
Auto-negotiation: on
Port: Direct Attach Copper
PHYAD: 0
Transceiver: internal
Supports Wake-on: d
Wake-on: d
    Current message level: 0x00000004 (4)
                           link
Link detected: yes

Network Adapter Interrupt Coalesce Settings

Adaptive RX: on  TX: on
stats-block-usecs: n/a
sample-interval: n/a
pkt-rate-low: n/a
pkt-rate-high: n/a

rx-usecs: 8
rx-frames: 128
rx-usecs-irq: n/a
rx-frames-irq: n/a

tx-usecs: 8
tx-frames: 128
tx-usecs-irq: n/a
tx-frames-irq: n/a

Proxmox vs. VMware ESXi: A Performance Comparison Using NVMe/TCP

OVERVIEW

ARCHITECTURE COMPARISON

VMware / PVSCSI / VMFS

Proxmox / Virtio-SCSI / RAW

FINDINGS

Proxmox Offers Higher IOPS

Proxmox Has Lower Latency

Proxmox Delivers More Bandwidth

BENCHMARKS

IOPS

LATENCY

BANDWIDTH

RELATIVE COMPARISONS

IOPS IMPROVEMENT

LATENCY REDUCTION

AVERAGE IOPS IMPROVEMENT

AVERAGE LATENCY REDUCTION

VMWARE ENVIRONMENT

Network Diagram

Description

Required Tuning

Requests Outstanding

I/O Scheduler Bypass

Software

VMware Version

Blockbridge Version

Guest Version

Hardware

Server Platform

Processor

Processor NUMA Configuration

Network Adapter & Link

Network Adapter Interrupt Coalesce Settings

PROXMOX ENVIRONMENT

Network Diagram

Description

Required Tuning

Software

Proxmox Version

Blockbridge Version

Guest Version

Hardware

Server Platform

Processor

Processor NUMA Configuration

Network Adapter

Network Adapter PCI Connectivity

Network Adapter Link

Network Adapter Interrupt Coalesce Settings

ADDITIONAL RESOURCES