OVERVIEW

Proxmox supports moving powered-on virtual machines between compute hosts, a process called live migration. Live migration allows you gracefully relocate VMs to perform host maintenance, improve the network locality of virtual machines that communicate with one another, migrate to new hardware, and more. Live migration works with local disks and shared storage. Shared storage, such as Ceph or Blockbridge, is generally preferred to eliminate lengthy transfers of disk images and minimize migration time.

This technote discusses the differences between secure and insecure virtual machine migration and describes how to eliminate spurious failures during concurrent migration. We explore the different migration modes, describe the known issues, and recommend best practices for rock-solid behavior in production.

PROXMOX LIVE MIGRATION

Live migration is a native QEMU primitive that allows a virtual machine to run while concurrently transferring its operating state (i.e., memory, CPU registers, firmware, configuration, etc.) between hosts. During migration, QEMU serializes the runtime state of a virtual machine into a byte stream and sends it to a remote QEMU process via TCP socket, unix domain socket, stdin/stdout, or file descriptor. Proxmox orchestrates the QEMU migration process and offers two transport modes: secure and insecure.

Secure Migration Overview

                      ┌────────────────────────┐               ┌────────────────────────┐
                      │      SOURCE HOST       │               │       TARGET HOST      │
                      ├────────────────────────┤               ├────────────────────────┤
                      │                        │               │                        │
                      │ ┌────┐   ┌───┐   ┌───┐ │               │ ┌───┐   ┌───┐   ┌────┐ │
                      │ │QEMU├──►│SSH├──►│TCP├─┼── ENCRYPTED ──┼─┤TCP├──►│SSH├──►│QEMU│ │
                      │ └────┘   └───┘   └───┘ │               │ └───┘   └───┘   └────┘ │
                      │                        │               │                        │
                      └────────────────────────┘               └────────────────────────┘

By default, Proxmox uses a secure network transport for virtual machine migration. Secure migration leverages dynamically created SSH tunnels that connect the source and destination hosts. The QEMU virtual machine state is transferred via Unix Domain Socket tunneled over SSH. SSH tunnels are authenticated using public/private keypairs exchanged during cluster installation. SSH provides in-flight confidentiality and data integrity guarantees at the expense of increased CPU utilization.

Known Issues With Secure Migration

In customer environments, we’ve found that secure migration failure can often be attributed to migration concurrency. Specifically, when migration job concurrency (i.e., MaxWorkers) exceeds 10, sshd on the target host can randomly drop new connections, resulting in migration failure.

Our team has tracked this issue back to the default sshd configuration value for MaxStartups. The MaxStartups parameter controls the maximum number of concurrent unauthenticated connections to the SSH daemon. Using the default value of 10:30:60, sshd will begin refusing connections with a probability of 30% once the number of outstanding unauthenticated SSH sessions reaches 10. The probability increases linearly, and all connection attempts are refused if the number of unauthenticated connections reaches ‘‘full’’ (i.e., 60).

Insecure Migration Overview

                              ┌────────────────┐               ┌────────────────┐
                              │  SOURCE HOST   │               │  TARGET HOST   │
                              ├────────────────┤               ├────────────────┤
                              │                │               │                │
                              │ ┌────┐   ┌───┐ │               │ ┌───┐   ┌────┐ │
                              │ │QEMU├──►│TCP├─┼── CLEARTEXT ──┼─┤TCP├──►│QEMU│ │
                              │ └────┘   └───┘ │               │ └───┘   └────┘ │
                              │                │               │                │
                              └────────────────┘               └────────────────┘

Insecure migration uses a standard TCP connection to transfer a VM’s state between hosts, offering improved throughput for single VM transfers compared to SSH tunnels used for secure migration.

Proxmox statically reserves ports 60000 through 60050 for insecure migration. Port allocations are managed using a time-based reservation scheme. If the migration client requests a port reservation and does not bind to it within 5 seconds, the client is considered to have forfeited its reservation.

Known Issues With Insecure Migration

In customer environments, we’ve found that spurious insecure migration failures can often be attributed to conflicts resulting from races in port allocation and use. In addition, since the migration port range is limited to 50 ports, migration concurrency levels greater than 50 cause failures due to port allocation failures.

GUIDANCE

We recommend avoiding insecure migration entirely. First, the hard-coded reserved port range is insufficient to support maximum concurrency. Second, the algorithm coordinating access to the reserved port range is prone to spurious conflicts resulting in migration failure. Third, the lack of authentication and in-flight confidentiality is not worth the performance gain in a production environment.

Stick with the default secure transport mode. For improved reliability at scale, modify your host’s sshd configuration to prevent connection drops. If you do not want to modify your ssh configuration, we recommend limiting migration job concurrency to 8.

If you want to operate with more than eight migration jobs in parallel, we recommend changing the default sshd MaxStartups parameters to eliminate spurious migration failures resulting from dropped connections.

By default, the sshd MaxStartups parameter is a tuple that controls how and when to drop connections. We recommend adjusting the high and low watermarks or eliminating randomized connection dropping altogether by specifying a single integer value. Adjust the values according to the maximum concurrency you want to achieve.

ADDITIONAL RESOURCES

Concurrent Migration Via CLI

The Proxmox CLI can easily manage virtual machine migration, including job concurrency. If you are operating from a shell on a Proxmox host, you can use pvesh to interface with the API, without authentication.

pvesh create /nodes/<SOURCE NODE>/migrateall -target <TARGET NODE>
optional: -maxworkers <number of concurrent jobs>
optional: -vms <space separated list of VM ids>

You can find the API documentation here