OVERVIEW
Proxmox supports moving powered-on virtual machines between compute hosts, a process called live migration. Live migration allows you gracefully relocate VMs to perform host maintenance, improve the network locality of virtual machines that communicate with one another, migrate to new hardware, and more. Live migration works with local disks and shared storage. Shared storage, such as Ceph or Blockbridge, is generally preferred to eliminate lengthy transfers of disk images and minimize migration time.
This technote discusses the differences between secure and insecure virtual machine migration and describes how to eliminate spurious failures during concurrent migration. We explore the different migration modes, describe the known issues, and recommend best practices for rock-solid behavior in production.
PROXMOX LIVE MIGRATION
Live migration is a native QEMU
primitive that allows a virtual machine to run while concurrently
transferring its operating state (i.e., memory, CPU registers,
firmware, configuration, etc.) between hosts. During migration, QEMU
serializes the runtime state of a virtual machine into a byte stream
and sends it to a remote QEMU process via TCP socket, unix domain
socket, stdin/stdout, or file descriptor. Proxmox orchestrates the
QEMU migration process and offers two transport modes: secure
and
insecure
.
Secure Migration Overview
┌────────────────────────┐ ┌────────────────────────┐
│ SOURCE HOST │ │ TARGET HOST │
├────────────────────────┤ ├────────────────────────┤
│ │ │ │
│ ┌────┐ ┌───┐ ┌───┐ │ │ ┌───┐ ┌───┐ ┌────┐ │
│ │QEMU├──►│SSH├──►│TCP├─┼── ENCRYPTED ──┼─┤TCP├──►│SSH├──►│QEMU│ │
│ └────┘ └───┘ └───┘ │ │ └───┘ └───┘ └────┘ │
│ │ │ │
└────────────────────────┘ └────────────────────────┘
By default, Proxmox uses a secure network transport for virtual machine migration. Secure migration leverages dynamically created SSH tunnels that connect the source and destination hosts. The QEMU virtual machine state is transferred via Unix Domain Socket tunneled over SSH. SSH tunnels are authenticated using public/private keypairs exchanged during cluster installation. SSH provides in-flight confidentiality and data integrity guarantees at the expense of increased CPU utilization.
Known Issues With Secure Migration
In customer environments, we’ve found that secure migration failure
can often be attributed to migration concurrency. Specifically, when
migration job concurrency (i.e., MaxWorkers
) exceeds 10, sshd
on the
target host can randomly drop new connections, resulting in migration
failure.
Our team has tracked this issue back to the default sshd
configuration value for MaxStartups
. The MaxStartups
parameter
controls the maximum number of concurrent unauthenticated connections
to the SSH daemon. Using the default value of 10:30:60
, sshd
will
begin refusing connections with a probability of 30% once the number
of outstanding unauthenticated SSH sessions reaches 10
. The
probability increases linearly, and all connection attempts are
refused if the number of unauthenticated connections reaches ‘‘full’’
(i.e., 60
).
Insecure Migration Overview
┌────────────────┐ ┌────────────────┐
│ SOURCE HOST │ │ TARGET HOST │
├────────────────┤ ├────────────────┤
│ │ │ │
│ ┌────┐ ┌───┐ │ │ ┌───┐ ┌────┐ │
│ │QEMU├──►│TCP├─┼── CLEARTEXT ──┼─┤TCP├──►│QEMU│ │
│ └────┘ └───┘ │ │ └───┘ └────┘ │
│ │ │ │
└────────────────┘ └────────────────┘
Insecure migration uses a standard TCP connection to transfer a VM’s state between hosts, offering improved throughput for single VM transfers compared to SSH tunnels used for secure migration.
Proxmox statically reserves ports 60000
through 60050
for
insecure
migration. Port allocations are managed using a time-based
reservation scheme. If the migration client requests a port
reservation and does not bind to it within 5 seconds, the client is
considered to have forfeited its reservation.
migration
variable declared in the Proxmox
datacenter.cfg.Known Issues With Insecure Migration
In customer environments, we’ve found that spurious insecure migration failures can often be attributed to conflicts resulting from races in port allocation and use. In addition, since the migration port range is limited to 50 ports, migration concurrency levels greater than 50 cause failures due to port allocation failures.
GUIDANCE
We recommend avoiding insecure
migration entirely. First, the
hard-coded reserved port range is insufficient to support maximum
concurrency. Second, the algorithm coordinating access to the reserved
port range is prone to spurious conflicts resulting in migration
failure. Third, the lack of authentication and in-flight
confidentiality is not worth the performance gain in a production
environment.
Stick with the default secure
transport mode. For improved reliability
at scale, modify your host’s sshd configuration to prevent connection
drops. If you do not want to modify your ssh configuration, we
recommend limiting migration job concurrency to 8.
If you want to operate with more than eight migration jobs in
parallel, we recommend changing the default sshd MaxStartups
parameters to eliminate spurious migration failures resulting from
dropped connections.
By default, the sshd MaxStartups
parameter is a tuple that controls
how and when to drop connections. We recommend adjusting the high and
low watermarks or eliminating randomized connection dropping
altogether by specifying a single integer value. Adjust the values
according to the maximum concurrency you want to achieve.
MaxStartups=108:30:128
or MaxStartups=128
are safe bets for most
environments.ADDITIONAL RESOURCES
Concurrent Migration Via CLI
The Proxmox CLI can easily manage virtual machine migration, including
job concurrency. If you are operating from a shell on a Proxmox host,
you can use pvesh
to interface with the API, without authentication.
pvesh create /nodes/<SOURCE NODE>/migrateall -target <TARGET NODE>
optional: -maxworkers <number of concurrent jobs>
optional: -vms <space separated list of VM ids>
You can find the API documentation here
Forum & Bugzilla Links
- Bugzilla - Deterministic VM migration port
- Forum - Live Migration Thread