Demystifying QEMU cache=none

QEMU’s cache=none mode is often misunderstood. At first glance, it seems simple: data bypasses the host page cache, leaving the guest responsible for durability. In practice, the behavior is more nuanced. While writes to RAW block devices are generally predictable and reliable, QCOW2 introduces additional complexity. Metadata updates, write ordering, and flush handling in QCOW2 delays or reorders how data is recorded, creating partially lost or torn writes if the VM crashes or loses power unexpectedly.

This article focuses on cache=none, explaining how it interacts with guest writes and storage, why its behavior can create subtle data risks on QCOW2 virtual devices, and what mechanisms are needed to ensure consistency. By the end, readers will understand why cache=none is not simply “no caching,” why raw devices are the safest option, and why QCOW2 devices can unexpectedly corrupt data on failure in surprising ways.


Context And Scope

Blockbridge’s experience is primarily with enterprise data center workloads, where durability, availability, and consistency are critical considerations. The information in this document reflects that context.

Both QCOW2 and RAW formats are supported on Blockbridge storage systems. The analysis presented here is intended to help readers understand the failure modes of QCOW2 and the technical trade-offs between formats. While RAW may align more closely with enterprise reliability requirements, the optimal choice depends on operational priorities.


What the Existing Documentation Says

Proxmox Documentation

According to the Proxmox documentation, the cache=none mode is described as follows:

  • “Seems to be the best performance and is the default since Proxmox 2.x.”
  • “Host page cache is not used.”
  • “Guest disk cache is set to writeback.”
  • “Warning: like writeback, you can lose data in case of a power failure.”
  • “You need to use the barrier option in your Linux guest’s fstab if kernel < 2.6.37 to avoid FS corruption in case of power failure.”

At first glance, it looks simple: the host page cache is bypassed, performance should be strong, and the guest filesystem takes care of caching and data integrity.

QEMU Documentation

QEMU’s documentation defines cache=none in terms of three attributes:

                     +---------------+-----------------+--------------+----------------+
                     | cache mode    | cache.writeback | cache.direct | cache.no-flush |
                     +---------------+-----------------+--------------+----------------+
                     | none          | on              | on           | off            |
                     +---------------+-----------------+--------------+----------------+
  • cache.writeback=on: QEMU reports write completion to the guest as soon as the data is placed in the host page cache. Safe only if the guest issues flushes. Disabling writeback acknowledges writes only after flush completion.
  • cache.direct=on: Performs disk I/O directly (using O_DIRECT) to the backing storage device, bypassing the host page cache. Internal data copies may still occur for alignment or buffering.
  • cache.no-flush=off: Maintains normal flush semantics. Setting this to on disables flushes entirely, removing all durability guarantees.

The QCOW2 documentation is somewhat more descriptive but also circular. A technical reader with a strong coffee in hand will notice that while cache.writeback=on intends to report I/O completion inline with the write() system call, direct=on ensures the I/O is not acknowledged by the host page cache, giving stronger durability guarantees when used correctly.

What’s Missing

The key gap in the documentation is that the cache mode really only describes how QEMU interacts with the underlying storage devices connected to the host’s kernel.

For raw devices such as NVMe, iSCSI, Ceph, and even LVM-thick, the behavior is straightforward. I/O passes through QEMU to the kernel, possibly with some adjustments for alignment and padding, and the O_DIRECT flag requests Linux to communicate with the device with minimal buffering. This is the most efficient data path that results in no caching end-to-end.

This simplicity and predictability of raw devices can easily give the impression that QCOW2 images interact with storage in the same way, with no caching or other intermediate handling. In reality, QCOW2 behaves radically differently. QCOW2 is implemented entirely within QEMU, and understanding its behavior and consequences is critical. In short, cache=none does not mean “no caching” for QCOW2.


What cache=none Does In Plain Words

cache=none instructs QEMU to open(2) backing files and block devices for virtual disks using the O_DIRECT flag. O_DIRECT allows write() system calls to move data between QEMU’s userspace buffers and the storage device without copying the data into the host’s kernel page/buffer cache.

The cache=none mode also instructs QEMU to expose a virtual disk to the guest that advertises a volatile write cache. This is an indicator that the guest is responsible for issuing FLUSH and/or BARRIER commands to ensure correctness:

  • Flushes ensure data is persisted to stable storage.
  • Barriers enforce ordering constraints between write completions.

The Role of Caches in HDDs

So far, this is well-defined. Rotational storage devices have used volatile onboard caches for decades. The primary purpose of these caches was to accumulate enough data to optimize the mechanical seek process, which involves moving the head across the disk surface to align with a specific track on the platter. These optimizations reduced the total head travel distance and allowed the device to coalesce write operations, taking advantage of the platter’s rotation to minimize latency and improve throughput.

The Role of Caches in SSDs

In contrast, nonrotational storage devices such as solid state drives use caches primarily for different reasons. Every solid state drive requires memory to maintain internal flash translation tables that map logical block addresses to physical NAND locations. Consumer-grade solid state drives typically use volatile DRAM caches, which are lost on sudden power loss. Datacenter and enterprise-grade solid state drives include power loss protection circuitry, often in the form of capacitors, to ensure that any cached data and metadata are safely written to nonvolatile media if power is unexpectedly removed.

Cache Lifetime of Real Devices

Despite these differences, all storage devices use their cache only as a temporary buffer to hold incoming data before it is permanently written to the underlying media. They are not designed to retain data in the cache for extended periods, since the fundamental purpose of a storage device is to ensure long-term data persistence on durable media rather than in transient memory.


Exploring the Risks of QEMU/QCOW2 with cache=none

QCOW2 Data Caching via Deferred Metadata

QCOW2 is a copy-on-write image format supporting snapshots and thin provisioning. Its flexibility comes at a cost: QCOW2 requires persistent metadata to track how virtual storage addresses map to physical addresses within a file or device.

When a virtual disk is a QCOW2 image configured with cache=none, QEMU issues writes for all QCOW2 data using O_DIRECT. However, the L1 and L2 metadata remains entirely in QEMU’s volatile memory during normal operation. It is not flushed automatically. Metadata is persisted only when the guest explicitly issues a flush (e.g., fsync()), or during specific QEMU operations such as a graceful shutdown, snapshot commit, or migration.

This means that when a write allocates a cluster or subcluster, the application data is written immediately, while the metadata describing the allocation remains in QEMU memory. The effect is that the existence of the write is cached, which is functionally equivalent to caching the write itself.

An interesting characteristic of QEMU/QCOW2 is that it relies entirely on the guest operating system to issue flush commands to synchronize its metadata. Without explicit flush operations, QEMU can keep its QCOW2 metadata in a volatile state indefinitely. This behavior is notably different from that of real storage devices, which make every reasonable effort to persist data to durable media as quickly as possible to minimize the risk of loss.


Increased Risk with QCOW2 Subcluster Allocation

By default, QCOW2 organizes and manages storage in units called clusters. Clusters are contiguous regions of physical space within an image. Both metadata tables and user data are allocated and stored as clusters.

A defining feature of QCOW is its copy-on-write behavior. When an I/O modifies a region of data after a snapshot, QCOW preserves the original blocks by writing the changes to a new cluster and updating metadata to point to it. If the I/O is smaller than a cluster, the surrounding data is copied into the new location.

To address some of the performance issues associated with copying data, QCOW introduced subcluster allocation using extended metadata. By doubling the metadata overhead, a cluster can be subdivided into smaller subclusters (e.g., 32 subclusters in a 128 KiB cluster), reducing the frequency of copy-on-write operations and improving efficiency for small writes.

However, this optimization introduces significant tradeoffs. Enabling l2_extended=on (subcluster allocation) increases metadata churn, especially when snapshots are in use, since they record deltas from parent layers. More critically, it increases the risk of torn writes and data inconsistency in the event of a crash.

While subcluster tracking improves small-write performance, it comes at the cost of consistency. QCOW has historically struggled with maintaining integrity on unexpected power loss. With larger clusters, these issues were less frequent, less severe, and relatively straightforward to reconcile. Fine-grain allocation amplifies these risks, making data corruption more likely.

To illustrate this, here’s a simple example of data corruption that you can reproduce yourself on a raw QCOW device attached to a guest (i.e., no filesystem):

Example of Lost Writes and Structural Tears:

  1. Take a snapshot, creating a new QCOW2 metadata layer.
  2. Application writes an 8KiB buffer of 0xAA at LBA 1 (4KiB block size).
  3. Application issues a flush to commit the metadata.
  4. Application writes an 8KiB buffer of 0xBB at LBA 0.
  5. VM is abruptly terminated due to host power loss or QEMU process termination.

Result:

  • Until termination, the virtual disk appears consistent to the guest.
  • On power loss, the second write is torn because the data was written, but subcluster metadata describing the allocation was not.

The diagram below illustrates the data hazard step by step:


                   ACTION                                   RESULT OF READ      ONDISK STATE
                  ───────────────────────────────────────  ─────────────────  ─────────────────

                                                           ┌───┬───┬───┬───┐  ┌───┬───┬───┬───┐
                  # SNAPSHOT (GUEST)                       │ - │ - │ - │ - │  │ - │ - │ - │ - │
                                                           └───┴───┴───┴───┘  └───┴───┴───┴───┘

                                                           ┌───┬───┬───┬───┐  ┌───┬───┬───┬───┐
                  # WRITE 0XA,BS=4K,SEEK=1,COUNT=2         │ - │ A │ A │ - │  │ - │ A │ A │ - │
                                                           └───┴───┴───┴───┘  └───┴───┴───┴───┘

                                                           ┌───┬───┬───┬───┐  ┌───┬───┬───┬───┐
                  # FSYNC()                                │ - │ A │ A │ - │  │ - │ A │ A │ - │
                                                           └───┴───┴───┴───┘  └───┴───┴───┴───┘

                                                           ┌───┬───┬───┬───┐  ┌───┬───┬───┬───┐
                  # WRITE 0XB,BS=4K,SEEK=0,COUNT=2         │ B │ B │ A │ - │  │ B │ B │ A │ - │
                                                           └───┴───┴───┴───┘  └───┴───┴───┴───┘

                                                           ┌───┬───┬───┬───┐  ┌───┬───┬───┬───┐
                  # SLEEP 60 (GUEST)                       │ B │ B │ A │ - │  │ B │ B │ A │ - │
                                                           └───┴───┴───┴───┘  └───┴───┴───┴───┘

                                                           ┌───┬───┬───┬───┐  ┌───┬───┬───┬───┐
                  # UNPLANNED GUEST TERMINATION            │ - │ B │ A │ - │  │ B │ B │ A │ - │
                                                           └───┴───┴───┴───┘  └───┴───┴───┴───┘

                ┌──────────────────────────────────────────────────────────────────────────────┐
                │   ┌───┐              ┌───┐              ┌───┐                                │
                │   │ A │ 4K DATA=0XA  │ B │ 4K DATA=0XB  │ - │ 4K DATA (PRIOR TO SNAP)        │
                │   └───┘              └───┘              └───┘                                │
                └──────────────────────────────────────────────────────────────────────────────┘


Why Barriers and Flushes Are Critical

Deterministic write ordering and durability are fundamental primitives that ensure transactional applications and filesystems can recover reliably after a failure. In QEMU, these guarantees are enforced through the use of flush and barrier operations.

A flush forces all buffered writes, whether in the guest or in QEMU, to be committed to stable storage, ensuring that previous writes are durable before new ones proceed. A barrier enforces strict write ordering, ensuring that all writes issued before it are fully committed to storage before any subsequent writes begin.

Without these mechanisms, intermediate devices or virtualization layers can reorder or delay I/O in ways that violate the guest’s expectations, leading to unrecoverable corruption.

QCOW2 is particularly sensitive because it relies entirely on guest-initiated flushes for durability. Its metadata and allocation structures do not persist automatically. Delayed or missing flushes in any application can result in inconsistent data and metadata.

The risks for raw devices are substantially lower because they involve no intermediate caching. Writes are issued directly to the underlying storage device, which typically commits data to stable media almost immediately. On enterprise and datacenter-grade storage, these operations are high-speed, low-latency, and durable upon completion, providing strong consistency guarantees even under failure conditions.

In essence, enterprise storage largely eliminates durability concerns and minimizes the potential for reordering, making raw devices a far safer choice for critical workloads. QCOW2 is semantically correct, but it is more prone to data loss on unexpected power failure.

Proxmox’s cache=none documentation warns: “You need to use the barrier option in your Linux guest’s fstab if kernel < 2.6.37 to avoid FS corruption in case of power failure.” With QCOW2, using barriers is not optional. It is absolutely essential to ensure any semblance of consistency after failures. Fortunately, most modern filesystems enable barriers by default.

That said, not all applications rely on filesystems. Many attempt to bypass the filesystem entirely for performance reasons, which can leave them exposed to the same risks if flushes and barriers are not explicitly managed.


Why Isn’t Data Corruption More Widespread?

Widespread data corruption with QCOW2 is relatively uncommon, largely because active journaling filesystems help keep metadata in sync. Silent corruption after power loss is a different matter, as its name implies.

Filesystems such as ext4, XFS, ZFS, and btrfs maintain journals to track metadata changes for each transaction. These journals are flushed regularly, either automatically or on commit, which has the side effect of committing the underlying QCOW2 metadata associated with guest writes.

As a result, many workloads remain synchronized with the virtual disk almost by accident. For example, modifying and saving a file updates the inode’s mtime, triggering a journal transaction. The guest issues a flush, QCOW2 writes the pending metadata, and both the data and its allocation information are made consistent.

Other common operations, such as creating or deleting files, resizing directories, or committing database transactions, generate similar journal flushes. These frequent flushes help prevent inconsistencies, even though QCOW2 itself does not automatically persist metadata.

Workloads that bypass the filesystem, perform large sequential writes without journaling, or disable barriers for performance reasons are much more vulnerable. The risk is also higher for disks with less ambient activity, such as a separate “application disk” added to a VM apart from the root disk. In these cases, QCOW2’s reliance on explicit flushes becomes a significant liability, and unexpected power loss or process termination can result in substantial data corruption.


Application-Level Risks and Delayed Metadata Updates

Even with journaling filesystems, it’s essential to understand that writes flushed from the guest’s page cache are not stable on completion. This includes applications using O_DIRECT. Unless the application explicitly manages flushes, the primary mechanism that forces QCOW2 metadata to disk is the deferred modification time (mtime) and inode updates, which typically occur 5 to 30 seconds after the data is written, depending on the filesystem.

Risks:

  • Writes issued between filesystem journal flushes can be partially persisted and torn if the VM terminates unexpectedly.
  • QCOW2 metadata can remain out of sync with guest data, including allocation tables and L2 cluster mappings.

Delayed metadata, QCOW2’s in-memory caching, and fine-grained subcluster allocation increase the risk of data loss and create complex corruption patterns, where part of a file may be updated while other parts revert. Applications that rely on infrequent flushes or bypass the filesystem are at the highest risk of data loss in QCOW2 environments.


Is QCOW2 with cache=none Safe to Use?

QCOW2 with cache=none is semantically correct, and many modern workloads can operate safely on it. Well-behaved applications, particularly databases using fsync() or journaling filesystems, generally remain consistent.

However, QCOW2 is considerably more vulnerable to complex data loss during unexpected termination, process kills, or power failures. The presence of subcluster allocation dramatically amplifies the potential for torn or inconsistent writes. Applications that work directly on block devices in the guest, bypassing the ambient protection of a journaling filesystem, are especially exposed. Likewise, custom or lightly tested software, or workloads using specialized filesystem options such as lazytime or disabled barriers, face the highest risk of corruption.

Key Points

  • QCOW2 is prone to torn, interleaved, or reordered writes during power loss.
  • Delayed metadata updates, in-memory caching, and fine-grained subcluster allocation amplify the risk and complexity of data corruption.
  • Older filesystems (such as ext2 and FAT) and applications that do not explicitly issue flushes are especially vulnerable and should be avoided entirely.
  • RAW storage types are generally safer, exhibiting less reordering, stronger durability, and fewer lost writes after unexpected failure.

Key Takeaways

While QCOW2 with cache=none functions correctly in most cases, the risk of data corruption during unexpected power loss or VM termination is real. Configurations such as QCOW2 on NFS, as well as newer QCOW2-on-LVM setups, are susceptible to the types of corruption discussed in this technote. The more recent Volume as Snapshot Chains feature introduces additional risk due to subcluster allocation (i.e., l2_extended=on).

For workloads where minimizing data loss is a priority, RAW devices generally provide more reliable consistency and durability. Examples of reliable RAW storage options include Ceph, LVM-Thick, ZFS, native iSCSI, and native NVMe.

Choosing between QCOW2 and RAW should consider workload type, performance requirements, and operational priorities. While RAW is often preferred for workloads requiring durability and consistency, QCOW2 can still be appropriate for less critical workloads or scenarios where its features offer clear advantages.

Application developers should not assume that data in QCOW2 is persistent unless the guest OS has explicitly issued flush operations. If QCOW2 is used, it is advisable to disable subcluster allocation unless the application can reliably recover from partially written or torn blocks.

ADDITIONAL RESOURCES

External References

  • Proxmox Documentation On Cache=none link
  • QEMU Documentation on Cache=none link
  • QCOW2 Image Format Specification link
  • QCOW2 Subcluster Allocation Presentation link
  • Enabling subcluster allocation in PVE link

Blockbridge Resources