Race condition in KVM gmap invalidation / shadow paging (memory safety and isolation risk)

HIGH
torvalds/linux
Commit: 086aca1030cf
Affected: v7.0-rc5 and earlier (all 7.0-rc versions prior to this commit)
2026-04-16 05:43 UTC

Description

This commit fixes a real race condition in KVM's gmap/shadow paging code, addressing partial gmap invalidations particularly in the s390 KVM path and tightening the user-kernel API surface by replacing VLAs with FLEX_ARRAY. The core changes introduce an invalidated flag on gmap objects and adjust the shadow/invalidation logic to avoid races where a gmap could be observed in an inconsistent state during concurrent invalidation and shadow-paging operations. It also updates multiple UAPI structures to use the FLEX_ARRAY helper instead of VLAs, reducing risks from unbounded kernel/user-space structures. The combination of s390 race fixes and the UAPI cleanup constitutes a genuine vulnerability fix aimed at preventing race-induced memory safety/isolation issues in virtualization. Rationale: - The diff shows multiple guarded checks that previously relied on sg->parent to determine validity; these have been replaced with sg->invalidated in several hot paths, indicating a race-prone condition when a gmap is invalidated concurrently with shadow operations. - The gmap structure now carries an invalidated flag and the code marks gmaps as invalidated during certain transitions, which prevents unsafe pointer usage or premature re-use of shadow state. - UAPI structures are migrated from inline flexible arrays (VLAs) to __DECLARE_FLEX_ARRAY, eliminating VLAs in kernel-user interfaces and reducing attack surface from malformed or oversized user-space inputs. Affected behavior before fix: concurrent invalidation/updates of gmaps could race with shadow paging, potentially leading to use-after-free-like scenarios or memory isolation breaches in edge cases. Impact: The fix reduces the likelihood of memory-safety vulnerabilities in KVM where gmap invalidations race with shadow paging, improving guest memory isolation and stability.

Proof of Concept

Note: A reliable runnable PoC for this kernel-level race is not readily practical in a short snippet due to the need to orchestrate kernel-internal gmap/shadow-pagetable races within a KVM guest/host context. Below is a high-level PoC outline and steps that could guide a reproduce attempt in a controlled lab with a patched kernel containing this fix. PoC outline (high level, not a runnable exploit): 1) Environment: a Linux host with virtualization support, running a guest via KVM on an affected kernel prior to this patch. Patch the host kernel with the commit addressed here and boot. 2) Prepare a guest workload that repeatedly exercises gmap shadow paging (shadow CRSTE/PTE paths) and can trigger partial gmap invalidations while shadow updates are in-flight. 3) Create two kernel threads in the host that operate on the same gmap: - Thread A: perform partial gmap invalidation on a shadow gmap (simulating an invalidation race). - Thread B: concurrently perform shadow-page updates (reading/writing PTE/CRSTE) that rely on sg->parent state. 4) Observe outcomes: - Before the fix: races could lead to inconsistent checks around sg->parent vs sg->invalidated, potentially causing EAGAIN paths to be skipped or misused pointers, in rare cases enabling memory safety violations or guest memory isolation leaks. - After the fix: sg->invalidated is checked first in critical paths, and invalidated transitions help prevent use-after-free-like states; the race is mitigated. 5) Validation signals: - Look for crashes, use-after-free warnings, or unexpected EAGAIN returns that correlate with concurrent invalidation/shadow events. - Optional: instrument the kernel with ftrace/dtrace to confirm the invalidated flag gating in the hot paths (e.g., _do_shadow_pte, _do_shadow_crste, _gaccess_do_shadow). Note: This PoC outline assumes access to a patched kernel and the ability to orchestrate kernel-internal race conditions. A small, safe PoC in user space is not feasible due to the need to touch kernel KVM shadow/gmap internals.

Commit Details

Author: Linus Torvalds

Date: 2026-04-11 18:45 UTC

Message:

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm Pull kvm fixes from Paolo Bonzini: "s390: - vsie: Fix races with partial gmap invalidations x86: - Use __DECLARE_FLEX_ARRAY() for UAPI structures with VLAs" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: KVM: s390: vsie: Fix races with partial gmap invalidations KVM: x86: Use __DECLARE_FLEX_ARRAY() for UAPI structures with VLAs

Triage Assessment

Vulnerability Type: Race condition

Confidence: HIGH

Reasoning:

The commit includes patches labeled as 'Fix races with partial gmap invalidations' in KVM s390 code, addressing race conditions that can affect memory management and isolation. It also updates UAPI structures to use flex arrays, mitigating potential unsafe VLAs in user-kernel interfaces. The primary security-related impact is preventing race conditions that could lead to memory safety or isolation breaches in virtualization.

Verification Assessment

Vulnerability Type: Race condition in KVM gmap invalidation / shadow paging (memory safety and isolation risk)

Confidence: HIGH

Affected Versions: v7.0-rc5 and earlier (all 7.0-rc versions prior to this commit)

Code Diff

diff --git a/arch/s390/kvm/gaccess.c b/arch/s390/kvm/gaccess.c index 53a8550e7102e8..290e03a13a9562 100644 --- a/arch/s390/kvm/gaccess.c +++ b/arch/s390/kvm/gaccess.c @@ -1449,7 +1449,7 @@ static int _do_shadow_pte(struct gmap *sg, gpa_t raddr, union pte *ptep_h, union pgste_set_unlock(ptep_h, pgste); if (rc) return rc; - if (!sg->parent) + if (sg->invalidated) return -EAGAIN; newpte = _pte(f->pfn, 0, !p, 0); @@ -1479,7 +1479,7 @@ static int _do_shadow_crste(struct gmap *sg, gpa_t raddr, union crste *host, uni do { /* _gmap_crstep_xchg_atomic() could have unshadowed this shadow gmap */ - if (!sg->parent) + if (sg->invalidated) return -EAGAIN; oldcrste = READ_ONCE(*host); newcrste = _crste_fc1(f->pfn, oldcrste.h.tt, f->writable, !p); @@ -1492,7 +1492,7 @@ static int _do_shadow_crste(struct gmap *sg, gpa_t raddr, union crste *host, uni if (!newcrste.h.p && !f->writable) return -EOPNOTSUPP; } while (!_gmap_crstep_xchg_atomic(sg->parent, host, oldcrste, newcrste, f->gfn, false)); - if (!sg->parent) + if (sg->invalidated) return -EAGAIN; newcrste = _crste_fc1(f->pfn, oldcrste.h.tt, 0, !p); @@ -1545,7 +1545,7 @@ static int _gaccess_do_shadow(struct kvm_s390_mmu_cache *mc, struct gmap *sg, entries[i].pfn, i + 1, entries[i].writable); if (rc) return rc; - if (!sg->parent) + if (sg->invalidated) return -EAGAIN; } @@ -1601,6 +1601,7 @@ static inline int _gaccess_shadow_fault(struct kvm_vcpu *vcpu, struct gmap *sg, scoped_guard(spinlock, &parent->children_lock) { if (READ_ONCE(sg->parent) != parent) return -EAGAIN; + sg->invalidated = false; rc = _gaccess_do_shadow(vcpu->arch.mc, sg, saddr, walk); } if (rc == -ENOMEM) diff --git a/arch/s390/kvm/gmap.c b/arch/s390/kvm/gmap.c index 645c32c767d24b..0111d31e038656 100644 --- a/arch/s390/kvm/gmap.c +++ b/arch/s390/kvm/gmap.c @@ -181,6 +181,7 @@ void gmap_remove_child(struct gmap *child) list_del(&child->list); child->parent = NULL; + child->invalidated = true; } /** @@ -1069,6 +1070,7 @@ static void gmap_unshadow_level(struct gmap *sg, gfn_t r_gfn, int level) if (level > TABLE_TYPE_PAGE_TABLE) align = 1UL << (11 * level + _SEGMENT_SHIFT); kvm_s390_vsie_gmap_notifier(sg, ALIGN_DOWN(gaddr, align), ALIGN(gaddr + 1, align)); + sg->invalidated = true; if (dat_entry_walk(NULL, r_gfn, sg->asce, 0, level, &crstep, &ptep)) return; if (ptep) { @@ -1174,6 +1176,7 @@ static inline int __gmap_protect_asce_top_level(struct kvm_s390_mmu_cache *mc, s scoped_guard(spinlock, &parent->children_lock) { if (READ_ONCE(sg->parent) != parent) return -EAGAIN; + sg->invalidated = false; for (i = 0; i < CRST_TABLE_PAGES; i++) { if (!context->f[i].valid) continue; diff --git a/arch/s390/kvm/gmap.h b/arch/s390/kvm/gmap.h index 579399ef54803d..31ea13fda142bc 100644 --- a/arch/s390/kvm/gmap.h +++ b/arch/s390/kvm/gmap.h @@ -60,6 +60,7 @@ enum gmap_flags { struct gmap { unsigned long flags; unsigned char edat_level; + bool invalidated; struct kvm *kvm; union asce asce; struct list_head list; diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h index 0d4538fa6c31ab..5f2b30d0405c87 100644 --- a/arch/x86/include/uapi/asm/kvm.h +++ b/arch/x86/include/uapi/asm/kvm.h @@ -197,13 +197,13 @@ struct kvm_msrs { __u32 nmsrs; /* number of msrs in entries */ __u32 pad; - struct kvm_msr_entry entries[]; + __DECLARE_FLEX_ARRAY(struct kvm_msr_entry, entries); }; /* for KVM_GET_MSR_INDEX_LIST */ struct kvm_msr_list { __u32 nmsrs; /* number of msrs in entries */ - __u32 indices[]; + __DECLARE_FLEX_ARRAY(__u32, indices); }; /* Maximum size of any access bitmap in bytes */ @@ -245,7 +245,7 @@ struct kvm_cpuid_entry { struct kvm_cpuid { __u32 nent; __u32 padding; - struct kvm_cpuid_entry entries[]; + __DECLARE_FLEX_ARRAY(struct kvm_cpuid_entry, entries); }; struct kvm_cpuid_entry2 { @@ -267,7 +267,7 @@ struct kvm_cpuid_entry2 { struct kvm_cpuid2 { __u32 nent; __u32 padding; - struct kvm_cpuid_entry2 entries[]; + __DECLARE_FLEX_ARRAY(struct kvm_cpuid_entry2, entries); }; /* for KVM_GET_PIT and KVM_SET_PIT */ @@ -398,7 +398,7 @@ struct kvm_xsave { * the contents of CPUID leaf 0xD on the host. */ __u32 region[1024]; - __u32 extra[]; + __DECLARE_FLEX_ARRAY(__u32, extra); }; #define KVM_MAX_XCRS 16 @@ -566,7 +566,7 @@ struct kvm_pmu_event_filter { __u32 fixed_counter_bitmap; __u32 flags; __u32 pad[4]; - __u64 events[]; + __DECLARE_FLEX_ARRAY(__u64, events); }; #define KVM_PMU_EVENT_ALLOW 0 diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index 80364d4dbebb0c..3f0d8d3c3dafd0 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -11,6 +11,7 @@ #include <linux/const.h> #include <linux/types.h> #include <linux/compiler.h> +#include <linux/stddef.h> #include <linux/ioctl.h> #include <asm/kvm.h> @@ -542,7 +543,7 @@ struct kvm_coalesced_mmio { struct kvm_coalesced_mmio_ring { __u32 first, last; - struct kvm_coalesced_mmio coalesced_mmio[]; + __DECLARE_FLEX_ARRAY(struct kvm_coalesced_mmio, coalesced_mmio); }; #define KVM_COALESCED_MMIO_MAX \ @@ -592,7 +593,7 @@ struct kvm_clear_dirty_log { /* for KVM_SET_SIGNAL_MASK */ struct kvm_signal_mask { __u32 len; - __u8 sigset[]; + __DECLARE_FLEX_ARRAY(__u8, sigset); }; /* for KVM_TPR_ACCESS_REPORTING */ @@ -1051,7 +1052,7 @@ struct kvm_irq_routing_entry { struct kvm_irq_routing { __u32 nr; __u32 flags; - struct kvm_irq_routing_entry entries[]; + __DECLARE_FLEX_ARRAY(struct kvm_irq_routing_entry, entries); }; #define KVM_IRQFD_FLAG_DEASSIGN (1 << 0) @@ -1142,7 +1143,7 @@ struct kvm_dirty_tlb { struct kvm_reg_list { __u64 n; /* number of regs */ - __u64 reg[]; + __DECLARE_FLEX_ARRAY(__u64, reg); }; struct kvm_one_reg { @@ -1608,7 +1609,7 @@ struct kvm_stats_desc { #ifdef __KERNEL__ char name[KVM_STATS_NAME_SIZE]; #else - char name[]; + __DECLARE_FLEX_ARRAY(char, name); #endif };
← Back to Alerts View on GitHub →