Skip to content

Commit 3bda682

Browse files
author
CKI KWF Bot
committed
Merge: KVM: arm64: Map GPU device memory as cacheable
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-10/-/merge_requests/1185 Grace based platforms such as Grace Hopper/Blackwell Superchips have CPU accessible cache coherent GPU memory. The GPU device memory is essentially a DDR memory and retains properties such as cacheability, unaligned accesses, atomics and handling of executable faults. This requires the device memory to be mapped as NORMAL in stage-2. Today KVM forces the memory to either NORMAL or DEVICE_nGnRE depending on whether the memory region is added to the kernel. The KVM code is thus restrictive and prevents device memory that is not added to the kernel to be marked as cacheable. The patch aims to solve this. A cachebility check is made by consulting the VMA pgprot value. If the pgprot mapping type is cacheable, it is considered safe to be mapped cacheable as the KVM S2 will have the same Normal memory type as the VMA has in the S1 and KVM has no additional responsibility for safety. Note when FWB (Force Write Back) is not enabled, the kernel expects to trivially do cache management by flushing the memory by linearly converting a kvm_pte to phys_addr to a KVA. The cache management thus relies on memory being mapped. Since the GPU device memory is not kernel mapped, exit when the FWB is not supported. Similarly, ARM64_HAS_CACHE_DIC allows KVM to avoid flushing the icache and turns icache_inval_pou() into a NOP. So the cacheable PFNMAP is made contingent on these two hardware features. The ability to safely do the cacheable mapping of PFNMAP is exposed through a KVM capability for userspace consumption. KVM: arm64: Rename the device variable to s2_force_noncacheable KVM: arm64: Update the check to detect device memory KVM: arm64: Block cacheable PFNMAP mapping KVM: arm64: Allow cacheable stage 2 mapping using VMA flags KVM: arm64: Expose new KVM cap for cacheable PFNMAP Documentation/virt/kvm/api.rst | 10 +++ arch/arm64/kvm/arm.c | 7 ++ arch/arm64/kvm/mmu.c | 118 ++++++++++++++++++++++++++------- include/linux/kvm_host.h | 2 + include/uapi/linux/kvm.h | 1 + virt/kvm/kvm_main.c | 5 ++ NOTE: This patch series is a backport from kvm-arm's next branch, as this functionality isn't slated for upstream inclusion until v6.17-rc1, which is too late to create an MR for RHEL-10.1 inclusion. This functionality is need in a RHEL-10.1 host in order for a device-assigned Hopper GPU to not hange a guest when basic nvida-smi commands are executed to provide functional information about the (assigned) GPU (in a guest VM). The nvdia-vgpu vfio-pci-variant driver was merged to RHEL-10.1 in an earlier kernel, that enabled the Hopper/Blackwell device-assignment to a VM; this patch set completes the functionality by making the device usable in the VM. This series has been in upstream development for over a year, and has had significant review by ARM, KVM, and mm maintainers, per the upstream posting: The changes are heavily influenced by the discussions among maintainers Marc Zyngier and Oliver Upton besides Jason Gunthorpe, Catalin Marinas, David Hildenbrand, Sean Christopherson [1]. Many thanks for their valuable suggestions. The commit-id's used in this backport from the kvm-arm -next branch are expected to be the same when eventually pulled into Linus's tree for v6.17-rc1 merge (famous last words). JIRA: https://issues.redhat.com/browse/RHEL-73607 Upstream: https://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm.git Signed-off-by: Donald Dutile <ddutile@redhat.com> Approved-by: David Hildenbrand <david@redhat.com> Approved-by: Gavin Shan <gshan@redhat.com> Approved-by: Sebastian Ott <sebott@redhat.com> Approved-by: Cornelia Huck <cohuck@redhat.com> Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com> Merged-by: CKI GitLab Kmaint Pipeline Bot <26919896-cki-kmaint-pipeline-bot@users.noreply.gitlab.com>
2 parents 1ef1899 + e11788e commit 3bda682

File tree

5 files changed

+103
-24
lines changed

5 files changed

+103
-24
lines changed

Documentation/virt/kvm/api.rst

Lines changed: 12 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8490,7 +8490,7 @@ ENOSYS for the others.
84908490
When enabled, KVM will exit to userspace with KVM_EXIT_SYSTEM_EVENT of
84918491
type KVM_SYSTEM_EVENT_SUSPEND to process the guest suspend request.
84928492

8493-
7.37 KVM_CAP_ARM_WRITABLE_IMP_ID_REGS
8493+
7.42 KVM_CAP_ARM_WRITABLE_IMP_ID_REGS
84948494
-------------------------------------
84958495

84968496
:Architectures: arm64
@@ -8508,6 +8508,17 @@ aforementioned registers before the first KVM_RUN. These registers are VM
85088508
scoped, meaning that the same set of values are presented on all vCPUs in a
85098509
given VM.
85108510

8511+
7.43 KVM_CAP_ARM_CACHEABLE_PFNMAP_SUPPORTED
8512+
-------------------------------------------
8513+
8514+
:Architectures: arm64
8515+
:Target: VM
8516+
:Parameters: None
8517+
8518+
This capability indicate to the userspace whether a PFNMAP memory region
8519+
can be safely mapped as cacheable. This relies on the presence of
8520+
force write back (FWB) feature support on the hardware.
8521+
85118522
8. Other capabilities.
85128523
======================
85138524

arch/arm64/include/asm/kvm_mmu.h

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -371,6 +371,24 @@ static inline void kvm_fault_unlock(struct kvm *kvm)
371371
read_unlock(&kvm->mmu_lock);
372372
}
373373

374+
/*
375+
* ARM64 KVM relies on a simple conversion from physaddr to a kernel
376+
* virtual address (KVA) when it does cache maintenance as the CMO
377+
* instructions work on virtual addresses. This is incompatible with
378+
* VM_PFNMAP VMAs which may not have a kernel direct mapping to a
379+
* virtual address.
380+
*
381+
* With S2FWB and CACHE DIC features, KVM need not do cache flushing
382+
* and CMOs are NOP'd. This has the effect of no longer requiring a
383+
* KVA for addresses mapped into the S2. The presence of these features
384+
* are thus necessary to support cacheable S2 mapping of VM_PFNMAP.
385+
*/
386+
static inline bool kvm_supports_cacheable_pfnmap(void)
387+
{
388+
return cpus_have_final_cap(ARM64_HAS_STAGE2_FWB) &&
389+
cpus_have_final_cap(ARM64_HAS_CACHE_DIC);
390+
}
391+
374392
#ifdef CONFIG_PTDUMP_STAGE2_DEBUGFS
375393
void kvm_s2_ptdump_create_debugfs(struct kvm *kvm);
376394
#else

arch/arm64/kvm/arm.c

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -408,6 +408,13 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
408408
case KVM_CAP_ARM_SUPPORTED_REG_MASK_RANGES:
409409
r = BIT(0);
410410
break;
411+
case KVM_CAP_ARM_CACHEABLE_PFNMAP_SUPPORTED:
412+
if (!kvm)
413+
r = -EINVAL;
414+
else
415+
r = kvm_supports_cacheable_pfnmap();
416+
break;
417+
411418
default:
412419
r = 0;
413420
}

arch/arm64/kvm/mmu.c

Lines changed: 65 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -193,11 +193,6 @@ int kvm_arch_flush_remote_tlbs_range(struct kvm *kvm,
193193
return 0;
194194
}
195195

196-
static bool kvm_is_device_pfn(unsigned long pfn)
197-
{
198-
return !pfn_is_map_memory(pfn);
199-
}
200-
201196
static void *stage2_memcache_zalloc_page(void *arg)
202197
{
203198
struct kvm_mmu_memory_cache *mc = arg;
@@ -1466,15 +1461,27 @@ static bool kvm_vma_mte_allowed(struct vm_area_struct *vma)
14661461
return vma->vm_flags & VM_MTE_ALLOWED;
14671462
}
14681463

1464+
static bool kvm_vma_is_cacheable(struct vm_area_struct *vma)
1465+
{
1466+
switch (FIELD_GET(PTE_ATTRINDX_MASK, pgprot_val(vma->vm_page_prot))) {
1467+
case MT_NORMAL_NC:
1468+
case MT_DEVICE_nGnRnE:
1469+
case MT_DEVICE_nGnRE:
1470+
return false;
1471+
default:
1472+
return true;
1473+
}
1474+
}
1475+
14691476
static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
14701477
struct kvm_s2_trans *nested,
14711478
struct kvm_memory_slot *memslot, unsigned long hva,
14721479
bool fault_is_perm)
14731480
{
14741481
int ret = 0;
14751482
bool write_fault, writable, force_pte = false;
1476-
bool exec_fault, mte_allowed;
1477-
bool device = false, vfio_allow_any_uc = false;
1483+
bool exec_fault, mte_allowed, is_vma_cacheable;
1484+
bool s2_force_noncacheable = false, vfio_allow_any_uc = false;
14781485
unsigned long mmu_seq;
14791486
phys_addr_t ipa = fault_ipa;
14801487
struct kvm *kvm = vcpu->kvm;
@@ -1488,6 +1495,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
14881495
enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_R;
14891496
struct kvm_pgtable *pgt;
14901497
struct page *page;
1498+
vm_flags_t vm_flags;
14911499
enum kvm_pgtable_walk_flags flags = KVM_PGTABLE_WALK_HANDLE_FAULT | KVM_PGTABLE_WALK_SHARED;
14921500

14931501
if (fault_is_perm)
@@ -1615,6 +1623,10 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
16151623

16161624
vfio_allow_any_uc = vma->vm_flags & VM_ALLOW_ANY_UNCACHED;
16171625

1626+
vm_flags = vma->vm_flags;
1627+
1628+
is_vma_cacheable = kvm_vma_is_cacheable(vma);
1629+
16181630
/* Don't use the VMA after the unlock -- it may have vanished */
16191631
vma = NULL;
16201632

@@ -1638,18 +1650,39 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
16381650
if (is_error_noslot_pfn(pfn))
16391651
return -EFAULT;
16401652

1641-
if (kvm_is_device_pfn(pfn)) {
1642-
/*
1643-
* If the page was identified as device early by looking at
1644-
* the VMA flags, vma_pagesize is already representing the
1645-
* largest quantity we can map. If instead it was mapped
1646-
* via __kvm_faultin_pfn(), vma_pagesize is set to PAGE_SIZE
1647-
* and must not be upgraded.
1648-
*
1649-
* In both cases, we don't let transparent_hugepage_adjust()
1650-
* change things at the last minute.
1651-
*/
1652-
device = true;
1653+
/*
1654+
* Check if this is non-struct page memory PFN, and cannot support
1655+
* CMOs. It could potentially be unsafe to access as cachable.
1656+
*/
1657+
if (vm_flags & (VM_PFNMAP | VM_MIXEDMAP) && !pfn_is_map_memory(pfn)) {
1658+
if (is_vma_cacheable) {
1659+
/*
1660+
* Whilst the VMA owner expects cacheable mapping to this
1661+
* PFN, hardware also has to support the FWB and CACHE DIC
1662+
* features.
1663+
*
1664+
* ARM64 KVM relies on kernel VA mapping to the PFN to
1665+
* perform cache maintenance as the CMO instructions work on
1666+
* virtual addresses. VM_PFNMAP region are not necessarily
1667+
* mapped to a KVA and hence the presence of hardware features
1668+
* S2FWB and CACHE DIC are mandatory to avoid the need for
1669+
* cache maintenance.
1670+
*/
1671+
if (!kvm_supports_cacheable_pfnmap())
1672+
return -EFAULT;
1673+
} else {
1674+
/*
1675+
* If the page was identified as device early by looking at
1676+
* the VMA flags, vma_pagesize is already representing the
1677+
* largest quantity we can map. If instead it was mapped
1678+
* via __kvm_faultin_pfn(), vma_pagesize is set to PAGE_SIZE
1679+
* and must not be upgraded.
1680+
*
1681+
* In both cases, we don't let transparent_hugepage_adjust()
1682+
* change things at the last minute.
1683+
*/
1684+
s2_force_noncacheable = true;
1685+
}
16531686
} else if (logging_active && !write_fault) {
16541687
/*
16551688
* Only actually map the page as writable if this was a write
@@ -1658,7 +1691,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
16581691
writable = false;
16591692
}
16601693

1661-
if (exec_fault && device)
1694+
if (exec_fault && s2_force_noncacheable)
16621695
return -ENOEXEC;
16631696

16641697
/*
@@ -1691,7 +1724,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
16911724
* If we are not forced to use page mapping, check if we are
16921725
* backed by a THP and thus use block mapping if possible.
16931726
*/
1694-
if (vma_pagesize == PAGE_SIZE && !(force_pte || device)) {
1727+
if (vma_pagesize == PAGE_SIZE && !(force_pte || s2_force_noncacheable)) {
16951728
if (fault_is_perm && fault_granule > PAGE_SIZE)
16961729
vma_pagesize = fault_granule;
16971730
else
@@ -1705,7 +1738,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
17051738
}
17061739
}
17071740

1708-
if (!fault_is_perm && !device && kvm_has_mte(kvm)) {
1741+
if (!fault_is_perm && !s2_force_noncacheable && kvm_has_mte(kvm)) {
17091742
/* Check the VMM hasn't introduced a new disallowed VMA */
17101743
if (mte_allowed) {
17111744
sanitise_mte_tags(kvm, pfn, vma_pagesize);
@@ -1721,7 +1754,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
17211754
if (exec_fault)
17221755
prot |= KVM_PGTABLE_PROT_X;
17231756

1724-
if (device) {
1757+
if (s2_force_noncacheable) {
17251758
if (vfio_allow_any_uc)
17261759
prot |= KVM_PGTABLE_PROT_NORMAL_NC;
17271760
else
@@ -2217,6 +2250,15 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
22172250
ret = -EINVAL;
22182251
break;
22192252
}
2253+
2254+
/*
2255+
* Cacheable PFNMAP is allowed only if the hardware
2256+
* supports it.
2257+
*/
2258+
if (kvm_vma_is_cacheable(vma) && !kvm_supports_cacheable_pfnmap()) {
2259+
ret = -EINVAL;
2260+
break;
2261+
}
22202262
}
22212263
hva = min(reg_end, vma->vm_end);
22222264
} while (hva < reg_end);

include/uapi/linux/kvm.h

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -932,6 +932,7 @@ struct kvm_enable_cap {
932932
#define KVM_CAP_ARM_WRITABLE_IMP_ID_REGS 239
933933
#define KVM_CAP_ARM_EL2 240
934934
#define KVM_CAP_ARM_EL2_E2H0 241
935+
#define KVM_CAP_ARM_CACHEABLE_PFNMAP_SUPPORTED 243
935936

936937
struct kvm_irq_routing_irqchip {
937938
__u32 irqchip;

0 commit comments

Comments
 (0)