This just allows read/write of three feature bits. ASID is still
ignored. Any writes to TTBR0_EL0 and TTBR1_EL0, including changing
the ASID, will still cause a complete flush of the TLB.
Signed-off-by: Jim MacArthur <jim.macarthur@linaro.org>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Reviewed-by: Gustavo Romero <gustavo.romero@linaro.org>
Signed-off-by: Jim MacArthur <jim.macarthur@linaro.org>
[PMM: add entry to v8_user_idregs[] list also]
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
hw/arm/boot.h is included twice
Signed-off-by: Osama Abdelkader <osama.abdelkader@gmail.com>
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Since commit f2e61edb29 ("hw/loongarch/virt: Use MemTxAttrs interface
for misc ops") which adds a call to g_assert_not_reached() in the path
of handling unimplemented IOCSRs, QEMU would abort when the guest
accesses unimplemented IOCSRs.
This is too serious since there's nothing fatal happening in QEMU
itself, and the guest could probably continue running if we give zero as
result for these reads, which also matches the behavior observed on
3A5000M real machine.
Replace the assertion with qemu_log_mask(LOG_UNIMP, ...), it's still
possible to examine unimplemented IOCSR access through "-d unimp"
command line arguments.
Fixes: f2e61edb29 ("hw/loongarch/virt: Use MemTxAttrs interface for misc ops")
Signed-off-by: Yao Zi <me@ziyao.cc>
Signed-off-by: Bibo Mao <maobibo@loongson.cn>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Bibo Mao <maobibo@loongson.cn>
Exception ADEM/ADEF need update CSR_BADV, the value from the virtual
address.
Signed-off-by: Song Gao <gaosong@loongson.cn>
Signed-off-by: Bibo Mao <maobibo@loongson.cn>
Reviewed-by: Bibo Mao <maobibo@loongson.cn>
Exception BCE need update CSR_BADV, and the value is env->pc.
Signed-off-by: Song Gao <gaosong@loongson.cn>
Signed-off-by: Bibo Mao <maobibo@loongson.cn>
Reviewed-by: Bibo Mao <maobibo@loongson.cn>
According to Volume 1 Manual 7.4.8 ,exception,SYS,BRK,INE,IPE,PPD
FPE,SXD,ASXD are need't update CSR_BADV, this patch correct it.
Signed-off-by: Song Gao <gaosong@loongson.cn>
Signed-off-by: Bibo Mao <maobibo@loongson.cn>
Reviewed-by: Bibo Mao <maobibo@loongson.cn>
When we use the -kernel parameter to start an elf format kernel relying on
fdt, we get the following error:
pcieport 0000:00:01.0: of_irq_parse_pci: failed with rc=-22
pcieport 0000:00:01.0: enabling device (0000 -> 0003)
pcieport 0000:00:01.0: PME: Signaling with IRQ 19
pcieport 0000:00:01.0: AER: enabled with IRQ 19
pcieport 0000:00:01.1: of_irq_parse_pci: failed with rc=-22
pcieport 0000:00:01.1: enabling device (0000 -> 0003)
pcieport 0000:00:01.1: PME: Signaling with IRQ 20
pcieport 0000:00:01.1: AER: enabled with IRQ 20
pcieport 0000:00:01.2: of_irq_parse_pci: failed with rc=-22
pcieport 0000:00:01.2: enabling device (0000 -> 0003)
pcieport 0000:00:01.2: PME: Signaling with IRQ 21
pcieport 0000:00:01.2: AER: enabled with IRQ 21
pcieport 0000:00:01.3: of_irq_parse_pci: failed with rc=-22
pcieport 0000:00:01.3: enabling device (0000 -> 0003)
pcieport 0000:00:01.3: PME: Signaling with IRQ 22
pcieport 0000:00:01.3: AER: enabled with IRQ 22
pcieport 0000:00:01.4: of_irq_parse_pci: failed with rc=-22
This is because the description of interrupt-cell is missing in the pcie
irq map. And there is a lack of a description of the interrupt trigger
type. Now it is corrected and the correct interrupt-cell is added in the
pcie irq map.
Refer to the implementation in arm and add some comments.
Signed-off-by: Xianglai Li <lixianglai@loongson.cn>
Signed-off-by: Bibo Mao <maobibo@loongson.cn>
Reviewed-by: Bibo Mao <maobibo@loongson.cn>
In the loongarch virt fdt file, the interrupt trigger type directly
uses magic numbers. Now, refer to the definitions in the linux kernel and
use macro definitions.
Signed-off-by: Xianglai Li <lixianglai@loongson.cn>
Signed-off-by: Bibo Mao <maobibo@loongson.cn>
Reviewed-by: Bibo Mao <maobibo@loongson.cn>
* Resolves build errors with gcc 16
* Adjusts the Linux headers for s390x and mshv
* Fixes endianness issue in the VFIO helper functions
* Adds support for live migration with vIOMMU when using IOMMU
dirty tracking
* Implements a migration blocker to prevent failures when VM
memory is too large
* Corrects an unmap_bitmap failure in the legacy VFIO backend
* Addresses a workaround for an Intel IOMMU errata.
* Implements Intel IOMMU first stage translation for passthrough
device. Also a prerequisite work for vSVA.
* Updates documentation
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEoPZlSPBIlev+awtgUaNDx8/77KEFAmlmEgQACgkQUaNDx8/7
7KFm2w/+JwlyiY5jWjzvCBCEEgBdBrb8XzMoSFr2xWNQrNHvE23veeQJcT+5LwQI
DV74Y3wmWYeAVGGKHVoALVEIJYtjVDOPU5TIyhr4nTMO8/A2j1ylBhsP6ZnWYYkO
uFe92O3wTHViFY5h9dgm1JsA3Bok52mteAHAE5gsxCNYk6h+ps1a5UZM8wxjtNA2
yVIvAZvaubnA/0yN02pz5bCOhPpaGpkV69l7nJSHwk2RPuspUR6dWo11P2yjxVDQ
7pv7DbLl9qm+xdmOp0ANVPKp9fqBJnBa/ta1Dn1VrQ2iJXnwezy+IdNC1In/HKKy
ZHe+V/p2JA09xjjmB2fu53DQQIjh/qeCWi0b2vkDZZVvl0hJ+0y9P1GRxhwBhtgK
/vwvgKGwC3OwXcdrxXNvD4Yy4NJLUtCoN8vmyI41ohLeMfr7/XrmTrf0J4ciPc4T
1bAHY2SWkFL59ylN+gt1khlV8zqPYP9S1i08A2wJjvLOwqRJ/LN2tNEh9pWvGmFg
p5WGTNeZLsfD+ZT10bm083EMAc1va7RTQNjAzb55pxq0ASPl7ZIVAKqazaG9QsaK
apPxGGYevuWzJVaNYWAqj7y37WDP/w6rKmyRmIBMV+x9+Dv+DGPbGb8oAOjZ0Av5
489mHYIONxp//2SvaUSfGpQHACCgEKTHlstdlyw79C84xPzHujE=
=o9aW
-----END PGP SIGNATURE-----
Merge tag 'pull-vfio-20260113' of https://github.com/legoater/qemu into staging
vfio queue:
* Resolves build errors with gcc 16
* Adjusts the Linux headers for s390x and mshv
* Fixes endianness issue in the VFIO helper functions
* Adds support for live migration with vIOMMU when using IOMMU
dirty tracking
* Implements a migration blocker to prevent failures when VM
memory is too large
* Corrects an unmap_bitmap failure in the legacy VFIO backend
* Addresses a workaround for an Intel IOMMU errata.
* Implements Intel IOMMU first stage translation for passthrough
device. Also a prerequisite work for vSVA.
* Updates documentation
# -----BEGIN PGP SIGNATURE-----
#
# iQIzBAABCAAdFiEEoPZlSPBIlev+awtgUaNDx8/77KEFAmlmEgQACgkQUaNDx8/7
# 7KFm2w/+JwlyiY5jWjzvCBCEEgBdBrb8XzMoSFr2xWNQrNHvE23veeQJcT+5LwQI
# DV74Y3wmWYeAVGGKHVoALVEIJYtjVDOPU5TIyhr4nTMO8/A2j1ylBhsP6ZnWYYkO
# uFe92O3wTHViFY5h9dgm1JsA3Bok52mteAHAE5gsxCNYk6h+ps1a5UZM8wxjtNA2
# yVIvAZvaubnA/0yN02pz5bCOhPpaGpkV69l7nJSHwk2RPuspUR6dWo11P2yjxVDQ
# 7pv7DbLl9qm+xdmOp0ANVPKp9fqBJnBa/ta1Dn1VrQ2iJXnwezy+IdNC1In/HKKy
# ZHe+V/p2JA09xjjmB2fu53DQQIjh/qeCWi0b2vkDZZVvl0hJ+0y9P1GRxhwBhtgK
# /vwvgKGwC3OwXcdrxXNvD4Yy4NJLUtCoN8vmyI41ohLeMfr7/XrmTrf0J4ciPc4T
# 1bAHY2SWkFL59ylN+gt1khlV8zqPYP9S1i08A2wJjvLOwqRJ/LN2tNEh9pWvGmFg
# p5WGTNeZLsfD+ZT10bm083EMAc1va7RTQNjAzb55pxq0ASPl7ZIVAKqazaG9QsaK
# apPxGGYevuWzJVaNYWAqj7y37WDP/w6rKmyRmIBMV+x9+Dv+DGPbGb8oAOjZ0Av5
# 489mHYIONxp//2SvaUSfGpQHACCgEKTHlstdlyw79C84xPzHujE=
# =o9aW
# -----END PGP SIGNATURE-----
# gpg: Signature made Tue 13 Jan 2026 08:36:04 PM AEDT
# gpg: using RSA key A0F66548F04895EBFE6B0B6051A343C7CFFBECA1
# gpg: Good signature from "Cédric Le Goater <clg@redhat.com>" [full]
# gpg: aka "Cédric Le Goater <clg@kaod.org>" [full]
* tag 'pull-vfio-20260113' of https://github.com/legoater/qemu: (41 commits)
tests/rcutorture: Fix build error
tests/qtest: Fix build error
target/riscv: Fix build errors
ppc/vof: Fix build error
update-linux-headers: Remove "asm-s390/unistd_32.h"
include/hw/hyperv: Remove unused 'struct mshv_vp_registers' definition
util/vfio-helper: Fix endianness in PCI config read/write functions
Workaround for ERRATA_772415_SPR17
vfio/listener: Bypass readonly region for dirty tracking
intel_iommu_accel: Implement get_host_iommu_quirks() callback
hw/pci: Introduce pci_device_get_host_iommu_quirks()
vfio/migration: Allow live migration with vIOMMU without VFs using device dirty tracking
vfio/migration: Add migration blocker if VM memory is too large to cause unmap_bitmap failure
vfio/listener: Add missing dirty tracking in region_del
intel_iommu: Fix unmap_bitmap failure with legacy VFIO backend
vfio/iommufd: Add IOMMU_HWPT_GET_DIRTY_BITMAP_NO_CLEAR flag support
vfio: Add a backend_flag parameter to vfio_container_query_dirty_bitmap()
vfio/container-legacy: rename vfio_dma_unmap_bitmap() to vfio_legacy_dma_unmap_get_dirty_bitmap()
vfio/iommufd: Query dirty bitmap before DMA unmap
vfio/iommufd: Add framework code to support getting dirty bitmap before unmap
...
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
VEX is only forbidden in real and vm86 mode; 16-bit protected mode supports
it for some unfathomable reason.
Cc: qemu-stable@nongnu.org
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
VSIB can have either 32-bit or 64-bit addresses, pass a constant mask to
the helper and apply it before the load.
Cc: qemu-stable@nongnu.org
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
If the vex_special field was not initialized, it was considered to be
X86_VEX_SSEUnaligned (whose value was zero). Add a new value to
fix that.
Cc: qemu-stable@nongnu.org
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Remove dead code; it arose when I noticed that, because 0x3? opcodes do
have a pop, case 0x32 works just fine as fcomp (even though 0x?2 is fcom):
there is no need to hack the op to 0x03.
Reported by Coverity as CID 1643922.
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The value that is pushed by PUSHF is the full EFLAGS, while CC_OP_EFLAGS
only wants arithmetic flags in CC_SRC. To avoid this, follow what other
helpers do and set CC_SRC/CC_OP directly in helper_read_eflags. This
is basically free and fixes an issue booting Windows 3.11.
Reported-by: Mark Cave-Ayland <mark.caveayland@nutanix.com>
Fixes: e661e2d7a3 ("target/i386/tcg: update cc_op after PUSHF", 2025-12-27)
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Newer gcc compiler (version 16.0.0 20260103 (Red Hat 16.0.0-0) (GCC))
detects an unused variable error:
../tests/unit/rcutorture.c: In function ‘rcu_read_stress_test’:
../tests/unit/rcutorture.c:251:18: error: variable ‘garbage’ set but not used [-Werror=unused-but-set-variable=]
251 | volatile int garbage = 0;
| ^~~~~~~
Since the 'garbage' variable is used to generate memory reads from the
CPU while holding the RCU lock, it can not be removed. Tag it as
((unused)) instead to silence the compiler warnings/errors.
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Link: https://lore.kernel.org/qemu-devel/20260112163350.1251114-1-clg@redhat.com
Signed-off-by: Cédric Le Goater <clg@redhat.com>
Newer gcc compiler (version 16.0.0 20260103 (Red Hat 16.0.0-0) (GCC))
detects an unused variable error:
../tests/qtest/libqtest.c: In function ‘qtest_qom_has_concrete_type’:
../tests/qtest/libqtest.c:1044:9: error: variable ‘idx’ set but not used [-Werror=unused-but-set-variable=]
Remove idx.
Cc: Fabiano Rosas <farosas@suse.de>
Cc: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Link: https://lore.kernel.org/qemu-devel/20260112123146.1010621-1-clg@redhat.com
Signed-off-by: Cédric Le Goater <clg@redhat.com>
Newer gcc compiler (version 16.0.0 20260103 (Red Hat 16.0.0-0) (GCC))
detects a truncation error:
../target/riscv/cpu.c: In function ‘riscv_isa_write_fdt’:
../target/riscv/cpu.c:2916:35: error: ‘%d’ directive output may be truncated writing between 1 and 11 bytes into a region of size 5 [-Werror=format-truncation=]
2916 | snprintf(isa_base, maxlen, "rv%di", xlen);
| ^~
../target/riscv/cpu.c:2916:32: note: directive argument in the range [-2147483648, 2147483632]
2916 | snprintf(isa_base, maxlen, "rv%di", xlen);
| ^~~~~~~
Since the xlen variable represents the register width (32, 64, 128) in
the RISC-V base ISA name, mask its value with a 8-bit bitmask to
satisfy the size constraints on the snprintf output.
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Alistair Francis <alistair.francis@wdc.com>
Cc: Weiwei Li <liwei1518@gmail.com>
Cc: Daniel Henrique Barboza <dbarboza@ventanamicro.com>
Cc: Liu Zhiwei <zhiwei_liu@linux.alibaba.com>
Reviewed-by: Daniel Henrique Barboza <daniel.barboza@oss.qualcomm.com>
Link: https://lore.kernel.org/qemu-devel/20260112161626.1232639-1-clg@redhat.com
Signed-off-by: Cédric Le Goater <clg@redhat.com>
Newer gcc compiler (version 16.0.0 20260103 (Red Hat 16.0.0-0) (GCC))
detects an unused variable error:
../hw/ppc/vof.c: In function ‘vof_dt_memory_available’:
../hw/ppc/vof.c:642:12: error: variable ‘n’ set but not used [-Werror=unused-but-set-variable=]
Remove 'n'.
Cc: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Link: https://lore.kernel.org/qemu-devel/20260112124722.1029212-1-clg@redhat.com
Signed-off-by: Cédric Le Goater <clg@redhat.com>
The "asm/unistd_32.h" file was generated for the 31-bit compatibility
mode on the s390 architecture and support was removed in v6.19-rc1,
commit 4ac286c4a8d9 ("s390/syscalls: Switch to generic system call
table generation")
unistd_32.h is no longer generated when running make header_install.
Remove it.
Reported-by: Shameer Kolothum <skolothumtho@nvidia.com>
Cc: Thomas Huth <thuth@redhat.com>
Reviewed-by: Thomas Huth <thuth@redhat.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Link: https://lore.kernel.org/qemu-devel/20260112155341.1209988-1-clg@redhat.com
Signed-off-by: Cédric Le Goater <clg@redhat.com>
The 'struct mshv_vp_registers' definition in hvgdk_mini.h is unused in
QEMU and conflicts with the canonical definition in
linux-headers/linux/mshv.h.
Remove the duplicate definition to avoid build conflicts when the Linux
headers are updated.
Cc: Magnus Kulke <magnuskulke@linux.microsoft.com>
Reviewed-by: Magnus Kulke <magnuskulke@linux.microsoft.com>
Link: https://lore.kernel.org/qemu-devel/20260108185012.2568277-1-clg@redhat.com
Signed-off-by: Cédric Le Goater <clg@redhat.com>
The VFIO pread/pwrite functions use little-endian data format. Currently, the
qemu_vfio_pci_read_config() and qemu_vfio_pci_write_config() don't correctly
convert from CPU native endian format to little-endian (and vice versa) when
using the pread/pwrite functions. Fix this by limiting read/write to 32 bits
and handling endian conversion in qemu_vfio_pci_read_config() and
qemu_vfio_pci_write_config().
Signed-off-by: Farhan Ali <alifm@linux.ibm.com>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
Link: https://lore.kernel.org/qemu-devel/20260105222029.2423-1-alifm@linux.ibm.com
[ clg: Fixed typo in subject ]
Signed-off-by: Cédric Le Goater <clg@redhat.com>
On a system influenced by ERRATA_772415, IOMMU_HW_INFO_VTD_ERRATA_772415_SPR17
is repored by IOMMU_DEVICE_GET_HW_INFO. Due to this errata, even the readonly
range mapped on second stage page table could still be written.
Reference from 4th Gen Intel Xeon Processor Scalable Family Specification
Update, Errata Details, SPR17.
Link https://edc.intel.com/content/www/us/en/design/products-and-solutions/processors-and-chipsets/eagle-stream/sapphire-rapids-specification-update/
Backup https://cdrdv2.intel.com/v1/dl/getContent/772415
Also copied the SPR17 details from above link:
"Problem: When remapping hardware is configured by system software in
scalable mode as Nested (PGTT=011b) and with PWSNP field Set in the
PASID-table-entry, it may Set Accessed bit and Dirty bit (and Extended
Access bit if enabled) in first-stage page-table entries even when
second-stage mappings indicate that corresponding first-stage page-table
is Read-Only.
Implication: Due to this erratum, pages mapped as Read-only in second-stage
page-tables may be modified by remapping hardware Access/Dirty bit updates.
Workaround: None identified. System software enabling nested translations
for a VM should ensure that there are no read-only pages in the
corresponding second-stage mappings."
Introduce a helper vfio_device_get_host_iommu_quirk_bypass_ro to check if
readonly mappings should be bypassed.
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Link: https://lore.kernel.org/qemu-devel/20260106062808.316574-5-zhenzhong.duan@intel.com
Signed-off-by: Cédric Le Goater <clg@redhat.com>
When doing dirty tracking or calculating dirty tracking range, readonly
regions can be bypassed, because corresponding DMA mappings are readonly
and never become dirty.
This can optimize dirty tracking a bit for passthrough device.
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Link: https://lore.kernel.org/qemu-devel/20260106062808.316574-4-zhenzhong.duan@intel.com
Signed-off-by: Cédric Le Goater <clg@redhat.com>
Implement get_host_iommu_quirks() callback to retrieve the vendor specific
hardware information data and convert it into bitmaps defined with enum
host_iommu_quirks. It will be used by VFIO in subsequent patch.
Suggested-by: Eric Auger <eric.auger@redhat.com>
Suggested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Link: https://lore.kernel.org/qemu-devel/20260106062808.316574-3-zhenzhong.duan@intel.com
Signed-off-by: Cédric Le Goater <clg@redhat.com>
In VFIO core, we call iommufd_backend_get_device_info() to return vendor
specific hardware information data, but it's not good to retrieve this raw
data in VFIO core.
Introduce a new PCIIOMMUOps optional callback, get_host_iommu_quirk() which
allows to retrieve the vendor specific hardware information data and convert
it into bitmaps defined with enum host_iommu_quirks.
pci_device_get_host_iommu_quirks() is a wrapper that can be called on a PCI
device potentially protected by a vIOMMU.
Suggested-by: Eric Auger <eric.auger@redhat.com>
Suggested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Link: https://lore.kernel.org/qemu-devel/20260106062808.316574-2-zhenzhong.duan@intel.com
Signed-off-by: Cédric Le Goater <clg@redhat.com>
Commit e46883204c ("vfio/migration: Block migration with vIOMMU")
introduces a migration blocker when vIOMMU is enabled, because we need
to calculate the IOVA ranges for device dirty tracking. But this is
unnecessary for iommu dirty tracking.
Limit the vfio_viommu_preset() check to those devices which use device
dirty tracking. This allows live migration with VFIO devices which use
iommu dirty tracking.
Suggested-by: Jason Zeng <jason.zeng@intel.com>
Co-developed-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Reviewed-by: Yi Liu <yi.l.liu@intel.com>
Tested-by: Xudong Hao <xudong.hao@intel.com>
Tested-by: Giovannio Cabiddu <giovanni.cabiddu@intel.com>
Tested-by: Rohith S R <rohith.s.r@intel.com>
Link: https://lore.kernel.org/qemu-devel/20251218062643.624796-10-zhenzhong.duan@intel.com
Signed-off-by: Cédric Le Goater <clg@redhat.com>
With default config, kernel VFIO IOMMU type1 driver limits dirty bitmap to
256MB for unmap_bitmap ioctl so the maximum guest memory region is no more
than 8TB size for the ioctl to succeed.
Be conservative here to limit total guest memory to max value supported
by unmap_bitmap ioctl or else add a migration blocker. IOMMUFD backend
doesn't have such limit, one can use it if there is a need to migrate such
large VM.
Suggested-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Reviewed-by: Yi Liu <yi.l.liu@intel.com>
Link: https://lore.kernel.org/qemu-devel/20251218062643.624796-9-zhenzhong.duan@intel.com
Signed-off-by: Cédric Le Goater <clg@redhat.com>
If a VFIO device in guest switches from passthrough(PT) domain to block
domain, the whole memory address space is unmapped, but we passed a NULL
iotlb entry to unmap_bitmap, then bitmap query didn't happen and we lost
dirty pages.
By constructing an iotlb entry with iova = gpa for unmap_bitmap, it can
set dirty bits correctly.
For IOMMU address space, we still send NULL iotlb because VFIO don't know
the actual mappings in guest. It's vIOMMU's responsibility to send actual
unmapping notifications, e.g., vtd_address_space_unmap_in_dirty_tracking().
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Tested-by: Giovannio Cabiddu <giovanni.cabiddu@intel.com>
Reviewed-by: Yi Liu <yi.l.liu@intel.com>
Link: https://lore.kernel.org/qemu-devel/20251218062643.624796-8-zhenzhong.duan@intel.com
Signed-off-by: Cédric Le Goater <clg@redhat.com>
If a VFIO device in guest switches from IOMMU domain to block domain,
vtd_address_space_unmap() is called to unmap whole address space.
If that happens during migration, migration fails with legacy VFIO
backend as below:
Status: failed (vfio_container_dma_unmap(0x561bbbd92d90, 0x100000000000, 0x100000000000) = -7 (Argument list too long))
Because legacy VFIO limits maximum bitmap size to 256MB which maps to 8TB on
4K page system, when 16TB sized UNMAP notification is sent, unmap_bitmap
ioctl fails. Normally such large UNMAP notification come from IOVA range
rather than system memory.
Apart from that, vtd_address_space_unmap() sends UNMAP notification with
translated_addr = 0, because there is no valid translated_addr for unmapping
a whole iommu memory region. This breaks dirty tracking no matter which VFIO
backend is used.
Fix them all by iterating over DMAMap list to unmap each range with active
mapping when global_dirty_tracking is active. global_dirty_tracking is
protected by BQL, so it's safe to reference it directly. If it's not active,
unmapping the whole address space in one go is optimal.
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Reviewed-by: Yi Liu <yi.l.liu@intel.com>
Tested-by: Giovannio Cabiddu <giovanni.cabiddu@intel.com>
Tested-by: Rohith S R <rohith.s.r@intel.com>
Link: https://lore.kernel.org/qemu-devel/20251218062643.624796-7-zhenzhong.duan@intel.com
Signed-off-by: Cédric Le Goater <clg@redhat.com>
Pass IOMMU_HWPT_GET_DIRTY_BITMAP_NO_CLEAR when doing the last dirty
bitmap query right before unmap, no PTEs flushes. This accelerates the
query without issue because unmap will tear down the mapping anyway.
Co-developed-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
Reviewed-by: Yi Liu <yi.l.liu@intel.com>
Tested-by: Xudong Hao <xudong.hao@intel.com>
Tested-by: Giovannio Cabiddu <giovanni.cabiddu@intel.com>
Tested-by: Rohith S R <rohith.s.r@intel.com>
Link: https://lore.kernel.org/qemu-devel/20251218062643.624796-6-zhenzhong.duan@intel.com
Signed-off-by: Cédric Le Goater <clg@redhat.com>
This new parameter will be used in following patch, currently 0 is passed.
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
Reviewed-by: Yi Liu <yi.l.liu@intel.com>
Tested-by: Giovannio Cabiddu <giovanni.cabiddu@intel.com>
Tested-by: Rohith S R <rohith.s.r@intel.com>
Link: https://lore.kernel.org/qemu-devel/20251218062643.624796-5-zhenzhong.duan@intel.com
[ clg: Fixed subject typo ]
Signed-off-by: Cédric Le Goater <clg@redhat.com>
This is to follow naming style in container-legacy.c to have low level functions
with vfio_legacy_ prefix.
No functional changes.
Suggested-by: Cédric Le Goater <clg@redhat.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
Reviewed-by: Yi Liu <yi.l.liu@intel.com>
Link: https://lore.kernel.org/qemu-devel/20251218062643.624796-4-zhenzhong.duan@intel.com
Signed-off-by: Cédric Le Goater <clg@redhat.com>
When an existing mapping is unmapped, there could already be dirty bits
which need to be recorded before unmap.
If query dirty bitmap fails, we still need to do unmapping or else there
is stale mapping and it's risky to guest.
Co-developed-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
Reviewed-by: Yi Liu <yi.l.liu@intel.com>
Tested-by: Xudong Hao <xudong.hao@intel.com>
Tested-by: Giovannio Cabiddu <giovanni.cabiddu@intel.com>
Tested-by: Rohith S R <rohith.s.r@intel.com>
Link: https://lore.kernel.org/qemu-devel/20251218062643.624796-3-zhenzhong.duan@intel.com
Signed-off-by: Cédric Le Goater <clg@redhat.com>
Currently we support device and iommu dirty tracking, device dirty tracking
is preferred.
Add the framework code in iommufd_cdev_unmap() to choose either device or
iommu dirty tracking, just like vfio_legacy_dma_unmap_one().
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
Reviewed-by: Yi Liu <yi.l.liu@intel.com>
Tested-by: Xudong Hao <xudong.hao@intel.com>
Tested-by: Giovannio Cabiddu <giovanni.cabiddu@intel.com>
Tested-by: Rohith S R <rohith.s.r@intel.com>
Link: https://lore.kernel.org/qemu-devel/20251218062643.624796-2-zhenzhong.duan@intel.com
Signed-off-by: Cédric Le Goater <clg@redhat.com>
Add documentation about using IOMMUFD backed VFIO device with intel_iommu with
x-flts=on.
Suggested-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Yi Liu <yi.l.liu@intel.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Link: https://lore.kernel.org/qemu-devel/20260106061304.314546-20-zhenzhong.duan@intel.com
Signed-off-by: Cédric Le Goater <clg@redhat.com>
Now that all infrastructures of supporting passthrough device running
with first stage translation are there, enable it now.
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Yi Liu <yi.l.liu@intel.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Link: https://lore.kernel.org/qemu-devel/20260106061304.314546-19-zhenzhong.duan@intel.com
Signed-off-by: Cédric Le Goater <clg@redhat.com>
When x-flts=on, we set up bindings to nested HWPT in host, after
migration, VFIO device binds to nesting parent HWPT by default.
We need to re-establish the bindings to nested HWPT, or else device
DMA will break.
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Yi Liu <yi.l.liu@intel.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Link: https://lore.kernel.org/qemu-devel/20260106061304.314546-18-zhenzhong.duan@intel.com
Signed-off-by: Cédric Le Goater <clg@redhat.com>
This replays guest pasid bindings after context cache invalidation.
Actually, programmer should issue pasid cache invalidation with proper
granularity after issuing context cache invalidation.
We see old linux such as 6.7.0-rc2 not following the spec, it sends
pasid cache invalidation before context cache invalidation, then QEMU
depends on context cache invalidation to get pasid entry and setup
binding.
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Yi Liu <yi.l.liu@intel.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Link: https://lore.kernel.org/qemu-devel/20260106061304.314546-17-zhenzhong.duan@intel.com
Signed-off-by: Cédric Le Goater <clg@redhat.com>
When either 'Set Root Table Pointer' or 'Translation Enable' bit is changed,
all pasid bindings on host side become stale and need to be updated.
Introduce a helper function vtd_replay_pasid_bindings_all() to go through all
pasid entries in all passthrough devices to update host side bindings.
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Yi Liu <yi.l.liu@intel.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Link: https://lore.kernel.org/qemu-devel/20260106061304.314546-16-zhenzhong.duan@intel.com
Signed-off-by: Cédric Le Goater <clg@redhat.com>
This traps the guest PASID-based iotlb invalidation request and propagate it
to host.
Intel VT-d 3.0 supports nested translation in PASID granularity. Guest SVA
support could be implemented by configuring nested translation on specific
pasid. This is also known as dual stage DMA translation.
Under such configuration, guest owns the GVA->GPA translation which is
configured as first stage page table on host side for a specific pasid, and
host owns GPA->HPA translation. As guest owns first stage translation table,
piotlb invalidation should be propagated to host since host IOMMU will cache
first level page table related mappings during DMA address translation.
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Yi Liu <yi.l.liu@intel.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Link: https://lore.kernel.org/qemu-devel/20260106061304.314546-15-zhenzhong.duan@intel.com
Signed-off-by: Cédric Le Goater <clg@redhat.com>
This captures the guest PASID table entry modifications and propagates
the changes to host to attach a hwpt with type determined per guest IOMMU
PGTT configuration.
When PGTT=PT, attach PASID_0 to a second stage HWPT(GPA->HPA).
When PGTT=FST, attach PASID_0 to nested HWPT with nesting parent HWPT
coming from VFIO.
Co-Authored-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Yi Liu <yi.l.liu@intel.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Link: https://lore.kernel.org/qemu-devel/20260106061304.314546-14-zhenzhong.duan@intel.com
Signed-off-by: Cédric Le Goater <clg@redhat.com>
Add some macros and inline functions that will be used by following patch.
This patch also make a cleanup to change below macros to use extract64()
just like what smmu does, because they are either used in following patches
or used indirectly by new introduced inline functions.
VTD_INV_DESC_PIOTLB_IH
VTD_SM_PASID_ENTRY_PGTT
VTD_SM_PASID_ENTRY_DID
VTD_SM_PASID_ENTRY_FSPM
VTD_SM_PASID_ENTRY_FSPTPTR
But we doesn't aim to change the huge amount of bit mask style macro
definitions in this patch, that should be in a separate patch.
Suggested-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Reviewed-by: Yi Liu <yi.l.liu@intel.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Link: https://lore.kernel.org/qemu-devel/20260106061304.314546-13-zhenzhong.duan@intel.com
Signed-off-by: Cédric Le Goater <clg@redhat.com>
When guest enables scalable mode and setup first stage page table, we don't
want to use IOMMU MR but rather continue using the system MR for IOMMUFD
backed host device.
Then default HWPT in VFIO contains GPA->HPA mappings which could be reused
as nesting parent HWPT to construct nested HWPT in vIOMMU.
Move vtd_as_key into intel_iommu_internal.h as it's also used by accel code.
Suggested-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Yi Liu <yi.l.liu@intel.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Link: https://lore.kernel.org/qemu-devel/20260106061304.314546-12-zhenzhong.duan@intel.com
Signed-off-by: Cédric Le Goater <clg@redhat.com>
Currently we don't support nested translation for passthrough device with
emulated device under same PCI bridge, because they require different address
space when x-flts=on.
In theory, we do support if devices under same PCI bridge are all passthrough
devices. But emulated device can be hotplugged under same bridge. To simplify,
just forbid passthrough device under PCI bridge no matter if there is, or will
be emulated devices under same bridge. This is acceptable because PCIE bridge
is more popular than PCI bridge now.
Suggested-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Yi Liu <yi.l.liu@intel.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Link: https://lore.kernel.org/qemu-devel/20260106061304.314546-11-zhenzhong.duan@intel.com
Signed-off-by: Cédric Le Goater <clg@redhat.com>