The test case in the ppe42 functional test triggers a TCG debug
assertion, which causes the test to fail in an --enable-debug
build or when the sanitizers are enabled:
#6 0x00007ffff4a3b517 in __assert_fail
(assertion=0x5555562e7589 "!temp_readonly(ots)", file=0x5555562e5b23 "../../tcg/tcg.c", line=4928, function=0x5555562e8900 <__PRETTY_FUNCTION__.23> "tcg_reg_alloc_mov") at ./assert/assert.c:105
#7 0x0000555555cc2189 in tcg_reg_alloc_mov (s=0x7fff60000b70, op=0x7fff600126f8) at ../../tcg/tcg.c:4928
#8 0x0000555555cc74e0 in tcg_gen_code (s=0x7fff60000b70, tb=0x7fffa802f540, pc_start=4294446080) at ../../tcg/tcg.c:6667
#9 0x0000555555d02abe in setjmp_gen_code
(env=0x555556cbe610, tb=0x7fffa802f540, pc=4294446080, host_pc=0x7fffeea00c00, max_insns=0x7fffee9f9d74, ti=0x7fffee9f9d90)
at ../../accel/tcg/translate-all.c:257
#10 0x0000555555d02d75 in tb_gen_code (cpu=0x555556cba590, s=...) at ../../accel/tcg/translate-all.c:325
#11 0x0000555555cf5922 in cpu_exec_loop (cpu=0x555556cba590, sc=0x7fffee9f9ee0) at ../../accel/tcg/cpu-exec.c:970
#12 0x0000555555cf5aae in cpu_exec_setjmp (cpu=0x555556cba590, sc=0x7fffee9f9ee0) at ../../accel/tcg/cpu-exec.c:1016
#13 0x0000555555cf5b4b in cpu_exec (cpu=0x555556cba590) at ../../accel/tcg/cpu-exec.c:1042
#14 0x0000555555d1e7ab in tcg_cpu_exec (cpu=0x555556cba590) at ../../accel/tcg/tcg-accel-ops.c:82
#15 0x0000555555d1ff97 in rr_cpu_thread_fn (arg=0x555556cba590) at ../../accel/tcg/tcg-accel-ops-rr.c:285
#16 0x00005555561586c9 in qemu_thread_start (args=0x555556ee3c90) at ../../util/qemu-thread-posix.c:393
#17 0x00007ffff4a9caa4 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:447
#18 0x00007ffff4b29c6c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78
This can be reproduced "by hand":
./build/clang/qemu-system-ppc -display none -vga none \
-machine ppe42_machine -serial stdio \
-device loader,file=$HOME/.cache/qemu/download/03c1ac0fb7f6c025102a02776a93b35101dae7c14b75e4eab36a337e39042ea8 \
-device loader,addr=0xfff80040,cpu-num=0
(assuming you have the image file from the functional test
in your local cache).
This happens for this input:
IN:
0xfff80c00: 07436004 .byte 0x07, 0x43, 0x60, 0x04
which generates (among other things):
not_i32 $0x80000,$0x80000
which the TCG optimization pass turns into:
mov_i32 $0x80000,$0xfff7ffff dead: 1 pref=0xffff
and where we then assert because we tried to write to a constant.
This happens for the CLRBWIBC instruction which ends up in
do_mask_branch() with rb_is_gpr false and invert true. In this case
we will generate code that sets mask to a tcg_constant_tl() but then
uses it as the LHS in tcg_gen_not_tl().
Fix the assertion by doing the invert in the translate time C code
for the "mask is constant" case.
Cc: qemu-stable@nongnu.org
Fixes: f7ec91c239 ("target/ppc: Add IBM PPE42 special instructions")
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Reviewed-by: Glenn Miles <milesg@linux.ibm.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Link: https://lore.kernel.org/qemu-devel/20260212150753.1749448-1-peter.maydell@linaro.org
Signed-off-by: Harsh Prateek Bora <harshpb@linux.ibm.com>
Now that we expose AccessFrequencyRegs, expose HV_X64_MSR_APIC_FREQUENCY as well for the case when the Hyper-V LAPIC is not used.
If the Hyper-V LAPIC is used, this will be handled by the hypervisor instead of the VMM, hence gating it on !whpx_irqchip_in_kernel().
Signed-off-by: Mohamed Mediouni <mohamed@unpredictable.fr>
Link: https://lore.kernel.org/r/20260228214704.19048-8-mohamed@unpredictable.fr
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
At the point in time in which we setup the partition, the vCPUs
aren't available yet.
So enable them by default for now like what the MSHV backend does.
AccessFrequencyRegs is shared for both the LAPIC frequency reporting and the TSC frequency.
To still benefit from the fixed TSC frequency reporting when kernel-irqchip=off, still enable AccessFrequencyRegs anyway.
Signed-off-by: Mohamed Mediouni <mohamed@unpredictable.fr>
Link: https://lore.kernel.org/r/20260228214704.19048-4-mohamed@unpredictable.fr
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Expose ITS_NO by default, as users using Clearwater Forest and higher
CPU models would not be able to live migrate to lower CPU hosts due to
missing features. In that case, they would not be vulnerable to ITS.
its-no was originally added on [1], but needs to be exposed on the
individual CPU models for the guests to see by default.
Note: Version 1 already exposes ARCH_CAP_BHI_NO, which would already
mark the CPU as invulnerable to ITS (at least in Linux); however,
expose ITS_NO for completeness.
[1] 74978391b2 ("target/i386: Make ITS_NO available to guests")
Cc: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Jon Kohler <jon@nutanix.com>
Link: https://lore.kernel.org/r/20251106174626.49930-6-jon@nutanix.com
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Expose ITS_NO by default, as users using Sierra Forest and higher
CPU models would not be able to live migrate to lower CPU hosts due to
missing features. In that case, they would not be vulnerable to ITS.
its-no was originally added on [1], but needs to be exposed on the
individual CPU models for the guests to see by default.
Note: For SRF, version 2 already exposed BHI_CTRL, which would already
mark the CPU as invulnerable to ITS (at least in Linux); however,
expose ITS_NO for completeness.
[1] 74978391b2 ("target/i386: Make ITS_NO available to guests")
Cc: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Jon Kohler <jon@nutanix.com>
Link: https://lore.kernel.org/r/20251106174626.49930-5-jon@nutanix.com
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Expose ITS_NO by default, as users using Granite Rapids and higher
CPU models would not be able to live migrate to lower CPU hosts due to
missing features. In that case, they would not be vulnerable to ITS.
its-no was originally added on [1], but needs to be exposed on the
individual CPU models for the guests to see by default.
[1] 74978391b2 ("target/i386: Make ITS_NO available to guests")
Cc: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Jon Kohler <jon@nutanix.com>
Link: https://lore.kernel.org/r/20251106174626.49930-4-jon@nutanix.com
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Expose ITS_NO by default, as users using Sapphire Rapids and higher
CPU models would not be able to live migrate to lower CPU hosts due to
missing features. In that case, they would not be vulnerable to ITS.
its-no was originally added on [1], but needs to be exposed on the
individual CPU models for the guests to see by default.
[1] 74978391b2 ("target/i386: Make ITS_NO available to guests")
Cc: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Jon Kohler <jon@nutanix.com>
Link: https://lore.kernel.org/r/20251106174626.49930-3-jon@nutanix.com
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add bit definition for Indirect Target Selection (ITS_NO) bit 62, to
allow ITS_NO to be added directly to a CPU model in the future.
Cc: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Jon Kohler <jon@nutanix.com>
Link: https://lore.kernel.org/r/20251106174626.49930-2-jon@nutanix.com
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
alpha_cpu_realizefn() did not properly call cpu_reset(), which
corrupted icount. Add the missing function call to fix icount.
Signed-off-by: Akihiko Odaki <odaki@rsg.ci.i.u-tokyo.ac.jp>
Tested-by: Thomas Huth <thuth@redhat.com>
Link: https://lore.kernel.org/r/20260217-alpha-v1-1-0dcc708c9db3@rsg.ci.i.u-tokyo.ac.jp
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Otherwise, interrupts processed through the cancel vCPU and inject path will not cause the vCPU to go out of its halt state.
Signed-off-by: Mohamed Mediouni <mohamed@unpredictable.fr>
Link: https://lore.kernel.org/r/20260226181930.53170-3-mohamed@unpredictable.fr
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
WHvRunVpExitReasonX64Halt _is_ triggered on halt with kernel-irqchip=off as of Windows 11 version 25H2.
Signed-off-by: Mohamed Mediouni <mohamed@unpredictable.fr>
Link: https://lore.kernel.org/r/20260226181930.53170-2-mohamed@unpredictable.fr
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add a new accelerator option that allows the guest to adjust the PAT.
This is already the case for TDX guests and allows using virtio-gpu
Venus with RADV or NVIDIA drivers.
The quirk is disabled by default. Since this caused problems with
Linux's Bochs video device driver, add a knob to leave it enabled,
and for now do ont enable it by default.
Signed-off-by: Myrsky Lintu <qemu.haziness801@passinbox.com>
Resolves: https://gitlab.com/qemu-project/qemu/-/issues/2943
Link: https://lore.kernel.org/r/175527721636.15451.4393515241478547957-1@git.sr.ht
[Add property; for now leave it off by default. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
On confidential guests KVM virtual machine file descriptor changes as a
part of the guest reset process. Xen capabilities needs to be re-initialized in
KVM against the new file descriptor.
Signed-off-by: Ani Sinha <anisinha@redhat.com>
Link: https://lore.kernel.org/r/20260225035000.385950-29-anisinha@redhat.com
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
We need to make sure that synic CPU feature is not already enabled. If it is,
trying to enable it again will result in the following assertion:
Unexpected error in object_property_try_add() at ../qom/object.c:1268:
qemu-system-x86_64: attempt to add duplicate property 'synic' to object (type 'host-x86_64-cpu')
So enable synic only if its not enabled already.
Signed-off-by: Ani Sinha <anisinha@redhat.com>
Link: https://lore.kernel.org/r/20260225035000.385950-27-anisinha@redhat.com
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When the KVM VM file descriptor changes as a part of the confidential guest
reset mechanism, it necessary to create a new confidential guest context and
re-encrypt the VM memory. This happens for SEV-ES and SEV-SNP virtual machines
as a part of SEV_LAUNCH_FINISH, SEV_SNP_LAUNCH_FINISH operations.
A new resettable interface for SEV module has been added. A new reset callback
for the reset 'exit' state has been implemented to perform the above operations
when the VM file descriptor has changed during VM reset.
Tracepoints has been added also for tracing purpose.
Signed-off-by: Ani Sinha <anisinha@redhat.com>
Link: https://lore.kernel.org/r/20260225035000.385950-23-anisinha@redhat.com
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
If there is existing launch update data and kernel hashes data, they need to be
freed when initialization code is executed. This is important for resettable
confidential guests where the initialization happens once every reset.
Signed-off-by: Ani Sinha <anisinha@redhat.com>
Link: https://lore.kernel.org/r/20260225035000.385950-22-anisinha@redhat.com
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The various notifiers that are used needs to be installed only once not on
every initialization. This includes the vm state change notifier and others.
This change uses 'cgs->ready' flag to install the notifiers only one time,
the first time.
Signed-off-by: Ani Sinha <anisinha@redhat.com>
Link: https://lore.kernel.org/r/20260225035000.385950-21-anisinha@redhat.com
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
sev_launch_finish() and sev_snp_launch_finish() could be called multiple times
when the confidential guest is being reset/rebooted. The migration
blockers should not be added multiple times, once per invocation. This change
makes sure that the migration blockers are added only one time by adding the
migration blockers to the vm state change handler when the vm transitions to
the running state. Subsequent reboots do not change the state of the vm.
Reviewed-by: Prasad Pandit <pjp@fedoraproject.org>
Signed-off-by: Ani Sinha <anisinha@redhat.com>
Link: https://lore.kernel.org/r/20260225035000.385950-20-anisinha@redhat.com
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
During reset, when the VM file descriptor is changed, the TDX state needs to be
re-initialized. A notifier callback is implemented to reset the old
state and free memory before the new state is initialized post VM file
descriptor change.
Signed-off-by: Ani Sinha <anisinha@redhat.com>
Link: https://lore.kernel.org/r/20260225035000.385950-19-anisinha@redhat.com
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When the confidential virtual machine KVM file descriptor changes due to the
guest reset, some TDX specific setup steps needs to be done again. This
includes finalizing the initial guest launch state again. This change
re-executes some parts of the TDX setup during the device reset phaze using a
resettable interface. This finalizes the guest launch state again and locks
it in. Machine done notifier which was previously used is no longer needed as
the same code is now executed as a part of VM reset.
Signed-off-by: Ani Sinha <anisinha@redhat.com>
Link: https://lore.kernel.org/r/20260225035000.385950-18-anisinha@redhat.com
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
A new helper function is introduced that refactors all firmware memory
initialization code into a separate function. No functional change.
Signed-off-by: Ani Sinha <anisinha@redhat.com>
Link: https://lore.kernel.org/r/20260225035000.385950-17-anisinha@redhat.com
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When IGVM is not being used by the confidential guest, the guest firmware has
to be reloaded explicitly again into memory. This is because, the memory into
which the firmware was loaded before reset was encrypted and is thus lost
upon reset. When IGVM is used, it is expected that the IGVM will contain the
guest firmware and the execution of the IGVM directives will set up the guest
firmware memory.
Signed-off-by: Ani Sinha <anisinha@redhat.com>
Link: https://lore.kernel.org/r/20260225035000.385950-15-anisinha@redhat.com
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When the kvm file descriptor changes as a part of confidential guest reset,
some architecture specific setups including SEV/SEV-SNP/TDX specific setups
needs to be redone. These changes are implemented as a part of the
kvm_arch_on_vmfd_change() callback which was introduced previously.
Signed-off-by: Ani Sinha <anisinha@redhat.com>
Link: https://lore.kernel.org/r/20260225035000.385950-11-anisinha@redhat.com
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
We will re-register smram listeners after the VM file descriptors has changed.
We need to unregister them first to make sure addresses and reference counters
work properly.
Signed-off-by: Ani Sinha <anisinha@redhat.com>
Link: https://lore.kernel.org/r/20260225035000.385950-10-anisinha@redhat.com
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This change adds common kvm specific support to handle KVM VM file descriptor
change. KVM VM file descriptor can change as a part of confidential guest reset
mechanism. A new function api kvm_arch_on_vmfd_change() per
architecture platform is added in order to implement architecture specific
changes required to support it. A subsequent patch will add x86 specific
implementation for kvm_arch_on_vmfd_change() as currently only x86 supports
confidential guest reset.
Signed-off-by: Ani Sinha <anisinha@redhat.com>
Link: https://lore.kernel.org/r/20260225035000.385950-6-anisinha@redhat.com
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
As a part of the confidential guest reset process, the existing encrypted guest
state must be made mutable since it would be discarded after reset. A new
encrypted and locked guest state must be established after the reset. To this
end, a new boolean member per confidential guest support class
(eg, tdx or sev-snp) is added that will indicate whether its possible to
rebuild guest state:
bool can_rebuild_guest_state;
This is true if rebuilding guest state is possible, false otherwise.
A KVM based confidential guest reset is only possible when
the existing state is locked but its possible to rebuild guest state.
Otherwise, the guest is not resettable.
Signed-off-by: Ani Sinha <anisinha@redhat.com>
Link: https://lore.kernel.org/r/20260225035000.385950-3-anisinha@redhat.com
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
kvm_filter_msr() does not check if an msr entry is already present in the
msr_handlers table and installs a new handler unconditionally. If the function
is called again with the same MSR, it will result in duplicate entries in the
table and multiple such calls will fill up the table needlessly. Fix that.
Signed-off-by: Ani Sinha <anisinha@redhat.com>
Link: https://lore.kernel.org/r/20260225035000.385950-2-anisinha@redhat.com
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The nitro accel does not actually make use of CPU emulation or details:
It always uses the host CPU regardless of configuration. Machines for
the nitro accel select the host CPU type as default to have a clear
statement of the above and to have a unified cpu type across all
supported architectures.
The arm64 logic on Linux currently only allows -cpu host for KVM based
virtual machines. Add a special case for nitro so that when the nitro
accel is active, it allows use of the host cpu type.
Signed-off-by: Alexander Graf <graf@amazon.com>
Link: https://lore.kernel.org/r/20260225220807.33092-8-graf@amazon.com
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The kvm_get_vcpu_events propogates the state of the pending smi
from the kernel to the cpu->interrupt_request, with the intention
of having un up to date migration state.
Later the opposite is done, the kvm_put_vcpu_events restores the state
of the pending #SMI from the 'cs->interrupt_request'
The only problem is that kvm_get_vcpu_events also resets the SMI
in cpu->interrupt_request when there is no pending #SMI indicated by the kernel,
and that is wrong as the SMI might be still raised by qemu.
While at it, also fix a similar but more theoretical bug with regard to a
latched #INIT while in SMM.
A simple reproducer for this bug is to read an EFI variable in a loop
from within a guest, while at the same time run 'info registers' on
the qemu HMP monitor.
The reads will, once in a while, fail with an 'Invalid argument' error.
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20260223221908.361456-1-mlevitsk@redhat.com
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Use that to not bump RIP for those cases.
Warn on read/write from/to unmapped MMIO, but not consider that as an exception.
For reads, return 0xFF(s) as the register value in that case.
Leaves a coverage gap for read_val_ext(), to be handled in a later commit.
Signed-off-by: Mohamed Mediouni <mohamed@unpredictable.fr>
Link: https://lore.kernel.org/r/20260223233950.96076-25-mohamed@unpredictable.fr
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
For symmetry, save/restore the same set of registers even when not needed.
CR2 save/restore needed as page faults injected to the guest imply modifying CR2.
Signed-off-by: Mohamed Mediouni <mohamed@unpredictable.fr>
Link: https://lore.kernel.org/r/20260223233950.96076-21-mohamed@unpredictable.fr
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
target/i386/emulate doesn't currently properly emulate instructions
which might cause a page fault during their execution. Notably, REP STOS/MOVS
from MMIO to an address which is unmapped until a page fault exception is raised
causes an abort() in vmx_write_mem.
Change the interface between the HW accel backend and target/i386/emulate as a first step towards addressing that.
Adapt the page table walker code to give actionable errors,
while leaving a possibility for backends to provide their own walker.
This removes the usage of the Hyper-V page walker in the mshv backend.
Signed-off-by: Mohamed Mediouni <mohamed@unpredictable.fr>
Link: https://lore.kernel.org/r/20260223233950.96076-20-mohamed@unpredictable.fr
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Optimise vmexits by save/restoring less state in those cases instead of the full state.
Signed-off-by: Mohamed Mediouni <mohamed@unpredictable.fr>
Reviewed-by: Bernhard Beschow <shentey@gmail.com>
Link: https://lore.kernel.org/r/20260223233950.96076-17-mohamed@unpredictable.fr
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>