When the KVM VM file descriptor changes as a part of the confidential guest
reset mechanism, it necessary to create a new confidential guest context and
re-encrypt the VM memory. This happens for SEV-ES and SEV-SNP virtual machines
as a part of SEV_LAUNCH_FINISH, SEV_SNP_LAUNCH_FINISH operations.
A new resettable interface for SEV module has been added. A new reset callback
for the reset 'exit' state has been implemented to perform the above operations
when the VM file descriptor has changed during VM reset.
Tracepoints has been added also for tracing purpose.
Signed-off-by: Ani Sinha <anisinha@redhat.com>
Link: https://lore.kernel.org/r/20260225035000.385950-23-anisinha@redhat.com
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
If there is existing launch update data and kernel hashes data, they need to be
freed when initialization code is executed. This is important for resettable
confidential guests where the initialization happens once every reset.
Signed-off-by: Ani Sinha <anisinha@redhat.com>
Link: https://lore.kernel.org/r/20260225035000.385950-22-anisinha@redhat.com
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The various notifiers that are used needs to be installed only once not on
every initialization. This includes the vm state change notifier and others.
This change uses 'cgs->ready' flag to install the notifiers only one time,
the first time.
Signed-off-by: Ani Sinha <anisinha@redhat.com>
Link: https://lore.kernel.org/r/20260225035000.385950-21-anisinha@redhat.com
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
sev_launch_finish() and sev_snp_launch_finish() could be called multiple times
when the confidential guest is being reset/rebooted. The migration
blockers should not be added multiple times, once per invocation. This change
makes sure that the migration blockers are added only one time by adding the
migration blockers to the vm state change handler when the vm transitions to
the running state. Subsequent reboots do not change the state of the vm.
Reviewed-by: Prasad Pandit <pjp@fedoraproject.org>
Signed-off-by: Ani Sinha <anisinha@redhat.com>
Link: https://lore.kernel.org/r/20260225035000.385950-20-anisinha@redhat.com
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
During reset, when the VM file descriptor is changed, the TDX state needs to be
re-initialized. A notifier callback is implemented to reset the old
state and free memory before the new state is initialized post VM file
descriptor change.
Signed-off-by: Ani Sinha <anisinha@redhat.com>
Link: https://lore.kernel.org/r/20260225035000.385950-19-anisinha@redhat.com
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When the confidential virtual machine KVM file descriptor changes due to the
guest reset, some TDX specific setup steps needs to be done again. This
includes finalizing the initial guest launch state again. This change
re-executes some parts of the TDX setup during the device reset phaze using a
resettable interface. This finalizes the guest launch state again and locks
it in. Machine done notifier which was previously used is no longer needed as
the same code is now executed as a part of VM reset.
Signed-off-by: Ani Sinha <anisinha@redhat.com>
Link: https://lore.kernel.org/r/20260225035000.385950-18-anisinha@redhat.com
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
A new helper function is introduced that refactors all firmware memory
initialization code into a separate function. No functional change.
Signed-off-by: Ani Sinha <anisinha@redhat.com>
Link: https://lore.kernel.org/r/20260225035000.385950-17-anisinha@redhat.com
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Confidential guests needs to generate a new KVM file descriptor upon virtual
machine reset. Existing VCPUs needs to be reattached to this new
KVM VM file descriptor. As a part of this, new VCPU file descriptors against
this new KVM VM file descriptor needs to be created and re-initialized.
Resources allocated against the old VCPU fds needs to be released. This change
makes this happen.
Signed-off-by: Ani Sinha <anisinha@redhat.com>
Link: https://lore.kernel.org/r/20260225035000.385950-16-anisinha@redhat.com
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When IGVM is not being used by the confidential guest, the guest firmware has
to be reloaded explicitly again into memory. This is because, the memory into
which the firmware was loaded before reset was encrypted and is thus lost
upon reset. When IGVM is used, it is expected that the IGVM will contain the
guest firmware and the execution of the IGVM directives will set up the guest
firmware memory.
Signed-off-by: Ani Sinha <anisinha@redhat.com>
Link: https://lore.kernel.org/r/20260225035000.385950-15-anisinha@redhat.com
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Confidential guest smust reload their bios rom upon reset. This is because
bios memory is encrypted and upon reset, the contents of the old bios memory
is lost and cannot be re-used. To this end, export a new x86 function
x86_bios_rom_reload() to reload the bios again. This function will be used in
the subsequent patches.
Reviewed-by: Bernhard Beschow <shentey@gmail.com>
Signed-off-by: Ani Sinha <anisinha@redhat.com>
Link: https://lore.kernel.org/r/20260225035000.385950-14-anisinha@redhat.com
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
For confidential guests, bios image must be reinitialized upon reset. This
is because bios memory is encrypted and hence once the old confidential
kvm context is destroyed, it cannot be decrypted. It needs to be reinitilized.
Towards that, this change refactors x86_bios_rom_init() code so that
parts of it can be called during confidential guest reset.
No functional chnage.
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Ani Sinha <anisinha@redhat.com>
Link: https://lore.kernel.org/r/20260225035000.385950-13-anisinha@redhat.com
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When the kvm file descriptor changes as a part of confidential guest reset,
some architecture specific setups including SEV/SEV-SNP/TDX specific setups
needs to be redone. These changes are implemented as a part of the
kvm_arch_on_vmfd_change() callback which was introduced previously.
Signed-off-by: Ani Sinha <anisinha@redhat.com>
Link: https://lore.kernel.org/r/20260225035000.385950-11-anisinha@redhat.com
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
We will re-register smram listeners after the VM file descriptors has changed.
We need to unregister them first to make sure addresses and reference counters
work properly.
Signed-off-by: Ani Sinha <anisinha@redhat.com>
Link: https://lore.kernel.org/r/20260225035000.385950-10-anisinha@redhat.com
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Various subsystems might need to take some steps before the KVM file descriptor
for a virtual machine is changed. So a new boolean attribute is added to the
vmfd_notifier structure which is passed to the notifier callbacks.
vmfd_notifer.pre is true for pre-notification of vmfd change and false for
post notification. Notifier callback implementations can simply check
the boolean value for (vmfd_notifer*)->pre and can take actions for pre or
post vmfd change based on the value.
Subsequent patches will add callback implementations for specific components
that need this pre-notification.
Signed-off-by: Ani Sinha <anisinha@redhat.com>
Link: https://lore.kernel.org/r/20260225035000.385950-9-anisinha@redhat.com
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
A notifier callback can be used by various subsystems to perform actions when
KVM file descriptor for a virtual machine changes as a part of confidential
guest reset process. This change adds this notifier mechanism. Subsequent
patches will add specific implementations for various notifier callbacks
corresponding to various subsystems that need to take action when KVM VM file
descriptor changed.
Signed-off-by: Ani Sinha <anisinha@redhat.com>
Link: https://lore.kernel.org/r/20260225035000.385950-8-anisinha@redhat.com
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When the KVM VM file descriptor has changed and a new one created, the guest
state is no longer in protected state. Mark it as such.
The guest state becomes protected again when TDX and SEV-ES and SEV-SNP mark
it as such.
Signed-off-by: Ani Sinha <anisinha@redhat.com>
Link: https://lore.kernel.org/r/20260225035000.385950-7-anisinha@redhat.com
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This change adds common kvm specific support to handle KVM VM file descriptor
change. KVM VM file descriptor can change as a part of confidential guest reset
mechanism. A new function api kvm_arch_on_vmfd_change() per
architecture platform is added in order to implement architecture specific
changes required to support it. A subsequent patch will add x86 specific
implementation for kvm_arch_on_vmfd_change() as currently only x86 supports
confidential guest reset.
Signed-off-by: Ani Sinha <anisinha@redhat.com>
Link: https://lore.kernel.org/r/20260225035000.385950-6-anisinha@redhat.com
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
After the guest KVM file descriptor has changed as a part of the process of
confidential guest reset mechanism, existing memory needs to be reattached to
the new file descriptor. This change adds a helper function ram_block_rebind()
for this purpose. The next patch will make use of this function.
Signed-off-by: Ani Sinha <anisinha@redhat.com>
Link: https://lore.kernel.org/r/20260225035000.385950-5-anisinha@redhat.com
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When a confidential virtual machine is reset, a new guest context in the
accelerator must be generated post reset. Therefore, the old accelerator guest
file handle must be closed and a new one created. To this end, a per-accelerator
callback, "rebuild_guest" is introduced that would get called when a confidential
guest is reset. Subsequent patches will introduce specific implementation of
this callback for KVM accelerator.
Signed-off-by: Ani Sinha <anisinha@redhat.com>
Link: https://lore.kernel.org/r/20260225035000.385950-4-anisinha@redhat.com
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
As a part of the confidential guest reset process, the existing encrypted guest
state must be made mutable since it would be discarded after reset. A new
encrypted and locked guest state must be established after the reset. To this
end, a new boolean member per confidential guest support class
(eg, tdx or sev-snp) is added that will indicate whether its possible to
rebuild guest state:
bool can_rebuild_guest_state;
This is true if rebuilding guest state is possible, false otherwise.
A KVM based confidential guest reset is only possible when
the existing state is locked but its possible to rebuild guest state.
Otherwise, the guest is not resettable.
Signed-off-by: Ani Sinha <anisinha@redhat.com>
Link: https://lore.kernel.org/r/20260225035000.385950-3-anisinha@redhat.com
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
kvm_filter_msr() does not check if an msr entry is already present in the
msr_handlers table and installs a new handler unconditionally. If the function
is called again with the same MSR, it will result in duplicate entries in the
table and multiple such calls will fill up the table needlessly. Fix that.
Signed-off-by: Ani Sinha <anisinha@redhat.com>
Link: https://lore.kernel.org/r/20260225035000.385950-2-anisinha@redhat.com
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Now that all pieces are in place to spawn Nitro Enclaves using
a special purpose accelerator and machine model, document how
to use it.
Signed-off-by: Alexander Graf <graf@amazon.com>
Link: https://lore.kernel.org/r/20260225220807.33092-12-graf@amazon.com
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Nitro Enclaves can only boot EIF files which are a combination of
kernel, initramfs and cmdline in a single file. When the kernel image is
not an EIF, treat it like a kernel image and assemble an EIF image on
the fly. This way, users can call QEMU with a direct
kernel/initrd/cmdline combination and everything "just works".
Signed-off-by: Alexander Graf <graf@amazon.com>
Reviewed-by: Dorjoy Chowdhury <dorjoychy111@gmail.com>
Link: https://lore.kernel.org/r/20260225220807.33092-11-graf@amazon.com
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
In follow-up patches we need some EIF file definitions that are
currently in the eif.c file, but want to access them from a separate
device. Move them into the header instead.
Signed-off-by: Alexander Graf <graf@amazon.com>
Reviewed-by: Dorjoy Chowdhury <dorjoychy111@gmail.com>
Link: https://lore.kernel.org/r/20260225220807.33092-10-graf@amazon.com
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add a machine model to spawn a Nitro Enclave. Unlike the existing -M
nitro-enclave, this machine model works exclusively with the -accel
nitro accelerator to drive real Nitro Enclave creation. It supports
memory allocation, number of CPU selection, both x86_64 as well as
aarch64, implements the Enclave heartbeat logic and debug serial
console.
To use it, create an EIF file and run
$ qemu-system-x86_64 -accel nitro,debug-mode=on -M nitro -nographic \
-kernel test.eif
or
$ qemu-system-aarch64 -accel nitro,debug-mode=on -M nitro -nographic \
-kernel test.eif
Signed-off-by: Alexander Graf <graf@amazon.com>
Link: https://lore.kernel.org/r/20260225220807.33092-9-graf@amazon.com
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The VBSA test is a subset of the wider Arm architecture compliance
suites (ACS) which validate machines meet particular minimum set of
requirements. The VBSA is for virtual machines so it makes sense we
should check the -M virt machine is compliant.
Fortunately there are prebuilt binaries published via github so all we
need to do is build an EFI partition and place things in the right
place.
There are some additional Linux based tests which are left for later.
Reviewed-by: Mohamed Mediouni <mohamed@unpredictable.fr>
Reviewed-by: Pierrick Bouvier <pierrick.bouvier@linaro.org>
Message-ID: <20260226185303.1920021-8-alex.bennee@linaro.org>
Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
For now only use the minimal decadency set until all the OpenBSD
mappings can be divined.
Reviewed-by: Pierrick Bouvier <pierrick.bouvier@linaro.org>
Message-ID: <20260226185303.1920021-7-alex.bennee@linaro.org>
Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
For reasons still not clear to me passing the single dashed
-interactive would confuse the argument parsing enough we tried to
pass "nterative" as a string to the launch command causing failure and
head scratching.
Reviewed-by: Pierrick Bouvier <pierrick.bouvier@linaro.org>
Message-ID: <20260226185303.1920021-6-alex.bennee@linaro.org>
Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
The bugs have evidently been fixed in the latest release so we can
migrate the laggards into how all-test-cross container and remove the
legacy hacks. They are also packaged for the main architectures so we
don't need to jump through the amd64 hoops.
Suggested-by: John Snow <jsnow@redhat.com>
Reviewed-by: Thomas Huth <thuth@redhat.com>
Message-ID: <20260226185303.1920021-3-alex.bennee@linaro.org>
Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
Debian 11 was EOL in 2024, and Debian 12 will be EOL this June. This
patch moves all but one of our tests, debian-legacy-test-cross, onto
Debian 13.
This patch does the bare minimum to upgrade these tests and doesn't make
any attempt at optimization or cleanup that may or may not be possible
with this upgrade.
Signed-off-by: John Snow <jsnow@redhat.com>
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
[AJB: tweak summary line]
Message-ID: <20260226185303.1920021-2-alex.bennee@linaro.org>
Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
The nitro accel does not actually make use of CPU emulation or details:
It always uses the host CPU regardless of configuration. Machines for
the nitro accel select the host CPU type as default to have a clear
statement of the above and to have a unified cpu type across all
supported architectures.
The arm64 logic on Linux currently only allows -cpu host for KVM based
virtual machines. Add a special case for nitro so that when the nitro
accel is active, it allows use of the host cpu type.
Signed-off-by: Alexander Graf <graf@amazon.com>
Link: https://lore.kernel.org/r/20260225220807.33092-8-graf@amazon.com
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Nitro Enclaves expect the parent instance to host a vsock heartbeat listener
at port 9000. To host a Nitro Enclave with the nitro accel in QEMU, add
such a heartbeat listener as device model, so that the machine can
easily instantiate it.
Signed-off-by: Alexander Graf <graf@amazon.com>
Link: https://lore.kernel.org/r/20260225220807.33092-7-graf@amazon.com
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Nitro Enclaves support a special "debug" mode. When in debug mode, the
Nitro Hypervisor provides a vsock port that the parent can connect to to
receive serial console output of the Enclave. Add a new nitro-serial-vsock
driver that implements short-circuit logic to establish the vsock
connection to that port and feed its data into a chardev, so that a machine
model can use it as serial device.
Signed-off-by: Alexander Graf <graf@amazon.com>
Link: https://lore.kernel.org/r/20260225220807.33092-6-graf@amazon.com
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Nitro Enclaves are a confidential compute technology which
allows a parent instance to carve out resources from itself
and spawn a confidential sibling VM next to itself. Similar
to other confidential compute solutions, this sibling is
controlled by an underlying vmm, but still has a higher level
vmm (QEMU) to implement some of its I/O functionality and
lifecycle.
Add an accelerator to drive this interface. In combination with
follow-on patches to enhance the Nitro Enclaves machine model, this
will allow users to run a Nitro Enclave using QEMU.
Signed-off-by: Alexander Graf <graf@amazon.com>
Link: https://lore.kernel.org/r/20260225220807.33092-5-graf@amazon.com
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add a dedicated bus for Nitro Enclave vsock devices. In Nitro Enclaves,
communication between parent and enclave/hypervisor happens almost
exclusively through vsock. The nitro-vsock-bus models this dependency
in QEMU, which allows devices in this bus to implement individual services
on top of vsock.
The nitro machine spawns this bus by creating the included
nitro-vsock-bridge sysbus device.
The nitro accel then advertises the Enclave's CID to the bus by calling
nitro_vsock_bridge_start_enclave() on the bridge device as soon as it
knows the CID.
Nitro vsock devices can listen to that event and learn the Enclave's CID
when it is available to perform actions, such as connect to the debug
serial vsock port.
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Alexander Graf <graf@amazon.com>
Link: https://lore.kernel.org/r/20260225220807.33092-4-graf@amazon.com
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
QEMU is learning to drive the /dev/nitro_enclaves device node. Include
its UAPI header into our local copy of kernel headers so it has all
defines we need to drive it.
Signed-off-by: Alexander Graf <graf@amazon.com>
Link: https://lore.kernel.org/r/20260225220807.33092-3-graf@amazon.com
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
We want to enable QEMU to drive the /dev/nitro_enclaves device node. Add
its UAPI header into our kernel sync so we have all defines we need to
drive it.
Signed-off-by: Alexander Graf <graf@amazon.com>
Link: https://lore.kernel.org/r/20260225220807.33092-2-graf@amazon.com
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The kvm_get_vcpu_events propogates the state of the pending smi
from the kernel to the cpu->interrupt_request, with the intention
of having un up to date migration state.
Later the opposite is done, the kvm_put_vcpu_events restores the state
of the pending #SMI from the 'cs->interrupt_request'
The only problem is that kvm_get_vcpu_events also resets the SMI
in cpu->interrupt_request when there is no pending #SMI indicated by the kernel,
and that is wrong as the SMI might be still raised by qemu.
While at it, also fix a similar but more theoretical bug with regard to a
latched #INIT while in SMM.
A simple reproducer for this bug is to read an EFI variable in a loop
from within a guest, while at the same time run 'info registers' on
the qemu HMP monitor.
The reads will, once in a while, fail with an 'Invalid argument' error.
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20260223221908.361456-1-mlevitsk@redhat.com
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Use that to not bump RIP for those cases.
Warn on read/write from/to unmapped MMIO, but not consider that as an exception.
For reads, return 0xFF(s) as the register value in that case.
Leaves a coverage gap for read_val_ext(), to be handled in a later commit.
Signed-off-by: Mohamed Mediouni <mohamed@unpredictable.fr>
Link: https://lore.kernel.org/r/20260223233950.96076-25-mohamed@unpredictable.fr
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
For symmetry, save/restore the same set of registers even when not needed.
CR2 save/restore needed as page faults injected to the guest imply modifying CR2.
Signed-off-by: Mohamed Mediouni <mohamed@unpredictable.fr>
Link: https://lore.kernel.org/r/20260223233950.96076-21-mohamed@unpredictable.fr
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>