In our long-term experience in Bytedance, we've found that under
the same load, live migration of larger VMs with more devices is
often more difficult to converge (requiring a larger downtime limit).
Through some testing and calculations, we conclude that bitmap sync time
affects the calculation of live migration bandwidth.
When the addresses processed are not aligned, a large number of
clear_dirty ioctl occur (e.g. a 4MB misaligned memory can generate
2048 clear_dirty ioctls from two different memory_listener),
which increases the time required for bitmap_sync and makes it
more difficult for dirty pages to converge.
For a 64C256G vm with 8 vhost-user-net(32 queue per nic) and
16 vhost-user-blk(4 queue per blk), the sync time is as high as *73ms*
(tested with 10GBps dirty rate, the sync time increases as the dirty
page rate increases), Here are each part of the sync time:
- sync from kvm to ram_list: 2.5ms
- vhost_log_sync:3ms
- sync aligned memory from ram_list to RAMBlock: 5ms
- sync misaligned memory from ram_list to RAMBlock: 61ms
Attempt to merge those fragmented clear_dirty ioctls, then syncing
misaligned memory from ram_list to RAMBlock takes only about 1ms,
and the total sync time is only *12ms*.
Signed-off-by: Chuang Xu <xuchuangxclwt@bytedance.com>
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Link: https://lore.kernel.org/r/20251218114220.83354-1-xuchuangxclwt@bytedance.com
[peterx: drop var "offset" in physical_memory_sync_dirty_bitmap]
Signed-off-by: Peter Xu <peterx@redhat.com>
Currently the code that reads the qtest protocol commands insists
that every input line has a command. If it receives a line with
nothing but whitespace it will trip an assertion in
qtest_process_command().
This is a little awkward for the case where we are feeding qtest a
set of bug-reproduction commands via standard input or a file,
because it means you need to be careful not to leave a blank line at
the start or the end when cutting and pasting the command sequence
from a bug report.
Change the code to allow and ignore blank lines in the input.
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Message-ID: <20251106151959.1088095-1-peter.maydell@linaro.org>
Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
In the qtest_event() QEMUChrEvent handler, we create a timer
and log OPENED on CHR_EVENT_OPENED, and we destroy the timer and
log CLOSED on CHR_EVENT_CLOSED. However, the chardev subsystem
can send us more than one CHR_EVENT_CLOSED if we're reading from
a file chardev:
* the first one happens when we read the last data from the file
* the second one happens when the user hits ^C to exit QEMU
and the chardev is finalized: char_fd_finalize()
This causes us to call g_timer_elapsed() with a NULL timer
(which glib complains about) and print an extra CLOSED log line
with a zero timestamp:
[I +0.063829] CLOSED
qemu-system-aarch64: GLib: g_timer_elapsed: assertion 'timer != NULL' failed
[I +0.000000] CLOSED
Avoid this by ignoring a CHR_EVENT_CLOSED if we have already
processed one.
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Reviewed-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Message-ID: <20251107174306.1408139-1-peter.maydell@linaro.org>
Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
When the Bus Master bit is disabled in a PCI device's Command Register,
the device's DMA address space becomes unassigned memory (i.e. the
io_mem_unassigned MemoryRegion).
This can lead to deadlocks with IOThreads since io_mem_unassigned
accesses attempt to acquire the Big QEMU Lock (BQL). For example,
virtio-pci devices deadlock in virtio_write_config() ->
virtio_pci_stop_ioeventfd() when waiting for the IOThread while holding
the BQL. The IOThread is unable to acquire the BQL but the vcpu thread
won't release the BQL while waiting for the IOThread.
io_mem_unassigned is trivially thread-safe since it has no state, it
simply rejects all load/store accesses. Therefore it is safe to enable
lockless I/O on io_mem_unassigned to eliminate this deadlock.
Here is the backtrace described above:
Thread 9 (Thread 0x7fccfcdff6c0 (LWP 247832) "CPU 4/KVM"):
#0 0x00007fcd11529d46 in ppoll () from target:/lib64/libc.so.6
#1 0x000056468a1a9bad in ppoll (__fds=<optimized out>, __nfds=<optimized out>, __timeout=0x0, __ss=0x0) at /usr/include/bits/poll2.h:88
#2 0x000056468a18f9d9 in fdmon_poll_wait (ctx=0x5646c6a1dc30, ready_list=0x7fccfcdfb310, timeout=-1) at ../util/fdmon-poll.c:79
#3 0x000056468a18f14f in aio_poll (ctx=<optimized out>, blocking=blocking@entry=true) at ../util/aio-posix.c:730
#4 0x000056468a1ad842 in aio_wait_bh_oneshot (ctx=<optimized out>, cb=cb@entry=0x564689faa420 <virtio_blk_ioeventfd_stop_vq_bh>, opaque=<optimized out>) at ../util/aio-wait.c:85
#5 0x0000564689faaa89 in virtio_blk_stop_ioeventfd (vdev=0x5646c8fd7e90) at ../hw/block/virtio-blk.c:1644
#6 0x0000564689d77880 in virtio_bus_stop_ioeventfd (bus=bus@entry=0x5646c8fd7e08) at ../hw/virtio/virtio-bus.c:264
#7 0x0000564689d780db in virtio_bus_stop_ioeventfd (bus=bus@entry=0x5646c8fd7e08) at ../hw/virtio/virtio-bus.c:256
#8 0x0000564689d7d98a in virtio_pci_stop_ioeventfd (proxy=0x5646c8fcf8e0) at ../hw/virtio/virtio-pci.c:413
#9 virtio_write_config (pci_dev=0x5646c8fcf8e0, address=4, val=<optimized out>, len=<optimized out>) at ../hw/virtio/virtio-pci.c:803
#10 0x0000564689dcb45a in memory_region_write_accessor (mr=mr@entry=0x5646c6dc2d30, addr=3145732, value=value@entry=0x7fccfcdfb528, size=size@entry=2, shift=<optimized out>, mask=mask@entry=65535, attrs=...) at ../system/memory.c:491
#11 0x0000564689dcaeb0 in access_with_adjusted_size (addr=addr@entry=3145732, value=value@entry=0x7fccfcdfb528, size=size@entry=2, access_size_min=<optimized out>, access_size_max=<optimized out>, access_fn=0x564689dcb3f0 <memory_region_write_accessor>, mr=0x5646c6dc2d30, attrs=...) at ../system/memory.c:567
#12 0x0000564689dcb156 in memory_region_dispatch_write (mr=mr@entry=0x5646c6dc2d30, addr=addr@entry=3145732, data=<optimized out>, op=<optimized out>, attrs=attrs@entry=...) at ../system/memory.c:1554
#13 0x0000564689dd389a in flatview_write_continue_step (attrs=..., attrs@entry=..., buf=buf@entry=0x7fcd05b87028 "", mr_addr=3145732, l=l@entry=0x7fccfcdfb5f0, mr=0x5646c6dc2d30, len=2) at ../system/physmem.c:3266
#14 0x0000564689dd3adb in flatview_write_continue (fv=0x7fcadc0d8930, addr=3761242116, attrs=..., ptr=0xe0300004, len=2, mr_addr=<optimized out>, l=<optimized out>, mr=<optimized out>) at ../system/physmem.c:3296
#15 flatview_write (fv=0x7fcadc0d8930, addr=addr@entry=3761242116, attrs=attrs@entry=..., buf=buf@entry=0x7fcd05b87028, len=len@entry=2) at ../system/physmem.c:3327
#16 0x0000564689dd7191 in address_space_write (as=0x56468b433600 <address_space_memory>, addr=3761242116, attrs=..., buf=0x7fcd05b87028, len=2) at ../system/physmem.c:3447
#17 address_space_rw (as=0x56468b433600 <address_space_memory>, addr=3761242116, attrs=attrs@entry=..., buf=buf@entry=0x7fcd05b87028, len=2, is_write=<optimized out>) at ../system/physmem.c:3457
#18 0x0000564689ff1ef6 in kvm_cpu_exec (cpu=cpu@entry=0x5646c6dab810) at ../accel/kvm/kvm-all.c:3248
#19 0x0000564689ff32f5 in kvm_vcpu_thread_fn (arg=arg@entry=0x5646c6dab810) at ../accel/kvm/kvm-accel-ops.c:53
#20 0x000056468a19225c in qemu_thread_start (args=0x5646c6db6190) at ../util/qemu-thread-posix.c:393
#21 0x00007fcd114c5b68 in start_thread () from target:/lib64/libc.so.6
#22 0x00007fcd115364e4 in clone () from target:/lib64/libc.so.6
Thread 3 (Thread 0x7fcd0503a6c0 (LWP 247825) "IO iothread1"):
#0 0x00007fcd114c2d30 in __lll_lock_wait () from target:/lib64/libc.so.6
#1 0x00007fcd114c8fe2 in pthread_mutex_lock@@GLIBC_2.2.5 () from target:/lib64/libc.so.6
#2 0x000056468a192538 in qemu_mutex_lock_impl (mutex=0x56468b432e60 <bql>, file=0x56468a1e26a5 "../system/physmem.c", line=3198) at ../util/qemu-thread-posix.c:94
#3 0x0000564689dc12e2 in bql_lock_impl (file=file@entry=0x56468a1e26a5 "../system/physmem.c", line=line@entry=3198) at ../system/cpus.c:566
#4 0x0000564689ddc151 in prepare_mmio_access (mr=0x56468b433800 <io_mem_unassigned>) at ../system/physmem.c:3198
#5 address_space_lduw_internal_cached_slow (cache=<optimized out>, addr=2, attrs=..., result=0x0, endian=DEVICE_LITTLE_ENDIAN) at ../system/memory_ldst.c.inc:211
#6 address_space_lduw_le_cached_slow (cache=<optimized out>, addr=addr@entry=2, attrs=attrs@entry=..., result=result@entry=0x0) at ../system/memory_ldst.c.inc:253
#7 0x0000564689fd692c in address_space_lduw_le_cached (result=0x0, cache=<optimized out>, addr=2, attrs=...) at /var/tmp/qemu/include/exec/memory_ldst_cached.h.inc:35
#8 lduw_le_phys_cached (cache=<optimized out>, addr=2) at /var/tmp/qemu/include/exec/memory_ldst_phys.h.inc:66
#9 virtio_lduw_phys_cached (vdev=<optimized out>, cache=<optimized out>, pa=2) at /var/tmp/qemu/include/hw/virtio/virtio-access.h:166
#10 vring_avail_idx (vq=0x5646c8fe2470) at ../hw/virtio/virtio.c:396
#11 virtio_queue_split_set_notification (vq=0x5646c8fe2470, enable=0) at ../hw/virtio/virtio.c:534
#12 virtio_queue_set_notification (vq=0x5646c8fe2470, enable=0) at ../hw/virtio/virtio.c:595
#13 0x000056468a18e7a8 in poll_set_started (ctx=ctx@entry=0x5646c6c74e30, ready_list=ready_list@entry=0x7fcd050366a0, started=started@entry=true) at ../util/aio-posix.c:247
#14 0x000056468a18f2bb in poll_set_started (ctx=0x5646c6c74e30, ready_list=0x7fcd050366a0, started=true) at ../util/aio-posix.c:226
#15 try_poll_mode (ctx=0x5646c6c74e30, ready_list=0x7fcd050366a0, timeout=<synthetic pointer>) at ../util/aio-posix.c:612
#16 aio_poll (ctx=0x5646c6c74e30, blocking=blocking@entry=true) at ../util/aio-posix.c:689
#17 0x000056468a032c26 in iothread_run (opaque=opaque@entry=0x5646c69f3380) at ../iothread.c:63
#18 0x000056468a19225c in qemu_thread_start (args=0x5646c6c75410) at ../util/qemu-thread-posix.c:393
#19 0x00007fcd114c5b68 in start_thread () from target:/lib64/libc.so.6
#20 0x00007fcd115364e4 in clone () from target:/lib64/libc.so.6
Buglink: https://issues.redhat.com/browse/RHEL-71933
Reported-by: Peixiu Hou <phou@redhat.com>
Cc: Kevin Wolf <kwolf@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Link: https://lore.kernel.org/r/20251029185224.420261-1-stefanha@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
migrate_add_blocker_modes() and migration_add_notifier_modes use
variable arguments for a set of migration modes. The variable
arguments get collected into a bitset for processsing. Take a bitset
argument instead, it's simpler.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Link: https://lore.kernel.org/r/20251027064503.1074255-3-armbru@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
There's an existing helper function designed to obtain the block size.
Modify ram_block_attribute_create() to use this function for
consistency.
Tested-by: Farrah Chen <farrah.chen@intel.com>
Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
Link: https://lore.kernel.org/r/20251023095526.48365-3-chenyi.qiang@intel.com
[peterx: fix double spaces, per david]
Signed-off-by: Peter Xu <peterx@redhat.com>
Currently, CoCo VMs can perform conversion at the base page granularity,
which is the granularity that has to be tracked. In relevant setups, the
target page size is assumed to be equal to the host page size, thus
fixing the block size to the host page size.
However, since private memory and shared memory have different backend
at present, users can specify shared memory with a hugetlbfs backend
while private memory with guest_memfd backend only supports 4K page
size. In this scenario, ram_block->page_size is different from the host
page size which will trigger an assertion when retrieving the block
size.
To address this, return the host page size directly to relax the
restriction. This changes fixes a regression of using hugetlbfs backend
for shared memory within CoCo VMs, with or without VFIO devices' presence.
Acked-by: David Hildenbrand <david@redhat.com>
Tested-by: Farrah Chen <farrah.chen@intel.com>
Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
Link: https://lore.kernel.org/r/20251023095526.48365-2-chenyi.qiang@intel.com
[peterx: fix subject, per david]
Cc: qemu-stable <qemu-stable@nongnu.org>
Signed-off-by: Peter Xu <peterx@redhat.com>
QEMU has a per-thread "bql_locked" variable stored in TLS section, showing
whether the current thread is holding the BQL lock.
It's a pretty handy variable. Function-wise, QEMU have codes trying to
conditionally take bql, relying on the var reflecting the locking status
(e.g. BQL_LOCK_GUARD), or in a GDB debugging session, we could also look at
the variable (in reality, co_tls_bql_locked), to see which thread is
currently holding the bql.
When using that as a debugging facility, sometimes we can observe multiple
threads holding bql at the same time. It's because QEMU's condvar APIs
bypassed the bql_*() API, hence they do not update bql_locked even if they
have released the mutex while waiting.
It can cause confusion if one does "thread apply all p co_tls_bql_locked"
and see multiple threads reporting true.
Fix this by moving the bql status updates into the mutex debug hooks. Now
the variable should always reflect the reality.
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Message-ID: <20250904223158.1276992-1-peterx@redhat.com>
Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
cpus_kick_thread() is called via cpu_exit() -> qemu_cpu_kick(),
and also via gdb_syscall_handling(). Access the CPUState field
using atomic accesses. See commit 8ac2ca02744 ("accel: use atomic
accesses for exit_request") for rationale.
Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Manos Pitsidianakis <manos.pitsidianakis@linaro.org>
Message-Id: <20250925025520.71805-3-philmd@linaro.org>
Libguestfs wants to use qemu to run a captive appliance. When the
program linked to libguestfs exits, we want qemu to be cleaned up.
Libguestfs goes to great lengths to do this at the moment: it either
forks a separate process to ensure clean-up is done, or it asks
libvirt to clean up the qemu process. However this is complicated and
not totally reliable.
On Linux, FreeBSD and macOS, there are mechanisms to ensure a signal
or message is delivered to a process when its parent process goes
away. The qemu test suite even uses this mechanism on Linux (see
PR_SET_PDEATHSIG in tests/qtest/libqtest.c).
In nbdkit we have long had the concept of running nbdkit captively,
and we have the nbdkit --exit-with-parent flag to help
(https://libguestfs.org/nbdkit-captive.1.html#EXIT-WITH-PARENT)
This commit adds the same mechanism. The syntax is:
qemu -run-with exit-with-parent=on [...]
This is not a feature that most typical users of qemu (for running
general purpose, long-lived VMs) should use, so it defaults to off.
The exit-with-parent.[ch] files are copied from nbdkit, where they
have a 3-clause BSD license which is compatible with qemu:
https://gitlab.com/nbdkit/nbdkit/-/tree/master/common/utils?ref_type=heads
Thanks: Daniel P. Berrangé <berrange@redhat.com>
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Signed-off-by: Richard W.M. Jones <rjones@redhat.com>
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
All the functions are about "-audio model=" handling, a simpler
way to setup audio. Rename functions/variables to reflect this better.
audio_register_model_with_cb() dropped "pci" from the name, since it
will be generalized next.
deprecated_register_soundhw() was actually not a function to be
removed since it's used for "-audio model=" aliasing.
Signed-off-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
This helper is used next by -audio code.
Signed-off-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
The actual backend is "Chardev", CharBackend is the frontend side of
it (whatever talks to the backend), let's rename it for readability.
Signed-off-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Link: https://lore.kernel.org/r/20251022074612.1258413-1-marcandre.lureau@redhat.com
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add Error **errp parameter to load_image_targphys(),
load_image_targphys_as(), and get_image_size() to enable better
error reporting when image loading fails.
Pass NULL for errp in all existing call sites to maintain current
behavior. No functional change intended in this patch.
Suggested-by: Peter Maydell <peter.maydell@linaro.org>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Aditya Gupta <adityag@linux.ibm.com>
Tested-by: Aditya Gupta <adityag@linux.ibm.com>
Signed-off-by: Vishal Chourasia <vishalc@linux.ibm.com>
Message-ID: <20251024130556.1942835-6-vishalc@linux.ibm.com>
Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
The `-soundhw` CLI was removed in commit 039a68373c ("introduce
-audio as a replacement for -soundhw"). Remove outdated comments
and update the document mentioning the old usage.
Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Message-Id: <20251021131825.99390-2-philmd@linaro.org>
MachineClass::get_default_cpu_type() runs once the machine is
created, being able to evaluate runtime checks; it returns the
machine default CPU type.
Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Reviewed-by: Pierrick Bouvier <pierrick.bouvier@linaro.org>
Message-Id: <20251020221508.67413-7-philmd@linaro.org>
Binaries can register a QOM type to filter their machines
by filling their TargetInfo::machine_typename field.
Commit 28502121be ("system/vl: Filter machine list available
for a particular target binary") added the filter to
machine_help_func() but missed the other places where the machine
list must be filtered, such QMP 'query-machines' command used by
QTests, and select_machine(). Fix that.
Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Pierrick Bouvier <pierrick.bouvier@linaro.org>
Message-Id: <20251020220941.65269-2-philmd@linaro.org>
Very few files use the Physical Memory API. Declare its
methods in their own header: "system/physmem.h".
Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
Message-Id: <20251001175448.18933-19-philmd@linaro.org>
The functions related to the Physical Memory API declared
in "system/ram_addr.h" do not operate on vCPU. Remove the
'cpu_' prefix.
Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
Message-Id: <20251001175448.18933-18-philmd@linaro.org>
cpu_physical_memory_clear_dirty_range() is now only called within
system/physmem.c, by qemu_ram_resize(). Reduce its scope by making
it internal to this file. Since it doesn't involve any CPU, remove
the 'cpu_' prefix. As it operates on a range, rename @start as @addr.
Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Message-Id: <20251001175448.18933-16-philmd@linaro.org>
Avoid maintaining large functions in header, rely on the
linker to optimize at linking time.
Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Message-Id: <20251001175448.18933-15-philmd@linaro.org>
Avoid maintaining large functions in header, rely on the
linker to optimize at linking time.
Remove the now unneeded "system/xen.h" header.
Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Message-Id: <20251001175448.18933-14-philmd@linaro.org>
Avoid maintaining large functions in header, rely on the
linker to optimize at linking time.
Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Message-Id: <20251001175448.18933-12-philmd@linaro.org>
Avoid maintaining large functions in header, rely on the
linker to optimize at linking time.
Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Message-Id: <20251001175448.18933-11-philmd@linaro.org>
Avoid maintaining large functions in header, rely on the
linker to optimize at linking time.
cpu_physical_memory_all_dirty() doesn't involve any CPU,
remove the 'cpu_' prefix.
Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Message-Id: <20251001175448.18933-10-philmd@linaro.org>
Avoid maintaining large functions in header, rely on the
linker to optimize at linking time.
Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Message-Id: <20251001175448.18933-9-philmd@linaro.org>
Avoid maintaining large functions in header, rely on the
linker to optimize at linking time.
cpu_physical_memory_get_dirty() doesn't involve any CPU,
remove the 'cpu_' prefix.
Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Message-Id: <20251001175448.18933-8-philmd@linaro.org>
The legacy cpu_physical_memory_rw() method is no more used,
remove it.
Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Message-Id: <20251002084203.63899-16-philmd@linaro.org>
Following the mechanical changes of commit adeefe0167 ("Avoid
cpu_physical_memory_rw() with a constant is_write argument"),
replace:
- cpu_physical_memory_rw(, is_write=false) -> address_space_read()
- cpu_physical_memory_rw(, is_write=true) -> address_space_write()
Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Message-Id: <20251002084203.63899-15-philmd@linaro.org>
In order to remove cpu_physical_memory_rw() in a pair of commits,
and due to a cyclic dependency between "exec/cpu-common.h" and
"system/memory.h", un-inline cpu_physical_memory_read() and
cpu_physical_memory_write() as a prerequired step.
Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Message-Id: <20251002084203.63899-14-philmd@linaro.org>
Rename cpu_flush_icache_range() as address_space_flush_icache_range(),
passing an address space by argument. The single caller, rom_reset(),
already operates on an address space. Use it.
Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Message-Id: <20251002084203.63899-7-philmd@linaro.org>
There are no more uses of the legacy cpu_physical_memory_is_io()
method. Remove it.
Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Message-Id: <20251002084203.63899-6-philmd@linaro.org>
Factor address_space_is_io() out of cpu_physical_memory_is_io().
Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Thomas Huth <thuth@redhat.com>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Message-Id: <20251002084203.63899-3-philmd@linaro.org>
In 2dbaf58bbe we conditionally skipped the increment
of buf because ubsan warns incrementing NULL, and buf
is always NULL for FLUSH_CACHE. However, the existence
of the test for NULL caused Coverity to warn that the
memcpy in the WRITE_DATA case lacked a test for NULL.
Duplicate address_space_write_rom_internal into the two
callers, dropping enum write_rom_type, and simplify.
This eliminates buf in the flush case, and eliminates
the conditional increment of buf in the write case.
Coverity: CID 1621220
Fixes: 2dbaf58bbe ("system/physmem: Silence warning from ubsan")
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Reviewed-by: Thomas Huth <thuth@redhat.com>
Message-ID: <20250922192940.2908002-1-richard.henderson@linaro.org>
Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Rename @start as @offset, since it express an offset within a RAMBlock.
Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Acked-by: Peter Xu <peterx@redhat.com>
Message-Id: <20251002032812.26069-5-philmd@linaro.org>
Move ramblock_is_pmem() along with the RAM Block API
exposed by the "system/ramblock.h" header. Rename as
ram_block_is_pmem() to keep API prefix consistency.
Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Acked-by: Peter Xu <peterx@redhat.com>
Message-Id: <20251002032812.26069-3-philmd@linaro.org>
Invalidating the entire address space (i.e. range of [0, ~0ULL]) is a
valid and required operation by vIOMMU implementations. However, such
invalidations currently trigger an assertion unless they originate from
device IOTLB invalidations.
Although in recent Linux guests this case is not exercised by the VTD
implementation due to various optimizations, the assertion will be hit
by upcoming AMD vIOMMU changes to support DMA address translation. More
specifically, when running a Linux guest with VFIO passthrough device,
and a kernel that does not contain commmit 3f2571fed2fa ("iommu/amd:
Remove redundant domain flush from attach_device()").
Remove the assertion altogether and adjust the range to ensure it does
not cross notifier boundaries.
Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Peter Xu <peterx@redhat.com>
Message-Id: <20201116165506.31315-6-eperezma@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Message-ID: <20250919213515.917111-2-alejandro.j.jimenez@oracle.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Remove duplicate entry PRELAUNCH->INMIGRATE from runstate_transitions_def.
Move PRELAUNCH->SUSPENDED entry with all the other PRELAUNCH->XXX entries.
Signed-off-by: Marco Cavenati <Marco.Cavenati@eurecom.fr>
Reviewed-by: Michael Tokarev <mjt@tls.msk.ru>
Signed-off-by: Michael Tokarev <mjt@tls.msk.ru>
Add the cpr-exec migration mode. Usage:
qemu-system-$arch -machine aux-ram-share=on ...
migrate_set_parameter mode cpr-exec
migrate_set_parameter cpr-exec-command \
<arg1> <arg2> ... -incoming <uri-1> \
migrate -d <uri-1>
The migrate command stops the VM, saves state to uri-1,
directly exec's a new version of QEMU on the same host,
replacing the original process while retaining its PID, and
loads state from uri-1. Guest RAM is preserved in place,
albeit with new virtual addresses.
The new QEMU process is started by exec'ing the command
specified by the @cpr-exec-command parameter. The first word of
the command is the binary, and the remaining words are its
arguments. The command may be a direct invocation of new QEMU,
or may be a non-QEMU command that exec's the new QEMU binary.
This mode creates a second migration channel that is not visible
to the user. At the start of migration, old QEMU saves CPR state
to the second channel, and at the end of migration, it tells the
main loop to call cpr_exec. New QEMU loads CPR state early, before
objects are created.
Because old QEMU terminates when new QEMU starts, one cannot
stream data between the two, so uri-1 must be a type,
such as a file, that accepts all data before old QEMU exits.
Otherwise, old QEMU may quietly block writing to the channel.
Memory-backend objects must have the share=on attribute, but
memory-backend-epc is not supported. The VM must be started with
the '-machine aux-ram-share=on' option, which allows anonymous
memory to be transferred in place to the new process. The memfds
are kept open across exec by clearing the close-on-exec flag, their
values are saved in CPR state, and they are mmap'd in new QEMU.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Acked-by: Markus Armbruster <armbru@redhat.com>
Link: https://lore.kernel.org/r/1759332851-370353-7-git-send-email-steven.sistare@oracle.com
Signed-off-by: Peter Xu <peterx@redhat.com>
When we unrealize a CPU object (which happens on vCPU hot-unplug), we
should destroy all the AddressSpace objects we created via calls to
cpu_address_space_init() when the CPU was realized.
Commit 24bec42f3d added a function to do this for a specific
AddressSpace, but did not add any places where the function was
called.
Since we always want to destroy all the AddressSpaces on unrealize,
regardless of the target architecture, we don't need to try to keep
track of how many are still undestroyed, or make the target
architecture code manually call a destroy function for each AS it
created. Instead we can adjust the function to always completely
destroy the whole cpu->ases array, and arrange for it to be called
during CPU unrealize as part of the common code.
Without this fix, AddressSanitizer will report a leak like this
from a run where we hot-plugged and then hot-unplugged an x86 KVM
vCPU:
Direct leak of 416 byte(s) in 1 object(s) allocated from:
#0 0x5b638565053d in calloc (/data_nvme1n1/linaro/qemu-from-laptop/qemu/build/x86-tgts-asan/qemu-system-x86_64+0x1ee153d) (BuildId: c1cd6022b195142106e1bffeca23498c2b752bca)
#1 0x7c28083f77b1 in g_malloc0 (/lib/x86_64-linux-gnu/libglib-2.0.so.0+0x637b1) (BuildId: 1eb6131419edb83b2178b682829a6913cf682d75)
#2 0x5b6386999c7c in cpu_address_space_init /data_nvme1n1/linaro/qemu-from-laptop/qemu/build/x86-tgts-asan/../../system/physmem.c:797:25
#3 0x5b638727f049 in kvm_cpu_realizefn /data_nvme1n1/linaro/qemu-from-laptop/qemu/build/x86-tgts-asan/../../target/i386/kvm/kvm-cpu.c:102:5
#4 0x5b6385745f40 in accel_cpu_common_realize /data_nvme1n1/linaro/qemu-from-laptop/qemu/build/x86-tgts-asan/../../accel/accel-common.c:101:13
#5 0x5b638568fe3c in cpu_exec_realizefn /data_nvme1n1/linaro/qemu-from-laptop/qemu/build/x86-tgts-asan/../../hw/core/cpu-common.c:232:10
#6 0x5b63874a2cd5 in x86_cpu_realizefn /data_nvme1n1/linaro/qemu-from-laptop/qemu/build/x86-tgts-asan/../../target/i386/cpu.c:9321:5
#7 0x5b6387a0469a in device_set_realized /data_nvme1n1/linaro/qemu-from-laptop/qemu/build/x86-tgts-asan/../../hw/core/qdev.c:494:13
#8 0x5b6387a27d9e in property_set_bool /data_nvme1n1/linaro/qemu-from-laptop/qemu/build/x86-tgts-asan/../../qom/object.c:2375:5
#9 0x5b6387a2090b in object_property_set /data_nvme1n1/linaro/qemu-from-laptop/qemu/build/x86-tgts-asan/../../qom/object.c:1450:5
#10 0x5b6387a35b05 in object_property_set_qobject /data_nvme1n1/linaro/qemu-from-laptop/qemu/build/x86-tgts-asan/../../qom/qom-qobject.c:28:10
#11 0x5b6387a21739 in object_property_set_bool /data_nvme1n1/linaro/qemu-from-laptop/qemu/build/x86-tgts-asan/../../qom/object.c:1520:15
#12 0x5b63879fe510 in qdev_realize /data_nvme1n1/linaro/qemu-from-laptop/qemu/build/x86-tgts-asan/../../hw/core/qdev.c:276:12
Cc: qemu-stable@nongnu.org
Resolves: https://gitlab.com/qemu-project/qemu/-/issues/2517
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Link: https://lore.kernel.org/r/20250929144228.1994037-4-peter.maydell@linaro.org
Signed-off-by: Peter Xu <peterx@redhat.com>
If an AddressSpace has been created in its own allocated
memory, cleaning it up requires first destroying the AS
and then freeing the memory. Doing this doesn't work:
address_space_destroy(as);
g_free_rcu(as, rcu);
because both address_space_destroy() and g_free_rcu()
try to use the same 'rcu' node in the AddressSpace struct
and the address_space_destroy hook gets overwritten.
Provide a new address_space_destroy_free() function which
will destroy the AS and then free the memory it uses, all
in one RCU callback.
(CC to stable because the next commit needs this function.)
Cc: qemu-stable@nongnu.org
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Link: https://lore.kernel.org/r/20250929144228.1994037-3-peter.maydell@linaro.org
Signed-off-by: Peter Xu <peterx@redhat.com>
Kirill Martynov reported assertation in cpu_asidx_from_attrs() being hit
when x86_cpu_dump_state() is called to dump the CPU state[*]. It happens
when the CPU is in SMM and KVM emulation failure due to misbehaving
guest.
The root cause is that QEMU i386 never enables the SMM address space for
cpu since KVM SMM support has been added.
Enable the SMM cpu address space under KVM when the SMM is enabled for
the x86machine.
[*] https://lore.kernel.org/qemu-devel/20250523154431.506993-1-stdcalllevi@yandex-team.ru/
Reported-by: Kirill Martynov <stdcalllevi@yandex-team.ru>
Reviewed-by: Zhao Liu <zhao1.liu@intel.com>
Tested-by: Kirill Martynov <stdcalllevi@yandex-team.ru>
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20250730095253.1833411-2-xiaoyao.li@intel.com
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Make the code common to all accelerators: after seeing cpu->exit_request
set to true, accelerator code needs to reach qemu_process_cpu_events_common().
So for the common cases where they use qemu_process_cpu_events(), go ahead and
clear it in there. Note that the cheap qatomic_set() is enough because
at this point the thread has taken the BQL; qatomic_set_mb() is not needed.
In particular, this is the ordering of the communication between
I/O and vCPU threads is always the same.
In the I/O thread:
(a) store other memory locations that will be checked if cpu->exit_request
or cpu->interrupt_request is 1 (for example cpu->stop or cpu->work_list
for cpu->exit_request)
(b) cpu_exit(): store-release cpu->exit_request, or
(b) cpu_interrupt(): store-release cpu->interrupt_request
>>> at this point, cpu->halt_cond is broadcast and the BQL released
(c) do the accelerator-specific kick (e.g. write icount_decr for TCG,
pthread_kill for KVM, etc.)
In the vCPU thread instead the opposite order is respected:
(c) the accelerator's execution loop exits thanks to the kick
(b) then the inner execution loop checks cpu->interrupt_request
and cpu->exit_request. If needed cpu->interrupt_request is
converted into cpu->exit_request when work is needed outside
the execution loop.
(a) then the other memory locations are checked. Some may need to
be read under the BQL, but the vCPU thread may also take other
locks (e.g. for queued work items) or none at all.
qatomic_set_mb() would only be needed if the halt sleep was done
outside the BQL (though in that case, cpu->exit_request probably
would be replaced by a QemuEvent or something like that).
Reviewed-by: Igor Mammedov <imammedo@redhat.com>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Do so before extending it to the user-mode emulators, where there is no
such thing as an "I/O thread".
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>