Increase MAX_MEM_PREALLOC_THREAD_COUNT from 16 to 32. This was last
touched in 2017 [1] and, since then, physical machine sizes and VMs
therein have continue to get even bigger, both on average and on the
extremes.
For very large VMs, using 16 threads to preallocate memory can be a
non-trivial bottleneck during VM start-up and migration. Increasing
this limit to 32 threads reduces the time taken for these operations.
Test results from quad socket Intel 8490H (4x 60 cores) show a fairly
linear gain of 50% with the 2x thread count increase.
---------------------------------------------
Idle Guest w/ 2M HugePages | Start-up time
---------------------------------------------
240 vCPU, 7.5TB (16 threads) | 2m41.955s
---------------------------------------------
240 vCPU, 7.5TB (32 threads) | 1m19.404s
---------------------------------------------
Note: Going above 32 threads appears to have diminishing returns at
the point where the memory bandwidth and context switching costs
appear to be a limiting factor to linear scaling. For posterity, on
the same system as above:
- 32 threads: 1m19s
- 48 threads: 1m4s
- 64 threads: 59s
- 240 threads: 50s
Additional thread counts also get less interesting as the amount of
memory is to be preallocated is smaller. Putting that all together,
32 threads appears to be a sane number with a solid speedup on fairly
modern hardware. To go faster, we'd either need to improve the hardware
(CPU/memory) itself or improve clear_pages_*() on the kernel side to
be more efficient.
[1] 1e356fc14b ("mem-prealloc: reduce large guest start-up and migration time.")
Signed-off-by: Jon Kohler <jon@nutanix.com>
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
The vreport() function will optionally emit an prefix for error
messages which is output to stderr incrementally. In the event
that two vreport() calls execute concurrently, there is a risk
that the prefix output will interleave. To address this it is
required to take a lock on 'stderr' when outputting errors.
Reported-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Reviewed-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
The vreport() function will print to HMP if available, otherwise
to stderr. In the event that vreport() is called during execution
of a QMP command, it will print to stderr, but mistakenly omit the
message prefixes (timestamp, guest name, program name).
This new usage of monitor_is_cur_qmp() from vreport() requires that
we add a stub to satisfy linking of non-system emulator binaries.
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
The monitor_cur_hmp() function will acquire/release mutex locks, which
will trigger trace probes, which can in turn trigger qemu_log() calls.
vreport() calls monitor_cur() multiple times through its execution
both directly and indirectly via error_vprintf().
The result is that the prefix information printed by vreport() gets
interleaved with qemu_log() output, when run outside the context of
an HMP command dispatcher. This can be seen with:
$ qemu-system-x86_64 \
-msg timestamp=on,guest-name=on \
-display none \
-object tls-creds-x509,id=f,dir=fish \
-name fish \
-d trace:qemu_mutex*
2025-09-10T16:30:42.514374Z qemu_mutex_unlock released mutex 0x560b0339b4c0 (/var/home/berrange/src/virt/qemu/include/qemu/lockable.h:56)
2025-09-10T16:30:42.514400Z qemu_mutex_lock waiting on mutex 0x560b033983e0 (/var/home/berrange/src/virt/qemu/include/qemu/lockable.h:56)
2025-09-10T16:30:42.514402Z qemu_mutex_locked taken mutex 0x560b033983e0 (/var/home/berrange/src/virt/qemu/include/qemu/lockable.h:56)
2025-09-10T16:30:42.514404Z qemu_mutex_unlock released mutex 0x560b033983e0 (/var/home/berrange/src/virt/qemu/include/qemu/lockable.h:56)
2025-09-10T16:30:42.516716Z qemu_mutex_lock waiting on mutex 0x560b03398560 (../monitor/monitor.c:91)
2025-09-10T16:30:42.516723Z qemu_mutex_locked taken mutex 0x560b03398560 (../monitor/monitor.c:91)
2025-09-10T16:30:42.516726Z qemu_mutex_unlock released mutex 0x560b03398560 (../monitor/monitor.c:96)
2025-09-10T16:30:42.516728Z qemu_mutex_lock waiting on mutex 0x560b03398560 (../monitor/monitor.c:91)
2025-09-10T16:31:04.842057Z qemu_mutex_locked taken mutex 0x564f5e401560 (../monitor/monitor.c:91)
2025-09-10T16:31:04.842058Z qemu_mutex_unlock released mutex 0x564f5e401560 (../monitor/monitor.c:96)
2025-09-10T16:31:04.842055Z 2025-09-10T16:31:04.842060Z qemu_mutex_lock waiting on mutex 0x564f5e401560 (../monitor/monitor.c:91)
2025-09-10T16:31:04.842061Z qemu_mutex_locked taken mutex 0x564f5e401560 (../monitor/monitor.c:91)
2025-09-10T16:31:04.842062Z qemu_mutex_unlock released mutex 0x564f5e401560 (../monitor/monitor.c:96)
2025-09-10T16:31:04.842064Z qemu_mutex_lock waiting on mutex 0x564f5e401560 (../monitor/monitor.c:91)
2025-09-10T16:31:04.842065Z qemu_mutex_locked taken mutex 0x564f5e401560 (../monitor/monitor.c:91)
2025-09-10T16:31:04.842066Z qemu_mutex_unlock released mutex 0x564f5e401560 (../monitor/monitor.c:96)
fish 2025-09-10T16:31:04.842068Z qemu_mutex_lock waiting on mutex 0x564f5e401560 (../monitor/monitor.c:91)
2025-09-10T16:31:04.842069Z qemu_mutex_locked taken mutex 0x564f5e401560 (../monitor/monitor.c:91)
2025-09-10T16:31:04.842070Z qemu_mutex_unlock released mutex 0x564f5e401560 (../monitor/monitor.c:96)
2025-09-10T16:31:04.842072Z qemu_mutex_lock waiting on mutex 0x564f5e401560 (../monitor/monitor.c:91)
2025-09-10T16:31:04.842097Z qemu_mutex_locked taken mutex 0x564f5e401560 (../monitor/monitor.c:91)
2025-09-10T16:31:04.842099Z qemu_mutex_unlock released mutex 0x564f5e401560 (../monitor/monitor.c:96)
qemu-system-x86_64:2025-09-10T16:31:04.842100Z qemu_mutex_lock waiting on mutex 0x564f5e401560 (../monitor/monitor.c:91)
2025-09-10T16:31:04.842102Z qemu_mutex_locked taken mutex 0x564f5e401560 (../monitor/monitor.c:91)
2025-09-10T16:31:04.842103Z qemu_mutex_unlock released mutex 0x564f5e401560 (../monitor/monitor.c:96)
2025-09-10T16:31:04.842105Z qemu_mutex_lock waiting on mutex 0x564f5e401560 (../monitor/monitor.c:91)
2025-09-10T16:31:04.842106Z qemu_mutex_locked taken mutex 0x564f5e401560 (../monitor/monitor.c:91)
2025-09-10T16:31:04.842107Z qemu_mutex_unlock released mutex 0x564f5e401560 (../monitor/monitor.c:96)
Unable to access credentials fish/ca-cert.pem: No such file or directory2025-09-10T16:31:04.842109Z qemu_mutex_lock waiting on mutex 0x564f5e401560 (../monitor/monitor.c:91)
2025-09-10T16:31:04.842110Z qemu_mutex_locked taken mutex 0x564f5e401560 (../monitor/monitor.c:91)
2025-09-10T16:31:04.842111Z qemu_mutex_unlock released mutex 0x564f5e401560 (../monitor/monitor.c:96)
To avoid this interleaving (as well as reduce the huge number of
mutex lock/unlock calls) we need to ensure that monitor_cur_is_hmp()
is only called once at the start of vreport(), and if no HMP is
present, no further monitor APIs can be called.
This implies error_[v]printf() cannot be called from vreport().
Instead we must introduce error_[v]printf_mon() which accept a
pre-acquired Monitor object. In some cases, however, fprintf
can be called directly as output will never be directed to the
monitor.
$ qemu-system-x86_64 \
-msg timestamp=on,guest-name=on \
-display none \
-object tls-creds-x509,id=f,dir=fish \
-name fish \
-d trace:qemu_mutex*
2025-09-10T16:31:22.701691Z qemu_mutex_unlock released mutex 0x5626fd3b84c0 (/var/home/berrange/src/virt/qemu/include/qemu/lockable.h:56)
2025-09-10T16:31:22.701728Z qemu_mutex_lock waiting on mutex 0x5626fd3b53e0 (/var/home/berrange/src/virt/qemu/include/qemu/lockable.h:56)
2025-09-10T16:31:22.701730Z qemu_mutex_locked taken mutex 0x5626fd3b53e0 (/var/home/berrange/src/virt/qemu/include/qemu/lockable.h:56)
2025-09-10T16:31:22.701732Z qemu_mutex_unlock released mutex 0x5626fd3b53e0 (/var/home/berrange/src/virt/qemu/include/qemu/lockable.h:56)
2025-09-10T16:31:22.703989Z qemu_mutex_lock waiting on mutex 0x5626fd3b5560 (../monitor/monitor.c:91)
2025-09-10T16:31:22.703996Z qemu_mutex_locked taken mutex 0x5626fd3b5560 (../monitor/monitor.c:91)
2025-09-10T16:31:22.703999Z qemu_mutex_unlock released mutex 0x5626fd3b5560 (../monitor/monitor.c:96)
2025-09-10T16:31:22.704000Z fish qemu-system-x86_64: Unable to access credentials fish/ca-cert.pem: No such file or directory
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
The current unit tests rely on monitor.o not being linked, such
that the monitor stubs get linked instead. Since error_vprintf
is in monitor.o this allows a stub error_vprintf impl to be used
that calls g_test_message.
This takes a different approach, with error_vprintf moving
back to error-report.c such that it is always linked into the
tests. The monitor_vprintf() stub is then changed to use
g_test_message if QTEST_SILENT_ERRORS is set, otherwise it will
return -1 and trigger error_vprintf to call vfprintf.
The end result is functionally equivalent for the purposes of
the unit tests.
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
One codepath that could return NULL failed to populate the errp
object.
Reported-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
There are three general patterns to QEMU log output
1. Single complete message calls
qemu_log("Some message\n");
2. Direct use of fprintf
FILE *f = qemu_log_trylock()
fprintf(f, "...");
fprintf(f, "...");
fprintf(f, "...\n");
qemu_log_unlock(f)
3. Mixed use of qemu_log_trylock/qemu_log()
FILE *f = qemu_log_trylock()
qemu_log("....");
qemu_log("....");
qemu_log("....\n");
qemu_log_unlock(f)
When message prefixes are enabled, the timestamp will be
unconditionally emitted for all qemu_log() calls. This
works fine in the 1st case, and has no effect in the 2nd
case. In the 3rd case, however, we get the timestamp
printed over & over in each fragment.
One can suggest that pattern (3) is pointless as it is
functionally identical to (2) but with extra indirection
and overhead. None the less we have a fair bit of code
that does this.
The qemu_log() call itself is nothing more than a wrapper
which does pattern (2) with a single fprintf() call.
One might question whether (2) should include the message
prefix in the same way that (1), but there are scenarios
where this could be inappropriate / unhelpful such as the
CPU register dumps or linux-user strace output.
This patch fixes the problem in pattern (3) by keeping
track of the call depth of qemu_log_trylock() and then
only emitting the the prefix when the starting depth
was zero. In doing this qemu_log_trylock_context() is
also introduced as a variant of qemu_log_trylock()
that emits the prefix. Callers doing to batch output
can thus choose whether a prefix is appropriate or
not.
Fixes: 012842c075 (log: make '-msg timestamp=on' apply to all qemu_log usage)
Reported-by: Richard Henderson <richard.henderson@linaro.org>
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
This will be used to include the thread name in error reports
in a later patch. It returns a const string stored in a thread
local to avoid memory allocation when it is called repeatedly
in a single thread. The thread name should be set at the very
start of the thread execution, which is the case when using
qemu_thread_create.
This uses the official thread APIs for fetching thread names,
so that it captures names of threads spawned by code in 3rd
party libraries, not merely QEMU spawned thrads.
This also addresses the gap from the previous patch for setting
the name of the main thread. A constructor is used to initialize
the 'namebuf' thread-local in the main thread only.
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
The default main thread name is undefined, so use a constructor to
explicitly set it to 'main'. This constructor is marked to run early
as the thread name is intended to be used in error reporting / logs
which may be triggered very early in QEMU execution.
This is only done on Windows platforms, because on Linux (and possibly
other POSIX platforms) changing the main thread name has a side effect
of changing the process name reported by tools like 'ps' which fetch
from the file /proc/self/task/tid/comm, expecting it to be the binary
name.
The subsequent patch will address POSIX platforms in a different way.
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Reviewed-by: Dr. David Alan Gilbert <dave@treblig.org>
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
The ability to set the thread name needs to be used in a number
of places, so expose the current impls as public methods.
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Reviewed-by: Dr. David Alan Gilbert <dave@treblig.org>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
The call to set the thread name on Win32 platforms is done by the parent
thread, after _beginthreadex() returns. At this point the new child
thread is potentially already executing its start method. To ensure the
thread name is guaranteed to be set before any "interesting" code starts
executing, it must be done in the start method of the child thread itself.
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Reviewed-by: Dr. David Alan Gilbert <dave@treblig.org>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
When thread naming was introduced years ago, it was disabled by
default and put behind a command line flag:
commit 8f480de0c9
Author: Dr. David Alan Gilbert <dgilbert@redhat.com>
Date: Thu Jan 30 10:20:31 2014 +0000
Add 'debug-threads' suboption to --name
This was done based on a concern that something might depend
on the historical thread naming. Thread names, however, were
never promised to be part of QEMU's public API. The defaults
will vary across platforms, so no assumptions should ever be
made about naming.
An opt-in behaviour is also unfortunately incompatible with
RCU which creates its thread from an constructor function
which is run before command line args are parsed. Thus the
RCU thread lacks any name.
libvirt has unconditionally enabled debug-threads=yes on all
VMs it creates for 10 years. Interestingly this DID expose a
bug in libvirt, as it parsed /proc/$PID/stat and could not
cope with a space in the thread name. This was a latent
pre-existing bug in libvirt though, and not a part of QEMU's
API.
Having thread names always available, will allow thread names
to be included in error reports and log messags QEMU prints
by default, which will improve ability to triage QEMU bugs.
Reviewed-by: Dr. David Alan Gilbert <dave@treblig.org>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Reviewed-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
Check to make sure that we have inotify in libc, before looking for it
in libinotify.
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Marc-André Lureau <marcandre.lureau@redhat.com>
Cc: Daniel P. Berrange <berrange@redhat.com>
Cc: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Signed-off-by: Warner Losh <imp@bsdimp.com>
This reverts commit ddb4d9d174.
The commit says:
> This reverts commit 55d98e3ede.
>
> The commit introduced a regression in the replay functional test
> on alpha (tests/functional/alpha/test_replay.py), that causes CI
> failures regularly. Thus revert this change until someone has
> figured out what is going wrong here.
Reapply the change as alpha is fixed.
Signed-off-by: Akihiko Odaki <odaki@rsg.ci.i.u-tokyo.ac.jp>
Link: https://lore.kernel.org/r/20260217-alpha-v1-2-0dcc708c9db3@rsg.ci.i.u-tokyo.ac.jp
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
gsource_check() only looks at the ppoll revents for the io_uring fd,
but CQEs can be posted during gsource_prepare()'s io_uring_submit()
call via kernel task_work processing on syscall exit. These completions
are already sitting in the CQ ring but the ring fd may not be signaled
yet, causing gsource_check() to return false.
Add a fallback io_uring_cq_ready() check so completions that arrive
during submission are dispatched immediately rather than waiting for
the next ppoll() cycle.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Message-ID: <20260213143225.161043-3-axboe@kernel.dk>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
When a vCPU thread handles MMIO (holding BQL), aio_co_enter() runs the
block I/O coroutine inline on the vCPU thread because
qemu_get_current_aio_context() returns the main AioContext when BQL is
held. The coroutine calls luring_co_submit() which queues an SQE via
fdmon_io_uring_add_sqe(), but the actual io_uring_submit() only happens
in gsource_prepare() on the main loop thread.
Since the coroutine ran inline (not via aio_co_schedule()), no BH is
scheduled and aio_notify() is never called. The main loop remains asleep
in ppoll() with up to a 499ms timeout, leaving the SQE unsubmitted until
the next timer fires.
Fix this by calling aio_notify() after queuing the SQE. This wakes the
main loop via the eventfd so it can run gsource_prepare() and submit the
pending SQE promptly.
This is a generic fix that benefits all devices using aio=io_uring.
Without it, AHCI/SATA devices see MUCH worse I/O latency since they use
MMIO (not ioeventfd like virtio) and have no other mechanism to wake the
main loop after queuing block I/O.
This is usually a bit hard to detect, as it also relies on the ppoll
loop not waking up for other activity, and micro benchmarks tend not to
see it because they don't have any real processing time. With a
synthetic test case that has a few usleep() to simulate processing of
read data, it's very noticeable. The below example reads 128MB with
O_DIRECT in 128KB chunks in batches of 16, and has a 1ms delay before
each batch submit, and a 1ms delay after processing each completion.
Running it on /dev/sda yields:
time sudo ./iotest /dev/sda
________________________________________________________
Executed in 25.76 secs fish external
usr time 6.19 millis 783.00 micros 5.41 millis
sys time 12.43 millis 642.00 micros 11.79 millis
while on a virtio-blk or NVMe device we get:
time sudo ./iotest /dev/vdb
________________________________________________________
Executed in 1.25 secs fish external
usr time 1.40 millis 0.30 millis 1.10 millis
sys time 17.61 millis 1.43 millis 16.18 millis
time sudo ./iotest /dev/nvme0n1
________________________________________________________
Executed in 1.26 secs fish external
usr time 6.11 millis 0.52 millis 5.59 millis
sys time 13.94 millis 1.50 millis 12.43 millis
where the latter are consistent. If we run the same test but keep the
socket for the ssh connection active by having activity there, then
the sda test looks as follows:
time sudo ./iotest /dev/sda
________________________________________________________
Executed in 1.23 secs fish external
usr time 2.70 millis 39.00 micros 2.66 millis
sys time 4.97 millis 977.00 micros 3.99 millis
as now the ppoll loop is woken all the time anyway.
After this fix, on an idle system:
time sudo ./iotest /dev/sda
________________________________________________________
Executed in 1.30 secs fish external
usr time 2.14 millis 0.14 millis 2.00 millis
sys time 16.93 millis 1.16 millis 15.76 millis
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Message-Id: <07d701b9-3039-4f9b-99a2-abeae51146a5@kernel.dk>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
[Generalize the comment since this applies to all vCPU thread activity,
not just coroutines, as suggested by Kevin Wolf <kwolf@redhat.com>.
--Stefan]
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
To be reused in the following commit.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
Reviewed-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Message-ID: <20260201173633.413934-3-vsementsov@yandex-team.ru>
This reverts commit 55d98e3ede.
The commit introduced a regression in the replay functional test
on alpha (tests/functional/alpha/test_replay.py), that causes CI
failures regularly. Thus revert this change until someone has
figured out what is going wrong here.
Buglink: https://gitlab.com/qemu-project/qemu/-/issues/3197
Reviewed-by: Akihiko Odaki <odaki@rsg.ci.i.u-tokyo.ac.jp>
Signed-off-by: Thomas Huth <thuth@redhat.com>
Message-ID: <20260209120336.41454-1-thuth@redhat.com>
Test that the fix in commit 20aa05edc2 ("util/hexdump: fix
QEMU_HEXDUMP_LINE_WIDTH logic") make sense.
To not break compilation when we build without 'block', move
hexdump.c out of "if have_block" in meson.build.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Message-Id: <20260202112826.38018-1-philmd@linaro.org>
This commit deals with various .c files that included system
headers that are already pulled in by osdep.h, where the .c
file includes osdep.h already itself.
This commit was created with scripts/clean-includes:
./scripts/clean-includes '--git' 'misc' 'hw/core' 'semihosting' 'target/arm' 'target/i386/kvm/kvm.c' 'target/loongarch' 'target/riscv' 'tools' 'util'
All .c should include qemu/osdep.h first. The script performs three
related cleanups:
* Ensure .c files include qemu/osdep.h first.
* Including it in a .h is redundant, since the .c already includes
it. Drop such inclusions.
* Likewise, including headers qemu/osdep.h includes is redundant.
Drop these, too.
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Reviewed-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Message-id: 20260116125830.926296-4-peter.maydell@linaro.org
These files only require "qemu/bswap.h", not "qemu/host-utils.h".
Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Message-ID: <20260109163730.57087-2-philmd@linaro.org>
All these files indirectly include the "qemu/bswap.h" header.
Make this inclusion explicit to avoid build errors when
refactoring unrelated headers.
Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Message-ID: <20260109164742.58041-4-philmd@linaro.org>
This reverts commit fccb744f41.
This commit introduced dependency of linux-user on qemu-sockets.c.
The latter includes handling of various socket types, while gdbstub
only needs unix sockets. Including different kinds of sockets
makes it more problematic to build linux-user statically.
The original issue - the need to unlink unix socket before binding -
will be addressed in the next change.
Reviewed-by: Ilya Leoshkevich <iii@linux.ibm.com>
Signed-off-by: Michael Tokarev <mjt@tls.msk.ru>
As we no longer support i386 as a host architecture,
this abstraction is no longer required.
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Pierrick Bouvier <pierrick.bouvier@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Replace all uses with the normal qatomic_{read,set}.
Reviewed-by: Pierrick Bouvier <pierrick.bouvier@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
This API is no longer used.
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Pierrick Bouvier <pierrick.bouvier@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Remove instances of __i386__, except from tests and imported headers.
Drop a block containing sanity check and fprintf error message for
i386-on-i386 or x86_64-on-x86_64 emulation. If we really want
something like this, we would do it via some form of compile-time check.
Reviewed-by: Pierrick Bouvier <pierrick.bouvier@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
The VFIO pread/pwrite functions use little-endian data format. Currently, the
qemu_vfio_pci_read_config() and qemu_vfio_pci_write_config() don't correctly
convert from CPU native endian format to little-endian (and vice versa) when
using the pread/pwrite functions. Fix this by limiting read/write to 32 bits
and handling endian conversion in qemu_vfio_pci_read_config() and
qemu_vfio_pci_write_config().
Signed-off-by: Farhan Ali <alifm@linux.ibm.com>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
Link: https://lore.kernel.org/qemu-devel/20260105222029.2423-1-alifm@linux.ibm.com
[ clg: Fixed typo in subject ]
Signed-off-by: Cédric Le Goater <clg@redhat.com>
Replaying even trivial s390x kernels hangs, because:
- cpu_post_load() fires the TOD timer immediately.
- s390_tod_load() schedules work for firing the TOD timer.
- If rr loop sees work and then timer, we get one timer expiration.
- If rr loop sees timer and then work, we get two timer expirations.
- Record and replay may diverge due to this race.
- In this particular case divergence makes replay loop spin: it sees that
TOD timer has expired, but cannot invoke its callback, because there
is no recorded CHECKPOINT_CLOCK_VIRTUAL.
- The order in which rr loop sees work and timer depends on whether
and when rr loop wakes up during load_snapshot().
- rr loop may wake up after the main thread kicks the CPU and drops
the BQL, which may happen if it calls, e.g., qemu_cond_wait_bql().
Firing TOD timer twice is duplicate work, but it was introduced
intentionally in commit 7c12f710ba ("s390x/tcg: rearm the CKC timer
during migration") in order to avoid dependency on migration order.
The key culprits here are timers that are armed ready expired. They
break the ordering between timers and CPU work, because they are not
constrained by instruction execution, thus introducing non-determinism
and record-replay divergence.
Fix by converting such timer callbacks to CPU work. Also add TOD clock
updates to the save path, mirroring the load path, in order to have the
same CHECKPOINT_CLOCK_VIRTUAL during recording and replaying.
Link: https://lore.kernel.org/qemu-devel/20251128133949.181828-1-thuth@redhat.com/
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Tested-by: Thomas Huth <thuth@redhat.com>
Message-ID: <20251201215514.1751994-1-iii@linux.ibm.com>
[thuth: Add SPDX license identifiers to the new stubs files]
Signed-off-by: Thomas Huth <thuth@redhat.com>
Replace
error_setg_errno(errp, errno, MSG, FNAME);
by
error_setg_file_open(errp, errno, FNAME);
where MSG is "Could not open '%s'" or similar.
Also replace equivalent uses of error_setg().
A few messages lose prefixes ("net dump: ", "SEV: ", __func__ ": ").
We could put them back with error_prepend(). Not worth the bother.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Dr. David Alan Gilbert <dave@treblig.org>
Message-ID: <20251121121438.1249498-11-armbru@redhat.com>
[Conflict with commit 26b4a6ffe7 (monitor/hmp: Merge
hmp-cmds-target.c within hmp-cmds.c) resolved]
qemu_ftruncate64() is a general-purpose utility function that may be
used outside of the block layer. Move it to util/oslib-win32.c where
other Windows-specific utility functions reside.
Signed-off-by: Nguyen Dinh Phi <phind.uet@gmail.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Pierrick Bouvier <pierrick.bouvier@linaro.org>
Message-ID: <20251218085446.462827-3-phind.uet@gmail.com>
Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
AIO_WAIT_WHILE is used even outside the block layer; move the header file
out of block/ just like the implementation is in util/.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
AioContexts are used as a generic event loop even outside the block
layer; move the header file out of block/ just like the implementation
is in util/.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Avoid including all of qdev everywhere (the hw/core/qdev.h header in fact
brings in a lot more headers too), instead declare a couple structs for
which only a pointer type is needed.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Create a new header corresponding to functions defined in
util/aiocb.c, and include it whenever AIOCBs are used but
AioContext is not.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Fix a data race occurred between `worker_thread()` writing and
`thread_pool_completion_bh()` reading shared data in `util/thread-pool.c`.
Signed-off-by: Marc Morcos <marcmorcos@google.com>
Link: https://lore.kernel.org/r/20251213001443.2041258-3-marcmorcos@google.com
[Use qatomic_set for writes to ret->ret. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Move RAMBlock functions out of ram_addr.h and cpu-common.h;
move memory API headers out of include/exec and into include/system.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
A recent change in glibc 2.42.9000 [1] changes the return type of
strstr() and other string functions to be 'const char *' when the
input is a 'const char *'.
This breaks the build in various files with errors such as :
error: initialization discards 'const' qualifier from pointer target type [-Werror=discarded-qualifiers]
208 | char *pidstr = strstr(filename, "%");
| ^~~~~~
Fix this by changing the type of the variables that store the result
of these functions to 'const char *'.
[1] https://sourceware.org/git/?p=glibc.git;a=commit;h=cd748a63ab1a7ae846175c532a3daab341c62690
Signed-off-by: Cédric Le Goater <clg@redhat.com>
Reviewed-by: Laurent Vivier <laurent@vivier.eu>
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Message-ID: <20251209174328.698774-1-clg@redhat.com>
Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
AioContext has its own io_uring instance for file descriptor monitoring.
The disk I/O io_uring code was developed separately. Originally I
thought the characteristics of file descriptor monitoring and disk I/O
were too different, requiring separate io_uring instances.
Now it has become clear to me that it's feasible to share a single
io_uring instance for file descriptor monitoring and disk I/O. We're not
using io_uring's IOPOLL feature or anything else that would require a
separate instance.
Unify block/io_uring.c and util/fdmon-io_uring.c using the new
aio_add_sqe() API that allows user-defined io_uring sqe submission. Now
block/io_uring.c just needs to submit readv/writev/fsync and most of the
io_uring-specific logic is handled by fdmon-io_uring.c.
There are two immediate advantages:
1. Fewer system calls. There is no need to monitor the disk I/O io_uring
ring fd from the file descriptor monitoring io_uring instance. Disk
I/O completions are now picked up directly. Also, sqes are
accumulated in the sq ring until the end of the event loop iteration
and there are fewer io_uring_enter(2) syscalls.
2. Less code duplication.
Note that error_setg() messages are not supposed to end with
punctuation, so I removed a '.' for the non-io_uring build error
message.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-ID: <20251104022933.618123-15-stefanha@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Introduce the aio_add_sqe() API for submitting io_uring requests in the
current AioContext. This allows other components in QEMU, like the block
layer, to take advantage of io_uring features without creating their own
io_uring context.
This API supports nested event loops just like file descriptor
monitoring and BHs do. This comes at a complexity cost: CQE callbacks
must be placed on a list so that nested event loops can invoke pending
CQE callbacks from parent event loops. If you're wondering why
CqeHandler exists instead of just a callback function pointer, this is
why.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-ID: <20251104022933.618123-14-stefanha@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The ppoll and epoll file descriptor monitoring implementations rely on
the event loop's generic file descriptor, timer, and BH dispatch code to
invoke user callbacks.
The io_uring file descriptor monitoring implementation will need
io_uring-specific dispatch logic for CQE handlers for custom SQEs.
Introduce a new FDMonOps ->dispatch() callback that allows file
descriptor monitoring implementations to invoke user callbacks. The next
patch will use this new callback.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-ID: <20251104022933.618123-13-stefanha@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reduce the level of indentation to make further code changes easier to
read.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-ID: <20251104022933.618123-12-stefanha@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
io_uring may not be available at runtime due to system policies (e.g.
the io_uring_disabled sysctl) or creation could fail due to file
descriptor resource limits.
Handle failure scenarios as follows:
If another AioContext already has io_uring, then fail AioContext
creation so that the aio_add_sqe() API is available uniformly from all
QEMU threads. Otherwise fall back to epoll(7) if io_uring is
unavailable.
Notes:
- Update the comment about selecting the fastest fdmon implementation.
At this point it's not about speed anymore, it's about aio_add_sqe()
API availability.
- Uppercase the error message when converting from error_report() to
error_setg_errno() for consistency (but there are instances of
lowercase in the codebase).
- It's easier to move the #ifdefs from aio-posix.h to aio-posix.c.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Message-ID: <20251104022933.618123-11-stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
When aio_context_new() -> aio_context_setup() fails at startup it
doesn't really matter whether errors are returned to the caller or the
process terminates immediately.
However, it is not acceptable to terminate when hotplugging --object
iothread at runtime. Refactor aio_context_setup() so that errors can be
propagated. The next commit will set errp when fdmon_io_uring_setup()
fails.
Suggested-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Message-ID: <20251104022933.618123-10-stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
g_source_destroy() only removes the GSource from the GMainContext it's
attached to, if any. It does not free it.
Use g_source_unref() instead so that the AioContext (which embeds a
GSource) is freed. There is no need to call g_source_destroy() in
aio_context_new() because the GSource isn't attached to a GMainContext
yet.
aio_ctx_finalize() expects everything to be set up already, so introduce
the new ctx->initialized boolean and do nothing when called with
!initialized. This also requires moving aio_context_setup() down after
event_notifier_init() since aio_ctx_finalize() won't release any
resources that aio_context_setup() acquired.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-ID: <20251104022933.618123-9-stefanha@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
There is no need for aio_context_use_g_source() now that epoll(7) and
io_uring(7) file descriptor monitoring works with the glib event loop.
AioContext doesn't need to be notified that GSource is being used.
On hosts with io_uring support this now enables fdmon-io_uring.c by
default, replacing fdmon-poll.c and fdmon-epoll.c. In other words, the
event loop will use io_uring!
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Message-ID: <20251104022933.618123-8-stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
AioContext's glib integration only supports ppoll(2) file descriptor
monitoring. epoll(7) and io_uring(7) disable themselves and switch back
to ppoll(2) when the glib event loop is used. The main loop thread
cannot use epoll(7) or io_uring(7) because it always uses the glib event
loop.
Future QEMU features may require io_uring(7). One example is uring_cmd
support in FUSE exports. Each feature could create its own io_uring(7)
context and integrate it into the event loop, but this is inefficient
due to extra syscalls. It would be more efficient to reuse the
AioContext's existing fdmon-io_uring.c io_uring(7) context because
fdmon-io_uring.c will already be active on systems where Linux io_uring
is available.
In order to keep fdmon-io_uring.c's AioContext operational even when the
glib event loop is used, extend FDMonOps with an API similar to
GSourceFuncs so that file descriptor monitoring can integrate into the
glib event loop.
A quick summary of the GSourceFuncs API:
- prepare() is called each event loop iteration before waiting for file
descriptors and timers.
- check() is called to determine whether events are ready to be
dispatched after waiting.
- dispatch() is called to process events.
More details here: https://docs.gtk.org/glib/struct.SourceFuncs.html
Move the ppoll(2)-specific code from aio-posix.c into fdmon-poll.c and
also implement epoll(7)- and io_uring(7)-specific file descriptor
monitoring code for glib event loops.
Note that it's still faster to use aio_poll() rather than the glib event
loop since glib waits for file descriptor activity with ppoll(2) and
does not support adaptive polling. But at least epoll(7) and io_uring(7)
now work in glib event loops.
Splitting this into multiple commits without temporarily breaking
AioContext proved difficult so this commit makes all the changes. The
next commit will remove the aio_context_use_g_source() API because it is
no longer needed.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-ID: <20251104022933.618123-7-stefanha@redhat.com>
[kwolf: Build fixes; fix AioContext.list_lock use after destroy]
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Commit 816a430c51 ("util/aio: Defer disabling poll mode as long as
possible") kept polling enabled when the event loop timeout is 0. Since
there is no timeout the event loop will continue immediately and the
overhead of disabling and re-enabling polling can be avoided.
fdmon-io_uring.c is unable to take advantage of this optimization
because its ->need_wait() function returns true whenever there are new
io_uring SQEs to submit:
if (timeout || ctx->fdmon_ops->need_wait(ctx)) {
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Polling will be disabled even when timeout == 0.
Extend the optimization to handle the case when need_wait() returns true
and timeout == 0.
Cc: Chao Gao <chao.gao@intel.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Message-ID: <20251104022933.618123-5-stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
io_uring_enter(2) only returns -EINTR in some cases when interrupted by
a signal. Therefore the while loop in fdmon_io_uring_wait() is
incomplete and can lead to a spurious early return.
Handle the case when a signal interrupts io_uring_enter(2) but the
syscall returns the number of SQEs submitted (that takes priority over
-EINTR).
This patch probably makes little difference for QEMU, but the test suite
relies on the exact pattern of aio_poll() return values, so it's best to
hide this io_uring syscall interface quirk.
Here is the strace of test-aio receiving 3 SIGCONT signals after this
fix has been applied. Notice how the io_uring_enter(2) return value is 1
the first time because an SQE was submitted, but -EINTR the other times:
eventfd2(0, EFD_CLOEXEC|EFD_NONBLOCK) = 9
io_uring_enter(7, 1, 0, 0, NULL, 8) = 1
clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0}, 0x7ffe38a46240) = 0
io_uring_enter(7, 1, 1, IORING_ENTER_GETEVENTS, NULL, 8) = 1
--- SIGCONT {si_signo=SIGCONT, si_code=SI_USER, si_pid=596096, si_uid=1000} ---
io_uring_enter(7, 0, 1, IORING_ENTER_GETEVENTS, NULL, 8) = -1 EINTR (Interrupted system call)
--- SIGCONT {si_signo=SIGCONT, si_code=SI_USER, si_pid=596096, si_uid=1000} ---
io_uring_enter(7, 0, 1, IORING_ENTER_GETEVENTS, NULL, 8 <unfinished ...>
<... io_uring_enter resumed>) = -1 EINTR (Interrupted system call)
--- SIGCONT {si_signo=SIGCONT, si_code=SI_USER, si_pid=596096, si_uid=1000} ---
io_uring_enter(7, 0, 1, IORING_ENTER_GETEVENTS, NULL, 8 <unfinished ...>
<... io_uring_enter resumed>) = 0
Reported-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-ID: <20251104022933.618123-4-stefanha@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>