Move RAMBlock functions out of ram_addr.h and cpu-common.h;
move memory API headers out of include/exec and into include/system.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
In our long-term experience in Bytedance, we've found that under
the same load, live migration of larger VMs with more devices is
often more difficult to converge (requiring a larger downtime limit).
Through some testing and calculations, we conclude that bitmap sync time
affects the calculation of live migration bandwidth.
When the addresses processed are not aligned, a large number of
clear_dirty ioctl occur (e.g. a 4MB misaligned memory can generate
2048 clear_dirty ioctls from two different memory_listener),
which increases the time required for bitmap_sync and makes it
more difficult for dirty pages to converge.
For a 64C256G vm with 8 vhost-user-net(32 queue per nic) and
16 vhost-user-blk(4 queue per blk), the sync time is as high as *73ms*
(tested with 10GBps dirty rate, the sync time increases as the dirty
page rate increases), Here are each part of the sync time:
- sync from kvm to ram_list: 2.5ms
- vhost_log_sync:3ms
- sync aligned memory from ram_list to RAMBlock: 5ms
- sync misaligned memory from ram_list to RAMBlock: 61ms
Attempt to merge those fragmented clear_dirty ioctls, then syncing
misaligned memory from ram_list to RAMBlock takes only about 1ms,
and the total sync time is only *12ms*.
Signed-off-by: Chuang Xu <xuchuangxclwt@bytedance.com>
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Link: https://lore.kernel.org/r/20251218114220.83354-1-xuchuangxclwt@bytedance.com
[peterx: drop var "offset" in physical_memory_sync_dirty_bitmap]
Signed-off-by: Peter Xu <peterx@redhat.com>
The QERR_INVALID_PARAMETER_VALUE macro is documented as not to be used
in new code. Remove the usage from migration/options.c.
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
Link: https://lore.kernel.org/r/20251215220041.12657-12-farosas@suse.de
Signed-off-by: Peter Xu <peterx@redhat.com>
Some migration parameters are updated immediately once they are set
via migrate-set-parameters. Move that work outside of
migrate_params_apply() and leave that function with the single
responsibility of setting s->parameters and not doing any
side-effects.
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
Link: https://lore.kernel.org/r/20251215220041.12657-9-farosas@suse.de
Signed-off-by: Peter Xu <peterx@redhat.com>
migrate_set_error() currently doesn't take ownership of the error being
passed in. It's not aligned with the error API and meanwhile it also
makes most of the caller free the error explicitly.
Change the API to take the ownership of the Error object instead. This
should save a lot of error_copy() invocations.
Reviewed-by: Markus Armbruster <armbru@redhat.com>
Link: https://lore.kernel.org/r/20251201194510.1121221-8-peterx@redhat.com
[peterx: break line for qemu_savevm_send_packaged, per markus]
Signed-off-by: Peter Xu <peterx@redhat.com>
Currently if you set these flags and have any shared memory object, saving
a snapshot will fail with:
Failed to write bitmap to file: Unable to write to file: Bad address
We need to skip writing RAMBlocks that are backed by shared objects.
Also, we should mark these RAMBlocks as skipped, so the snapshot format stays
readable to tools that later don't know QEMU's command line (for example
scripts/analyze-migration.py). I used bitmap_offset=0 pages_offset=0 for this.
This minor change to snapshot format should be safe, as offset=0 should not
have ever been possible.
Signed-off-by: Pawel Zmarzly <pzmarzly0@gmail.com>
Link: https://lore.kernel.org/r/20251126154734.940066-1-pzmarzly0@gmail.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Snapshots made with mapped-ram and x-ignore-shared flags are
not parsed properly.
The ignore-shared feature adds and extra field in the stream, which
needs to be consumed on the destination side. Even though mapped-ram has
a fixed header format, the ignore-shared is part of the "generic" stream
infomation so the mapped-ram code is currently skipping that be64 read
which incorrectly offsets every subsequent read from the stream.
The current ignore-shared handling can simply be moved earlier in the code
to encompass mapped-ram as well since the ignore-shared doubleword is the
first one read when parsing the ramblock section of the stream.
Co-authored-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Pawel Zmarzly <pzmarzly0@gmail.com>
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Link: https://lore.kernel.org/r/20251126121233.542473-1-pzmarzly0@gmail.com
[peterx: enhance commit log per fabiano]
Signed-off-by: Peter Xu <peterx@redhat.com>
Make mapped-ram compatible with loadvm snapshot restoring by explicitly
zeroing memory pages in this case.
Skip zeroing for -incoming and -loadvm migrations to preserve performance.
Signed-off-by: Marco Cavenati <Marco.Cavenati@eurecom.fr>
Link: https://lore.kernel.org/r/20251010115954.1995298-3-Marco.Cavenati@eurecom.fr
Signed-off-by: Peter Xu <peterx@redhat.com>
Very few files use the Physical Memory API. Declare its
methods in their own header: "system/physmem.h".
Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
Message-Id: <20251001175448.18933-19-philmd@linaro.org>
The functions related to the Physical Memory API declared
in "system/ram_addr.h" do not operate on vCPU. Remove the
'cpu_' prefix.
Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
Message-Id: <20251001175448.18933-18-philmd@linaro.org>
cpu_physical_memory_sync_dirty_bitmap() is now only called within
system/physmem.c, by ramblock_sync_dirty_bitmap(). Reduce its scope
by making it internal to this file. Since it doesn't involve any CPU,
remove the 'cpu_' prefix.
Remove the now unneeded "qemu/rcu.h" and "system/memory.h" headers.
Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Message-Id: <20251001175448.18933-17-philmd@linaro.org>
Move ramblock_is_pmem() along with the RAM Block API
exposed by the "system/ramblock.h" header. Rename as
ram_block_is_pmem() to keep API prefix consistency.
Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Acked-by: Peter Xu <peterx@redhat.com>
Message-Id: <20251002032812.26069-3-philmd@linaro.org>
Add the cpr-exec migration mode. Usage:
qemu-system-$arch -machine aux-ram-share=on ...
migrate_set_parameter mode cpr-exec
migrate_set_parameter cpr-exec-command \
<arg1> <arg2> ... -incoming <uri-1> \
migrate -d <uri-1>
The migrate command stops the VM, saves state to uri-1,
directly exec's a new version of QEMU on the same host,
replacing the original process while retaining its PID, and
loads state from uri-1. Guest RAM is preserved in place,
albeit with new virtual addresses.
The new QEMU process is started by exec'ing the command
specified by the @cpr-exec-command parameter. The first word of
the command is the binary, and the remaining words are its
arguments. The command may be a direct invocation of new QEMU,
or may be a non-QEMU command that exec's the new QEMU binary.
This mode creates a second migration channel that is not visible
to the user. At the start of migration, old QEMU saves CPR state
to the second channel, and at the end of migration, it tells the
main loop to call cpr_exec. New QEMU loads CPR state early, before
objects are created.
Because old QEMU terminates when new QEMU starts, one cannot
stream data between the two, so uri-1 must be a type,
such as a file, that accepts all data before old QEMU exits.
Otherwise, old QEMU may quietly block writing to the channel.
Memory-backend objects must have the share=on attribute, but
memory-backend-epc is not supported. The VM must be started with
the '-machine aux-ram-share=on' option, which allows anonymous
memory to be transferred in place to the new process. The memfds
are kept open across exec by clearing the close-on-exec flag, their
values are saved in CPR state, and they are mmap'd in new QEMU.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Acked-by: Markus Armbruster <armbru@redhat.com>
Link: https://lore.kernel.org/r/1759332851-370353-7-git-send-email-steven.sistare@oracle.com
Signed-off-by: Peter Xu <peterx@redhat.com>
This is an incremental step in converting vmstate loading
code to report error via Error objects instead of directly
printing it to console/monitor.
It is ensured that loadvm_process_enable_colo() must report an error
in errp, in case of failure.
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Reviewed-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Signed-off-by: Arun Menon <armenon@redhat.com>
Tested-by: Fabiano Rosas <farosas@suse.de>
Reviewed-by: Akihiko Odaki <odaki@rsg.ci.i.u-tokyo.ac.jp>
Link: https://lore.kernel.org/r/20250918-propagate_tpm_error-v14-21-36f11a6fb9d3@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
The function colo_init_ram_cache() currently returns -errno if
qemu_anon_ram_alloc() fails. However, the subsequent cleanup loop that
calls qemu_anon_ram_free() could potentially alter the value of errno.
This would cause the function to return a value that does not accurately
represent the original allocation failure.
This commit changes the return value to -1 on memory allocation failure.
This ensures that the return value is consistent and is not affected by
any errno changes that may occur during the free process.
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Arun Menon <armenon@redhat.com>
Tested-by: Fabiano Rosas <farosas@suse.de>
Reviewed-by: Akihiko Odaki <odaki@rsg.ci.i.u-tokyo.ac.jp>
Link: https://lore.kernel.org/r/20250918-propagate_tpm_error-v14-20-36f11a6fb9d3@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
This is an incremental step in converting vmstate loading
code to report error via Error objects instead of directly
printing it to console/monitor.
It is ensured that ram_postcopy_incoming_init() must report an error
in errp, in case of failure.
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Reviewed-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Signed-off-by: Arun Menon <armenon@redhat.com>
Tested-by: Fabiano Rosas <farosas@suse.de>
Reviewed-by: Akihiko Odaki <odaki@rsg.ci.i.u-tokyo.ac.jp>
Link: https://lore.kernel.org/r/20250918-propagate_tpm_error-v14-14-36f11a6fb9d3@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
This is a follow up on the other commit "migration/ram: avoid to do log
clear in the last round" but for postcopy.
https://lore.kernel.org/r/20250514115827.3216082-1-yanfei.xu@bytedance.com
I can observe more than 10% reduction of average page fault latency during
postcopy phase with this optimization:
Before: 268.00us (+-1.87%)
After: 232.67us (+-2.01%)
The test was done with a 16GB VM with 80 vCPUs, running a workload that
busy random writes to 13GB memory.
Cc: Yanfei Xu <yanfei.xu@bytedance.com>
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Link: https://lore.kernel.org/r/20250613140801.474264-12-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
Take notes on start/end state of dirty pages for the whole system.
Reviewed-by: Juraj Marcin <jmarcin@redhat.com>
Link: https://lore.kernel.org/r/20250613140801.474264-10-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
The check over PAGE_DIRTY_FOUND isn't necessary. We could indent one less
and assert that instead.
Reviewed-by: Juraj Marcin <jmarcin@redhat.com>
Link: https://lore.kernel.org/r/20250613140801.474264-9-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
Now after merging the precopy and postcopy version of complete() hook,
rename the precopy version from save_live_complete_precopy() to
save_complete().
Dropping the "live" when at it, because it's in most cases not live when
happening (in precopy).
No functional change intended.
Reviewed-by: Juraj Marcin <jmarcin@redhat.com>
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Link: https://lore.kernel.org/r/20250613140801.474264-7-peterx@redhat.com
[peterx: squash the fixup that covers a few more doc spots, per Juraj]
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
The hook is only defined in two vmstate users ("ram" and "block dirty
bitmap"), meanwhile both of them define the hook exactly the same as the
precopy version. Hence, this postcopy version isn't needed.
No functional change intended.
Reviewed-by: Juraj Marcin <jmarcin@redhat.com>
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Link: https://lore.kernel.org/r/20250613140801.474264-6-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
Update ReplayRamDiscard() function to return the result and unify the
ReplayRamPopulate() and ReplayRamDiscard() to ReplayRamDiscardState() at
the same time due to their identical definitions. This unification
simplifies related structures, such as VirtIOMEMReplayData, which makes
it cleaner.
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
Link: https://lore.kernel.org/r/20250612082747.51539-4-chenyi.qiang@intel.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Syncing volatile memory provides no benefit, instead it can cause
performance issues in some cases. Only sync memory that is marked as
non-volatile after migration completes on destination.
Signed-off-by: Ben Chaney <bchaney@akamai.com>
Fixes: bd108a44bc (migration: ram: Switch to ram block writeback)
Link: https://lore.kernel.org/r/1CC43F59-336F-4A12-84AD-DB89E0A17A95@akamai.com
Signed-off-by: Peter Xu <peterx@redhat.com>
There won't be any ram sync after the stage of save_complete, therefore
it's unnecessary to do manually protect for dirty pages being sent. Skip
to do this in last round can reduce noticeable downtime.
Signed-off-by: Yanfei Xu <yanfei.xu@bytedance.com>
Tested-by: Fabiano Rosas <farosas@suse.de>
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Link: https://lore.kernel.org/r/20250514115827.3216082-1-yanfei.xu@bytedance.com
[peterx: add comments]
Signed-off-by: Peter Xu <peterx@redhat.com>
Enable Multifd and Postcopy migration together.
The migration_ioc_process_incoming() routine checks
magic value sent on each channel and helps to properly
setup multifd and postcopy channels.
The Precopy and Multifd threads work during the initial
guest RAM transfer. When migration moves to the Postcopy
phase, the multifd threads cease to send data on multifd
channels and Postcopy threads on the destination
request/pull data from the source side.
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Signed-off-by: Prasad Pandit <pjp@fedoraproject.org>
Link: https://lore.kernel.org/r/20250512125124.147064-3-ppandit@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
The preempt mode postcopy has been introduced for a while. From latency
POV, it should always win the vanilla postcopy.
However there's one thing missing when preempt mode is enabled right now,
which is the spatial locality hint when there're page requests from the
destination side.
In vanilla postcopy, as long as a page request was unqueued, it will update
the PSS of the precopy background stream, so that after a page request the
background thread will move the pages after whatever was requested. It's
pretty much a natural behavior when there's only one channel anyway, and
one scanner to send the pages.
Preempt mode didn't follow that, because preempt mode has its own channel
and its own PSS (which doesn't linearly scan the guest memory, but
dedicated to resolve page requested from destination). So the page request
process and the background migration process are completely separate.
This patch adds the hint explicitly for preempt mode. With that, whenever
the preempt mode receives a page request on the source, it will service the
remote page fault in the return path, then it'll provide a hint to the
background thread so that we'll start sending the pages right after the
requested ones in the background, assuming the follow up pages have a
higher chance to be accessed later.
NOTE: since the background migration thread and return path thread run
completely concurrently, it doesn't always mean the hint will be applied
every single time. For example, it's possible that the return path thread
receives multiple page requests in a row without the background thread
getting the chance to consume one. In such case, the preempt thread only
provide the hint if the previous hint has been consumed. After all,
there's no point queuing hints when we only have one linear scanner.
This could measureably improve the simple sequential memory access pattern
during postcopy (when preempt is on). For random accesses, I can measure a
slight increase of remote page fault latency from ~500us -> ~600us, that
could be a trade-off to have such hint mechanism, and after all that's
still greatly improved comparing to vanilla postcopy on random (~10ms).
The patch is verified by our QE team in a video streaming test case, to
reduce the pause of the video from ~1min to a few seconds when switching
over to postcopy with preempt mode.
Reported-by: Xiaohui Li <xiaohli@redhat.com>
Tested-by: Xiaohui Li <xiaohli@redhat.com>
Reviewed-by: Juraj Marcin <jmarcin@redhat.com>
Link: https://lore.kernel.org/r/20250424220705.195544-1-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Implement save_postcopy_prepare(), preparing for the enablement
of both multifd and postcopy.
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Prasad Pandit <pjp@fedoraproject.org>
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Message-ID: <20250411114534.3370816-5-ppandit@redhat.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
control_save_page() is for RDMA only, unfold it to make the code more
clear.
In addition:
- Similar to other branches style in ram_save_target_page(), involve RDMA
only if the condition 'migrate_rdma()' is true.
- Further simplify the code by removing the RAM_SAVE_CONTROL_NOT_SUPP.
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
Message-ID: <20250305062825.772629-6-lizhijian@fujitsu.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
A few functions now end with a label. The next commit will clean them
up.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Message-ID: <20250407082643.2310002-3-armbru@redhat.com>
[Straightforward conflict with commit 988ad4cceb (hw/loongarch/virt:
Fix cpuslot::cpu set at last in virt_cpu_plug()) resolved]
Convert the existing includes with sed.
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Pierrick Bouvier <pierrick.bouvier@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Address an error in RDMA-based migration by ensuring RDMA is prioritized
when saving pages in `ram_save_target_page()`.
Previously, the RDMA protocol's page-saving step was placed after other
protocols due to a refactoring in commit bc38dc2f5f. This led to migration
failures characterized by unknown control messages and state loading errors
destination:
(qemu) qemu-system-x86_64: Unknown control message QEMU FILE
qemu-system-x86_64: error while loading state section id 1(ram)
qemu-system-x86_64: load of migration failed: Operation not permitted
source:
(qemu) qemu-system-x86_64: RDMA is in an error state waiting migration to abort!
qemu-system-x86_64: failed to save SaveStateEntry with id(name): 1(ram): -1
qemu-system-x86_64: rdma migration: recv polling control error!
qemu-system-x86_64: warning: Early error. Sending error.
qemu-system-x86_64: warning: rdma migration: send polling control error
RDMA migration implemented its own protocol/method to send pages to
destination side, hand over to RDMA first to prevent pages being saved by
other protocol.
Fixes: bc38dc2f5f ("migration: refactor ram_save_target_page functions")
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
Message-ID: <20250305062825.772629-2-lizhijian@fujitsu.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
There's no point passing the error into migration cancel only for it
to call migrate_set_error().
Reviewed-by: Peter Xu <peterx@redhat.com>
Message-ID: <20250213175927.19642-2-farosas@suse.de>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
Refactor ram_save_target_page legacy and multifd
functions into one. Other than simplifying it,
it frees 'migration_ops' object from usage, so it
is expunged.
Signed-off-by: Prasad Pandit <pjp@fedoraproject.org>
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Message-ID: <20250127120823.144949-3-ppandit@redhat.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
Add the cpr-transfer migration mode, which allows the user to transfer
a guest to a new QEMU instance on the same host with minimal guest pause
time, by preserving guest RAM in place, albeit with new virtual addresses
in new QEMU, and by preserving device file descriptors. Pages that were
locked in memory for DMA in old QEMU remain locked in new QEMU, because the
descriptor of the device that locked them remains open.
cpr-transfer preserves memory and devices descriptors by sending them to
new QEMU over a unix domain socket using SCM_RIGHTS. Such CPR state cannot
be sent over the normal migration channel, because devices and backends
are created prior to reading the channel, so this mode sends CPR state
over a second "cpr" migration channel. New QEMU reads the cpr channel
prior to creating devices or backends. The user specifies the cpr channel
in the channel arguments on the outgoing side, and in a second -incoming
command-line parameter on the incoming side.
The user must start old QEMU with the the '-machine aux-ram-share=on' option,
which allows anonymous memory to be transferred in place to the new process
by transferring a memory descriptor for each ram block. Memory-backend
objects must have the share=on attribute, but memory-backend-epc is not
supported.
The user starts new QEMU on the same host as old QEMU, with command-line
arguments to create the same machine, plus the -incoming option for the
main migration channel, like normal live migration. In addition, the user
adds a second -incoming option with channel type "cpr". This CPR channel
must support file descriptor transfer with SCM_RIGHTS, i.e. it must be a
UNIX domain socket.
To initiate CPR, the user issues a migrate command to old QEMU, adding
a second migration channel of type "cpr" in the channels argument.
Old QEMU stops the VM, saves state to the migration channels, and enters
the postmigrate state. New QEMU mmap's memory descriptors, and execution
resumes.
The implementation splits qmp_migrate into start and finish functions.
Start sends CPR state to new QEMU, which responds by closing the CPR
channel. Old QEMU detects the HUP then calls finish, which connects the
main migration channel.
In summary, the usage is:
qemu-system-$arch -machine aux-ram-share=on ...
start new QEMU with "-incoming <main-uri> -incoming <cpr-channel>"
Issue commands to old QEMU:
migrate_set_parameter mode cpr-transfer
{"execute": "migrate", ...
{"channel-type": "main"...}, {"channel-type": "cpr"...} ... }
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Acked-by: Markus Armbruster <armbru@redhat.com>
Link: https://lore.kernel.org/r/1736967650-129648-17-git-send-email-steven.sistare@oracle.com
Signed-off-by: Fabiano Rosas <farosas@suse.de>
It's not straightforward to see why src QEMU needs to sync multifd during
setup() phase. After all, there's no page queued at that point.
For old QEMUs, there's a solid reason: EOS requires it to work. While it's
clueless on the new QEMUs which do not take EOS message as sync requests.
One will figure that out only when this is conditionally removed. In fact,
the author did try it out. Logically we could still avoid doing this on
new machine types, however that needs a separate compat field and that can
be an overkill in some trivial overhead in setup() phase.
Let's instead document it completely, to avoid someone else tries this
again and do the debug one more time, or anyone confused on why this ever
existed.
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Message-Id: <20241206224755.1108686-8-peterx@redhat.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
The src flush condition check is over complicated, and it's getting more
out of control if postcopy will be involved.
In general, we have two modes to do the sync: legacy or modern ways.
Legacy uses per-section flush, modern uses per-round flush.
Mapped-ram always uses the modern, which is per-round.
Introduce two helpers, which can greatly simplify the code, and hopefully
make it readable again.
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Message-Id: <20241206224755.1108686-7-peterx@redhat.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
Multifd never worked with postcopy, at least yet so far.
Remove the sync processing there, because it's confusing, and they should
never appear. Now if RAM_SAVE_FLAG_MULTIFD_FLUSH is observed, we fail hard
instead of trying to invoke multifd code.
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Signed-off-by: Peter Xu <peterx@redhat.com>
Message-Id: <20241206224755.1108686-6-peterx@redhat.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
RAM_SAVE_FLAG_MULTIFD_FLUSH message should always be correlated to a sync
request on src. Unify such message into one place, and conditionally send
the message only if necessary.
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Signed-off-by: Peter Xu <peterx@redhat.com>
Message-Id: <20241206224755.1108686-5-peterx@redhat.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
Firstly, we're going to use the multifd flag soon in multifd code, so ram.c
isn't gonna work.
Secondly, we have a separate RDMA flag dangling around, which is definitely
not obvious. There's one comment that helps, but not too much.
Put all RAM save flags altogether, so nothing will get overlooked.
Add a section explain why we can't use bits over 0x200.
Remove RAM_SAVE_FLAG_FULL as it's already not used in QEMU, as the comment
explained.
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Signed-off-by: Peter Xu <peterx@redhat.com>
Message-Id: <20241206224755.1108686-4-peterx@redhat.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
Commit 637280aeb2 ("migration/multifd: Avoid the final FLUSH in
complete()") stopped sending the RAM_SAVE_FLAG_MULTIFD_FLUSH flag at
ram_save_complete(), because the sync on the destination side is not
needed due to the last iteration of find_dirty_block() having already
done it.
However, that commit overlooked that multifd_ram_flush_and_sync() on the
source side is also not needed at ram_save_complete(), for the same
reason.
Moreover, removing the RAM_SAVE_FLAG_MULTIFD_FLUSH but keeping the
multifd_ram_flush_and_sync() means that currently the recv threads will
hang when receiving the MULTIFD_FLAG_SYNC message, waiting for the
destination sync which only happens when RAM_SAVE_FLAG_MULTIFD_FLUSH is
received.
Luckily, multifd is still all working fine because recv side cleanup
code (mostly multifd_recv_sync_main()) is smart enough to make sure even
if recv threads are stuck at SYNC it'll get kicked out. And since this
is the completion phase of migration, nothing else will be sent after
the SYNCs.
This needs to be fixed because in the future VFIO will have data to push
after ram_save_complete() and we don't want the recv thread to be stuck
in the MULTIFD_FLAG_SYNC message.
Remove the unnecessary (and buggy) invocation of
multifd_ram_flush_and_sync().
For very old binaries (multifd_flush_after_each_section==true), the
flush_and_sync is still needed because each EOS received on destination
will enforce all-channel sync once.
Stable branches do not need this patch, as no real bug I can think of
that will go wrong there.. so not attaching Fixes to be clear on the
backport not needed.
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Signed-off-by: Peter Xu <peterx@redhat.com>
Message-Id: <20241206224755.1108686-2-peterx@redhat.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
Headers in include/sysemu/ are not only related to system
*emulation*, they are also used by virtualization. Rename
as system/ which is clearer.
Files renamed manually then mechanical change using sed tool.
Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Tested-by: Lei Yang <leiyang@redhat.com>
Message-Id: <20241203172445.28576-1-philmd@linaro.org>
Now with the current migration_is_running(), it will report exactly the
opposite of what will be reported by migration_is_idle().
Drop migration_is_idle(), instead use "!migration_is_running()" which
should be identical on functionality.
In reality, most of the idle check is inverted, so it's even easier to
write with "migrate_is_running()" check.
Reviewed-by: Cédric Le Goater <clg@redhat.com>
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Link: https://lore.kernel.org/r/20241024213056.1395400-6-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
This helper is mostly the same as migration_is_running(), except that one
has COLO reported as true, the other has CANCELLING reported as true.
Per my past years experience on the state changes, none of them should
matter.
To make it slightly safer, report both COLO || CANCELLING to be true in
migration_is_running(), then drop the other one. We kept the 1st only
because the name is simpler, and clear enough.
Reviewed-by: Cédric Le Goater <clg@redhat.com>
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Link: https://lore.kernel.org/r/20241024213056.1395400-5-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
When VM is configured with huge memory, the current throttle logic
doesn't look like to scale, because migration_trigger_throttle()
is only called for each iteration, so it won't be invoked for a long
time if one iteration can take a long time.
The periodic dirty sync aims to fix the above issue by synchronizing
the ramblock from remote dirty bitmap and, when necessary, triggering
the CPU throttle multiple times during a long iteration.
This is a trade-off between synchronization overhead and CPU throttle
impact.
Signed-off-by: Hyman Huang <yong.huang@smartx.com>
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Link: https://lore.kernel.org/r/f61f1b3653f2acf026901103e1c73d157d38b08f.1729146786.git.yong.huang@smartx.com
[peterx: make prev_cnt global, and reset for each migration]
Signed-off-by: Peter Xu <peterx@redhat.com>
The global static variable ram_state in fact is referred to by the
"rs" parameter in migration_bitmap_sync_precopy. For ease of calling
by the callees, use the global variable directly in
migration_bitmap_sync_precopy and remove "rs" parameter.
The migration_bitmap_sync_precopy will be exported in the next commit.
Signed-off-by: Hyman Huang <yong.huang@smartx.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Link: https://lore.kernel.org/r/283c335d61463bf477160da91b24da45cdaf3e43.1729146786.git.yong.huang@smartx.com
Signed-off-by: Peter Xu <peterx@redhat.com>
../migration/ram.c:1873:23: error: ‘dirty’ may be used uninitialized [-Werror=maybe-uninitialized]
When 'block' != NULL, 'dirty' is initialized.
Signed-off-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Acked-by: Peter Xu <peterx@redhat.com>
This patch is part of a series that moves towards a consistent use of
g_assert_not_reached() rather than an ad hoc mix of different
assertion mechanisms.
Signed-off-by: Pierrick Bouvier <pierrick.bouvier@linaro.org>
Reviewed-by: Thomas Huth <thuth@redhat.com>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Message-ID: <20240919044641.386068-31-pierrick.bouvier@linaro.org>
Signed-off-by: Thomas Huth <thuth@redhat.com>