diff mbox series

[PATCH-for-9.1,v2,2/3] migration: Remove RDMA protocol handling

Message ID 20240328130255.52257-3-philmd@linaro.org
State New
Headers show
Series rdma: Remove RDMA subsystem and pvrdma device | expand

Commit Message

Philippe Mathieu-Daudé March 28, 2024, 1:02 p.m. UTC
The whole RDMA subsystem was deprecated in commit e9a54265f5
("hw/rdma: Deprecate the pvrdma device and the rdma subsystem")
released in v8.2.

Remove:
 - RDMA handling from migration
 - dependencies on libibumad, libibverbs and librdmacm

Keep the RAM_SAVE_FLAG_HOOK definition since it might appears
in old migration streams.

Cc: Peter Xu <peterx@redhat.com>
Cc: Li Zhijian <lizhijian@fujitsu.com>
Acked-by: Fabiano Rosas <farosas@suse.de>
Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
---
 MAINTAINERS                                   |    7 -
 docs/devel/migration/main.rst                 |    6 -
 docs/rdma.txt                                 |  420 --
 docs/system/loongarch/virt.rst                |    2 +-
 meson.build                                   |   23 -
 qapi/migration.json                           |   31 +-
 migration/migration-stats.h                   |    6 +-
 migration/migration.h                         |    9 -
 migration/options.h                           |    2 -
 migration/rdma.h                              |   69 -
 migration/migration-stats.c                   |    5 +-
 migration/migration.c                         |   31 -
 migration/options.c                           |   16 -
 migration/qemu-file.c                         |    1 -
 migration/ram.c                               |   86 +-
 migration/rdma.c                              | 4184 -----------------
 migration/savevm.c                            |    2 +-
 meson_options.txt                             |    2 -
 migration/meson.build                         |    1 -
 migration/trace-events                        |   68 +-
 qemu-options.hx                               |    3 -
 .../org.centos/stream/8/build-environment.yml |    1 -
 .../ci/org.centos/stream/8/x86_64/configure   |    2 -
 scripts/ci/setup/build-environment.yml        |    4 -
 scripts/coverity-scan/run-coverity-scan       |    2 +-
 scripts/meson-buildoptions.sh                 |    3 -
 tests/lcitool/projects/qemu.yml               |    3 -
 tests/migration/guestperf/engine.py           |    4 +-
 28 files changed, 14 insertions(+), 4979 deletions(-)
 delete mode 100644 docs/rdma.txt
 delete mode 100644 migration/rdma.h
 delete mode 100644 migration/rdma.c

Comments

Fabiano Rosas March 28, 2024, 2:18 p.m. UTC | #1
Philippe Mathieu-Daudé <philmd@linaro.org> writes:

> The whole RDMA subsystem was deprecated in commit e9a54265f5
> ("hw/rdma: Deprecate the pvrdma device and the rdma subsystem")
> released in v8.2.
>
> Remove:
>  - RDMA handling from migration
>  - dependencies on libibumad, libibverbs and librdmacm
>
> Keep the RAM_SAVE_FLAG_HOOK definition since it might appears
> in old migration streams.
>
> Cc: Peter Xu <peterx@redhat.com>
> Cc: Li Zhijian <lizhijian@fujitsu.com>
> Acked-by: Fabiano Rosas <farosas@suse.de>
> Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>

Just to be clear, because people raised the point in the last version,
the first link in the deprecation commit links to a thread comprising
entirely of rdma migration patches. I don't see any ambiguity on whether
the deprecation was intended to include migration. There's even an ack
from Juan.

So on the basis of not reverting the previous maintainer's decision, my
Ack stands here.

We also had pretty obvious bugs ([1], [2]) in the past that would have
been caught if we had any kind of testing for the feature, so I can't
even say this thing works currently.

@Peter Xu, @Li Zhijian, what are your thoughts on this?

1- https://lore.kernel.org/r/20230920090412.726725-1-lizhijian@fujitsu.com
2- https://lore.kernel.org/r/CAHEcVy7HXSwn4Ow_Kog+Q+TN6f_kMeiCHevz1qGM-fbxBPp1hQ@mail.gmail.com
Thomas Huth March 28, 2024, 3:22 p.m. UTC | #2
On 28/03/2024 16.01, Peter Xu wrote:
> On Thu, Mar 28, 2024 at 11:18:04AM -0300, Fabiano Rosas wrote:
>> Philippe Mathieu-Daudé <philmd@linaro.org> writes:
>>
>>> The whole RDMA subsystem was deprecated in commit e9a54265f5
>>> ("hw/rdma: Deprecate the pvrdma device and the rdma subsystem")
>>> released in v8.2.
>>>
>>> Remove:
>>>   - RDMA handling from migration
>>>   - dependencies on libibumad, libibverbs and librdmacm
>>>
>>> Keep the RAM_SAVE_FLAG_HOOK definition since it might appears
>>> in old migration streams.
>>>
>>> Cc: Peter Xu <peterx@redhat.com>
>>> Cc: Li Zhijian <lizhijian@fujitsu.com>
>>> Acked-by: Fabiano Rosas <farosas@suse.de>
>>> Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
>>
>> Just to be clear, because people raised the point in the last version,
>> the first link in the deprecation commit links to a thread comprising
>> entirely of rdma migration patches. I don't see any ambiguity on whether
>> the deprecation was intended to include migration. There's even an ack
>> from Juan.
> 
> Yes I remember that's the plan.
> 
>>
>> So on the basis of not reverting the previous maintainer's decision, my
>> Ack stands here.
>>
>> We also had pretty obvious bugs ([1], [2]) in the past that would have
>> been caught if we had any kind of testing for the feature, so I can't
>> even say this thing works currently.
>>
>> @Peter Xu, @Li Zhijian, what are your thoughts on this?
> 
> Generally I definitely agree with such a removal sooner or later, as that's
> how deprecation works, and even after Juan's left I'm not aware of any
> other new RDMA users.  Personally, I'd slightly prefer postponing it one
> more release which might help a bit of our downstream maintenance, however
> I assume that's not a blocker either, as I think we can also manage it.
> 
> IMHO it's more important to know whether there are still users and whether
> they would still like to see it around.

Since e9a54265f5 was not very clear about rdma migration code, should we 
maybe rather add a separate deprecation note for the migration part, and add 
a proper warning message to the migration code in case someone tries to use 
it there, and then only remove the rdma migration code after two more releases?

  Thomas
Xingtao Yao (Fujitsu)" via March 29, 2024, 1:53 a.m. UTC | #3
On 28/03/2024 23:01, Peter Xu wrote:
> On Thu, Mar 28, 2024 at 11:18:04AM -0300, Fabiano Rosas wrote:
>> Philippe Mathieu-Daudé <philmd@linaro.org> writes:
>>
>>> The whole RDMA subsystem was deprecated in commit e9a54265f5
>>> ("hw/rdma: Deprecate the pvrdma device and the rdma subsystem")
>>> released in v8.2.
>>>
>>> Remove:
>>>   - RDMA handling from migration
>>>   - dependencies on libibumad, libibverbs and librdmacm
>>>
>>> Keep the RAM_SAVE_FLAG_HOOK definition since it might appears
>>> in old migration streams.
>>>
>>> Cc: Peter Xu <peterx@redhat.com>
>>> Cc: Li Zhijian <lizhijian@fujitsu.com>
>>> Acked-by: Fabiano Rosas <farosas@suse.de>
>>> Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
>>
>> Just to be clear, because people raised the point in the last version,
>> the first link in the deprecation commit links to a thread comprising
>> entirely of rdma migration patches. I don't see any ambiguity on whether
>> the deprecation was intended to include migration. There's even an ack
>> from Juan.
> 
> Yes I remember that's the plan.
> 
>>
>> So on the basis of not reverting the previous maintainer's decision, my
>> Ack stands here.
>>
>> We also had pretty obvious bugs ([1], [2]) in the past that would have
>> been caught if we had any kind of testing for the feature, so I can't
>> even say this thing works currently.
>>
>> @Peter Xu, @Li Zhijian, what are your thoughts on this?
> 
> Generally I definitely agree with such a removal sooner or later, as that's
> how deprecation works, and even after Juan's left I'm not aware of any
> other new RDMA users.  Personally, I'd slightly prefer postponing it one
> more release which might help a bit of our downstream maintenance, however
> I assume that's not a blocker either, as I think we can also manage it.
> 
> IMHO it's more important to know whether there are still users and whether
> they would still like to see it around. That's also one thing I notice that
> e9a54265f533f didn't yet get acks from RDMA users that we are aware, even
> if they're rare. According to [2] it could be that such user may only rely
> on the release versions of QEMU when it broke things.
> 
> So I'm copying Yu too (while Zhijian is already in the loop), just in case
> someone would like to stand up and speak.


I admit RDMA migration was lack of testing(unit/CI test), which led to the a few
obvious bugs being noticed too late.
However I was a bit surprised when I saw the removal of the RDMA migration. I wasn't
aware that this feature has not been marked as deprecated(at least there is no
prompt to end-user).


> IMHO it's more important to know whether there are still users and whether
> they would still like to see it around.

Agree.
I didn't immediately express my opinion in V1 because I'm also consulting our
customers for this feature in the future.

Personally, I agree with Perter's idea that "I'd slightly prefer postponing it one
more release which might help a bit of our downstream maintenance"

Thanks
Zhijian

> 
> Thanks,
> 
>>
>> 1- https://lore.kernel.org/r/20230920090412.726725-1-lizhijian@fujitsu.com
>> 2- https://lore.kernel.org/r/CAHEcVy7HXSwn4Ow_Kog+Q+TN6f_kMeiCHevz1qGM-fbxBPp1hQ@mail.gmail.com
>>
>
Daniel P. Berrangé March 29, 2024, 7:44 p.m. UTC | #4
On Fri, Mar 29, 2024 at 11:28:54AM +0100, Philippe Mathieu-Daudé wrote:
> Hi Zhijian,
> 
> On 29/3/24 02:53, Zhijian Li (Fujitsu) wrote:
> > 
> > 
> > On 28/03/2024 23:01, Peter Xu wrote:
> > > On Thu, Mar 28, 2024 at 11:18:04AM -0300, Fabiano Rosas wrote:
> > > > Philippe Mathieu-Daudé <philmd@linaro.org> writes:
> > > > 
> > > > > The whole RDMA subsystem was deprecated in commit e9a54265f5
> > > > > ("hw/rdma: Deprecate the pvrdma device and the rdma subsystem")
> > > > > released in v8.2.
> > > > > 
> > > > > Remove:
> > > > >    - RDMA handling from migration
> > > > >    - dependencies on libibumad, libibverbs and librdmacm
> > > > > 
> > > > > Keep the RAM_SAVE_FLAG_HOOK definition since it might appears
> > > > > in old migration streams.
> > > > > 
> > > > > Cc: Peter Xu <peterx@redhat.com>
> > > > > Cc: Li Zhijian <lizhijian@fujitsu.com>
> > > > > Acked-by: Fabiano Rosas <farosas@suse.de>
> > > > > Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
> > > > 
> > > > Just to be clear, because people raised the point in the last version,
> > > > the first link in the deprecation commit links to a thread comprising
> > > > entirely of rdma migration patches. I don't see any ambiguity on whether
> > > > the deprecation was intended to include migration. There's even an ack
> > > > from Juan.
> > > 
> > > Yes I remember that's the plan.
> > > 
> > > > 
> > > > So on the basis of not reverting the previous maintainer's decision, my
> > > > Ack stands here.
> > > > 
> > > > We also had pretty obvious bugs ([1], [2]) in the past that would have
> > > > been caught if we had any kind of testing for the feature, so I can't
> > > > even say this thing works currently.
> > > > 
> > > > @Peter Xu, @Li Zhijian, what are your thoughts on this?
> > > 
> > > Generally I definitely agree with such a removal sooner or later, as that's
> > > how deprecation works, and even after Juan's left I'm not aware of any
> > > other new RDMA users.  Personally, I'd slightly prefer postponing it one
> > > more release which might help a bit of our downstream maintenance, however
> > > I assume that's not a blocker either, as I think we can also manage it.
> > > 
> > > IMHO it's more important to know whether there are still users and whether
> > > they would still like to see it around. That's also one thing I notice that
> > > e9a54265f533f didn't yet get acks from RDMA users that we are aware, even
> > > if they're rare. According to [2] it could be that such user may only rely
> > > on the release versions of QEMU when it broke things.
> > > 
> > > So I'm copying Yu too (while Zhijian is already in the loop), just in case
> > > someone would like to stand up and speak.
> > 
> > 
> > I admit RDMA migration was lack of testing(unit/CI test), which led to the a few
> > obvious bugs being noticed too late.
> > However I was a bit surprised when I saw the removal of the RDMA migration. I wasn't
> > aware that this feature has not been marked as deprecated(at least there is no
> > prompt to end-user).
> > 
> > 
> > > IMHO it's more important to know whether there are still users and whether
> > > they would still like to see it around.
> > 
> > Agree.
> > I didn't immediately express my opinion in V1 because I'm also consulting our
> > customers for this feature in the future.
> > 
> > Personally, I agree with Perter's idea that "I'd slightly prefer postponing it one
> > more release which might help a bit of our downstream maintenance"
> 
> Do you mind posting a deprecation patch to clarify the situation?

The key thing the first deprecation patch missed was that it failed
to issue a warning message when RDMA migration was actually used.

With regards,
Daniel
Xingtao Yao (Fujitsu)" via April 1, 2024, 7:55 a.m. UTC | #5
Phil,

on 3/29/2024 6:28 PM, Philippe Mathieu-Daudé wrote:
>>
>>
>>> IMHO it's more important to know whether there are still users and 
>>> whether
>>> they would still like to see it around.
>>
>> Agree.
>> I didn't immediately express my opinion in V1 because I'm also 
>> consulting our
>> customers for this feature in the future.
>>
>> Personally, I agree with Perter's idea that "I'd slightly prefer 
>> postponing it one
>> more release which might help a bit of our downstream maintenance"
>
> Do you mind posting a deprecation patch to clarify the situation?
>

No problem, i just posted a deprecation patch, please take a look.
https://lore.kernel.org/qemu-devel/20240401035947.3310834-1-lizhijian@fujitsu.com/T/#u

Thanks
Zhijian
Yu Zhang April 1, 2024, 9:26 p.m. UTC | #6
Hello Peter und Zhjian,

Thank you so much for letting me know about this. I'm also a bit surprised at
the plan for deprecating the RDMA migration subsystem.

> IMHO it's more important to know whether there are still users and whether
> they would still like to see it around.

> I admit RDMA migration was lack of testing(unit/CI test), which led to the a few
> obvious bugs being noticed too late.

Yes, we are a user of this subsystem. I was unaware of the lack of test coverage
for this part. As soon as 8.2 was released, I saw that many of the
migration test
cases failed and came to realize that there might be a bug between 8.1
and 8.2, but
was unable to confirm and report it quickly to you.

The maintenance of this part could be too costly or difficult from
your point of view.

My concern is, this plan will forces a few QEMU users (not sure how
many) like us
either to stick to the RDMA migration by using an increasingly older
version of QEMU,
or to abandon the currently used RDMA migration.

Best regards,
Yu Zhang

On Mon, Apr 1, 2024 at 9:56 AM Zhijian Li (Fujitsu)
<lizhijian@fujitsu.com> wrote:
>
> Phil,
>
> on 3/29/2024 6:28 PM, Philippe Mathieu-Daudé wrote:
> >>
> >>
> >>> IMHO it's more important to know whether there are still users and
> >>> whether
> >>> they would still like to see it around.
> >>
> >> Agree.
> >> I didn't immediately express my opinion in V1 because I'm also
> >> consulting our
> >> customers for this feature in the future.
> >>
> >> Personally, I agree with Perter's idea that "I'd slightly prefer
> >> postponing it one
> >> more release which might help a bit of our downstream maintenance"
> >
> > Do you mind posting a deprecation patch to clarify the situation?
> >
>
> No problem, i just posted a deprecation patch, please take a look.
> https://lore.kernel.org/qemu-devel/20240401035947.3310834-1-lizhijian@fujitsu.com/T/#u
>
> Thanks
> Zhijian
Jinpu Wang April 8, 2024, 2:07 p.m. UTC | #7
Hi Peter,

On Tue, Apr 2, 2024 at 11:24 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Mon, Apr 01, 2024 at 11:26:25PM +0200, Yu Zhang wrote:
> > Hello Peter und Zhjian,
> >
> > Thank you so much for letting me know about this. I'm also a bit surprised at
> > the plan for deprecating the RDMA migration subsystem.
>
> It's not too late, since it looks like we do have users not yet notified
> from this, we'll redo the deprecation procedure even if it'll be the final
> plan, and it'll be 2 releases after this.
>
> >
> > > IMHO it's more important to know whether there are still users and whether
> > > they would still like to see it around.
> >
> > > I admit RDMA migration was lack of testing(unit/CI test), which led to the a few
> > > obvious bugs being noticed too late.
> >
> > Yes, we are a user of this subsystem. I was unaware of the lack of test coverage
> > for this part. As soon as 8.2 was released, I saw that many of the
> > migration test
> > cases failed and came to realize that there might be a bug between 8.1
> > and 8.2, but
> > was unable to confirm and report it quickly to you.
> >
> > The maintenance of this part could be too costly or difficult from
> > your point of view.
>
> It may or may not be too costly, it's just that we need real users of RDMA
> taking some care of it.  Having it broken easily for >1 releases definitely
> is a sign of lack of users.  It is an implication to the community that we
> should consider dropping some features so that we can get the best use of
> the community resources for the things that may have a broader audience.
>
> One thing majorly missing is a RDMA tester to guard all the merges to not
> break RDMA paths, hopefully in CI.  That should not rely on RDMA hardwares
> but just to sanity check the migration+rdma code running all fine.  RDMA
> taught us the lesson so we're requesting CI coverage for all other new
> features that will be merged at least for migration subsystem, so that we
> plan to not merge anything that is not covered by CI unless extremely
> necessary in the future.
>
> For sure CI is not the only missing part, but I'd say we should start with
> it, then someone should also take care of the code even if only in
> maintenance mode (no new feature to add on top).
>
> >
> > My concern is, this plan will forces a few QEMU users (not sure how
> > many) like us
> > either to stick to the RDMA migration by using an increasingly older
> > version of QEMU,
> > or to abandon the currently used RDMA migration.
>
> RDMA doesn't get new features anyway, if there's specific use case for RDMA
> migrations, would it work if such a scenario uses the old binary?  Is it
> possible to switch to the TCP protocol with some good NICs?
We have used rdma migration with HCA from Nvidia for years, our
experience is RDMA migration works better than tcp (over ipoib).

Switching back to TCP will lead us to the old problems which was
solved by RDMA migration.

>
> Per our best knowledge, RDMA users are rare, and please let anyone know if
> you are aware of such users.  IIUC the major reason why RDMA stopped being
> the trend is because the network is not like ten years ago; I don't think I
> have good knowledge in RDMA at all nor network, but my understanding is
> it's pretty easy to fetch modern NIC to outperform RDMAs, then it may make
> little sense to maintain multiple protocols, considering RDMA migration
> code is so special so that it has the most custom code comparing to other
> protocols.
+cc some guys from Huawei.

I'm surprised RDMA users are rare,  I guess maybe many are just
working with different code base.
>
> Thanks,
>
> --
> Peter Xu

Thx!
Jinpu Wang
>
Jinpu Wang April 9, 2024, 7:32 a.m. UTC | #8
Hi Peter,

On Mon, Apr 8, 2024 at 6:18 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Mon, Apr 08, 2024 at 04:07:20PM +0200, Jinpu Wang wrote:
> > Hi Peter,
>
> Jinpu,
>
> Thanks for joining the discussion.
>
> >
> > On Tue, Apr 2, 2024 at 11:24 PM Peter Xu <peterx@redhat.com> wrote:
> > >
> > > On Mon, Apr 01, 2024 at 11:26:25PM +0200, Yu Zhang wrote:
> > > > Hello Peter und Zhjian,
> > > >
> > > > Thank you so much for letting me know about this. I'm also a bit surprised at
> > > > the plan for deprecating the RDMA migration subsystem.
> > >
> > > It's not too late, since it looks like we do have users not yet notified
> > > from this, we'll redo the deprecation procedure even if it'll be the final
> > > plan, and it'll be 2 releases after this.
> > >
> > > >
> > > > > IMHO it's more important to know whether there are still users and whether
> > > > > they would still like to see it around.
> > > >
> > > > > I admit RDMA migration was lack of testing(unit/CI test), which led to the a few
> > > > > obvious bugs being noticed too late.
> > > >
> > > > Yes, we are a user of this subsystem. I was unaware of the lack of test coverage
> > > > for this part. As soon as 8.2 was released, I saw that many of the
> > > > migration test
> > > > cases failed and came to realize that there might be a bug between 8.1
> > > > and 8.2, but
> > > > was unable to confirm and report it quickly to you.
> > > >
> > > > The maintenance of this part could be too costly or difficult from
> > > > your point of view.
> > >
> > > It may or may not be too costly, it's just that we need real users of RDMA
> > > taking some care of it.  Having it broken easily for >1 releases definitely
> > > is a sign of lack of users.  It is an implication to the community that we
> > > should consider dropping some features so that we can get the best use of
> > > the community resources for the things that may have a broader audience.
> > >
> > > One thing majorly missing is a RDMA tester to guard all the merges to not
> > > break RDMA paths, hopefully in CI.  That should not rely on RDMA hardwares
> > > but just to sanity check the migration+rdma code running all fine.  RDMA
> > > taught us the lesson so we're requesting CI coverage for all other new
> > > features that will be merged at least for migration subsystem, so that we
> > > plan to not merge anything that is not covered by CI unless extremely
> > > necessary in the future.
> > >
> > > For sure CI is not the only missing part, but I'd say we should start with
> > > it, then someone should also take care of the code even if only in
> > > maintenance mode (no new feature to add on top).
> > >
> > > >
> > > > My concern is, this plan will forces a few QEMU users (not sure how
> > > > many) like us
> > > > either to stick to the RDMA migration by using an increasingly older
> > > > version of QEMU,
> > > > or to abandon the currently used RDMA migration.
> > >
> > > RDMA doesn't get new features anyway, if there's specific use case for RDMA
> > > migrations, would it work if such a scenario uses the old binary?  Is it
> > > possible to switch to the TCP protocol with some good NICs?
> > We have used rdma migration with HCA from Nvidia for years, our
> > experience is RDMA migration works better than tcp (over ipoib).
>
> Please bare with me, as I know little on rdma stuff.
>
> I'm actually pretty confused (and since a long time ago..) on why we need
> to operation with rdma contexts when ipoib seems to provide all the tcp
> layers.  I meant, can it work with the current "tcp:" protocol with ipoib
> even if there's rdma/ib hardwares underneath?  Is it because of performance
> improvements so that we must use a separate path comparing to generic
> "tcp:" protocol here?
using rdma protocol with ib verbs , we can leverage the full benefit of RDMA by
talking directly to NIC which bypasses the kernel overhead, less cpu
utilization and better performance.

While IPoIB is more for compatibility to  applications using tcp, but
can't get full benefit of RDMA.  When you have mix generation of IB
devices, there are performance issue on IPoIB, we've seen 40G HCA can
only reach 2 Gb/s on IPoIB, but with raw RDMA can reach full line
speed.

I just run a simple iperf3 test via ipoib and ib_send_bw on same hosts:

iperf 3.9
Linux ps404a-3 5.15.137-pserver #5.15.137-6~deb11 SMP Thu Jan 4
07:19:34 UTC 2024 x86_64
-----------------------------------------------------------
Server listening on 5201
-----------------------------------------------------------
Time: Tue, 09 Apr 2024 06:55:02 GMT
Accepted connection from 2a02:247f:401:4:2:0:b:3, port 41130
      Cookie: cer2hexlldrowclq6izh7gbg5toviffqbcwt
      TCP MSS: 0 (default)
[  5] local 2a02:247f:401:4:2:0:a:3 port 5201 connected to
2a02:247f:401:4:2:0:b:3 port 41136
Starting Test: protocol: TCP, 1 streams, 131072 byte blocks, omitting
0 seconds, 10 second test, tos 0
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-1.00   sec  1.80 GBytes  15.5 Gbits/sec
[  5]   1.00-2.00   sec  1.85 GBytes  15.9 Gbits/sec
[  5]   2.00-3.00   sec  1.88 GBytes  16.2 Gbits/sec
[  5]   3.00-4.00   sec  1.87 GBytes  16.1 Gbits/sec
[  5]   4.00-5.00   sec  1.88 GBytes  16.2 Gbits/sec
[  5]   5.00-6.00   sec  1.93 GBytes  16.6 Gbits/sec
[  5]   6.00-7.00   sec  2.00 GBytes  17.2 Gbits/sec
[  5]   7.00-8.00   sec  1.93 GBytes  16.6 Gbits/sec
[  5]   8.00-9.00   sec  1.86 GBytes  16.0 Gbits/sec
[  5]   9.00-10.00  sec  1.95 GBytes  16.8 Gbits/sec
[  5]  10.00-10.04  sec  85.2 MBytes  17.3 Gbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
Test Complete. Summary Results:
[ ID] Interval           Transfer     Bitrate
[  5] (sender statistics not available)
[  5]   0.00-10.04  sec  19.0 GBytes  16.3 Gbits/sec                  receiver
rcv_tcp_congestion cubic
iperf 3.9
Linux ps404a-3 5.15.137-pserver #5.15.137-6~deb11 SMP Thu Jan 4
07:19:34 UTC 2024 x86_64
-----------------------------------------------------------
Server listening on 5201
-----------------------------------------------------------
^Ciperf3: interrupt - the server has terminated
1 jwang@ps404a-3.stg:~$ sudo ib_send_bw -F -a

************************************
* Waiting for client to connect... *
************************************
---------------------------------------------------------------------------------------
                    Send BW Test
 Dual-port       : OFF Device         : mlx5_0
 Number of qps   : 1 Transport type : IB
 Connection type : RC Using SRQ      : OFF
 PCIe relax order: ON
 ibv_wr* API     : ON
 RX depth        : 512
 CQ Moderation   : 100
 Mtu             : 4096[B]
 Link type       : IB
 Max inline data : 0[B]
 rdma_cm QPs : OFF
 Data ex. method : Ethernet
---------------------------------------------------------------------------------------
 local address: LID 0x24 QPN 0x0174 PSN 0x300138
 remote address: LID 0x17 QPN 0x004a PSN 0xc54d6f
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]   MsgRate[Mpps]
 2          1000             0.00               6.46       3.385977
 4          1000             0.00               10.38     2.721894
 8          1000             0.00               25.69     3.367830
 16         1000             0.00               41.46     2.716859
 32         1000             0.00               102.98    3.374577
 64         1000             0.00               206.12    3.377053
 128        1000             0.00               405.03    3.318007
 256        1000             0.00               821.52    3.364939
 512        1000             0.00               2150.78    4.404803
 1024       1000             0.00               4288.13    4.391044
 2048       1000             0.00               8518.25    4.361346
 4096       1000             0.00               11440.77    2.928836
 8192       1000             0.00               11526.45    1.475385
 16384      1000             0.00               11526.06    0.737668
 32768      1000             0.00               11524.86    0.368795
 65536      1000             0.00               11331.84    0.181309
 131072     1000             0.00               11524.75    0.092198
 262144     1000             0.00               11525.82    0.046103
 524288     1000             0.00               11524.70    0.023049
 1048576    1000             0.00               11510.84    0.011511
 2097152    1000             0.00               11524.58    0.005762
 4194304    1000             0.00               11514.26    0.002879
 8388608    1000             0.00               11511.01    0.001439
---------------------------------------------------------------------------------------

you can see with ipoib, it reaches 16 Gb/s using TCP, 1 streams,
131072 byte blocks
with RDMA at 4k+ message size it reaches 100 Gb/s


>
> >
> > Switching back to TCP will lead us to the old problems which was
> > solved by RDMA migration.
>
> Can you elaborate the problems, and why tcp won't work in this case?  They
> may not be directly relevant to the issue we're discussing, but I'm happy
> to learn more.
>
> What is the NICs you were testing before?  Did the test carry out with
> things like modern ones (50Gbps-200Gbps NICs), or the test was done when
> these hardwares are not common?
We use Mellanox/NVidia IB HCA from 40 Gb/s to 200 Gb/s mixed
generation across globe.
>
> Per my recent knowledge on the new Intel hardwares, at least the ones that
> support QPL, it's easy to achieve single core 50Gbps+.
In good case, I've also seen 50 Gbps + on Mellanox HCA.
>
> https://lore.kernel.org/r/PH7PR11MB5941A91AC1E514BCC32896A6A3342@PH7PR11MB5941.namprd11.prod.outlook.com
>
> Quote from Yuan:
>
>   Yes, I use iperf3 to check the bandwidth for one core, the bandwith is 60Gbps.
>   [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
>   [  5]   0.00-1.00   sec  7.00 GBytes  60.1 Gbits/sec    0   2.87 MBytes
>   [  5]   1.00-2.00   sec  7.05 GBytes  60.6 Gbits/sec    0   2.87 Mbytes
>
>   And in the live migration test, a multifd thread's CPU utilization is almost 100%
>
> It boils down to what old problems were there with tcp first, though.
Yeah, this is the key reason we use RDMA. (low cpu ulitization and
better performance)
>
> >
> > >
> > > Per our best knowledge, RDMA users are rare, and please let anyone know if
> > > you are aware of such users.  IIUC the major reason why RDMA stopped being
> > > the trend is because the network is not like ten years ago; I don't think I
> > > have good knowledge in RDMA at all nor network, but my understanding is
> > > it's pretty easy to fetch modern NIC to outperform RDMAs, then it may make
> > > little sense to maintain multiple protocols, considering RDMA migration
> > > code is so special so that it has the most custom code comparing to other
> > > protocols.
> > +cc some guys from Huawei.
> >
> > I'm surprised RDMA users are rare,  I guess maybe many are just
> > working with different code base.
>
> Yes, please cc whoever might be interested (or surprised.. :) to know this,
> and let's be open to all possibilities.
>
> I don't think it makes sense if there're a lot of users of a feature then
> we deprecate that without a good reason.  However there's always the
> resource limitation issue we're facing, so it could still have the
> possibility that this gets deprecated if nobody is working on our upstream
> branch. Say, if people use private branches anyway to support rdma without
> collaborating upstream, keeping such feature upstream then may not make
> much sense either, unless there's some way to collaborate.  We'll see.

Is there document/link about the unittest/CI for migration tests, Why
are those tests missing?
Is it hard or very special to set up an environment for that? maybe we
can help in this regards.
>
> It seems there can still be people joining this discussion.  I'll hold off
> a bit on merging this patch to provide enough window for anyone to chim in.

Thx for discussion and understanding.


Jinpu Wang
>
> Thanks,
>
> > >
> > > Thanks,
> > >
> > > --
> > > Peter Xu
> >
> > Thx!
> > Jinpu Wang
> > >
> >
>
> --
> Peter Xu
>
Markus Armbruster April 9, 2024, 9 a.m. UTC | #9
Peter Xu <peterx@redhat.com> writes:

> On Mon, Apr 08, 2024 at 04:07:20PM +0200, Jinpu Wang wrote:
>> Hi Peter,
>
> Jinpu,
>
> Thanks for joining the discussion.
>
>> 
>> On Tue, Apr 2, 2024 at 11:24 PM Peter Xu <peterx@redhat.com> wrote:
>> >
>> > On Mon, Apr 01, 2024 at 11:26:25PM +0200, Yu Zhang wrote:
>> > > Hello Peter und Zhjian,
>> > >
>> > > Thank you so much for letting me know about this. I'm also a bit surprised at
>> > > the plan for deprecating the RDMA migration subsystem.
>> >
>> > It's not too late, since it looks like we do have users not yet notified
>> > from this, we'll redo the deprecation procedure even if it'll be the final
>> > plan, and it'll be 2 releases after this.

[...]

>> > Per our best knowledge, RDMA users are rare, and please let anyone know if
>> > you are aware of such users.  IIUC the major reason why RDMA stopped being
>> > the trend is because the network is not like ten years ago; I don't think I
>> > have good knowledge in RDMA at all nor network, but my understanding is
>> > it's pretty easy to fetch modern NIC to outperform RDMAs, then it may make
>> > little sense to maintain multiple protocols, considering RDMA migration
>> > code is so special so that it has the most custom code comparing to other
>> > protocols.
>> +cc some guys from Huawei.
>> 
>> I'm surprised RDMA users are rare,  I guess maybe many are just
>> working with different code base.
>
> Yes, please cc whoever might be interested (or surprised.. :) to know this,
> and let's be open to all possibilities.
>
> I don't think it makes sense if there're a lot of users of a feature then
> we deprecate that without a good reason.  However there's always the
> resource limitation issue we're facing, so it could still have the
> possibility that this gets deprecated if nobody is working on our upstream
> branch. Say, if people use private branches anyway to support rdma without
> collaborating upstream, keeping such feature upstream then may not make
> much sense either, unless there's some way to collaborate.  We'll see.
>
> It seems there can still be people joining this discussion.  I'll hold off
> a bit on merging this patch to provide enough window for anyone to chim in.

Users are not enough.  Only maintainers are.

At some point, people cared enough about RDMA in QEMU to contribute the
code.  That's why have the code.

To keep the code, we need people who care enough about RDMA in QEMU to
maintain it.  Without such people, the case for keeping it remains
dangerously weak, and no amount of talk or even benchmarks can change
that.
Peter Xu April 9, 2024, 7:46 p.m. UTC | #10
On Tue, Apr 09, 2024 at 09:32:46AM +0200, Jinpu Wang wrote:
> Hi Peter,
> 
> On Mon, Apr 8, 2024 at 6:18 PM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Mon, Apr 08, 2024 at 04:07:20PM +0200, Jinpu Wang wrote:
> > > Hi Peter,
> >
> > Jinpu,
> >
> > Thanks for joining the discussion.
> >
> > >
> > > On Tue, Apr 2, 2024 at 11:24 PM Peter Xu <peterx@redhat.com> wrote:
> > > >
> > > > On Mon, Apr 01, 2024 at 11:26:25PM +0200, Yu Zhang wrote:
> > > > > Hello Peter und Zhjian,
> > > > >
> > > > > Thank you so much for letting me know about this. I'm also a bit surprised at
> > > > > the plan for deprecating the RDMA migration subsystem.
> > > >
> > > > It's not too late, since it looks like we do have users not yet notified
> > > > from this, we'll redo the deprecation procedure even if it'll be the final
> > > > plan, and it'll be 2 releases after this.
> > > >
> > > > >
> > > > > > IMHO it's more important to know whether there are still users and whether
> > > > > > they would still like to see it around.
> > > > >
> > > > > > I admit RDMA migration was lack of testing(unit/CI test), which led to the a few
> > > > > > obvious bugs being noticed too late.
> > > > >
> > > > > Yes, we are a user of this subsystem. I was unaware of the lack of test coverage
> > > > > for this part. As soon as 8.2 was released, I saw that many of the
> > > > > migration test
> > > > > cases failed and came to realize that there might be a bug between 8.1
> > > > > and 8.2, but
> > > > > was unable to confirm and report it quickly to you.
> > > > >
> > > > > The maintenance of this part could be too costly or difficult from
> > > > > your point of view.
> > > >
> > > > It may or may not be too costly, it's just that we need real users of RDMA
> > > > taking some care of it.  Having it broken easily for >1 releases definitely
> > > > is a sign of lack of users.  It is an implication to the community that we
> > > > should consider dropping some features so that we can get the best use of
> > > > the community resources for the things that may have a broader audience.
> > > >
> > > > One thing majorly missing is a RDMA tester to guard all the merges to not
> > > > break RDMA paths, hopefully in CI.  That should not rely on RDMA hardwares
> > > > but just to sanity check the migration+rdma code running all fine.  RDMA
> > > > taught us the lesson so we're requesting CI coverage for all other new
> > > > features that will be merged at least for migration subsystem, so that we
> > > > plan to not merge anything that is not covered by CI unless extremely
> > > > necessary in the future.
> > > >
> > > > For sure CI is not the only missing part, but I'd say we should start with
> > > > it, then someone should also take care of the code even if only in
> > > > maintenance mode (no new feature to add on top).
> > > >
> > > > >
> > > > > My concern is, this plan will forces a few QEMU users (not sure how
> > > > > many) like us
> > > > > either to stick to the RDMA migration by using an increasingly older
> > > > > version of QEMU,
> > > > > or to abandon the currently used RDMA migration.
> > > >
> > > > RDMA doesn't get new features anyway, if there's specific use case for RDMA
> > > > migrations, would it work if such a scenario uses the old binary?  Is it
> > > > possible to switch to the TCP protocol with some good NICs?
> > > We have used rdma migration with HCA from Nvidia for years, our
> > > experience is RDMA migration works better than tcp (over ipoib).
> >
> > Please bare with me, as I know little on rdma stuff.
> >
> > I'm actually pretty confused (and since a long time ago..) on why we need
> > to operation with rdma contexts when ipoib seems to provide all the tcp
> > layers.  I meant, can it work with the current "tcp:" protocol with ipoib
> > even if there's rdma/ib hardwares underneath?  Is it because of performance
> > improvements so that we must use a separate path comparing to generic
> > "tcp:" protocol here?
> using rdma protocol with ib verbs , we can leverage the full benefit of RDMA by
> talking directly to NIC which bypasses the kernel overhead, less cpu
> utilization and better performance.
> 
> While IPoIB is more for compatibility to  applications using tcp, but
> can't get full benefit of RDMA.  When you have mix generation of IB
> devices, there are performance issue on IPoIB, we've seen 40G HCA can
> only reach 2 Gb/s on IPoIB, but with raw RDMA can reach full line
> speed.
> 
> I just run a simple iperf3 test via ipoib and ib_send_bw on same hosts:
> 
> iperf 3.9
> Linux ps404a-3 5.15.137-pserver #5.15.137-6~deb11 SMP Thu Jan 4
> 07:19:34 UTC 2024 x86_64
> -----------------------------------------------------------
> Server listening on 5201
> -----------------------------------------------------------
> Time: Tue, 09 Apr 2024 06:55:02 GMT
> Accepted connection from 2a02:247f:401:4:2:0:b:3, port 41130
>       Cookie: cer2hexlldrowclq6izh7gbg5toviffqbcwt
>       TCP MSS: 0 (default)
> [  5] local 2a02:247f:401:4:2:0:a:3 port 5201 connected to
> 2a02:247f:401:4:2:0:b:3 port 41136
> Starting Test: protocol: TCP, 1 streams, 131072 byte blocks, omitting
> 0 seconds, 10 second test, tos 0
> [ ID] Interval           Transfer     Bitrate
> [  5]   0.00-1.00   sec  1.80 GBytes  15.5 Gbits/sec
> [  5]   1.00-2.00   sec  1.85 GBytes  15.9 Gbits/sec
> [  5]   2.00-3.00   sec  1.88 GBytes  16.2 Gbits/sec
> [  5]   3.00-4.00   sec  1.87 GBytes  16.1 Gbits/sec
> [  5]   4.00-5.00   sec  1.88 GBytes  16.2 Gbits/sec
> [  5]   5.00-6.00   sec  1.93 GBytes  16.6 Gbits/sec
> [  5]   6.00-7.00   sec  2.00 GBytes  17.2 Gbits/sec
> [  5]   7.00-8.00   sec  1.93 GBytes  16.6 Gbits/sec
> [  5]   8.00-9.00   sec  1.86 GBytes  16.0 Gbits/sec
> [  5]   9.00-10.00  sec  1.95 GBytes  16.8 Gbits/sec
> [  5]  10.00-10.04  sec  85.2 MBytes  17.3 Gbits/sec
> - - - - - - - - - - - - - - - - - - - - - - - - -
> Test Complete. Summary Results:
> [ ID] Interval           Transfer     Bitrate
> [  5] (sender statistics not available)
> [  5]   0.00-10.04  sec  19.0 GBytes  16.3 Gbits/sec                  receiver
> rcv_tcp_congestion cubic
> iperf 3.9
> Linux ps404a-3 5.15.137-pserver #5.15.137-6~deb11 SMP Thu Jan 4
> 07:19:34 UTC 2024 x86_64
> -----------------------------------------------------------
> Server listening on 5201
> -----------------------------------------------------------
> ^Ciperf3: interrupt - the server has terminated
> 1 jwang@ps404a-3.stg:~$ sudo ib_send_bw -F -a
> 
> ************************************
> * Waiting for client to connect... *
> ************************************
> ---------------------------------------------------------------------------------------
>                     Send BW Test
>  Dual-port       : OFF Device         : mlx5_0
>  Number of qps   : 1 Transport type : IB
>  Connection type : RC Using SRQ      : OFF
>  PCIe relax order: ON
>  ibv_wr* API     : ON
>  RX depth        : 512
>  CQ Moderation   : 100
>  Mtu             : 4096[B]
>  Link type       : IB
>  Max inline data : 0[B]
>  rdma_cm QPs : OFF
>  Data ex. method : Ethernet
> ---------------------------------------------------------------------------------------
>  local address: LID 0x24 QPN 0x0174 PSN 0x300138
>  remote address: LID 0x17 QPN 0x004a PSN 0xc54d6f
> ---------------------------------------------------------------------------------------
>  #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]   MsgRate[Mpps]
>  2          1000             0.00               6.46       3.385977
>  4          1000             0.00               10.38     2.721894
>  8          1000             0.00               25.69     3.367830
>  16         1000             0.00               41.46     2.716859
>  32         1000             0.00               102.98    3.374577
>  64         1000             0.00               206.12    3.377053
>  128        1000             0.00               405.03    3.318007
>  256        1000             0.00               821.52    3.364939
>  512        1000             0.00               2150.78    4.404803
>  1024       1000             0.00               4288.13    4.391044
>  2048       1000             0.00               8518.25    4.361346
>  4096       1000             0.00               11440.77    2.928836
>  8192       1000             0.00               11526.45    1.475385
>  16384      1000             0.00               11526.06    0.737668
>  32768      1000             0.00               11524.86    0.368795
>  65536      1000             0.00               11331.84    0.181309
>  131072     1000             0.00               11524.75    0.092198
>  262144     1000             0.00               11525.82    0.046103
>  524288     1000             0.00               11524.70    0.023049
>  1048576    1000             0.00               11510.84    0.011511
>  2097152    1000             0.00               11524.58    0.005762
>  4194304    1000             0.00               11514.26    0.002879
>  8388608    1000             0.00               11511.01    0.001439
> ---------------------------------------------------------------------------------------
> 
> you can see with ipoib, it reaches 16 Gb/s using TCP, 1 streams,
> 131072 byte blocks
> with RDMA at 4k+ message size it reaches 100 Gb/s

I get it now, thank you!

> 
> 
> >
> > >
> > > Switching back to TCP will lead us to the old problems which was
> > > solved by RDMA migration.
> >
> > Can you elaborate the problems, and why tcp won't work in this case?  They
> > may not be directly relevant to the issue we're discussing, but I'm happy
> > to learn more.
> >
> > What is the NICs you were testing before?  Did the test carry out with
> > things like modern ones (50Gbps-200Gbps NICs), or the test was done when
> > these hardwares are not common?
> We use Mellanox/NVidia IB HCA from 40 Gb/s to 200 Gb/s mixed
> generation across globe.
> >
> > Per my recent knowledge on the new Intel hardwares, at least the ones that
> > support QPL, it's easy to achieve single core 50Gbps+.
> In good case, I've also seen 50 Gbps + on Mellanox HCA.

I see. Have you compared the HCAs v.s. the modern NICs?  Now NICs can
achieve similar performance from their spec as I said; I am not sure how
they perform in real life, but maybe worth trying.  I only tried 100G nic
and I rem I can hit 70+Gbps with multifd migrations at peak bandwidth.
Have you tried that before?

Note that here I didn't want to compare the performance between the two and
find a winner.  The issue we're facing now is we have the RDMA migration
now mostly having its own path all over the place, while the rest protocols
(socket, fd, file, etc.) all share the rest.

Then, _if_ modern NICs can work similarly v.s. rdma, I don't yet see a good
reason to keep it.  It could be that technology just improved so we can use
less code to do as good.  It's a good news to help QEMU evolve by dropping
unused code.

For some details there on the rdma complications for migration:

  (1) RDMA is the only protocol that doesn't yet support QIOChannel, while
      migration uses QIOChannels mostly everywhere now.. e.g. in multifd,
      it means it won't easily support any new things using QIOChannels.

  (2) RDMA is the only protocol that mostly hard-coded everywhere in the
      RAM migrations, polluting the core logic with much more code
      internally to support this protocol.

For (1), see migrate_fd_connect() from rdma_start_outgoing_migration().
While the rest protocols all go via migration_channel_connect().

For (2), see all the "rdma_*" functions in migration/ram.c, where I don't
think it's common to a protocol - most of the rest protocols don't need
those hard-coded stuff.  migration/rdma.c has 4000+ LOC for these stuff,
while to do a not-so-fair comparison, migration/fd.c only has <100 LOC.

Then, we found we don't even know who's using it.

I hope I explained why people started this idea, and also why I think that
makes sense at least to me.

> >
> > https://lore.kernel.org/r/PH7PR11MB5941A91AC1E514BCC32896A6A3342@PH7PR11MB5941.namprd11.prod.outlook.com
> >
> > Quote from Yuan:
> >
> >   Yes, I use iperf3 to check the bandwidth for one core, the bandwith is 60Gbps.
> >   [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> >   [  5]   0.00-1.00   sec  7.00 GBytes  60.1 Gbits/sec    0   2.87 MBytes
> >   [  5]   1.00-2.00   sec  7.05 GBytes  60.6 Gbits/sec    0   2.87 Mbytes
> >
> >   And in the live migration test, a multifd thread's CPU utilization is almost 100%
> >
> > It boils down to what old problems were there with tcp first, though.
> Yeah, this is the key reason we use RDMA. (low cpu ulitization and
> better performance)
> >
> > >
> > > >
> > > > Per our best knowledge, RDMA users are rare, and please let anyone know if
> > > > you are aware of such users.  IIUC the major reason why RDMA stopped being
> > > > the trend is because the network is not like ten years ago; I don't think I
> > > > have good knowledge in RDMA at all nor network, but my understanding is
> > > > it's pretty easy to fetch modern NIC to outperform RDMAs, then it may make
> > > > little sense to maintain multiple protocols, considering RDMA migration
> > > > code is so special so that it has the most custom code comparing to other
> > > > protocols.
> > > +cc some guys from Huawei.
> > >
> > > I'm surprised RDMA users are rare,  I guess maybe many are just
> > > working with different code base.
> >
> > Yes, please cc whoever might be interested (or surprised.. :) to know this,
> > and let's be open to all possibilities.
> >
> > I don't think it makes sense if there're a lot of users of a feature then
> > we deprecate that without a good reason.  However there's always the
> > resource limitation issue we're facing, so it could still have the
> > possibility that this gets deprecated if nobody is working on our upstream
> > branch. Say, if people use private branches anyway to support rdma without
> > collaborating upstream, keeping such feature upstream then may not make
> > much sense either, unless there's some way to collaborate.  We'll see.
> 
> Is there document/link about the unittest/CI for migration tests, Why
> are those tests missing?
> Is it hard or very special to set up an environment for that? maybe we
> can help in this regards.

See tests/qtest/migration-test.c.  We put most of our migration tests
there and that's covered in CI.

I think one major issue is CI systems don't normally have rdma devices.
Can rdma migration test be carried out without a real hardware?

> >
> > It seems there can still be people joining this discussion.  I'll hold off
> > a bit on merging this patch to provide enough window for anyone to chim in.
> 
> Thx for discussion and understanding.

Thanks for all these inputs so far.  These can help us make a wiser and
clearer step no matter which way we choose.
Xingtao Yao (Fujitsu)" via April 10, 2024, 2:28 a.m. UTC | #11
on 4/10/2024 3:46 AM, Peter Xu wrote:

>> Is there document/link about the unittest/CI for migration tests, Why
>> are those tests missing?
>> Is it hard or very special to set up an environment for that? maybe we
>> can help in this regards.
> See tests/qtest/migration-test.c.  We put most of our migration tests
> there and that's covered in CI.
>
> I think one major issue is CI systems don't normally have rdma devices.
> Can rdma migration test be carried out without a real hardware?

Yeah,  RXE aka. SOFT-RoCE is able to emulate the RDMA, for example
$ sudo rdma link add rxe_eth0 type rxe netdev eth0  # on host
then we can get a new RDMA interface "rxe_eth0".
This new RDMA interface is able to do the QEMU RDMA migration.

Also, the loopback(lo) device is able to emulate the RDMA interface 
"rxe_lo", however when
I tried(years ago) to do RDMA migration over this 
interface(rdma:127.0.0.1:3333) , it got something wrong.
So i gave up enabling the RDMA migration qtest at that time.



Thanks
Zhijian



     

>
>>> It seems there can still be people joining this discussion.  I'll hold off
>>> a bit on merging this patch to provide enough window for anyone to chim in.
>> Thx for discussion and understanding.
> Thanks for all these inputs so far.  These can help us make a wiser and
> clearer step no matter which way we choose.
Jinpu Wang April 11, 2024, 2:42 p.m. UTC | #12
Hi Peter,

On Tue, Apr 9, 2024 at 9:47 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Tue, Apr 09, 2024 at 09:32:46AM +0200, Jinpu Wang wrote:
> > Hi Peter,
> >
> > On Mon, Apr 8, 2024 at 6:18 PM Peter Xu <peterx@redhat.com> wrote:
> > >
> > > On Mon, Apr 08, 2024 at 04:07:20PM +0200, Jinpu Wang wrote:
> > > > Hi Peter,
> > >
> > > Jinpu,
> > >
> > > Thanks for joining the discussion.
> > >
> > > >
> > > > On Tue, Apr 2, 2024 at 11:24 PM Peter Xu <peterx@redhat.com> wrote:
> > > > >
> > > > > On Mon, Apr 01, 2024 at 11:26:25PM +0200, Yu Zhang wrote:
> > > > > > Hello Peter und Zhjian,
> > > > > >
> > > > > > Thank you so much for letting me know about this. I'm also a bit surprised at
> > > > > > the plan for deprecating the RDMA migration subsystem.
> > > > >
> > > > > It's not too late, since it looks like we do have users not yet notified
> > > > > from this, we'll redo the deprecation procedure even if it'll be the final
> > > > > plan, and it'll be 2 releases after this.
> > > > >
> > > > > >
> > > > > > > IMHO it's more important to know whether there are still users and whether
> > > > > > > they would still like to see it around.
> > > > > >
> > > > > > > I admit RDMA migration was lack of testing(unit/CI test), which led to the a few
> > > > > > > obvious bugs being noticed too late.
> > > > > >
> > > > > > Yes, we are a user of this subsystem. I was unaware of the lack of test coverage
> > > > > > for this part. As soon as 8.2 was released, I saw that many of the
> > > > > > migration test
> > > > > > cases failed and came to realize that there might be a bug between 8.1
> > > > > > and 8.2, but
> > > > > > was unable to confirm and report it quickly to you.
> > > > > >
> > > > > > The maintenance of this part could be too costly or difficult from
> > > > > > your point of view.
> > > > >
> > > > > It may or may not be too costly, it's just that we need real users of RDMA
> > > > > taking some care of it.  Having it broken easily for >1 releases definitely
> > > > > is a sign of lack of users.  It is an implication to the community that we
> > > > > should consider dropping some features so that we can get the best use of
> > > > > the community resources for the things that may have a broader audience.
> > > > >
> > > > > One thing majorly missing is a RDMA tester to guard all the merges to not
> > > > > break RDMA paths, hopefully in CI.  That should not rely on RDMA hardwares
> > > > > but just to sanity check the migration+rdma code running all fine.  RDMA
> > > > > taught us the lesson so we're requesting CI coverage for all other new
> > > > > features that will be merged at least for migration subsystem, so that we
> > > > > plan to not merge anything that is not covered by CI unless extremely
> > > > > necessary in the future.
> > > > >
> > > > > For sure CI is not the only missing part, but I'd say we should start with
> > > > > it, then someone should also take care of the code even if only in
> > > > > maintenance mode (no new feature to add on top).
> > > > >
> > > > > >
> > > > > > My concern is, this plan will forces a few QEMU users (not sure how
> > > > > > many) like us
> > > > > > either to stick to the RDMA migration by using an increasingly older
> > > > > > version of QEMU,
> > > > > > or to abandon the currently used RDMA migration.
> > > > >
> > > > > RDMA doesn't get new features anyway, if there's specific use case for RDMA
> > > > > migrations, would it work if such a scenario uses the old binary?  Is it
> > > > > possible to switch to the TCP protocol with some good NICs?
> > > > We have used rdma migration with HCA from Nvidia for years, our
> > > > experience is RDMA migration works better than tcp (over ipoib).
> > >
> > > Please bare with me, as I know little on rdma stuff.
> > >
> > > I'm actually pretty confused (and since a long time ago..) on why we need
> > > to operation with rdma contexts when ipoib seems to provide all the tcp
> > > layers.  I meant, can it work with the current "tcp:" protocol with ipoib
> > > even if there's rdma/ib hardwares underneath?  Is it because of performance
> > > improvements so that we must use a separate path comparing to generic
> > > "tcp:" protocol here?
> > using rdma protocol with ib verbs , we can leverage the full benefit of RDMA by
> > talking directly to NIC which bypasses the kernel overhead, less cpu
> > utilization and better performance.
> >
> > While IPoIB is more for compatibility to  applications using tcp, but
> > can't get full benefit of RDMA.  When you have mix generation of IB
> > devices, there are performance issue on IPoIB, we've seen 40G HCA can
> > only reach 2 Gb/s on IPoIB, but with raw RDMA can reach full line
> > speed.
> >
> > I just run a simple iperf3 test via ipoib and ib_send_bw on same hosts:
> >
> > iperf 3.9
> > Linux ps404a-3 5.15.137-pserver #5.15.137-6~deb11 SMP Thu Jan 4
> > 07:19:34 UTC 2024 x86_64
> > -----------------------------------------------------------
> > Server listening on 5201
> > -----------------------------------------------------------
> > Time: Tue, 09 Apr 2024 06:55:02 GMT
> > Accepted connection from 2a02:247f:401:4:2:0:b:3, port 41130
> >       Cookie: cer2hexlldrowclq6izh7gbg5toviffqbcwt
> >       TCP MSS: 0 (default)
> > [  5] local 2a02:247f:401:4:2:0:a:3 port 5201 connected to
> > 2a02:247f:401:4:2:0:b:3 port 41136
> > Starting Test: protocol: TCP, 1 streams, 131072 byte blocks, omitting
> > 0 seconds, 10 second test, tos 0
> > [ ID] Interval           Transfer     Bitrate
> > [  5]   0.00-1.00   sec  1.80 GBytes  15.5 Gbits/sec
> > [  5]   1.00-2.00   sec  1.85 GBytes  15.9 Gbits/sec
> > [  5]   2.00-3.00   sec  1.88 GBytes  16.2 Gbits/sec
> > [  5]   3.00-4.00   sec  1.87 GBytes  16.1 Gbits/sec
> > [  5]   4.00-5.00   sec  1.88 GBytes  16.2 Gbits/sec
> > [  5]   5.00-6.00   sec  1.93 GBytes  16.6 Gbits/sec
> > [  5]   6.00-7.00   sec  2.00 GBytes  17.2 Gbits/sec
> > [  5]   7.00-8.00   sec  1.93 GBytes  16.6 Gbits/sec
> > [  5]   8.00-9.00   sec  1.86 GBytes  16.0 Gbits/sec
> > [  5]   9.00-10.00  sec  1.95 GBytes  16.8 Gbits/sec
> > [  5]  10.00-10.04  sec  85.2 MBytes  17.3 Gbits/sec
> > - - - - - - - - - - - - - - - - - - - - - - - - -
> > Test Complete. Summary Results:
> > [ ID] Interval           Transfer     Bitrate
> > [  5] (sender statistics not available)
> > [  5]   0.00-10.04  sec  19.0 GBytes  16.3 Gbits/sec                  receiver
> > rcv_tcp_congestion cubic
> > iperf 3.9
> > Linux ps404a-3 5.15.137-pserver #5.15.137-6~deb11 SMP Thu Jan 4
> > 07:19:34 UTC 2024 x86_64
> > -----------------------------------------------------------
> > Server listening on 5201
> > -----------------------------------------------------------
> > ^Ciperf3: interrupt - the server has terminated
> > 1 jwang@ps404a-3.stg:~$ sudo ib_send_bw -F -a
> >
> > ************************************
> > * Waiting for client to connect... *
> > ************************************
> > ---------------------------------------------------------------------------------------
> >                     Send BW Test
> >  Dual-port       : OFF Device         : mlx5_0
> >  Number of qps   : 1 Transport type : IB
> >  Connection type : RC Using SRQ      : OFF
> >  PCIe relax order: ON
> >  ibv_wr* API     : ON
> >  RX depth        : 512
> >  CQ Moderation   : 100
> >  Mtu             : 4096[B]
> >  Link type       : IB
> >  Max inline data : 0[B]
> >  rdma_cm QPs : OFF
> >  Data ex. method : Ethernet
> > ---------------------------------------------------------------------------------------
> >  local address: LID 0x24 QPN 0x0174 PSN 0x300138
> >  remote address: LID 0x17 QPN 0x004a PSN 0xc54d6f
> > ---------------------------------------------------------------------------------------
> >  #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]   MsgRate[Mpps]
> >  2          1000             0.00               6.46       3.385977
> >  4          1000             0.00               10.38     2.721894
> >  8          1000             0.00               25.69     3.367830
> >  16         1000             0.00               41.46     2.716859
> >  32         1000             0.00               102.98    3.374577
> >  64         1000             0.00               206.12    3.377053
> >  128        1000             0.00               405.03    3.318007
> >  256        1000             0.00               821.52    3.364939
> >  512        1000             0.00               2150.78    4.404803
> >  1024       1000             0.00               4288.13    4.391044
> >  2048       1000             0.00               8518.25    4.361346
> >  4096       1000             0.00               11440.77    2.928836
> >  8192       1000             0.00               11526.45    1.475385
> >  16384      1000             0.00               11526.06    0.737668
> >  32768      1000             0.00               11524.86    0.368795
> >  65536      1000             0.00               11331.84    0.181309
> >  131072     1000             0.00               11524.75    0.092198
> >  262144     1000             0.00               11525.82    0.046103
> >  524288     1000             0.00               11524.70    0.023049
> >  1048576    1000             0.00               11510.84    0.011511
> >  2097152    1000             0.00               11524.58    0.005762
> >  4194304    1000             0.00               11514.26    0.002879
> >  8388608    1000             0.00               11511.01    0.001439
> > ---------------------------------------------------------------------------------------
> >
> > you can see with ipoib, it reaches 16 Gb/s using TCP, 1 streams,
> > 131072 byte blocks
> > with RDMA at 4k+ message size it reaches 100 Gb/s
>
> I get it now, thank you!
>
> >
> >
> > >
> > > >
> > > > Switching back to TCP will lead us to the old problems which was
> > > > solved by RDMA migration.
> > >
> > > Can you elaborate the problems, and why tcp won't work in this case?  They
> > > may not be directly relevant to the issue we're discussing, but I'm happy
> > > to learn more.
> > >
> > > What is the NICs you were testing before?  Did the test carry out with
> > > things like modern ones (50Gbps-200Gbps NICs), or the test was done when
> > > these hardwares are not common?
> > We use Mellanox/NVidia IB HCA from 40 Gb/s to 200 Gb/s mixed
> > generation across globe.
> > >
> > > Per my recent knowledge on the new Intel hardwares, at least the ones that
> > > support QPL, it's easy to achieve single core 50Gbps+.
> > In good case, I've also seen 50 Gbps + on Mellanox HCA.
>
> I see. Have you compared the HCAs v.s. the modern NICs?  Now NICs can
> achieve similar performance from their spec as I said; I am not sure how
> they perform in real life, but maybe worth trying.  I only tried 100G nic
> and I rem I can hit 70+Gbps with multifd migrations at peak bandwidth.
> Have you tried that before?
Yes, I recently tried 100 G Eth NIC, with only iperf not yet with qemu
migration.
yes, iperf can reach 90 Gbps with multiple streams.
>
> Note that here I didn't want to compare the performance between the two and
> find a winner.  The issue we're facing now is we have the RDMA migration
> now mostly having its own path all over the place, while the rest protocols
> (socket, fd, file, etc.) all share the rest.
>
> Then, _if_ modern NICs can work similarly v.s. rdma, I don't yet see a good
> reason to keep it.  It could be that technology just improved so we can use
> less code to do as good.  It's a good news to help QEMU evolve by dropping
> unused code.
>
> For some details there on the rdma complications for migration:
>
>   (1) RDMA is the only protocol that doesn't yet support QIOChannel, while
>       migration uses QIOChannels mostly everywhere now.. e.g. in multifd,
>       it means it won't easily support any new things using QIOChannels.
>
>   (2) RDMA is the only protocol that mostly hard-coded everywhere in the
>       RAM migrations, polluting the core logic with much more code
>       internally to support this protocol.
>
> For (1), see migrate_fd_connect() from rdma_start_outgoing_migration().
> While the rest protocols all go via migration_channel_connect().
>
> For (2), see all the "rdma_*" functions in migration/ram.c, where I don't
> think it's common to a protocol - most of the rest protocols don't need
> those hard-coded stuff.  migration/rdma.c has 4000+ LOC for these stuff,
> while to do a not-so-fair comparison, migration/fd.c only has <100 LOC.
>
> Then, we found we don't even know who's using it.
>
> I hope I explained why people started this idea, and also why I think that
> makes sense at least to me.
Yes, I can understand rdma migration become more a burden for upstream
maintainers.
>
> > >
> > > https://lore.kernel.org/r/PH7PR11MB5941A91AC1E514BCC32896A6A3342@PH7PR11MB5941.namprd11.prod.outlook.com
> > >
> > > Quote from Yuan:
> > >
> > >   Yes, I use iperf3 to check the bandwidth for one core, the bandwith is 60Gbps.
> > >   [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> > >   [  5]   0.00-1.00   sec  7.00 GBytes  60.1 Gbits/sec    0   2.87 MBytes
> > >   [  5]   1.00-2.00   sec  7.05 GBytes  60.6 Gbits/sec    0   2.87 Mbytes
> > >
> > >   And in the live migration test, a multifd thread's CPU utilization is almost 100%
> > >
> > > It boils down to what old problems were there with tcp first, though.
> > Yeah, this is the key reason we use RDMA. (low cpu ulitization and
> > better performance)
> > >
> > > >
> > > > >
> > > > > Per our best knowledge, RDMA users are rare, and please let anyone know if
> > > > > you are aware of such users.  IIUC the major reason why RDMA stopped being
> > > > > the trend is because the network is not like ten years ago; I don't think I
> > > > > have good knowledge in RDMA at all nor network, but my understanding is
> > > > > it's pretty easy to fetch modern NIC to outperform RDMAs, then it may make
> > > > > little sense to maintain multiple protocols, considering RDMA migration
> > > > > code is so special so that it has the most custom code comparing to other
> > > > > protocols.
> > > > +cc some guys from Huawei.
> > > >
> > > > I'm surprised RDMA users are rare,  I guess maybe many are just
> > > > working with different code base.
> > >
> > > Yes, please cc whoever might be interested (or surprised.. :) to know this,
> > > and let's be open to all possibilities.
> > >
> > > I don't think it makes sense if there're a lot of users of a feature then
> > > we deprecate that without a good reason.  However there's always the
> > > resource limitation issue we're facing, so it could still have the
> > > possibility that this gets deprecated if nobody is working on our upstream
> > > branch. Say, if people use private branches anyway to support rdma without
> > > collaborating upstream, keeping such feature upstream then may not make
> > > much sense either, unless there's some way to collaborate.  We'll see.
> >
> > Is there document/link about the unittest/CI for migration tests, Why
> > are those tests missing?
> > Is it hard or very special to set up an environment for that? maybe we
> > can help in this regards.
>
> See tests/qtest/migration-test.c.  We put most of our migration tests
> there and that's covered in CI.
Yu is looking into that see if we can run the CI on our side.
>
> I think one major issue is CI systems don't normally have rdma devices.
> Can rdma migration test be carried out without a real hardware?
As Zhijian mentioned we can use the SoftRoCE (rxe)
>
> > >
> > > It seems there can still be people joining this discussion.  I'll hold off
> > > a bit on merging this patch to provide enough window for anyone to chim in.
> >
> > Thx for discussion and understanding.
>
> Thanks for all these inputs so far.  These can help us make a wiser and
> clearer step no matter which way we choose.
>
> --
> Peter Xu
>
Thx!
Yu Zhang April 11, 2024, 4:36 p.m. UTC | #13
> 1) Either a CI test covering at least the major RDMA paths, or at least
>     periodically tests for each QEMU release will be needed.
We use a batch of regression test cases for the stack, which covers the
test for QEMU. I did such test for most of the QEMU releases planned as
candidates for rollout.

The migration test needs a pair of (either physical or virtual) servers with
InfiniBand network, which makes it difficult to do on a single server. The
nested VM could be a possible approach, for which we may need virtual
InfiniBand network. Is SoftRoCE [1] a choice? I will try it and let you know.

[1]  https://enterprise-support.nvidia.com/s/article/howto-configure-soft-roce

Thanks and best regards!

On Thu, Apr 11, 2024 at 4:20 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Wed, Apr 10, 2024 at 09:49:15AM -0400, Peter Xu wrote:
> > On Wed, Apr 10, 2024 at 02:28:59AM +0000, Zhijian Li (Fujitsu) via wrote:
> > >
> > >
> > > on 4/10/2024 3:46 AM, Peter Xu wrote:
> > >
> > > >> Is there document/link about the unittest/CI for migration tests, Why
> > > >> are those tests missing?
> > > >> Is it hard or very special to set up an environment for that? maybe we
> > > >> can help in this regards.
> > > > See tests/qtest/migration-test.c.  We put most of our migration tests
> > > > there and that's covered in CI.
> > > >
> > > > I think one major issue is CI systems don't normally have rdma devices.
> > > > Can rdma migration test be carried out without a real hardware?
> > >
> > > Yeah,  RXE aka. SOFT-RoCE is able to emulate the RDMA, for example
> > > $ sudo rdma link add rxe_eth0 type rxe netdev eth0  # on host
> > > then we can get a new RDMA interface "rxe_eth0".
> > > This new RDMA interface is able to do the QEMU RDMA migration.
> > >
> > > Also, the loopback(lo) device is able to emulate the RDMA interface
> > > "rxe_lo", however when
> > > I tried(years ago) to do RDMA migration over this
> > > interface(rdma:127.0.0.1:3333) , it got something wrong.
> > > So i gave up enabling the RDMA migration qtest at that time.
> >
> > Thanks, Zhijian.
> >
> > I'm not sure adding an emu-link for rdma is doable for CI systems, though.
> > Maybe someone more familiar with how CI works can chim in.
>
> Some people got dropped on the cc list for unknown reason, I'm adding them
> back (Fabiano, Peter Maydell, Phil).  Let's make sure nobody is dropped by
> accident.
>
> I'll try to summarize what is still missing, and I think these will be
> greatly helpful if we don't want to deprecate rdma migration:
>
>   1) Either a CI test covering at least the major RDMA paths, or at least
>      periodically tests for each QEMU release will be needed.
>
>   2) Some performance tests between modern RDMA and NIC devices are
>      welcomed.  The current knowledge is modern NIC can work similarly to
>      RDMA in performance, then it's debatable why we still maintain so much
>      rdma specific code.
>
>   3) No need to be soild patchsets for this one, but some plan to improve
>      RDMA migration code so that it is not almost isolated from the rest
>      protocols.
>
>   4) Someone to look after this code for real.
>
> For 2) and 3) more info is here:
>
> https://lore.kernel.org/r/ZhWa0YeAb9ySVKD1@x1n
>
> Here 4) can be the most important as Markus pointed out.  We just didn't
> get there yet on the discussions, but maybe Markus is right that we should
> talk that first.
>
> Thanks,
>
> --
> Peter Xu
>
Peter Xu April 12, 2024, 2:04 p.m. UTC | #14
Yu,

On Thu, Apr 11, 2024 at 06:36:54PM +0200, Yu Zhang wrote:
> > 1) Either a CI test covering at least the major RDMA paths, or at least
> >     periodically tests for each QEMU release will be needed.
> We use a batch of regression test cases for the stack, which covers the
> test for QEMU. I did such test for most of the QEMU releases planned as
> candidates for rollout.

The least I can think of is a few tests in one release.  Definitely too
less if one release can already break..

> 
> The migration test needs a pair of (either physical or virtual) servers with
> InfiniBand network, which makes it difficult to do on a single server. The
> nested VM could be a possible approach, for which we may need virtual
> InfiniBand network. Is SoftRoCE [1] a choice? I will try it and let you know.
> 
> [1]  https://enterprise-support.nvidia.com/s/article/howto-configure-soft-roce

Does it require a kernel driver?  The less host kernel / hardware /
.. dependencies the better.

I am wondering whether there can be a library doing everything in
userspace, translating RDMA into e.g. socket messages (so maybe ultimately
that's something like IP->rdma->IP.. just to cover the "rdma" procedures),
then that'll work for CI reliably.

Please also see my full list, though, especially entry 4).  Thanks already
for looking for solutions on the tests, but I don't want to waste your time
then found that tests are not enough even if ready.  I think we need people
that understand these stuff well enough, have dedicated time and look after
it.

Thanks,
Michael Galaxy April 29, 2024, 1:08 p.m. UTC | #15
Hi All (and Peter),

My name is Michael Galaxy (formerly Hines). Yes, I changed my last name 
(highly irregular for a male) and yes, that's my real last name: 
https://www.linkedin.com/in/mrgalaxy/)

I'm the original author of the RDMA implementation. I've been discussing 
with Yu Zhang for a little bit about potentially handing over 
maintainership of the codebase to his team.

I simply have zero access to RoCE or Infiniband hardware at all, 
unfortunately. so I've never been able to run tests or use what I wrote 
at work, and as all of you know, if you don't have a way to test 
something, then you can't maintain it.

Yu Zhang put a (very kind) proposal forward to me to ask the community 
if they feel comfortable training his team to maintain the codebase (and 
run tests) while they learn about it.

If you don't mind, I'd like to let him send over his (very detailed) 
proposal,

- Michael

On 4/11/24 11:36, Yu Zhang wrote:
>> 1) Either a CI test covering at least the major RDMA paths, or at least
>>      periodically tests for each QEMU release will be needed.
> We use a batch of regression test cases for the stack, which covers the
> test for QEMU. I did such test for most of the QEMU releases planned as
> candidates for rollout.
>
> The migration test needs a pair of (either physical or virtual) servers with
> InfiniBand network, which makes it difficult to do on a single server. The
> nested VM could be a possible approach, for which we may need virtual
> InfiniBand network. Is SoftRoCE [1] a choice? I will try it and let you know.
>
> [1]  https://urldefense.com/v3/__https://enterprise-support.nvidia.com/s/article/howto-configure-soft-roce__;!!GjvTz_vk!VEqNfg3Kdf58Oh1FkGL6ErDLfvUXZXPwMTaXizuIQeIgJiywPzuwbqx8wM0KUsyopw_EYQxWvGHE3ig$
>
> Thanks and best regards!
>
> On Thu, Apr 11, 2024 at 4:20 PM Peter Xu <peterx@redhat.com> wrote:
>> On Wed, Apr 10, 2024 at 09:49:15AM -0400, Peter Xu wrote:
>>> On Wed, Apr 10, 2024 at 02:28:59AM +0000, Zhijian Li (Fujitsu) via wrote:
>>>>
>>>> on 4/10/2024 3:46 AM, Peter Xu wrote:
>>>>
>>>>>> Is there document/link about the unittest/CI for migration tests, Why
>>>>>> are those tests missing?
>>>>>> Is it hard or very special to set up an environment for that? maybe we
>>>>>> can help in this regards.
>>>>> See tests/qtest/migration-test.c.  We put most of our migration tests
>>>>> there and that's covered in CI.
>>>>>
>>>>> I think one major issue is CI systems don't normally have rdma devices.
>>>>> Can rdma migration test be carried out without a real hardware?
>>>> Yeah,  RXE aka. SOFT-RoCE is able to emulate the RDMA, for example
>>>> $ sudo rdma link add rxe_eth0 type rxe netdev eth0  # on host
>>>> then we can get a new RDMA interface "rxe_eth0".
>>>> This new RDMA interface is able to do the QEMU RDMA migration.
>>>>
>>>> Also, the loopback(lo) device is able to emulate the RDMA interface
>>>> "rxe_lo", however when
>>>> I tried(years ago) to do RDMA migration over this
>>>> interface(rdma:127.0.0.1:3333) , it got something wrong.
>>>> So i gave up enabling the RDMA migration qtest at that time.
>>> Thanks, Zhijian.
>>>
>>> I'm not sure adding an emu-link for rdma is doable for CI systems, though.
>>> Maybe someone more familiar with how CI works can chim in.
>> Some people got dropped on the cc list for unknown reason, I'm adding them
>> back (Fabiano, Peter Maydell, Phil).  Let's make sure nobody is dropped by
>> accident.
>>
>> I'll try to summarize what is still missing, and I think these will be
>> greatly helpful if we don't want to deprecate rdma migration:
>>
>>    1) Either a CI test covering at least the major RDMA paths, or at least
>>       periodically tests for each QEMU release will be needed.
>>
>>    2) Some performance tests between modern RDMA and NIC devices are
>>       welcomed.  The current knowledge is modern NIC can work similarly to
>>       RDMA in performance, then it's debatable why we still maintain so much
>>       rdma specific code.
>>
>>    3) No need to be soild patchsets for this one, but some plan to improve
>>       RDMA migration code so that it is not almost isolated from the rest
>>       protocols.
>>
>>    4) Someone to look after this code for real.
>>
>> For 2) and 3) more info is here:
>>
>> https://urldefense.com/v3/__https://lore.kernel.org/r/ZhWa0YeAb9ySVKD1@x1n__;!!GjvTz_vk!VEqNfg3Kdf58Oh1FkGL6ErDLfvUXZXPwMTaXizuIQeIgJiywPzuwbqx8wM0KUsyopw_EYQxWpIWYBhQ$
>>
>> Here 4) can be the most important as Markus pointed out.  We just didn't
>> get there yet on the discussions, but maybe Markus is right that we should
>> talk that first.
>>
>> Thanks,
>>
>> --
>> Peter Xu
>>
Peter Xu April 29, 2024, 2:56 p.m. UTC | #16
On Mon, Apr 29, 2024 at 08:08:10AM -0500, Michael Galaxy wrote:
> Hi All (and Peter),

Hi, Michael,

> 
> My name is Michael Galaxy (formerly Hines). Yes, I changed my last name
> (highly irregular for a male) and yes, that's my real last name:
> https://www.linkedin.com/in/mrgalaxy/)
> 
> I'm the original author of the RDMA implementation. I've been discussing
> with Yu Zhang for a little bit about potentially handing over maintainership
> of the codebase to his team.
> 
> I simply have zero access to RoCE or Infiniband hardware at all,
> unfortunately. so I've never been able to run tests or use what I wrote at
> work, and as all of you know, if you don't have a way to test something,
> then you can't maintain it.
> 
> Yu Zhang put a (very kind) proposal forward to me to ask the community if
> they feel comfortable training his team to maintain the codebase (and run
> tests) while they learn about it.

The "while learning" part is fine at least to me.  IMHO the "ownership" to
the code, or say, taking over the responsibility, may or may not need 100%
mastering the code base first.  There should still be some fundamental
confidence to work on the code though as a starting point, then it's about
serious use case to back this up, and careful testings while getting more
familiar with it.

> 
> If you don't mind, I'd like to let him send over his (very detailed)
> proposal,

Yes please, it's exactly the time to share the plan.  The hope is we try to
reach a consensus before or around the middle of this release (9.1).
Normally QEMU has a 3~4 months window for each release and 9.1 schedule is
not yet out, but I think it means we make a decision before or around
middle of June.

Thanks,

> 
> - Michael
> 
> On 4/11/24 11:36, Yu Zhang wrote:
> > > 1) Either a CI test covering at least the major RDMA paths, or at least
> > >      periodically tests for each QEMU release will be needed.
> > We use a batch of regression test cases for the stack, which covers the
> > test for QEMU. I did such test for most of the QEMU releases planned as
> > candidates for rollout.
> > 
> > The migration test needs a pair of (either physical or virtual) servers with
> > InfiniBand network, which makes it difficult to do on a single server. The
> > nested VM could be a possible approach, for which we may need virtual
> > InfiniBand network. Is SoftRoCE [1] a choice? I will try it and let you know.
> > 
> > [1]  https://urldefense.com/v3/__https://enterprise-support.nvidia.com/s/article/howto-configure-soft-roce__;!!GjvTz_vk!VEqNfg3Kdf58Oh1FkGL6ErDLfvUXZXPwMTaXizuIQeIgJiywPzuwbqx8wM0KUsyopw_EYQxWvGHE3ig$
> > 
> > Thanks and best regards!
> > 
> > On Thu, Apr 11, 2024 at 4:20 PM Peter Xu <peterx@redhat.com> wrote:
> > > On Wed, Apr 10, 2024 at 09:49:15AM -0400, Peter Xu wrote:
> > > > On Wed, Apr 10, 2024 at 02:28:59AM +0000, Zhijian Li (Fujitsu) via wrote:
> > > > > 
> > > > > on 4/10/2024 3:46 AM, Peter Xu wrote:
> > > > > 
> > > > > > > Is there document/link about the unittest/CI for migration tests, Why
> > > > > > > are those tests missing?
> > > > > > > Is it hard or very special to set up an environment for that? maybe we
> > > > > > > can help in this regards.
> > > > > > See tests/qtest/migration-test.c.  We put most of our migration tests
> > > > > > there and that's covered in CI.
> > > > > > 
> > > > > > I think one major issue is CI systems don't normally have rdma devices.
> > > > > > Can rdma migration test be carried out without a real hardware?
> > > > > Yeah,  RXE aka. SOFT-RoCE is able to emulate the RDMA, for example
> > > > > $ sudo rdma link add rxe_eth0 type rxe netdev eth0  # on host
> > > > > then we can get a new RDMA interface "rxe_eth0".
> > > > > This new RDMA interface is able to do the QEMU RDMA migration.
> > > > > 
> > > > > Also, the loopback(lo) device is able to emulate the RDMA interface
> > > > > "rxe_lo", however when
> > > > > I tried(years ago) to do RDMA migration over this
> > > > > interface(rdma:127.0.0.1:3333) , it got something wrong.
> > > > > So i gave up enabling the RDMA migration qtest at that time.
> > > > Thanks, Zhijian.
> > > > 
> > > > I'm not sure adding an emu-link for rdma is doable for CI systems, though.
> > > > Maybe someone more familiar with how CI works can chim in.
> > > Some people got dropped on the cc list for unknown reason, I'm adding them
> > > back (Fabiano, Peter Maydell, Phil).  Let's make sure nobody is dropped by
> > > accident.
> > > 
> > > I'll try to summarize what is still missing, and I think these will be
> > > greatly helpful if we don't want to deprecate rdma migration:
> > > 
> > >    1) Either a CI test covering at least the major RDMA paths, or at least
> > >       periodically tests for each QEMU release will be needed.
> > > 
> > >    2) Some performance tests between modern RDMA and NIC devices are
> > >       welcomed.  The current knowledge is modern NIC can work similarly to
> > >       RDMA in performance, then it's debatable why we still maintain so much
> > >       rdma specific code.
> > > 
> > >    3) No need to be soild patchsets for this one, but some plan to improve
> > >       RDMA migration code so that it is not almost isolated from the rest
> > >       protocols.
> > > 
> > >    4) Someone to look after this code for real.
> > > 
> > > For 2) and 3) more info is here:
> > > 
> > > https://urldefense.com/v3/__https://lore.kernel.org/r/ZhWa0YeAb9ySVKD1@x1n__;!!GjvTz_vk!VEqNfg3Kdf58Oh1FkGL6ErDLfvUXZXPwMTaXizuIQeIgJiywPzuwbqx8wM0KUsyopw_EYQxWpIWYBhQ$
> > > 
> > > Here 4) can be the most important as Markus pointed out.  We just didn't
> > > get there yet on the discussions, but maybe Markus is right that we should
> > > talk that first.
> > > 
> > > Thanks,
> > > 
> > > --
> > > Peter Xu
> > > 
>
Yu Zhang April 29, 2024, 8:45 p.m. UTC | #17
Hello Michael and Peter,

We are very glad at your quick and kind reply about our plan to take
over the maintenance of your code. The message is for presenting our
plan and working together.
If we were able to obtain the maintainer's role, our plan is:

1. Create the necessary unit-test cases and get them integrated into
the current QEMU GitLab-CI pipeline
2. Review and test the code changes by other developers to ensure that
nothing is broken in the changed code before being merged by the
community
3. Based on our current practice and application scenario, look for
possible improvements when necessary

Besides that, a patch is attached to announce this change in the community.

With your generous support, we hope that the development community
will make a positive decision for us.

Kind regards,
Yu Zhang@ IONOS Cloud

On Mon, Apr 29, 2024 at 4:57 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Mon, Apr 29, 2024 at 08:08:10AM -0500, Michael Galaxy wrote:
> > Hi All (and Peter),
>
> Hi, Michael,
>
> >
> > My name is Michael Galaxy (formerly Hines). Yes, I changed my last name
> > (highly irregular for a male) and yes, that's my real last name:
> > https://www.linkedin.com/in/mrgalaxy/)
> >
> > I'm the original author of the RDMA implementation. I've been discussing
> > with Yu Zhang for a little bit about potentially handing over maintainership
> > of the codebase to his team.
> >
> > I simply have zero access to RoCE or Infiniband hardware at all,
> > unfortunately. so I've never been able to run tests or use what I wrote at
> > work, and as all of you know, if you don't have a way to test something,
> > then you can't maintain it.
> >
> > Yu Zhang put a (very kind) proposal forward to me to ask the community if
> > they feel comfortable training his team to maintain the codebase (and run
> > tests) while they learn about it.
>
> The "while learning" part is fine at least to me.  IMHO the "ownership" to
> the code, or say, taking over the responsibility, may or may not need 100%
> mastering the code base first.  There should still be some fundamental
> confidence to work on the code though as a starting point, then it's about
> serious use case to back this up, and careful testings while getting more
> familiar with it.
>
> >
> > If you don't mind, I'd like to let him send over his (very detailed)
> > proposal,
>
> Yes please, it's exactly the time to share the plan.  The hope is we try to
> reach a consensus before or around the middle of this release (9.1).
> Normally QEMU has a 3~4 months window for each release and 9.1 schedule is
> not yet out, but I think it means we make a decision before or around
> middle of June.
>
> Thanks,
>
> >
> > - Michael
> >
> > On 4/11/24 11:36, Yu Zhang wrote:
> > > > 1) Either a CI test covering at least the major RDMA paths, or at least
> > > >      periodically tests for each QEMU release will be needed.
> > > We use a batch of regression test cases for the stack, which covers the
> > > test for QEMU. I did such test for most of the QEMU releases planned as
> > > candidates for rollout.
> > >
> > > The migration test needs a pair of (either physical or virtual) servers with
> > > InfiniBand network, which makes it difficult to do on a single server. The
> > > nested VM could be a possible approach, for which we may need virtual
> > > InfiniBand network. Is SoftRoCE [1] a choice? I will try it and let you know.
> > >
> > > [1]  https://urldefense.com/v3/__https://enterprise-support.nvidia.com/s/article/howto-configure-soft-roce__;!!GjvTz_vk!VEqNfg3Kdf58Oh1FkGL6ErDLfvUXZXPwMTaXizuIQeIgJiywPzuwbqx8wM0KUsyopw_EYQxWvGHE3ig$
> > >
> > > Thanks and best regards!
> > >
> > > On Thu, Apr 11, 2024 at 4:20 PM Peter Xu <peterx@redhat.com> wrote:
> > > > On Wed, Apr 10, 2024 at 09:49:15AM -0400, Peter Xu wrote:
> > > > > On Wed, Apr 10, 2024 at 02:28:59AM +0000, Zhijian Li (Fujitsu) via wrote:
> > > > > >
> > > > > > on 4/10/2024 3:46 AM, Peter Xu wrote:
> > > > > >
> > > > > > > > Is there document/link about the unittest/CI for migration tests, Why
> > > > > > > > are those tests missing?
> > > > > > > > Is it hard or very special to set up an environment for that? maybe we
> > > > > > > > can help in this regards.
> > > > > > > See tests/qtest/migration-test.c.  We put most of our migration tests
> > > > > > > there and that's covered in CI.
> > > > > > >
> > > > > > > I think one major issue is CI systems don't normally have rdma devices.
> > > > > > > Can rdma migration test be carried out without a real hardware?
> > > > > > Yeah,  RXE aka. SOFT-RoCE is able to emulate the RDMA, for example
> > > > > > $ sudo rdma link add rxe_eth0 type rxe netdev eth0  # on host
> > > > > > then we can get a new RDMA interface "rxe_eth0".
> > > > > > This new RDMA interface is able to do the QEMU RDMA migration.
> > > > > >
> > > > > > Also, the loopback(lo) device is able to emulate the RDMA interface
> > > > > > "rxe_lo", however when
> > > > > > I tried(years ago) to do RDMA migration over this
> > > > > > interface(rdma:127.0.0.1:3333) , it got something wrong.
> > > > > > So i gave up enabling the RDMA migration qtest at that time.
> > > > > Thanks, Zhijian.
> > > > >
> > > > > I'm not sure adding an emu-link for rdma is doable for CI systems, though.
> > > > > Maybe someone more familiar with how CI works can chim in.
> > > > Some people got dropped on the cc list for unknown reason, I'm adding them
> > > > back (Fabiano, Peter Maydell, Phil).  Let's make sure nobody is dropped by
> > > > accident.
> > > >
> > > > I'll try to summarize what is still missing, and I think these will be
> > > > greatly helpful if we don't want to deprecate rdma migration:
> > > >
> > > >    1) Either a CI test covering at least the major RDMA paths, or at least
> > > >       periodically tests for each QEMU release will be needed.
> > > >
> > > >    2) Some performance tests between modern RDMA and NIC devices are
> > > >       welcomed.  The current knowledge is modern NIC can work similarly to
> > > >       RDMA in performance, then it's debatable why we still maintain so much
> > > >       rdma specific code.
> > > >
> > > >    3) No need to be soild patchsets for this one, but some plan to improve
> > > >       RDMA migration code so that it is not almost isolated from the rest
> > > >       protocols.
> > > >
> > > >    4) Someone to look after this code for real.
> > > >
> > > > For 2) and 3) more info is here:
> > > >
> > > > https://urldefense.com/v3/__https://lore.kernel.org/r/ZhWa0YeAb9ySVKD1@x1n__;!!GjvTz_vk!VEqNfg3Kdf58Oh1FkGL6ErDLfvUXZXPwMTaXizuIQeIgJiywPzuwbqx8wM0KUsyopw_EYQxWpIWYBhQ$
> > > >
> > > > Here 4) can be the most important as Markus pointed out.  We just didn't
> > > > get there yet on the discussions, but maybe Markus is right that we should
> > > > talk that first.
> > > >
> > > > Thanks,
> > > >
> > > > --
> > > > Peter Xu
> > > >
> >
>
> --
> Peter Xu
>
Michael Galaxy April 29, 2024, 8:56 p.m. UTC | #18
Reviewed-by: Michael Galaxy <mgalaxy@akamai.com>

Thanks Yu Zhang and Peter.

- Michael

On 4/29/24 15:45, Yu Zhang wrote:
> Hello Michael and Peter,
>
> We are very glad at your quick and kind reply about our plan to take
> over the maintenance of your code. The message is for presenting our
> plan and working together.
> If we were able to obtain the maintainer's role, our plan is:
>
> 1. Create the necessary unit-test cases and get them integrated into
> the current QEMU GitLab-CI pipeline
> 2. Review and test the code changes by other developers to ensure that
> nothing is broken in the changed code before being merged by the
> community
> 3. Based on our current practice and application scenario, look for
> possible improvements when necessary
>
> Besides that, a patch is attached to announce this change in the community.
>
> With your generous support, we hope that the development community
> will make a positive decision for us.
>
> Kind regards,
> Yu Zhang@ IONOS Cloud
>
> On Mon, Apr 29, 2024 at 4:57 PM Peter Xu <peterx@redhat.com> wrote:
>> On Mon, Apr 29, 2024 at 08:08:10AM -0500, Michael Galaxy wrote:
>>> Hi All (and Peter),
>> Hi, Michael,
>>
>>> My name is Michael Galaxy (formerly Hines). Yes, I changed my last name
>>> (highly irregular for a male) and yes, that's my real last name:
>>> https://urldefense.com/v3/__https://www.linkedin.com/in/mrgalaxy/__;!!GjvTz_vk!TZmnCE90EK692dSjZGr-2cpOEZBQTBsTO2bW5z3rSbpZgNVCexZkxwDXhmIOWG2GAKZAUovQ5xe5coQ$ )
>>>
>>> I'm the original author of the RDMA implementation. I've been discussing
>>> with Yu Zhang for a little bit about potentially handing over maintainership
>>> of the codebase to his team.
>>>
>>> I simply have zero access to RoCE or Infiniband hardware at all,
>>> unfortunately. so I've never been able to run tests or use what I wrote at
>>> work, and as all of you know, if you don't have a way to test something,
>>> then you can't maintain it.
>>>
>>> Yu Zhang put a (very kind) proposal forward to me to ask the community if
>>> they feel comfortable training his team to maintain the codebase (and run
>>> tests) while they learn about it.
>> The "while learning" part is fine at least to me.  IMHO the "ownership" to
>> the code, or say, taking over the responsibility, may or may not need 100%
>> mastering the code base first.  There should still be some fundamental
>> confidence to work on the code though as a starting point, then it's about
>> serious use case to back this up, and careful testings while getting more
>> familiar with it.
>>
>>> If you don't mind, I'd like to let him send over his (very detailed)
>>> proposal,
>> Yes please, it's exactly the time to share the plan.  The hope is we try to
>> reach a consensus before or around the middle of this release (9.1).
>> Normally QEMU has a 3~4 months window for each release and 9.1 schedule is
>> not yet out, but I think it means we make a decision before or around
>> middle of June.
>>
>> Thanks,
>>
>>> - Michael
>>>
>>> On 4/11/24 11:36, Yu Zhang wrote:
>>>>> 1) Either a CI test covering at least the major RDMA paths, or at least
>>>>>       periodically tests for each QEMU release will be needed.
>>>> We use a batch of regression test cases for the stack, which covers the
>>>> test for QEMU. I did such test for most of the QEMU releases planned as
>>>> candidates for rollout.
>>>>
>>>> The migration test needs a pair of (either physical or virtual) servers with
>>>> InfiniBand network, which makes it difficult to do on a single server. The
>>>> nested VM could be a possible approach, for which we may need virtual
>>>> InfiniBand network. Is SoftRoCE [1] a choice? I will try it and let you know.
>>>>
>>>> [1]  https://urldefense.com/v3/__https://enterprise-support.nvidia.com/s/article/howto-configure-soft-roce__;!!GjvTz_vk!VEqNfg3Kdf58Oh1FkGL6ErDLfvUXZXPwMTaXizuIQeIgJiywPzuwbqx8wM0KUsyopw_EYQxWvGHE3ig$
>>>>
>>>> Thanks and best regards!
>>>>
>>>> On Thu, Apr 11, 2024 at 4:20 PM Peter Xu <peterx@redhat.com> wrote:
>>>>> On Wed, Apr 10, 2024 at 09:49:15AM -0400, Peter Xu wrote:
>>>>>> On Wed, Apr 10, 2024 at 02:28:59AM +0000, Zhijian Li (Fujitsu) via wrote:
>>>>>>> on 4/10/2024 3:46 AM, Peter Xu wrote:
>>>>>>>
>>>>>>>>> Is there document/link about the unittest/CI for migration tests, Why
>>>>>>>>> are those tests missing?
>>>>>>>>> Is it hard or very special to set up an environment for that? maybe we
>>>>>>>>> can help in this regards.
>>>>>>>> See tests/qtest/migration-test.c.  We put most of our migration tests
>>>>>>>> there and that's covered in CI.
>>>>>>>>
>>>>>>>> I think one major issue is CI systems don't normally have rdma devices.
>>>>>>>> Can rdma migration test be carried out without a real hardware?
>>>>>>> Yeah,  RXE aka. SOFT-RoCE is able to emulate the RDMA, for example
>>>>>>> $ sudo rdma link add rxe_eth0 type rxe netdev eth0  # on host
>>>>>>> then we can get a new RDMA interface "rxe_eth0".
>>>>>>> This new RDMA interface is able to do the QEMU RDMA migration.
>>>>>>>
>>>>>>> Also, the loopback(lo) device is able to emulate the RDMA interface
>>>>>>> "rxe_lo", however when
>>>>>>> I tried(years ago) to do RDMA migration over this
>>>>>>> interface(rdma:127.0.0.1:3333) , it got something wrong.
>>>>>>> So i gave up enabling the RDMA migration qtest at that time.
>>>>>> Thanks, Zhijian.
>>>>>>
>>>>>> I'm not sure adding an emu-link for rdma is doable for CI systems, though.
>>>>>> Maybe someone more familiar with how CI works can chim in.
>>>>> Some people got dropped on the cc list for unknown reason, I'm adding them
>>>>> back (Fabiano, Peter Maydell, Phil).  Let's make sure nobody is dropped by
>>>>> accident.
>>>>>
>>>>> I'll try to summarize what is still missing, and I think these will be
>>>>> greatly helpful if we don't want to deprecate rdma migration:
>>>>>
>>>>>     1) Either a CI test covering at least the major RDMA paths, or at least
>>>>>        periodically tests for each QEMU release will be needed.
>>>>>
>>>>>     2) Some performance tests between modern RDMA and NIC devices are
>>>>>        welcomed.  The current knowledge is modern NIC can work similarly to
>>>>>        RDMA in performance, then it's debatable why we still maintain so much
>>>>>        rdma specific code.
>>>>>
>>>>>     3) No need to be soild patchsets for this one, but some plan to improve
>>>>>        RDMA migration code so that it is not almost isolated from the rest
>>>>>        protocols.
>>>>>
>>>>>     4) Someone to look after this code for real.
>>>>>
>>>>> For 2) and 3) more info is here:
>>>>>
>>>>> https://urldefense.com/v3/__https://lore.kernel.org/r/ZhWa0YeAb9ySVKD1@x1n__;!!GjvTz_vk!VEqNfg3Kdf58Oh1FkGL6ErDLfvUXZXPwMTaXizuIQeIgJiywPzuwbqx8wM0KUsyopw_EYQxWpIWYBhQ$
>>>>>
>>>>> Here 4) can be the most important as Markus pointed out.  We just didn't
>>>>> get there yet on the discussions, but maybe Markus is right that we should
>>>>> talk that first.
>>>>>
>>>>> Thanks,
>>>>>
>>>>> --
>>>>> Peter Xu
>>>>>
>> --
>> Peter Xu
>>
Daniel P. Berrangé April 30, 2024, 8 a.m. UTC | #19
On Tue, Apr 30, 2024 at 09:15:03AM +0200, Markus Armbruster wrote:
> Peter Xu <peterx@redhat.com> writes:
> 
> > On Mon, Apr 29, 2024 at 08:08:10AM -0500, Michael Galaxy wrote:
> >> Hi All (and Peter),
> >
> > Hi, Michael,
> >
> >> 
> >> My name is Michael Galaxy (formerly Hines). Yes, I changed my last name
> >> (highly irregular for a male) and yes, that's my real last name:
> >> https://www.linkedin.com/in/mrgalaxy/)
> >> 
> >> I'm the original author of the RDMA implementation. I've been discussing
> >> with Yu Zhang for a little bit about potentially handing over maintainership
> >> of the codebase to his team.
> >> 
> >> I simply have zero access to RoCE or Infiniband hardware at all,
> >> unfortunately. so I've never been able to run tests or use what I wrote at
> >> work, and as all of you know, if you don't have a way to test something,
> >> then you can't maintain it.
> >> 
> >> Yu Zhang put a (very kind) proposal forward to me to ask the community if
> >> they feel comfortable training his team to maintain the codebase (and run
> >> tests) while they learn about it.
> >
> > The "while learning" part is fine at least to me.  IMHO the "ownership" to
> > the code, or say, taking over the responsibility, may or may not need 100%
> > mastering the code base first.  There should still be some fundamental
> > confidence to work on the code though as a starting point, then it's about
> > serious use case to back this up, and careful testings while getting more
> > familiar with it.
> 
> How much experience we expect of maintainers depends on the subsystem
> and other circumstances.  The hard requirement isn't experience, it's
> trust.  See the recent attack on xz.
> 
> I do not mean to express any doubts whatsoever on Yu Zhang's integrity!
> I'm merely reminding y'all what's at stake.

I think we shouldn't overly obsess[1] about 'xz', because the overwhealmingly
common scenario is that volunteer maintainers are honest people. QEMU is
in a massively better peer review situation. With xz there was basically no
oversight of the new maintainer. With QEMU, we have oversight from 1000's
of people on the list, a huge pool of general maintainers, the specific
migration maintainers, and the release manager merging code.

With a lack of historical experiance with QEMU maintainership, I'd suggest
that new RDMA volunteers would start by adding themselves to the "MAINTAINERS"
file with only the 'Reviewer' classification. The main migration maintainers
would still handle pull requests, but wait for a R-b from one of the RMDA
volunteers. After some period of time the RDMA folks could graduate to full
maintainer status if the migration maintainers needed to reduce their load.
I suspect that might prove unneccesary though, given RDMA isn't an area of
code with a high turnover of patches.

With regards,
Daniel

[1] If we do want to obsess about something bad though, we should
    look at our handling of binary blobs in the repo and tarballs.
    ie the firmware binaries that all get built in an arbitrary
    environment of their respective maintainer. If we need firmware
    blobs in tree, we should strive to come up with a reprodicble
    build environment that gives us byte-for-byte identical results,
    so the blobs can be verified. This is rather a tangent from this
    thread though :)
Peter Xu May 1, 2024, 4:16 p.m. UTC | #20
On Wed, May 01, 2024 at 04:59:38PM +0100, Daniel P. Berrangé wrote:
> On Wed, May 01, 2024 at 11:31:13AM -0400, Peter Xu wrote:
> > What I worry more is whether this is really what we want to keep rdma in
> > qemu, and that's also why I was trying to request for some serious
> > performance measurements comparing rdma v.s. nics.  And here when I said
> > "we" I mean both QEMU community and any company that will support keeping
> > rdma around.
> > 
> > The problem is if NICs now are fast enough to perform at least equally
> > against rdma, and if it has a lower cost of overall maintenance, does it
> > mean that rdma migration will only be used by whoever wants to keep them in
> > the products and existed already?  In that case we should simply ask new
> > users to stick with tcp, and rdma users should only drop but not increase.
> > 
> > It seems also destined that most new migration features will not support
> > rdma: see how much we drop old features in migration now (which rdma
> > _might_ still leverage, but maybe not), and how much we add mostly multifd
> > relevant which will probably not apply to rdma at all.  So in general what
> > I am worrying is a both-loss condition, if the company might be easier to
> > either stick with an old qemu (depending on whether other new features are
> > requested to be used besides RDMA alone), or do periodic rebase with RDMA
> > downstream only.
> 
> I don't know much about the originals of RDMA support in QEMU and why
> this particular design was taken. It is indeed a huge maint burden to
> have a completely different code flow for RDMA with 4000+ lines of
> custom protocol signalling which is barely understandable.
> 
> I would note that /usr/include/rdma/rsocket.h provides a higher level
> API that is a 1-1 match of the normal kernel 'sockets' API. If we had
> leveraged that, then QIOChannelSocket class and the QAPI SocketAddress
> type could almost[1] trivially have supported RDMA. There would have
> been almost no RDMA code required in the migration subsystem, and all
> the modern features like compression, multifd, post-copy, etc would
> "just work".
> 
> I guess the 'rsocket.h' shim may well limit some of the possible
> performance gains, but it might still have been a better tradeoff
> to have not quite so good peak performance, but with massively
> less maint burden.

My understanding so far is RDMA is sololy for performance but nothing else,
then it's a question on whether rdma existing users would like to do so if
it will run slower.

Jinpu mentioned on the explicit usages of ib verbs but I am just mostly
quotting that word as I don't really know such details:

https://lore.kernel.org/qemu-devel/CAMGffEm2TWJxOPcNQTQ1Sjytf5395dBzTCMYiKRqfxDzJwSN6A@mail.gmail.com/

So not sure whether that applies here too, in that having qiochannel
wrapper may not allow direct access to those ib verbs.

Thanks,

> 
> With regards,
> Daniel
> 
> [1] "almost" trivially, because the poll() integration for rsockets
>     requires a bit more magic sauce since rsockets FDs are not
>     really FDs from the kernel's POV. Still, QIOCHannel likely can
>     abstract that probme.
> -- 
> |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
>
Michael Galaxy May 2, 2024, 1:22 p.m. UTC | #21
Yu Zhang / Jinpu,

Any possibility (at your lesiure, and within the disclosure rules of 
your company, IONOS) if you could share any of your performance 
information to educate the group?

NICs have indeed changed, but not everybody has 100ge mellanox cards at 
their disposal. Some people don't.

- Michael

On 5/1/24 11:16, Peter Xu wrote:
> On Wed, May 01, 2024 at 04:59:38PM +0100, Daniel P. Berrangé wrote:
>> On Wed, May 01, 2024 at 11:31:13AM -0400, Peter Xu wrote:
>>> What I worry more is whether this is really what we want to keep rdma in
>>> qemu, and that's also why I was trying to request for some serious
>>> performance measurements comparing rdma v.s. nics.  And here when I said
>>> "we" I mean both QEMU community and any company that will support keeping
>>> rdma around.
>>>
>>> The problem is if NICs now are fast enough to perform at least equally
>>> against rdma, and if it has a lower cost of overall maintenance, does it
>>> mean that rdma migration will only be used by whoever wants to keep them in
>>> the products and existed already?  In that case we should simply ask new
>>> users to stick with tcp, and rdma users should only drop but not increase.
>>>
>>> It seems also destined that most new migration features will not support
>>> rdma: see how much we drop old features in migration now (which rdma
>>> _might_ still leverage, but maybe not), and how much we add mostly multifd
>>> relevant which will probably not apply to rdma at all.  So in general what
>>> I am worrying is a both-loss condition, if the company might be easier to
>>> either stick with an old qemu (depending on whether other new features are
>>> requested to be used besides RDMA alone), or do periodic rebase with RDMA
>>> downstream only.
>> I don't know much about the originals of RDMA support in QEMU and why
>> this particular design was taken. It is indeed a huge maint burden to
>> have a completely different code flow for RDMA with 4000+ lines of
>> custom protocol signalling which is barely understandable.
>>
>> I would note that /usr/include/rdma/rsocket.h provides a higher level
>> API that is a 1-1 match of the normal kernel 'sockets' API. If we had
>> leveraged that, then QIOChannelSocket class and the QAPI SocketAddress
>> type could almost[1] trivially have supported RDMA. There would have
>> been almost no RDMA code required in the migration subsystem, and all
>> the modern features like compression, multifd, post-copy, etc would
>> "just work".
>>
>> I guess the 'rsocket.h' shim may well limit some of the possible
>> performance gains, but it might still have been a better tradeoff
>> to have not quite so good peak performance, but with massively
>> less maint burden.
> My understanding so far is RDMA is sololy for performance but nothing else,
> then it's a question on whether rdma existing users would like to do so if
> it will run slower.
>
> Jinpu mentioned on the explicit usages of ib verbs but I am just mostly
> quotting that word as I don't really know such details:
>
> https://urldefense.com/v3/__https://lore.kernel.org/qemu-devel/CAMGffEm2TWJxOPcNQTQ1Sjytf5395dBzTCMYiKRqfxDzJwSN6A@mail.gmail.com/__;!!GjvTz_vk!W6-HGWM-XkF_52am249DrLIDQeZctVOHg72LvOHGUcwxqQM5mY0GNYYl-yNJslN7A5GfLOew9oW_kg$
>
> So not sure whether that applies here too, in that having qiochannel
> wrapper may not allow direct access to those ib verbs.
>
> Thanks,
>
>> With regards,
>> Daniel
>>
>> [1] "almost" trivially, because the poll() integration for rsockets
>>      requires a bit more magic sauce since rsockets FDs are not
>>      really FDs from the kernel's POV. Still, QIOCHannel likely can
>>      abstract that probme.
>> -- 
>> |: https://urldefense.com/v3/__https://berrange.com__;!!GjvTz_vk!W6-HGWM-XkF_52am249DrLIDQeZctVOHg72LvOHGUcwxqQM5mY0GNYYl-yNJslN7A5GfLOfyTmFFUQ$       -o-    https://urldefense.com/v3/__https://www.flickr.com/photos/dberrange__;!!GjvTz_vk!W6-HGWM-XkF_52am249DrLIDQeZctVOHg72LvOHGUcwxqQM5mY0GNYYl-yNJslN7A5GfLOf8A5OC0Q$  :|
>> |: https://urldefense.com/v3/__https://libvirt.org__;!!GjvTz_vk!W6-HGWM-XkF_52am249DrLIDQeZctVOHg72LvOHGUcwxqQM5mY0GNYYl-yNJslN7A5GfLOf3gffAdg$          -o-            https://urldefense.com/v3/__https://fstop138.berrange.com__;!!GjvTz_vk!W6-HGWM-XkF_52am249DrLIDQeZctVOHg72LvOHGUcwxqQM5mY0GNYYl-yNJslN7A5GfLOfPMofYqw$  :|
>> |: https://urldefense.com/v3/__https://entangle-photo.org__;!!GjvTz_vk!W6-HGWM-XkF_52am249DrLIDQeZctVOHg72LvOHGUcwxqQM5mY0GNYYl-yNJslN7A5GfLOeQ5jjAeQ$     -o-    https://urldefense.com/v3/__https://www.instagram.com/dberrange__;!!GjvTz_vk!W6-HGWM-XkF_52am249DrLIDQeZctVOHg72LvOHGUcwxqQM5mY0GNYYl-yNJslN7A5GfLOfhaDF9WA$  :|
>>
Jinpu Wang May 2, 2024, 1:30 p.m. UTC | #22
Hi Michael, Hi Peter,


On Thu, May 2, 2024 at 3:23 PM Michael Galaxy <mgalaxy@akamai.com> wrote:
>
> Yu Zhang / Jinpu,
>
> Any possibility (at your lesiure, and within the disclosure rules of
> your company, IONOS) if you could share any of your performance
> information to educate the group?
>
> NICs have indeed changed, but not everybody has 100ge mellanox cards at
> their disposal. Some people don't.
Our staging env is with 100 Gb/s IB environment.
We will have a new setup in the coming months with Ethernet (RoCE), we
will run some performance
comparison when we have the environment ready.

>
> - Michael

Thx!
Jinpu
>
> On 5/1/24 11:16, Peter Xu wrote:
> > On Wed, May 01, 2024 at 04:59:38PM +0100, Daniel P. Berrangé wrote:
> >> On Wed, May 01, 2024 at 11:31:13AM -0400, Peter Xu wrote:
> >>> What I worry more is whether this is really what we want to keep rdma in
> >>> qemu, and that's also why I was trying to request for some serious
> >>> performance measurements comparing rdma v.s. nics.  And here when I said
> >>> "we" I mean both QEMU community and any company that will support keeping
> >>> rdma around.
> >>>
> >>> The problem is if NICs now are fast enough to perform at least equally
> >>> against rdma, and if it has a lower cost of overall maintenance, does it
> >>> mean that rdma migration will only be used by whoever wants to keep them in
> >>> the products and existed already?  In that case we should simply ask new
> >>> users to stick with tcp, and rdma users should only drop but not increase.
> >>>
> >>> It seems also destined that most new migration features will not support
> >>> rdma: see how much we drop old features in migration now (which rdma
> >>> _might_ still leverage, but maybe not), and how much we add mostly multifd
> >>> relevant which will probably not apply to rdma at all.  So in general what
> >>> I am worrying is a both-loss condition, if the company might be easier to
> >>> either stick with an old qemu (depending on whether other new features are
> >>> requested to be used besides RDMA alone), or do periodic rebase with RDMA
> >>> downstream only.
> >> I don't know much about the originals of RDMA support in QEMU and why
> >> this particular design was taken. It is indeed a huge maint burden to
> >> have a completely different code flow for RDMA with 4000+ lines of
> >> custom protocol signalling which is barely understandable.
> >>
> >> I would note that /usr/include/rdma/rsocket.h provides a higher level
> >> API that is a 1-1 match of the normal kernel 'sockets' API. If we had
> >> leveraged that, then QIOChannelSocket class and the QAPI SocketAddress
> >> type could almost[1] trivially have supported RDMA. There would have
> >> been almost no RDMA code required in the migration subsystem, and all
> >> the modern features like compression, multifd, post-copy, etc would
> >> "just work".
> >>
> >> I guess the 'rsocket.h' shim may well limit some of the possible
> >> performance gains, but it might still have been a better tradeoff
> >> to have not quite so good peak performance, but with massively
> >> less maint burden.
> > My understanding so far is RDMA is sololy for performance but nothing else,
> > then it's a question on whether rdma existing users would like to do so if
> > it will run slower.
> >
> > Jinpu mentioned on the explicit usages of ib verbs but I am just mostly
> > quotting that word as I don't really know such details:
> >
> > https://urldefense.com/v3/__https://lore.kernel.org/qemu-devel/CAMGffEm2TWJxOPcNQTQ1Sjytf5395dBzTCMYiKRqfxDzJwSN6A@mail.gmail.com/__;!!GjvTz_vk!W6-HGWM-XkF_52am249DrLIDQeZctVOHg72LvOHGUcwxqQM5mY0GNYYl-yNJslN7A5GfLOew9oW_kg$
> >
> > So not sure whether that applies here too, in that having qiochannel
> > wrapper may not allow direct access to those ib verbs.
> >
> > Thanks,
> >
> >> With regards,
> >> Daniel
> >>
> >> [1] "almost" trivially, because the poll() integration for rsockets
> >>      requires a bit more magic sauce since rsockets FDs are not
> >>      really FDs from the kernel's POV. Still, QIOCHannel likely can
> >>      abstract that probme.
> >> --
> >> |: https://urldefense.com/v3/__https://berrange.com__;!!GjvTz_vk!W6-HGWM-XkF_52am249DrLIDQeZctVOHg72LvOHGUcwxqQM5mY0GNYYl-yNJslN7A5GfLOfyTmFFUQ$       -o-    https://urldefense.com/v3/__https://www.flickr.com/photos/dberrange__;!!GjvTz_vk!W6-HGWM-XkF_52am249DrLIDQeZctVOHg72LvOHGUcwxqQM5mY0GNYYl-yNJslN7A5GfLOf8A5OC0Q$  :|
> >> |: https://urldefense.com/v3/__https://libvirt.org__;!!GjvTz_vk!W6-HGWM-XkF_52am249DrLIDQeZctVOHg72LvOHGUcwxqQM5mY0GNYYl-yNJslN7A5GfLOf3gffAdg$          -o-            https://urldefense.com/v3/__https://fstop138.berrange.com__;!!GjvTz_vk!W6-HGWM-XkF_52am249DrLIDQeZctVOHg72LvOHGUcwxqQM5mY0GNYYl-yNJslN7A5GfLOfPMofYqw$  :|
> >> |: https://urldefense.com/v3/__https://entangle-photo.org__;!!GjvTz_vk!W6-HGWM-XkF_52am249DrLIDQeZctVOHg72LvOHGUcwxqQM5mY0GNYYl-yNJslN7A5GfLOeQ5jjAeQ$     -o-    https://urldefense.com/v3/__https://www.instagram.com/dberrange__;!!GjvTz_vk!W6-HGWM-XkF_52am249DrLIDQeZctVOHg72LvOHGUcwxqQM5mY0GNYYl-yNJslN7A5GfLOfhaDF9WA$  :|
> >>
Peter Xu May 2, 2024, 4:19 p.m. UTC | #23
On Thu, May 02, 2024 at 03:30:58PM +0200, Jinpu Wang wrote:
> Hi Michael, Hi Peter,
> 
> 
> On Thu, May 2, 2024 at 3:23 PM Michael Galaxy <mgalaxy@akamai.com> wrote:
> >
> > Yu Zhang / Jinpu,
> >
> > Any possibility (at your lesiure, and within the disclosure rules of
> > your company, IONOS) if you could share any of your performance
> > information to educate the group?
> >
> > NICs have indeed changed, but not everybody has 100ge mellanox cards at
> > their disposal. Some people don't.
> Our staging env is with 100 Gb/s IB environment.
> We will have a new setup in the coming months with Ethernet (RoCE), we
> will run some performance
> comparison when we have the environment ready.

Thanks both.  Please keep us posted.

Just to double check, we're comparing "tcp:" v.s. "rdma:", RoCE is not
involved, am I right?

The other note is that the comparison needs to be with multifd enabled for
the "tcp:" case.  I'd suggest we start with 8 threads if it's 100Gbps.

I think I can still fetch some 100Gbps or even 200Gbps nics around our labs
without even waiting for months.  If you want I can try to see how we can
test together.  And btw I don't think we need a cluster, IIUC we simply
need two hosts, 100G nic on both sides?  IOW, it seems to me we only need
two cards just for experiments, systems that can drive the cards, and a
wire supporting 100G?

> 
> >
> > - Michael
> 
> Thx!
> Jinpu
> >
> > On 5/1/24 11:16, Peter Xu wrote:
> > > On Wed, May 01, 2024 at 04:59:38PM +0100, Daniel P. Berrangé wrote:
> > >> On Wed, May 01, 2024 at 11:31:13AM -0400, Peter Xu wrote:
> > >>> What I worry more is whether this is really what we want to keep rdma in
> > >>> qemu, and that's also why I was trying to request for some serious
> > >>> performance measurements comparing rdma v.s. nics.  And here when I said
> > >>> "we" I mean both QEMU community and any company that will support keeping
> > >>> rdma around.
> > >>>
> > >>> The problem is if NICs now are fast enough to perform at least equally
> > >>> against rdma, and if it has a lower cost of overall maintenance, does it
> > >>> mean that rdma migration will only be used by whoever wants to keep them in
> > >>> the products and existed already?  In that case we should simply ask new
> > >>> users to stick with tcp, and rdma users should only drop but not increase.
> > >>>
> > >>> It seems also destined that most new migration features will not support
> > >>> rdma: see how much we drop old features in migration now (which rdma
> > >>> _might_ still leverage, but maybe not), and how much we add mostly multifd
> > >>> relevant which will probably not apply to rdma at all.  So in general what
> > >>> I am worrying is a both-loss condition, if the company might be easier to
> > >>> either stick with an old qemu (depending on whether other new features are
> > >>> requested to be used besides RDMA alone), or do periodic rebase with RDMA
> > >>> downstream only.
> > >> I don't know much about the originals of RDMA support in QEMU and why
> > >> this particular design was taken. It is indeed a huge maint burden to
> > >> have a completely different code flow for RDMA with 4000+ lines of
> > >> custom protocol signalling which is barely understandable.
> > >>
> > >> I would note that /usr/include/rdma/rsocket.h provides a higher level
> > >> API that is a 1-1 match of the normal kernel 'sockets' API. If we had
> > >> leveraged that, then QIOChannelSocket class and the QAPI SocketAddress
> > >> type could almost[1] trivially have supported RDMA. There would have
> > >> been almost no RDMA code required in the migration subsystem, and all
> > >> the modern features like compression, multifd, post-copy, etc would
> > >> "just work".
> > >>
> > >> I guess the 'rsocket.h' shim may well limit some of the possible
> > >> performance gains, but it might still have been a better tradeoff
> > >> to have not quite so good peak performance, but with massively
> > >> less maint burden.
> > > My understanding so far is RDMA is sololy for performance but nothing else,
> > > then it's a question on whether rdma existing users would like to do so if
> > > it will run slower.
> > >
> > > Jinpu mentioned on the explicit usages of ib verbs but I am just mostly
> > > quotting that word as I don't really know such details:
> > >
> > > https://urldefense.com/v3/__https://lore.kernel.org/qemu-devel/CAMGffEm2TWJxOPcNQTQ1Sjytf5395dBzTCMYiKRqfxDzJwSN6A@mail.gmail.com/__;!!GjvTz_vk!W6-HGWM-XkF_52am249DrLIDQeZctVOHg72LvOHGUcwxqQM5mY0GNYYl-yNJslN7A5GfLOew9oW_kg$
> > >
> > > So not sure whether that applies here too, in that having qiochannel
> > > wrapper may not allow direct access to those ib verbs.
> > >
> > > Thanks,
> > >
> > >> With regards,
> > >> Daniel
> > >>
> > >> [1] "almost" trivially, because the poll() integration for rsockets
> > >>      requires a bit more magic sauce since rsockets FDs are not
> > >>      really FDs from the kernel's POV. Still, QIOCHannel likely can
> > >>      abstract that probme.
> > >> --
> > >> |: https://urldefense.com/v3/__https://berrange.com__;!!GjvTz_vk!W6-HGWM-XkF_52am249DrLIDQeZctVOHg72LvOHGUcwxqQM5mY0GNYYl-yNJslN7A5GfLOfyTmFFUQ$       -o-    https://urldefense.com/v3/__https://www.flickr.com/photos/dberrange__;!!GjvTz_vk!W6-HGWM-XkF_52am249DrLIDQeZctVOHg72LvOHGUcwxqQM5mY0GNYYl-yNJslN7A5GfLOf8A5OC0Q$  :|
> > >> |: https://urldefense.com/v3/__https://libvirt.org__;!!GjvTz_vk!W6-HGWM-XkF_52am249DrLIDQeZctVOHg72LvOHGUcwxqQM5mY0GNYYl-yNJslN7A5GfLOf3gffAdg$          -o-            https://urldefense.com/v3/__https://fstop138.berrange.com__;!!GjvTz_vk!W6-HGWM-XkF_52am249DrLIDQeZctVOHg72LvOHGUcwxqQM5mY0GNYYl-yNJslN7A5GfLOfPMofYqw$  :|
> > >> |: https://urldefense.com/v3/__https://entangle-photo.org__;!!GjvTz_vk!W6-HGWM-XkF_52am249DrLIDQeZctVOHg72LvOHGUcwxqQM5mY0GNYYl-yNJslN7A5GfLOeQ5jjAeQ$     -o-    https://urldefense.com/v3/__https://www.instagram.com/dberrange__;!!GjvTz_vk!W6-HGWM-XkF_52am249DrLIDQeZctVOHg72LvOHGUcwxqQM5mY0GNYYl-yNJslN7A5GfLOfhaDF9WA$  :|
> > >>
>
Jinpu Wang May 2, 2024, 5:10 p.m. UTC | #24
Hi Peter

On Thu, May 2, 2024 at 6:20 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Thu, May 02, 2024 at 03:30:58PM +0200, Jinpu Wang wrote:
> > Hi Michael, Hi Peter,
> >
> >
> > On Thu, May 2, 2024 at 3:23 PM Michael Galaxy <mgalaxy@akamai.com> wrote:
> > >
> > > Yu Zhang / Jinpu,
> > >
> > > Any possibility (at your lesiure, and within the disclosure rules of
> > > your company, IONOS) if you could share any of your performance
> > > information to educate the group?
> > >
> > > NICs have indeed changed, but not everybody has 100ge mellanox cards at
> > > their disposal. Some people don't.
> > Our staging env is with 100 Gb/s IB environment.
> > We will have a new setup in the coming months with Ethernet (RoCE), we
> > will run some performance
> > comparison when we have the environment ready.
>
> Thanks both.  Please keep us posted.
>
> Just to double check, we're comparing "tcp:" v.s. "rdma:", RoCE is not
> involved, am I right?
kinds of. Our new hardware is RDMA capable, we can configure it to run
in "rdma" transport or "tcp"
it is more straight comparison,
When run "rdma" transport, RoCE is involved, eg the
rdma-core/ibverbs/rdmacm/vendor verbs driver are used.
>
> The other note is that the comparison needs to be with multifd enabled for
> the "tcp:" case.  I'd suggest we start with 8 threads if it's 100Gbps.
>
> I think I can still fetch some 100Gbps or even 200Gbps nics around our labs
> without even waiting for months.  If you want I can try to see how we can
> test together.  And btw I don't think we need a cluster, IIUC we simply
> need two hosts, 100G nic on both sides?  IOW, it seems to me we only need
> two cards just for experiments, systems that can drive the cards, and a
> wire supporting 100G?

Yes, the simple setup can be just two hosts directly connected. This remind me,
I may also able to find a test setup with 100 G nic in lab, will keep
you posted.

Regards!
>
> >
> > >
> > > - Michael
> >
> > Thx!
> > Jinpu
> > >
> > > On 5/1/24 11:16, Peter Xu wrote:
> > > > On Wed, May 01, 2024 at 04:59:38PM +0100, Daniel P. Berrangé wrote:
> > > >> On Wed, May 01, 2024 at 11:31:13AM -0400, Peter Xu wrote:
> > > >>> What I worry more is whether this is really what we want to keep rdma in
> > > >>> qemu, and that's also why I was trying to request for some serious
> > > >>> performance measurements comparing rdma v.s. nics.  And here when I said
> > > >>> "we" I mean both QEMU community and any company that will support keeping
> > > >>> rdma around.
> > > >>>
> > > >>> The problem is if NICs now are fast enough to perform at least equally
> > > >>> against rdma, and if it has a lower cost of overall maintenance, does it
> > > >>> mean that rdma migration will only be used by whoever wants to keep them in
> > > >>> the products and existed already?  In that case we should simply ask new
> > > >>> users to stick with tcp, and rdma users should only drop but not increase.
> > > >>>
> > > >>> It seems also destined that most new migration features will not support
> > > >>> rdma: see how much we drop old features in migration now (which rdma
> > > >>> _might_ still leverage, but maybe not), and how much we add mostly multifd
> > > >>> relevant which will probably not apply to rdma at all.  So in general what
> > > >>> I am worrying is a both-loss condition, if the company might be easier to
> > > >>> either stick with an old qemu (depending on whether other new features are
> > > >>> requested to be used besides RDMA alone), or do periodic rebase with RDMA
> > > >>> downstream only.
> > > >> I don't know much about the originals of RDMA support in QEMU and why
> > > >> this particular design was taken. It is indeed a huge maint burden to
> > > >> have a completely different code flow for RDMA with 4000+ lines of
> > > >> custom protocol signalling which is barely understandable.
> > > >>
> > > >> I would note that /usr/include/rdma/rsocket.h provides a higher level
> > > >> API that is a 1-1 match of the normal kernel 'sockets' API. If we had
> > > >> leveraged that, then QIOChannelSocket class and the QAPI SocketAddress
> > > >> type could almost[1] trivially have supported RDMA. There would have
> > > >> been almost no RDMA code required in the migration subsystem, and all
> > > >> the modern features like compression, multifd, post-copy, etc would
> > > >> "just work".
> > > >>
> > > >> I guess the 'rsocket.h' shim may well limit some of the possible
> > > >> performance gains, but it might still have been a better tradeoff
> > > >> to have not quite so good peak performance, but with massively
> > > >> less maint burden.
> > > > My understanding so far is RDMA is sololy for performance but nothing else,
> > > > then it's a question on whether rdma existing users would like to do so if
> > > > it will run slower.
> > > >
> > > > Jinpu mentioned on the explicit usages of ib verbs but I am just mostly
> > > > quotting that word as I don't really know such details:
> > > >
> > > > https://urldefense.com/v3/__https://lore.kernel.org/qemu-devel/CAMGffEm2TWJxOPcNQTQ1Sjytf5395dBzTCMYiKRqfxDzJwSN6A@mail.gmail.com/__;!!GjvTz_vk!W6-HGWM-XkF_52am249DrLIDQeZctVOHg72LvOHGUcwxqQM5mY0GNYYl-yNJslN7A5GfLOew9oW_kg$
> > > >
> > > > So not sure whether that applies here too, in that having qiochannel
> > > > wrapper may not allow direct access to those ib verbs.
> > > >
> > > > Thanks,
> > > >
> > > >> With regards,
> > > >> Daniel
> > > >>
> > > >> [1] "almost" trivially, because the poll() integration for rsockets
> > > >>      requires a bit more magic sauce since rsockets FDs are not
> > > >>      really FDs from the kernel's POV. Still, QIOCHannel likely can
> > > >>      abstract that probme.
> > > >> --
> > > >> |: https://urldefense.com/v3/__https://berrange.com__;!!GjvTz_vk!W6-HGWM-XkF_52am249DrLIDQeZctVOHg72LvOHGUcwxqQM5mY0GNYYl-yNJslN7A5GfLOfyTmFFUQ$       -o-    https://urldefense.com/v3/__https://www.flickr.com/photos/dberrange__;!!GjvTz_vk!W6-HGWM-XkF_52am249DrLIDQeZctVOHg72LvOHGUcwxqQM5mY0GNYYl-yNJslN7A5GfLOf8A5OC0Q$  :|
> > > >> |: https://urldefense.com/v3/__https://libvirt.org__;!!GjvTz_vk!W6-HGWM-XkF_52am249DrLIDQeZctVOHg72LvOHGUcwxqQM5mY0GNYYl-yNJslN7A5GfLOf3gffAdg$          -o-            https://urldefense.com/v3/__https://fstop138.berrange.com__;!!GjvTz_vk!W6-HGWM-XkF_52am249DrLIDQeZctVOHg72LvOHGUcwxqQM5mY0GNYYl-yNJslN7A5GfLOfPMofYqw$  :|
> > > >> |: https://urldefense.com/v3/__https://entangle-photo.org__;!!GjvTz_vk!W6-HGWM-XkF_52am249DrLIDQeZctVOHg72LvOHGUcwxqQM5mY0GNYYl-yNJslN7A5GfLOeQ5jjAeQ$     -o-    https://urldefense.com/v3/__https://www.instagram.com/dberrange__;!!GjvTz_vk!W6-HGWM-XkF_52am249DrLIDQeZctVOHg72LvOHGUcwxqQM5mY0GNYYl-yNJslN7A5GfLOfhaDF9WA$  :|
> > > >>
> >
>
> --
> Peter Xu
>
Jinpu Wang May 3, 2024, 6:40 a.m. UTC | #25
Hi Daniel,

On Wed, May 1, 2024 at 6:00 PM Daniel P. Berrangé <berrange@redhat.com> wrote:
>
> On Wed, May 01, 2024 at 11:31:13AM -0400, Peter Xu wrote:
> > What I worry more is whether this is really what we want to keep rdma in
> > qemu, and that's also why I was trying to request for some serious
> > performance measurements comparing rdma v.s. nics.  And here when I said
> > "we" I mean both QEMU community and any company that will support keeping
> > rdma around.
> >
> > The problem is if NICs now are fast enough to perform at least equally
> > against rdma, and if it has a lower cost of overall maintenance, does it
> > mean that rdma migration will only be used by whoever wants to keep them in
> > the products and existed already?  In that case we should simply ask new
> > users to stick with tcp, and rdma users should only drop but not increase.
> >
> > It seems also destined that most new migration features will not support
> > rdma: see how much we drop old features in migration now (which rdma
> > _might_ still leverage, but maybe not), and how much we add mostly multifd
> > relevant which will probably not apply to rdma at all.  So in general what
> > I am worrying is a both-loss condition, if the company might be easier to
> > either stick with an old qemu (depending on whether other new features are
> > requested to be used besides RDMA alone), or do periodic rebase with RDMA
> > downstream only.
>
> I don't know much about the originals of RDMA support in QEMU and why
> this particular design was taken. It is indeed a huge maint burden to
> have a completely different code flow for RDMA with 4000+ lines of
> custom protocol signalling which is barely understandable.
>
> I would note that /usr/include/rdma/rsocket.h provides a higher level
> API that is a 1-1 match of the normal kernel 'sockets' API. If we had
> leveraged that, then QIOChannelSocket class and the QAPI SocketAddress
> type could almost[1] trivially have supported RDMA. There would have
> been almost no RDMA code required in the migration subsystem, and all
> the modern features like compression, multifd, post-copy, etc would
> "just work".
I guess at the time rsocket is less mature, and less performant
compared to using uverbs directly.



>
> I guess the 'rsocket.h' shim may well limit some of the possible
> performance gains, but it might still have been a better tradeoff
> to have not quite so good peak performance, but with massively
> less maint burden.
I had a brief check in the rsocket changelog, there seems some
improvement over time,
 might be worth revisiting this. due to socket abstraction, we can't
use some feature like
 ODP, it won't be a small and easy task.
> With regards,
> Daniel
Thanks for the suggestion.
>
> [1] "almost" trivially, because the poll() integration for rsockets
>     requires a bit more magic sauce since rsockets FDs are not
>     really FDs from the kernel's POV. Still, QIOCHannel likely can
>     abstract that probme.
> --
> |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
>
Peter Xu May 3, 2024, 2:33 p.m. UTC | #26
On Fri, May 03, 2024 at 08:40:03AM +0200, Jinpu Wang wrote:
> I had a brief check in the rsocket changelog, there seems some
> improvement over time,
>  might be worth revisiting this. due to socket abstraction, we can't
> use some feature like
>  ODP, it won't be a small and easy task.

It'll be good to know whether Dan's suggestion would work first, without
rewritting everything yet so far.  Not sure whether some perf test could
help with the rsocket APIs even without QEMU's involvements (or looking for
test data supporting / invalidate such conversions).

Thanks,
Xingtao Yao (Fujitsu)" via May 6, 2024, 2:06 a.m. UTC | #27
Hi, Peter

RDMA features high bandwidth, low latency (in non-blocking lossless network), and direct remote 
memory access by bypassing the CPU (As you know, CPU resources are expensive for cloud vendors, 
which is one of the reasons why we introduced offload cards.), which TCP does not have. 

In some scenarios where fast live migration is needed (extremely short interruption duration and migration 
duration) is very useful. To this end, we have also developed RDMA support for multifd.

Regards,
-Gonglei

> -----Original Message-----
> From: Peter Xu [mailto:peterx@redhat.com]
> Sent: Wednesday, May 1, 2024 11:31 PM
> To: Daniel P. Berrangé <berrange@redhat.com>
> Cc: Markus Armbruster <armbru@redhat.com>; Michael Galaxy
> <mgalaxy@akamai.com>; Yu Zhang <yu.zhang@ionos.com>; Zhijian Li (Fujitsu)
> <lizhijian@fujitsu.com>; Jinpu Wang <jinpu.wang@ionos.com>; Elmar Gerdes
> <elmar.gerdes@ionos.com>; qemu-devel@nongnu.org; Yuval Shaia
> <yuval.shaia.ml@gmail.com>; Kevin Wolf <kwolf@redhat.com>; Prasanna
> Kumar Kalever <prasanna.kalever@redhat.com>; Cornelia Huck
> <cohuck@redhat.com>; Michael Roth <michael.roth@amd.com>; Prasanna
> Kumar Kalever <prasanna4324@gmail.com>; integration@gluster.org; Paolo
> Bonzini <pbonzini@redhat.com>; qemu-block@nongnu.org;
> devel@lists.libvirt.org; Hanna Reitz <hreitz@redhat.com>; Michael S. Tsirkin
> <mst@redhat.com>; Thomas Huth <thuth@redhat.com>; Eric Blake
> <eblake@redhat.com>; Song Gao <gaosong@loongson.cn>; Marc-André
> Lureau <marcandre.lureau@redhat.com>; Alex Bennée
> <alex.bennee@linaro.org>; Wainer dos Santos Moschetta
> <wainersm@redhat.com>; Beraldo Leal <bleal@redhat.com>; Gonglei (Arei)
> <arei.gonglei@huawei.com>; Pannengyuan <pannengyuan@huawei.com>
> Subject: Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling
> 
> On Tue, Apr 30, 2024 at 09:00:49AM +0100, Daniel P. Berrangé wrote:
> > On Tue, Apr 30, 2024 at 09:15:03AM +0200, Markus Armbruster wrote:
> > > Peter Xu <peterx@redhat.com> writes:
> > >
> > > > On Mon, Apr 29, 2024 at 08:08:10AM -0500, Michael Galaxy wrote:
> > > >> Hi All (and Peter),
> > > >
> > > > Hi, Michael,
> > > >
> > > >>
> > > >> My name is Michael Galaxy (formerly Hines). Yes, I changed my
> > > >> last name (highly irregular for a male) and yes, that's my real last name:
> > > >> https://www.linkedin.com/in/mrgalaxy/)
> > > >>
> > > >> I'm the original author of the RDMA implementation. I've been
> > > >> discussing with Yu Zhang for a little bit about potentially
> > > >> handing over maintainership of the codebase to his team.
> > > >>
> > > >> I simply have zero access to RoCE or Infiniband hardware at all,
> > > >> unfortunately. so I've never been able to run tests or use what I
> > > >> wrote at work, and as all of you know, if you don't have a way to
> > > >> test something, then you can't maintain it.
> > > >>
> > > >> Yu Zhang put a (very kind) proposal forward to me to ask the
> > > >> community if they feel comfortable training his team to maintain
> > > >> the codebase (and run
> > > >> tests) while they learn about it.
> > > >
> > > > The "while learning" part is fine at least to me.  IMHO the
> > > > "ownership" to the code, or say, taking over the responsibility,
> > > > may or may not need 100% mastering the code base first.  There
> > > > should still be some fundamental confidence to work on the code
> > > > though as a starting point, then it's about serious use case to
> > > > back this up, and careful testings while getting more familiar with it.
> > >
> > > How much experience we expect of maintainers depends on the
> > > subsystem and other circumstances.  The hard requirement isn't
> > > experience, it's trust.  See the recent attack on xz.
> > >
> > > I do not mean to express any doubts whatsoever on Yu Zhang's integrity!
> > > I'm merely reminding y'all what's at stake.
> >
> > I think we shouldn't overly obsess[1] about 'xz', because the
> > overwhealmingly common scenario is that volunteer maintainers are
> > honest people. QEMU is in a massively better peer review situation.
> > With xz there was basically no oversight of the new maintainer. With
> > QEMU, we have oversight from 1000's of people on the list, a huge pool
> > of general maintainers, the specific migration maintainers, and the release
> manager merging code.
> >
> > With a lack of historical experiance with QEMU maintainership, I'd
> > suggest that new RDMA volunteers would start by adding themselves to the
> "MAINTAINERS"
> > file with only the 'Reviewer' classification. The main migration
> > maintainers would still handle pull requests, but wait for a R-b from
> > one of the RMDA volunteers. After some period of time the RDMA folks
> > could graduate to full maintainer status if the migration maintainers needed
> to reduce their load.
> > I suspect that might prove unneccesary though, given RDMA isn't an
> > area of code with a high turnover of patches.
> 
> Right, and we can do that as a start, it also follows our normal rules of starting
> from Reviewers to maintain something.  I even considered Zhijian to be the
> previous rdma goto guy / maintainer no matter what role he used to have in
> the MAINTAINERS file.
> 
> Here IMHO it's more about whether any company would like to stand up and
> provide help, without yet binding that to be able to send pull requests in the
> near future or even longer term.
> 
> What I worry more is whether this is really what we want to keep rdma in
> qemu, and that's also why I was trying to request for some serious
> performance measurements comparing rdma v.s. nics.  And here when I said
> "we" I mean both QEMU community and any company that will support
> keeping rdma around.
> 
> The problem is if NICs now are fast enough to perform at least equally against
> rdma, and if it has a lower cost of overall maintenance, does it mean that rdma
> migration will only be used by whoever wants to keep them in the products and
> existed already?  In that case we should simply ask new users to stick with tcp,
> and rdma users should only drop but not increase.
> 
> It seems also destined that most new migration features will not support
> rdma: see how much we drop old features in migration now (which rdma
> _might_ still leverage, but maybe not), and how much we add mostly multifd
> relevant which will probably not apply to rdma at all.  So in general what I am
> worrying is a both-loss condition, if the company might be easier to either stick
> with an old qemu (depending on whether other new features are requested to
> be used besides RDMA alone), or do periodic rebase with RDMA downstream
> only.
> 
> So even if we want to keep RDMA around I hope with this chance we can at
> least have clear picture on when we should still suggest any new user to use
> RDMA (with the reasons behind).  Or we simply shouldn't suggest any new
> user to use RDMA at all (because at least it'll lose many new migration
> features).
> 
> Thanks,
> 
> --
> Peter Xu
Jinpu Wang May 6, 2024, 10:08 a.m. UTC | #28
Hi Peter, hi Daniel,

On Fri, May 3, 2024 at 4:33 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Fri, May 03, 2024 at 08:40:03AM +0200, Jinpu Wang wrote:
> > I had a brief check in the rsocket changelog, there seems some
> > improvement over time,
> >  might be worth revisiting this. due to socket abstraction, we can't
> > use some feature like
> >  ODP, it won't be a small and easy task.
>
> It'll be good to know whether Dan's suggestion would work first, without
> rewritting everything yet so far.  Not sure whether some perf test could
> help with the rsocket APIs even without QEMU's involvements (or looking for
> test data supporting / invalidate such conversions).
>
I did a quick test with iperf on 100 G environment and 40 G
environment, in summary rsocket works pretty well.

iperf tests between 2 hosts with 40 G (IB),
first  a few test with different num. of threads on top of ipoib
interface, later with preload rsocket on top of same ipoib interface.

jwang@ps401a-914.nst:~$ iperf -p 52000 -c 10.43.3.145
------------------------------------------------------------
Client connecting to 10.43.3.145, TCP port 52000
TCP window size:  165 KByte (default)
------------------------------------------------------------
[  3] local 10.43.3.146 port 55602 connected with 10.43.3.145 port 52000
[ ID] Interval       Transfer     Bandwidth
[  3] 0.0000-10.0001 sec  2.85 GBytes  2.44 Gbits/sec
jwang@ps401a-914.nst:~$ iperf -p 52000 -c 10.43.3.145 -P 2
------------------------------------------------------------
Client connecting to 10.43.3.145, TCP port 52000
TCP window size:  165 KByte (default)
------------------------------------------------------------
[  4] local 10.43.3.146 port 39640 connected with 10.43.3.145 port 52000
[  3] local 10.43.3.146 port 39626 connected with 10.43.3.145 port 52000
[ ID] Interval       Transfer     Bandwidth
[  3] 0.0000-10.0012 sec  2.85 GBytes  2.45 Gbits/sec
[  4] 0.0000-10.0026 sec  2.86 GBytes  2.45 Gbits/sec
[SUM] 0.0000-10.0026 sec  5.71 GBytes  4.90 Gbits/sec
[ CT] final connect times (min/avg/max/stdev) =
0.281/0.300/0.318/0.318 ms (tot/err) = 2/0
jwang@ps401a-914.nst:~$ iperf -p 52000 -c 10.43.3.145 -P 4
------------------------------------------------------------
Client connecting to 10.43.3.145, TCP port 52000
TCP window size:  165 KByte (default)
------------------------------------------------------------
[  4] local 10.43.3.146 port 46956 connected with 10.43.3.145 port 52000
[  6] local 10.43.3.146 port 46978 connected with 10.43.3.145 port 52000
[  3] local 10.43.3.146 port 46944 connected with 10.43.3.145 port 52000
[  5] local 10.43.3.146 port 46962 connected with 10.43.3.145 port 52000
[ ID] Interval       Transfer     Bandwidth
[  3] 0.0000-10.0017 sec  2.85 GBytes  2.45 Gbits/sec
[  4] 0.0000-10.0015 sec  2.85 GBytes  2.45 Gbits/sec
[  5] 0.0000-10.0026 sec  2.85 GBytes  2.45 Gbits/sec
[  6] 0.0000-10.0005 sec  2.85 GBytes  2.45 Gbits/sec
[SUM] 0.0000-10.0005 sec  11.4 GBytes  9.80 Gbits/sec
[ CT] final connect times (min/avg/max/stdev) =
0.274/0.312/0.360/0.212 ms (tot/err) = 4/0
jwang@ps401a-914.nst:~$ iperf -p 52000 -c 10.43.3.145 -P 8
------------------------------------------------------------
Client connecting to 10.43.3.145, TCP port 52000
TCP window size:  165 KByte (default)
------------------------------------------------------------
[  7] local 10.43.3.146 port 35062 connected with 10.43.3.145 port 52000
[  6] local 10.43.3.146 port 35058 connected with 10.43.3.145 port 52000
[  8] local 10.43.3.146 port 35066 connected with 10.43.3.145 port 52000
[  9] local 10.43.3.146 port 35074 connected with 10.43.3.145 port 52000
[  3] local 10.43.3.146 port 35038 connected with 10.43.3.145 port 52000
[ 12] local 10.43.3.146 port 35088 connected with 10.43.3.145 port 52000
[  5] local 10.43.3.146 port 35048 connected with 10.43.3.145 port 52000
[  4] local 10.43.3.146 port 35050 connected with 10.43.3.145 port 52000
[ ID] Interval       Transfer     Bandwidth
[  4] 0.0000-10.0005 sec  2.85 GBytes  2.44 Gbits/sec
[  8] 0.0000-10.0011 sec  2.85 GBytes  2.45 Gbits/sec
[  5] 0.0000-10.0000 sec  2.85 GBytes  2.45 Gbits/sec
[ 12] 0.0000-10.0021 sec  2.85 GBytes  2.44 Gbits/sec
[  3] 0.0000-10.0003 sec  2.85 GBytes  2.44 Gbits/sec
[  7] 0.0000-10.0065 sec  2.50 GBytes  2.14 Gbits/sec
[  9] 0.0000-10.0077 sec  2.52 GBytes  2.16 Gbits/sec
[  6] 0.0000-10.0003 sec  2.85 GBytes  2.44 Gbits/sec
[SUM] 0.0000-10.0003 sec  22.1 GBytes  19.0 Gbits/sec
[ CT] final connect times (min/avg/max/stdev) =
0.096/0.226/0.339/0.109 ms (tot/err) = 8/0
jwang@ps401a-914.nst:~$ iperf -p 52000 -c 10.43.3.145 -P 16
[  3] local 10.43.3.146 port 49540 connected with 10.43.3.145 port 52000
------------------------------------------------------------
Client connecting to 10.43.3.145, TCP port 52000
TCP window size:  165 KByte (default)
------------------------------------------------------------
[  6] local 10.43.3.146 port 49554 connected with 10.43.3.145 port 52000
[  8] local 10.43.3.146 port 49584 connected with 10.43.3.145 port 52000
[  5] local 10.43.3.146 port 49552 connected with 10.43.3.145 port 52000
[ 20] local 10.43.3.146 port 49626 connected with 10.43.3.145 port 52000
[  4] local 10.43.3.146 port 49606 connected with 10.43.3.145 port 52000
[  9] local 10.43.3.146 port 49596 connected with 10.43.3.145 port 52000
[ 10] local 10.43.3.146 port 49604 connected with 10.43.3.145 port 52000
[ 26] local 10.43.3.146 port 49678 connected with 10.43.3.145 port 52000
[  7] local 10.43.3.146 port 49556 connected with 10.43.3.145 port 52000
[ 25] local 10.43.3.146 port 49662 connected with 10.43.3.145 port 52000
[ 22] local 10.43.3.146 port 49636 connected with 10.43.3.145 port 52000
[ 11] local 10.43.3.146 port 49612 connected with 10.43.3.145 port 52000
[ 13] local 10.43.3.146 port 49618 connected with 10.43.3.145 port 52000
[ 23] local 10.43.3.146 port 49646 connected with 10.43.3.145 port 52000
[ 15] local 10.43.3.146 port 49688 connected with 10.43.3.145 port 52000
[ ID] Interval       Transfer     Bandwidth
[ 11] 0.0000-10.0024 sec  2.28 GBytes  1.95 Gbits/sec
[ 23] 0.0000-10.0022 sec  2.28 GBytes  1.95 Gbits/sec
[ 20] 0.0000-10.0010 sec  2.28 GBytes  1.95 Gbits/sec
[  8] 0.0000-10.0032 sec  2.28 GBytes  1.95 Gbits/sec
[ 26] 0.0000-10.0038 sec  2.28 GBytes  1.95 Gbits/sec
[ 10] 0.0000-10.0002 sec  2.28 GBytes  1.95 Gbits/sec
[  7] 0.0000-10.0033 sec  2.28 GBytes  1.95 Gbits/sec
[ 15] 0.0000-10.0015 sec  2.27 GBytes  1.95 Gbits/sec
[  4] 0.0000-10.0028 sec  2.28 GBytes  1.95 Gbits/sec
[  6] 0.0000-10.0012 sec  2.28 GBytes  1.96 Gbits/sec
[ 13] 0.0000-10.0030 sec  2.28 GBytes  1.95 Gbits/sec
[ 25] 0.0000-10.0051 sec  2.28 GBytes  1.95 Gbits/sec
[  5] 0.0000-10.0001 sec  2.28 GBytes  1.96 Gbits/sec
[  9] 0.0000-10.0017 sec  2.28 GBytes  1.95 Gbits/sec
[ 22] 0.0000-10.0008 sec  2.27 GBytes  1.95 Gbits/sec
[  3] 0.0000-10.0033 sec  2.28 GBytes  1.95 Gbits/sec
[SUM] 0.0000-10.0034 sec  36.4 GBytes  31.3 Gbits/sec
[ CT] final connect times (min/avg/max/stdev) =
0.105/0.217/0.401/0.093 ms (tot/err) = 16/0
jwang@ps401a-914.nst:~$
LD_PRELOAD=/usr/lib/x86_64-linux-gnu/rsocket/librspreload.so iperf -p
52000 -c 10.43.3.145 -P 16
------------------------------------------------------------
Client connecting to 10.43.3.145, TCP port 52000
TCP window size:  128 KByte (default)
------------------------------------------------------------
[  3] local 10.43.3.146 port 48902 connected with 10.43.3.145 port 52000
[  5] local 10.43.3.146 port 52777 connected with 10.43.3.145 port 52000
[  9] local 10.43.3.146 port 42911 connected with 10.43.3.145 port 52000
[ 11] local 10.43.3.146 port 56354 connected with 10.43.3.145 port 52000
[ 15] local 10.43.3.146 port 43325 connected with 10.43.3.145 port 52000
[  6] local 10.43.3.146 port 37041 connected with 10.43.3.145 port 52000
[  7] local 10.43.3.146 port 58828 connected with 10.43.3.145 port 52000
[ 17] local 10.43.3.146 port 48858 connected with 10.43.3.145 port 52000
[ 13] local 10.43.3.146 port 49256 connected with 10.43.3.145 port 52000
[ 16] local 10.43.3.146 port 35652 connected with 10.43.3.145 port 52000
[  8] local 10.43.3.146 port 48567 connected with 10.43.3.145 port 52000
[ 18] local 10.43.3.146 port 47394 connected with 10.43.3.145 port 52000
[ 19] local 10.43.3.146 port 48065 connected with 10.43.3.145 port 52000
[ 10] local 10.43.3.146 port 39788 connected with 10.43.3.145 port 52000
[  4] local 10.43.3.146 port 46818 connected with 10.43.3.145 port 52000
[ 14] local 10.43.3.146 port 57174 connected with 10.43.3.145 port 52000
[ ID] Interval       Transfer     Bandwidth
[ 14] 0.0000-10.0002 sec  2.30 GBytes  1.98 Gbits/sec
[  6] 0.0000-10.0004 sec  2.31 GBytes  1.98 Gbits/sec
[  5] 0.0000-10.0002 sec  2.31 GBytes  1.98 Gbits/sec
[  8] 0.0000-10.0001 sec  2.31 GBytes  1.98 Gbits/sec
[ 11] 0.0000-10.0003 sec  2.31 GBytes  1.98 Gbits/sec
[ 18] 0.0000-10.0003 sec  2.31 GBytes  1.98 Gbits/sec
[  3] 0.0000-10.0004 sec  2.31 GBytes  1.98 Gbits/sec
[  4] 0.0000-10.0005 sec  2.30 GBytes  1.98 Gbits/sec
[ 17] 0.0000-10.0004 sec  2.31 GBytes  1.98 Gbits/sec
[ 15] 0.0000-10.0005 sec  2.31 GBytes  1.98 Gbits/sec
[ 19] 0.0000-10.0001 sec  2.30 GBytes  1.98 Gbits/sec
[  7] 0.0000-10.0004 sec  2.31 GBytes  1.98 Gbits/sec
[ 13] 0.0000-10.0005 sec  2.31 GBytes  1.98 Gbits/sec
[ 10] 0.0000-10.0003 sec  2.30 GBytes  1.98 Gbits/sec
[  9] 0.0000-10.0000 sec  2.31 GBytes  1.98 Gbits/sec
[ 16] 0.0000-10.0002 sec  2.31 GBytes  1.98 Gbits/sec
[SUM] 0.0000-10.0003 sec  36.9 GBytes  31.7 Gbits/sec
[ CT] final connect times (min/avg/max/stdev) =
88.398/101.706/114.726/24.755 ms (tot/err) = 16/0
jwang@ps401a-914.nst:~$
LD_PRELOAD=/usr/lib/x86_64-linux-gnu/rsocket/librspreload.so iperf -p
52000 -c 10.43.3.145 -P 1
------------------------------------------------------------
Client connecting to 10.43.3.145, TCP port 52000
TCP window size:  128 KByte (default)
------------------------------------------------------------
[  3] local 10.43.3.146 port 49168 connected with 10.43.3.145 port 52000
[ ID] Interval       Transfer     Bandwidth
[  3] 0.0000-10.0000 sec  34.3 GBytes  29.5 Gbits/sec
jwang@ps401a-914.nst:~$
LD_PRELOAD=/usr/lib/x86_64-linux-gnu/rsocket/librspreload.so iperf -p
52000 -c 10.43.3.145 -P 2
------------------------------------------------------------
Client connecting to 10.43.3.145, TCP port 52000
TCP window size:  128 KByte (default)
------------------------------------------------------------
[  3] local 10.43.3.146 port 42096 connected with 10.43.3.145 port 52000
[  4] local 10.43.3.146 port 58667 connected with 10.43.3.145 port 52000
[ ID] Interval       Transfer     Bandwidth
[  4] 0.0000-10.0001 sec  18.4 GBytes  15.8 Gbits/sec
[  3] 0.0000-10.0000 sec  18.5 GBytes  15.9 Gbits/sec
[SUM] 0.0000-10.0001 sec  36.9 GBytes  31.7 Gbits/sec
[ CT] final connect times (min/avg/max/stdev) =
38.155/38.997/39.839/39.839 ms (tot/err) = 2/0
jwang@ps401a-914.nst:~$
LD_PRELOAD=/usr/lib/x86_64-linux-gnu/rsocket/librspreload.so iperf -p
52000 -c 10.43.3.145 -P 4
------------------------------------------------------------
Client connecting to 10.43.3.145, TCP port 52000
TCP window size:  128 KByte (default)
------------------------------------------------------------
[  3] local 10.43.3.146 port 36100 connected with 10.43.3.145 port 52000
[  5] local 10.43.3.146 port 55108 connected with 10.43.3.145 port 52000
[  6] local 10.43.3.146 port 41039 connected with 10.43.3.145 port 52000
[  7] local 10.43.3.146 port 34868 connected with 10.43.3.145 port 52000
[ ID] Interval       Transfer     Bandwidth
[  7] 0.0000-10.0000 sec  9.22 GBytes  7.92 Gbits/sec
[  5] 0.0000-10.0000 sec  9.22 GBytes  7.92 Gbits/sec
[  3] 0.0000-10.0000 sec  9.22 GBytes  7.92 Gbits/sec
[  6] 0.0000-10.0001 sec  9.22 GBytes  7.92 Gbits/sec
[SUM] 0.0000-10.0001 sec  36.9 GBytes  31.7 Gbits/sec
[ CT] final connect times (min/avg/max/stdev) =
51.401/53.644/56.015/30.487 ms (tot/err) = 4/0

You can see with rsocket it reaches ~ 30 Gb/s with single stream,
while ipoib only 2.5 Gb/s (12 X), ipoib scales with more threads until
 ~ 32 Gb/s, which is the link limit.

With 100 G env, rsocket also out perform ipoib, see below:


jwang@ps404a-59.stg:~$
LD_PRELOAD=/usr/lib/x86_64-linux-gnu/rsocket/librspreload.so iperf -p
52000 -c 10.43.48.58
------------------------------------------------------------
Client connecting to 10.43.48.58, TCP port 52000
TCP window size:  128 KByte (default)
------------------------------------------------------------
[  3] local 10.43.48.59 port 40588 connected with 10.43.48.58 port 52000
[ ID] Interval       Transfer     Bandwidth
[  3] 0.0000-10.0000 sec  80.7 GBytes  69.4 Gbits/sec
jwang@ps404a-59.stg:~$
LD_PRELOAD=/usr/lib/x86_64-linux-gnu/rsocket/librspreload.so iperf -p
52000 -c 10.43.48.58 -P 2
------------------------------------------------------------
Client connecting to 10.43.48.58, TCP port 52000
TCP window size:  128 KByte (default)
------------------------------------------------------------
[  3] local 10.43.48.59 port 41813 connected with 10.43.48.58 port 52000
[  5] local 10.43.48.59 port 60638 connected with 10.43.48.58 port 52000
[ ID] Interval       Transfer     Bandwidth
[  5] 0.0000-10.0000 sec  48.9 GBytes  42.0 Gbits/sec
[  3] 0.0000-10.0000 sec  49.8 GBytes  42.8 Gbits/sec
[SUM] 0.0000-10.0000 sec  98.7 GBytes  84.8 Gbits/sec
[ CT] final connect times (min/avg/max/stdev) =
6.962/7.764/8.567/8.567 ms (tot/err) = 2/0
jwang@ps404a-59.stg:~$
LD_PRELOAD=/usr/lib/x86_64-linux-gnu/rsocket/librspreload.so iperf -p
52000 -c 10.43.48.58 -P 4
[  6] local 10.43.48.59 port 58086 connected with 10.43.48.58 port 52000
[  3] local 10.43.48.59 port 49335 connected with 10.43.48.58 port 52000
------------------------------------------------------------
Client connecting to 10.43.48.58, TCP port 52000
TCP window size:  128 KByte (default)
------------------------------------------------------------
[  4] local 10.43.48.59 port 44593 connected with 10.43.48.58 port 52000
[  5] local 10.43.48.59 port 60464 connected with 10.43.48.58 port 52000
[ ID] Interval       Transfer     Bandwidth
[  5] 0.0000-10.0000 sec  28.0 GBytes  24.0 Gbits/sec
[  4] 0.0000-10.0000 sec  28.0 GBytes  24.0 Gbits/sec
[  3] 0.0000-10.0000 sec  28.0 GBytes  24.1 Gbits/sec
[  6] 0.0000-10.0000 sec  28.0 GBytes  24.1 Gbits/sec
[SUM] 0.0000-10.0001 sec   112 GBytes  96.3 Gbits/sec
[ CT] final connect times (min/avg/max/stdev) =
7.344/9.619/12.199/5.271 ms (tot/err) = 4/0
jwang@ps404a-59.stg:~$
LD_PRELOAD=/usr/lib/x86_64-linux-gnu/rsocket/librspreload.so iperf -p
52000 -c 10.43.48.58 -P 8
[  3] local 10.43.48.59 port 43020 connected with 10.43.48.58 port 52000
[  7] local 10.43.48.59 port 59720 connected with 10.43.48.58 port 52000
[  4] local 10.43.48.59 port 52547 connected with 10.43.48.58 port 52000
[  8] local 10.43.48.59 port 41712 connected with 10.43.48.58 port 52000
[ 10] local 10.43.48.59 port 53126 connected with 10.43.48.58 port 52000
------------------------------------------------------------
Client connecting to 10.43.48.58, TCP port 52000
TCP window size:  128 KByte (default)
------------------------------------------------------------
[  6] local 10.43.48.59 port 60311 connected with 10.43.48.58 port 52000
[  5] local 10.43.48.59 port 44103 connected with 10.43.48.58 port 52000
[  9] local 10.43.48.59 port 49007 connected with 10.43.48.58 port 52000
[ ID] Interval       Transfer     Bandwidth
[  9] 0.0000-10.0001 sec  14.0 GBytes  12.0 Gbits/sec
[  8] 0.0000-10.0000 sec  14.0 GBytes  12.0 Gbits/sec
[  4] 0.0000-10.0001 sec  14.0 GBytes  12.0 Gbits/sec
[  6] 0.0000-10.0000 sec  14.0 GBytes  12.0 Gbits/sec
[ 10] 0.0000-10.0000 sec  14.0 GBytes  12.0 Gbits/sec
[  7] 0.0000-10.0000 sec  14.0 GBytes  12.0 Gbits/sec
[  5] 0.0000-10.0001 sec  14.0 GBytes  12.0 Gbits/sec
[  3] 0.0000-10.0001 sec  14.0 GBytes  12.0 Gbits/sec
[SUM] 0.0000-10.0001 sec   112 GBytes  96.3 Gbits/sec
[ CT] final connect times (min/avg/max/stdev) =
6.942/12.361/18.109/4.872 ms (tot/err) = 8/0
jwang@ps404a-59.stg:~$ iperf -p 52000 -c 10.43.48.58 -P 8
------------------------------------------------------------
Client connecting to 10.43.48.58, TCP port 52000
TCP window size:  165 KByte (default)
------------------------------------------------------------
[  4] local 10.43.48.59 port 58176 connected with 10.43.48.58 port 52000
[  5] local 10.43.48.59 port 58180 connected with 10.43.48.58 port 52000
[  3] local 10.43.48.59 port 58178 connected with 10.43.48.58 port 52000
[ 10] local 10.43.48.59 port 58226 connected with 10.43.48.58 port 52000
[ 11] local 10.43.48.59 port 58228 connected with 10.43.48.58 port 52000
[  9] local 10.43.48.59 port 58212 connected with 10.43.48.58 port 52000
[  7] local 10.43.48.59 port 58194 connected with 10.43.48.58 port 52000
[  8] local 10.43.48.59 port 58198 connected with 10.43.48.58 port 52000
[ ID] Interval       Transfer     Bandwidth
[  9] 0.0000-10.0005 sec  15.8 GBytes  13.5 Gbits/sec
[  4] 0.0000-10.0002 sec  15.8 GBytes  13.6 Gbits/sec
[  3] 0.0000-10.0000 sec  15.8 GBytes  13.6 Gbits/sec
[  5] 0.0000-10.0002 sec  15.8 GBytes  13.6 Gbits/sec
[  8] 0.0000-10.0005 sec  7.89 GBytes  6.78 Gbits/sec
[ 10] 0.0000-10.0000 sec  15.8 GBytes  13.6 Gbits/sec
[ 11] 0.0000-10.0014 sec  7.94 GBytes  6.82 Gbits/sec
[  7] 0.0000-10.0009 sec  15.8 GBytes  13.6 Gbits/sec
[SUM] 0.0000-10.0009 sec   111 GBytes  95.1 Gbits/sec
[ CT] final connect times (min/avg/max/stdev) =
0.234/0.325/0.406/0.155 ms (tot/err) = 8/0
jwang@ps404a-59.stg:~$ iperf -p 52000 -c 10.43.48.58 -P 4
[  3] local 10.43.48.59 port 42548 connected with 10.43.48.58 port 52000
------------------------------------------------------------
Client connecting to 10.43.48.58, TCP port 52000
TCP window size:  165 KByte (default)
------------------------------------------------------------
[  4] local 10.43.48.59 port 42558 connected with 10.43.48.58 port 52000
[  5] local 10.43.48.59 port 42560 connected with 10.43.48.58 port 52000
[  6] local 10.43.48.59 port 42562 connected with 10.43.48.58 port 52000
[ ID] Interval       Transfer     Bandwidth
[  6] 0.0000-10.0000 sec  27.8 GBytes  23.9 Gbits/sec
[  5] 0.0000-10.0001 sec  27.3 GBytes  23.5 Gbits/sec
[  3] 0.0000-10.0001 sec  27.8 GBytes  23.9 Gbits/sec
[  4] 0.0000-10.0001 sec  27.8 GBytes  23.9 Gbits/sec
[SUM] 0.0000-10.0001 sec   111 GBytes  95.1 Gbits/sec
[ CT] final connect times (min/avg/max/stdev) =
0.295/0.340/0.390/0.201 ms (tot/err) = 4/0
jwang@ps404a-59.stg:~$ iperf -p 52000 -c 10.43.48.58 -P 2
------------------------------------------------------------
Client connecting to 10.43.48.58, TCP port 52000
TCP window size:  165 KByte (default)
------------------------------------------------------------
[  4] local 10.43.48.59 port 44194 connected with 10.43.48.58 port 52000
[  3] local 10.43.48.59 port 44186 connected with 10.43.48.58 port 52000
[ ID] Interval       Transfer     Bandwidth
[  3] 0.0000-10.0000 sec  48.3 GBytes  41.5 Gbits/sec
[  4] 0.0000-10.0000 sec  41.3 GBytes  35.5 Gbits/sec
[SUM] 0.0000-10.0000 sec  89.7 GBytes  77.0 Gbits/sec
[ CT] final connect times (min/avg/max/stdev) =
0.227/0.233/0.240/0.240 ms (tot/err) = 2/0
jwang@ps404a-59.stg:~$ pbkvm list
 VM  State  PID  Cores  Mem  VNC  Migration
--------------------------------------------

Total: 0 VMs, Running: 0
jwang@ps404a-59.stg:~$ iperf -p 52000 -c 10.43.48.58 -P 1
------------------------------------------------------------
Client connecting to 10.43.48.58, TCP port 52000
TCP window size:  165 KByte (default)
------------------------------------------------------------
[  3] local 10.43.48.59 port 40364 connected with 10.43.48.58 port 52000
[ ID] Interval       Transfer     Bandwidth
[  3] 0.0000-10.0000 sec  51.2 GBytes  44.0 Gbits/sec

Thanks!


> Thanks,
>
> --
> Peter Xu
>
Peter Xu May 6, 2024, 3:18 p.m. UTC | #29
On Mon, May 06, 2024 at 02:06:28AM +0000, Gonglei (Arei) wrote:
> Hi, Peter

Hey, Lei,

Happy to see you around again after years.

> RDMA features high bandwidth, low latency (in non-blocking lossless
> network), and direct remote memory access by bypassing the CPU (As you
> know, CPU resources are expensive for cloud vendors, which is one of the
> reasons why we introduced offload cards.), which TCP does not have.

It's another cost to use offload cards, v.s. preparing more cpu resources?

> In some scenarios where fast live migration is needed (extremely short
> interruption duration and migration duration) is very useful. To this
> end, we have also developed RDMA support for multifd.

Will any of you upstream that work?  I'm curious how intrusive would it be
when adding it to multifd, if it can keep only 5 exported functions like
what rdma.h does right now it'll be pretty nice.  We also want to make sure
it works with arbitrary sized loads and buffers, e.g. vfio is considering
to add IO loads to multifd channels too.

One thing to note that the question here is not about a pure performance
comparison between rdma and nics only.  It's about help us make a decision
on whether to drop rdma, iow, even if rdma performs well, the community
still has the right to drop it if nobody can actively work and maintain it.
It's just that if nics can perform as good it's more a reason to drop,
unless companies can help to provide good support and work together.

Thanks,
Peter Xu May 6, 2024, 3:28 p.m. UTC | #30
On Mon, May 06, 2024 at 12:08:43PM +0200, Jinpu Wang wrote:
> Hi Peter, hi Daniel,

Hi, Jinpu,

Thanks for sharing this test results.  Sounds like a great news.

What's your plan next?  Would it then be worthwhile / possible moving QEMU
into that direction?  Would that greatly simplify rdma code as Dan
mentioned?

Thanks,

> 
> On Fri, May 3, 2024 at 4:33 PM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Fri, May 03, 2024 at 08:40:03AM +0200, Jinpu Wang wrote:
> > > I had a brief check in the rsocket changelog, there seems some
> > > improvement over time,
> > >  might be worth revisiting this. due to socket abstraction, we can't
> > > use some feature like
> > >  ODP, it won't be a small and easy task.
> >
> > It'll be good to know whether Dan's suggestion would work first, without
> > rewritting everything yet so far.  Not sure whether some perf test could
> > help with the rsocket APIs even without QEMU's involvements (or looking for
> > test data supporting / invalidate such conversions).
> >
> I did a quick test with iperf on 100 G environment and 40 G
> environment, in summary rsocket works pretty well.
> 
> iperf tests between 2 hosts with 40 G (IB),
> first  a few test with different num. of threads on top of ipoib
> interface, later with preload rsocket on top of same ipoib interface.
> 
> jwang@ps401a-914.nst:~$ iperf -p 52000 -c 10.43.3.145
> ------------------------------------------------------------
> Client connecting to 10.43.3.145, TCP port 52000
> TCP window size:  165 KByte (default)
> ------------------------------------------------------------
> [  3] local 10.43.3.146 port 55602 connected with 10.43.3.145 port 52000
> [ ID] Interval       Transfer     Bandwidth
> [  3] 0.0000-10.0001 sec  2.85 GBytes  2.44 Gbits/sec
> jwang@ps401a-914.nst:~$ iperf -p 52000 -c 10.43.3.145 -P 2
> ------------------------------------------------------------
> Client connecting to 10.43.3.145, TCP port 52000
> TCP window size:  165 KByte (default)
> ------------------------------------------------------------
> [  4] local 10.43.3.146 port 39640 connected with 10.43.3.145 port 52000
> [  3] local 10.43.3.146 port 39626 connected with 10.43.3.145 port 52000
> [ ID] Interval       Transfer     Bandwidth
> [  3] 0.0000-10.0012 sec  2.85 GBytes  2.45 Gbits/sec
> [  4] 0.0000-10.0026 sec  2.86 GBytes  2.45 Gbits/sec
> [SUM] 0.0000-10.0026 sec  5.71 GBytes  4.90 Gbits/sec
> [ CT] final connect times (min/avg/max/stdev) =
> 0.281/0.300/0.318/0.318 ms (tot/err) = 2/0
> jwang@ps401a-914.nst:~$ iperf -p 52000 -c 10.43.3.145 -P 4
> ------------------------------------------------------------
> Client connecting to 10.43.3.145, TCP port 52000
> TCP window size:  165 KByte (default)
> ------------------------------------------------------------
> [  4] local 10.43.3.146 port 46956 connected with 10.43.3.145 port 52000
> [  6] local 10.43.3.146 port 46978 connected with 10.43.3.145 port 52000
> [  3] local 10.43.3.146 port 46944 connected with 10.43.3.145 port 52000
> [  5] local 10.43.3.146 port 46962 connected with 10.43.3.145 port 52000
> [ ID] Interval       Transfer     Bandwidth
> [  3] 0.0000-10.0017 sec  2.85 GBytes  2.45 Gbits/sec
> [  4] 0.0000-10.0015 sec  2.85 GBytes  2.45 Gbits/sec
> [  5] 0.0000-10.0026 sec  2.85 GBytes  2.45 Gbits/sec
> [  6] 0.0000-10.0005 sec  2.85 GBytes  2.45 Gbits/sec
> [SUM] 0.0000-10.0005 sec  11.4 GBytes  9.80 Gbits/sec
> [ CT] final connect times (min/avg/max/stdev) =
> 0.274/0.312/0.360/0.212 ms (tot/err) = 4/0
> jwang@ps401a-914.nst:~$ iperf -p 52000 -c 10.43.3.145 -P 8
> ------------------------------------------------------------
> Client connecting to 10.43.3.145, TCP port 52000
> TCP window size:  165 KByte (default)
> ------------------------------------------------------------
> [  7] local 10.43.3.146 port 35062 connected with 10.43.3.145 port 52000
> [  6] local 10.43.3.146 port 35058 connected with 10.43.3.145 port 52000
> [  8] local 10.43.3.146 port 35066 connected with 10.43.3.145 port 52000
> [  9] local 10.43.3.146 port 35074 connected with 10.43.3.145 port 52000
> [  3] local 10.43.3.146 port 35038 connected with 10.43.3.145 port 52000
> [ 12] local 10.43.3.146 port 35088 connected with 10.43.3.145 port 52000
> [  5] local 10.43.3.146 port 35048 connected with 10.43.3.145 port 52000
> [  4] local 10.43.3.146 port 35050 connected with 10.43.3.145 port 52000
> [ ID] Interval       Transfer     Bandwidth
> [  4] 0.0000-10.0005 sec  2.85 GBytes  2.44 Gbits/sec
> [  8] 0.0000-10.0011 sec  2.85 GBytes  2.45 Gbits/sec
> [  5] 0.0000-10.0000 sec  2.85 GBytes  2.45 Gbits/sec
> [ 12] 0.0000-10.0021 sec  2.85 GBytes  2.44 Gbits/sec
> [  3] 0.0000-10.0003 sec  2.85 GBytes  2.44 Gbits/sec
> [  7] 0.0000-10.0065 sec  2.50 GBytes  2.14 Gbits/sec
> [  9] 0.0000-10.0077 sec  2.52 GBytes  2.16 Gbits/sec
> [  6] 0.0000-10.0003 sec  2.85 GBytes  2.44 Gbits/sec
> [SUM] 0.0000-10.0003 sec  22.1 GBytes  19.0 Gbits/sec
> [ CT] final connect times (min/avg/max/stdev) =
> 0.096/0.226/0.339/0.109 ms (tot/err) = 8/0
> jwang@ps401a-914.nst:~$ iperf -p 52000 -c 10.43.3.145 -P 16
> [  3] local 10.43.3.146 port 49540 connected with 10.43.3.145 port 52000
> ------------------------------------------------------------
> Client connecting to 10.43.3.145, TCP port 52000
> TCP window size:  165 KByte (default)
> ------------------------------------------------------------
> [  6] local 10.43.3.146 port 49554 connected with 10.43.3.145 port 52000
> [  8] local 10.43.3.146 port 49584 connected with 10.43.3.145 port 52000
> [  5] local 10.43.3.146 port 49552 connected with 10.43.3.145 port 52000
> [ 20] local 10.43.3.146 port 49626 connected with 10.43.3.145 port 52000
> [  4] local 10.43.3.146 port 49606 connected with 10.43.3.145 port 52000
> [  9] local 10.43.3.146 port 49596 connected with 10.43.3.145 port 52000
> [ 10] local 10.43.3.146 port 49604 connected with 10.43.3.145 port 52000
> [ 26] local 10.43.3.146 port 49678 connected with 10.43.3.145 port 52000
> [  7] local 10.43.3.146 port 49556 connected with 10.43.3.145 port 52000
> [ 25] local 10.43.3.146 port 49662 connected with 10.43.3.145 port 52000
> [ 22] local 10.43.3.146 port 49636 connected with 10.43.3.145 port 52000
> [ 11] local 10.43.3.146 port 49612 connected with 10.43.3.145 port 52000
> [ 13] local 10.43.3.146 port 49618 connected with 10.43.3.145 port 52000
> [ 23] local 10.43.3.146 port 49646 connected with 10.43.3.145 port 52000
> [ 15] local 10.43.3.146 port 49688 connected with 10.43.3.145 port 52000
> [ ID] Interval       Transfer     Bandwidth
> [ 11] 0.0000-10.0024 sec  2.28 GBytes  1.95 Gbits/sec
> [ 23] 0.0000-10.0022 sec  2.28 GBytes  1.95 Gbits/sec
> [ 20] 0.0000-10.0010 sec  2.28 GBytes  1.95 Gbits/sec
> [  8] 0.0000-10.0032 sec  2.28 GBytes  1.95 Gbits/sec
> [ 26] 0.0000-10.0038 sec  2.28 GBytes  1.95 Gbits/sec
> [ 10] 0.0000-10.0002 sec  2.28 GBytes  1.95 Gbits/sec
> [  7] 0.0000-10.0033 sec  2.28 GBytes  1.95 Gbits/sec
> [ 15] 0.0000-10.0015 sec  2.27 GBytes  1.95 Gbits/sec
> [  4] 0.0000-10.0028 sec  2.28 GBytes  1.95 Gbits/sec
> [  6] 0.0000-10.0012 sec  2.28 GBytes  1.96 Gbits/sec
> [ 13] 0.0000-10.0030 sec  2.28 GBytes  1.95 Gbits/sec
> [ 25] 0.0000-10.0051 sec  2.28 GBytes  1.95 Gbits/sec
> [  5] 0.0000-10.0001 sec  2.28 GBytes  1.96 Gbits/sec
> [  9] 0.0000-10.0017 sec  2.28 GBytes  1.95 Gbits/sec
> [ 22] 0.0000-10.0008 sec  2.27 GBytes  1.95 Gbits/sec
> [  3] 0.0000-10.0033 sec  2.28 GBytes  1.95 Gbits/sec
> [SUM] 0.0000-10.0034 sec  36.4 GBytes  31.3 Gbits/sec
> [ CT] final connect times (min/avg/max/stdev) =
> 0.105/0.217/0.401/0.093 ms (tot/err) = 16/0
> jwang@ps401a-914.nst:~$
> LD_PRELOAD=/usr/lib/x86_64-linux-gnu/rsocket/librspreload.so iperf -p
> 52000 -c 10.43.3.145 -P 16
> ------------------------------------------------------------
> Client connecting to 10.43.3.145, TCP port 52000
> TCP window size:  128 KByte (default)
> ------------------------------------------------------------
> [  3] local 10.43.3.146 port 48902 connected with 10.43.3.145 port 52000
> [  5] local 10.43.3.146 port 52777 connected with 10.43.3.145 port 52000
> [  9] local 10.43.3.146 port 42911 connected with 10.43.3.145 port 52000
> [ 11] local 10.43.3.146 port 56354 connected with 10.43.3.145 port 52000
> [ 15] local 10.43.3.146 port 43325 connected with 10.43.3.145 port 52000
> [  6] local 10.43.3.146 port 37041 connected with 10.43.3.145 port 52000
> [  7] local 10.43.3.146 port 58828 connected with 10.43.3.145 port 52000
> [ 17] local 10.43.3.146 port 48858 connected with 10.43.3.145 port 52000
> [ 13] local 10.43.3.146 port 49256 connected with 10.43.3.145 port 52000
> [ 16] local 10.43.3.146 port 35652 connected with 10.43.3.145 port 52000
> [  8] local 10.43.3.146 port 48567 connected with 10.43.3.145 port 52000
> [ 18] local 10.43.3.146 port 47394 connected with 10.43.3.145 port 52000
> [ 19] local 10.43.3.146 port 48065 connected with 10.43.3.145 port 52000
> [ 10] local 10.43.3.146 port 39788 connected with 10.43.3.145 port 52000
> [  4] local 10.43.3.146 port 46818 connected with 10.43.3.145 port 52000
> [ 14] local 10.43.3.146 port 57174 connected with 10.43.3.145 port 52000
> [ ID] Interval       Transfer     Bandwidth
> [ 14] 0.0000-10.0002 sec  2.30 GBytes  1.98 Gbits/sec
> [  6] 0.0000-10.0004 sec  2.31 GBytes  1.98 Gbits/sec
> [  5] 0.0000-10.0002 sec  2.31 GBytes  1.98 Gbits/sec
> [  8] 0.0000-10.0001 sec  2.31 GBytes  1.98 Gbits/sec
> [ 11] 0.0000-10.0003 sec  2.31 GBytes  1.98 Gbits/sec
> [ 18] 0.0000-10.0003 sec  2.31 GBytes  1.98 Gbits/sec
> [  3] 0.0000-10.0004 sec  2.31 GBytes  1.98 Gbits/sec
> [  4] 0.0000-10.0005 sec  2.30 GBytes  1.98 Gbits/sec
> [ 17] 0.0000-10.0004 sec  2.31 GBytes  1.98 Gbits/sec
> [ 15] 0.0000-10.0005 sec  2.31 GBytes  1.98 Gbits/sec
> [ 19] 0.0000-10.0001 sec  2.30 GBytes  1.98 Gbits/sec
> [  7] 0.0000-10.0004 sec  2.31 GBytes  1.98 Gbits/sec
> [ 13] 0.0000-10.0005 sec  2.31 GBytes  1.98 Gbits/sec
> [ 10] 0.0000-10.0003 sec  2.30 GBytes  1.98 Gbits/sec
> [  9] 0.0000-10.0000 sec  2.31 GBytes  1.98 Gbits/sec
> [ 16] 0.0000-10.0002 sec  2.31 GBytes  1.98 Gbits/sec
> [SUM] 0.0000-10.0003 sec  36.9 GBytes  31.7 Gbits/sec
> [ CT] final connect times (min/avg/max/stdev) =
> 88.398/101.706/114.726/24.755 ms (tot/err) = 16/0
> jwang@ps401a-914.nst:~$
> LD_PRELOAD=/usr/lib/x86_64-linux-gnu/rsocket/librspreload.so iperf -p
> 52000 -c 10.43.3.145 -P 1
> ------------------------------------------------------------
> Client connecting to 10.43.3.145, TCP port 52000
> TCP window size:  128 KByte (default)
> ------------------------------------------------------------
> [  3] local 10.43.3.146 port 49168 connected with 10.43.3.145 port 52000
> [ ID] Interval       Transfer     Bandwidth
> [  3] 0.0000-10.0000 sec  34.3 GBytes  29.5 Gbits/sec
> jwang@ps401a-914.nst:~$
> LD_PRELOAD=/usr/lib/x86_64-linux-gnu/rsocket/librspreload.so iperf -p
> 52000 -c 10.43.3.145 -P 2
> ------------------------------------------------------------
> Client connecting to 10.43.3.145, TCP port 52000
> TCP window size:  128 KByte (default)
> ------------------------------------------------------------
> [  3] local 10.43.3.146 port 42096 connected with 10.43.3.145 port 52000
> [  4] local 10.43.3.146 port 58667 connected with 10.43.3.145 port 52000
> [ ID] Interval       Transfer     Bandwidth
> [  4] 0.0000-10.0001 sec  18.4 GBytes  15.8 Gbits/sec
> [  3] 0.0000-10.0000 sec  18.5 GBytes  15.9 Gbits/sec
> [SUM] 0.0000-10.0001 sec  36.9 GBytes  31.7 Gbits/sec
> [ CT] final connect times (min/avg/max/stdev) =
> 38.155/38.997/39.839/39.839 ms (tot/err) = 2/0
> jwang@ps401a-914.nst:~$
> LD_PRELOAD=/usr/lib/x86_64-linux-gnu/rsocket/librspreload.so iperf -p
> 52000 -c 10.43.3.145 -P 4
> ------------------------------------------------------------
> Client connecting to 10.43.3.145, TCP port 52000
> TCP window size:  128 KByte (default)
> ------------------------------------------------------------
> [  3] local 10.43.3.146 port 36100 connected with 10.43.3.145 port 52000
> [  5] local 10.43.3.146 port 55108 connected with 10.43.3.145 port 52000
> [  6] local 10.43.3.146 port 41039 connected with 10.43.3.145 port 52000
> [  7] local 10.43.3.146 port 34868 connected with 10.43.3.145 port 52000
> [ ID] Interval       Transfer     Bandwidth
> [  7] 0.0000-10.0000 sec  9.22 GBytes  7.92 Gbits/sec
> [  5] 0.0000-10.0000 sec  9.22 GBytes  7.92 Gbits/sec
> [  3] 0.0000-10.0000 sec  9.22 GBytes  7.92 Gbits/sec
> [  6] 0.0000-10.0001 sec  9.22 GBytes  7.92 Gbits/sec
> [SUM] 0.0000-10.0001 sec  36.9 GBytes  31.7 Gbits/sec
> [ CT] final connect times (min/avg/max/stdev) =
> 51.401/53.644/56.015/30.487 ms (tot/err) = 4/0
> 
> You can see with rsocket it reaches ~ 30 Gb/s with single stream,
> while ipoib only 2.5 Gb/s (12 X), ipoib scales with more threads until
>  ~ 32 Gb/s, which is the link limit.
> 
> With 100 G env, rsocket also out perform ipoib, see below:
> 
> 
> jwang@ps404a-59.stg:~$
> LD_PRELOAD=/usr/lib/x86_64-linux-gnu/rsocket/librspreload.so iperf -p
> 52000 -c 10.43.48.58
> ------------------------------------------------------------
> Client connecting to 10.43.48.58, TCP port 52000
> TCP window size:  128 KByte (default)
> ------------------------------------------------------------
> [  3] local 10.43.48.59 port 40588 connected with 10.43.48.58 port 52000
> [ ID] Interval       Transfer     Bandwidth
> [  3] 0.0000-10.0000 sec  80.7 GBytes  69.4 Gbits/sec
> jwang@ps404a-59.stg:~$
> LD_PRELOAD=/usr/lib/x86_64-linux-gnu/rsocket/librspreload.so iperf -p
> 52000 -c 10.43.48.58 -P 2
> ------------------------------------------------------------
> Client connecting to 10.43.48.58, TCP port 52000
> TCP window size:  128 KByte (default)
> ------------------------------------------------------------
> [  3] local 10.43.48.59 port 41813 connected with 10.43.48.58 port 52000
> [  5] local 10.43.48.59 port 60638 connected with 10.43.48.58 port 52000
> [ ID] Interval       Transfer     Bandwidth
> [  5] 0.0000-10.0000 sec  48.9 GBytes  42.0 Gbits/sec
> [  3] 0.0000-10.0000 sec  49.8 GBytes  42.8 Gbits/sec
> [SUM] 0.0000-10.0000 sec  98.7 GBytes  84.8 Gbits/sec
> [ CT] final connect times (min/avg/max/stdev) =
> 6.962/7.764/8.567/8.567 ms (tot/err) = 2/0
> jwang@ps404a-59.stg:~$
> LD_PRELOAD=/usr/lib/x86_64-linux-gnu/rsocket/librspreload.so iperf -p
> 52000 -c 10.43.48.58 -P 4
> [  6] local 10.43.48.59 port 58086 connected with 10.43.48.58 port 52000
> [  3] local 10.43.48.59 port 49335 connected with 10.43.48.58 port 52000
> ------------------------------------------------------------
> Client connecting to 10.43.48.58, TCP port 52000
> TCP window size:  128 KByte (default)
> ------------------------------------------------------------
> [  4] local 10.43.48.59 port 44593 connected with 10.43.48.58 port 52000
> [  5] local 10.43.48.59 port 60464 connected with 10.43.48.58 port 52000
> [ ID] Interval       Transfer     Bandwidth
> [  5] 0.0000-10.0000 sec  28.0 GBytes  24.0 Gbits/sec
> [  4] 0.0000-10.0000 sec  28.0 GBytes  24.0 Gbits/sec
> [  3] 0.0000-10.0000 sec  28.0 GBytes  24.1 Gbits/sec
> [  6] 0.0000-10.0000 sec  28.0 GBytes  24.1 Gbits/sec
> [SUM] 0.0000-10.0001 sec   112 GBytes  96.3 Gbits/sec
> [ CT] final connect times (min/avg/max/stdev) =
> 7.344/9.619/12.199/5.271 ms (tot/err) = 4/0
> jwang@ps404a-59.stg:~$
> LD_PRELOAD=/usr/lib/x86_64-linux-gnu/rsocket/librspreload.so iperf -p
> 52000 -c 10.43.48.58 -P 8
> [  3] local 10.43.48.59 port 43020 connected with 10.43.48.58 port 52000
> [  7] local 10.43.48.59 port 59720 connected with 10.43.48.58 port 52000
> [  4] local 10.43.48.59 port 52547 connected with 10.43.48.58 port 52000
> [  8] local 10.43.48.59 port 41712 connected with 10.43.48.58 port 52000
> [ 10] local 10.43.48.59 port 53126 connected with 10.43.48.58 port 52000
> ------------------------------------------------------------
> Client connecting to 10.43.48.58, TCP port 52000
> TCP window size:  128 KByte (default)
> ------------------------------------------------------------
> [  6] local 10.43.48.59 port 60311 connected with 10.43.48.58 port 52000
> [  5] local 10.43.48.59 port 44103 connected with 10.43.48.58 port 52000
> [  9] local 10.43.48.59 port 49007 connected with 10.43.48.58 port 52000
> [ ID] Interval       Transfer     Bandwidth
> [  9] 0.0000-10.0001 sec  14.0 GBytes  12.0 Gbits/sec
> [  8] 0.0000-10.0000 sec  14.0 GBytes  12.0 Gbits/sec
> [  4] 0.0000-10.0001 sec  14.0 GBytes  12.0 Gbits/sec
> [  6] 0.0000-10.0000 sec  14.0 GBytes  12.0 Gbits/sec
> [ 10] 0.0000-10.0000 sec  14.0 GBytes  12.0 Gbits/sec
> [  7] 0.0000-10.0000 sec  14.0 GBytes  12.0 Gbits/sec
> [  5] 0.0000-10.0001 sec  14.0 GBytes  12.0 Gbits/sec
> [  3] 0.0000-10.0001 sec  14.0 GBytes  12.0 Gbits/sec
> [SUM] 0.0000-10.0001 sec   112 GBytes  96.3 Gbits/sec
> [ CT] final connect times (min/avg/max/stdev) =
> 6.942/12.361/18.109/4.872 ms (tot/err) = 8/0
> jwang@ps404a-59.stg:~$ iperf -p 52000 -c 10.43.48.58 -P 8
> ------------------------------------------------------------
> Client connecting to 10.43.48.58, TCP port 52000
> TCP window size:  165 KByte (default)
> ------------------------------------------------------------
> [  4] local 10.43.48.59 port 58176 connected with 10.43.48.58 port 52000
> [  5] local 10.43.48.59 port 58180 connected with 10.43.48.58 port 52000
> [  3] local 10.43.48.59 port 58178 connected with 10.43.48.58 port 52000
> [ 10] local 10.43.48.59 port 58226 connected with 10.43.48.58 port 52000
> [ 11] local 10.43.48.59 port 58228 connected with 10.43.48.58 port 52000
> [  9] local 10.43.48.59 port 58212 connected with 10.43.48.58 port 52000
> [  7] local 10.43.48.59 port 58194 connected with 10.43.48.58 port 52000
> [  8] local 10.43.48.59 port 58198 connected with 10.43.48.58 port 52000
> [ ID] Interval       Transfer     Bandwidth
> [  9] 0.0000-10.0005 sec  15.8 GBytes  13.5 Gbits/sec
> [  4] 0.0000-10.0002 sec  15.8 GBytes  13.6 Gbits/sec
> [  3] 0.0000-10.0000 sec  15.8 GBytes  13.6 Gbits/sec
> [  5] 0.0000-10.0002 sec  15.8 GBytes  13.6 Gbits/sec
> [  8] 0.0000-10.0005 sec  7.89 GBytes  6.78 Gbits/sec
> [ 10] 0.0000-10.0000 sec  15.8 GBytes  13.6 Gbits/sec
> [ 11] 0.0000-10.0014 sec  7.94 GBytes  6.82 Gbits/sec
> [  7] 0.0000-10.0009 sec  15.8 GBytes  13.6 Gbits/sec
> [SUM] 0.0000-10.0009 sec   111 GBytes  95.1 Gbits/sec
> [ CT] final connect times (min/avg/max/stdev) =
> 0.234/0.325/0.406/0.155 ms (tot/err) = 8/0
> jwang@ps404a-59.stg:~$ iperf -p 52000 -c 10.43.48.58 -P 4
> [  3] local 10.43.48.59 port 42548 connected with 10.43.48.58 port 52000
> ------------------------------------------------------------
> Client connecting to 10.43.48.58, TCP port 52000
> TCP window size:  165 KByte (default)
> ------------------------------------------------------------
> [  4] local 10.43.48.59 port 42558 connected with 10.43.48.58 port 52000
> [  5] local 10.43.48.59 port 42560 connected with 10.43.48.58 port 52000
> [  6] local 10.43.48.59 port 42562 connected with 10.43.48.58 port 52000
> [ ID] Interval       Transfer     Bandwidth
> [  6] 0.0000-10.0000 sec  27.8 GBytes  23.9 Gbits/sec
> [  5] 0.0000-10.0001 sec  27.3 GBytes  23.5 Gbits/sec
> [  3] 0.0000-10.0001 sec  27.8 GBytes  23.9 Gbits/sec
> [  4] 0.0000-10.0001 sec  27.8 GBytes  23.9 Gbits/sec
> [SUM] 0.0000-10.0001 sec   111 GBytes  95.1 Gbits/sec
> [ CT] final connect times (min/avg/max/stdev) =
> 0.295/0.340/0.390/0.201 ms (tot/err) = 4/0
> jwang@ps404a-59.stg:~$ iperf -p 52000 -c 10.43.48.58 -P 2
> ------------------------------------------------------------
> Client connecting to 10.43.48.58, TCP port 52000
> TCP window size:  165 KByte (default)
> ------------------------------------------------------------
> [  4] local 10.43.48.59 port 44194 connected with 10.43.48.58 port 52000
> [  3] local 10.43.48.59 port 44186 connected with 10.43.48.58 port 52000
> [ ID] Interval       Transfer     Bandwidth
> [  3] 0.0000-10.0000 sec  48.3 GBytes  41.5 Gbits/sec
> [  4] 0.0000-10.0000 sec  41.3 GBytes  35.5 Gbits/sec
> [SUM] 0.0000-10.0000 sec  89.7 GBytes  77.0 Gbits/sec
> [ CT] final connect times (min/avg/max/stdev) =
> 0.227/0.233/0.240/0.240 ms (tot/err) = 2/0
> jwang@ps404a-59.stg:~$ pbkvm list
>  VM  State  PID  Cores  Mem  VNC  Migration
> --------------------------------------------
> 
> Total: 0 VMs, Running: 0
> jwang@ps404a-59.stg:~$ iperf -p 52000 -c 10.43.48.58 -P 1
> ------------------------------------------------------------
> Client connecting to 10.43.48.58, TCP port 52000
> TCP window size:  165 KByte (default)
> ------------------------------------------------------------
> [  3] local 10.43.48.59 port 40364 connected with 10.43.48.58 port 52000
> [ ID] Interval       Transfer     Bandwidth
> [  3] 0.0000-10.0000 sec  51.2 GBytes  44.0 Gbits/sec
> 
> Thanks!
> 
> 
> > Thanks,
> >
> > --
> > Peter Xu
> >
>
Xingtao Yao (Fujitsu)" via May 7, 2024, 1:50 a.m. UTC | #31
Hello,

> -----Original Message-----
> From: Peter Xu [mailto:peterx@redhat.com]
> Sent: Monday, May 6, 2024 11:18 PM
> To: Gonglei (Arei) <arei.gonglei@huawei.com>
> Cc: Daniel P. Berrangé <berrange@redhat.com>; Markus Armbruster
> <armbru@redhat.com>; Michael Galaxy <mgalaxy@akamai.com>; Yu Zhang
> <yu.zhang@ionos.com>; Zhijian Li (Fujitsu) <lizhijian@fujitsu.com>; Jinpu Wang
> <jinpu.wang@ionos.com>; Elmar Gerdes <elmar.gerdes@ionos.com>;
> qemu-devel@nongnu.org; Yuval Shaia <yuval.shaia.ml@gmail.com>; Kevin Wolf
> <kwolf@redhat.com>; Prasanna Kumar Kalever
> <prasanna.kalever@redhat.com>; Cornelia Huck <cohuck@redhat.com>;
> Michael Roth <michael.roth@amd.com>; Prasanna Kumar Kalever
> <prasanna4324@gmail.com>; integration@gluster.org; Paolo Bonzini
> <pbonzini@redhat.com>; qemu-block@nongnu.org; devel@lists.libvirt.org;
> Hanna Reitz <hreitz@redhat.com>; Michael S. Tsirkin <mst@redhat.com>;
> Thomas Huth <thuth@redhat.com>; Eric Blake <eblake@redhat.com>; Song
> Gao <gaosong@loongson.cn>; Marc-André Lureau
> <marcandre.lureau@redhat.com>; Alex Bennée <alex.bennee@linaro.org>;
> Wainer dos Santos Moschetta <wainersm@redhat.com>; Beraldo Leal
> <bleal@redhat.com>; Pannengyuan <pannengyuan@huawei.com>;
> Xiexiangyou <xiexiangyou@huawei.com>
> Subject: Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling
> 
> On Mon, May 06, 2024 at 02:06:28AM +0000, Gonglei (Arei) wrote:
> > Hi, Peter
> 
> Hey, Lei,
> 
> Happy to see you around again after years.
> 
Haha, me too.

> > RDMA features high bandwidth, low latency (in non-blocking lossless
> > network), and direct remote memory access by bypassing the CPU (As you
> > know, CPU resources are expensive for cloud vendors, which is one of
> > the reasons why we introduced offload cards.), which TCP does not have.
> 
> It's another cost to use offload cards, v.s. preparing more cpu resources?
> 
Software and hardware offload converged architecture is the way to go for all cloud vendors 
(Including comprehensive benefits in terms of performance, cost, security, and innovation speed), 
it's not just a matter of adding the resource of a DPU card.

> > In some scenarios where fast live migration is needed (extremely short
> > interruption duration and migration duration) is very useful. To this
> > end, we have also developed RDMA support for multifd.
> 
> Will any of you upstream that work?  I'm curious how intrusive would it be
> when adding it to multifd, if it can keep only 5 exported functions like what
> rdma.h does right now it'll be pretty nice.  We also want to make sure it works
> with arbitrary sized loads and buffers, e.g. vfio is considering to add IO loads to
> multifd channels too.
> 

In fact, we sent the patchset to the community in 2021. Pls see:
https://lore.kernel.org/all/20210203185906.GT2950@work-vm/T/


> One thing to note that the question here is not about a pure performance
> comparison between rdma and nics only.  It's about help us make a decision
> on whether to drop rdma, iow, even if rdma performs well, the community still
> has the right to drop it if nobody can actively work and maintain it.
> It's just that if nics can perform as good it's more a reason to drop, unless
> companies can help to provide good support and work together.
> 

We are happy to provide the necessary review and maintenance work for RDMA
if the community needs it.

CC'ing Chuan Zheng.


Regards,
-Gonglei
Jinpu Wang May 7, 2024, 4:52 a.m. UTC | #32
Hi Peter, hi Daniel,
On Mon, May 6, 2024 at 5:29 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Mon, May 06, 2024 at 12:08:43PM +0200, Jinpu Wang wrote:
> > Hi Peter, hi Daniel,
>
> Hi, Jinpu,
>
> Thanks for sharing this test results.  Sounds like a great news.
>
> What's your plan next?  Would it then be worthwhile / possible moving QEMU
> into that direction?  Would that greatly simplify rdma code as Dan
> mentioned?
I'm rather not familiar with QEMU migration yet,  from the test
result, I think it's a possible direction,
just we need to at least based on a rather recent release like
rdma-core v33 with proper 'fork' support.

Maybe Dan or you could give more detail about what you have in mind
for using rsocket as a replacement for the future.
We will also look into the implementation details in the meantime.

Thx!
J

>
> Thanks,
>
> >
> > On Fri, May 3, 2024 at 4:33 PM Peter Xu <peterx@redhat.com> wrote:
> > >
> > > On Fri, May 03, 2024 at 08:40:03AM +0200, Jinpu Wang wrote:
> > > > I had a brief check in the rsocket changelog, there seems some
> > > > improvement over time,
> > > >  might be worth revisiting this. due to socket abstraction, we can't
> > > > use some feature like
> > > >  ODP, it won't be a small and easy task.
> > >
> > > It'll be good to know whether Dan's suggestion would work first, without
> > > rewritting everything yet so far.  Not sure whether some perf test could
> > > help with the rsocket APIs even without QEMU's involvements (or looking for
> > > test data supporting / invalidate such conversions).
> > >
> > I did a quick test with iperf on 100 G environment and 40 G
> > environment, in summary rsocket works pretty well.
> >
> > iperf tests between 2 hosts with 40 G (IB),
> > first  a few test with different num. of threads on top of ipoib
> > interface, later with preload rsocket on top of same ipoib interface.
> >
> > jwang@ps401a-914.nst:~$ iperf -p 52000 -c 10.43.3.145
> > ------------------------------------------------------------
> > Client connecting to 10.43.3.145, TCP port 52000
> > TCP window size:  165 KByte (default)
> > ------------------------------------------------------------
> > [  3] local 10.43.3.146 port 55602 connected with 10.43.3.145 port 52000
> > [ ID] Interval       Transfer     Bandwidth
> > [  3] 0.0000-10.0001 sec  2.85 GBytes  2.44 Gbits/sec
> > jwang@ps401a-914.nst:~$ iperf -p 52000 -c 10.43.3.145 -P 2
> > ------------------------------------------------------------
> > Client connecting to 10.43.3.145, TCP port 52000
> > TCP window size:  165 KByte (default)
> > ------------------------------------------------------------
> > [  4] local 10.43.3.146 port 39640 connected with 10.43.3.145 port 52000
> > [  3] local 10.43.3.146 port 39626 connected with 10.43.3.145 port 52000
> > [ ID] Interval       Transfer     Bandwidth
> > [  3] 0.0000-10.0012 sec  2.85 GBytes  2.45 Gbits/sec
> > [  4] 0.0000-10.0026 sec  2.86 GBytes  2.45 Gbits/sec
> > [SUM] 0.0000-10.0026 sec  5.71 GBytes  4.90 Gbits/sec
> > [ CT] final connect times (min/avg/max/stdev) =
> > 0.281/0.300/0.318/0.318 ms (tot/err) = 2/0
> > jwang@ps401a-914.nst:~$ iperf -p 52000 -c 10.43.3.145 -P 4
> > ------------------------------------------------------------
> > Client connecting to 10.43.3.145, TCP port 52000
> > TCP window size:  165 KByte (default)
> > ------------------------------------------------------------
> > [  4] local 10.43.3.146 port 46956 connected with 10.43.3.145 port 52000
> > [  6] local 10.43.3.146 port 46978 connected with 10.43.3.145 port 52000
> > [  3] local 10.43.3.146 port 46944 connected with 10.43.3.145 port 52000
> > [  5] local 10.43.3.146 port 46962 connected with 10.43.3.145 port 52000
> > [ ID] Interval       Transfer     Bandwidth
> > [  3] 0.0000-10.0017 sec  2.85 GBytes  2.45 Gbits/sec
> > [  4] 0.0000-10.0015 sec  2.85 GBytes  2.45 Gbits/sec
> > [  5] 0.0000-10.0026 sec  2.85 GBytes  2.45 Gbits/sec
> > [  6] 0.0000-10.0005 sec  2.85 GBytes  2.45 Gbits/sec
> > [SUM] 0.0000-10.0005 sec  11.4 GBytes  9.80 Gbits/sec
> > [ CT] final connect times (min/avg/max/stdev) =
> > 0.274/0.312/0.360/0.212 ms (tot/err) = 4/0
> > jwang@ps401a-914.nst:~$ iperf -p 52000 -c 10.43.3.145 -P 8
> > ------------------------------------------------------------
> > Client connecting to 10.43.3.145, TCP port 52000
> > TCP window size:  165 KByte (default)
> > ------------------------------------------------------------
> > [  7] local 10.43.3.146 port 35062 connected with 10.43.3.145 port 52000
> > [  6] local 10.43.3.146 port 35058 connected with 10.43.3.145 port 52000
> > [  8] local 10.43.3.146 port 35066 connected with 10.43.3.145 port 52000
> > [  9] local 10.43.3.146 port 35074 connected with 10.43.3.145 port 52000
> > [  3] local 10.43.3.146 port 35038 connected with 10.43.3.145 port 52000
> > [ 12] local 10.43.3.146 port 35088 connected with 10.43.3.145 port 52000
> > [  5] local 10.43.3.146 port 35048 connected with 10.43.3.145 port 52000
> > [  4] local 10.43.3.146 port 35050 connected with 10.43.3.145 port 52000
> > [ ID] Interval       Transfer     Bandwidth
> > [  4] 0.0000-10.0005 sec  2.85 GBytes  2.44 Gbits/sec
> > [  8] 0.0000-10.0011 sec  2.85 GBytes  2.45 Gbits/sec
> > [  5] 0.0000-10.0000 sec  2.85 GBytes  2.45 Gbits/sec
> > [ 12] 0.0000-10.0021 sec  2.85 GBytes  2.44 Gbits/sec
> > [  3] 0.0000-10.0003 sec  2.85 GBytes  2.44 Gbits/sec
> > [  7] 0.0000-10.0065 sec  2.50 GBytes  2.14 Gbits/sec
> > [  9] 0.0000-10.0077 sec  2.52 GBytes  2.16 Gbits/sec
> > [  6] 0.0000-10.0003 sec  2.85 GBytes  2.44 Gbits/sec
> > [SUM] 0.0000-10.0003 sec  22.1 GBytes  19.0 Gbits/sec
> > [ CT] final connect times (min/avg/max/stdev) =
> > 0.096/0.226/0.339/0.109 ms (tot/err) = 8/0
> > jwang@ps401a-914.nst:~$ iperf -p 52000 -c 10.43.3.145 -P 16
> > [  3] local 10.43.3.146 port 49540 connected with 10.43.3.145 port 52000
> > ------------------------------------------------------------
> > Client connecting to 10.43.3.145, TCP port 52000
> > TCP window size:  165 KByte (default)
> > ------------------------------------------------------------
> > [  6] local 10.43.3.146 port 49554 connected with 10.43.3.145 port 52000
> > [  8] local 10.43.3.146 port 49584 connected with 10.43.3.145 port 52000
> > [  5] local 10.43.3.146 port 49552 connected with 10.43.3.145 port 52000
> > [ 20] local 10.43.3.146 port 49626 connected with 10.43.3.145 port 52000
> > [  4] local 10.43.3.146 port 49606 connected with 10.43.3.145 port 52000
> > [  9] local 10.43.3.146 port 49596 connected with 10.43.3.145 port 52000
> > [ 10] local 10.43.3.146 port 49604 connected with 10.43.3.145 port 52000
> > [ 26] local 10.43.3.146 port 49678 connected with 10.43.3.145 port 52000
> > [  7] local 10.43.3.146 port 49556 connected with 10.43.3.145 port 52000
> > [ 25] local 10.43.3.146 port 49662 connected with 10.43.3.145 port 52000
> > [ 22] local 10.43.3.146 port 49636 connected with 10.43.3.145 port 52000
> > [ 11] local 10.43.3.146 port 49612 connected with 10.43.3.145 port 52000
> > [ 13] local 10.43.3.146 port 49618 connected with 10.43.3.145 port 52000
> > [ 23] local 10.43.3.146 port 49646 connected with 10.43.3.145 port 52000
> > [ 15] local 10.43.3.146 port 49688 connected with 10.43.3.145 port 52000
> > [ ID] Interval       Transfer     Bandwidth
> > [ 11] 0.0000-10.0024 sec  2.28 GBytes  1.95 Gbits/sec
> > [ 23] 0.0000-10.0022 sec  2.28 GBytes  1.95 Gbits/sec
> > [ 20] 0.0000-10.0010 sec  2.28 GBytes  1.95 Gbits/sec
> > [  8] 0.0000-10.0032 sec  2.28 GBytes  1.95 Gbits/sec
> > [ 26] 0.0000-10.0038 sec  2.28 GBytes  1.95 Gbits/sec
> > [ 10] 0.0000-10.0002 sec  2.28 GBytes  1.95 Gbits/sec
> > [  7] 0.0000-10.0033 sec  2.28 GBytes  1.95 Gbits/sec
> > [ 15] 0.0000-10.0015 sec  2.27 GBytes  1.95 Gbits/sec
> > [  4] 0.0000-10.0028 sec  2.28 GBytes  1.95 Gbits/sec
> > [  6] 0.0000-10.0012 sec  2.28 GBytes  1.96 Gbits/sec
> > [ 13] 0.0000-10.0030 sec  2.28 GBytes  1.95 Gbits/sec
> > [ 25] 0.0000-10.0051 sec  2.28 GBytes  1.95 Gbits/sec
> > [  5] 0.0000-10.0001 sec  2.28 GBytes  1.96 Gbits/sec
> > [  9] 0.0000-10.0017 sec  2.28 GBytes  1.95 Gbits/sec
> > [ 22] 0.0000-10.0008 sec  2.27 GBytes  1.95 Gbits/sec
> > [  3] 0.0000-10.0033 sec  2.28 GBytes  1.95 Gbits/sec
> > [SUM] 0.0000-10.0034 sec  36.4 GBytes  31.3 Gbits/sec
> > [ CT] final connect times (min/avg/max/stdev) =
> > 0.105/0.217/0.401/0.093 ms (tot/err) = 16/0
> > jwang@ps401a-914.nst:~$
> > LD_PRELOAD=/usr/lib/x86_64-linux-gnu/rsocket/librspreload.so iperf -p
> > 52000 -c 10.43.3.145 -P 16
> > ------------------------------------------------------------
> > Client connecting to 10.43.3.145, TCP port 52000
> > TCP window size:  128 KByte (default)
> > ------------------------------------------------------------
> > [  3] local 10.43.3.146 port 48902 connected with 10.43.3.145 port 52000
> > [  5] local 10.43.3.146 port 52777 connected with 10.43.3.145 port 52000
> > [  9] local 10.43.3.146 port 42911 connected with 10.43.3.145 port 52000
> > [ 11] local 10.43.3.146 port 56354 connected with 10.43.3.145 port 52000
> > [ 15] local 10.43.3.146 port 43325 connected with 10.43.3.145 port 52000
> > [  6] local 10.43.3.146 port 37041 connected with 10.43.3.145 port 52000
> > [  7] local 10.43.3.146 port 58828 connected with 10.43.3.145 port 52000
> > [ 17] local 10.43.3.146 port 48858 connected with 10.43.3.145 port 52000
> > [ 13] local 10.43.3.146 port 49256 connected with 10.43.3.145 port 52000
> > [ 16] local 10.43.3.146 port 35652 connected with 10.43.3.145 port 52000
> > [  8] local 10.43.3.146 port 48567 connected with 10.43.3.145 port 52000
> > [ 18] local 10.43.3.146 port 47394 connected with 10.43.3.145 port 52000
> > [ 19] local 10.43.3.146 port 48065 connected with 10.43.3.145 port 52000
> > [ 10] local 10.43.3.146 port 39788 connected with 10.43.3.145 port 52000
> > [  4] local 10.43.3.146 port 46818 connected with 10.43.3.145 port 52000
> > [ 14] local 10.43.3.146 port 57174 connected with 10.43.3.145 port 52000
> > [ ID] Interval       Transfer     Bandwidth
> > [ 14] 0.0000-10.0002 sec  2.30 GBytes  1.98 Gbits/sec
> > [  6] 0.0000-10.0004 sec  2.31 GBytes  1.98 Gbits/sec
> > [  5] 0.0000-10.0002 sec  2.31 GBytes  1.98 Gbits/sec
> > [  8] 0.0000-10.0001 sec  2.31 GBytes  1.98 Gbits/sec
> > [ 11] 0.0000-10.0003 sec  2.31 GBytes  1.98 Gbits/sec
> > [ 18] 0.0000-10.0003 sec  2.31 GBytes  1.98 Gbits/sec
> > [  3] 0.0000-10.0004 sec  2.31 GBytes  1.98 Gbits/sec
> > [  4] 0.0000-10.0005 sec  2.30 GBytes  1.98 Gbits/sec
> > [ 17] 0.0000-10.0004 sec  2.31 GBytes  1.98 Gbits/sec
> > [ 15] 0.0000-10.0005 sec  2.31 GBytes  1.98 Gbits/sec
> > [ 19] 0.0000-10.0001 sec  2.30 GBytes  1.98 Gbits/sec
> > [  7] 0.0000-10.0004 sec  2.31 GBytes  1.98 Gbits/sec
> > [ 13] 0.0000-10.0005 sec  2.31 GBytes  1.98 Gbits/sec
> > [ 10] 0.0000-10.0003 sec  2.30 GBytes  1.98 Gbits/sec
> > [  9] 0.0000-10.0000 sec  2.31 GBytes  1.98 Gbits/sec
> > [ 16] 0.0000-10.0002 sec  2.31 GBytes  1.98 Gbits/sec
> > [SUM] 0.0000-10.0003 sec  36.9 GBytes  31.7 Gbits/sec
> > [ CT] final connect times (min/avg/max/stdev) =
> > 88.398/101.706/114.726/24.755 ms (tot/err) = 16/0
> > jwang@ps401a-914.nst:~$
> > LD_PRELOAD=/usr/lib/x86_64-linux-gnu/rsocket/librspreload.so iperf -p
> > 52000 -c 10.43.3.145 -P 1
> > ------------------------------------------------------------
> > Client connecting to 10.43.3.145, TCP port 52000
> > TCP window size:  128 KByte (default)
> > ------------------------------------------------------------
> > [  3] local 10.43.3.146 port 49168 connected with 10.43.3.145 port 52000
> > [ ID] Interval       Transfer     Bandwidth
> > [  3] 0.0000-10.0000 sec  34.3 GBytes  29.5 Gbits/sec
> > jwang@ps401a-914.nst:~$
> > LD_PRELOAD=/usr/lib/x86_64-linux-gnu/rsocket/librspreload.so iperf -p
> > 52000 -c 10.43.3.145 -P 2
> > ------------------------------------------------------------
> > Client connecting to 10.43.3.145, TCP port 52000
> > TCP window size:  128 KByte (default)
> > ------------------------------------------------------------
> > [  3] local 10.43.3.146 port 42096 connected with 10.43.3.145 port 52000
> > [  4] local 10.43.3.146 port 58667 connected with 10.43.3.145 port 52000
> > [ ID] Interval       Transfer     Bandwidth
> > [  4] 0.0000-10.0001 sec  18.4 GBytes  15.8 Gbits/sec
> > [  3] 0.0000-10.0000 sec  18.5 GBytes  15.9 Gbits/sec
> > [SUM] 0.0000-10.0001 sec  36.9 GBytes  31.7 Gbits/sec
> > [ CT] final connect times (min/avg/max/stdev) =
> > 38.155/38.997/39.839/39.839 ms (tot/err) = 2/0
> > jwang@ps401a-914.nst:~$
> > LD_PRELOAD=/usr/lib/x86_64-linux-gnu/rsocket/librspreload.so iperf -p
> > 52000 -c 10.43.3.145 -P 4
> > ------------------------------------------------------------
> > Client connecting to 10.43.3.145, TCP port 52000
> > TCP window size:  128 KByte (default)
> > ------------------------------------------------------------
> > [  3] local 10.43.3.146 port 36100 connected with 10.43.3.145 port 52000
> > [  5] local 10.43.3.146 port 55108 connected with 10.43.3.145 port 52000
> > [  6] local 10.43.3.146 port 41039 connected with 10.43.3.145 port 52000
> > [  7] local 10.43.3.146 port 34868 connected with 10.43.3.145 port 52000
> > [ ID] Interval       Transfer     Bandwidth
> > [  7] 0.0000-10.0000 sec  9.22 GBytes  7.92 Gbits/sec
> > [  5] 0.0000-10.0000 sec  9.22 GBytes  7.92 Gbits/sec
> > [  3] 0.0000-10.0000 sec  9.22 GBytes  7.92 Gbits/sec
> > [  6] 0.0000-10.0001 sec  9.22 GBytes  7.92 Gbits/sec
> > [SUM] 0.0000-10.0001 sec  36.9 GBytes  31.7 Gbits/sec
> > [ CT] final connect times (min/avg/max/stdev) =
> > 51.401/53.644/56.015/30.487 ms (tot/err) = 4/0
> >
> > You can see with rsocket it reaches ~ 30 Gb/s with single stream,
> > while ipoib only 2.5 Gb/s (12 X), ipoib scales with more threads until
> >  ~ 32 Gb/s, which is the link limit.
> >
> > With 100 G env, rsocket also out perform ipoib, see below:
> >
> >
> > jwang@ps404a-59.stg:~$
> > LD_PRELOAD=/usr/lib/x86_64-linux-gnu/rsocket/librspreload.so iperf -p
> > 52000 -c 10.43.48.58
> > ------------------------------------------------------------
> > Client connecting to 10.43.48.58, TCP port 52000
> > TCP window size:  128 KByte (default)
> > ------------------------------------------------------------
> > [  3] local 10.43.48.59 port 40588 connected with 10.43.48.58 port 52000
> > [ ID] Interval       Transfer     Bandwidth
> > [  3] 0.0000-10.0000 sec  80.7 GBytes  69.4 Gbits/sec
> > jwang@ps404a-59.stg:~$
> > LD_PRELOAD=/usr/lib/x86_64-linux-gnu/rsocket/librspreload.so iperf -p
> > 52000 -c 10.43.48.58 -P 2
> > ------------------------------------------------------------
> > Client connecting to 10.43.48.58, TCP port 52000
> > TCP window size:  128 KByte (default)
> > ------------------------------------------------------------
> > [  3] local 10.43.48.59 port 41813 connected with 10.43.48.58 port 52000
> > [  5] local 10.43.48.59 port 60638 connected with 10.43.48.58 port 52000
> > [ ID] Interval       Transfer     Bandwidth
> > [  5] 0.0000-10.0000 sec  48.9 GBytes  42.0 Gbits/sec
> > [  3] 0.0000-10.0000 sec  49.8 GBytes  42.8 Gbits/sec
> > [SUM] 0.0000-10.0000 sec  98.7 GBytes  84.8 Gbits/sec
> > [ CT] final connect times (min/avg/max/stdev) =
> > 6.962/7.764/8.567/8.567 ms (tot/err) = 2/0
> > jwang@ps404a-59.stg:~$
> > LD_PRELOAD=/usr/lib/x86_64-linux-gnu/rsocket/librspreload.so iperf -p
> > 52000 -c 10.43.48.58 -P 4
> > [  6] local 10.43.48.59 port 58086 connected with 10.43.48.58 port 52000
> > [  3] local 10.43.48.59 port 49335 connected with 10.43.48.58 port 52000
> > ------------------------------------------------------------
> > Client connecting to 10.43.48.58, TCP port 52000
> > TCP window size:  128 KByte (default)
> > ------------------------------------------------------------
> > [  4] local 10.43.48.59 port 44593 connected with 10.43.48.58 port 52000
> > [  5] local 10.43.48.59 port 60464 connected with 10.43.48.58 port 52000
> > [ ID] Interval       Transfer     Bandwidth
> > [  5] 0.0000-10.0000 sec  28.0 GBytes  24.0 Gbits/sec
> > [  4] 0.0000-10.0000 sec  28.0 GBytes  24.0 Gbits/sec
> > [  3] 0.0000-10.0000 sec  28.0 GBytes  24.1 Gbits/sec
> > [  6] 0.0000-10.0000 sec  28.0 GBytes  24.1 Gbits/sec
> > [SUM] 0.0000-10.0001 sec   112 GBytes  96.3 Gbits/sec
> > [ CT] final connect times (min/avg/max/stdev) =
> > 7.344/9.619/12.199/5.271 ms (tot/err) = 4/0
> > jwang@ps404a-59.stg:~$
> > LD_PRELOAD=/usr/lib/x86_64-linux-gnu/rsocket/librspreload.so iperf -p
> > 52000 -c 10.43.48.58 -P 8
> > [  3] local 10.43.48.59 port 43020 connected with 10.43.48.58 port 52000
> > [  7] local 10.43.48.59 port 59720 connected with 10.43.48.58 port 52000
> > [  4] local 10.43.48.59 port 52547 connected with 10.43.48.58 port 52000
> > [  8] local 10.43.48.59 port 41712 connected with 10.43.48.58 port 52000
> > [ 10] local 10.43.48.59 port 53126 connected with 10.43.48.58 port 52000
> > ------------------------------------------------------------
> > Client connecting to 10.43.48.58, TCP port 52000
> > TCP window size:  128 KByte (default)
> > ------------------------------------------------------------
> > [  6] local 10.43.48.59 port 60311 connected with 10.43.48.58 port 52000
> > [  5] local 10.43.48.59 port 44103 connected with 10.43.48.58 port 52000
> > [  9] local 10.43.48.59 port 49007 connected with 10.43.48.58 port 52000
> > [ ID] Interval       Transfer     Bandwidth
> > [  9] 0.0000-10.0001 sec  14.0 GBytes  12.0 Gbits/sec
> > [  8] 0.0000-10.0000 sec  14.0 GBytes  12.0 Gbits/sec
> > [  4] 0.0000-10.0001 sec  14.0 GBytes  12.0 Gbits/sec
> > [  6] 0.0000-10.0000 sec  14.0 GBytes  12.0 Gbits/sec
> > [ 10] 0.0000-10.0000 sec  14.0 GBytes  12.0 Gbits/sec
> > [  7] 0.0000-10.0000 sec  14.0 GBytes  12.0 Gbits/sec
> > [  5] 0.0000-10.0001 sec  14.0 GBytes  12.0 Gbits/sec
> > [  3] 0.0000-10.0001 sec  14.0 GBytes  12.0 Gbits/sec
> > [SUM] 0.0000-10.0001 sec   112 GBytes  96.3 Gbits/sec
> > [ CT] final connect times (min/avg/max/stdev) =
> > 6.942/12.361/18.109/4.872 ms (tot/err) = 8/0
> > jwang@ps404a-59.stg:~$ iperf -p 52000 -c 10.43.48.58 -P 8
> > ------------------------------------------------------------
> > Client connecting to 10.43.48.58, TCP port 52000
> > TCP window size:  165 KByte (default)
> > ------------------------------------------------------------
> > [  4] local 10.43.48.59 port 58176 connected with 10.43.48.58 port 52000
> > [  5] local 10.43.48.59 port 58180 connected with 10.43.48.58 port 52000
> > [  3] local 10.43.48.59 port 58178 connected with 10.43.48.58 port 52000
> > [ 10] local 10.43.48.59 port 58226 connected with 10.43.48.58 port 52000
> > [ 11] local 10.43.48.59 port 58228 connected with 10.43.48.58 port 52000
> > [  9] local 10.43.48.59 port 58212 connected with 10.43.48.58 port 52000
> > [  7] local 10.43.48.59 port 58194 connected with 10.43.48.58 port 52000
> > [  8] local 10.43.48.59 port 58198 connected with 10.43.48.58 port 52000
> > [ ID] Interval       Transfer     Bandwidth
> > [  9] 0.0000-10.0005 sec  15.8 GBytes  13.5 Gbits/sec
> > [  4] 0.0000-10.0002 sec  15.8 GBytes  13.6 Gbits/sec
> > [  3] 0.0000-10.0000 sec  15.8 GBytes  13.6 Gbits/sec
> > [  5] 0.0000-10.0002 sec  15.8 GBytes  13.6 Gbits/sec
> > [  8] 0.0000-10.0005 sec  7.89 GBytes  6.78 Gbits/sec
> > [ 10] 0.0000-10.0000 sec  15.8 GBytes  13.6 Gbits/sec
> > [ 11] 0.0000-10.0014 sec  7.94 GBytes  6.82 Gbits/sec
> > [  7] 0.0000-10.0009 sec  15.8 GBytes  13.6 Gbits/sec
> > [SUM] 0.0000-10.0009 sec   111 GBytes  95.1 Gbits/sec
> > [ CT] final connect times (min/avg/max/stdev) =
> > 0.234/0.325/0.406/0.155 ms (tot/err) = 8/0
> > jwang@ps404a-59.stg:~$ iperf -p 52000 -c 10.43.48.58 -P 4
> > [  3] local 10.43.48.59 port 42548 connected with 10.43.48.58 port 52000
> > ------------------------------------------------------------
> > Client connecting to 10.43.48.58, TCP port 52000
> > TCP window size:  165 KByte (default)
> > ------------------------------------------------------------
> > [  4] local 10.43.48.59 port 42558 connected with 10.43.48.58 port 52000
> > [  5] local 10.43.48.59 port 42560 connected with 10.43.48.58 port 52000
> > [  6] local 10.43.48.59 port 42562 connected with 10.43.48.58 port 52000
> > [ ID] Interval       Transfer     Bandwidth
> > [  6] 0.0000-10.0000 sec  27.8 GBytes  23.9 Gbits/sec
> > [  5] 0.0000-10.0001 sec  27.3 GBytes  23.5 Gbits/sec
> > [  3] 0.0000-10.0001 sec  27.8 GBytes  23.9 Gbits/sec
> > [  4] 0.0000-10.0001 sec  27.8 GBytes  23.9 Gbits/sec
> > [SUM] 0.0000-10.0001 sec   111 GBytes  95.1 Gbits/sec
> > [ CT] final connect times (min/avg/max/stdev) =
> > 0.295/0.340/0.390/0.201 ms (tot/err) = 4/0
> > jwang@ps404a-59.stg:~$ iperf -p 52000 -c 10.43.48.58 -P 2
> > ------------------------------------------------------------
> > Client connecting to 10.43.48.58, TCP port 52000
> > TCP window size:  165 KByte (default)
> > ------------------------------------------------------------
> > [  4] local 10.43.48.59 port 44194 connected with 10.43.48.58 port 52000
> > [  3] local 10.43.48.59 port 44186 connected with 10.43.48.58 port 52000
> > [ ID] Interval       Transfer     Bandwidth
> > [  3] 0.0000-10.0000 sec  48.3 GBytes  41.5 Gbits/sec
> > [  4] 0.0000-10.0000 sec  41.3 GBytes  35.5 Gbits/sec
> > [SUM] 0.0000-10.0000 sec  89.7 GBytes  77.0 Gbits/sec
> > [ CT] final connect times (min/avg/max/stdev) =
> > 0.227/0.233/0.240/0.240 ms (tot/err) = 2/0
> > jwang@ps404a-59.stg:~$ pbkvm list
> >  VM  State  PID  Cores  Mem  VNC  Migration
> > --------------------------------------------
> >
> > Total: 0 VMs, Running: 0
> > jwang@ps404a-59.stg:~$ iperf -p 52000 -c 10.43.48.58 -P 1
> > ------------------------------------------------------------
> > Client connecting to 10.43.48.58, TCP port 52000
> > TCP window size:  165 KByte (default)
> > ------------------------------------------------------------
> > [  3] local 10.43.48.59 port 40364 connected with 10.43.48.58 port 52000
> > [ ID] Interval       Transfer     Bandwidth
> > [  3] 0.0000-10.0000 sec  51.2 GBytes  44.0 Gbits/sec
> >
> > Thanks!
> >
> >
> > > Thanks,
> > >
> > > --
> > > Peter Xu
> > >
> >
>
> --
> Peter Xu
>
Peter Xu May 7, 2024, 4:28 p.m. UTC | #33
On Tue, May 07, 2024 at 01:50:43AM +0000, Gonglei (Arei) wrote:
> Hello,
> 
> > -----Original Message-----
> > From: Peter Xu [mailto:peterx@redhat.com]
> > Sent: Monday, May 6, 2024 11:18 PM
> > To: Gonglei (Arei) <arei.gonglei@huawei.com>
> > Cc: Daniel P. Berrangé <berrange@redhat.com>; Markus Armbruster
> > <armbru@redhat.com>; Michael Galaxy <mgalaxy@akamai.com>; Yu Zhang
> > <yu.zhang@ionos.com>; Zhijian Li (Fujitsu) <lizhijian@fujitsu.com>; Jinpu Wang
> > <jinpu.wang@ionos.com>; Elmar Gerdes <elmar.gerdes@ionos.com>;
> > qemu-devel@nongnu.org; Yuval Shaia <yuval.shaia.ml@gmail.com>; Kevin Wolf
> > <kwolf@redhat.com>; Prasanna Kumar Kalever
> > <prasanna.kalever@redhat.com>; Cornelia Huck <cohuck@redhat.com>;
> > Michael Roth <michael.roth@amd.com>; Prasanna Kumar Kalever
> > <prasanna4324@gmail.com>; integration@gluster.org; Paolo Bonzini
> > <pbonzini@redhat.com>; qemu-block@nongnu.org; devel@lists.libvirt.org;
> > Hanna Reitz <hreitz@redhat.com>; Michael S. Tsirkin <mst@redhat.com>;
> > Thomas Huth <thuth@redhat.com>; Eric Blake <eblake@redhat.com>; Song
> > Gao <gaosong@loongson.cn>; Marc-André Lureau
> > <marcandre.lureau@redhat.com>; Alex Bennée <alex.bennee@linaro.org>;
> > Wainer dos Santos Moschetta <wainersm@redhat.com>; Beraldo Leal
> > <bleal@redhat.com>; Pannengyuan <pannengyuan@huawei.com>;
> > Xiexiangyou <xiexiangyou@huawei.com>
> > Subject: Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling
> > 
> > On Mon, May 06, 2024 at 02:06:28AM +0000, Gonglei (Arei) wrote:
> > > Hi, Peter
> > 
> > Hey, Lei,
> > 
> > Happy to see you around again after years.
> > 
> Haha, me too.
> 
> > > RDMA features high bandwidth, low latency (in non-blocking lossless
> > > network), and direct remote memory access by bypassing the CPU (As you
> > > know, CPU resources are expensive for cloud vendors, which is one of
> > > the reasons why we introduced offload cards.), which TCP does not have.
> > 
> > It's another cost to use offload cards, v.s. preparing more cpu resources?
> > 
> Software and hardware offload converged architecture is the way to go for all cloud vendors 
> (Including comprehensive benefits in terms of performance, cost, security, and innovation speed), 
> it's not just a matter of adding the resource of a DPU card.
> 
> > > In some scenarios where fast live migration is needed (extremely short
> > > interruption duration and migration duration) is very useful. To this
> > > end, we have also developed RDMA support for multifd.
> > 
> > Will any of you upstream that work?  I'm curious how intrusive would it be
> > when adding it to multifd, if it can keep only 5 exported functions like what
> > rdma.h does right now it'll be pretty nice.  We also want to make sure it works
> > with arbitrary sized loads and buffers, e.g. vfio is considering to add IO loads to
> > multifd channels too.
> > 
> 
> In fact, we sent the patchset to the community in 2021. Pls see:
> https://lore.kernel.org/all/20210203185906.GT2950@work-vm/T/

I wasn't aware of that for sure in the past..

Multifd has changed quite a bit in the last 9.0 release, that may not apply
anymore.  One thing to mention is please look at Dan's comment on possible
use of rsocket.h:

https://lore.kernel.org/all/ZjJm6rcqS5EhoKgK@redhat.com/

And Jinpu did help provide an initial test result over the library:

https://lore.kernel.org/qemu-devel/CAMGffEk8wiKNQmoUYxcaTHGtiEm2dwoCF_W7T0vMcD-i30tUkA@mail.gmail.com/

It looks like we have a chance to apply that in QEMU.

> 
> 
> > One thing to note that the question here is not about a pure performance
> > comparison between rdma and nics only.  It's about help us make a decision
> > on whether to drop rdma, iow, even if rdma performs well, the community still
> > has the right to drop it if nobody can actively work and maintain it.
> > It's just that if nics can perform as good it's more a reason to drop, unless
> > companies can help to provide good support and work together.
> > 
> 
> We are happy to provide the necessary review and maintenance work for RDMA
> if the community needs it.
> 
> CC'ing Chuan Zheng.

I'm not sure whether you and Jinpu's team would like to work together and
provide a final solution for rdma over multifd.  It could be much simpler
than the original 2021 proposal if the rsocket API will work out.

Thanks,
Zheng Chuan May 9, 2024, 8:58 a.m. UTC | #34
Hi, Peter,Lei,Jinpu.

On 2024/5/8 0:28, Peter Xu wrote:
> On Tue, May 07, 2024 at 01:50:43AM +0000, Gonglei (Arei) wrote:
>> Hello,
>>
>>> -----Original Message-----
>>> From: Peter Xu [mailto:peterx@redhat.com]
>>> Sent: Monday, May 6, 2024 11:18 PM
>>> To: Gonglei (Arei) <arei.gonglei@huawei.com>
>>> Cc: Daniel P. Berrangé <berrange@redhat.com>; Markus Armbruster
>>> <armbru@redhat.com>; Michael Galaxy <mgalaxy@akamai.com>; Yu Zhang
>>> <yu.zhang@ionos.com>; Zhijian Li (Fujitsu) <lizhijian@fujitsu.com>; Jinpu Wang
>>> <jinpu.wang@ionos.com>; Elmar Gerdes <elmar.gerdes@ionos.com>;
>>> qemu-devel@nongnu.org; Yuval Shaia <yuval.shaia.ml@gmail.com>; Kevin Wolf
>>> <kwolf@redhat.com>; Prasanna Kumar Kalever
>>> <prasanna.kalever@redhat.com>; Cornelia Huck <cohuck@redhat.com>;
>>> Michael Roth <michael.roth@amd.com>; Prasanna Kumar Kalever
>>> <prasanna4324@gmail.com>; integration@gluster.org; Paolo Bonzini
>>> <pbonzini@redhat.com>; qemu-block@nongnu.org; devel@lists.libvirt.org;
>>> Hanna Reitz <hreitz@redhat.com>; Michael S. Tsirkin <mst@redhat.com>;
>>> Thomas Huth <thuth@redhat.com>; Eric Blake <eblake@redhat.com>; Song
>>> Gao <gaosong@loongson.cn>; Marc-André Lureau
>>> <marcandre.lureau@redhat.com>; Alex Bennée <alex.bennee@linaro.org>;
>>> Wainer dos Santos Moschetta <wainersm@redhat.com>; Beraldo Leal
>>> <bleal@redhat.com>; Pannengyuan <pannengyuan@huawei.com>;
>>> Xiexiangyou <xiexiangyou@huawei.com>
>>> Subject: Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling
>>>
>>> On Mon, May 06, 2024 at 02:06:28AM +0000, Gonglei (Arei) wrote:
>>>> Hi, Peter
>>>
>>> Hey, Lei,
>>>
>>> Happy to see you around again after years.
>>>
>> Haha, me too.
>>
>>>> RDMA features high bandwidth, low latency (in non-blocking lossless
>>>> network), and direct remote memory access by bypassing the CPU (As you
>>>> know, CPU resources are expensive for cloud vendors, which is one of
>>>> the reasons why we introduced offload cards.), which TCP does not have.
>>>
>>> It's another cost to use offload cards, v.s. preparing more cpu resources?
>>>
>> Software and hardware offload converged architecture is the way to go for all cloud vendors 
>> (Including comprehensive benefits in terms of performance, cost, security, and innovation speed), 
>> it's not just a matter of adding the resource of a DPU card.
>>
>>>> In some scenarios where fast live migration is needed (extremely short
>>>> interruption duration and migration duration) is very useful. To this
>>>> end, we have also developed RDMA support for multifd.
>>>
>>> Will any of you upstream that work?  I'm curious how intrusive would it be
>>> when adding it to multifd, if it can keep only 5 exported functions like what
>>> rdma.h does right now it'll be pretty nice.  We also want to make sure it works
>>> with arbitrary sized loads and buffers, e.g. vfio is considering to add IO loads to
>>> multifd channels too.
>>>
>>
>> In fact, we sent the patchset to the community in 2021. Pls see:
>> https://lore.kernel.org/all/20210203185906.GT2950@work-vm/T/
> 

Yes, I have sent the patchset of multifd support for rdma migration by taking over my colleague, and also
sorry for not keeping on this work at that time due to some reasons.
And also I am strongly agree with Lei that the RDMA protocol has some special advantages against with TCP
in some scenario, and we are indeed to use it in our product.

> I wasn't aware of that for sure in the past..
> 
> Multifd has changed quite a bit in the last 9.0 release, that may not apply
> anymore.  One thing to mention is please look at Dan's comment on possible
> use of rsocket.h:
> 
> https://lore.kernel.org/all/ZjJm6rcqS5EhoKgK@redhat.com/
> 
> And Jinpu did help provide an initial test result over the library:
> 
> https://lore.kernel.org/qemu-devel/CAMGffEk8wiKNQmoUYxcaTHGtiEm2dwoCF_W7T0vMcD-i30tUkA@mail.gmail.com/
> 
> It looks like we have a chance to apply that in QEMU.
> 
>>
>>
>>> One thing to note that the question here is not about a pure performance
>>> comparison between rdma and nics only.  It's about help us make a decision
>>> on whether to drop rdma, iow, even if rdma performs well, the community still
>>> has the right to drop it if nobody can actively work and maintain it.
>>> It's just that if nics can perform as good it's more a reason to drop, unless
>>> companies can help to provide good support and work together.
>>>
>>
>> We are happy to provide the necessary review and maintenance work for RDMA
>> if the community needs it.
>>
>> CC'ing Chuan Zheng.
> 
> I'm not sure whether you and Jinpu's team would like to work together and
> provide a final solution for rdma over multifd.  It could be much simpler
> than the original 2021 proposal if the rsocket API will work out.
> 
> Thanks,
> 
That's a good news to see the socket abstraction for RDMA!
When I was developed the series above, the most pain is the RDMA migration has no QIOChannel abstraction and i need to take a 'fake channel'
for it which is awkward in code implementation.
So, as far as I know, we can do this by
i. the first thing is that we need to evaluate the rsocket is good enough to satisfy our QIOChannel fundamental abstraction
ii. if it works right, then we will continue to see if it can give us opportunity to hide the detail of rdma protocol
    into rsocket by remove most of code in rdma.c and also some hack in migration main process.
iii. implement the advanced features like multi-fd and multi-uri for rdma migration.

Since I am not familiar with rsocket, I need some times to look at it and do some quick verify with rdma migration based on rsocket.
But, yes, I am willing to involved in this refactor work and to see if we can make this migration feature more better:)
Peter Xu May 9, 2024, 2:13 p.m. UTC | #35
On Thu, May 09, 2024 at 04:58:34PM +0800, Zheng Chuan via wrote:
> That's a good news to see the socket abstraction for RDMA!
> When I was developed the series above, the most pain is the RDMA migration has no QIOChannel abstraction and i need to take a 'fake channel'
> for it which is awkward in code implementation.
> So, as far as I know, we can do this by
> i. the first thing is that we need to evaluate the rsocket is good enough to satisfy our QIOChannel fundamental abstraction
> ii. if it works right, then we will continue to see if it can give us opportunity to hide the detail of rdma protocol
>     into rsocket by remove most of code in rdma.c and also some hack in migration main process.
> iii. implement the advanced features like multi-fd and multi-uri for rdma migration.
> 
> Since I am not familiar with rsocket, I need some times to look at it and do some quick verify with rdma migration based on rsocket.
> But, yes, I am willing to involved in this refactor work and to see if we can make this migration feature more better:)

Based on what we have now, it looks like we'd better halt the deprecation
process a bit, so I think we shouldn't need to rush it at least in 9.1
then, and we'll need to see how it goes on the refactoring.

It'll be perfect if rsocket works, otherwise supporting multifd with little
overhead / exported APIs would also be a good thing in general with
whatever approach.  And obviously all based on the facts that we can get
resources from companies to support this feature first.

Note that so far nobody yet compared with rdma v.s. nic perf, so I hope if
any of us can provide some test results please do so.  Many people are
saying RDMA is better, but I yet didn't see any numbers comparing it with
modern TCP networks.  I don't want to have old impressions floating around
even if things might have changed..  When we have consolidated results, we
should share them out and also reflect that in QEMU's migration docs when a
rdma document page is ready.

Chuan, please check the whole thread discussion, it may help to understand
what we are looking for on rdma migrations [1].  Meanwhile please feel free
to sync with Jinpu's team and see how to move forward with such a project.

[1] https://lore.kernel.org/qemu-devel/87frwatp7n.fsf@suse.de/

Thanks,
Jinpu Wang May 13, 2024, 7:30 a.m. UTC | #36
Hi Peter, Hi Chuan,

On Thu, May 9, 2024 at 4:14 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Thu, May 09, 2024 at 04:58:34PM +0800, Zheng Chuan via wrote:
> > That's a good news to see the socket abstraction for RDMA!
> > When I was developed the series above, the most pain is the RDMA migration has no QIOChannel abstraction and i need to take a 'fake channel'
> > for it which is awkward in code implementation.
> > So, as far as I know, we can do this by
> > i. the first thing is that we need to evaluate the rsocket is good enough to satisfy our QIOChannel fundamental abstraction
> > ii. if it works right, then we will continue to see if it can give us opportunity to hide the detail of rdma protocol
> >     into rsocket by remove most of code in rdma.c and also some hack in migration main process.
> > iii. implement the advanced features like multi-fd and multi-uri for rdma migration.
> >
> > Since I am not familiar with rsocket, I need some times to look at it and do some quick verify with rdma migration based on rsocket.
> > But, yes, I am willing to involved in this refactor work and to see if we can make this migration feature more better:)
>
> Based on what we have now, it looks like we'd better halt the deprecation
> process a bit, so I think we shouldn't need to rush it at least in 9.1
> then, and we'll need to see how it goes on the refactoring.
>
> It'll be perfect if rsocket works, otherwise supporting multifd with little
> overhead / exported APIs would also be a good thing in general with
> whatever approach.  And obviously all based on the facts that we can get
> resources from companies to support this feature first.
>
> Note that so far nobody yet compared with rdma v.s. nic perf, so I hope if
> any of us can provide some test results please do so.  Many people are
> saying RDMA is better, but I yet didn't see any numbers comparing it with
> modern TCP networks.  I don't want to have old impressions floating around
> even if things might have changed..  When we have consolidated results, we
> should share them out and also reflect that in QEMU's migration docs when a
> rdma document page is ready.
I also did a tests with Mellanox ConnectX-6 100 G RoCE nic, the
results are mixed, for less than 3 streams native ethernet is faster,
and when more than 3 streams rsocket performs better.

root@x4-right:~# iperf -c 1.1.1.16 -P 1
------------------------------------------------------------
Client connecting to 1.1.1.16, TCP port 5001
TCP window size:  325 KByte (default)
------------------------------------------------------------
[  3] local 1.1.1.15 port 44214 connected with 1.1.1.16 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3] 0.0000-10.0000 sec  52.9 GBytes  45.4 Gbits/sec
root@x4-right:~# iperf -c 1.1.1.16 -P 2
[  3] local 1.1.1.15 port 33118 connected with 1.1.1.16 port 5001
[  4] local 1.1.1.15 port 33130 connected with 1.1.1.16 port 5001
------------------------------------------------------------
Client connecting to 1.1.1.16, TCP port 5001
TCP window size: 4.00 MByte (default)
------------------------------------------------------------
[ ID] Interval       Transfer     Bandwidth
[  3] 0.0000-10.0001 sec  45.0 GBytes  38.7 Gbits/sec
[  4] 0.0000-10.0000 sec  43.9 GBytes  37.7 Gbits/sec
[SUM] 0.0000-10.0000 sec  88.9 GBytes  76.4 Gbits/sec
[ CT] final connect times (min/avg/max/stdev) =
0.172/0.189/0.205/0.172 ms (tot/err) = 2/0
root@x4-right:~# iperf -c 1.1.1.16 -P 4
------------------------------------------------------------
Client connecting to 1.1.1.16, TCP port 5001
TCP window size:  325 KByte (default)
------------------------------------------------------------
[  5] local 1.1.1.15 port 50748 connected with 1.1.1.16 port 5001
[  4] local 1.1.1.15 port 50734 connected with 1.1.1.16 port 5001
[  6] local 1.1.1.15 port 50764 connected with 1.1.1.16 port 5001
[  3] local 1.1.1.15 port 50730 connected with 1.1.1.16 port 5001
[ ID] Interval       Transfer     Bandwidth
[  6] 0.0000-10.0000 sec  24.7 GBytes  21.2 Gbits/sec
[  3] 0.0000-10.0004 sec  23.6 GBytes  20.3 Gbits/sec
[  4] 0.0000-10.0000 sec  27.8 GBytes  23.9 Gbits/sec
[  5] 0.0000-10.0000 sec  28.0 GBytes  24.0 Gbits/sec
[SUM] 0.0000-10.0000 sec   104 GBytes  89.4 Gbits/sec
[ CT] final connect times (min/avg/max/stdev) =
0.104/0.156/0.204/0.124 ms (tot/err) = 4/0
root@x4-right:~# iperf -c 1.1.1.16 -P 8
[  4] local 1.1.1.15 port 55588 connected with 1.1.1.16 port 5001
[  5] local 1.1.1.15 port 55600 connected with 1.1.1.16 port 5001
------------------------------------------------------------
Client connecting to 1.1.1.16, TCP port 5001
TCP window size:  325 KByte (default)
------------------------------------------------------------
[ 10] local 1.1.1.15 port 55628 connected with 1.1.1.16 port 5001
[ 15] local 1.1.1.15 port 55648 connected with 1.1.1.16 port 5001
[  7] local 1.1.1.15 port 55620 connected with 1.1.1.16 port 5001
[  3] local 1.1.1.15 port 55584 connected with 1.1.1.16 port 5001
[ 14] local 1.1.1.15 port 55644 connected with 1.1.1.16 port 5001
[  6] local 1.1.1.15 port 55610 connected with 1.1.1.16 port 5001
[ ID] Interval       Transfer     Bandwidth
[  6] 0.0000-10.0015 sec  8.47 GBytes  7.27 Gbits/sec
[  4] 0.0000-10.0011 sec  8.62 GBytes  7.40 Gbits/sec
[  7] 0.0000-10.0000 sec  18.1 GBytes  15.5 Gbits/sec
[ 14] 0.0000-10.0000 sec  8.69 GBytes  7.46 Gbits/sec
[  5] 0.0000-10.0006 sec  18.5 GBytes  15.9 Gbits/sec
[ 10] 0.0000-10.0006 sec  16.1 GBytes  13.9 Gbits/sec
[  3] 0.0000-10.0000 sec  17.1 GBytes  14.6 Gbits/sec
[ 15] 0.0000-10.0016 sec  8.54 GBytes  7.34 Gbits/sec
[SUM] 0.0000-10.0017 sec   104 GBytes  89.4 Gbits/sec
[ CT] final connect times (min/avg/max/stdev) =
0.049/0.095/0.213/0.062 ms (tot/err) = 8/0

root@x4-right:~#
LD_PRELOAD=/usr/lib/x86_64-linux-gnu/rsocket/librspreload.so  iperf -c
1.1.1.16 -P 1
------------------------------------------------------------
Client connecting to 1.1.1.16, TCP port 5001
TCP window size:  128 KByte (default)
------------------------------------------------------------
[  3] local 1.1.1.15 port 45596 connected with 1.1.1.16 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3] 0.0000-10.0000 sec  37.8 GBytes  32.5 Gbits/sec
root@x4-right:~#
LD_PRELOAD=/usr/lib/x86_64-linux-gnu/rsocket/librspreload.so  iperf -c
1.1.1.16 -P 2
[  4] local 1.1.1.15 port 46782 connected with 1.1.1.16 port 5001
------------------------------------------------------------
Client connecting to 1.1.1.16, TCP port 5001
TCP window size:  128 KByte (default)
------------------------------------------------------------
[  3] local 1.1.1.15 port 43237 connected with 1.1.1.16 port 5001
[ ID] Interval       Transfer     Bandwidth
[  4] 0.0000-10.0000 sec  37.5 GBytes  32.2 Gbits/sec
[  3] 0.0000-10.0000 sec  40.7 GBytes  34.9 Gbits/sec
[SUM] 0.0000-10.0000 sec  78.2 GBytes  67.2 Gbits/sec
[ CT] final connect times (min/avg/max/stdev) =
5.819/6.579/7.340/7.340 ms (tot/err) = 2/0
root@x4-right:~#
LD_PRELOAD=/usr/lib/x86_64-linux-gnu/rsocket/librspreload.so  iperf -c
1.1.1.16 -P 4
[  4] local 1.1.1.15 port 60385 connected with 1.1.1.16 port 5001
[  7] local 1.1.1.15 port 55203 connected with 1.1.1.16 port 5001
[  6] local 1.1.1.15 port 35084 connected with 1.1.1.16 port 5001
------------------------------------------------------------
Client connecting to 1.1.1.16, TCP port 5001
TCP window size:  128 KByte (default)
------------------------------------------------------------
[  3] local 1.1.1.15 port 37253 connected with 1.1.1.16 port 5001
[ ID] Interval       Transfer     Bandwidth
[  6] 0.0000-10.0000 sec  28.4 GBytes  24.4 Gbits/sec
[  4] 0.0000-10.0000 sec  28.3 GBytes  24.3 Gbits/sec
[  7] 0.0000-10.0000 sec  28.4 GBytes  24.4 Gbits/sec
[  3] 0.0000-10.0001 sec  28.2 GBytes  24.3 Gbits/sec
[SUM] 0.0000-10.0001 sec   113 GBytes  97.3 Gbits/sec
[ CT] final connect times (min/avg/max/stdev) =
5.311/7.579/10.019/4.165 ms (tot/err) = 4/0
root@x4-right:~#
LD_PRELOAD=/usr/lib/x86_64-linux-gnu/rsocket/librspreload.so  iperf -c
1.1.1.16 -P 8
[  8] local 1.1.1.15 port 33684 connected with 1.1.1.16 port 5001
[ 10] local 1.1.1.15 port 40620 connected with 1.1.1.16 port 5001
------------------------------------------------------------
Client connecting to 1.1.1.16, TCP port 5001
TCP window size:  128 KByte (default)
------------------------------------------------------------
[  3] local 1.1.1.15 port 56988 connected with 1.1.1.16 port 5001
[  4] local 1.1.1.15 port 51139 connected with 1.1.1.16 port 5001
[ 12] local 1.1.1.15 port 44712 connected with 1.1.1.16 port 5001
[  5] local 1.1.1.15 port 50838 connected with 1.1.1.16 port 5001
[  6] local 1.1.1.15 port 51334 connected with 1.1.1.16 port 5001
[  9] local 1.1.1.15 port 40611 connected with 1.1.1.16 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3] 0.0000-10.0000 sec  13.8 GBytes  11.9 Gbits/sec
[  5] 0.0000-10.0001 sec  13.9 GBytes  11.9 Gbits/sec
[ 12] 0.0000-10.0001 sec  13.8 GBytes  11.9 Gbits/sec
[ 10] 0.0000-10.0001 sec  13.9 GBytes  11.9 Gbits/sec
[  9] 0.0000-10.0000 sec  13.8 GBytes  11.9 Gbits/sec
[  6] 0.0000-10.0000 sec  13.9 GBytes  11.9 Gbits/sec
[  8] 0.0000-10.0000 sec  13.8 GBytes  11.9 Gbits/sec
[  4] 0.0000-10.0001 sec  13.8 GBytes  11.9 Gbits/sec
[SUM] 0.0000-10.0001 sec   111 GBytes  95.1 Gbits/sec
[ CT] final connect times (min/avg/max/stdev) =
5.973/10.699/15.943/4.251 ms (tot/err) = 8/0
root@x4-right:~#
LD_PRELOAD=/usr/lib/x86_64-linux-gnu/rsocket/librspreload.so  iperf -c
1.1.1.16 -P 1
------------------------------------------------------------
Client connecting to 1.1.1.16, TCP port 5001
TCP window size:  128 KByte (default)
------------------------------------------------------------
[  3] local 1.1.1.15 port 36960 connected with 1.1.1.16 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3] 0.0000-10.0000 sec  41.1 GBytes  35.3 Gbits/sec
root@x4-right:~#
LD_PRELOAD=/usr/lib/x86_64-linux-gnu/rsocket/librspreload.so  iperf -c
1.1.1.16 -P 2
[  3] local 1.1.1.15 port 32799 connected with 1.1.1.16 port 5001
------------------------------------------------------------
Client connecting to 1.1.1.16, TCP port 5001
TCP window size:  128 KByte (default)
------------------------------------------------------------
[  4] local 1.1.1.15 port 35912 connected with 1.1.1.16 port 5001
[ ID] Interval       Transfer     Bandwidth
[  4] 0.0000-10.0000 sec  36.6 GBytes  31.4 Gbits/sec
[  3] 0.0000-10.0000 sec  36.6 GBytes  31.4 Gbits/sec
[SUM] 0.0000-10.0000 sec  73.2 GBytes  62.9 Gbits/sec
[ CT] final connect times (min/avg/max/stdev) =
5.172/5.842/6.512/6.512 ms (tot/err) = 2/0
root@x4-right:~#
LD_PRELOAD=/usr/lib/x86_64-linux-gnu/rsocket/librspreload.so  iperf -c
1.1.1.16 -P 4
[  4] local 1.1.1.15 port 53311 connected with 1.1.1.16 port 5001
------------------------------------------------------------
Client connecting to 1.1.1.16, TCP port 5001
TCP window size:  128 KByte (default)
------------------------------------------------------------
[  3] local 1.1.1.15 port 37243 connected with 1.1.1.16 port 5001
[  7] local 1.1.1.15 port 60801 connected with 1.1.1.16 port 5001
[  6] local 1.1.1.15 port 49694 connected with 1.1.1.16 port 5001
[ ID] Interval       Transfer     Bandwidth
[  6] 0.0000-10.0000 sec  28.2 GBytes  24.2 Gbits/sec
[  7] 0.0000-10.0000 sec  28.2 GBytes  24.3 Gbits/sec
[  3] 0.0000-10.0000 sec  28.2 GBytes  24.2 Gbits/sec
[  4] 0.0000-10.0000 sec  28.2 GBytes  24.2 Gbits/sec
[SUM] 0.0000-10.0000 sec   113 GBytes  96.9 Gbits/sec
[ CT] final connect times (min/avg/max/stdev) =
5.570/7.762/10.045/4.265 ms (tot/err) = 4/0
root@x4-right:~#


>
> Chuan, please check the whole thread discussion, it may help to understand
> what we are looking for on rdma migrations [1].  Meanwhile please feel free
> to sync with Jinpu's team and see how to move forward with such a project.
We are happy to work with community to improve rdma migration.

>
> [1] https://lore.kernel.org/qemu-devel/87frwatp7n.fsf@suse.de/
>
> Thanks,
Regards!
>
> --
> Peter Xu
>
Yu Zhang May 14, 2024, 3:19 p.m. UTC | #37
Hello Peter and all,

I did a comparison of the VM live-migration speeds between RDMA and
TCP/IP on our servers
and plotted the results to get an initial impression. Unfortunately,
the Ethernet NICs are not the
recent ones, therefore, it may not make much sense. I can do it on
servers with more recent Ethernet
NICs and keep you updated.

It seems that the benefits of RDMA becomes obviously when the VM has
large memory and is
running memory-intensive workload.

Best regards,
Yu Zhang @ IONOS Cloud

On Thu, May 9, 2024 at 4:14 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Thu, May 09, 2024 at 04:58:34PM +0800, Zheng Chuan via wrote:
> > That's a good news to see the socket abstraction for RDMA!
> > When I was developed the series above, the most pain is the RDMA migration has no QIOChannel abstraction and i need to take a 'fake channel'
> > for it which is awkward in code implementation.
> > So, as far as I know, we can do this by
> > i. the first thing is that we need to evaluate the rsocket is good enough to satisfy our QIOChannel fundamental abstraction
> > ii. if it works right, then we will continue to see if it can give us opportunity to hide the detail of rdma protocol
> >     into rsocket by remove most of code in rdma.c and also some hack in migration main process.
> > iii. implement the advanced features like multi-fd and multi-uri for rdma migration.
> >
> > Since I am not familiar with rsocket, I need some times to look at it and do some quick verify with rdma migration based on rsocket.
> > But, yes, I am willing to involved in this refactor work and to see if we can make this migration feature more better:)
>
> Based on what we have now, it looks like we'd better halt the deprecation
> process a bit, so I think we shouldn't need to rush it at least in 9.1
> then, and we'll need to see how it goes on the refactoring.
>
> It'll be perfect if rsocket works, otherwise supporting multifd with little
> overhead / exported APIs would also be a good thing in general with
> whatever approach.  And obviously all based on the facts that we can get
> resources from companies to support this feature first.
>
> Note that so far nobody yet compared with rdma v.s. nic perf, so I hope if
> any of us can provide some test results please do so.  Many people are
> saying RDMA is better, but I yet didn't see any numbers comparing it with
> modern TCP networks.  I don't want to have old impressions floating around
> even if things might have changed..  When we have consolidated results, we
> should share them out and also reflect that in QEMU's migration docs when a
> rdma document page is ready.
>
> Chuan, please check the whole thread discussion, it may help to understand
> what we are looking for on rdma migrations [1].  Meanwhile please feel free
> to sync with Jinpu's team and see how to move forward with such a project.
>
> [1] https://lore.kernel.org/qemu-devel/87frwatp7n.fsf@suse.de/
>
> Thanks,
>
> --
> Peter Xu
>
Michael Galaxy May 16, 2024, 5:29 p.m. UTC | #38
These are very compelling results, no?

(40gbps cards, right? Are the cards active/active? or active/standby?)

- Michael

On 5/14/24 10:19, Yu Zhang wrote:
> Hello Peter and all,
>
> I did a comparison of the VM live-migration speeds between RDMA and
> TCP/IP on our servers
> and plotted the results to get an initial impression. Unfortunately,
> the Ethernet NICs are not the
> recent ones, therefore, it may not make much sense. I can do it on
> servers with more recent Ethernet
> NICs and keep you updated.
>
> It seems that the benefits of RDMA becomes obviously when the VM has
> large memory and is
> running memory-intensive workload.
>
> Best regards,
> Yu Zhang @ IONOS Cloud
>
> On Thu, May 9, 2024 at 4:14 PM Peter Xu <peterx@redhat.com> wrote:
>> On Thu, May 09, 2024 at 04:58:34PM +0800, Zheng Chuan via wrote:
>>> That's a good news to see the socket abstraction for RDMA!
>>> When I was developed the series above, the most pain is the RDMA migration has no QIOChannel abstraction and i need to take a 'fake channel'
>>> for it which is awkward in code implementation.
>>> So, as far as I know, we can do this by
>>> i. the first thing is that we need to evaluate the rsocket is good enough to satisfy our QIOChannel fundamental abstraction
>>> ii. if it works right, then we will continue to see if it can give us opportunity to hide the detail of rdma protocol
>>>      into rsocket by remove most of code in rdma.c and also some hack in migration main process.
>>> iii. implement the advanced features like multi-fd and multi-uri for rdma migration.
>>>
>>> Since I am not familiar with rsocket, I need some times to look at it and do some quick verify with rdma migration based on rsocket.
>>> But, yes, I am willing to involved in this refactor work and to see if we can make this migration feature more better:)
>> Based on what we have now, it looks like we'd better halt the deprecation
>> process a bit, so I think we shouldn't need to rush it at least in 9.1
>> then, and we'll need to see how it goes on the refactoring.
>>
>> It'll be perfect if rsocket works, otherwise supporting multifd with little
>> overhead / exported APIs would also be a good thing in general with
>> whatever approach.  And obviously all based on the facts that we can get
>> resources from companies to support this feature first.
>>
>> Note that so far nobody yet compared with rdma v.s. nic perf, so I hope if
>> any of us can provide some test results please do so.  Many people are
>> saying RDMA is better, but I yet didn't see any numbers comparing it with
>> modern TCP networks.  I don't want to have old impressions floating around
>> even if things might have changed..  When we have consolidated results, we
>> should share them out and also reflect that in QEMU's migration docs when a
>> rdma document page is ready.
>>
>> Chuan, please check the whole thread discussion, it may help to understand
>> what we are looking for on rdma migrations [1].  Meanwhile please feel free
>> to sync with Jinpu's team and see how to move forward with such a project.
>>
>> [1] https://urldefense.com/v3/__https://lore.kernel.org/qemu-devel/87frwatp7n.fsf@suse.de/__;!!GjvTz_vk!QnXDo1zSlYecz7JvJky4SOQ9I8V5MoGHbINdAQAzMJQ_yYg_8_BSUXz9kjvbSgFefhG0wi1j38KaC3g$
>>
>> Thanks,
>>
>> --
>> Peter Xu
>>
Yu Zhang May 17, 2024, 1:01 p.m. UTC | #39
Hello Michael and Peter,

Exactly, not so compelling, as I did it first only on servers widely
used for production in our data center. The network adapters are

Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720
2-port Gigabit Ethernet PCIe
InfiniBand controller: Mellanox Technologies MT27800 Family [ConnectX-5]

which doesn't meet our purpose. I can choose RDMA or TCP for VM
migration. RDMA traffic is through InfiniBand and TCP through Ethernet
on these two hosts. One is standby while the other is active.

Now I'll try on a server with more recent Ethernet and InfiniBand
network adapters. One of them has:
BCM57414 NetXtreme-E 10Gb/25Gb RDMA Ethernet Controller (rev 01)

The comparison between RDMA and TCP on the same NIC could make more sense.

Best regards,
Yu Zhang @ IONOS Cloud







On Thu, May 16, 2024 at 7:30 PM Michael Galaxy <mgalaxy@akamai.com> wrote:
>
> These are very compelling results, no?
>
> (40gbps cards, right? Are the cards active/active? or active/standby?)
>
> - Michael
>
> On 5/14/24 10:19, Yu Zhang wrote:
> > Hello Peter and all,
> >
> > I did a comparison of the VM live-migration speeds between RDMA and
> > TCP/IP on our servers
> > and plotted the results to get an initial impression. Unfortunately,
> > the Ethernet NICs are not the
> > recent ones, therefore, it may not make much sense. I can do it on
> > servers with more recent Ethernet
> > NICs and keep you updated.
> >
> > It seems that the benefits of RDMA becomes obviously when the VM has
> > large memory and is
> > running memory-intensive workload.
> >
> > Best regards,
> > Yu Zhang @ IONOS Cloud
> >
> > On Thu, May 9, 2024 at 4:14 PM Peter Xu <peterx@redhat.com> wrote:
> >> On Thu, May 09, 2024 at 04:58:34PM +0800, Zheng Chuan via wrote:
> >>> That's a good news to see the socket abstraction for RDMA!
> >>> When I was developed the series above, the most pain is the RDMA migration has no QIOChannel abstraction and i need to take a 'fake channel'
> >>> for it which is awkward in code implementation.
> >>> So, as far as I know, we can do this by
> >>> i. the first thing is that we need to evaluate the rsocket is good enough to satisfy our QIOChannel fundamental abstraction
> >>> ii. if it works right, then we will continue to see if it can give us opportunity to hide the detail of rdma protocol
> >>>      into rsocket by remove most of code in rdma.c and also some hack in migration main process.
> >>> iii. implement the advanced features like multi-fd and multi-uri for rdma migration.
> >>>
> >>> Since I am not familiar with rsocket, I need some times to look at it and do some quick verify with rdma migration based on rsocket.
> >>> But, yes, I am willing to involved in this refactor work and to see if we can make this migration feature more better:)
> >> Based on what we have now, it looks like we'd better halt the deprecation
> >> process a bit, so I think we shouldn't need to rush it at least in 9.1
> >> then, and we'll need to see how it goes on the refactoring.
> >>
> >> It'll be perfect if rsocket works, otherwise supporting multifd with little
> >> overhead / exported APIs would also be a good thing in general with
> >> whatever approach.  And obviously all based on the facts that we can get
> >> resources from companies to support this feature first.
> >>
> >> Note that so far nobody yet compared with rdma v.s. nic perf, so I hope if
> >> any of us can provide some test results please do so.  Many people are
> >> saying RDMA is better, but I yet didn't see any numbers comparing it with
> >> modern TCP networks.  I don't want to have old impressions floating around
> >> even if things might have changed..  When we have consolidated results, we
> >> should share them out and also reflect that in QEMU's migration docs when a
> >> rdma document page is ready.
> >>
> >> Chuan, please check the whole thread discussion, it may help to understand
> >> what we are looking for on rdma migrations [1].  Meanwhile please feel free
> >> to sync with Jinpu's team and see how to move forward with such a project.
> >>
> >> [1] https://urldefense.com/v3/__https://lore.kernel.org/qemu-devel/87frwatp7n.fsf@suse.de/__;!!GjvTz_vk!QnXDo1zSlYecz7JvJky4SOQ9I8V5MoGHbINdAQAzMJQ_yYg_8_BSUXz9kjvbSgFefhG0wi1j38KaC3g$
> >>
> >> Thanks,
> >>
> >> --
> >> Peter Xu
> >>
Xingtao Yao (Fujitsu)" via May 28, 2024, 9:06 a.m. UTC | #40
Hi Peter,

> -----Original Message-----
> From: Peter Xu [mailto:peterx@redhat.com]
> Sent: Wednesday, May 22, 2024 6:15 AM
> To: Yu Zhang <yu.zhang@ionos.com>
> Cc: Michael Galaxy <mgalaxy@akamai.com>; Jinpu Wang
> <jinpu.wang@ionos.com>; Elmar Gerdes <elmar.gerdes@ionos.com>;
> zhengchuan <zhengchuan@huawei.com>; Gonglei (Arei)
> <arei.gonglei@huawei.com>; Daniel P. Berrangé <berrange@redhat.com>;
> Markus Armbruster <armbru@redhat.com>; Zhijian Li (Fujitsu)
> <lizhijian@fujitsu.com>; qemu-devel@nongnu.org; Yuval Shaia
> <yuval.shaia.ml@gmail.com>; Kevin Wolf <kwolf@redhat.com>; Prasanna
> Kumar Kalever <prasanna.kalever@redhat.com>; Cornelia Huck
> <cohuck@redhat.com>; Michael Roth <michael.roth@amd.com>; Prasanna
> Kumar Kalever <prasanna4324@gmail.com>; Paolo Bonzini
> <pbonzini@redhat.com>; qemu-block@nongnu.org; devel@lists.libvirt.org;
> Hanna Reitz <hreitz@redhat.com>; Michael S. Tsirkin <mst@redhat.com>;
> Thomas Huth <thuth@redhat.com>; Eric Blake <eblake@redhat.com>; Song
> Gao <gaosong@loongson.cn>; Marc-André Lureau
> <marcandre.lureau@redhat.com>; Alex Bennée <alex.bennee@linaro.org>;
> Wainer dos Santos Moschetta <wainersm@redhat.com>; Beraldo Leal
> <bleal@redhat.com>; Pannengyuan <pannengyuan@huawei.com>;
> Xiexiangyou <xiexiangyou@huawei.com>; Fabiano Rosas <farosas@suse.de>
> Subject: Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling
> 
> On Fri, May 17, 2024 at 03:01:59PM +0200, Yu Zhang wrote:
> > Hello Michael and Peter,
> 
> Hi,
> 
> >
> > Exactly, not so compelling, as I did it first only on servers widely
> > used for production in our data center. The network adapters are
> >
> > Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720
> > 2-port Gigabit Ethernet PCIe
> 
> Hmm... I definitely thinks Jinpu's Mellanox ConnectX-6 looks more reasonable.
> 
> https://lore.kernel.org/qemu-devel/CAMGffEn-DKpMZ4tA71MJYdyemg0Zda15
> wVAqk81vXtKzx-LfJQ@mail.gmail.com/
> 
> Appreciate a lot for everyone helping on the testings.
> 
> > InfiniBand controller: Mellanox Technologies MT27800 Family
> > [ConnectX-5]
> >
> > which doesn't meet our purpose. I can choose RDMA or TCP for VM
> > migration. RDMA traffic is through InfiniBand and TCP through Ethernet
> > on these two hosts. One is standby while the other is active.
> >
> > Now I'll try on a server with more recent Ethernet and InfiniBand
> > network adapters. One of them has:
> > BCM57414 NetXtreme-E 10Gb/25Gb RDMA Ethernet Controller (rev 01)
> >
> > The comparison between RDMA and TCP on the same NIC could make more
> sense.
> 
> It looks to me NICs are powerful now, but again as I mentioned I don't think it's
> a reason we need to deprecate rdma, especially if QEMU's rdma migration has
> the chance to be refactored using rsocket.
> 
> Is there anyone who started looking into that direction?  Would it make sense
> we start some PoC now?
> 

My team has finished the PoC refactoring which works well. 

Progress:
1.  Implement io/channel-rdma.c,
2.  Add unit test tests/unit/test-io-channel-rdma.c and verifying it is successful,
3.  Remove the original code from migration/rdma.c,
4.  Rewrite the rdma_start_outgoing_migration and rdma_start_incoming_migration logic,
5.  Remove all rdma_xxx functions from migration/ram.c. (to prevent RDMA live migration from polluting the core logic of live migration),
6.  The soft-RoCE implemented by software is used to test the RDMA live migration. It's successful.

We will be submit the patchset later.


Regards,
-Gonglei

> Thanks,
> 
> --
> Peter Xu
Jinpu Wang May 28, 2024, 9:11 a.m. UTC | #41
Hi Gonglei,

On Tue, May 28, 2024 at 11:06 AM Gonglei (Arei) <arei.gonglei@huawei.com> wrote:
>
> Hi Peter,
>
> > -----Original Message-----
> > From: Peter Xu [mailto:peterx@redhat.com]
> > Sent: Wednesday, May 22, 2024 6:15 AM
> > To: Yu Zhang <yu.zhang@ionos.com>
> > Cc: Michael Galaxy <mgalaxy@akamai.com>; Jinpu Wang
> > <jinpu.wang@ionos.com>; Elmar Gerdes <elmar.gerdes@ionos.com>;
> > zhengchuan <zhengchuan@huawei.com>; Gonglei (Arei)
> > <arei.gonglei@huawei.com>; Daniel P. Berrangé <berrange@redhat.com>;
> > Markus Armbruster <armbru@redhat.com>; Zhijian Li (Fujitsu)
> > <lizhijian@fujitsu.com>; qemu-devel@nongnu.org; Yuval Shaia
> > <yuval.shaia.ml@gmail.com>; Kevin Wolf <kwolf@redhat.com>; Prasanna
> > Kumar Kalever <prasanna.kalever@redhat.com>; Cornelia Huck
> > <cohuck@redhat.com>; Michael Roth <michael.roth@amd.com>; Prasanna
> > Kumar Kalever <prasanna4324@gmail.com>; Paolo Bonzini
> > <pbonzini@redhat.com>; qemu-block@nongnu.org; devel@lists.libvirt.org;
> > Hanna Reitz <hreitz@redhat.com>; Michael S. Tsirkin <mst@redhat.com>;
> > Thomas Huth <thuth@redhat.com>; Eric Blake <eblake@redhat.com>; Song
> > Gao <gaosong@loongson.cn>; Marc-André Lureau
> > <marcandre.lureau@redhat.com>; Alex Bennée <alex.bennee@linaro.org>;
> > Wainer dos Santos Moschetta <wainersm@redhat.com>; Beraldo Leal
> > <bleal@redhat.com>; Pannengyuan <pannengyuan@huawei.com>;
> > Xiexiangyou <xiexiangyou@huawei.com>; Fabiano Rosas <farosas@suse.de>
> > Subject: Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling
> >
> > On Fri, May 17, 2024 at 03:01:59PM +0200, Yu Zhang wrote:
> > > Hello Michael and Peter,
> >
> > Hi,
> >
> > >
> > > Exactly, not so compelling, as I did it first only on servers widely
> > > used for production in our data center. The network adapters are
> > >
> > > Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720
> > > 2-port Gigabit Ethernet PCIe
> >
> > Hmm... I definitely thinks Jinpu's Mellanox ConnectX-6 looks more reasonable.
> >
> > https://lore.kernel.org/qemu-devel/CAMGffEn-DKpMZ4tA71MJYdyemg0Zda15
> > wVAqk81vXtKzx-LfJQ@mail.gmail.com/
> >
> > Appreciate a lot for everyone helping on the testings.
> >
> > > InfiniBand controller: Mellanox Technologies MT27800 Family
> > > [ConnectX-5]
> > >
> > > which doesn't meet our purpose. I can choose RDMA or TCP for VM
> > > migration. RDMA traffic is through InfiniBand and TCP through Ethernet
> > > on these two hosts. One is standby while the other is active.
> > >
> > > Now I'll try on a server with more recent Ethernet and InfiniBand
> > > network adapters. One of them has:
> > > BCM57414 NetXtreme-E 10Gb/25Gb RDMA Ethernet Controller (rev 01)
> > >
> > > The comparison between RDMA and TCP on the same NIC could make more
> > sense.
> >
> > It looks to me NICs are powerful now, but again as I mentioned I don't think it's
> > a reason we need to deprecate rdma, especially if QEMU's rdma migration has
> > the chance to be refactored using rsocket.
> >
> > Is there anyone who started looking into that direction?  Would it make sense
> > we start some PoC now?
> >
>
> My team has finished the PoC refactoring which works well.
>
> Progress:
> 1.  Implement io/channel-rdma.c,
> 2.  Add unit test tests/unit/test-io-channel-rdma.c and verifying it is successful,
> 3.  Remove the original code from migration/rdma.c,
> 4.  Rewrite the rdma_start_outgoing_migration and rdma_start_incoming_migration logic,
> 5.  Remove all rdma_xxx functions from migration/ram.c. (to prevent RDMA live migration from polluting the core logic of live migration),
> 6.  The soft-RoCE implemented by software is used to test the RDMA live migration. It's successful.
>
> We will be submit the patchset later.
>
Thanks for working on this PoC, and sharing progress on this, we are
looking forward for the patchset.

>
> Regards,
> -Gonglei
Regards!
Jinpu
>
> > Thanks,
> >
> > --
> > Peter Xu
>
Peter Xu May 28, 2024, 3:54 p.m. UTC | #42
On Tue, May 28, 2024 at 09:06:04AM +0000, Gonglei (Arei) wrote:
> Hi Peter,
> 
> > -----Original Message-----
> > From: Peter Xu [mailto:peterx@redhat.com]
> > Sent: Wednesday, May 22, 2024 6:15 AM
> > To: Yu Zhang <yu.zhang@ionos.com>
> > Cc: Michael Galaxy <mgalaxy@akamai.com>; Jinpu Wang
> > <jinpu.wang@ionos.com>; Elmar Gerdes <elmar.gerdes@ionos.com>;
> > zhengchuan <zhengchuan@huawei.com>; Gonglei (Arei)
> > <arei.gonglei@huawei.com>; Daniel P. Berrangé <berrange@redhat.com>;
> > Markus Armbruster <armbru@redhat.com>; Zhijian Li (Fujitsu)
> > <lizhijian@fujitsu.com>; qemu-devel@nongnu.org; Yuval Shaia
> > <yuval.shaia.ml@gmail.com>; Kevin Wolf <kwolf@redhat.com>; Prasanna
> > Kumar Kalever <prasanna.kalever@redhat.com>; Cornelia Huck
> > <cohuck@redhat.com>; Michael Roth <michael.roth@amd.com>; Prasanna
> > Kumar Kalever <prasanna4324@gmail.com>; Paolo Bonzini
> > <pbonzini@redhat.com>; qemu-block@nongnu.org; devel@lists.libvirt.org;
> > Hanna Reitz <hreitz@redhat.com>; Michael S. Tsirkin <mst@redhat.com>;
> > Thomas Huth <thuth@redhat.com>; Eric Blake <eblake@redhat.com>; Song
> > Gao <gaosong@loongson.cn>; Marc-André Lureau
> > <marcandre.lureau@redhat.com>; Alex Bennée <alex.bennee@linaro.org>;
> > Wainer dos Santos Moschetta <wainersm@redhat.com>; Beraldo Leal
> > <bleal@redhat.com>; Pannengyuan <pannengyuan@huawei.com>;
> > Xiexiangyou <xiexiangyou@huawei.com>; Fabiano Rosas <farosas@suse.de>
> > Subject: Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling
> > 
> > On Fri, May 17, 2024 at 03:01:59PM +0200, Yu Zhang wrote:
> > > Hello Michael and Peter,
> > 
> > Hi,
> > 
> > >
> > > Exactly, not so compelling, as I did it first only on servers widely
> > > used for production in our data center. The network adapters are
> > >
> > > Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720
> > > 2-port Gigabit Ethernet PCIe
> > 
> > Hmm... I definitely thinks Jinpu's Mellanox ConnectX-6 looks more reasonable.
> > 
> > https://lore.kernel.org/qemu-devel/CAMGffEn-DKpMZ4tA71MJYdyemg0Zda15
> > wVAqk81vXtKzx-LfJQ@mail.gmail.com/
> > 
> > Appreciate a lot for everyone helping on the testings.
> > 
> > > InfiniBand controller: Mellanox Technologies MT27800 Family
> > > [ConnectX-5]
> > >
> > > which doesn't meet our purpose. I can choose RDMA or TCP for VM
> > > migration. RDMA traffic is through InfiniBand and TCP through Ethernet
> > > on these two hosts. One is standby while the other is active.
> > >
> > > Now I'll try on a server with more recent Ethernet and InfiniBand
> > > network adapters. One of them has:
> > > BCM57414 NetXtreme-E 10Gb/25Gb RDMA Ethernet Controller (rev 01)
> > >
> > > The comparison between RDMA and TCP on the same NIC could make more
> > sense.
> > 
> > It looks to me NICs are powerful now, but again as I mentioned I don't think it's
> > a reason we need to deprecate rdma, especially if QEMU's rdma migration has
> > the chance to be refactored using rsocket.
> > 
> > Is there anyone who started looking into that direction?  Would it make sense
> > we start some PoC now?
> > 
> 
> My team has finished the PoC refactoring which works well. 
> 
> Progress:
> 1.  Implement io/channel-rdma.c,
> 2.  Add unit test tests/unit/test-io-channel-rdma.c and verifying it is successful,
> 3.  Remove the original code from migration/rdma.c,
> 4.  Rewrite the rdma_start_outgoing_migration and rdma_start_incoming_migration logic,
> 5.  Remove all rdma_xxx functions from migration/ram.c. (to prevent RDMA live migration from polluting the core logic of live migration),
> 6.  The soft-RoCE implemented by software is used to test the RDMA live migration. It's successful.
> 
> We will be submit the patchset later.

That's great news, thank you!
Xingtao Yao (Fujitsu)" via May 29, 2024, 2:43 a.m. UTC | #43
Hi,

> -----Original Message-----
> From: Peter Xu [mailto:peterx@redhat.com]
> Sent: Tuesday, May 28, 2024 11:55 PM
> > > > Exactly, not so compelling, as I did it first only on servers
> > > > widely used for production in our data center. The network
> > > > adapters are
> > > >
> > > > Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme
> > > > BCM5720 2-port Gigabit Ethernet PCIe
> > >
> > > Hmm... I definitely thinks Jinpu's Mellanox ConnectX-6 looks more
> reasonable.
> > >
> > >
> https://lore.kernel.org/qemu-devel/CAMGffEn-DKpMZ4tA71MJYdyemg0Zda15
> > > wVAqk81vXtKzx-LfJQ@mail.gmail.com/
> > >
> > > Appreciate a lot for everyone helping on the testings.
> > >
> > > > InfiniBand controller: Mellanox Technologies MT27800 Family
> > > > [ConnectX-5]
> > > >
> > > > which doesn't meet our purpose. I can choose RDMA or TCP for VM
> > > > migration. RDMA traffic is through InfiniBand and TCP through
> > > > Ethernet on these two hosts. One is standby while the other is active.
> > > >
> > > > Now I'll try on a server with more recent Ethernet and InfiniBand
> > > > network adapters. One of them has:
> > > > BCM57414 NetXtreme-E 10Gb/25Gb RDMA Ethernet Controller (rev 01)
> > > >
> > > > The comparison between RDMA and TCP on the same NIC could make
> > > > more
> > > sense.
> > >
> > > It looks to me NICs are powerful now, but again as I mentioned I
> > > don't think it's a reason we need to deprecate rdma, especially if
> > > QEMU's rdma migration has the chance to be refactored using rsocket.
> > >
> > > Is there anyone who started looking into that direction?  Would it
> > > make sense we start some PoC now?
> > >
> >
> > My team has finished the PoC refactoring which works well.
> >
> > Progress:
> > 1.  Implement io/channel-rdma.c,
> > 2.  Add unit test tests/unit/test-io-channel-rdma.c and verifying it
> > is successful, 3.  Remove the original code from migration/rdma.c, 4.
> > Rewrite the rdma_start_outgoing_migration and
> > rdma_start_incoming_migration logic, 5.  Remove all rdma_xxx functions
> > from migration/ram.c. (to prevent RDMA live migration from polluting the
> core logic of live migration), 6.  The soft-RoCE implemented by software is
> used to test the RDMA live migration. It's successful.
> >
> > We will be submit the patchset later.
> 
> That's great news, thank you!
> 
> --
> Peter Xu

For rdma programming, the current mainstream implementation is to use rdma_cm to establish a connection, and then use verbs to transmit data.

rdma_cm and ibverbs create two FDs respectively. The two FDs have different responsibilities. rdma_cm fd is used to notify connection establishment events, 
and verbs fd is used to notify new CQEs. When poll/epoll monitoring is directly performed on the rdma_cm fd, only a pollin event can be monitored, which means
that an rdma_cm event occurs. When the verbs fd is directly polled/epolled, only the pollin event can be listened, which indicates that a new CQE is generated.

Rsocket is a sub-module attached to the rdma_cm library and provides rdma calls that are completely similar to socket interfaces. However, this library returns 
only the rdma_cm fd for listening to link setup-related events and does not expose the verbs fd (readable and writable events for listening to data). Only the rpoll 
interface provided by the RSocket can be used to listen to related events. However, QEMU uses the ppoll interface to listen to the rdma_cm fd (gotten by raccept API). 
And cannot listen to the verbs fd event. Only some hacking methods can be used to address this problem. 

Do you guys have any ideas? Thanks.


Regards,
-Gonglei
Jinpu Wang May 29, 2024, 4:33 a.m. UTC | #44
On Wed, May 29, 2024 at 4:43 AM Gonglei (Arei) <arei.gonglei@huawei.com> wrote:
>
> Hi,
>
> > -----Original Message-----
> > From: Peter Xu [mailto:peterx@redhat.com]
> > Sent: Tuesday, May 28, 2024 11:55 PM
> > > > > Exactly, not so compelling, as I did it first only on servers
> > > > > widely used for production in our data center. The network
> > > > > adapters are
> > > > >
> > > > > Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme
> > > > > BCM5720 2-port Gigabit Ethernet PCIe
> > > >
> > > > Hmm... I definitely thinks Jinpu's Mellanox ConnectX-6 looks more
> > reasonable.
> > > >
> > > >
> > https://lore.kernel.org/qemu-devel/CAMGffEn-DKpMZ4tA71MJYdyemg0Zda15
> > > > wVAqk81vXtKzx-LfJQ@mail.gmail.com/
> > > >
> > > > Appreciate a lot for everyone helping on the testings.
> > > >
> > > > > InfiniBand controller: Mellanox Technologies MT27800 Family
> > > > > [ConnectX-5]
> > > > >
> > > > > which doesn't meet our purpose. I can choose RDMA or TCP for VM
> > > > > migration. RDMA traffic is through InfiniBand and TCP through
> > > > > Ethernet on these two hosts. One is standby while the other is active.
> > > > >
> > > > > Now I'll try on a server with more recent Ethernet and InfiniBand
> > > > > network adapters. One of them has:
> > > > > BCM57414 NetXtreme-E 10Gb/25Gb RDMA Ethernet Controller (rev 01)
> > > > >
> > > > > The comparison between RDMA and TCP on the same NIC could make
> > > > > more
> > > > sense.
> > > >
> > > > It looks to me NICs are powerful now, but again as I mentioned I
> > > > don't think it's a reason we need to deprecate rdma, especially if
> > > > QEMU's rdma migration has the chance to be refactored using rsocket.
> > > >
> > > > Is there anyone who started looking into that direction?  Would it
> > > > make sense we start some PoC now?
> > > >
> > >
> > > My team has finished the PoC refactoring which works well.
> > >
> > > Progress:
> > > 1.  Implement io/channel-rdma.c,
> > > 2.  Add unit test tests/unit/test-io-channel-rdma.c and verifying it
> > > is successful, 3.  Remove the original code from migration/rdma.c, 4.
> > > Rewrite the rdma_start_outgoing_migration and
> > > rdma_start_incoming_migration logic, 5.  Remove all rdma_xxx functions
> > > from migration/ram.c. (to prevent RDMA live migration from polluting the
> > core logic of live migration), 6.  The soft-RoCE implemented by software is
> > used to test the RDMA live migration. It's successful.
> > >
> > > We will be submit the patchset later.
> >
> > That's great news, thank you!
> >
> > --
> > Peter Xu
>
> For rdma programming, the current mainstream implementation is to use rdma_cm to establish a connection, and then use verbs to transmit data.
>
> rdma_cm and ibverbs create two FDs respectively. The two FDs have different responsibilities. rdma_cm fd is used to notify connection establishment events,
> and verbs fd is used to notify new CQEs. When poll/epoll monitoring is directly performed on the rdma_cm fd, only a pollin event can be monitored, which means
> that an rdma_cm event occurs. When the verbs fd is directly polled/epolled, only the pollin event can be listened, which indicates that a new CQE is generated.
>
> Rsocket is a sub-module attached to the rdma_cm library and provides rdma calls that are completely similar to socket interfaces. However, this library returns
> only the rdma_cm fd for listening to link setup-related events and does not expose the verbs fd (readable and writable events for listening to data). Only the rpoll
> interface provided by the RSocket can be used to listen to related events. However, QEMU uses the ppoll interface to listen to the rdma_cm fd (gotten by raccept API).
> And cannot listen to the verbs fd event. Only some hacking methods can be used to address this problem.
>
> Do you guys have any ideas? Thanks.
+cc linux-rdma
+cc Sean



>
>
> Regards,
> -Gonglei
Greg Sword May 29, 2024, 6:05 a.m. UTC | #45
On Wed, May 29, 2024 at 12:33 PM Jinpu Wang <jinpu.wang@ionos.com> wrote:
>
> On Wed, May 29, 2024 at 4:43 AM Gonglei (Arei) <arei.gonglei@huawei.com> wrote:
> >
> > Hi,
> >
> > > -----Original Message-----
> > > From: Peter Xu [mailto:peterx@redhat.com]
> > > Sent: Tuesday, May 28, 2024 11:55 PM
> > > > > > Exactly, not so compelling, as I did it first only on servers
> > > > > > widely used for production in our data center. The network
> > > > > > adapters are
> > > > > >
> > > > > > Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme
> > > > > > BCM5720 2-port Gigabit Ethernet PCIe
> > > > >
> > > > > Hmm... I definitely thinks Jinpu's Mellanox ConnectX-6 looks more
> > > reasonable.
> > > > >
> > > > >
> > > https://lore.kernel.org/qemu-devel/CAMGffEn-DKpMZ4tA71MJYdyemg0Zda15
> > > > > wVAqk81vXtKzx-LfJQ@mail.gmail.com/
> > > > >
> > > > > Appreciate a lot for everyone helping on the testings.
> > > > >
> > > > > > InfiniBand controller: Mellanox Technologies MT27800 Family
> > > > > > [ConnectX-5]
> > > > > >
> > > > > > which doesn't meet our purpose. I can choose RDMA or TCP for VM
> > > > > > migration. RDMA traffic is through InfiniBand and TCP through
> > > > > > Ethernet on these two hosts. One is standby while the other is active.
> > > > > >
> > > > > > Now I'll try on a server with more recent Ethernet and InfiniBand
> > > > > > network adapters. One of them has:
> > > > > > BCM57414 NetXtreme-E 10Gb/25Gb RDMA Ethernet Controller (rev 01)
> > > > > >
> > > > > > The comparison between RDMA and TCP on the same NIC could make
> > > > > > more
> > > > > sense.
> > > > >
> > > > > It looks to me NICs are powerful now, but again as I mentioned I
> > > > > don't think it's a reason we need to deprecate rdma, especially if
> > > > > QEMU's rdma migration has the chance to be refactored using rsocket.
> > > > >
> > > > > Is there anyone who started looking into that direction?  Would it
> > > > > make sense we start some PoC now?
> > > > >
> > > >
> > > > My team has finished the PoC refactoring which works well.
> > > >
> > > > Progress:
> > > > 1.  Implement io/channel-rdma.c,
> > > > 2.  Add unit test tests/unit/test-io-channel-rdma.c and verifying it
> > > > is successful, 3.  Remove the original code from migration/rdma.c, 4.
> > > > Rewrite the rdma_start_outgoing_migration and
> > > > rdma_start_incoming_migration logic, 5.  Remove all rdma_xxx functions
> > > > from migration/ram.c. (to prevent RDMA live migration from polluting the
> > > core logic of live migration), 6.  The soft-RoCE implemented by software is
> > > used to test the RDMA live migration. It's successful.
> > > >
> > > > We will be submit the patchset later.
> > >
> > > That's great news, thank you!
> > >
> > > --
> > > Peter Xu
> >
> > For rdma programming, the current mainstream implementation is to use rdma_cm to establish a connection, and then use verbs to transmit data.
> >
> > rdma_cm and ibverbs create two FDs respectively. The two FDs have different responsibilities. rdma_cm fd is used to notify connection establishment events,
> > and verbs fd is used to notify new CQEs. When poll/epoll monitoring is directly performed on the rdma_cm fd, only a pollin event can be monitored, which means
> > that an rdma_cm event occurs. When the verbs fd is directly polled/epolled, only the pollin event can be listened, which indicates that a new CQE is generated.
> >
> > Rsocket is a sub-module attached to the rdma_cm library and provides rdma calls that are completely similar to socket interfaces. However, this library returns
> > only the rdma_cm fd for listening to link setup-related events and does not expose the verbs fd (readable and writable events for listening to data). Only the rpoll
> > interface provided by the RSocket can be used to listen to related events. However, QEMU uses the ppoll interface to listen to the rdma_cm fd (gotten by raccept API).
> > And cannot listen to the verbs fd event. Only some hacking methods can be used to address this problem.
> >
> > Do you guys have any ideas? Thanks.
> +cc linux-rdma

Why include rdma community?

> +cc Sean
>
>
>
> >
> >
> > Regards,
> > -Gonglei
>
Jinpu Wang May 29, 2024, 7:04 a.m. UTC | #46
On Wed, May 29, 2024 at 8:08 AM Greg Sword <gregsword0@gmail.com> wrote:
>
> On Wed, May 29, 2024 at 12:33 PM Jinpu Wang <jinpu.wang@ionos.com> wrote:
> >
> > On Wed, May 29, 2024 at 4:43 AM Gonglei (Arei) <arei.gonglei@huawei.com> wrote:
> > >
> > > Hi,
> > >
> > > > -----Original Message-----
> > > > From: Peter Xu [mailto:peterx@redhat.com]
> > > > Sent: Tuesday, May 28, 2024 11:55 PM
> > > > > > > Exactly, not so compelling, as I did it first only on servers
> > > > > > > widely used for production in our data center. The network
> > > > > > > adapters are
> > > > > > >
> > > > > > > Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme
> > > > > > > BCM5720 2-port Gigabit Ethernet PCIe
> > > > > >
> > > > > > Hmm... I definitely thinks Jinpu's Mellanox ConnectX-6 looks more
> > > > reasonable.
> > > > > >
> > > > > >
> > > > https://lore.kernel.org/qemu-devel/CAMGffEn-DKpMZ4tA71MJYdyemg0Zda15
> > > > > > wVAqk81vXtKzx-LfJQ@mail.gmail.com/
> > > > > >
> > > > > > Appreciate a lot for everyone helping on the testings.
> > > > > >
> > > > > > > InfiniBand controller: Mellanox Technologies MT27800 Family
> > > > > > > [ConnectX-5]
> > > > > > >
> > > > > > > which doesn't meet our purpose. I can choose RDMA or TCP for VM
> > > > > > > migration. RDMA traffic is through InfiniBand and TCP through
> > > > > > > Ethernet on these two hosts. One is standby while the other is active.
> > > > > > >
> > > > > > > Now I'll try on a server with more recent Ethernet and InfiniBand
> > > > > > > network adapters. One of them has:
> > > > > > > BCM57414 NetXtreme-E 10Gb/25Gb RDMA Ethernet Controller (rev 01)
> > > > > > >
> > > > > > > The comparison between RDMA and TCP on the same NIC could make
> > > > > > > more
> > > > > > sense.
> > > > > >
> > > > > > It looks to me NICs are powerful now, but again as I mentioned I
> > > > > > don't think it's a reason we need to deprecate rdma, especially if
> > > > > > QEMU's rdma migration has the chance to be refactored using rsocket.
> > > > > >
> > > > > > Is there anyone who started looking into that direction?  Would it
> > > > > > make sense we start some PoC now?
> > > > > >
> > > > >
> > > > > My team has finished the PoC refactoring which works well.
> > > > >
> > > > > Progress:
> > > > > 1.  Implement io/channel-rdma.c,
> > > > > 2.  Add unit test tests/unit/test-io-channel-rdma.c and verifying it
> > > > > is successful, 3.  Remove the original code from migration/rdma.c, 4.
> > > > > Rewrite the rdma_start_outgoing_migration and
> > > > > rdma_start_incoming_migration logic, 5.  Remove all rdma_xxx functions
> > > > > from migration/ram.c. (to prevent RDMA live migration from polluting the
> > > > core logic of live migration), 6.  The soft-RoCE implemented by software is
> > > > used to test the RDMA live migration. It's successful.
> > > > >
> > > > > We will be submit the patchset later.
> > > >
> > > > That's great news, thank you!
> > > >
> > > > --
> > > > Peter Xu
> > >
> > > For rdma programming, the current mainstream implementation is to use rdma_cm to establish a connection, and then use verbs to transmit data.
> > >
> > > rdma_cm and ibverbs create two FDs respectively. The two FDs have different responsibilities. rdma_cm fd is used to notify connection establishment events,
> > > and verbs fd is used to notify new CQEs. When poll/epoll monitoring is directly performed on the rdma_cm fd, only a pollin event can be monitored, which means
> > > that an rdma_cm event occurs. When the verbs fd is directly polled/epolled, only the pollin event can be listened, which indicates that a new CQE is generated.
> > >
> > > Rsocket is a sub-module attached to the rdma_cm library and provides rdma calls that are completely similar to socket interfaces. However, this library returns
> > > only the rdma_cm fd for listening to link setup-related events and does not expose the verbs fd (readable and writable events for listening to data). Only the rpoll
> > > interface provided by the RSocket can be used to listen to related events. However, QEMU uses the ppoll interface to listen to the rdma_cm fd (gotten by raccept API).
> > > And cannot listen to the verbs fd event. Only some hacking methods can be used to address this problem.
> > >
> > > Do you guys have any ideas? Thanks.
> > +cc linux-rdma
>
> Why include rdma community?
rdma community has a lot people with experience in rdma/rsocket?
>
> > +cc Sean
> >
> >
> >
> > >
> > >
> > > Regards,
> > > -Gonglei
> >
Xingtao Yao (Fujitsu)" via May 29, 2024, 8:30 a.m. UTC | #47
> -----Original Message-----
> From: Greg Sword [mailto:gregsword0@gmail.com]
> Sent: Wednesday, May 29, 2024 2:06 PM
> To: Jinpu Wang <jinpu.wang@ionos.com>
> Subject: Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling
> 
> On Wed, May 29, 2024 at 12:33 PM Jinpu Wang <jinpu.wang@ionos.com>
> wrote:
> >
> > On Wed, May 29, 2024 at 4:43 AM Gonglei (Arei) <arei.gonglei@huawei.com>
> wrote:
> > >
> > > Hi,
> > >
> > > > -----Original Message-----
> > > > From: Peter Xu [mailto:peterx@redhat.com]
> > > > Sent: Tuesday, May 28, 2024 11:55 PM
> > > > > > > Exactly, not so compelling, as I did it first only on
> > > > > > > servers widely used for production in our data center. The
> > > > > > > network adapters are
> > > > > > >
> > > > > > > Ethernet controller: Broadcom Inc. and subsidiaries
> > > > > > > NetXtreme
> > > > > > > BCM5720 2-port Gigabit Ethernet PCIe
> > > > > >
> > > > > > Hmm... I definitely thinks Jinpu's Mellanox ConnectX-6 looks
> > > > > > more
> > > > reasonable.
> > > > > >
> > > > > >
> > > >
> https://lore.kernel.org/qemu-devel/CAMGffEn-DKpMZ4tA71MJYdyemg0Zda
> > > > 15
> > > > > > wVAqk81vXtKzx-LfJQ@mail.gmail.com/
> > > > > >
> > > > > > Appreciate a lot for everyone helping on the testings.
> > > > > >
> > > > > > > InfiniBand controller: Mellanox Technologies MT27800 Family
> > > > > > > [ConnectX-5]
> > > > > > >
> > > > > > > which doesn't meet our purpose. I can choose RDMA or TCP for
> > > > > > > VM migration. RDMA traffic is through InfiniBand and TCP
> > > > > > > through Ethernet on these two hosts. One is standby while the other
> is active.
> > > > > > >
> > > > > > > Now I'll try on a server with more recent Ethernet and
> > > > > > > InfiniBand network adapters. One of them has:
> > > > > > > BCM57414 NetXtreme-E 10Gb/25Gb RDMA Ethernet Controller (rev
> > > > > > > 01)
> > > > > > >
> > > > > > > The comparison between RDMA and TCP on the same NIC could
> > > > > > > make more
> > > > > > sense.
> > > > > >
> > > > > > It looks to me NICs are powerful now, but again as I mentioned
> > > > > > I don't think it's a reason we need to deprecate rdma,
> > > > > > especially if QEMU's rdma migration has the chance to be refactored
> using rsocket.
> > > > > >
> > > > > > Is there anyone who started looking into that direction?
> > > > > > Would it make sense we start some PoC now?
> > > > > >
> > > > >
> > > > > My team has finished the PoC refactoring which works well.
> > > > >
> > > > > Progress:
> > > > > 1.  Implement io/channel-rdma.c, 2.  Add unit test
> > > > > tests/unit/test-io-channel-rdma.c and verifying it is
> > > > > successful, 3.  Remove the original code from migration/rdma.c, 4.
> > > > > Rewrite the rdma_start_outgoing_migration and
> > > > > rdma_start_incoming_migration logic, 5.  Remove all rdma_xxx
> > > > > functions from migration/ram.c. (to prevent RDMA live migration
> > > > > from polluting the
> > > > core logic of live migration), 6.  The soft-RoCE implemented by
> > > > software is used to test the RDMA live migration. It's successful.
> > > > >
> > > > > We will be submit the patchset later.
> > > >
> > > > That's great news, thank you!
> > > >
> > > > --
> > > > Peter Xu
> > >
> > > For rdma programming, the current mainstream implementation is to use
> rdma_cm to establish a connection, and then use verbs to transmit data.
> > >
> > > rdma_cm and ibverbs create two FDs respectively. The two FDs have
> > > different responsibilities. rdma_cm fd is used to notify connection
> > > establishment events, and verbs fd is used to notify new CQEs. When
> poll/epoll monitoring is directly performed on the rdma_cm fd, only a pollin
> event can be monitored, which means that an rdma_cm event occurs. When
> the verbs fd is directly polled/epolled, only the pollin event can be listened,
> which indicates that a new CQE is generated.
> > >
> > > Rsocket is a sub-module attached to the rdma_cm library and provides
> > > rdma calls that are completely similar to socket interfaces.
> > > However, this library returns only the rdma_cm fd for listening to link
> setup-related events and does not expose the verbs fd (readable and writable
> events for listening to data). Only the rpoll interface provided by the RSocket
> can be used to listen to related events. However, QEMU uses the ppoll
> interface to listen to the rdma_cm fd (gotten by raccept API).
> > > And cannot listen to the verbs fd event. Only some hacking methods can be
> used to address this problem.
> > >
> > > Do you guys have any ideas? Thanks.
> > +cc linux-rdma
> 
> Why include rdma community?
> 

Can rdma/rsocket provide an API to expose the verbs fd? 


Regards,
-Gonglei

> > +cc Sean
> >
> >
> >
> > >
> > >
> > > Regards,
> > > -Gonglei
> >
Jinpu Wang May 29, 2024, 9:17 a.m. UTC | #48
Hi Gonglei,

On Wed, May 29, 2024 at 10:31 AM Gonglei (Arei) <arei.gonglei@huawei.com> wrote:
>
>
>
> > -----Original Message-----
> > From: Greg Sword [mailto:gregsword0@gmail.com]
> > Sent: Wednesday, May 29, 2024 2:06 PM
> > To: Jinpu Wang <jinpu.wang@ionos.com>
> > Subject: Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling
> >
> > On Wed, May 29, 2024 at 12:33 PM Jinpu Wang <jinpu.wang@ionos.com>
> > wrote:
> > >
> > > On Wed, May 29, 2024 at 4:43 AM Gonglei (Arei) <arei.gonglei@huawei.com>
> > wrote:
> > > >
> > > > Hi,
> > > >
> > > > > -----Original Message-----
> > > > > From: Peter Xu [mailto:peterx@redhat.com]
> > > > > Sent: Tuesday, May 28, 2024 11:55 PM
> > > > > > > > Exactly, not so compelling, as I did it first only on
> > > > > > > > servers widely used for production in our data center. The
> > > > > > > > network adapters are
> > > > > > > >
> > > > > > > > Ethernet controller: Broadcom Inc. and subsidiaries
> > > > > > > > NetXtreme
> > > > > > > > BCM5720 2-port Gigabit Ethernet PCIe
> > > > > > >
> > > > > > > Hmm... I definitely thinks Jinpu's Mellanox ConnectX-6 looks
> > > > > > > more
> > > > > reasonable.
> > > > > > >
> > > > > > >
> > > > >
> > https://lore.kernel.org/qemu-devel/CAMGffEn-DKpMZ4tA71MJYdyemg0Zda
> > > > > 15
> > > > > > > wVAqk81vXtKzx-LfJQ@mail.gmail.com/
> > > > > > >
> > > > > > > Appreciate a lot for everyone helping on the testings.
> > > > > > >
> > > > > > > > InfiniBand controller: Mellanox Technologies MT27800 Family
> > > > > > > > [ConnectX-5]
> > > > > > > >
> > > > > > > > which doesn't meet our purpose. I can choose RDMA or TCP for
> > > > > > > > VM migration. RDMA traffic is through InfiniBand and TCP
> > > > > > > > through Ethernet on these two hosts. One is standby while the other
> > is active.
> > > > > > > >
> > > > > > > > Now I'll try on a server with more recent Ethernet and
> > > > > > > > InfiniBand network adapters. One of them has:
> > > > > > > > BCM57414 NetXtreme-E 10Gb/25Gb RDMA Ethernet Controller (rev
> > > > > > > > 01)
> > > > > > > >
> > > > > > > > The comparison between RDMA and TCP on the same NIC could
> > > > > > > > make more
> > > > > > > sense.
> > > > > > >
> > > > > > > It looks to me NICs are powerful now, but again as I mentioned
> > > > > > > I don't think it's a reason we need to deprecate rdma,
> > > > > > > especially if QEMU's rdma migration has the chance to be refactored
> > using rsocket.
> > > > > > >
> > > > > > > Is there anyone who started looking into that direction?
> > > > > > > Would it make sense we start some PoC now?
> > > > > > >
> > > > > >
> > > > > > My team has finished the PoC refactoring which works well.
> > > > > >
> > > > > > Progress:
> > > > > > 1.  Implement io/channel-rdma.c, 2.  Add unit test
> > > > > > tests/unit/test-io-channel-rdma.c and verifying it is
> > > > > > successful, 3.  Remove the original code from migration/rdma.c, 4.
> > > > > > Rewrite the rdma_start_outgoing_migration and
> > > > > > rdma_start_incoming_migration logic, 5.  Remove all rdma_xxx
> > > > > > functions from migration/ram.c. (to prevent RDMA live migration
> > > > > > from polluting the
> > > > > core logic of live migration), 6.  The soft-RoCE implemented by
> > > > > software is used to test the RDMA live migration. It's successful.
> > > > > >
> > > > > > We will be submit the patchset later.
> > > > >
> > > > > That's great news, thank you!
> > > > >
> > > > > --
> > > > > Peter Xu
> > > >
> > > > For rdma programming, the current mainstream implementation is to use
> > rdma_cm to establish a connection, and then use verbs to transmit data.
> > > >
> > > > rdma_cm and ibverbs create two FDs respectively. The two FDs have
> > > > different responsibilities. rdma_cm fd is used to notify connection
> > > > establishment events, and verbs fd is used to notify new CQEs. When
> > poll/epoll monitoring is directly performed on the rdma_cm fd, only a pollin
> > event can be monitored, which means that an rdma_cm event occurs. When
> > the verbs fd is directly polled/epolled, only the pollin event can be listened,
> > which indicates that a new CQE is generated.
> > > >
> > > > Rsocket is a sub-module attached to the rdma_cm library and provides
> > > > rdma calls that are completely similar to socket interfaces.
> > > > However, this library returns only the rdma_cm fd for listening to link
> > setup-related events and does not expose the verbs fd (readable and writable
> > events for listening to data). Only the rpoll interface provided by the RSocket
> > can be used to listen to related events. However, QEMU uses the ppoll
> > interface to listen to the rdma_cm fd (gotten by raccept API).
> > > > And cannot listen to the verbs fd event.
I'm confused, the rs_poll_arm
:https://github.com/linux-rdma/rdma-core/blob/master/librdmacm/rsocket.c#L3290
For STREAM, rpoll setup fd for both cq fd and cm fd.

> > > >
> > > > Do you guys have any ideas? Thanks.
> > > +cc linux-rdma
> >
> > Why include rdma community?
> >
>
> Can rdma/rsocket provide an API to expose the verbs fd?
Why do we need verbs fd? looks rsocket during rsend/rrecv is handling
the new completion if any via rs_get_comp

Another question to my mind is Daniel suggested a bit different way of
using rsocket: https://lore.kernel.org/qemu-devel/ZjtOreamN8xF9FDE@redhat.com/
Have you considered that?

Thx!
Jinpu




>
>
> Regards,
> -Gonglei
>
> > > +cc Sean
> > >
> > >
> > >
> > > >
> > > >
> > > > Regards,
> > > > -Gonglei
> > >
Xingtao Yao (Fujitsu)" via May 29, 2024, 9:34 a.m. UTC | #49
> -----Original Message-----
> From: Jinpu Wang [mailto:jinpu.wang@ionos.com]
> Sent: Wednesday, May 29, 2024 5:18 PM
> To: Gonglei (Arei) <arei.gonglei@huawei.com>
> Cc: Greg Sword <gregsword0@gmail.com>; Peter Xu <peterx@redhat.com>;
> Yu Zhang <yu.zhang@ionos.com>; Michael Galaxy <mgalaxy@akamai.com>;
> Elmar Gerdes <elmar.gerdes@ionos.com>; zhengchuan
> <zhengchuan@huawei.com>; Daniel P. Berrangé <berrange@redhat.com>;
> Markus Armbruster <armbru@redhat.com>; Zhijian Li (Fujitsu)
> <lizhijian@fujitsu.com>; qemu-devel@nongnu.org; Yuval Shaia
> <yuval.shaia.ml@gmail.com>; Kevin Wolf <kwolf@redhat.com>; Prasanna
> Kumar Kalever <prasanna.kalever@redhat.com>; Cornelia Huck
> <cohuck@redhat.com>; Michael Roth <michael.roth@amd.com>; Prasanna
> Kumar Kalever <prasanna4324@gmail.com>; Paolo Bonzini
> <pbonzini@redhat.com>; qemu-block@nongnu.org; devel@lists.libvirt.org;
> Hanna Reitz <hreitz@redhat.com>; Michael S. Tsirkin <mst@redhat.com>;
> Thomas Huth <thuth@redhat.com>; Eric Blake <eblake@redhat.com>; Song
> Gao <gaosong@loongson.cn>; Marc-André Lureau
> <marcandre.lureau@redhat.com>; Alex Bennée <alex.bennee@linaro.org>;
> Wainer dos Santos Moschetta <wainersm@redhat.com>; Beraldo Leal
> <bleal@redhat.com>; Pannengyuan <pannengyuan@huawei.com>;
> Xiexiangyou <xiexiangyou@huawei.com>; Fabiano Rosas <farosas@suse.de>;
> RDMA mailing list <linux-rdma@vger.kernel.org>; shefty@nvidia.com; Haris
> Iqbal <haris.iqbal@ionos.com>
> Subject: Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling
> 
> Hi Gonglei,
> 
> On Wed, May 29, 2024 at 10:31 AM Gonglei (Arei) <arei.gonglei@huawei.com>
> wrote:
> >
> >
> >
> > > -----Original Message-----
> > > From: Greg Sword [mailto:gregsword0@gmail.com]
> > > Sent: Wednesday, May 29, 2024 2:06 PM
> > > To: Jinpu Wang <jinpu.wang@ionos.com>
> > > Subject: Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol
> > > handling
> > >
> > > On Wed, May 29, 2024 at 12:33 PM Jinpu Wang <jinpu.wang@ionos.com>
> > > wrote:
> > > >
> > > > On Wed, May 29, 2024 at 4:43 AM Gonglei (Arei)
> > > > <arei.gonglei@huawei.com>
> > > wrote:
> > > > >
> > > > > Hi,
> > > > >
> > > > > > -----Original Message-----
> > > > > > From: Peter Xu [mailto:peterx@redhat.com]
> > > > > > Sent: Tuesday, May 28, 2024 11:55 PM
> > > > > > > > > Exactly, not so compelling, as I did it first only on
> > > > > > > > > servers widely used for production in our data center.
> > > > > > > > > The network adapters are
> > > > > > > > >
> > > > > > > > > Ethernet controller: Broadcom Inc. and subsidiaries
> > > > > > > > > NetXtreme
> > > > > > > > > BCM5720 2-port Gigabit Ethernet PCIe
> > > > > > > >
> > > > > > > > Hmm... I definitely thinks Jinpu's Mellanox ConnectX-6
> > > > > > > > looks more
> > > > > > reasonable.
> > > > > > > >
> > > > > > > >
> > > > > >
> > >
> https://lore.kernel.org/qemu-devel/CAMGffEn-DKpMZ4tA71MJYdyemg0Zda
> > > > > > 15
> > > > > > > > wVAqk81vXtKzx-LfJQ@mail.gmail.com/
> > > > > > > >
> > > > > > > > Appreciate a lot for everyone helping on the testings.
> > > > > > > >
> > > > > > > > > InfiniBand controller: Mellanox Technologies MT27800
> > > > > > > > > Family [ConnectX-5]
> > > > > > > > >
> > > > > > > > > which doesn't meet our purpose. I can choose RDMA or TCP
> > > > > > > > > for VM migration. RDMA traffic is through InfiniBand and
> > > > > > > > > TCP through Ethernet on these two hosts. One is standby
> > > > > > > > > while the other
> > > is active.
> > > > > > > > >
> > > > > > > > > Now I'll try on a server with more recent Ethernet and
> > > > > > > > > InfiniBand network adapters. One of them has:
> > > > > > > > > BCM57414 NetXtreme-E 10Gb/25Gb RDMA Ethernet Controller
> > > > > > > > > (rev
> > > > > > > > > 01)
> > > > > > > > >
> > > > > > > > > The comparison between RDMA and TCP on the same NIC
> > > > > > > > > could make more
> > > > > > > > sense.
> > > > > > > >
> > > > > > > > It looks to me NICs are powerful now, but again as I
> > > > > > > > mentioned I don't think it's a reason we need to deprecate
> > > > > > > > rdma, especially if QEMU's rdma migration has the chance
> > > > > > > > to be refactored
> > > using rsocket.
> > > > > > > >
> > > > > > > > Is there anyone who started looking into that direction?
> > > > > > > > Would it make sense we start some PoC now?
> > > > > > > >
> > > > > > >
> > > > > > > My team has finished the PoC refactoring which works well.
> > > > > > >
> > > > > > > Progress:
> > > > > > > 1.  Implement io/channel-rdma.c, 2.  Add unit test
> > > > > > > tests/unit/test-io-channel-rdma.c and verifying it is
> > > > > > > successful, 3.  Remove the original code from migration/rdma.c, 4.
> > > > > > > Rewrite the rdma_start_outgoing_migration and
> > > > > > > rdma_start_incoming_migration logic, 5.  Remove all rdma_xxx
> > > > > > > functions from migration/ram.c. (to prevent RDMA live
> > > > > > > migration from polluting the
> > > > > > core logic of live migration), 6.  The soft-RoCE implemented
> > > > > > by software is used to test the RDMA live migration. It's successful.
> > > > > > >
> > > > > > > We will be submit the patchset later.
> > > > > >
> > > > > > That's great news, thank you!
> > > > > >
> > > > > > --
> > > > > > Peter Xu
> > > > >
> > > > > For rdma programming, the current mainstream implementation is
> > > > > to use
> > > rdma_cm to establish a connection, and then use verbs to transmit data.
> > > > >
> > > > > rdma_cm and ibverbs create two FDs respectively. The two FDs
> > > > > have different responsibilities. rdma_cm fd is used to notify
> > > > > connection establishment events, and verbs fd is used to notify
> > > > > new CQEs. When
> > > poll/epoll monitoring is directly performed on the rdma_cm fd, only
> > > a pollin event can be monitored, which means that an rdma_cm event
> > > occurs. When the verbs fd is directly polled/epolled, only the
> > > pollin event can be listened, which indicates that a new CQE is generated.
> > > > >
> > > > > Rsocket is a sub-module attached to the rdma_cm library and
> > > > > provides rdma calls that are completely similar to socket interfaces.
> > > > > However, this library returns only the rdma_cm fd for listening
> > > > > to link
> > > setup-related events and does not expose the verbs fd (readable and
> > > writable events for listening to data). Only the rpoll interface
> > > provided by the RSocket can be used to listen to related events.
> > > However, QEMU uses the ppoll interface to listen to the rdma_cm fd
> (gotten by raccept API).
> > > > > And cannot listen to the verbs fd event.
> I'm confused, the rs_poll_arm
> :https://github.com/linux-rdma/rdma-core/blob/master/librdmacm/rsocket.c#
> L3290
> For STREAM, rpoll setup fd for both cq fd and cm fd.
> 
> > > > >
> > > > > Do you guys have any ideas? Thanks.
> > > > +cc linux-rdma
> > >
> > > Why include rdma community?
> > >
> >
> > Can rdma/rsocket provide an API to expose the verbs fd?
> Why do we need verbs fd? looks rsocket during rsend/rrecv is handling the new
> completion if any via rs_get_comp
> 
Actually I said the reason in the previous mail. Listing some header in librdmacm.

/* verbs.h */
struct ibv_comp_channel {
	struct ibv_context     *context;
	int			fd;
	int			refcnt;
};

/* rdma_cma.h */
struct rdma_event_channel {
	int			fd;
};

/* rdma_cma.h */
struct rdma_cm_id {
	struct ibv_context	*verbs;
	struct rdma_event_channel *channel;   //==> it can be gotten by rsocket.h
	void			*context;
	struct ibv_qp		*qp;
	struct rdma_route	 route;
	enum rdma_port_space	 ps;
	uint8_t			 port_num;
	struct rdma_cm_event	*event;
	struct ibv_comp_channel *send_cq_channel;  // ==> can't be gotten so that Qemu can't read the CQE data
	struct ibv_cq		*send_cq;
	struct ibv_comp_channel *recv_cq_channel;
	struct ibv_cq		*recv_cq;
	struct ibv_srq		*srq;
	struct ibv_pd		*pd;
	enum ibv_qp_type	qp_type;
};

/* rsocket.h */
int raccept(int socket, struct sockaddr *addr, socklen_t *addrlen);
int rpoll(struct pollfd *fds, nfds_t nfds, int timeout);


> Another question to my mind is Daniel suggested a bit different way of using
> rsocket: https://lore.kernel.org/qemu-devel/ZjtOreamN8xF9FDE@redhat.com/
> Have you considered that?
> 
We do use 'rsocket' APIs to refactor the RDMA code in QEMU and encounter the issue.


Regards,
-Gonglei
Jinpu Wang May 29, 2024, 9:44 a.m. UTC | #50
On Wed, May 29, 2024 at 11:35 AM Gonglei (Arei) <arei.gonglei@huawei.com>
wrote:

>
>
> > -----Original Message-----
> > From: Jinpu Wang [mailto:jinpu.wang@ionos.com]
> > Sent: Wednesday, May 29, 2024 5:18 PM
> > To: Gonglei (Arei) <arei.gonglei@huawei.com>
> > Cc: Greg Sword <gregsword0@gmail.com>; Peter Xu <peterx@redhat.com>;
> > Yu Zhang <yu.zhang@ionos.com>; Michael Galaxy <mgalaxy@akamai.com>;
> > Elmar Gerdes <elmar.gerdes@ionos.com>; zhengchuan
> > <zhengchuan@huawei.com>; Daniel P. Berrangé <berrange@redhat.com>;
> > Markus Armbruster <armbru@redhat.com>; Zhijian Li (Fujitsu)
> > <lizhijian@fujitsu.com>; qemu-devel@nongnu.org; Yuval Shaia
> > <yuval.shaia.ml@gmail.com>; Kevin Wolf <kwolf@redhat.com>; Prasanna
> > Kumar Kalever <prasanna.kalever@redhat.com>; Cornelia Huck
> > <cohuck@redhat.com>; Michael Roth <michael.roth@amd.com>; Prasanna
> > Kumar Kalever <prasanna4324@gmail.com>; Paolo Bonzini
> > <pbonzini@redhat.com>; qemu-block@nongnu.org; devel@lists.libvirt.org;
> > Hanna Reitz <hreitz@redhat.com>; Michael S. Tsirkin <mst@redhat.com>;
> > Thomas Huth <thuth@redhat.com>; Eric Blake <eblake@redhat.com>; Song
> > Gao <gaosong@loongson.cn>; Marc-André Lureau
> > <marcandre.lureau@redhat.com>; Alex Bennée <alex.bennee@linaro.org>;
> > Wainer dos Santos Moschetta <wainersm@redhat.com>; Beraldo Leal
> > <bleal@redhat.com>; Pannengyuan <pannengyuan@huawei.com>;
> > Xiexiangyou <xiexiangyou@huawei.com>; Fabiano Rosas <farosas@suse.de>;
> > RDMA mailing list <linux-rdma@vger.kernel.org>; shefty@nvidia.com; Haris
> > Iqbal <haris.iqbal@ionos.com>
> > Subject: Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol
> handling
> >
> > Hi Gonglei,
> >
> > On Wed, May 29, 2024 at 10:31 AM Gonglei (Arei) <arei.gonglei@huawei.com
> >
> > wrote:
> > >
> > >
> > >
> > > > -----Original Message-----
> > > > From: Greg Sword [mailto:gregsword0@gmail.com]
> > > > Sent: Wednesday, May 29, 2024 2:06 PM
> > > > To: Jinpu Wang <jinpu.wang@ionos.com>
> > > > Subject: Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol
> > > > handling
> > > >
> > > > On Wed, May 29, 2024 at 12:33 PM Jinpu Wang <jinpu.wang@ionos.com>
> > > > wrote:
> > > > >
> > > > > On Wed, May 29, 2024 at 4:43 AM Gonglei (Arei)
> > > > > <arei.gonglei@huawei.com>
> > > > wrote:
> > > > > >
> > > > > > Hi,
> > > > > >
> > > > > > > -----Original Message-----
> > > > > > > From: Peter Xu [mailto:peterx@redhat.com]
> > > > > > > Sent: Tuesday, May 28, 2024 11:55 PM
> > > > > > > > > > Exactly, not so compelling, as I did it first only on
> > > > > > > > > > servers widely used for production in our data center.
> > > > > > > > > > The network adapters are
> > > > > > > > > >
> > > > > > > > > > Ethernet controller: Broadcom Inc. and subsidiaries
> > > > > > > > > > NetXtreme
> > > > > > > > > > BCM5720 2-port Gigabit Ethernet PCIe
> > > > > > > > >
> > > > > > > > > Hmm... I definitely thinks Jinpu's Mellanox ConnectX-6
> > > > > > > > > looks more
> > > > > > > reasonable.
> > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > >
> > https://lore.kernel.org/qemu-devel/CAMGffEn-DKpMZ4tA71MJYdyemg0Zda
> > > > > > > 15
> > > > > > > > > wVAqk81vXtKzx-LfJQ@mail.gmail.com/
> > > > > > > > >
> > > > > > > > > Appreciate a lot for everyone helping on the testings.
> > > > > > > > >
> > > > > > > > > > InfiniBand controller: Mellanox Technologies MT27800
> > > > > > > > > > Family [ConnectX-5]
> > > > > > > > > >
> > > > > > > > > > which doesn't meet our purpose. I can choose RDMA or TCP
> > > > > > > > > > for VM migration. RDMA traffic is through InfiniBand and
> > > > > > > > > > TCP through Ethernet on these two hosts. One is standby
> > > > > > > > > > while the other
> > > > is active.
> > > > > > > > > >
> > > > > > > > > > Now I'll try on a server with more recent Ethernet and
> > > > > > > > > > InfiniBand network adapters. One of them has:
> > > > > > > > > > BCM57414 NetXtreme-E 10Gb/25Gb RDMA Ethernet Controller
> > > > > > > > > > (rev
> > > > > > > > > > 01)
> > > > > > > > > >
> > > > > > > > > > The comparison between RDMA and TCP on the same NIC
> > > > > > > > > > could make more
> > > > > > > > > sense.
> > > > > > > > >
> > > > > > > > > It looks to me NICs are powerful now, but again as I
> > > > > > > > > mentioned I don't think it's a reason we need to deprecate
> > > > > > > > > rdma, especially if QEMU's rdma migration has the chance
> > > > > > > > > to be refactored
> > > > using rsocket.
> > > > > > > > >
> > > > > > > > > Is there anyone who started looking into that direction?
> > > > > > > > > Would it make sense we start some PoC now?
> > > > > > > > >
> > > > > > > >
> > > > > > > > My team has finished the PoC refactoring which works well.
> > > > > > > >
> > > > > > > > Progress:
> > > > > > > > 1.  Implement io/channel-rdma.c, 2.  Add unit test
> > > > > > > > tests/unit/test-io-channel-rdma.c and verifying it is
> > > > > > > > successful, 3.  Remove the original code from
> migration/rdma.c, 4.
> > > > > > > > Rewrite the rdma_start_outgoing_migration and
> > > > > > > > rdma_start_incoming_migration logic, 5.  Remove all rdma_xxx
> > > > > > > > functions from migration/ram.c. (to prevent RDMA live
> > > > > > > > migration from polluting the
> > > > > > > core logic of live migration), 6.  The soft-RoCE implemented
> > > > > > > by software is used to test the RDMA live migration. It's
> successful.
> > > > > > > >
> > > > > > > > We will be submit the patchset later.
> > > > > > >
> > > > > > > That's great news, thank you!
> > > > > > >
> > > > > > > --
> > > > > > > Peter Xu
> > > > > >
> > > > > > For rdma programming, the current mainstream implementation is
> > > > > > to use
> > > > rdma_cm to establish a connection, and then use verbs to transmit
> data.
> > > > > >
> > > > > > rdma_cm and ibverbs create two FDs respectively. The two FDs
> > > > > > have different responsibilities. rdma_cm fd is used to notify
> > > > > > connection establishment events, and verbs fd is used to notify
> > > > > > new CQEs. When
> > > > poll/epoll monitoring is directly performed on the rdma_cm fd, only
> > > > a pollin event can be monitored, which means that an rdma_cm event
> > > > occurs. When the verbs fd is directly polled/epolled, only the
> > > > pollin event can be listened, which indicates that a new CQE is
> generated.
> > > > > >
> > > > > > Rsocket is a sub-module attached to the rdma_cm library and
> > > > > > provides rdma calls that are completely similar to socket
> interfaces.
> > > > > > However, this library returns only the rdma_cm fd for listening
> > > > > > to link
> > > > setup-related events and does not expose the verbs fd (readable and
> > > > writable events for listening to data). Only the rpoll interface
> > > > provided by the RSocket can be used to listen to related events.
> > > > However, QEMU uses the ppoll interface to listen to the rdma_cm fd
> > (gotten by raccept API).
> > > > > > And cannot listen to the verbs fd event.
> > I'm confused, the rs_poll_arm
> > :
> https://github.com/linux-rdma/rdma-core/blob/master/librdmacm/rsocket.c#
> > L3290
> > For STREAM, rpoll setup fd for both cq fd and cm fd.
> >
> > > > > >
> > > > > > Do you guys have any ideas? Thanks.
> > > > > +cc linux-rdma
> > > >
> > > > Why include rdma community?
> > > >
> > >
> > > Can rdma/rsocket provide an API to expose the verbs fd?
> > Why do we need verbs fd? looks rsocket during rsend/rrecv is handling
> the new
> > completion if any via rs_get_comp
> >
> Actually I said the reason in the previous mail. Listing some header in
> librdmacm.
>
> /* verbs.h */
> struct ibv_comp_channel {
>         struct ibv_context     *context;
>         int                     fd;
>         int                     refcnt;
> };
>
> /* rdma_cma.h */
> struct rdma_event_channel {
>         int                     fd;
> };
>
> /* rdma_cma.h */
> struct rdma_cm_id {
>         struct ibv_context      *verbs;
>         struct rdma_event_channel *channel;   //==> it can be gotten by
> rsocket.h
>         void                    *context;
>         struct ibv_qp           *qp;
>         struct rdma_route        route;
>         enum rdma_port_space     ps;
>         uint8_t                  port_num;
>         struct rdma_cm_event    *event;
>         struct ibv_comp_channel *send_cq_channel;  // ==> can't be gotten
> so that Qemu can't read the CQE dat
>
ok, but the send_cq_channel is set the same as recv_cq_channel:
https://github.com/linux-rdma/rdma-core/blob/master/librdmacm/rsocket.c#L855
and also use the same recv_cq as send_cq.


>         struct ibv_cq           *send_cq;
>         struct ibv_comp_channel *recv_cq_channel;
>         struct ibv_cq           *recv_cq;
>         struct ibv_srq          *srq;
>         struct ibv_pd           *pd;
>         enum ibv_qp_type        qp_type;
> };
>
> /* rsocket.h */
> int raccept(int socket, struct sockaddr *addr, socklen_t *addrlen);
> int rpoll(struct pollfd *fds, nfds_t nfds, int timeout);
>
>
> > Another question to my mind is Daniel suggested a bit different way of
> using
> > rsocket: https://lore.kernel.org/qemu-devel/ZjtOreamN8xF9FDE@redhat.com/
> > Have you considered that?
> >
> We do use 'rsocket' APIs to refactor the RDMA code in QEMU and encounter
> the issue.
>
>
> Regards,
> -Gonglei
>
>
Xingtao Yao (Fujitsu)" via May 29, 2024, 9:47 a.m. UTC | #51
Hi,

> -----Original Message-----
> > >
> https://lore.kernel.org/qemu-devel/CAMGffEn-DKpMZ4tA71MJYdyemg0Zda
> > > > > > 15
> > > > > > > > wVAqk81vXtKzx-LfJQ@mail.gmail.com/
> > > > > > > >
> > > > > > > > Appreciate a lot for everyone helping on the testings.
> > > > > > > >
> > > > > > > > > InfiniBand controller: Mellanox Technologies MT27800
> > > > > > > > > Family [ConnectX-5]
> > > > > > > > >
> > > > > > > > > which doesn't meet our purpose. I can choose RDMA or TCP
> > > > > > > > > for VM migration. RDMA traffic is through InfiniBand and
> > > > > > > > > TCP through Ethernet on these two hosts. One is standby
> > > > > > > > > while the other
> > > is active.
> > > > > > > > >
> > > > > > > > > Now I'll try on a server with more recent Ethernet and
> > > > > > > > > InfiniBand network adapters. One of them has:
> > > > > > > > > BCM57414 NetXtreme-E 10Gb/25Gb RDMA Ethernet Controller
> > > > > > > > > (rev
> > > > > > > > > 01)
> > > > > > > > >
> > > > > > > > > The comparison between RDMA and TCP on the same NIC
> > > > > > > > > could make more
> > > > > > > > sense.
> > > > > > > >
> > > > > > > > It looks to me NICs are powerful now, but again as I
> > > > > > > > mentioned I don't think it's a reason we need to deprecate
> > > > > > > > rdma, especially if QEMU's rdma migration has the chance
> > > > > > > > to be refactored
> > > using rsocket.
> > > > > > > >
> > > > > > > > Is there anyone who started looking into that direction?
> > > > > > > > Would it make sense we start some PoC now?
> > > > > > > >
> > > > > > >
> > > > > > > My team has finished the PoC refactoring which works well.
> > > > > > >
> > > > > > > Progress:
> > > > > > > 1.  Implement io/channel-rdma.c, 2.  Add unit test
> > > > > > > tests/unit/test-io-channel-rdma.c and verifying it is
> > > > > > > successful, 3.  Remove the original code from migration/rdma.c, 4.
> > > > > > > Rewrite the rdma_start_outgoing_migration and
> > > > > > > rdma_start_incoming_migration logic, 5.  Remove all rdma_xxx
> > > > > > > functions from migration/ram.c. (to prevent RDMA live
> > > > > > > migration from polluting the
> > > > > > core logic of live migration), 6.  The soft-RoCE implemented
> > > > > > by software is used to test the RDMA live migration. It's successful.
> > > > > > >
> > > > > > > We will be submit the patchset later.
> > > > > >
> > > > > > That's great news, thank you!
> > > > > >
> > > > > > --
> > > > > > Peter Xu
> > > > >
> > > > > For rdma programming, the current mainstream implementation is
> > > > > to use
> > > rdma_cm to establish a connection, and then use verbs to transmit data.
> > > > >
> > > > > rdma_cm and ibverbs create two FDs respectively. The two FDs
> > > > > have different responsibilities. rdma_cm fd is used to notify
> > > > > connection establishment events, and verbs fd is used to notify
> > > > > new CQEs. When
> > > poll/epoll monitoring is directly performed on the rdma_cm fd, only
> > > a pollin event can be monitored, which means that an rdma_cm event
> > > occurs. When the verbs fd is directly polled/epolled, only the
> > > pollin event can be listened, which indicates that a new CQE is generated.
> > > > >
> > > > > Rsocket is a sub-module attached to the rdma_cm library and
> > > > > provides rdma calls that are completely similar to socket interfaces.
> > > > > However, this library returns only the rdma_cm fd for listening
> > > > > to link
> > > setup-related events and does not expose the verbs fd (readable and
> > > writable events for listening to data). Only the rpoll interface
> > > provided by the RSocket can be used to listen to related events.
> > > However, QEMU uses the ppoll interface to listen to the rdma_cm fd
> (gotten by raccept API).
> > > > > And cannot listen to the verbs fd event.
> I'm confused, the rs_poll_arm
> :https://github.com/linux-rdma/rdma-core/blob/master/librdmacm/rsocket.c#
> L3290
> For STREAM, rpoll setup fd for both cq fd and cm fd.
> 

Right. But the question is QEMU do not use rpoll but gilb's ppoll. :(


Regards,
-Gonglei
Haris Iqbal May 29, 2024, 11:13 a.m. UTC | #52
Hello,

I am part of the storage kernel team which develops and maintains the
RDMA block storage in IONOS.
We work closely with Jinpu/Yu, and currently I am supporting Jinpu
with this Qemu RDMA work.

On Wed, May 29, 2024 at 11:49 AM Gonglei (Arei) via
<qemu-devel@nongnu.org> wrote:
>
> Hi,
>
> > -----Original Message-----
> > > >
> > https://lore.kernel.org/qemu-devel/CAMGffEn-DKpMZ4tA71MJYdyemg0Zda
> > > > > > > 15
> > > > > > > > > wVAqk81vXtKzx-LfJQ@mail.gmail.com/
> > > > > > > > >
> > > > > > > > > Appreciate a lot for everyone helping on the testings.
> > > > > > > > >
> > > > > > > > > > InfiniBand controller: Mellanox Technologies MT27800
> > > > > > > > > > Family [ConnectX-5]
> > > > > > > > > >
> > > > > > > > > > which doesn't meet our purpose. I can choose RDMA or TCP
> > > > > > > > > > for VM migration. RDMA traffic is through InfiniBand and
> > > > > > > > > > TCP through Ethernet on these two hosts. One is standby
> > > > > > > > > > while the other
> > > > is active.
> > > > > > > > > >
> > > > > > > > > > Now I'll try on a server with more recent Ethernet and
> > > > > > > > > > InfiniBand network adapters. One of them has:
> > > > > > > > > > BCM57414 NetXtreme-E 10Gb/25Gb RDMA Ethernet Controller
> > > > > > > > > > (rev
> > > > > > > > > > 01)
> > > > > > > > > >
> > > > > > > > > > The comparison between RDMA and TCP on the same NIC
> > > > > > > > > > could make more
> > > > > > > > > sense.
> > > > > > > > >
> > > > > > > > > It looks to me NICs are powerful now, but again as I
> > > > > > > > > mentioned I don't think it's a reason we need to deprecate
> > > > > > > > > rdma, especially if QEMU's rdma migration has the chance
> > > > > > > > > to be refactored
> > > > using rsocket.
> > > > > > > > >
> > > > > > > > > Is there anyone who started looking into that direction?
> > > > > > > > > Would it make sense we start some PoC now?
> > > > > > > > >
> > > > > > > >
> > > > > > > > My team has finished the PoC refactoring which works well.
> > > > > > > >
> > > > > > > > Progress:
> > > > > > > > 1.  Implement io/channel-rdma.c, 2.  Add unit test
> > > > > > > > tests/unit/test-io-channel-rdma.c and verifying it is
> > > > > > > > successful, 3.  Remove the original code from migration/rdma.c, 4.
> > > > > > > > Rewrite the rdma_start_outgoing_migration and
> > > > > > > > rdma_start_incoming_migration logic, 5.  Remove all rdma_xxx
> > > > > > > > functions from migration/ram.c. (to prevent RDMA live
> > > > > > > > migration from polluting the
> > > > > > > core logic of live migration), 6.  The soft-RoCE implemented
> > > > > > > by software is used to test the RDMA live migration. It's successful.
> > > > > > > >
> > > > > > > > We will be submit the patchset later.
> > > > > > >
> > > > > > > That's great news, thank you!
> > > > > > >
> > > > > > > --
> > > > > > > Peter Xu
> > > > > >
> > > > > > For rdma programming, the current mainstream implementation is
> > > > > > to use
> > > > rdma_cm to establish a connection, and then use verbs to transmit data.
> > > > > >
> > > > > > rdma_cm and ibverbs create two FDs respectively. The two FDs
> > > > > > have different responsibilities. rdma_cm fd is used to notify
> > > > > > connection establishment events, and verbs fd is used to notify
> > > > > > new CQEs. When
> > > > poll/epoll monitoring is directly performed on the rdma_cm fd, only
> > > > a pollin event can be monitored, which means that an rdma_cm event
> > > > occurs. When the verbs fd is directly polled/epolled, only the
> > > > pollin event can be listened, which indicates that a new CQE is generated.
> > > > > >
> > > > > > Rsocket is a sub-module attached to the rdma_cm library and
> > > > > > provides rdma calls that are completely similar to socket interfaces.
> > > > > > However, this library returns only the rdma_cm fd for listening
> > > > > > to link
> > > > setup-related events and does not expose the verbs fd (readable and
> > > > writable events for listening to data). Only the rpoll interface
> > > > provided by the RSocket can be used to listen to related events.
> > > > However, QEMU uses the ppoll interface to listen to the rdma_cm fd
> > (gotten by raccept API).
> > > > > > And cannot listen to the verbs fd event.
> > I'm confused, the rs_poll_arm
> > :https://github.com/linux-rdma/rdma-core/blob/master/librdmacm/rsocket.c#
> > L3290
> > For STREAM, rpoll setup fd for both cq fd and cm fd.
> >
>
> Right. But the question is QEMU do not use rpoll but gilb's ppoll. :(

I have a query around this topic. Are the fds used in socket migration
polled through ppoll?
If yes, then can someone point out where; I couldn't find that piece of code.

I could only find that sendmsg/send and recvmsg/recv is being used.

>
>
> Regards,
> -Gonglei
>
Peter Xu May 29, 2024, 4:33 p.m. UTC | #53
Lei,

On Wed, May 29, 2024 at 02:43:46AM +0000, Gonglei (Arei) wrote:
> For rdma programming, the current mainstream implementation is to use
> rdma_cm to establish a connection, and then use verbs to transmit data.
> rdma_cm and ibverbs create two FDs respectively. The two FDs have
> different responsibilities. rdma_cm fd is used to notify connection
> establishment events, and verbs fd is used to notify new CQEs. When
> poll/epoll monitoring is directly performed on the rdma_cm fd, only a
> pollin event can be monitored, which means that an rdma_cm event
> occurs. When the verbs fd is directly polled/epolled, only the pollin
> event can be listened, which indicates that a new CQE is generated.
>
> Rsocket is a sub-module attached to the rdma_cm library and provides
> rdma calls that are completely similar to socket interfaces. However,
> this library returns only the rdma_cm fd for listening to link
> setup-related events and does not expose the verbs fd (readable and
> writable events for listening to data). Only the rpoll interface provided
> by the RSocket can be used to listen to related events. However, QEMU
> uses the ppoll interface to listen to the rdma_cm fd (gotten by raccept
> API).  And cannot listen to the verbs fd event. Only some hacking methods
> can be used to address this problem.  Do you guys have any ideas? Thanks.

I saw that you mentioned this elsewhere:

> Right. But the question is QEMU do not use rpoll but gilb's ppoll. :(

So what I'm thinking may not make much sense, as I mentioned I don't think
I know rdma at all.. and my idea also has involvement on coroutine stuff
which I also don't know well. But just in case it shed some light in some
form.

IIUC we do iochannel blockings with this no matter for read/write:

        if (len == QIO_CHANNEL_ERR_BLOCK) {
            if (qemu_in_coroutine()) {
                qio_channel_yield(ioc, G_IO_XXX);
            } else {
                qio_channel_wait(ioc, G_IO_XXX);
            }
            continue;
        }

One thing I'm wondering is whether we can provide a new feature bit for
qiochannel, e.g., QIO_CHANNEL_FEATURE_POLL, so that the iochannel can
define its own poll routine rather than using the default when possible.

I think it may not work if it's in a coroutine, as I guess that'll block
other fds from being waked up.  Hence it should look like this:

        if (len == QIO_CHANNEL_ERR_BLOCK) {
            if (qemu_in_coroutine()) {
                qio_channel_yield(ioc, G_IO_XXX);
            } else if (qio_channel_has_feature(ioc, QIO_CHANNEL_FEATURE_POLL)) {
                qio_channel_poll(ioc, G_IO_XXX);
            } else {
                qio_channel_wait(ioc, G_IO_XXX);
            }
            continue;
        }

Maybe we even want to forbid such channel to be used in coroutine already,
as when QIO_CHANNEL_FEATURE_POLL set it may mean that this iochannel simply
won't work with poll() like in rdma's use case.

Then rdma iochannel can implement qio_channel_poll() using rpoll().

There's one other dependent issue here in that I _think_ the migration recv
side is still in a coroutine.. so we may need to move that into a thread
first.  IIRC we don't yet have a major blocker to do that, but I didn't
further check either.  I've put that issue aside just to see whether this
may or may not make sense.

Thanks,
Sean Hefty May 30, 2024, 6:23 p.m. UTC | #54
> > For rdma programming, the current mainstream implementation is to use
> rdma_cm to establish a connection, and then use verbs to transmit data.
> >
> > rdma_cm and ibverbs create two FDs respectively. The two FDs have
> > different responsibilities. rdma_cm fd is used to notify connection
> > establishment events, and verbs fd is used to notify new CQEs. When
> poll/epoll monitoring is directly performed on the rdma_cm fd, only a pollin
> event can be monitored, which means that an rdma_cm event occurs. When
> the verbs fd is directly polled/epolled, only the pollin event can be listened,
> which indicates that a new CQE is generated.
> >
> > Rsocket is a sub-module attached to the rdma_cm library and provides
> > rdma calls that are completely similar to socket interfaces. However,
> > this library returns only the rdma_cm fd for listening to link setup-related
> events and does not expose the verbs fd (readable and writable events for
> listening to data). Only the rpoll interface provided by the RSocket can be used
> to listen to related events. However, QEMU uses the ppoll interface to listen to
> the rdma_cm fd (gotten by raccept API).
> > And cannot listen to the verbs fd event. Only some hacking methods can be
> used to address this problem.
> >
> > Do you guys have any ideas? Thanks.

The current rsocket code allows calling rpoll() with non-rsocket fd's, so an app can use rpoll() directly in place of poll().  It may be easiest to add an rppoll() call to rsockets and call that when using RDMA.

In case the easy path isn't feasible:

An extension could allow extracting the actual fd's under an rsocket, in order to allow a user to call poll()/ppoll() directly.  But it would be non-trivial.

The 'fd' that represents an rsocket happens to be the fd related to the RDMA CM.  That's because an rsocket needs a unique integer value to report as an 'fd' value which will not conflict with any other fd value that the app may have.  I would consider the fd value an implementation detail, rather than something which an app should depend upon.  (For example, the 'fd' value returned for a datagram rsocket is actually a UDP socket fd).

Once an rsocket is in the connected state, it's possible an extended rgetsockopt() or rfcntl() call could return the fd related to the CQ.  But if an app tried to call poll() on that fd, the results would not be as expected.  For example, it's possible for data to be available to receive on the rsocket without the CQ fd being signaled.  Calling poll() on the CQ fd in this state could leave the app hanging.  This is a natural? result of races in the RDMA CQ signaling.  If you look at the rsocket rpoll() implementation, you'll see that it checks for data prior to sleeping.

For an app to safely wait in poll/ppoll on the CQ fd, it would need to invoke some sort of 'pre-poll' routine, which would perform the same checks done in rpoll() prior to blocking.  As a reference to a similar pre-poll routine, see the fi_trywait() call from this man page: 

https://ofiwg.github.io/libfabric/v1.21.0/man/fi_poll.3.html

This is for a different library but deals with the same underlying problem.  Obviously adding an rtrywait() to rsockets is possible but wouldn't align with any socket API equivalent.

- Sean
Dr. David Alan Gilbert June 5, 2024, 12:31 a.m. UTC | #55
* Michael Galaxy (mgalaxy@akamai.com) wrote:
> One thing to keep in mind here (despite me not having any hardware to test)
> was that one of the original goals here
> in the RDMA implementation was not simply raw throughput nor raw latency,
> but a lack of CPU utilization in kernel
> space due to the offload. While it is entirely possible that newer hardware
> w/ TCP might compete, the significant
> reductions in CPU usage in the TCP/IP stack were a big win at the time.
> 
> Just something to consider while you're doing the testing........

I just noticed this thread; some random notes from a somewhat
fragmented memory of this:

  a) Long long ago, I also tried rsocket; 
      https://lists.gnu.org/archive/html/qemu-devel/2015-01/msg02040.html
     as I remember the library was quite flaky at the time.

  b) A lot of the complexity in the rdma migration code comes from
    emulating a stream to carry the migration control data and interleaving
    that with the actual RAM copy.   I believe the original design used
    a separate TCP socket for the control data, and just used the RDMA
    for the data - that should be a lot simpler (but alas was rejected
    in review early on)

  c) I can't rememmber the last benchmarks I did; but I think I did
    manage to beat RDMA with multifd; but yes, multifd does eat host CPU
    where as RDMA barely uses a whisper.

  d) The 'zero-copy-send' option in migrate may well get some of that
     CPU time back; but if I remember we were still bottle necked on
     the receive side. (I can't remember if zero-copy-send worked with
     multifd?)

  e) Someone made a good suggestion (sorry can't remember who) - that the
     RDMA migration structure was the wrong way around - it should be the
     destination which initiates an RDMA read, rather than the source
     doing a write; then things might become a LOT simpler; you just need
     to send page ranges to the destination and it can pull it.
     That might work nicely for postcopy.

Dave

> - Michael
> 
> On 5/9/24 03:58, Zheng Chuan wrote:
> > Hi, Peter,Lei,Jinpu.
> > 
> > On 2024/5/8 0:28, Peter Xu wrote:
> > > On Tue, May 07, 2024 at 01:50:43AM +0000, Gonglei (Arei) wrote:
> > > > Hello,
> > > > 
> > > > > -----Original Message-----
> > > > > From: Peter Xu [mailto:peterx@redhat.com]
> > > > > Sent: Monday, May 6, 2024 11:18 PM
> > > > > To: Gonglei (Arei) <arei.gonglei@huawei.com>
> > > > > Cc: Daniel P. Berrangé <berrange@redhat.com>; Markus Armbruster
> > > > > <armbru@redhat.com>; Michael Galaxy <mgalaxy@akamai.com>; Yu Zhang
> > > > > <yu.zhang@ionos.com>; Zhijian Li (Fujitsu) <lizhijian@fujitsu.com>; Jinpu Wang
> > > > > <jinpu.wang@ionos.com>; Elmar Gerdes <elmar.gerdes@ionos.com>;
> > > > > qemu-devel@nongnu.org; Yuval Shaia <yuval.shaia.ml@gmail.com>; Kevin Wolf
> > > > > <kwolf@redhat.com>; Prasanna Kumar Kalever
> > > > > <prasanna.kalever@redhat.com>; Cornelia Huck <cohuck@redhat.com>;
> > > > > Michael Roth <michael.roth@amd.com>; Prasanna Kumar Kalever
> > > > > <prasanna4324@gmail.com>; integration@gluster.org; Paolo Bonzini
> > > > > <pbonzini@redhat.com>; qemu-block@nongnu.org; devel@lists.libvirt.org;
> > > > > Hanna Reitz <hreitz@redhat.com>; Michael S. Tsirkin <mst@redhat.com>;
> > > > > Thomas Huth <thuth@redhat.com>; Eric Blake <eblake@redhat.com>; Song
> > > > > Gao <gaosong@loongson.cn>; Marc-André Lureau
> > > > > <marcandre.lureau@redhat.com>; Alex Bennée <alex.bennee@linaro.org>;
> > > > > Wainer dos Santos Moschetta <wainersm@redhat.com>; Beraldo Leal
> > > > > <bleal@redhat.com>; Pannengyuan <pannengyuan@huawei.com>;
> > > > > Xiexiangyou <xiexiangyou@huawei.com>
> > > > > Subject: Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling
> > > > > 
> > > > > On Mon, May 06, 2024 at 02:06:28AM +0000, Gonglei (Arei) wrote:
> > > > > > Hi, Peter
> > > > > Hey, Lei,
> > > > > 
> > > > > Happy to see you around again after years.
> > > > > 
> > > > Haha, me too.
> > > > 
> > > > > > RDMA features high bandwidth, low latency (in non-blocking lossless
> > > > > > network), and direct remote memory access by bypassing the CPU (As you
> > > > > > know, CPU resources are expensive for cloud vendors, which is one of
> > > > > > the reasons why we introduced offload cards.), which TCP does not have.
> > > > > It's another cost to use offload cards, v.s. preparing more cpu resources?
> > > > > 
> > > > Software and hardware offload converged architecture is the way to go for all cloud vendors
> > > > (Including comprehensive benefits in terms of performance, cost, security, and innovation speed),
> > > > it's not just a matter of adding the resource of a DPU card.
> > > > 
> > > > > > In some scenarios where fast live migration is needed (extremely short
> > > > > > interruption duration and migration duration) is very useful. To this
> > > > > > end, we have also developed RDMA support for multifd.
> > > > > Will any of you upstream that work?  I'm curious how intrusive would it be
> > > > > when adding it to multifd, if it can keep only 5 exported functions like what
> > > > > rdma.h does right now it'll be pretty nice.  We also want to make sure it works
> > > > > with arbitrary sized loads and buffers, e.g. vfio is considering to add IO loads to
> > > > > multifd channels too.
> > > > > 
> > > > In fact, we sent the patchset to the community in 2021. Pls see:
> > > > https://urldefense.com/v3/__https://lore.kernel.org/all/20210203185906.GT2950@work-vm/T/__;!!GjvTz_vk!VfP_SV-8uRya7rBdopv8OUJkmnSi44Ktpqq1E7sr_Xcwt6zvveW51qboWOBSTChdUG1hJwfAl7HZl4NUEGc$
> > Yes, I have sent the patchset of multifd support for rdma migration by taking over my colleague, and also
> > sorry for not keeping on this work at that time due to some reasons.
> > And also I am strongly agree with Lei that the RDMA protocol has some special advantages against with TCP
> > in some scenario, and we are indeed to use it in our product.
> > 
> > > I wasn't aware of that for sure in the past..
> > > 
> > > Multifd has changed quite a bit in the last 9.0 release, that may not apply
> > > anymore.  One thing to mention is please look at Dan's comment on possible
> > > use of rsocket.h:
> > > 
> > > https://urldefense.com/v3/__https://lore.kernel.org/all/ZjJm6rcqS5EhoKgK@redhat.com/__;!!GjvTz_vk!VfP_SV-8uRya7rBdopv8OUJkmnSi44Ktpqq1E7sr_Xcwt6zvveW51qboWOBSTChdUG1hJwfAl7HZ0CFSE-o$
> > > 
> > > And Jinpu did help provide an initial test result over the library:
> > > 
> > > https://urldefense.com/v3/__https://lore.kernel.org/qemu-devel/CAMGffEk8wiKNQmoUYxcaTHGtiEm2dwoCF_W7T0vMcD-i30tUkA@mail.gmail.com/__;!!GjvTz_vk!VfP_SV-8uRya7rBdopv8OUJkmnSi44Ktpqq1E7sr_Xcwt6zvveW51qboWOBSTChdUG1hJwfAl7HZxPNcdb4$
> > > 
> > > It looks like we have a chance to apply that in QEMU.
> > > 
> > > > 
> > > > > One thing to note that the question here is not about a pure performance
> > > > > comparison between rdma and nics only.  It's about help us make a decision
> > > > > on whether to drop rdma, iow, even if rdma performs well, the community still
> > > > > has the right to drop it if nobody can actively work and maintain it.
> > > > > It's just that if nics can perform as good it's more a reason to drop, unless
> > > > > companies can help to provide good support and work together.
> > > > > 
> > > > We are happy to provide the necessary review and maintenance work for RDMA
> > > > if the community needs it.
> > > > 
> > > > CC'ing Chuan Zheng.
> > > I'm not sure whether you and Jinpu's team would like to work together and
> > > provide a final solution for rdma over multifd.  It could be much simpler
> > > than the original 2021 proposal if the rsocket API will work out.
> > > 
> > > Thanks,
> > > 
> > That's a good news to see the socket abstraction for RDMA!
> > When I was developed the series above, the most pain is the RDMA migration has no QIOChannel abstraction and i need to take a 'fake channel'
> > for it which is awkward in code implementation.
> > So, as far as I know, we can do this by
> > i. the first thing is that we need to evaluate the rsocket is good enough to satisfy our QIOChannel fundamental abstraction
> > ii. if it works right, then we will continue to see if it can give us opportunity to hide the detail of rdma protocol
> >      into rsocket by remove most of code in rdma.c and also some hack in migration main process.
> > iii. implement the advanced features like multi-fd and multi-uri for rdma migration.
> > 
> > Since I am not familiar with rsocket, I need some times to look at it and do some quick verify with rdma migration based on rsocket.
> > But, yes, I am willing to involved in this refactor work and to see if we can make this migration feature more better:)
> > 
> > 
>
Peter Xu June 5, 2024, 2:59 p.m. UTC | #56
On Wed, Jun 05, 2024 at 10:10:57AM -0400, Peter Xu wrote:
> >   e) Someone made a good suggestion (sorry can't remember who) - that the
> >      RDMA migration structure was the wrong way around - it should be the
> >      destination which initiates an RDMA read, rather than the source
> >      doing a write; then things might become a LOT simpler; you just need
> >      to send page ranges to the destination and it can pull it.
> >      That might work nicely for postcopy.
> 
> I'm not sure whether it'll still be a problem if rdma recv side is based on
> zero-copy.  It would be a matter of whether atomicity can be guaranteed so
> that we don't want the guest vcpus to see a partially copied page during
> on-flight DMAs.  UFFDIO_COPY (or friend) is currently the only solution for
> that.

And when thinking about this (of UFFDIO_COPY's nature on not being able to
do zero-copy...), the only way this will be able to do zerocopy is to use
file memories (shmem/hugetlbfs), as page cache can be prepopulated. So that
when we do DMA we pass over the page cache, which can be mapped in another
virtual address besides what the vcpus are using.

Then we can use UFFDIO_CONTINUE (rather than UFFDIO_COPY) to do atomic
updates on the vcpu pgtables, avoiding the copy.  QEMU doesn't have it, but
it looks like there's one more reason we may want to have better use of
shmem.. than anonymous.  And actually when working on 4k faults on 1G
hugetlb I added CONTINUE support.

https://github.com/xzpeter/qemu/tree/doublemap
https://github.com/xzpeter/qemu/commit/b8aff3a9d7654b1cf2c089a06894ff4899740dc5

Maybe it's worthwhile on its own now, because it also means we can use that
in multifd to avoid one extra layer of buffering when supporting
multifd+postcopy (which has the same issue here on directly copying data
into guest pages).  It'll also work with things like rmda I think in
similar ways.  It's just that it'll not work on anonymous.

I definitely hijacked the thread to somewhere too far away.  I'll stop
here..

Thanks,
Dr. David Alan Gilbert June 5, 2024, 8:48 p.m. UTC | #57
* Peter Xu (peterx@redhat.com) wrote:
> Hey, Dave!

Hey!

> On Wed, Jun 05, 2024 at 12:31:56AM +0000, Dr. David Alan Gilbert wrote:
> > * Michael Galaxy (mgalaxy@akamai.com) wrote:
> > > One thing to keep in mind here (despite me not having any hardware to test)
> > > was that one of the original goals here
> > > in the RDMA implementation was not simply raw throughput nor raw latency,
> > > but a lack of CPU utilization in kernel
> > > space due to the offload. While it is entirely possible that newer hardware
> > > w/ TCP might compete, the significant
> > > reductions in CPU usage in the TCP/IP stack were a big win at the time.
> > > 
> > > Just something to consider while you're doing the testing........
> > 
> > I just noticed this thread; some random notes from a somewhat
> > fragmented memory of this:
> > 
> >   a) Long long ago, I also tried rsocket; 
> >       https://lists.gnu.org/archive/html/qemu-devel/2015-01/msg02040.html
> >      as I remember the library was quite flaky at the time.
> 
> Hmm interesting.  There also looks like a thread doing rpoll().

Yeh, I can't actually remember much more about what I did back then!

> Btw, not sure whether you noticed, but there's the series posted for the
> latest rsocket conversion here:
> 
> https://lore.kernel.org/r/1717503252-51884-1-git-send-email-arei.gonglei@huawei.com

Oh I hadn't; I think all of the stack of qemu's file abstractions had
changed in the ~10 years since I wrote my version!

> I hope Lei and his team has tested >4G mem, otherwise definitely worth
> checking.  Lei also mentioned there're rsocket bugs they found in the cover
> letter, but not sure what's that about.

It would probably be a good idea to keep track of what bugs
are in flight with it, and try it on a few RDMA cards to see
what problems get triggered.
I think I reported a few at the time, but I gave up after
feeling it was getting very hacky.

> Yes, and zero-copy requires multifd for now. I think it's because we didn't
> want to complicate the header processings in the migration stream where it
> may not be page aligned.

Ah yes.

> > 
> >   e) Someone made a good suggestion (sorry can't remember who) - that the
> >      RDMA migration structure was the wrong way around - it should be the
> >      destination which initiates an RDMA read, rather than the source
> >      doing a write; then things might become a LOT simpler; you just need
> >      to send page ranges to the destination and it can pull it.
> >      That might work nicely for postcopy.
> 
> I'm not sure whether it'll still be a problem if rdma recv side is based on
> zero-copy.  It would be a matter of whether atomicity can be guaranteed so
> that we don't want the guest vcpus to see a partially copied page during
> on-flight DMAs.  UFFDIO_COPY (or friend) is currently the only solution for
> that.

Yes, but even ignoring that (and the UFFDIO_CONTINUE idea you mention), if
the destination can issue an RDMA read itself, it doesn't need to send messages
to the source to ask for a page fetch; it just goes and grabs it itself,
that's got to be good for latency.

Dave

> 
> Thanks,
> 
> -- 
> Peter Xu
>
Peter Xu June 5, 2024, 9:18 p.m. UTC | #58
On Wed, Jun 05, 2024 at 08:48:28PM +0000, Dr. David Alan Gilbert wrote:
> > > I just noticed this thread; some random notes from a somewhat
> > > fragmented memory of this:
> > > 
> > >   a) Long long ago, I also tried rsocket; 
> > >       https://lists.gnu.org/archive/html/qemu-devel/2015-01/msg02040.html
> > >      as I remember the library was quite flaky at the time.
> > 
> > Hmm interesting.  There also looks like a thread doing rpoll().
> 
> Yeh, I can't actually remember much more about what I did back then!

Heh, that's understandable and fair. :)

> > I hope Lei and his team has tested >4G mem, otherwise definitely worth
> > checking.  Lei also mentioned there're rsocket bugs they found in the cover
> > letter, but not sure what's that about.
> 
> It would probably be a good idea to keep track of what bugs
> are in flight with it, and try it on a few RDMA cards to see
> what problems get triggered.
> I think I reported a few at the time, but I gave up after
> feeling it was getting very hacky.

Agreed.  Maybe we can have a list of that in the cover letter or even
QEMU's migration/rmda doc page.

Lei, if you think that makes sense please do so in your upcoming posts.
There'll need to have a list of things you encountered in the kernel driver
and it'll be even better if there're further links to read on each problem.

> > > 
> > >   e) Someone made a good suggestion (sorry can't remember who) - that the
> > >      RDMA migration structure was the wrong way around - it should be the
> > >      destination which initiates an RDMA read, rather than the source
> > >      doing a write; then things might become a LOT simpler; you just need
> > >      to send page ranges to the destination and it can pull it.
> > >      That might work nicely for postcopy.
> > 
> > I'm not sure whether it'll still be a problem if rdma recv side is based on
> > zero-copy.  It would be a matter of whether atomicity can be guaranteed so
> > that we don't want the guest vcpus to see a partially copied page during
> > on-flight DMAs.  UFFDIO_COPY (or friend) is currently the only solution for
> > that.
> 
> Yes, but even ignoring that (and the UFFDIO_CONTINUE idea you mention), if
> the destination can issue an RDMA read itself, it doesn't need to send messages
> to the source to ask for a page fetch; it just goes and grabs it itself,
> that's got to be good for latency.

Oh, that's pretty internal stuff of rdma to me and beyond my knowledge..
but from what I can tell it sounds very reasonable indeed!

Thanks!
Xingtao Yao (Fujitsu)" via June 7, 2024, 8:57 a.m. UTC | #59
Hi,

> -----Original Message-----
> From: Peter Xu [mailto:peterx@redhat.com]
> Sent: Thursday, June 6, 2024 5:19 AM
> To: Dr. David Alan Gilbert <dave@treblig.org>
> Cc: Michael Galaxy <mgalaxy@akamai.com>; zhengchuan
> <zhengchuan@huawei.com>; Gonglei (Arei) <arei.gonglei@huawei.com>;
> Daniel P. Berrangé <berrange@redhat.com>; Markus Armbruster
> <armbru@redhat.com>; Yu Zhang <yu.zhang@ionos.com>; Zhijian Li (Fujitsu)
> <lizhijian@fujitsu.com>; Jinpu Wang <jinpu.wang@ionos.com>; Elmar Gerdes
> <elmar.gerdes@ionos.com>; qemu-devel@nongnu.org; Yuval Shaia
> <yuval.shaia.ml@gmail.com>; Kevin Wolf <kwolf@redhat.com>; Prasanna
> Kumar Kalever <prasanna.kalever@redhat.com>; Cornelia Huck
> <cohuck@redhat.com>; Michael Roth <michael.roth@amd.com>; Prasanna
> Kumar Kalever <prasanna4324@gmail.com>; integration@gluster.org; Paolo
> Bonzini <pbonzini@redhat.com>; qemu-block@nongnu.org;
> devel@lists.libvirt.org; Hanna Reitz <hreitz@redhat.com>; Michael S. Tsirkin
> <mst@redhat.com>; Thomas Huth <thuth@redhat.com>; Eric Blake
> <eblake@redhat.com>; Song Gao <gaosong@loongson.cn>; Marc-André
> Lureau <marcandre.lureau@redhat.com>; Alex Bennée
> <alex.bennee@linaro.org>; Wainer dos Santos Moschetta
> <wainersm@redhat.com>; Beraldo Leal <bleal@redhat.com>; Pannengyuan
> <pannengyuan@huawei.com>; Xiexiangyou <xiexiangyou@huawei.com>
> Subject: Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling
> 
> On Wed, Jun 05, 2024 at 08:48:28PM +0000, Dr. David Alan Gilbert wrote:
> > > > I just noticed this thread; some random notes from a somewhat
> > > > fragmented memory of this:
> > > >
> > > >   a) Long long ago, I also tried rsocket;
> > > >
> https://lists.gnu.org/archive/html/qemu-devel/2015-01/msg02040.html
> > > >      as I remember the library was quite flaky at the time.
> > >
> > > Hmm interesting.  There also looks like a thread doing rpoll().
> >
> > Yeh, I can't actually remember much more about what I did back then!
> 
> Heh, that's understandable and fair. :)
> 
> > > I hope Lei and his team has tested >4G mem, otherwise definitely
> > > worth checking.  Lei also mentioned there're rsocket bugs they found
> > > in the cover letter, but not sure what's that about.
> >
> > It would probably be a good idea to keep track of what bugs are in
> > flight with it, and try it on a few RDMA cards to see what problems
> > get triggered.
> > I think I reported a few at the time, but I gave up after feeling it
> > was getting very hacky.
> 
> Agreed.  Maybe we can have a list of that in the cover letter or even QEMU's
> migration/rmda doc page.
> 
> Lei, if you think that makes sense please do so in your upcoming posts.
> There'll need to have a list of things you encountered in the kernel driver and
> it'll be even better if there're further links to read on each problem.
> 
OK, no problem. There are two bugs:

Bug 1:

https://github.com/linux-rdma/rdma-core/commit/23985e25aebb559b761872313f8cab4e811c5a3d#diff-5ddbf83c6f021688166096ca96c9bba874dffc3cab88ded2e9d8b2176faa084cR3302-R3303

his commit introduces a bug that causes QEMU suspension.
When the timeout parameter of the rpoll is not -1 or 0, the program is suspended occasionally.

Problem analysis:
During the first rpoll,
In line 3297, rs_poll_enter () performs pollcnt++. In this case, the value of pollcnt is 1.
In line 3302, timeout expires and the function exits. Note that rs_poll_exit () is not --pollcnt here.
In this case, the value of pollcnt is 1.
During the second rpoll, pollcnt++ is performed in line 3297 rs_poll_enter (). In this case, the value of pollcnt is 2.
If no timeout expires and the poll return value is greater than 0, the rs_poll_stop () function is executed. Because the if (--pollcnt) condition is false, suspendpoll = 1 is executed.
Go back to the do while loop inside rpoll, again rs_poll_enter () now if (suspendpoll) condition is true, execute pthread_yield (); and return -EBUSY, Then, the do while loop in the rpoll is returned. Because the if (rs_poll_enter ()) condition is true, the rs_poll_enter () function is executed again after the continue operation. As a result, the program is suspended.

Root cause: In line 3302, rs_poll_exit () is not executed before the timeout expires function exits.


Bug 2:

In rsocket.c, there is a receive queue int accept_queue[2] implemented by socketpair. The listen_svc thread in rsocket.c is responsible for receiving connections and writing them to the accept_queue[1]. When raccept () is called, a connection is received from accept_queue[0].
In the test case, qio_channel_wait(QIO_CHANNEL(lioc), G_IO_IN); waits for a readable event (waiting for a connection), rpoll () checks if accept_queue[0] has a readable event, However, this poll does not poll accept_queue[0]. After the timeout expires, rpoll () obtains the readable event of accept_queue[0] from rs_poll_arm again.

Impaction: 
The accept operation can be performed only after 5000 ms. Of course, we can shorten this time by echoing the millisecond time > /etc/rdma/rsocket/wake_up_interval.


Regards,
-Gonglei

> > > >
> > > >   e) Someone made a good suggestion (sorry can't remember who) -
> that the
> > > >      RDMA migration structure was the wrong way around - it should
> be the
> > > >      destination which initiates an RDMA read, rather than the source
> > > >      doing a write; then things might become a LOT simpler; you just
> need
> > > >      to send page ranges to the destination and it can pull it.
> > > >      That might work nicely for postcopy.
> > >
> > > I'm not sure whether it'll still be a problem if rdma recv side is
> > > based on zero-copy.  It would be a matter of whether atomicity can
> > > be guaranteed so that we don't want the guest vcpus to see a
> > > partially copied page during on-flight DMAs.  UFFDIO_COPY (or
> > > friend) is currently the only solution for that.
> >
> > Yes, but even ignoring that (and the UFFDIO_CONTINUE idea you
> > mention), if the destination can issue an RDMA read itself, it doesn't
> > need to send messages to the source to ask for a page fetch; it just
> > goes and grabs it itself, that's got to be good for latency.
> 
> Oh, that's pretty internal stuff of rdma to me and beyond my knowledge..
> but from what I can tell it sounds very reasonable indeed!
> 
> Thanks!
> 
> --
> Peter Xu
>
diff mbox series

Patch

diff --git a/MAINTAINERS b/MAINTAINERS
index 91ab5235b8..05226cea0a 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -3426,13 +3426,6 @@  F: docs/devel/migration.rst
 F: qapi/migration.json
 F: tests/migration/
 F: util/userfaultfd.c
-X: migration/rdma*
-
-RDMA Migration
-R: Li Zhijian <lizhijian@fujitsu.com>
-R: Peter Xu <peterx@redhat.com>
-S: Odd Fixes
-F: migration/rdma*
 
 Migration dirty limit and dirty page rate
 M: Hyman Huang <yong.huang@smartx.com>
diff --git a/docs/devel/migration/main.rst b/docs/devel/migration/main.rst
index 54385a23e5..70278ce1e3 100644
--- a/docs/devel/migration/main.rst
+++ b/docs/devel/migration/main.rst
@@ -47,12 +47,6 @@  over any transport.
   QEMU interference. Note that QEMU does not flush cached file
   data/metadata at the end of migration.
 
-In addition, support is included for migration using RDMA, which
-transports the page data using ``RDMA``, where the hardware takes care of
-transporting the pages, and the load on the CPU is much lower.  While the
-internals of RDMA migration are a bit different, this isn't really visible
-outside the RAM migration code.
-
 All these migration protocols use the same infrastructure to
 save/restore state devices.  This infrastructure is shared with the
 savevm/loadvm functionality.
diff --git a/docs/rdma.txt b/docs/rdma.txt
deleted file mode 100644
index bd8dd799a9..0000000000
--- a/docs/rdma.txt
+++ /dev/null
@@ -1,420 +0,0 @@ 
-(RDMA: Remote Direct Memory Access)
-RDMA Live Migration Specification, Version # 1
-==============================================
-Wiki: https://wiki.qemu.org/Features/RDMALiveMigration
-Github: git@github.com:hinesmr/qemu.git, 'rdma' branch
-
-Copyright (C) 2013 Michael R. Hines <mrhines@us.ibm.com>
-
-An *exhaustive* paper (2010) shows additional performance details
-linked on the QEMU wiki above.
-
-Contents:
-=========
-* Introduction
-* Before running
-* Running
-* Performance
-* RDMA Migration Protocol Description
-* Versioning and Capabilities
-* QEMUFileRDMA Interface
-* Migration of VM's ram
-* Error handling
-* TODO
-
-Introduction:
-=============
-
-RDMA helps make your migration more deterministic under heavy load because
-of the significantly lower latency and higher throughput over TCP/IP. This is
-because the RDMA I/O architecture reduces the number of interrupts and
-data copies by bypassing the host networking stack. In particular, a TCP-based
-migration, under certain types of memory-bound workloads, may take a more
-unpredictable amount of time to complete the migration if the amount of
-memory tracked during each live migration iteration round cannot keep pace
-with the rate of dirty memory produced by the workload.
-
-RDMA currently comes in two flavors: both Ethernet based (RoCE, or RDMA
-over Converged Ethernet) as well as Infiniband-based. This implementation of
-migration using RDMA is capable of using both technologies because of
-the use of the OpenFabrics OFED software stack that abstracts out the
-programming model irrespective of the underlying hardware.
-
-Refer to openfabrics.org or your respective RDMA hardware vendor for
-an understanding on how to verify that you have the OFED software stack
-installed in your environment. You should be able to successfully link
-against the "librdmacm" and "libibverbs" libraries and development headers
-for a working build of QEMU to run successfully using RDMA Migration.
-
-BEFORE RUNNING:
-===============
-
-Use of RDMA during migration requires pinning and registering memory
-with the hardware. This means that memory must be physically resident
-before the hardware can transmit that memory to another machine.
-If this is not acceptable for your application or product, then the use
-of RDMA migration may in fact be harmful to co-located VMs or other
-software on the machine if there is not sufficient memory available to
-relocate the entire footprint of the virtual machine. If so, then the
-use of RDMA is discouraged and it is recommended to use standard TCP migration.
-
-Experimental: Next, decide if you want dynamic page registration.
-For example, if you have an 8GB RAM virtual machine, but only 1GB
-is in active use, then enabling this feature will cause all 8GB to
-be pinned and resident in memory. This feature mostly affects the
-bulk-phase round of the migration and can be enabled for extremely
-high-performance RDMA hardware using the following command:
-
-QEMU Monitor Command:
-$ migrate_set_capability rdma-pin-all on # disabled by default
-
-Performing this action will cause all 8GB to be pinned, so if that's
-not what you want, then please ignore this step altogether.
-
-On the other hand, this will also significantly speed up the bulk round
-of the migration, which can greatly reduce the "total" time of your migration.
-Example performance of this using an idle VM in the previous example
-can be found in the "Performance" section.
-
-Note: for very large virtual machines (hundreds of GBs), pinning all
-*all* of the memory of your virtual machine in the kernel is very expensive
-may extend the initial bulk iteration time by many seconds,
-and thus extending the total migration time. However, this will not
-affect the determinism or predictability of your migration you will
-still gain from the benefits of advanced pinning with RDMA.
-
-RUNNING:
-========
-
-First, set the migration speed to match your hardware's capabilities:
-
-QEMU Monitor Command:
-$ migrate_set_parameter max-bandwidth 40g # or whatever is the MAX of your RDMA device
-
-Next, on the destination machine, add the following to the QEMU command line:
-
-qemu ..... -incoming rdma:host:port
-
-Finally, perform the actual migration on the source machine:
-
-QEMU Monitor Command:
-$ migrate -d rdma:host:port
-
-PERFORMANCE
-===========
-
-Here is a brief summary of total migration time and downtime using RDMA:
-Using a 40gbps infiniband link performing a worst-case stress test,
-using an 8GB RAM virtual machine:
-
-Using the following command:
-$ apt-get install stress
-$ stress --vm-bytes 7500M --vm 1 --vm-keep
-
-1. Migration throughput: 26 gigabits/second.
-2. Downtime (stop time) varies between 15 and 100 milliseconds.
-
-EFFECTS of memory registration on bulk phase round:
-
-For example, in the same 8GB RAM example with all 8GB of memory in
-active use and the VM itself is completely idle using the same 40 gbps
-infiniband link:
-
-1. rdma-pin-all disabled total time: approximately 7.5 seconds @ 9.5 Gbps
-2. rdma-pin-all enabled total time: approximately 4 seconds @ 26 Gbps
-
-These numbers would of course scale up to whatever size virtual machine
-you have to migrate using RDMA.
-
-Enabling this feature does *not* have any measurable affect on
-migration *downtime*. This is because, without this feature, all of the
-memory will have already been registered already in advance during
-the bulk round and does not need to be re-registered during the successive
-iteration rounds.
-
-RDMA Protocol Description:
-==========================
-
-Migration with RDMA is separated into two parts:
-
-1. The transmission of the pages using RDMA
-2. Everything else (a control channel is introduced)
-
-"Everything else" is transmitted using a formal
-protocol now, consisting of infiniband SEND messages.
-
-An infiniband SEND message is the standard ibverbs
-message used by applications of infiniband hardware.
-The only difference between a SEND message and an RDMA
-message is that SEND messages cause notifications
-to be posted to the completion queue (CQ) on the
-infiniband receiver side, whereas RDMA messages (used
-for VM's ram) do not (to behave like an actual DMA).
-
-Messages in infiniband require two things:
-
-1. registration of the memory that will be transmitted
-2. (SEND only) work requests to be posted on both
-   sides of the network before the actual transmission
-   can occur.
-
-RDMA messages are much easier to deal with. Once the memory
-on the receiver side is registered and pinned, we're
-basically done. All that is required is for the sender
-side to start dumping bytes onto the link.
-
-(Memory is not released from pinning until the migration
-completes, given that RDMA migrations are very fast.)
-
-SEND messages require more coordination because the
-receiver must have reserved space (using a receive
-work request) on the receive queue (RQ) before QEMUFileRDMA
-can start using them to carry all the bytes as
-a control transport for migration of device state.
-
-To begin the migration, the initial connection setup is
-as follows (migration-rdma.c):
-
-1. Receiver and Sender are started (command line or libvirt):
-2. Both sides post two RQ work requests
-3. Receiver does listen()
-4. Sender does connect()
-5. Receiver accept()
-6. Check versioning and capabilities (described later)
-
-At this point, we define a control channel on top of SEND messages
-which is described by a formal protocol. Each SEND message has a
-header portion and a data portion (but together are transmitted
-as a single SEND message).
-
-Header:
-    * Length               (of the data portion, uint32, network byte order)
-    * Type                 (what command to perform, uint32, network byte order)
-    * Repeat               (Number of commands in data portion, same type only)
-
-The 'Repeat' field is here to support future multiple page registrations
-in a single message without any need to change the protocol itself
-so that the protocol is compatible against multiple versions of QEMU.
-Version #1 requires that all server implementations of the protocol must
-check this field and register all requests found in the array of commands located
-in the data portion and return an equal number of results in the response.
-The maximum number of repeats is hard-coded to 4096. This is a conservative
-limit based on the maximum size of a SEND message along with empirical
-observations on the maximum future benefit of simultaneous page registrations.
-
-The 'type' field has 12 different command values:
-     1. Unused
-     2. Error                      (sent to the source during bad things)
-     3. Ready                      (control-channel is available)
-     4. QEMU File                  (for sending non-live device state)
-     5. RAM Blocks request         (used right after connection setup)
-     6. RAM Blocks result          (used right after connection setup)
-     7. Compress page              (zap zero page and skip registration)
-     8. Register request           (dynamic chunk registration)
-     9. Register result            ('rkey' to be used by sender)
-    10. Register finished          (registration for current iteration finished)
-    11. Unregister request         (unpin previously registered memory)
-    12. Unregister finished        (confirmation that unpin completed)
-
-A single control message, as hinted above, can contain within the data
-portion an array of many commands of the same type. If there is more than
-one command, then the 'repeat' field will be greater than 1.
-
-After connection setup, message 5 & 6 are used to exchange ram block
-information and optionally pin all the memory if requested by the user.
-
-After ram block exchange is completed, we have two protocol-level
-functions, responsible for communicating control-channel commands
-using the above list of values:
-
-Logically:
-
-qemu_rdma_exchange_recv(header, expected command type)
-
-1. We transmit a READY command to let the sender know that
-   we are *ready* to receive some data bytes on the control channel.
-2. Before attempting to receive the expected command, we post another
-   RQ work request to replace the one we just used up.
-3. Block on a CQ event channel and wait for the SEND to arrive.
-4. When the send arrives, librdmacm will unblock us.
-5. Verify that the command-type and version received matches the one we expected.
-
-qemu_rdma_exchange_send(header, data, optional response header & data):
-
-1. Block on the CQ event channel waiting for a READY command
-   from the receiver to tell us that the receiver
-   is *ready* for us to transmit some new bytes.
-2. Optionally: if we are expecting a response from the command
-   (that we have not yet transmitted), let's post an RQ
-   work request to receive that data a few moments later.
-3. When the READY arrives, librdmacm will
-   unblock us and we immediately post a RQ work request
-   to replace the one we just used up.
-4. Now, we can actually post the work request to SEND
-   the requested command type of the header we were asked for.
-5. Optionally, if we are expecting a response (as before),
-   we block again and wait for that response using the additional
-   work request we previously posted. (This is used to carry
-   'Register result' commands #6 back to the sender which
-   hold the rkey need to perform RDMA. Note that the virtual address
-   corresponding to this rkey was already exchanged at the beginning
-   of the connection (described below).
-
-All of the remaining command types (not including 'ready')
-described above all use the aforementioned two functions to do the hard work:
-
-1. After connection setup, RAMBlock information is exchanged using
-   this protocol before the actual migration begins. This information includes
-   a description of each RAMBlock on the server side as well as the virtual addresses
-   and lengths of each RAMBlock. This is used by the client to determine the
-   start and stop locations of chunks and how to register them dynamically
-   before performing the RDMA operations.
-2. During runtime, once a 'chunk' becomes full of pages ready to
-   be sent with RDMA, the registration commands are used to ask the
-   other side to register the memory for this chunk and respond
-   with the result (rkey) of the registration.
-3. Also, the QEMUFile interfaces also call these functions (described below)
-   when transmitting non-live state, such as devices or to send
-   its own protocol information during the migration process.
-4. Finally, zero pages are only checked if a page has not yet been registered
-   using chunk registration (or not checked at all and unconditionally
-   written if chunk registration is disabled. This is accomplished using
-   the "Compress" command listed above. If the page *has* been registered
-   then we check the entire chunk for zero. Only if the entire chunk is
-   zero, then we send a compress command to zap the page on the other side.
-
-Versioning and Capabilities
-===========================
-Current version of the protocol is version #1.
-
-The same version applies to both for protocol traffic and capabilities
-negotiation. (i.e. There is only one version number that is referred to
-by all communication).
-
-librdmacm provides the user with a 'private data' area to be exchanged
-at connection-setup time before any infiniband traffic is generated.
-
-Header:
-    * Version (protocol version validated before send/recv occurs),
-                                               uint32, network byte order
-    * Flags   (bitwise OR of each capability),
-                                               uint32, network byte order
-
-There is no data portion of this header right now, so there is
-no length field. The maximum size of the 'private data' section
-is only 192 bytes per the Infiniband specification, so it's not
-very useful for data anyway. This structure needs to remain small.
-
-This private data area is a convenient place to check for protocol
-versioning because the user does not need to register memory to
-transmit a few bytes of version information.
-
-This is also a convenient place to negotiate capabilities
-(like dynamic page registration).
-
-If the version is invalid, we throw an error.
-
-If the version is new, we only negotiate the capabilities that the
-requested version is able to perform and ignore the rest.
-
-Currently there is only one capability in Version #1: dynamic page registration
-
-Finally: Negotiation happens with the Flags field: If the primary-VM
-sets a flag, but the destination does not support this capability, it
-will return a zero-bit for that flag and the primary-VM will understand
-that as not being an available capability and will thus disable that
-capability on the primary-VM side.
-
-QEMUFileRDMA Interface:
-=======================
-
-QEMUFileRDMA introduces a couple of new functions:
-
-1. qemu_rdma_get_buffer()               (QEMUFileOps rdma_read_ops)
-2. qemu_rdma_put_buffer()               (QEMUFileOps rdma_write_ops)
-
-These two functions are very short and simply use the protocol
-describe above to deliver bytes without changing the upper-level
-users of QEMUFile that depend on a bytestream abstraction.
-
-Finally, how do we handoff the actual bytes to get_buffer()?
-
-Again, because we're trying to "fake" a bytestream abstraction
-using an analogy not unlike individual UDP frames, we have
-to hold on to the bytes received from control-channel's SEND
-messages in memory.
-
-Each time we receive a complete "QEMU File" control-channel
-message, the bytes from SEND are copied into a small local holding area.
-
-Then, we return the number of bytes requested by get_buffer()
-and leave the remaining bytes in the holding area until get_buffer()
-comes around for another pass.
-
-If the buffer is empty, then we follow the same steps
-listed above and issue another "QEMU File" protocol command,
-asking for a new SEND message to re-fill the buffer.
-
-Migration of VM's ram:
-====================
-
-At the beginning of the migration, (migration-rdma.c),
-the sender and the receiver populate the list of RAMBlocks
-to be registered with each other into a structure.
-Then, using the aforementioned protocol, they exchange a
-description of these blocks with each other, to be used later
-during the iteration of main memory. This description includes
-a list of all the RAMBlocks, their offsets and lengths, virtual
-addresses and possibly includes pre-registered RDMA keys in case dynamic
-page registration was disabled on the server-side, otherwise not.
-
-Main memory is not migrated with the aforementioned protocol,
-but is instead migrated with normal RDMA Write operations.
-
-Pages are migrated in "chunks" (hard-coded to 1 Megabyte right now).
-Chunk size is not dynamic, but it could be in a future implementation.
-There's nothing to indicate that this is useful right now.
-
-When a chunk is full (or a flush() occurs), the memory backed by
-the chunk is registered with librdmacm is pinned in memory on
-both sides using the aforementioned protocol.
-After pinning, an RDMA Write is generated and transmitted
-for the entire chunk.
-
-Chunks are also transmitted in batches: This means that we
-do not request that the hardware signal the completion queue
-for the completion of *every* chunk. The current batch size
-is about 64 chunks (corresponding to 64 MB of memory).
-Only the last chunk in a batch must be signaled.
-This helps keep everything as asynchronous as possible
-and helps keep the hardware busy performing RDMA operations.
-
-Error-handling:
-===============
-
-Infiniband has what is called a "Reliable, Connected"
-link (one of 4 choices). This is the mode in which
-we use for RDMA migration.
-
-If a *single* message fails,
-the decision is to abort the migration entirely and
-cleanup all the RDMA descriptors and unregister all
-the memory.
-
-After cleanup, the Virtual Machine is returned to normal
-operation the same way that would happen if the TCP
-socket is broken during a non-RDMA based migration.
-
-TODO:
-=====
-1. Currently, 'ulimit -l' mlock() limits as well as cgroups swap limits
-   are not compatible with infiniband memory pinning and will result in
-   an aborted migration (but with the source VM left unaffected).
-2. Use of the recent /proc/<pid>/pagemap would likely speed up
-   the use of KSM and ballooning while using RDMA.
-3. Also, some form of balloon-device usage tracking would also
-   help alleviate some issues.
-4. Use LRU to provide more fine-grained direction of UNREGISTER
-   requests for unpinning memory in an overcommitted environment.
-5. Expose UNREGISTER support to the user by way of workload-specific
-   hints about application behavior.
diff --git a/docs/system/loongarch/virt.rst b/docs/system/loongarch/virt.rst
index 06d034b8ef..0a8e0766e4 100644
--- a/docs/system/loongarch/virt.rst
+++ b/docs/system/loongarch/virt.rst
@@ -39,7 +39,7 @@  can be accessed by following steps.
 
 .. code-block:: bash
 
-  ./configure --disable-rdma --prefix=/usr \
+  ./configure --prefix=/usr \
               --target-list="loongarch64-softmmu" \
               --disable-libiscsi --disable-libnfs --disable-libpmem \
               --disable-glusterfs --enable-libusb --enable-usb-redir \
diff --git a/meson.build b/meson.build
index d6af3cd53a..bd65abad13 100644
--- a/meson.build
+++ b/meson.build
@@ -1854,21 +1854,6 @@  if numa.found() and not cc.links('''
   endif
 endif
 
-rdma = not_found
-if not get_option('rdma').auto() or have_system
-  libumad = cc.find_library('ibumad', required: get_option('rdma'))
-  rdma_libs = [cc.find_library('rdmacm', has_headers: ['rdma/rdma_cma.h'],
-                               required: get_option('rdma')),
-               cc.find_library('ibverbs', required: get_option('rdma')),
-               libumad]
-  rdma = declare_dependency(dependencies: rdma_libs)
-  foreach lib: rdma_libs
-    if not lib.found()
-      rdma = not_found
-    endif
-  endforeach
-endif
-
 cacard = not_found
 if not get_option('smartcard').auto() or have_system
   cacard = dependency('libcacard', required: get_option('smartcard'),
@@ -2246,7 +2231,6 @@  endif
 config_host_data.set('CONFIG_OPENGL', opengl.found())
 config_host_data.set('CONFIG_PLUGIN', get_option('plugins'))
 config_host_data.set('CONFIG_RBD', rbd.found())
-config_host_data.set('CONFIG_RDMA', rdma.found())
 config_host_data.set('CONFIG_RELOCATABLE', get_option('relocatable'))
 config_host_data.set('CONFIG_SAFESTACK', get_option('safe_stack'))
 config_host_data.set('CONFIG_SDL', sdl.found())
@@ -2399,12 +2383,6 @@  if rbd.found()
                                        dependencies: rbd,
                                        prefix: '#include <rbd/librbd.h>'))
 endif
-if rdma.found()
-  config_host_data.set('HAVE_IBV_ADVISE_MR',
-                       cc.has_function('ibv_advise_mr',
-                                       dependencies: rdma,
-                                       prefix: '#include <infiniband/verbs.h>'))
-endif
 
 have_asan_fiber = false
 if get_option('sanitizers') and \
@@ -4398,7 +4376,6 @@  summary_info += {'Multipath support': mpathpersist}
 summary_info += {'Linux AIO support': libaio}
 summary_info += {'Linux io_uring support': linux_io_uring}
 summary_info += {'ATTR/XATTR support': libattr}
-summary_info += {'RDMA support':      rdma}
 summary_info += {'fdt support':       fdt_opt == 'disabled' ? false : fdt_opt}
 summary_info += {'libcap-ng support': libcap_ng}
 summary_info += {'bpf support':       libbpf}
diff --git a/qapi/migration.json b/qapi/migration.json
index 8c65b90328..9a56d403be 100644
--- a/qapi/migration.json
+++ b/qapi/migration.json
@@ -221,8 +221,8 @@ 
 #
 # @setup-time: amount of setup time in milliseconds *before* the
 #     iterations begin but *after* the QMP command is issued.  This is
-#     designed to provide an accounting of any activities (such as
-#     RDMA pinning) which may be expensive, but do not actually occur
+#     designed to provide an accounting of any activities which may be
+#     expensive, but do not actually occur
 #     during the iterative migration rounds themselves.  (since 1.6)
 #
 # @cpu-throttle-percentage: percentage of time guest cpus are being
@@ -430,10 +430,6 @@ 
 #     for certain work loads, by sending compressed difference of the
 #     pages
 #
-# @rdma-pin-all: Controls whether or not the entire VM memory
-#     footprint is mlock()'d on demand or all at once.  Refer to
-#     docs/rdma.txt for usage.  Disabled by default.  (since 2.0)
-#
 # @zero-blocks: During storage migration encode blocks of zeroes
 #     efficiently.  This essentially saves 1MB of zeroes per block on
 #     the wire.  Enabling requires source and target VM to support
@@ -547,7 +543,7 @@ 
 # Since: 1.2
 ##
 { 'enum': 'MigrationCapability',
-  'data': ['xbzrle', 'rdma-pin-all', 'auto-converge', 'zero-blocks',
+  'data': ['xbzrle', 'auto-converge', 'zero-blocks',
            { 'name': 'compress', 'features': [ 'deprecated' ] },
            'events', 'postcopy-ram',
            { 'name': 'x-colo', 'features': [ 'unstable' ] },
@@ -606,7 +602,6 @@ 
 #     -> { "execute": "query-migrate-capabilities" }
 #     <- { "return": [
 #           {"state": false, "capability": "xbzrle"},
-#           {"state": false, "capability": "rdma-pin-all"},
 #           {"state": false, "capability": "auto-converge"},
 #           {"state": false, "capability": "zero-blocks"},
 #           {"state": false, "capability": "compress"},
@@ -1654,14 +1649,12 @@ 
 #
 # @exec: Direct the migration stream to another process.
 #
-# @rdma: Migrate via RDMA.
-#
 # @file: Direct the migration stream to a file.
 #
 # Since: 8.2
 ##
 { 'enum': 'MigrationAddressType',
-  'data': [ 'socket', 'exec', 'rdma', 'file' ] }
+  'data': [ 'socket', 'exec', 'file' ] }
 
 ##
 # @FileMigrationArgs:
@@ -1701,7 +1694,6 @@ 
   'data': {
     'socket': 'SocketAddress',
     'exec': 'MigrationExecCommand',
-    'rdma': 'InetSocketAddress',
     'file': 'FileMigrationArgs' } }
 
 ##
@@ -1804,14 +1796,6 @@ 
 #     -> { "execute": "migrate",
 #          "arguments": {
 #              "channels": [ { "channel-type": "main",
-#                              "addr": { "transport": "rdma",
-#                                        "host": "10.12.34.9",
-#                                        "port": "1050" } } ] } }
-#     <- { "return": {} }
-#
-#     -> { "execute": "migrate",
-#          "arguments": {
-#              "channels": [ { "channel-type": "main",
 #                              "addr": { "transport": "file",
 #                                        "filename": "/tmp/migfile",
 #                                        "offset": "0x1000" } } ] } }
@@ -1879,13 +1863,6 @@ 
 #                                                  "/some/sock" ] } } ] } }
 #     <- { "return": {} }
 #
-#     -> { "execute": "migrate-incoming",
-#          "arguments": {
-#              "channels": [ { "channel-type": "main",
-#                              "addr": { "transport": "rdma",
-#                                        "host": "10.12.34.9",
-#                                        "port": "1050" } } ] } }
-#     <- { "return": {} }
 ##
 { 'command': 'migrate-incoming',
              'data': {'*uri': 'str',
diff --git a/migration/migration-stats.h b/migration/migration-stats.h
index 05290ade76..817c53559a 100644
--- a/migration/migration-stats.h
+++ b/migration/migration-stats.h
@@ -93,10 +93,6 @@  typedef struct {
      * Maximum amount of data we can send in a cycle.
      */
     Stat64 rate_limit_max;
-    /*
-     * Number of bytes sent through RDMA.
-     */
-    Stat64 rdma_bytes;
     /*
      * Number of pages transferred that were full of zeros.
      */
@@ -133,7 +129,7 @@  void migration_rate_set(uint64_t new_rate);
  *
  * Returns how many bytes have we transferred since the beginning of
  * the migration.  It accounts for bytes sent through any migration
- * channel, multifd, qemu_file, rdma, ....
+ * channel, multifd, qemu_file, ....
  */
 uint64_t migration_transferred_bytes(void);
 #endif
diff --git a/migration/migration.h b/migration/migration.h
index 8045e39c26..d097828580 100644
--- a/migration/migration.h
+++ b/migration/migration.h
@@ -162,13 +162,6 @@  struct MigrationIncomingState {
 
     int state;
 
-    /*
-     * The incoming migration coroutine, non-NULL during qemu_loadvm_state().
-     * Used to wake the migration incoming coroutine from rdma code. How much is
-     * it safe - it's a question.
-     */
-    Coroutine *loadvm_co;
-
     /* The coroutine we should enter (back) after failover */
     Coroutine *colo_incoming_co;
     QemuSemaphore colo_incoming_sem;
@@ -463,8 +456,6 @@  struct MigrationState {
      * switchover has been received.
      */
     bool switchover_acked;
-    /* Is this a rdma migration */
-    bool rdma_migration;
 };
 
 void migrate_set_state(int *state, int old_state, int new_state);
diff --git a/migration/options.h b/migration/options.h
index ab8199e207..c00213973e 100644
--- a/migration/options.h
+++ b/migration/options.h
@@ -37,7 +37,6 @@  bool migrate_multifd(void);
 bool migrate_pause_before_switchover(void);
 bool migrate_postcopy_blocktime(void);
 bool migrate_postcopy_preempt(void);
-bool migrate_rdma_pin_all(void);
 bool migrate_release_ram(void);
 bool migrate_return_path(void);
 bool migrate_validate_uuid(void);
@@ -54,7 +53,6 @@  bool migrate_zero_copy_send(void);
 
 bool migrate_multifd_flush_after_each_section(void);
 bool migrate_postcopy(void);
-bool migrate_rdma(void);
 bool migrate_tls(void);
 
 /* capabilities helpers */
diff --git a/migration/rdma.h b/migration/rdma.h
deleted file mode 100644
index a8d27f33b8..0000000000
--- a/migration/rdma.h
+++ /dev/null
@@ -1,69 +0,0 @@ 
-/*
- * RDMA protocol and interfaces
- *
- * Copyright IBM, Corp. 2010-2013
- * Copyright Red Hat, Inc. 2015-2016
- *
- * Authors:
- *  Michael R. Hines <mrhines@us.ibm.com>
- *  Jiuxing Liu <jl@us.ibm.com>
- *  Daniel P. Berrange <berrange@redhat.com>
- *
- * This work is licensed under the terms of the GNU GPL, version 2 or
- * later.  See the COPYING file in the top-level directory.
- *
- */
-
-#include "qemu/sockets.h"
-
-#ifndef QEMU_MIGRATION_RDMA_H
-#define QEMU_MIGRATION_RDMA_H
-
-#include "exec/memory.h"
-
-void rdma_start_outgoing_migration(void *opaque, InetSocketAddress *host_port,
-                                   Error **errp);
-
-void rdma_start_incoming_migration(InetSocketAddress *host_port, Error **errp);
-
-/*
- * Constants used by rdma return codes
- */
-#define RAM_CONTROL_SETUP     0
-#define RAM_CONTROL_ROUND     1
-#define RAM_CONTROL_FINISH    3
-
-/*
- * Whenever this is found in the data stream, the flags
- * will be passed to rdma functions in the incoming-migration
- * side.
- */
-#define RAM_SAVE_FLAG_HOOK     0x80
-
-#define RAM_SAVE_CONTROL_NOT_SUPP -1000
-#define RAM_SAVE_CONTROL_DELAYED  -2000
-
-#ifdef CONFIG_RDMA
-int rdma_registration_handle(QEMUFile *f);
-int rdma_registration_start(QEMUFile *f, uint64_t flags);
-int rdma_registration_stop(QEMUFile *f, uint64_t flags);
-int rdma_block_notification_handle(QEMUFile *f, const char *name);
-int rdma_control_save_page(QEMUFile *f, ram_addr_t block_offset,
-                           ram_addr_t offset, size_t size);
-#else
-static inline
-int rdma_registration_handle(QEMUFile *f) { return 0; }
-static inline
-int rdma_registration_start(QEMUFile *f, uint64_t flags) { return 0; }
-static inline
-int rdma_registration_stop(QEMUFile *f, uint64_t flags) { return 0; }
-static inline
-int rdma_block_notification_handle(QEMUFile *f, const char *name) { return 0; }
-static inline
-int rdma_control_save_page(QEMUFile *f, ram_addr_t block_offset,
-                           ram_addr_t offset, size_t size)
-{
-    return RAM_SAVE_CONTROL_NOT_SUPP;
-}
-#endif
-#endif
diff --git a/migration/migration-stats.c b/migration/migration-stats.c
index f690b98a03..9bc8d7018f 100644
--- a/migration/migration-stats.c
+++ b/migration/migration-stats.c
@@ -62,9 +62,8 @@  void migration_rate_reset(void)
 uint64_t migration_transferred_bytes(void)
 {
     uint64_t multifd = stat64_get(&mig_stats.multifd_bytes);
-    uint64_t rdma = stat64_get(&mig_stats.rdma_bytes);
     uint64_t qemu_file = stat64_get(&mig_stats.qemu_file_transferred);
 
-    trace_migration_transferred_bytes(qemu_file, multifd, rdma);
-    return qemu_file + multifd + rdma;
+    trace_migration_transferred_bytes(qemu_file, multifd);
+    return qemu_file + multifd;
 }
diff --git a/migration/migration.c b/migration/migration.c
index 9fe8fd2afd..8e17914c8b 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -25,7 +25,6 @@ 
 #include "sysemu/runstate.h"
 #include "sysemu/sysemu.h"
 #include "sysemu/cpu-throttle.h"
-#include "rdma.h"
 #include "ram.h"
 #include "ram-compress.h"
 #include "migration/global_state.h"
@@ -545,7 +544,6 @@  bool migrate_uri_parse(const char *uri, MigrationChannel **channel,
 {
     g_autoptr(MigrationChannel) val = g_new0(MigrationChannel, 1);
     g_autoptr(MigrationAddress) addr = g_new0(MigrationAddress, 1);
-    InetSocketAddress *isock = &addr->u.rdma;
     strList **tail = &addr->u.exec.args;
 
     if (strstart(uri, "exec:", NULL)) {
@@ -558,12 +556,6 @@  bool migrate_uri_parse(const char *uri, MigrationChannel **channel,
         QAPI_LIST_APPEND(tail, g_strdup("-c"));
 #endif
         QAPI_LIST_APPEND(tail, g_strdup(uri + strlen("exec:")));
-    } else if (strstart(uri, "rdma:", NULL)) {
-        if (inet_parse(isock, uri + strlen("rdma:"), errp)) {
-            qapi_free_InetSocketAddress(isock);
-            return false;
-        }
-        addr->transport = MIGRATION_ADDRESS_TYPE_RDMA;
     } else if (strstart(uri, "tcp:", NULL) ||
                 strstart(uri, "unix:", NULL) ||
                 strstart(uri, "vsock:", NULL) ||
@@ -645,22 +637,6 @@  static void qemu_start_incoming_migration(const char *uri, bool has_channels,
         } else if (saddr->type == SOCKET_ADDRESS_TYPE_FD) {
             fd_start_incoming_migration(saddr->u.fd.str, errp);
         }
-#ifdef CONFIG_RDMA
-    } else if (addr->transport == MIGRATION_ADDRESS_TYPE_RDMA) {
-        if (migrate_compress()) {
-            error_setg(errp, "RDMA and compression can't be used together");
-            return;
-        }
-        if (migrate_xbzrle()) {
-            error_setg(errp, "RDMA and XBZRLE can't be used together");
-            return;
-        }
-        if (migrate_multifd()) {
-            error_setg(errp, "RDMA and multifd can't be used together");
-            return;
-        }
-        rdma_start_incoming_migration(&addr->u.rdma, errp);
-#endif
     } else if (addr->transport == MIGRATION_ADDRESS_TYPE_EXEC) {
         exec_start_incoming_migration(addr->u.exec.args, errp);
     } else if (addr->transport == MIGRATION_ADDRESS_TYPE_FILE) {
@@ -751,9 +727,7 @@  process_incoming_migration_co(void *opaque)
     migrate_set_state(&mis->state, MIGRATION_STATUS_SETUP,
                       MIGRATION_STATUS_ACTIVE);
 
-    mis->loadvm_co = qemu_coroutine_self();
     ret = qemu_loadvm_state(mis->from_src_file);
-    mis->loadvm_co = NULL;
 
     trace_vmstate_downtime_checkpoint("dst-precopy-loadvm-completed");
 
@@ -1679,7 +1653,6 @@  int migrate_init(MigrationState *s, Error **errp)
     s->iteration_initial_bytes = 0;
     s->threshold_size = 0;
     s->switchover_acked = false;
-    s->rdma_migration = false;
     /*
      * set mig_stats memory to zero for a new migration
      */
@@ -2100,10 +2073,6 @@  void qmp_migrate(const char *uri, bool has_channels,
         } else if (saddr->type == SOCKET_ADDRESS_TYPE_FD) {
             fd_start_outgoing_migration(s, saddr->u.fd.str, &local_err);
         }
-#ifdef CONFIG_RDMA
-    } else if (addr->transport == MIGRATION_ADDRESS_TYPE_RDMA) {
-        rdma_start_outgoing_migration(s, &addr->u.rdma, &local_err);
-#endif
     } else if (addr->transport == MIGRATION_ADDRESS_TYPE_EXEC) {
         exec_start_outgoing_migration(s, addr->u.exec.args, &local_err);
     } else if (addr->transport == MIGRATION_ADDRESS_TYPE_FILE) {
diff --git a/migration/options.c b/migration/options.c
index bfd7753b69..02fc0b9ae8 100644
--- a/migration/options.c
+++ b/migration/options.c
@@ -185,7 +185,6 @@  Property migration_properties[] = {
 
     /* Migration capabilities */
     DEFINE_PROP_MIG_CAP("x-xbzrle", MIGRATION_CAPABILITY_XBZRLE),
-    DEFINE_PROP_MIG_CAP("x-rdma-pin-all", MIGRATION_CAPABILITY_RDMA_PIN_ALL),
     DEFINE_PROP_MIG_CAP("x-auto-converge", MIGRATION_CAPABILITY_AUTO_CONVERGE),
     DEFINE_PROP_MIG_CAP("x-zero-blocks", MIGRATION_CAPABILITY_ZERO_BLOCKS),
     DEFINE_PROP_MIG_CAP("x-compress", MIGRATION_CAPABILITY_COMPRESS),
@@ -323,13 +322,6 @@  bool migrate_postcopy_ram(void)
     return s->capabilities[MIGRATION_CAPABILITY_POSTCOPY_RAM];
 }
 
-bool migrate_rdma_pin_all(void)
-{
-    MigrationState *s = migrate_get_current();
-
-    return s->capabilities[MIGRATION_CAPABILITY_RDMA_PIN_ALL];
-}
-
 bool migrate_release_ram(void)
 {
     MigrationState *s = migrate_get_current();
@@ -393,13 +385,6 @@  bool migrate_postcopy(void)
     return migrate_postcopy_ram() || migrate_dirty_bitmaps();
 }
 
-bool migrate_rdma(void)
-{
-    MigrationState *s = migrate_get_current();
-
-    return s->rdma_migration;
-}
-
 bool migrate_tls(void)
 {
     MigrationState *s = migrate_get_current();
@@ -458,7 +443,6 @@  INITIALIZE_MIGRATE_CAPS_SET(check_caps_background_snapshot,
     MIGRATION_CAPABILITY_PAUSE_BEFORE_SWITCHOVER,
     MIGRATION_CAPABILITY_AUTO_CONVERGE,
     MIGRATION_CAPABILITY_RELEASE_RAM,
-    MIGRATION_CAPABILITY_RDMA_PIN_ALL,
     MIGRATION_CAPABILITY_COMPRESS,
     MIGRATION_CAPABILITY_XBZRLE,
     MIGRATION_CAPABILITY_X_COLO,
diff --git a/migration/qemu-file.c b/migration/qemu-file.c
index a10882d47f..ad2efb332e 100644
--- a/migration/qemu-file.c
+++ b/migration/qemu-file.c
@@ -32,7 +32,6 @@ 
 #include "trace.h"
 #include "options.h"
 #include "qapi/error.h"
-#include "rdma.h"
 #include "io/channel-file.h"
 
 #define IO_BUF_SIZE 32768
diff --git a/migration/ram.c b/migration/ram.c
index 8deb84984f..c81c8a7cff 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -59,7 +59,6 @@ 
 #include "qemu/iov.h"
 #include "multifd.h"
 #include "sysemu/runstate.h"
-#include "rdma.h"
 #include "options.h"
 #include "sysemu/dirtylimit.h"
 #include "sysemu/kvm.h"
@@ -89,7 +88,7 @@ 
 #define RAM_SAVE_FLAG_EOS      0x10
 #define RAM_SAVE_FLAG_CONTINUE 0x20
 #define RAM_SAVE_FLAG_XBZRLE   0x40
-/* 0x80 is reserved in rdma.h for RAM_SAVE_FLAG_HOOK */
+#define RAM_SAVE_FLAG_HOOK     0x80 /* was reserved by RDMA */
 #define RAM_SAVE_FLAG_COMPRESS_PAGE    0x100
 #define RAM_SAVE_FLAG_MULTIFD_FLUSH    0x200
 /* We can't use any flag that is bigger than 0x200 */
@@ -1175,32 +1174,6 @@  static int save_zero_page(RAMState *rs, PageSearchStatus *pss,
     return len;
 }
 
-/*
- * @pages: the number of pages written by the control path,
- *        < 0 - error
- *        > 0 - number of pages written
- *
- * Return true if the pages has been saved, otherwise false is returned.
- */
-static bool control_save_page(PageSearchStatus *pss,
-                              ram_addr_t offset, int *pages)
-{
-    int ret;
-
-    ret = rdma_control_save_page(pss->pss_channel, pss->block->offset, offset,
-                                 TARGET_PAGE_SIZE);
-    if (ret == RAM_SAVE_CONTROL_NOT_SUPP) {
-        return false;
-    }
-
-    if (ret == RAM_SAVE_CONTROL_DELAYED) {
-        *pages = 1;
-        return true;
-    }
-    *pages = ret;
-    return true;
-}
-
 /*
  * directly send the page to the stream
  *
@@ -2080,11 +2053,6 @@  static bool save_compress_page(RAMState *rs, PageSearchStatus *pss,
 static int ram_save_target_page_legacy(RAMState *rs, PageSearchStatus *pss)
 {
     ram_addr_t offset = ((ram_addr_t)pss->page) << TARGET_PAGE_BITS;
-    int res;
-
-    if (control_save_page(pss, offset, &res)) {
-        return res;
-    }
 
     if (save_compress_page(rs, pss, offset)) {
         return 1;
@@ -3114,18 +3082,6 @@  static int ram_save_setup(QEMUFile *f, void *opaque)
         }
     }
 
-    ret = rdma_registration_start(f, RAM_CONTROL_SETUP);
-    if (ret < 0) {
-        qemu_file_set_error(f, ret);
-        return ret;
-    }
-
-    ret = rdma_registration_stop(f, RAM_CONTROL_SETUP);
-    if (ret < 0) {
-        qemu_file_set_error(f, ret);
-        return ret;
-    }
-
     migration_ops = g_malloc0(sizeof(MigrationOps));
 
     if (migrate_multifd()) {
@@ -3221,12 +3177,6 @@  static int ram_save_iterate(QEMUFile *f, void *opaque)
             /* Read version before ram_list.blocks */
             smp_rmb();
 
-            ret = rdma_registration_start(f, RAM_CONTROL_ROUND);
-            if (ret < 0) {
-                qemu_file_set_error(f, ret);
-                goto out;
-            }
-
             t0 = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
             i = 0;
             while ((ret = migration_rate_exceeded(f)) == 0 ||
@@ -3278,15 +3228,6 @@  static int ram_save_iterate(QEMUFile *f, void *opaque)
         }
     }
 
-    /*
-     * Must occur before EOS (or any QEMUFile operation)
-     * because of RDMA protocol.
-     */
-    ret = rdma_registration_stop(f, RAM_CONTROL_ROUND);
-    if (ret < 0) {
-        qemu_file_set_error(f, ret);
-    }
-
 out:
     if (ret >= 0
         && migration_is_setup_or_active()) {
@@ -3332,12 +3273,6 @@  static int ram_save_complete(QEMUFile *f, void *opaque)
             migration_bitmap_sync_precopy(rs, true);
         }
 
-        ret = rdma_registration_start(f, RAM_CONTROL_FINISH);
-        if (ret < 0) {
-            qemu_file_set_error(f, ret);
-            return ret;
-        }
-
         /* try transferring iterative blocks of memory */
 
         /* flush all remaining blocks regardless of rate limiting */
@@ -3358,12 +3293,6 @@  static int ram_save_complete(QEMUFile *f, void *opaque)
         qemu_mutex_unlock(&rs->bitmap_mutex);
 
         compress_flush_data();
-
-        ret = rdma_registration_stop(f, RAM_CONTROL_FINISH);
-        if (ret < 0) {
-            qemu_file_set_error(f, ret);
-            return ret;
-        }
     }
 
     ret = multifd_send_sync_main();
@@ -3576,8 +3505,7 @@  static inline void *colo_cache_from_block_offset(RAMBlock *block,
 /**
  * ram_handle_zero: handle the zero page case
  *
- * If a page (or a whole RDMA chunk) has been
- * determined to be zero, then zap it.
+ * If a page has been determined to be zero, then zap it.
  *
  * @host: host address for the zero page
  * @ch: what the page is filled from.  We only support zero
@@ -4161,10 +4089,6 @@  static int parse_ramblock(QEMUFile *f, RAMBlock *block, ram_addr_t length)
             return -EINVAL;
         }
     }
-    ret = rdma_block_notification_handle(f, block->idstr);
-    if (ret < 0) {
-        qemu_file_set_error(f, ret);
-    }
 
     return ret;
 }
@@ -4363,12 +4287,6 @@  static int ram_load_precopy(QEMUFile *f)
                 multifd_recv_sync_main();
             }
             break;
-        case RAM_SAVE_FLAG_HOOK:
-            ret = rdma_registration_handle(f);
-            if (ret < 0) {
-                qemu_file_set_error(f, ret);
-            }
-            break;
         default:
             error_report("Unknown combination of migration flags: 0x%x", flags);
             ret = -EINVAL;
diff --git a/migration/rdma.c b/migration/rdma.c
deleted file mode 100644
index 855753c671..0000000000
--- a/migration/rdma.c
+++ /dev/null
@@ -1,4184 +0,0 @@ 
-/*
- * RDMA protocol and interfaces
- *
- * Copyright IBM, Corp. 2010-2013
- * Copyright Red Hat, Inc. 2015-2016
- *
- * Authors:
- *  Michael R. Hines <mrhines@us.ibm.com>
- *  Jiuxing Liu <jl@us.ibm.com>
- *  Daniel P. Berrange <berrange@redhat.com>
- *
- * This work is licensed under the terms of the GNU GPL, version 2 or
- * later.  See the COPYING file in the top-level directory.
- *
- */
-
-#include "qemu/osdep.h"
-#include "qapi/error.h"
-#include "qemu/cutils.h"
-#include "exec/target_page.h"
-#include "rdma.h"
-#include "migration.h"
-#include "migration-stats.h"
-#include "qemu-file.h"
-#include "ram.h"
-#include "qemu/error-report.h"
-#include "qemu/main-loop.h"
-#include "qemu/module.h"
-#include "qemu/rcu.h"
-#include "qemu/sockets.h"
-#include "qemu/bitmap.h"
-#include "qemu/coroutine.h"
-#include "exec/memory.h"
-#include <sys/socket.h>
-#include <netdb.h>
-#include <arpa/inet.h>
-#include <rdma/rdma_cma.h>
-#include "trace.h"
-#include "qom/object.h"
-#include "options.h"
-#include <poll.h>
-
-#define RDMA_RESOLVE_TIMEOUT_MS 10000
-
-/* Do not merge data if larger than this. */
-#define RDMA_MERGE_MAX (2 * 1024 * 1024)
-#define RDMA_SIGNALED_SEND_MAX (RDMA_MERGE_MAX / 4096)
-
-#define RDMA_REG_CHUNK_SHIFT 20 /* 1 MB */
-
-/*
- * This is only for non-live state being migrated.
- * Instead of RDMA_WRITE messages, we use RDMA_SEND
- * messages for that state, which requires a different
- * delivery design than main memory.
- */
-#define RDMA_SEND_INCREMENT 32768
-
-/*
- * Maximum size infiniband SEND message
- */
-#define RDMA_CONTROL_MAX_BUFFER (512 * 1024)
-#define RDMA_CONTROL_MAX_COMMANDS_PER_MESSAGE 4096
-
-#define RDMA_CONTROL_VERSION_CURRENT 1
-/*
- * Capabilities for negotiation.
- */
-#define RDMA_CAPABILITY_PIN_ALL 0x01
-
-/*
- * Add the other flags above to this list of known capabilities
- * as they are introduced.
- */
-static uint32_t known_capabilities = RDMA_CAPABILITY_PIN_ALL;
-
-/*
- * A work request ID is 64-bits and we split up these bits
- * into 3 parts:
- *
- * bits 0-15 : type of control message, 2^16
- * bits 16-29: ram block index, 2^14
- * bits 30-63: ram block chunk number, 2^34
- *
- * The last two bit ranges are only used for RDMA writes,
- * in order to track their completion and potentially
- * also track unregistration status of the message.
- */
-#define RDMA_WRID_TYPE_SHIFT  0UL
-#define RDMA_WRID_BLOCK_SHIFT 16UL
-#define RDMA_WRID_CHUNK_SHIFT 30UL
-
-#define RDMA_WRID_TYPE_MASK \
-    ((1UL << RDMA_WRID_BLOCK_SHIFT) - 1UL)
-
-#define RDMA_WRID_BLOCK_MASK \
-    (~RDMA_WRID_TYPE_MASK & ((1UL << RDMA_WRID_CHUNK_SHIFT) - 1UL))
-
-#define RDMA_WRID_CHUNK_MASK (~RDMA_WRID_BLOCK_MASK & ~RDMA_WRID_TYPE_MASK)
-
-/*
- * RDMA migration protocol:
- * 1. RDMA Writes (data messages, i.e. RAM)
- * 2. IB Send/Recv (control channel messages)
- */
-enum {
-    RDMA_WRID_NONE = 0,
-    RDMA_WRID_RDMA_WRITE = 1,
-    RDMA_WRID_SEND_CONTROL = 2000,
-    RDMA_WRID_RECV_CONTROL = 4000,
-};
-
-/*
- * Work request IDs for IB SEND messages only (not RDMA writes).
- * This is used by the migration protocol to transmit
- * control messages (such as device state and registration commands)
- *
- * We could use more WRs, but we have enough for now.
- */
-enum {
-    RDMA_WRID_READY = 0,
-    RDMA_WRID_DATA,
-    RDMA_WRID_CONTROL,
-    RDMA_WRID_MAX,
-};
-
-/*
- * SEND/RECV IB Control Messages.
- */
-enum {
-    RDMA_CONTROL_NONE = 0,
-    RDMA_CONTROL_ERROR,
-    RDMA_CONTROL_READY,               /* ready to receive */
-    RDMA_CONTROL_QEMU_FILE,           /* QEMUFile-transmitted bytes */
-    RDMA_CONTROL_RAM_BLOCKS_REQUEST,  /* RAMBlock synchronization */
-    RDMA_CONTROL_RAM_BLOCKS_RESULT,   /* RAMBlock synchronization */
-    RDMA_CONTROL_COMPRESS,            /* page contains repeat values */
-    RDMA_CONTROL_REGISTER_REQUEST,    /* dynamic page registration */
-    RDMA_CONTROL_REGISTER_RESULT,     /* key to use after registration */
-    RDMA_CONTROL_REGISTER_FINISHED,   /* current iteration finished */
-    RDMA_CONTROL_UNREGISTER_REQUEST,  /* dynamic UN-registration */
-    RDMA_CONTROL_UNREGISTER_FINISHED, /* unpinning finished */
-};
-
-
-/*
- * Memory and MR structures used to represent an IB Send/Recv work request.
- * This is *not* used for RDMA writes, only IB Send/Recv.
- */
-typedef struct {
-    uint8_t  control[RDMA_CONTROL_MAX_BUFFER]; /* actual buffer to register */
-    struct   ibv_mr *control_mr;               /* registration metadata */
-    size_t   control_len;                      /* length of the message */
-    uint8_t *control_curr;                     /* start of unconsumed bytes */
-} RDMAWorkRequestData;
-
-/*
- * Negotiate RDMA capabilities during connection-setup time.
- */
-typedef struct {
-    uint32_t version;
-    uint32_t flags;
-} RDMACapabilities;
-
-static void caps_to_network(RDMACapabilities *cap)
-{
-    cap->version = htonl(cap->version);
-    cap->flags = htonl(cap->flags);
-}
-
-static void network_to_caps(RDMACapabilities *cap)
-{
-    cap->version = ntohl(cap->version);
-    cap->flags = ntohl(cap->flags);
-}
-
-/*
- * Representation of a RAMBlock from an RDMA perspective.
- * This is not transmitted, only local.
- * This and subsequent structures cannot be linked lists
- * because we're using a single IB message to transmit
- * the information. It's small anyway, so a list is overkill.
- */
-typedef struct RDMALocalBlock {
-    char          *block_name;
-    uint8_t       *local_host_addr; /* local virtual address */
-    uint64_t       remote_host_addr; /* remote virtual address */
-    uint64_t       offset;
-    uint64_t       length;
-    struct         ibv_mr **pmr;    /* MRs for chunk-level registration */
-    struct         ibv_mr *mr;      /* MR for non-chunk-level registration */
-    uint32_t      *remote_keys;     /* rkeys for chunk-level registration */
-    uint32_t       remote_rkey;     /* rkeys for non-chunk-level registration */
-    int            index;           /* which block are we */
-    unsigned int   src_index;       /* (Only used on dest) */
-    bool           is_ram_block;
-    int            nb_chunks;
-    unsigned long *transit_bitmap;
-    unsigned long *unregister_bitmap;
-} RDMALocalBlock;
-
-/*
- * Also represents a RAMblock, but only on the dest.
- * This gets transmitted by the dest during connection-time
- * to the source VM and then is used to populate the
- * corresponding RDMALocalBlock with
- * the information needed to perform the actual RDMA.
- */
-typedef struct QEMU_PACKED RDMADestBlock {
-    uint64_t remote_host_addr;
-    uint64_t offset;
-    uint64_t length;
-    uint32_t remote_rkey;
-    uint32_t padding;
-} RDMADestBlock;
-
-static const char *control_desc(unsigned int rdma_control)
-{
-    static const char *strs[] = {
-        [RDMA_CONTROL_NONE] = "NONE",
-        [RDMA_CONTROL_ERROR] = "ERROR",
-        [RDMA_CONTROL_READY] = "READY",
-        [RDMA_CONTROL_QEMU_FILE] = "QEMU FILE",
-        [RDMA_CONTROL_RAM_BLOCKS_REQUEST] = "RAM BLOCKS REQUEST",
-        [RDMA_CONTROL_RAM_BLOCKS_RESULT] = "RAM BLOCKS RESULT",
-        [RDMA_CONTROL_COMPRESS] = "COMPRESS",
-        [RDMA_CONTROL_REGISTER_REQUEST] = "REGISTER REQUEST",
-        [RDMA_CONTROL_REGISTER_RESULT] = "REGISTER RESULT",
-        [RDMA_CONTROL_REGISTER_FINISHED] = "REGISTER FINISHED",
-        [RDMA_CONTROL_UNREGISTER_REQUEST] = "UNREGISTER REQUEST",
-        [RDMA_CONTROL_UNREGISTER_FINISHED] = "UNREGISTER FINISHED",
-    };
-
-    if (rdma_control > RDMA_CONTROL_UNREGISTER_FINISHED) {
-        return "??BAD CONTROL VALUE??";
-    }
-
-    return strs[rdma_control];
-}
-
-#if !defined(htonll)
-static uint64_t htonll(uint64_t v)
-{
-    union { uint32_t lv[2]; uint64_t llv; } u;
-    u.lv[0] = htonl(v >> 32);
-    u.lv[1] = htonl(v & 0xFFFFFFFFULL);
-    return u.llv;
-}
-#endif
-
-#if !defined(ntohll)
-static uint64_t ntohll(uint64_t v)
-{
-    union { uint32_t lv[2]; uint64_t llv; } u;
-    u.llv = v;
-    return ((uint64_t)ntohl(u.lv[0]) << 32) | (uint64_t) ntohl(u.lv[1]);
-}
-#endif
-
-static void dest_block_to_network(RDMADestBlock *db)
-{
-    db->remote_host_addr = htonll(db->remote_host_addr);
-    db->offset = htonll(db->offset);
-    db->length = htonll(db->length);
-    db->remote_rkey = htonl(db->remote_rkey);
-}
-
-static void network_to_dest_block(RDMADestBlock *db)
-{
-    db->remote_host_addr = ntohll(db->remote_host_addr);
-    db->offset = ntohll(db->offset);
-    db->length = ntohll(db->length);
-    db->remote_rkey = ntohl(db->remote_rkey);
-}
-
-/*
- * Virtual address of the above structures used for transmitting
- * the RAMBlock descriptions at connection-time.
- * This structure is *not* transmitted.
- */
-typedef struct RDMALocalBlocks {
-    int nb_blocks;
-    bool     init;             /* main memory init complete */
-    RDMALocalBlock *block;
-} RDMALocalBlocks;
-
-/*
- * Main data structure for RDMA state.
- * While there is only one copy of this structure being allocated right now,
- * this is the place where one would start if you wanted to consider
- * having more than one RDMA connection open at the same time.
- */
-typedef struct RDMAContext {
-    char *host;
-    int port;
-
-    RDMAWorkRequestData wr_data[RDMA_WRID_MAX];
-
-    /*
-     * This is used by *_exchange_send() to figure out whether or not
-     * the initial "READY" message has already been received or not.
-     * This is because other functions may potentially poll() and detect
-     * the READY message before send() does, in which case we need to
-     * know if it completed.
-     */
-    int control_ready_expected;
-
-    /* number of outstanding writes */
-    int nb_sent;
-
-    /* store info about current buffer so that we can
-       merge it with future sends */
-    uint64_t current_addr;
-    uint64_t current_length;
-    /* index of ram block the current buffer belongs to */
-    int current_index;
-    /* index of the chunk in the current ram block */
-    int current_chunk;
-
-    bool pin_all;
-
-    /*
-     * infiniband-specific variables for opening the device
-     * and maintaining connection state and so forth.
-     *
-     * cm_id also has ibv_context, rdma_event_channel, and ibv_qp in
-     * cm_id->verbs, cm_id->channel, and cm_id->qp.
-     */
-    struct rdma_cm_id *cm_id;               /* connection manager ID */
-    struct rdma_cm_id *listen_id;
-    bool connected;
-
-    struct ibv_context          *verbs;
-    struct rdma_event_channel   *channel;
-    struct ibv_qp *qp;                      /* queue pair */
-    struct ibv_comp_channel *recv_comp_channel;  /* recv completion channel */
-    struct ibv_comp_channel *send_comp_channel;  /* send completion channel */
-    struct ibv_pd *pd;                      /* protection domain */
-    struct ibv_cq *recv_cq;                 /* recvieve completion queue */
-    struct ibv_cq *send_cq;                 /* send completion queue */
-
-    /*
-     * If a previous write failed (perhaps because of a failed
-     * memory registration, then do not attempt any future work
-     * and remember the error state.
-     */
-    bool errored;
-    bool error_reported;
-    bool received_error;
-
-    /*
-     * Description of ram blocks used throughout the code.
-     */
-    RDMALocalBlocks local_ram_blocks;
-    RDMADestBlock  *dest_blocks;
-
-    /* Index of the next RAMBlock received during block registration */
-    unsigned int    next_src_index;
-
-    /*
-     * Migration on *destination* started.
-     * Then use coroutine yield function.
-     * Source runs in a thread, so we don't care.
-     */
-    int migration_started_on_destination;
-
-    int total_registrations;
-    int total_writes;
-
-    int unregister_current, unregister_next;
-    uint64_t unregistrations[RDMA_SIGNALED_SEND_MAX];
-
-    GHashTable *blockmap;
-
-    /* the RDMAContext for return path */
-    struct RDMAContext *return_path;
-    bool is_return_path;
-} RDMAContext;
-
-#define TYPE_QIO_CHANNEL_RDMA "qio-channel-rdma"
-OBJECT_DECLARE_SIMPLE_TYPE(QIOChannelRDMA, QIO_CHANNEL_RDMA)
-
-
-
-struct QIOChannelRDMA {
-    QIOChannel parent;
-    RDMAContext *rdmain;
-    RDMAContext *rdmaout;
-    QEMUFile *file;
-    bool blocking; /* XXX we don't actually honour this yet */
-};
-
-/*
- * Main structure for IB Send/Recv control messages.
- * This gets prepended at the beginning of every Send/Recv.
- */
-typedef struct QEMU_PACKED {
-    uint32_t len;     /* Total length of data portion */
-    uint32_t type;    /* which control command to perform */
-    uint32_t repeat;  /* number of commands in data portion of same type */
-    uint32_t padding;
-} RDMAControlHeader;
-
-static void control_to_network(RDMAControlHeader *control)
-{
-    control->type = htonl(control->type);
-    control->len = htonl(control->len);
-    control->repeat = htonl(control->repeat);
-}
-
-static void network_to_control(RDMAControlHeader *control)
-{
-    control->type = ntohl(control->type);
-    control->len = ntohl(control->len);
-    control->repeat = ntohl(control->repeat);
-}
-
-/*
- * Register a single Chunk.
- * Information sent by the source VM to inform the dest
- * to register an single chunk of memory before we can perform
- * the actual RDMA operation.
- */
-typedef struct QEMU_PACKED {
-    union QEMU_PACKED {
-        uint64_t current_addr;  /* offset into the ram_addr_t space */
-        uint64_t chunk;         /* chunk to lookup if unregistering */
-    } key;
-    uint32_t current_index; /* which ramblock the chunk belongs to */
-    uint32_t padding;
-    uint64_t chunks;            /* how many sequential chunks to register */
-} RDMARegister;
-
-static bool rdma_errored(RDMAContext *rdma)
-{
-    if (rdma->errored && !rdma->error_reported) {
-        error_report("RDMA is in an error state waiting migration"
-                     " to abort!");
-        rdma->error_reported = true;
-    }
-    return rdma->errored;
-}
-
-static void register_to_network(RDMAContext *rdma, RDMARegister *reg)
-{
-    RDMALocalBlock *local_block;
-    local_block  = &rdma->local_ram_blocks.block[reg->current_index];
-
-    if (local_block->is_ram_block) {
-        /*
-         * current_addr as passed in is an address in the local ram_addr_t
-         * space, we need to translate this for the destination
-         */
-        reg->key.current_addr -= local_block->offset;
-        reg->key.current_addr += rdma->dest_blocks[reg->current_index].offset;
-    }
-    reg->key.current_addr = htonll(reg->key.current_addr);
-    reg->current_index = htonl(reg->current_index);
-    reg->chunks = htonll(reg->chunks);
-}
-
-static void network_to_register(RDMARegister *reg)
-{
-    reg->key.current_addr = ntohll(reg->key.current_addr);
-    reg->current_index = ntohl(reg->current_index);
-    reg->chunks = ntohll(reg->chunks);
-}
-
-typedef struct QEMU_PACKED {
-    uint32_t value;     /* if zero, we will madvise() */
-    uint32_t block_idx; /* which ram block index */
-    uint64_t offset;    /* Address in remote ram_addr_t space */
-    uint64_t length;    /* length of the chunk */
-} RDMACompress;
-
-static void compress_to_network(RDMAContext *rdma, RDMACompress *comp)
-{
-    comp->value = htonl(comp->value);
-    /*
-     * comp->offset as passed in is an address in the local ram_addr_t
-     * space, we need to translate this for the destination
-     */
-    comp->offset -= rdma->local_ram_blocks.block[comp->block_idx].offset;
-    comp->offset += rdma->dest_blocks[comp->block_idx].offset;
-    comp->block_idx = htonl(comp->block_idx);
-    comp->offset = htonll(comp->offset);
-    comp->length = htonll(comp->length);
-}
-
-static void network_to_compress(RDMACompress *comp)
-{
-    comp->value = ntohl(comp->value);
-    comp->block_idx = ntohl(comp->block_idx);
-    comp->offset = ntohll(comp->offset);
-    comp->length = ntohll(comp->length);
-}
-
-/*
- * The result of the dest's memory registration produces an "rkey"
- * which the source VM must reference in order to perform
- * the RDMA operation.
- */
-typedef struct QEMU_PACKED {
-    uint32_t rkey;
-    uint32_t padding;
-    uint64_t host_addr;
-} RDMARegisterResult;
-
-static void result_to_network(RDMARegisterResult *result)
-{
-    result->rkey = htonl(result->rkey);
-    result->host_addr = htonll(result->host_addr);
-};
-
-static void network_to_result(RDMARegisterResult *result)
-{
-    result->rkey = ntohl(result->rkey);
-    result->host_addr = ntohll(result->host_addr);
-};
-
-static int qemu_rdma_exchange_send(RDMAContext *rdma, RDMAControlHeader *head,
-                                   uint8_t *data, RDMAControlHeader *resp,
-                                   int *resp_idx,
-                                   int (*callback)(RDMAContext *rdma,
-                                                   Error **errp),
-                                   Error **errp);
-
-static inline uint64_t ram_chunk_index(const uint8_t *start,
-                                       const uint8_t *host)
-{
-    return ((uintptr_t) host - (uintptr_t) start) >> RDMA_REG_CHUNK_SHIFT;
-}
-
-static inline uint8_t *ram_chunk_start(const RDMALocalBlock *rdma_ram_block,
-                                       uint64_t i)
-{
-    return (uint8_t *)(uintptr_t)(rdma_ram_block->local_host_addr +
-                                  (i << RDMA_REG_CHUNK_SHIFT));
-}
-
-static inline uint8_t *ram_chunk_end(const RDMALocalBlock *rdma_ram_block,
-                                     uint64_t i)
-{
-    uint8_t *result = ram_chunk_start(rdma_ram_block, i) +
-                                         (1UL << RDMA_REG_CHUNK_SHIFT);
-
-    if (result > (rdma_ram_block->local_host_addr + rdma_ram_block->length)) {
-        result = rdma_ram_block->local_host_addr + rdma_ram_block->length;
-    }
-
-    return result;
-}
-
-static void rdma_add_block(RDMAContext *rdma, const char *block_name,
-                           void *host_addr,
-                           ram_addr_t block_offset, uint64_t length)
-{
-    RDMALocalBlocks *local = &rdma->local_ram_blocks;
-    RDMALocalBlock *block;
-    RDMALocalBlock *old = local->block;
-
-    local->block = g_new0(RDMALocalBlock, local->nb_blocks + 1);
-
-    if (local->nb_blocks) {
-        if (rdma->blockmap) {
-            for (int x = 0; x < local->nb_blocks; x++) {
-                g_hash_table_remove(rdma->blockmap,
-                                    (void *)(uintptr_t)old[x].offset);
-                g_hash_table_insert(rdma->blockmap,
-                                    (void *)(uintptr_t)old[x].offset,
-                                    &local->block[x]);
-            }
-        }
-        memcpy(local->block, old, sizeof(RDMALocalBlock) * local->nb_blocks);
-        g_free(old);
-    }
-
-    block = &local->block[local->nb_blocks];
-
-    block->block_name = g_strdup(block_name);
-    block->local_host_addr = host_addr;
-    block->offset = block_offset;
-    block->length = length;
-    block->index = local->nb_blocks;
-    block->src_index = ~0U; /* Filled in by the receipt of the block list */
-    block->nb_chunks = ram_chunk_index(host_addr, host_addr + length) + 1UL;
-    block->transit_bitmap = bitmap_new(block->nb_chunks);
-    bitmap_clear(block->transit_bitmap, 0, block->nb_chunks);
-    block->unregister_bitmap = bitmap_new(block->nb_chunks);
-    bitmap_clear(block->unregister_bitmap, 0, block->nb_chunks);
-    block->remote_keys = g_new0(uint32_t, block->nb_chunks);
-
-    block->is_ram_block = local->init ? false : true;
-
-    if (rdma->blockmap) {
-        g_hash_table_insert(rdma->blockmap, (void *)(uintptr_t)block_offset, block);
-    }
-
-    trace_rdma_add_block(block_name, local->nb_blocks,
-                         (uintptr_t) block->local_host_addr,
-                         block->offset, block->length,
-                         (uintptr_t) (block->local_host_addr + block->length),
-                         BITS_TO_LONGS(block->nb_chunks) *
-                             sizeof(unsigned long) * 8,
-                         block->nb_chunks);
-
-    local->nb_blocks++;
-}
-
-/*
- * Memory regions need to be registered with the device and queue pairs setup
- * in advanced before the migration starts. This tells us where the RAM blocks
- * are so that we can register them individually.
- */
-static int qemu_rdma_init_one_block(RAMBlock *rb, void *opaque)
-{
-    const char *block_name = qemu_ram_get_idstr(rb);
-    void *host_addr = qemu_ram_get_host_addr(rb);
-    ram_addr_t block_offset = qemu_ram_get_offset(rb);
-    ram_addr_t length = qemu_ram_get_used_length(rb);
-    rdma_add_block(opaque, block_name, host_addr, block_offset, length);
-    return 0;
-}
-
-/*
- * Identify the RAMBlocks and their quantity. They will be references to
- * identify chunk boundaries inside each RAMBlock and also be referenced
- * during dynamic page registration.
- */
-static void qemu_rdma_init_ram_blocks(RDMAContext *rdma)
-{
-    RDMALocalBlocks *local = &rdma->local_ram_blocks;
-    int ret;
-
-    assert(rdma->blockmap == NULL);
-    memset(local, 0, sizeof *local);
-    ret = foreach_not_ignored_block(qemu_rdma_init_one_block, rdma);
-    assert(!ret);
-    trace_qemu_rdma_init_ram_blocks(local->nb_blocks);
-    rdma->dest_blocks = g_new0(RDMADestBlock,
-                               rdma->local_ram_blocks.nb_blocks);
-    local->init = true;
-}
-
-/*
- * Note: If used outside of cleanup, the caller must ensure that the destination
- * block structures are also updated
- */
-static void rdma_delete_block(RDMAContext *rdma, RDMALocalBlock *block)
-{
-    RDMALocalBlocks *local = &rdma->local_ram_blocks;
-    RDMALocalBlock *old = local->block;
-
-    if (rdma->blockmap) {
-        g_hash_table_remove(rdma->blockmap, (void *)(uintptr_t)block->offset);
-    }
-    if (block->pmr) {
-        for (int j = 0; j < block->nb_chunks; j++) {
-            if (!block->pmr[j]) {
-                continue;
-            }
-            ibv_dereg_mr(block->pmr[j]);
-            rdma->total_registrations--;
-        }
-        g_free(block->pmr);
-        block->pmr = NULL;
-    }
-
-    if (block->mr) {
-        ibv_dereg_mr(block->mr);
-        rdma->total_registrations--;
-        block->mr = NULL;
-    }
-
-    g_free(block->transit_bitmap);
-    block->transit_bitmap = NULL;
-
-    g_free(block->unregister_bitmap);
-    block->unregister_bitmap = NULL;
-
-    g_free(block->remote_keys);
-    block->remote_keys = NULL;
-
-    g_free(block->block_name);
-    block->block_name = NULL;
-
-    if (rdma->blockmap) {
-        for (int x = 0; x < local->nb_blocks; x++) {
-            g_hash_table_remove(rdma->blockmap,
-                                (void *)(uintptr_t)old[x].offset);
-        }
-    }
-
-    if (local->nb_blocks > 1) {
-
-        local->block = g_new0(RDMALocalBlock, local->nb_blocks - 1);
-
-        if (block->index) {
-            memcpy(local->block, old, sizeof(RDMALocalBlock) * block->index);
-        }
-
-        if (block->index < (local->nb_blocks - 1)) {
-            memcpy(local->block + block->index, old + (block->index + 1),
-                sizeof(RDMALocalBlock) *
-                    (local->nb_blocks - (block->index + 1)));
-            for (int x = block->index; x < local->nb_blocks - 1; x++) {
-                local->block[x].index--;
-            }
-        }
-    } else {
-        assert(block == local->block);
-        local->block = NULL;
-    }
-
-    trace_rdma_delete_block(block, (uintptr_t)block->local_host_addr,
-                           block->offset, block->length,
-                            (uintptr_t)(block->local_host_addr + block->length),
-                           BITS_TO_LONGS(block->nb_chunks) *
-                               sizeof(unsigned long) * 8, block->nb_chunks);
-
-    g_free(old);
-
-    local->nb_blocks--;
-
-    if (local->nb_blocks && rdma->blockmap) {
-        for (int x = 0; x < local->nb_blocks; x++) {
-            g_hash_table_insert(rdma->blockmap,
-                                (void *)(uintptr_t)local->block[x].offset,
-                                &local->block[x]);
-        }
-    }
-}
-
-/*
- * Trace RDMA device open, with device details.
- */
-static void qemu_rdma_dump_id(const char *who, struct ibv_context *verbs)
-{
-    struct ibv_port_attr port;
-
-    if (ibv_query_port(verbs, 1, &port)) {
-        trace_qemu_rdma_dump_id_failed(who);
-        return;
-    }
-
-    trace_qemu_rdma_dump_id(who,
-                verbs->device->name,
-                verbs->device->dev_name,
-                verbs->device->dev_path,
-                verbs->device->ibdev_path,
-                port.link_layer,
-                port.link_layer == IBV_LINK_LAYER_INFINIBAND ? "Infiniband"
-                : port.link_layer == IBV_LINK_LAYER_ETHERNET ? "Ethernet"
-                : "Unknown");
-}
-
-/*
- * Trace RDMA gid addressing information.
- * Useful for understanding the RDMA device hierarchy in the kernel.
- */
-static void qemu_rdma_dump_gid(const char *who, struct rdma_cm_id *id)
-{
-    char sgid[33];
-    char dgid[33];
-    inet_ntop(AF_INET6, &id->route.addr.addr.ibaddr.sgid, sgid, sizeof sgid);
-    inet_ntop(AF_INET6, &id->route.addr.addr.ibaddr.dgid, dgid, sizeof dgid);
-    trace_qemu_rdma_dump_gid(who, sgid, dgid);
-}
-
-/*
- * As of now, IPv6 over RoCE / iWARP is not supported by linux.
- * We will try the next addrinfo struct, and fail if there are
- * no other valid addresses to bind against.
- *
- * If user is listening on '[::]', then we will not have a opened a device
- * yet and have no way of verifying if the device is RoCE or not.
- *
- * In this case, the source VM will throw an error for ALL types of
- * connections (both IPv4 and IPv6) if the destination machine does not have
- * a regular infiniband network available for use.
- *
- * The only way to guarantee that an error is thrown for broken kernels is
- * for the management software to choose a *specific* interface at bind time
- * and validate what time of hardware it is.
- *
- * Unfortunately, this puts the user in a fix:
- *
- *  If the source VM connects with an IPv4 address without knowing that the
- *  destination has bound to '[::]' the migration will unconditionally fail
- *  unless the management software is explicitly listening on the IPv4
- *  address while using a RoCE-based device.
- *
- *  If the source VM connects with an IPv6 address, then we're OK because we can
- *  throw an error on the source (and similarly on the destination).
- *
- *  But in mixed environments, this will be broken for a while until it is fixed
- *  inside linux.
- *
- * We do provide a *tiny* bit of help in this function: We can list all of the
- * devices in the system and check to see if all the devices are RoCE or
- * Infiniband.
- *
- * If we detect that we have a *pure* RoCE environment, then we can safely
- * thrown an error even if the management software has specified '[::]' as the
- * bind address.
- *
- * However, if there is are multiple hetergeneous devices, then we cannot make
- * this assumption and the user just has to be sure they know what they are
- * doing.
- *
- * Patches are being reviewed on linux-rdma.
- */
-static int qemu_rdma_broken_ipv6_kernel(struct ibv_context *verbs, Error **errp)
-{
-    /* This bug only exists in linux, to our knowledge. */
-#ifdef CONFIG_LINUX
-    struct ibv_port_attr port_attr;
-
-    /*
-     * Verbs are only NULL if management has bound to '[::]'.
-     *
-     * Let's iterate through all the devices and see if there any pure IB
-     * devices (non-ethernet).
-     *
-     * If not, then we can safely proceed with the migration.
-     * Otherwise, there are no guarantees until the bug is fixed in linux.
-     */
-    if (!verbs) {
-        int num_devices;
-        struct ibv_device **dev_list = ibv_get_device_list(&num_devices);
-        bool roce_found = false;
-        bool ib_found = false;
-
-        for (int x = 0; x < num_devices; x++) {
-            verbs = ibv_open_device(dev_list[x]);
-            /*
-             * ibv_open_device() is not documented to set errno.  If
-             * it does, it's somebody else's doc bug.  If it doesn't,
-             * the use of errno below is wrong.
-             * TODO Find out whether ibv_open_device() sets errno.
-             */
-            if (!verbs) {
-                if (errno == EPERM) {
-                    continue;
-                } else {
-                    error_setg_errno(errp, errno,
-                                     "could not open RDMA device context");
-                    return -1;
-                }
-            }
-
-            if (ibv_query_port(verbs, 1, &port_attr)) {
-                ibv_close_device(verbs);
-                error_setg(errp,
-                           "RDMA ERROR: Could not query initial IB port");
-                return -1;
-            }
-
-            if (port_attr.link_layer == IBV_LINK_LAYER_INFINIBAND) {
-                ib_found = true;
-            } else if (port_attr.link_layer == IBV_LINK_LAYER_ETHERNET) {
-                roce_found = true;
-            }
-
-            ibv_close_device(verbs);
-
-        }
-
-        if (roce_found) {
-            if (ib_found) {
-                warn_report("migrations may fail:"
-                            " IPv6 over RoCE / iWARP in linux"
-                            " is broken. But since you appear to have a"
-                            " mixed RoCE / IB environment, be sure to only"
-                            " migrate over the IB fabric until the kernel "
-                            " fixes the bug.");
-            } else {
-                error_setg(errp, "RDMA ERROR: "
-                           "You only have RoCE / iWARP devices in your systems"
-                           " and your management software has specified '[::]'"
-                           ", but IPv6 over RoCE / iWARP is not supported in Linux.");
-                return -1;
-            }
-        }
-
-        return 0;
-    }
-
-    /*
-     * If we have a verbs context, that means that some other than '[::]' was
-     * used by the management software for binding. In which case we can
-     * actually warn the user about a potentially broken kernel.
-     */
-
-    /* IB ports start with 1, not 0 */
-    if (ibv_query_port(verbs, 1, &port_attr)) {
-        error_setg(errp, "RDMA ERROR: Could not query initial IB port");
-        return -1;
-    }
-
-    if (port_attr.link_layer == IBV_LINK_LAYER_ETHERNET) {
-        error_setg(errp, "RDMA ERROR: "
-                   "Linux kernel's RoCE / iWARP does not support IPv6 "
-                   "(but patches on linux-rdma in progress)");
-        return -1;
-    }
-
-#endif
-
-    return 0;
-}
-
-/*
- * Figure out which RDMA device corresponds to the requested IP hostname
- * Also create the initial connection manager identifiers for opening
- * the connection.
- */
-static int qemu_rdma_resolve_host(RDMAContext *rdma, Error **errp)
-{
-    Error *err = NULL;
-    int ret;
-    struct rdma_addrinfo *res;
-    char port_str[16];
-    struct rdma_cm_event *cm_event;
-    char ip[40] = "unknown";
-
-    if (rdma->host == NULL || !strcmp(rdma->host, "")) {
-        error_setg(errp, "RDMA ERROR: RDMA hostname has not been set");
-        return -1;
-    }
-
-    /* create CM channel */
-    rdma->channel = rdma_create_event_channel();
-    if (!rdma->channel) {
-        error_setg(errp, "RDMA ERROR: could not create CM channel");
-        return -1;
-    }
-
-    /* create CM id */
-    ret = rdma_create_id(rdma->channel, &rdma->cm_id, NULL, RDMA_PS_TCP);
-    if (ret < 0) {
-        error_setg(errp, "RDMA ERROR: could not create channel id");
-        goto err_resolve_create_id;
-    }
-
-    snprintf(port_str, 16, "%d", rdma->port);
-    port_str[15] = '\0';
-
-    ret = rdma_getaddrinfo(rdma->host, port_str, NULL, &res);
-    if (ret) {
-        error_setg(errp, "RDMA ERROR: could not rdma_getaddrinfo address %s",
-                   rdma->host);
-        goto err_resolve_get_addr;
-    }
-
-    /* Try all addresses, saving the first error in @err */
-    for (struct rdma_addrinfo *e = res; e != NULL; e = e->ai_next) {
-        Error **local_errp = err ? NULL : &err;
-
-        inet_ntop(e->ai_family,
-            &((struct sockaddr_in *) e->ai_dst_addr)->sin_addr, ip, sizeof ip);
-        trace_qemu_rdma_resolve_host_trying(rdma->host, ip);
-
-        ret = rdma_resolve_addr(rdma->cm_id, NULL, e->ai_dst_addr,
-                RDMA_RESOLVE_TIMEOUT_MS);
-        if (ret >= 0) {
-            if (e->ai_family == AF_INET6) {
-                ret = qemu_rdma_broken_ipv6_kernel(rdma->cm_id->verbs,
-                                                   local_errp);
-                if (ret < 0) {
-                    continue;
-                }
-            }
-            error_free(err);
-            goto route;
-        }
-    }
-
-    rdma_freeaddrinfo(res);
-    if (err) {
-        error_propagate(errp, err);
-    } else {
-        error_setg(errp, "RDMA ERROR: could not resolve address %s",
-                   rdma->host);
-    }
-    goto err_resolve_get_addr;
-
-route:
-    rdma_freeaddrinfo(res);
-    qemu_rdma_dump_gid("source_resolve_addr", rdma->cm_id);
-
-    ret = rdma_get_cm_event(rdma->channel, &cm_event);
-    if (ret < 0) {
-        error_setg(errp, "RDMA ERROR: could not perform event_addr_resolved");
-        goto err_resolve_get_addr;
-    }
-
-    if (cm_event->event != RDMA_CM_EVENT_ADDR_RESOLVED) {
-        error_setg(errp,
-                   "RDMA ERROR: result not equal to event_addr_resolved %s",
-                   rdma_event_str(cm_event->event));
-        rdma_ack_cm_event(cm_event);
-        goto err_resolve_get_addr;
-    }
-    rdma_ack_cm_event(cm_event);
-
-    /* resolve route */
-    ret = rdma_resolve_route(rdma->cm_id, RDMA_RESOLVE_TIMEOUT_MS);
-    if (ret < 0) {
-        error_setg(errp, "RDMA ERROR: could not resolve rdma route");
-        goto err_resolve_get_addr;
-    }
-
-    ret = rdma_get_cm_event(rdma->channel, &cm_event);
-    if (ret < 0) {
-        error_setg(errp, "RDMA ERROR: could not perform event_route_resolved");
-        goto err_resolve_get_addr;
-    }
-    if (cm_event->event != RDMA_CM_EVENT_ROUTE_RESOLVED) {
-        error_setg(errp, "RDMA ERROR: "
-                   "result not equal to event_route_resolved: %s",
-                   rdma_event_str(cm_event->event));
-        rdma_ack_cm_event(cm_event);
-        goto err_resolve_get_addr;
-    }
-    rdma_ack_cm_event(cm_event);
-    rdma->verbs = rdma->cm_id->verbs;
-    qemu_rdma_dump_id("source_resolve_host", rdma->cm_id->verbs);
-    qemu_rdma_dump_gid("source_resolve_host", rdma->cm_id);
-    return 0;
-
-err_resolve_get_addr:
-    rdma_destroy_id(rdma->cm_id);
-    rdma->cm_id = NULL;
-err_resolve_create_id:
-    rdma_destroy_event_channel(rdma->channel);
-    rdma->channel = NULL;
-    return -1;
-}
-
-/*
- * Create protection domain and completion queues
- */
-static int qemu_rdma_alloc_pd_cq(RDMAContext *rdma, Error **errp)
-{
-    /* allocate pd */
-    rdma->pd = ibv_alloc_pd(rdma->verbs);
-    if (!rdma->pd) {
-        error_setg(errp, "failed to allocate protection domain");
-        return -1;
-    }
-
-    /* create receive completion channel */
-    rdma->recv_comp_channel = ibv_create_comp_channel(rdma->verbs);
-    if (!rdma->recv_comp_channel) {
-        error_setg(errp, "failed to allocate receive completion channel");
-        goto err_alloc_pd_cq;
-    }
-
-    /*
-     * Completion queue can be filled by read work requests.
-     */
-    rdma->recv_cq = ibv_create_cq(rdma->verbs, (RDMA_SIGNALED_SEND_MAX * 3),
-                                  NULL, rdma->recv_comp_channel, 0);
-    if (!rdma->recv_cq) {
-        error_setg(errp, "failed to allocate receive completion queue");
-        goto err_alloc_pd_cq;
-    }
-
-    /* create send completion channel */
-    rdma->send_comp_channel = ibv_create_comp_channel(rdma->verbs);
-    if (!rdma->send_comp_channel) {
-        error_setg(errp, "failed to allocate send completion channel");
-        goto err_alloc_pd_cq;
-    }
-
-    rdma->send_cq = ibv_create_cq(rdma->verbs, (RDMA_SIGNALED_SEND_MAX * 3),
-                                  NULL, rdma->send_comp_channel, 0);
-    if (!rdma->send_cq) {
-        error_setg(errp, "failed to allocate send completion queue");
-        goto err_alloc_pd_cq;
-    }
-
-    return 0;
-
-err_alloc_pd_cq:
-    if (rdma->pd) {
-        ibv_dealloc_pd(rdma->pd);
-    }
-    if (rdma->recv_comp_channel) {
-        ibv_destroy_comp_channel(rdma->recv_comp_channel);
-    }
-    if (rdma->send_comp_channel) {
-        ibv_destroy_comp_channel(rdma->send_comp_channel);
-    }
-    if (rdma->recv_cq) {
-        ibv_destroy_cq(rdma->recv_cq);
-        rdma->recv_cq = NULL;
-    }
-    rdma->pd = NULL;
-    rdma->recv_comp_channel = NULL;
-    rdma->send_comp_channel = NULL;
-    return -1;
-
-}
-
-/*
- * Create queue pairs.
- */
-static int qemu_rdma_alloc_qp(RDMAContext *rdma)
-{
-    struct ibv_qp_init_attr attr = { 0 };
-
-    attr.cap.max_send_wr = RDMA_SIGNALED_SEND_MAX;
-    attr.cap.max_recv_wr = 3;
-    attr.cap.max_send_sge = 1;
-    attr.cap.max_recv_sge = 1;
-    attr.send_cq = rdma->send_cq;
-    attr.recv_cq = rdma->recv_cq;
-    attr.qp_type = IBV_QPT_RC;
-
-    if (rdma_create_qp(rdma->cm_id, rdma->pd, &attr) < 0) {
-        return -1;
-    }
-
-    rdma->qp = rdma->cm_id->qp;
-    return 0;
-}
-
-/* Check whether On-Demand Paging is supported by RDAM device */
-static bool rdma_support_odp(struct ibv_context *dev)
-{
-    struct ibv_device_attr_ex attr = {0};
-
-    if (ibv_query_device_ex(dev, NULL, &attr)) {
-        return false;
-    }
-
-    if (attr.odp_caps.general_caps & IBV_ODP_SUPPORT) {
-        return true;
-    }
-
-    return false;
-}
-
-/*
- * ibv_advise_mr to avoid RNR NAK error as far as possible.
- * The responder mr registering with ODP will sent RNR NAK back to
- * the requester in the face of the page fault.
- */
-static void qemu_rdma_advise_prefetch_mr(struct ibv_pd *pd, uint64_t addr,
-                                         uint32_t len,  uint32_t lkey,
-                                         const char *name, bool wr)
-{
-#ifdef HAVE_IBV_ADVISE_MR
-    int ret;
-    int advice = wr ? IBV_ADVISE_MR_ADVICE_PREFETCH_WRITE :
-                 IBV_ADVISE_MR_ADVICE_PREFETCH;
-    struct ibv_sge sg_list = {.lkey = lkey, .addr = addr, .length = len};
-
-    ret = ibv_advise_mr(pd, advice,
-                        IBV_ADVISE_MR_FLAG_FLUSH, &sg_list, 1);
-    /* ignore the error */
-    trace_qemu_rdma_advise_mr(name, len, addr, strerror(ret));
-#endif
-}
-
-static int qemu_rdma_reg_whole_ram_blocks(RDMAContext *rdma, Error **errp)
-{
-    int i;
-    RDMALocalBlocks *local = &rdma->local_ram_blocks;
-
-    for (i = 0; i < local->nb_blocks; i++) {
-        int access = IBV_ACCESS_LOCAL_WRITE | IBV_ACCESS_REMOTE_WRITE;
-
-        local->block[i].mr =
-            ibv_reg_mr(rdma->pd,
-                    local->block[i].local_host_addr,
-                    local->block[i].length, access
-                    );
-        /*
-         * ibv_reg_mr() is not documented to set errno.  If it does,
-         * it's somebody else's doc bug.  If it doesn't, the use of
-         * errno below is wrong.
-         * TODO Find out whether ibv_reg_mr() sets errno.
-         */
-        if (!local->block[i].mr &&
-            errno == ENOTSUP && rdma_support_odp(rdma->verbs)) {
-                access |= IBV_ACCESS_ON_DEMAND;
-                /* register ODP mr */
-                local->block[i].mr =
-                    ibv_reg_mr(rdma->pd,
-                               local->block[i].local_host_addr,
-                               local->block[i].length, access);
-                trace_qemu_rdma_register_odp_mr(local->block[i].block_name);
-
-                if (local->block[i].mr) {
-                    qemu_rdma_advise_prefetch_mr(rdma->pd,
-                                    (uintptr_t)local->block[i].local_host_addr,
-                                    local->block[i].length,
-                                    local->block[i].mr->lkey,
-                                    local->block[i].block_name,
-                                    true);
-                }
-        }
-
-        if (!local->block[i].mr) {
-            error_setg_errno(errp, errno,
-                             "Failed to register local dest ram block!");
-            goto err;
-        }
-        rdma->total_registrations++;
-    }
-
-    return 0;
-
-err:
-    for (i--; i >= 0; i--) {
-        ibv_dereg_mr(local->block[i].mr);
-        local->block[i].mr = NULL;
-        rdma->total_registrations--;
-    }
-
-    return -1;
-
-}
-
-/*
- * Find the ram block that corresponds to the page requested to be
- * transmitted by QEMU.
- *
- * Once the block is found, also identify which 'chunk' within that
- * block that the page belongs to.
- */
-static void qemu_rdma_search_ram_block(RDMAContext *rdma,
-                                       uintptr_t block_offset,
-                                       uint64_t offset,
-                                       uint64_t length,
-                                       uint64_t *block_index,
-                                       uint64_t *chunk_index)
-{
-    uint64_t current_addr = block_offset + offset;
-    RDMALocalBlock *block = g_hash_table_lookup(rdma->blockmap,
-                                                (void *) block_offset);
-    assert(block);
-    assert(current_addr >= block->offset);
-    assert((current_addr + length) <= (block->offset + block->length));
-
-    *block_index = block->index;
-    *chunk_index = ram_chunk_index(block->local_host_addr,
-                block->local_host_addr + (current_addr - block->offset));
-}
-
-/*
- * Register a chunk with IB. If the chunk was already registered
- * previously, then skip.
- *
- * Also return the keys associated with the registration needed
- * to perform the actual RDMA operation.
- */
-static int qemu_rdma_register_and_get_keys(RDMAContext *rdma,
-        RDMALocalBlock *block, uintptr_t host_addr,
-        uint32_t *lkey, uint32_t *rkey, int chunk,
-        uint8_t *chunk_start, uint8_t *chunk_end)
-{
-    if (block->mr) {
-        if (lkey) {
-            *lkey = block->mr->lkey;
-        }
-        if (rkey) {
-            *rkey = block->mr->rkey;
-        }
-        return 0;
-    }
-
-    /* allocate memory to store chunk MRs */
-    if (!block->pmr) {
-        block->pmr = g_new0(struct ibv_mr *, block->nb_chunks);
-    }
-
-    /*
-     * If 'rkey', then we're the destination, so grant access to the source.
-     *
-     * If 'lkey', then we're the source VM, so grant access only to ourselves.
-     */
-    if (!block->pmr[chunk]) {
-        uint64_t len = chunk_end - chunk_start;
-        int access = rkey ? IBV_ACCESS_LOCAL_WRITE | IBV_ACCESS_REMOTE_WRITE :
-                     0;
-
-        trace_qemu_rdma_register_and_get_keys(len, chunk_start);
-
-        block->pmr[chunk] = ibv_reg_mr(rdma->pd, chunk_start, len, access);
-        /*
-         * ibv_reg_mr() is not documented to set errno.  If it does,
-         * it's somebody else's doc bug.  If it doesn't, the use of
-         * errno below is wrong.
-         * TODO Find out whether ibv_reg_mr() sets errno.
-         */
-        if (!block->pmr[chunk] &&
-            errno == ENOTSUP && rdma_support_odp(rdma->verbs)) {
-            access |= IBV_ACCESS_ON_DEMAND;
-            /* register ODP mr */
-            block->pmr[chunk] = ibv_reg_mr(rdma->pd, chunk_start, len, access);
-            trace_qemu_rdma_register_odp_mr(block->block_name);
-
-            if (block->pmr[chunk]) {
-                qemu_rdma_advise_prefetch_mr(rdma->pd, (uintptr_t)chunk_start,
-                                            len, block->pmr[chunk]->lkey,
-                                            block->block_name, rkey);
-
-            }
-        }
-    }
-    if (!block->pmr[chunk]) {
-        return -1;
-    }
-    rdma->total_registrations++;
-
-    if (lkey) {
-        *lkey = block->pmr[chunk]->lkey;
-    }
-    if (rkey) {
-        *rkey = block->pmr[chunk]->rkey;
-    }
-    return 0;
-}
-
-/*
- * Register (at connection time) the memory used for control
- * channel messages.
- */
-static int qemu_rdma_reg_control(RDMAContext *rdma, int idx)
-{
-    rdma->wr_data[idx].control_mr = ibv_reg_mr(rdma->pd,
-            rdma->wr_data[idx].control, RDMA_CONTROL_MAX_BUFFER,
-            IBV_ACCESS_LOCAL_WRITE | IBV_ACCESS_REMOTE_WRITE);
-    if (rdma->wr_data[idx].control_mr) {
-        rdma->total_registrations++;
-        return 0;
-    }
-    return -1;
-}
-
-/*
- * Perform a non-optimized memory unregistration after every transfer
- * for demonstration purposes, only if pin-all is not requested.
- *
- * Potential optimizations:
- * 1. Start a new thread to run this function continuously
-        - for bit clearing
-        - and for receipt of unregister messages
- * 2. Use an LRU.
- * 3. Use workload hints.
- */
-static int qemu_rdma_unregister_waiting(RDMAContext *rdma)
-{
-    Error *err = NULL;
-
-    while (rdma->unregistrations[rdma->unregister_current]) {
-        int ret;
-        uint64_t wr_id = rdma->unregistrations[rdma->unregister_current];
-        uint64_t chunk =
-            (wr_id & RDMA_WRID_CHUNK_MASK) >> RDMA_WRID_CHUNK_SHIFT;
-        uint64_t index =
-            (wr_id & RDMA_WRID_BLOCK_MASK) >> RDMA_WRID_BLOCK_SHIFT;
-        RDMALocalBlock *block =
-            &(rdma->local_ram_blocks.block[index]);
-        RDMARegister reg = { .current_index = index };
-        RDMAControlHeader resp = { .type = RDMA_CONTROL_UNREGISTER_FINISHED,
-                                 };
-        RDMAControlHeader head = { .len = sizeof(RDMARegister),
-                                   .type = RDMA_CONTROL_UNREGISTER_REQUEST,
-                                   .repeat = 1,
-                                 };
-
-        trace_qemu_rdma_unregister_waiting_proc(chunk,
-                                                rdma->unregister_current);
-
-        rdma->unregistrations[rdma->unregister_current] = 0;
-        rdma->unregister_current++;
-
-        if (rdma->unregister_current == RDMA_SIGNALED_SEND_MAX) {
-            rdma->unregister_current = 0;
-        }
-
-
-        /*
-         * Unregistration is speculative (because migration is single-threaded
-         * and we cannot break the protocol's inifinband message ordering).
-         * Thus, if the memory is currently being used for transmission,
-         * then abort the attempt to unregister and try again
-         * later the next time a completion is received for this memory.
-         */
-        clear_bit(chunk, block->unregister_bitmap);
-
-        if (test_bit(chunk, block->transit_bitmap)) {
-            trace_qemu_rdma_unregister_waiting_inflight(chunk);
-            continue;
-        }
-
-        trace_qemu_rdma_unregister_waiting_send(chunk);
-
-        ret = ibv_dereg_mr(block->pmr[chunk]);
-        block->pmr[chunk] = NULL;
-        block->remote_keys[chunk] = 0;
-
-        if (ret != 0) {
-            error_report("unregistration chunk failed: %s",
-                         strerror(ret));
-            return -1;
-        }
-        rdma->total_registrations--;
-
-        reg.key.chunk = chunk;
-        register_to_network(rdma, &reg);
-        ret = qemu_rdma_exchange_send(rdma, &head, (uint8_t *) &reg,
-                                      &resp, NULL, NULL, &err);
-        if (ret < 0) {
-            error_report_err(err);
-            return -1;
-        }
-
-        trace_qemu_rdma_unregister_waiting_complete(chunk);
-    }
-
-    return 0;
-}
-
-static uint64_t qemu_rdma_make_wrid(uint64_t wr_id, uint64_t index,
-                                         uint64_t chunk)
-{
-    uint64_t result = wr_id & RDMA_WRID_TYPE_MASK;
-
-    result |= (index << RDMA_WRID_BLOCK_SHIFT);
-    result |= (chunk << RDMA_WRID_CHUNK_SHIFT);
-
-    return result;
-}
-
-/*
- * Consult the connection manager to see a work request
- * (of any kind) has completed.
- * Return the work request ID that completed.
- */
-static int qemu_rdma_poll(RDMAContext *rdma, struct ibv_cq *cq,
-                          uint64_t *wr_id_out, uint32_t *byte_len)
-{
-    int ret;
-    struct ibv_wc wc;
-    uint64_t wr_id;
-
-    ret = ibv_poll_cq(cq, 1, &wc);
-
-    if (!ret) {
-        *wr_id_out = RDMA_WRID_NONE;
-        return 0;
-    }
-
-    if (ret < 0) {
-        return -1;
-    }
-
-    wr_id = wc.wr_id & RDMA_WRID_TYPE_MASK;
-
-    if (wc.status != IBV_WC_SUCCESS) {
-        return -1;
-    }
-
-    if (rdma->control_ready_expected &&
-        (wr_id >= RDMA_WRID_RECV_CONTROL)) {
-        trace_qemu_rdma_poll_recv(wr_id - RDMA_WRID_RECV_CONTROL, wr_id,
-                                  rdma->nb_sent);
-        rdma->control_ready_expected = 0;
-    }
-
-    if (wr_id == RDMA_WRID_RDMA_WRITE) {
-        uint64_t chunk =
-            (wc.wr_id & RDMA_WRID_CHUNK_MASK) >> RDMA_WRID_CHUNK_SHIFT;
-        uint64_t index =
-            (wc.wr_id & RDMA_WRID_BLOCK_MASK) >> RDMA_WRID_BLOCK_SHIFT;
-        RDMALocalBlock *block = &(rdma->local_ram_blocks.block[index]);
-
-        trace_qemu_rdma_poll_write(wr_id, rdma->nb_sent,
-                                   index, chunk, block->local_host_addr,
-                                   (void *)(uintptr_t)block->remote_host_addr);
-
-        clear_bit(chunk, block->transit_bitmap);
-
-        if (rdma->nb_sent > 0) {
-            rdma->nb_sent--;
-        }
-    } else {
-        trace_qemu_rdma_poll_other(wr_id, rdma->nb_sent);
-    }
-
-    *wr_id_out = wc.wr_id;
-    if (byte_len) {
-        *byte_len = wc.byte_len;
-    }
-
-    return  0;
-}
-
-/* Wait for activity on the completion channel.
- * Returns 0 on success, none-0 on error.
- */
-static int qemu_rdma_wait_comp_channel(RDMAContext *rdma,
-                                       struct ibv_comp_channel *comp_channel)
-{
-    struct rdma_cm_event *cm_event;
-
-    /*
-     * Coroutine doesn't start until migration_fd_process_incoming()
-     * so don't yield unless we know we're running inside of a coroutine.
-     */
-    if (rdma->migration_started_on_destination &&
-        migration_incoming_get_current()->state == MIGRATION_STATUS_ACTIVE) {
-        yield_until_fd_readable(comp_channel->fd);
-    } else {
-        /* This is the source side, we're in a separate thread
-         * or destination prior to migration_fd_process_incoming()
-         * after postcopy, the destination also in a separate thread.
-         * we can't yield; so we have to poll the fd.
-         * But we need to be able to handle 'cancel' or an error
-         * without hanging forever.
-         */
-        while (!rdma->errored && !rdma->received_error) {
-            GPollFD pfds[2];
-            pfds[0].fd = comp_channel->fd;
-            pfds[0].events = G_IO_IN | G_IO_HUP | G_IO_ERR;
-            pfds[0].revents = 0;
-
-            pfds[1].fd = rdma->channel->fd;
-            pfds[1].events = G_IO_IN | G_IO_HUP | G_IO_ERR;
-            pfds[1].revents = 0;
-
-            /* 0.1s timeout, should be fine for a 'cancel' */
-            switch (qemu_poll_ns(pfds, 2, 100 * 1000 * 1000)) {
-            case 2:
-            case 1: /* fd active */
-                if (pfds[0].revents) {
-                    return 0;
-                }
-
-                if (pfds[1].revents) {
-                    if (rdma_get_cm_event(rdma->channel, &cm_event) < 0) {
-                        return -1;
-                    }
-
-                    if (cm_event->event == RDMA_CM_EVENT_DISCONNECTED ||
-                        cm_event->event == RDMA_CM_EVENT_DEVICE_REMOVAL) {
-                        rdma_ack_cm_event(cm_event);
-                        return -1;
-                    }
-                    rdma_ack_cm_event(cm_event);
-                }
-                break;
-
-            case 0: /* Timeout, go around again */
-                break;
-
-            default: /* Error of some type -
-                      * I don't trust errno from qemu_poll_ns
-                     */
-                return -1;
-            }
-
-            if (migrate_get_current()->state == MIGRATION_STATUS_CANCELLING) {
-                /* Bail out and let the cancellation happen */
-                return -1;
-            }
-        }
-    }
-
-    if (rdma->received_error) {
-        return -1;
-    }
-    return -rdma->errored;
-}
-
-static struct ibv_comp_channel *to_channel(RDMAContext *rdma, uint64_t wrid)
-{
-    return wrid < RDMA_WRID_RECV_CONTROL ? rdma->send_comp_channel :
-           rdma->recv_comp_channel;
-}
-
-static struct ibv_cq *to_cq(RDMAContext *rdma, uint64_t wrid)
-{
-    return wrid < RDMA_WRID_RECV_CONTROL ? rdma->send_cq : rdma->recv_cq;
-}
-
-/*
- * Block until the next work request has completed.
- *
- * First poll to see if a work request has already completed,
- * otherwise block.
- *
- * If we encounter completed work requests for IDs other than
- * the one we're interested in, then that's generally an error.
- *
- * The only exception is actual RDMA Write completions. These
- * completions only need to be recorded, but do not actually
- * need further processing.
- */
-static int qemu_rdma_block_for_wrid(RDMAContext *rdma,
-                                    uint64_t wrid_requested,
-                                    uint32_t *byte_len)
-{
-    int num_cq_events = 0, ret;
-    struct ibv_cq *cq;
-    void *cq_ctx;
-    uint64_t wr_id = RDMA_WRID_NONE, wr_id_in;
-    struct ibv_comp_channel *ch = to_channel(rdma, wrid_requested);
-    struct ibv_cq *poll_cq = to_cq(rdma, wrid_requested);
-
-    if (ibv_req_notify_cq(poll_cq, 0)) {
-        return -1;
-    }
-    /* poll cq first */
-    while (wr_id != wrid_requested) {
-        ret = qemu_rdma_poll(rdma, poll_cq, &wr_id_in, byte_len);
-        if (ret < 0) {
-            return -1;
-        }
-
-        wr_id = wr_id_in & RDMA_WRID_TYPE_MASK;
-
-        if (wr_id == RDMA_WRID_NONE) {
-            break;
-        }
-        if (wr_id != wrid_requested) {
-            trace_qemu_rdma_block_for_wrid_miss(wrid_requested, wr_id);
-        }
-    }
-
-    if (wr_id == wrid_requested) {
-        return 0;
-    }
-
-    while (1) {
-        ret = qemu_rdma_wait_comp_channel(rdma, ch);
-        if (ret < 0) {
-            goto err_block_for_wrid;
-        }
-
-        ret = ibv_get_cq_event(ch, &cq, &cq_ctx);
-        if (ret < 0) {
-            goto err_block_for_wrid;
-        }
-
-        num_cq_events++;
-
-        if (ibv_req_notify_cq(cq, 0)) {
-            goto err_block_for_wrid;
-        }
-
-        while (wr_id != wrid_requested) {
-            ret = qemu_rdma_poll(rdma, poll_cq, &wr_id_in, byte_len);
-            if (ret < 0) {
-                goto err_block_for_wrid;
-            }
-
-            wr_id = wr_id_in & RDMA_WRID_TYPE_MASK;
-
-            if (wr_id == RDMA_WRID_NONE) {
-                break;
-            }
-            if (wr_id != wrid_requested) {
-                trace_qemu_rdma_block_for_wrid_miss(wrid_requested, wr_id);
-            }
-        }
-
-        if (wr_id == wrid_requested) {
-            goto success_block_for_wrid;
-        }
-    }
-
-success_block_for_wrid:
-    if (num_cq_events) {
-        ibv_ack_cq_events(cq, num_cq_events);
-    }
-    return 0;
-
-err_block_for_wrid:
-    if (num_cq_events) {
-        ibv_ack_cq_events(cq, num_cq_events);
-    }
-
-    rdma->errored = true;
-    return -1;
-}
-
-/*
- * Post a SEND message work request for the control channel
- * containing some data and block until the post completes.
- */
-static int qemu_rdma_post_send_control(RDMAContext *rdma, uint8_t *buf,
-                                       RDMAControlHeader *head,
-                                       Error **errp)
-{
-    int ret;
-    RDMAWorkRequestData *wr = &rdma->wr_data[RDMA_WRID_CONTROL];
-    struct ibv_send_wr *bad_wr;
-    struct ibv_sge sge = {
-                           .addr = (uintptr_t)(wr->control),
-                           .length = head->len + sizeof(RDMAControlHeader),
-                           .lkey = wr->control_mr->lkey,
-                         };
-    struct ibv_send_wr send_wr = {
-                                   .wr_id = RDMA_WRID_SEND_CONTROL,
-                                   .opcode = IBV_WR_SEND,
-                                   .send_flags = IBV_SEND_SIGNALED,
-                                   .sg_list = &sge,
-                                   .num_sge = 1,
-                                };
-
-    trace_qemu_rdma_post_send_control(control_desc(head->type));
-
-    /*
-     * We don't actually need to do a memcpy() in here if we used
-     * the "sge" properly, but since we're only sending control messages
-     * (not RAM in a performance-critical path), then its OK for now.
-     *
-     * The copy makes the RDMAControlHeader simpler to manipulate
-     * for the time being.
-     */
-    assert(head->len <= RDMA_CONTROL_MAX_BUFFER - sizeof(*head));
-    memcpy(wr->control, head, sizeof(RDMAControlHeader));
-    control_to_network((void *) wr->control);
-
-    if (buf) {
-        memcpy(wr->control + sizeof(RDMAControlHeader), buf, head->len);
-    }
-
-
-    ret = ibv_post_send(rdma->qp, &send_wr, &bad_wr);
-
-    if (ret > 0) {
-        error_setg(errp, "Failed to use post IB SEND for control");
-        return -1;
-    }
-
-    ret = qemu_rdma_block_for_wrid(rdma, RDMA_WRID_SEND_CONTROL, NULL);
-    if (ret < 0) {
-        error_setg(errp, "rdma migration: send polling control error");
-        return -1;
-    }
-
-    return 0;
-}
-
-/*
- * Post a RECV work request in anticipation of some future receipt
- * of data on the control channel.
- */
-static int qemu_rdma_post_recv_control(RDMAContext *rdma, int idx,
-                                       Error **errp)
-{
-    struct ibv_recv_wr *bad_wr;
-    struct ibv_sge sge = {
-                            .addr = (uintptr_t)(rdma->wr_data[idx].control),
-                            .length = RDMA_CONTROL_MAX_BUFFER,
-                            .lkey = rdma->wr_data[idx].control_mr->lkey,
-                         };
-
-    struct ibv_recv_wr recv_wr = {
-                                    .wr_id = RDMA_WRID_RECV_CONTROL + idx,
-                                    .sg_list = &sge,
-                                    .num_sge = 1,
-                                 };
-
-
-    if (ibv_post_recv(rdma->qp, &recv_wr, &bad_wr)) {
-        error_setg(errp, "error posting control recv");
-        return -1;
-    }
-
-    return 0;
-}
-
-/*
- * Block and wait for a RECV control channel message to arrive.
- */
-static int qemu_rdma_exchange_get_response(RDMAContext *rdma,
-                RDMAControlHeader *head, uint32_t expecting, int idx,
-                Error **errp)
-{
-    uint32_t byte_len;
-    int ret = qemu_rdma_block_for_wrid(rdma, RDMA_WRID_RECV_CONTROL + idx,
-                                       &byte_len);
-
-    if (ret < 0) {
-        error_setg(errp, "rdma migration: recv polling control error!");
-        return -1;
-    }
-
-    network_to_control((void *) rdma->wr_data[idx].control);
-    memcpy(head, rdma->wr_data[idx].control, sizeof(RDMAControlHeader));
-
-    trace_qemu_rdma_exchange_get_response_start(control_desc(expecting));
-
-    if (expecting == RDMA_CONTROL_NONE) {
-        trace_qemu_rdma_exchange_get_response_none(control_desc(head->type),
-                                             head->type);
-    } else if (head->type != expecting || head->type == RDMA_CONTROL_ERROR) {
-        error_setg(errp, "Was expecting a %s (%d) control message"
-                ", but got: %s (%d), length: %d",
-                control_desc(expecting), expecting,
-                control_desc(head->type), head->type, head->len);
-        if (head->type == RDMA_CONTROL_ERROR) {
-            rdma->received_error = true;
-        }
-        return -1;
-    }
-    if (head->len > RDMA_CONTROL_MAX_BUFFER - sizeof(*head)) {
-        error_setg(errp, "too long length: %d", head->len);
-        return -1;
-    }
-    if (sizeof(*head) + head->len != byte_len) {
-        error_setg(errp, "Malformed length: %d byte_len %d",
-                   head->len, byte_len);
-        return -1;
-    }
-
-    return 0;
-}
-
-/*
- * When a RECV work request has completed, the work request's
- * buffer is pointed at the header.
- *
- * This will advance the pointer to the data portion
- * of the control message of the work request's buffer that
- * was populated after the work request finished.
- */
-static void qemu_rdma_move_header(RDMAContext *rdma, int idx,
-                                  RDMAControlHeader *head)
-{
-    rdma->wr_data[idx].control_len = head->len;
-    rdma->wr_data[idx].control_curr =
-        rdma->wr_data[idx].control + sizeof(RDMAControlHeader);
-}
-
-/*
- * This is an 'atomic' high-level operation to deliver a single, unified
- * control-channel message.
- *
- * Additionally, if the user is expecting some kind of reply to this message,
- * they can request a 'resp' response message be filled in by posting an
- * additional work request on behalf of the user and waiting for an additional
- * completion.
- *
- * The extra (optional) response is used during registration to us from having
- * to perform an *additional* exchange of message just to provide a response by
- * instead piggy-backing on the acknowledgement.
- */
-static int qemu_rdma_exchange_send(RDMAContext *rdma, RDMAControlHeader *head,
-                                   uint8_t *data, RDMAControlHeader *resp,
-                                   int *resp_idx,
-                                   int (*callback)(RDMAContext *rdma,
-                                                   Error **errp),
-                                   Error **errp)
-{
-    int ret;
-
-    /*
-     * Wait until the dest is ready before attempting to deliver the message
-     * by waiting for a READY message.
-     */
-    if (rdma->control_ready_expected) {
-        RDMAControlHeader resp_ignored;
-
-        ret = qemu_rdma_exchange_get_response(rdma, &resp_ignored,
-                                              RDMA_CONTROL_READY,
-                                              RDMA_WRID_READY, errp);
-        if (ret < 0) {
-            return -1;
-        }
-    }
-
-    /*
-     * If the user is expecting a response, post a WR in anticipation of it.
-     */
-    if (resp) {
-        ret = qemu_rdma_post_recv_control(rdma, RDMA_WRID_DATA, errp);
-        if (ret < 0) {
-            return -1;
-        }
-    }
-
-    /*
-     * Post a WR to replace the one we just consumed for the READY message.
-     */
-    ret = qemu_rdma_post_recv_control(rdma, RDMA_WRID_READY, errp);
-    if (ret < 0) {
-        return -1;
-    }
-
-    /*
-     * Deliver the control message that was requested.
-     */
-    ret = qemu_rdma_post_send_control(rdma, data, head, errp);
-
-    if (ret < 0) {
-        return -1;
-    }
-
-    /*
-     * If we're expecting a response, block and wait for it.
-     */
-    if (resp) {
-        if (callback) {
-            trace_qemu_rdma_exchange_send_issue_callback();
-            ret = callback(rdma, errp);
-            if (ret < 0) {
-                return -1;
-            }
-        }
-
-        trace_qemu_rdma_exchange_send_waiting(control_desc(resp->type));
-        ret = qemu_rdma_exchange_get_response(rdma, resp,
-                                              resp->type, RDMA_WRID_DATA,
-                                              errp);
-
-        if (ret < 0) {
-            return -1;
-        }
-
-        qemu_rdma_move_header(rdma, RDMA_WRID_DATA, resp);
-        if (resp_idx) {
-            *resp_idx = RDMA_WRID_DATA;
-        }
-        trace_qemu_rdma_exchange_send_received(control_desc(resp->type));
-    }
-
-    rdma->control_ready_expected = 1;
-
-    return 0;
-}
-
-/*
- * This is an 'atomic' high-level operation to receive a single, unified
- * control-channel message.
- */
-static int qemu_rdma_exchange_recv(RDMAContext *rdma, RDMAControlHeader *head,
-                                   uint32_t expecting, Error **errp)
-{
-    RDMAControlHeader ready = {
-                                .len = 0,
-                                .type = RDMA_CONTROL_READY,
-                                .repeat = 1,
-                              };
-    int ret;
-
-    /*
-     * Inform the source that we're ready to receive a message.
-     */
-    ret = qemu_rdma_post_send_control(rdma, NULL, &ready, errp);
-
-    if (ret < 0) {
-        return -1;
-    }
-
-    /*
-     * Block and wait for the message.
-     */
-    ret = qemu_rdma_exchange_get_response(rdma, head,
-                                          expecting, RDMA_WRID_READY, errp);
-
-    if (ret < 0) {
-        return -1;
-    }
-
-    qemu_rdma_move_header(rdma, RDMA_WRID_READY, head);
-
-    /*
-     * Post a new RECV work request to replace the one we just consumed.
-     */
-    ret = qemu_rdma_post_recv_control(rdma, RDMA_WRID_READY, errp);
-    if (ret < 0) {
-        return -1;
-    }
-
-    return 0;
-}
-
-/*
- * Write an actual chunk of memory using RDMA.
- *
- * If we're using dynamic registration on the dest-side, we have to
- * send a registration command first.
- */
-static int qemu_rdma_write_one(RDMAContext *rdma,
-                               int current_index, uint64_t current_addr,
-                               uint64_t length, Error **errp)
-{
-    struct ibv_sge sge;
-    struct ibv_send_wr send_wr = { 0 };
-    struct ibv_send_wr *bad_wr;
-    int reg_result_idx, ret, count = 0;
-    uint64_t chunk, chunks;
-    uint8_t *chunk_start, *chunk_end;
-    RDMALocalBlock *block = &(rdma->local_ram_blocks.block[current_index]);
-    RDMARegister reg;
-    RDMARegisterResult *reg_result;
-    RDMAControlHeader resp = { .type = RDMA_CONTROL_REGISTER_RESULT };
-    RDMAControlHeader head = { .len = sizeof(RDMARegister),
-                               .type = RDMA_CONTROL_REGISTER_REQUEST,
-                               .repeat = 1,
-                             };
-
-retry:
-    sge.addr = (uintptr_t)(block->local_host_addr +
-                            (current_addr - block->offset));
-    sge.length = length;
-
-    chunk = ram_chunk_index(block->local_host_addr,
-                            (uint8_t *)(uintptr_t)sge.addr);
-    chunk_start = ram_chunk_start(block, chunk);
-
-    if (block->is_ram_block) {
-        chunks = length / (1UL << RDMA_REG_CHUNK_SHIFT);
-
-        if (chunks && ((length % (1UL << RDMA_REG_CHUNK_SHIFT)) == 0)) {
-            chunks--;
-        }
-    } else {
-        chunks = block->length / (1UL << RDMA_REG_CHUNK_SHIFT);
-
-        if (chunks && ((block->length % (1UL << RDMA_REG_CHUNK_SHIFT)) == 0)) {
-            chunks--;
-        }
-    }
-
-    trace_qemu_rdma_write_one_top(chunks + 1,
-                                  (chunks + 1) *
-                                  (1UL << RDMA_REG_CHUNK_SHIFT) / 1024 / 1024);
-
-    chunk_end = ram_chunk_end(block, chunk + chunks);
-
-
-    while (test_bit(chunk, block->transit_bitmap)) {
-        (void)count;
-        trace_qemu_rdma_write_one_block(count++, current_index, chunk,
-                sge.addr, length, rdma->nb_sent, block->nb_chunks);
-
-        ret = qemu_rdma_block_for_wrid(rdma, RDMA_WRID_RDMA_WRITE, NULL);
-
-        if (ret < 0) {
-            error_setg(errp, "Failed to Wait for previous write to complete "
-                    "block %d chunk %" PRIu64
-                    " current %" PRIu64 " len %" PRIu64 " %d",
-                    current_index, chunk, sge.addr, length, rdma->nb_sent);
-            return -1;
-        }
-    }
-
-    if (!rdma->pin_all || !block->is_ram_block) {
-        if (!block->remote_keys[chunk]) {
-            /*
-             * This chunk has not yet been registered, so first check to see
-             * if the entire chunk is zero. If so, tell the other size to
-             * memset() + madvise() the entire chunk without RDMA.
-             */
-
-            if (buffer_is_zero((void *)(uintptr_t)sge.addr, length)) {
-                RDMACompress comp = {
-                                        .offset = current_addr,
-                                        .value = 0,
-                                        .block_idx = current_index,
-                                        .length = length,
-                                    };
-
-                head.len = sizeof(comp);
-                head.type = RDMA_CONTROL_COMPRESS;
-
-                trace_qemu_rdma_write_one_zero(chunk, sge.length,
-                                               current_index, current_addr);
-
-                compress_to_network(rdma, &comp);
-                ret = qemu_rdma_exchange_send(rdma, &head,
-                                (uint8_t *) &comp, NULL, NULL, NULL, errp);
-
-                if (ret < 0) {
-                    return -1;
-                }
-
-                /*
-                 * TODO: Here we are sending something, but we are not
-                 * accounting for anything transferred.  The following is wrong:
-                 *
-                 * stat64_add(&mig_stats.rdma_bytes, sge.length);
-                 *
-                 * because we are using some kind of compression.  I
-                 * would think that head.len would be the more similar
-                 * thing to a correct value.
-                 */
-                stat64_add(&mig_stats.zero_pages,
-                           sge.length / qemu_target_page_size());
-                return 1;
-            }
-
-            /*
-             * Otherwise, tell other side to register.
-             */
-            reg.current_index = current_index;
-            if (block->is_ram_block) {
-                reg.key.current_addr = current_addr;
-            } else {
-                reg.key.chunk = chunk;
-            }
-            reg.chunks = chunks;
-
-            trace_qemu_rdma_write_one_sendreg(chunk, sge.length, current_index,
-                                              current_addr);
-
-            register_to_network(rdma, &reg);
-            ret = qemu_rdma_exchange_send(rdma, &head, (uint8_t *) &reg,
-                                    &resp, &reg_result_idx, NULL, errp);
-            if (ret < 0) {
-                return -1;
-            }
-
-            /* try to overlap this single registration with the one we sent. */
-            if (qemu_rdma_register_and_get_keys(rdma, block, sge.addr,
-                                                &sge.lkey, NULL, chunk,
-                                                chunk_start, chunk_end)) {
-                error_setg(errp, "cannot get lkey");
-                return -1;
-            }
-
-            reg_result = (RDMARegisterResult *)
-                    rdma->wr_data[reg_result_idx].control_curr;
-
-            network_to_result(reg_result);
-
-            trace_qemu_rdma_write_one_recvregres(block->remote_keys[chunk],
-                                                 reg_result->rkey, chunk);
-
-            block->remote_keys[chunk] = reg_result->rkey;
-            block->remote_host_addr = reg_result->host_addr;
-        } else {
-            /* already registered before */
-            if (qemu_rdma_register_and_get_keys(rdma, block, sge.addr,
-                                                &sge.lkey, NULL, chunk,
-                                                chunk_start, chunk_end)) {
-                error_setg(errp, "cannot get lkey!");
-                return -1;
-            }
-        }
-
-        send_wr.wr.rdma.rkey = block->remote_keys[chunk];
-    } else {
-        send_wr.wr.rdma.rkey = block->remote_rkey;
-
-        if (qemu_rdma_register_and_get_keys(rdma, block, sge.addr,
-                                                     &sge.lkey, NULL, chunk,
-                                                     chunk_start, chunk_end)) {
-            error_setg(errp, "cannot get lkey!");
-            return -1;
-        }
-    }
-
-    /*
-     * Encode the ram block index and chunk within this wrid.
-     * We will use this information at the time of completion
-     * to figure out which bitmap to check against and then which
-     * chunk in the bitmap to look for.
-     */
-    send_wr.wr_id = qemu_rdma_make_wrid(RDMA_WRID_RDMA_WRITE,
-                                        current_index, chunk);
-
-    send_wr.opcode = IBV_WR_RDMA_WRITE;
-    send_wr.send_flags = IBV_SEND_SIGNALED;
-    send_wr.sg_list = &sge;
-    send_wr.num_sge = 1;
-    send_wr.wr.rdma.remote_addr = block->remote_host_addr +
-                                (current_addr - block->offset);
-
-    trace_qemu_rdma_write_one_post(chunk, sge.addr, send_wr.wr.rdma.remote_addr,
-                                   sge.length);
-
-    /*
-     * ibv_post_send() does not return negative error numbers,
-     * per the specification they are positive - no idea why.
-     */
-    ret = ibv_post_send(rdma->qp, &send_wr, &bad_wr);
-
-    if (ret == ENOMEM) {
-        trace_qemu_rdma_write_one_queue_full();
-        ret = qemu_rdma_block_for_wrid(rdma, RDMA_WRID_RDMA_WRITE, NULL);
-        if (ret < 0) {
-            error_setg(errp, "rdma migration: failed to make "
-                         "room in full send queue!");
-            return -1;
-        }
-
-        goto retry;
-
-    } else if (ret > 0) {
-        error_setg_errno(errp, ret,
-                         "rdma migration: post rdma write failed");
-        return -1;
-    }
-
-    set_bit(chunk, block->transit_bitmap);
-    stat64_add(&mig_stats.normal_pages, sge.length / qemu_target_page_size());
-    /*
-     * We are adding to transferred the amount of data written, but no
-     * overhead at all.  I will assume that RDMA is magicaly and don't
-     * need to transfer (at least) the addresses where it wants to
-     * write the pages.  Here it looks like it should be something
-     * like:
-     *     sizeof(send_wr) + sge.length
-     * but this being RDMA, who knows.
-     */
-    stat64_add(&mig_stats.rdma_bytes, sge.length);
-    ram_transferred_add(sge.length);
-    rdma->total_writes++;
-
-    return 0;
-}
-
-/*
- * Push out any unwritten RDMA operations.
- *
- * We support sending out multiple chunks at the same time.
- * Not all of them need to get signaled in the completion queue.
- */
-static int qemu_rdma_write_flush(RDMAContext *rdma, Error **errp)
-{
-    int ret;
-
-    if (!rdma->current_length) {
-        return 0;
-    }
-
-    ret = qemu_rdma_write_one(rdma, rdma->current_index, rdma->current_addr,
-                              rdma->current_length, errp);
-
-    if (ret < 0) {
-        return -1;
-    }
-
-    if (ret == 0) {
-        rdma->nb_sent++;
-        trace_qemu_rdma_write_flush(rdma->nb_sent);
-    }
-
-    rdma->current_length = 0;
-    rdma->current_addr = 0;
-
-    return 0;
-}
-
-static inline bool qemu_rdma_buffer_mergeable(RDMAContext *rdma,
-                    uint64_t offset, uint64_t len)
-{
-    RDMALocalBlock *block;
-    uint8_t *host_addr;
-    uint8_t *chunk_end;
-
-    if (rdma->current_index < 0) {
-        return false;
-    }
-
-    if (rdma->current_chunk < 0) {
-        return false;
-    }
-
-    block = &(rdma->local_ram_blocks.block[rdma->current_index]);
-    host_addr = block->local_host_addr + (offset - block->offset);
-    chunk_end = ram_chunk_end(block, rdma->current_chunk);
-
-    if (rdma->current_length == 0) {
-        return false;
-    }
-
-    /*
-     * Only merge into chunk sequentially.
-     */
-    if (offset != (rdma->current_addr + rdma->current_length)) {
-        return false;
-    }
-
-    if (offset < block->offset) {
-        return false;
-    }
-
-    if ((offset + len) > (block->offset + block->length)) {
-        return false;
-    }
-
-    if ((host_addr + len) > chunk_end) {
-        return false;
-    }
-
-    return true;
-}
-
-/*
- * We're not actually writing here, but doing three things:
- *
- * 1. Identify the chunk the buffer belongs to.
- * 2. If the chunk is full or the buffer doesn't belong to the current
- *    chunk, then start a new chunk and flush() the old chunk.
- * 3. To keep the hardware busy, we also group chunks into batches
- *    and only require that a batch gets acknowledged in the completion
- *    queue instead of each individual chunk.
- */
-static int qemu_rdma_write(RDMAContext *rdma,
-                           uint64_t block_offset, uint64_t offset,
-                           uint64_t len, Error **errp)
-{
-    uint64_t current_addr = block_offset + offset;
-    uint64_t index = rdma->current_index;
-    uint64_t chunk = rdma->current_chunk;
-
-    /* If we cannot merge it, we flush the current buffer first. */
-    if (!qemu_rdma_buffer_mergeable(rdma, current_addr, len)) {
-        if (qemu_rdma_write_flush(rdma, errp) < 0) {
-            return -1;
-        }
-        rdma->current_length = 0;
-        rdma->current_addr = current_addr;
-
-        qemu_rdma_search_ram_block(rdma, block_offset,
-                                   offset, len, &index, &chunk);
-        rdma->current_index = index;
-        rdma->current_chunk = chunk;
-    }
-
-    /* merge it */
-    rdma->current_length += len;
-
-    /* flush it if buffer is too large */
-    if (rdma->current_length >= RDMA_MERGE_MAX) {
-        return qemu_rdma_write_flush(rdma, errp);
-    }
-
-    return 0;
-}
-
-static void qemu_rdma_cleanup(RDMAContext *rdma)
-{
-    Error *err = NULL;
-
-    if (rdma->cm_id && rdma->connected) {
-        if ((rdma->errored ||
-             migrate_get_current()->state == MIGRATION_STATUS_CANCELLING) &&
-            !rdma->received_error) {
-            RDMAControlHeader head = { .len = 0,
-                                       .type = RDMA_CONTROL_ERROR,
-                                       .repeat = 1,
-                                     };
-            warn_report("Early error. Sending error.");
-            if (qemu_rdma_post_send_control(rdma, NULL, &head, &err) < 0) {
-                warn_report_err(err);
-            }
-        }
-
-        rdma_disconnect(rdma->cm_id);
-        trace_qemu_rdma_cleanup_disconnect();
-        rdma->connected = false;
-    }
-
-    if (rdma->channel) {
-        qemu_set_fd_handler(rdma->channel->fd, NULL, NULL, NULL);
-    }
-    g_free(rdma->dest_blocks);
-    rdma->dest_blocks = NULL;
-
-    for (int i = 0; i < RDMA_WRID_MAX; i++) {
-        if (rdma->wr_data[i].control_mr) {
-            rdma->total_registrations--;
-            ibv_dereg_mr(rdma->wr_data[i].control_mr);
-        }
-        rdma->wr_data[i].control_mr = NULL;
-    }
-
-    if (rdma->local_ram_blocks.block) {
-        while (rdma->local_ram_blocks.nb_blocks) {
-            rdma_delete_block(rdma, &rdma->local_ram_blocks.block[0]);
-        }
-    }
-
-    if (rdma->qp) {
-        rdma_destroy_qp(rdma->cm_id);
-        rdma->qp = NULL;
-    }
-    if (rdma->recv_cq) {
-        ibv_destroy_cq(rdma->recv_cq);
-        rdma->recv_cq = NULL;
-    }
-    if (rdma->send_cq) {
-        ibv_destroy_cq(rdma->send_cq);
-        rdma->send_cq = NULL;
-    }
-    if (rdma->recv_comp_channel) {
-        ibv_destroy_comp_channel(rdma->recv_comp_channel);
-        rdma->recv_comp_channel = NULL;
-    }
-    if (rdma->send_comp_channel) {
-        ibv_destroy_comp_channel(rdma->send_comp_channel);
-        rdma->send_comp_channel = NULL;
-    }
-    if (rdma->pd) {
-        ibv_dealloc_pd(rdma->pd);
-        rdma->pd = NULL;
-    }
-    if (rdma->cm_id) {
-        rdma_destroy_id(rdma->cm_id);
-        rdma->cm_id = NULL;
-    }
-
-    /* the destination side, listen_id and channel is shared */
-    if (rdma->listen_id) {
-        if (!rdma->is_return_path) {
-            rdma_destroy_id(rdma->listen_id);
-        }
-        rdma->listen_id = NULL;
-
-        if (rdma->channel) {
-            if (!rdma->is_return_path) {
-                rdma_destroy_event_channel(rdma->channel);
-            }
-            rdma->channel = NULL;
-        }
-    }
-
-    if (rdma->channel) {
-        rdma_destroy_event_channel(rdma->channel);
-        rdma->channel = NULL;
-    }
-    g_free(rdma->host);
-    rdma->host = NULL;
-}
-
-
-static int qemu_rdma_source_init(RDMAContext *rdma, bool pin_all, Error **errp)
-{
-    int ret;
-
-    /*
-     * Will be validated against destination's actual capabilities
-     * after the connect() completes.
-     */
-    rdma->pin_all = pin_all;
-
-    ret = qemu_rdma_resolve_host(rdma, errp);
-    if (ret < 0) {
-        goto err_rdma_source_init;
-    }
-
-    ret = qemu_rdma_alloc_pd_cq(rdma, errp);
-    if (ret < 0) {
-        goto err_rdma_source_init;
-    }
-
-    ret = qemu_rdma_alloc_qp(rdma);
-    if (ret < 0) {
-        error_setg(errp, "RDMA ERROR: rdma migration: error allocating qp!");
-        goto err_rdma_source_init;
-    }
-
-    qemu_rdma_init_ram_blocks(rdma);
-
-    /* Build the hash that maps from offset to RAMBlock */
-    rdma->blockmap = g_hash_table_new(g_direct_hash, g_direct_equal);
-    for (int i = 0; i < rdma->local_ram_blocks.nb_blocks; i++) {
-        g_hash_table_insert(rdma->blockmap,
-                (void *)(uintptr_t)rdma->local_ram_blocks.block[i].offset,
-                &rdma->local_ram_blocks.block[i]);
-    }
-
-    for (int i = 0; i < RDMA_WRID_MAX; i++) {
-        ret = qemu_rdma_reg_control(rdma, i);
-        if (ret < 0) {
-            error_setg(errp, "RDMA ERROR: rdma migration: error "
-                       "registering %d control!", i);
-            goto err_rdma_source_init;
-        }
-    }
-
-    return 0;
-
-err_rdma_source_init:
-    qemu_rdma_cleanup(rdma);
-    return -1;
-}
-
-static int qemu_get_cm_event_timeout(RDMAContext *rdma,
-                                     struct rdma_cm_event **cm_event,
-                                     long msec, Error **errp)
-{
-    int ret;
-    struct pollfd poll_fd = {
-                                .fd = rdma->channel->fd,
-                                .events = POLLIN,
-                                .revents = 0
-                            };
-
-    do {
-        ret = poll(&poll_fd, 1, msec);
-    } while (ret < 0 && errno == EINTR);
-
-    if (ret == 0) {
-        error_setg(errp, "RDMA ERROR: poll cm event timeout");
-        return -1;
-    } else if (ret < 0) {
-        error_setg(errp, "RDMA ERROR: failed to poll cm event, errno=%i",
-                   errno);
-        return -1;
-    } else if (poll_fd.revents & POLLIN) {
-        if (rdma_get_cm_event(rdma->channel, cm_event) < 0) {
-            error_setg(errp, "RDMA ERROR: failed to get cm event");
-            return -1;
-        }
-        return 0;
-    } else {
-        error_setg(errp, "RDMA ERROR: no POLLIN event, revent=%x",
-                   poll_fd.revents);
-        return -1;
-    }
-}
-
-static int qemu_rdma_connect(RDMAContext *rdma, bool return_path,
-                             Error **errp)
-{
-    RDMACapabilities cap = {
-                                .version = RDMA_CONTROL_VERSION_CURRENT,
-                                .flags = 0,
-                           };
-    struct rdma_conn_param conn_param = { .initiator_depth = 2,
-                                          .retry_count = 5,
-                                          .private_data = &cap,
-                                          .private_data_len = sizeof(cap),
-                                        };
-    struct rdma_cm_event *cm_event;
-    int ret;
-
-    /*
-     * Only negotiate the capability with destination if the user
-     * on the source first requested the capability.
-     */
-    if (rdma->pin_all) {
-        trace_qemu_rdma_connect_pin_all_requested();
-        cap.flags |= RDMA_CAPABILITY_PIN_ALL;
-    }
-
-    caps_to_network(&cap);
-
-    ret = qemu_rdma_post_recv_control(rdma, RDMA_WRID_READY, errp);
-    if (ret < 0) {
-        goto err_rdma_source_connect;
-    }
-
-    ret = rdma_connect(rdma->cm_id, &conn_param);
-    if (ret < 0) {
-        error_setg_errno(errp, errno,
-                         "RDMA ERROR: connecting to destination!");
-        goto err_rdma_source_connect;
-    }
-
-    if (return_path) {
-        ret = qemu_get_cm_event_timeout(rdma, &cm_event, 5000, errp);
-    } else {
-        ret = rdma_get_cm_event(rdma->channel, &cm_event);
-        if (ret < 0) {
-            error_setg_errno(errp, errno,
-                             "RDMA ERROR: failed to get cm event");
-        }
-    }
-    if (ret < 0) {
-        goto err_rdma_source_connect;
-    }
-
-    if (cm_event->event != RDMA_CM_EVENT_ESTABLISHED) {
-        error_setg(errp, "RDMA ERROR: connecting to destination!");
-        rdma_ack_cm_event(cm_event);
-        goto err_rdma_source_connect;
-    }
-    rdma->connected = true;
-
-    memcpy(&cap, cm_event->param.conn.private_data, sizeof(cap));
-    network_to_caps(&cap);
-
-    /*
-     * Verify that the *requested* capabilities are supported by the destination
-     * and disable them otherwise.
-     */
-    if (rdma->pin_all && !(cap.flags & RDMA_CAPABILITY_PIN_ALL)) {
-        warn_report("RDMA: Server cannot support pinning all memory. "
-                    "Will register memory dynamically.");
-        rdma->pin_all = false;
-    }
-
-    trace_qemu_rdma_connect_pin_all_outcome(rdma->pin_all);
-
-    rdma_ack_cm_event(cm_event);
-
-    rdma->control_ready_expected = 1;
-    rdma->nb_sent = 0;
-    return 0;
-
-err_rdma_source_connect:
-    qemu_rdma_cleanup(rdma);
-    return -1;
-}
-
-static int qemu_rdma_dest_init(RDMAContext *rdma, Error **errp)
-{
-    Error *err = NULL;
-    int ret;
-    struct rdma_cm_id *listen_id;
-    char ip[40] = "unknown";
-    struct rdma_addrinfo *res, *e;
-    char port_str[16];
-    int reuse = 1;
-
-    for (int i = 0; i < RDMA_WRID_MAX; i++) {
-        rdma->wr_data[i].control_len = 0;
-        rdma->wr_data[i].control_curr = NULL;
-    }
-
-    if (!rdma->host || !rdma->host[0]) {
-        error_setg(errp, "RDMA ERROR: RDMA host is not set!");
-        rdma->errored = true;
-        return -1;
-    }
-    /* create CM channel */
-    rdma->channel = rdma_create_event_channel();
-    if (!rdma->channel) {
-        error_setg(errp, "RDMA ERROR: could not create rdma event channel");
-        rdma->errored = true;
-        return -1;
-    }
-
-    /* create CM id */
-    ret = rdma_create_id(rdma->channel, &listen_id, NULL, RDMA_PS_TCP);
-    if (ret < 0) {
-        error_setg(errp, "RDMA ERROR: could not create cm_id!");
-        goto err_dest_init_create_listen_id;
-    }
-
-    snprintf(port_str, 16, "%d", rdma->port);
-    port_str[15] = '\0';
-
-    ret = rdma_getaddrinfo(rdma->host, port_str, NULL, &res);
-    if (ret) {
-        error_setg(errp, "RDMA ERROR: could not rdma_getaddrinfo address %s",
-                   rdma->host);
-        goto err_dest_init_bind_addr;
-    }
-
-    ret = rdma_set_option(listen_id, RDMA_OPTION_ID, RDMA_OPTION_ID_REUSEADDR,
-                          &reuse, sizeof reuse);
-    if (ret < 0) {
-        error_setg(errp, "RDMA ERROR: Error: could not set REUSEADDR option");
-        goto err_dest_init_bind_addr;
-    }
-
-    /* Try all addresses, saving the first error in @err */
-    for (e = res; e != NULL; e = e->ai_next) {
-        Error **local_errp = err ? NULL : &err;
-
-        inet_ntop(e->ai_family,
-            &((struct sockaddr_in *) e->ai_dst_addr)->sin_addr, ip, sizeof ip);
-        trace_qemu_rdma_dest_init_trying(rdma->host, ip);
-        ret = rdma_bind_addr(listen_id, e->ai_dst_addr);
-        if (ret < 0) {
-            continue;
-        }
-        if (e->ai_family == AF_INET6) {
-            ret = qemu_rdma_broken_ipv6_kernel(listen_id->verbs,
-                                               local_errp);
-            if (ret < 0) {
-                continue;
-            }
-        }
-        error_free(err);
-        break;
-    }
-
-    rdma_freeaddrinfo(res);
-    if (!e) {
-        if (err) {
-            error_propagate(errp, err);
-        } else {
-            error_setg(errp, "RDMA ERROR: Error: could not rdma_bind_addr!");
-        }
-        goto err_dest_init_bind_addr;
-    }
-
-    rdma->listen_id = listen_id;
-    qemu_rdma_dump_gid("dest_init", listen_id);
-    return 0;
-
-err_dest_init_bind_addr:
-    rdma_destroy_id(listen_id);
-err_dest_init_create_listen_id:
-    rdma_destroy_event_channel(rdma->channel);
-    rdma->channel = NULL;
-    rdma->errored = true;
-    return -1;
-
-}
-
-static void qemu_rdma_return_path_dest_init(RDMAContext *rdma_return_path,
-                                            RDMAContext *rdma)
-{
-    for (int i = 0; i < RDMA_WRID_MAX; i++) {
-        rdma_return_path->wr_data[i].control_len = 0;
-        rdma_return_path->wr_data[i].control_curr = NULL;
-    }
-
-    /*the CM channel and CM id is shared*/
-    rdma_return_path->channel = rdma->channel;
-    rdma_return_path->listen_id = rdma->listen_id;
-
-    rdma->return_path = rdma_return_path;
-    rdma_return_path->return_path = rdma;
-    rdma_return_path->is_return_path = true;
-}
-
-static RDMAContext *qemu_rdma_data_init(InetSocketAddress *saddr, Error **errp)
-{
-    RDMAContext *rdma = NULL;
-
-    rdma = g_new0(RDMAContext, 1);
-    rdma->current_index = -1;
-    rdma->current_chunk = -1;
-
-    rdma->host = g_strdup(saddr->host);
-    rdma->port = atoi(saddr->port);
-    return rdma;
-}
-
-/*
- * QEMUFile interface to the control channel.
- * SEND messages for control only.
- * VM's ram is handled with regular RDMA messages.
- */
-static ssize_t qio_channel_rdma_writev(QIOChannel *ioc,
-                                       const struct iovec *iov,
-                                       size_t niov,
-                                       int *fds,
-                                       size_t nfds,
-                                       int flags,
-                                       Error **errp)
-{
-    QIOChannelRDMA *rioc = QIO_CHANNEL_RDMA(ioc);
-    RDMAContext *rdma;
-    int ret;
-    ssize_t done = 0;
-    size_t len;
-
-    RCU_READ_LOCK_GUARD();
-    rdma = qatomic_rcu_read(&rioc->rdmaout);
-
-    if (!rdma) {
-        error_setg(errp, "RDMA control channel output is not set");
-        return -1;
-    }
-
-    if (rdma->errored) {
-        error_setg(errp,
-                   "RDMA is in an error state waiting migration to abort!");
-        return -1;
-    }
-
-    /*
-     * Push out any writes that
-     * we're queued up for VM's ram.
-     */
-    ret = qemu_rdma_write_flush(rdma, errp);
-    if (ret < 0) {
-        rdma->errored = true;
-        return -1;
-    }
-
-    for (int i = 0; i < niov; i++) {
-        size_t remaining = iov[i].iov_len;
-        uint8_t * data = (void *)iov[i].iov_base;
-        while (remaining) {
-            RDMAControlHeader head = {};
-
-            len = MIN(remaining, RDMA_SEND_INCREMENT);
-            remaining -= len;
-
-            head.len = len;
-            head.type = RDMA_CONTROL_QEMU_FILE;
-
-            ret = qemu_rdma_exchange_send(rdma, &head,
-                                          data, NULL, NULL, NULL, errp);
-
-            if (ret < 0) {
-                rdma->errored = true;
-                return -1;
-            }
-
-            data += len;
-            done += len;
-        }
-    }
-
-    return done;
-}
-
-static size_t qemu_rdma_fill(RDMAContext *rdma, uint8_t *buf,
-                             size_t size, int idx)
-{
-    size_t len = 0;
-
-    if (rdma->wr_data[idx].control_len) {
-        trace_qemu_rdma_fill(rdma->wr_data[idx].control_len, size);
-
-        len = MIN(size, rdma->wr_data[idx].control_len);
-        memcpy(buf, rdma->wr_data[idx].control_curr, len);
-        rdma->wr_data[idx].control_curr += len;
-        rdma->wr_data[idx].control_len -= len;
-    }
-
-    return len;
-}
-
-/*
- * QEMUFile interface to the control channel.
- * RDMA links don't use bytestreams, so we have to
- * return bytes to QEMUFile opportunistically.
- */
-static ssize_t qio_channel_rdma_readv(QIOChannel *ioc,
-                                      const struct iovec *iov,
-                                      size_t niov,
-                                      int **fds,
-                                      size_t *nfds,
-                                      int flags,
-                                      Error **errp)
-{
-    QIOChannelRDMA *rioc = QIO_CHANNEL_RDMA(ioc);
-    RDMAContext *rdma;
-    RDMAControlHeader head;
-    int ret;
-    ssize_t done = 0;
-    size_t len;
-
-    RCU_READ_LOCK_GUARD();
-    rdma = qatomic_rcu_read(&rioc->rdmain);
-
-    if (!rdma) {
-        error_setg(errp, "RDMA control channel input is not set");
-        return -1;
-    }
-
-    if (rdma->errored) {
-        error_setg(errp,
-                   "RDMA is in an error state waiting migration to abort!");
-        return -1;
-    }
-
-    for (int i = 0; i < niov; i++) {
-        size_t want = iov[i].iov_len;
-        uint8_t *data = (void *)iov[i].iov_base;
-
-        /*
-         * First, we hold on to the last SEND message we
-         * were given and dish out the bytes until we run
-         * out of bytes.
-         */
-        len = qemu_rdma_fill(rdma, data, want, 0);
-        done += len;
-        want -= len;
-        /* Got what we needed, so go to next iovec */
-        if (want == 0) {
-            continue;
-        }
-
-        /* If we got any data so far, then don't wait
-         * for more, just return what we have */
-        if (done > 0) {
-            break;
-        }
-
-
-        /* We've got nothing at all, so lets wait for
-         * more to arrive
-         */
-        ret = qemu_rdma_exchange_recv(rdma, &head, RDMA_CONTROL_QEMU_FILE,
-                                      errp);
-
-        if (ret < 0) {
-            rdma->errored = true;
-            return -1;
-        }
-
-        /*
-         * SEND was received with new bytes, now try again.
-         */
-        len = qemu_rdma_fill(rdma, data, want, 0);
-        done += len;
-        want -= len;
-
-        /* Still didn't get enough, so lets just return */
-        if (want) {
-            if (done == 0) {
-                return QIO_CHANNEL_ERR_BLOCK;
-            } else {
-                break;
-            }
-        }
-    }
-    return done;
-}
-
-/*
- * Block until all the outstanding chunks have been delivered by the hardware.
- */
-static int qemu_rdma_drain_cq(RDMAContext *rdma)
-{
-    Error *err = NULL;
-
-    if (qemu_rdma_write_flush(rdma, &err) < 0) {
-        error_report_err(err);
-        return -1;
-    }
-
-    while (rdma->nb_sent) {
-        if (qemu_rdma_block_for_wrid(rdma, RDMA_WRID_RDMA_WRITE, NULL) < 0) {
-            error_report("rdma migration: complete polling error!");
-            return -1;
-        }
-    }
-
-    qemu_rdma_unregister_waiting(rdma);
-
-    return 0;
-}
-
-
-static int qio_channel_rdma_set_blocking(QIOChannel *ioc,
-                                         bool blocking,
-                                         Error **errp)
-{
-    QIOChannelRDMA *rioc = QIO_CHANNEL_RDMA(ioc);
-    /* XXX we should make readv/writev actually honour this :-) */
-    rioc->blocking = blocking;
-    return 0;
-}
-
-
-typedef struct QIOChannelRDMASource QIOChannelRDMASource;
-struct QIOChannelRDMASource {
-    GSource parent;
-    QIOChannelRDMA *rioc;
-    GIOCondition condition;
-};
-
-static gboolean
-qio_channel_rdma_source_prepare(GSource *source,
-                                gint *timeout)
-{
-    QIOChannelRDMASource *rsource = (QIOChannelRDMASource *)source;
-    RDMAContext *rdma;
-    GIOCondition cond = 0;
-    *timeout = -1;
-
-    RCU_READ_LOCK_GUARD();
-    if (rsource->condition == G_IO_IN) {
-        rdma = qatomic_rcu_read(&rsource->rioc->rdmain);
-    } else {
-        rdma = qatomic_rcu_read(&rsource->rioc->rdmaout);
-    }
-
-    if (!rdma) {
-        error_report("RDMAContext is NULL when prepare Gsource");
-        return FALSE;
-    }
-
-    if (rdma->wr_data[0].control_len) {
-        cond |= G_IO_IN;
-    }
-    cond |= G_IO_OUT;
-
-    return cond & rsource->condition;
-}
-
-static gboolean
-qio_channel_rdma_source_check(GSource *source)
-{
-    QIOChannelRDMASource *rsource = (QIOChannelRDMASource *)source;
-    RDMAContext *rdma;
-    GIOCondition cond = 0;
-
-    RCU_READ_LOCK_GUARD();
-    if (rsource->condition == G_IO_IN) {
-        rdma = qatomic_rcu_read(&rsource->rioc->rdmain);
-    } else {
-        rdma = qatomic_rcu_read(&rsource->rioc->rdmaout);
-    }
-
-    if (!rdma) {
-        error_report("RDMAContext is NULL when check Gsource");
-        return FALSE;
-    }
-
-    if (rdma->wr_data[0].control_len) {
-        cond |= G_IO_IN;
-    }
-    cond |= G_IO_OUT;
-
-    return cond & rsource->condition;
-}
-
-static gboolean
-qio_channel_rdma_source_dispatch(GSource *source,
-                                 GSourceFunc callback,
-                                 gpointer user_data)
-{
-    QIOChannelFunc func = (QIOChannelFunc)callback;
-    QIOChannelRDMASource *rsource = (QIOChannelRDMASource *)source;
-    RDMAContext *rdma;
-    GIOCondition cond = 0;
-
-    RCU_READ_LOCK_GUARD();
-    if (rsource->condition == G_IO_IN) {
-        rdma = qatomic_rcu_read(&rsource->rioc->rdmain);
-    } else {
-        rdma = qatomic_rcu_read(&rsource->rioc->rdmaout);
-    }
-
-    if (!rdma) {
-        error_report("RDMAContext is NULL when dispatch Gsource");
-        return FALSE;
-    }
-
-    if (rdma->wr_data[0].control_len) {
-        cond |= G_IO_IN;
-    }
-    cond |= G_IO_OUT;
-
-    return (*func)(QIO_CHANNEL(rsource->rioc),
-                   (cond & rsource->condition),
-                   user_data);
-}
-
-static void
-qio_channel_rdma_source_finalize(GSource *source)
-{
-    QIOChannelRDMASource *ssource = (QIOChannelRDMASource *)source;
-
-    object_unref(OBJECT(ssource->rioc));
-}
-
-static GSourceFuncs qio_channel_rdma_source_funcs = {
-    qio_channel_rdma_source_prepare,
-    qio_channel_rdma_source_check,
-    qio_channel_rdma_source_dispatch,
-    qio_channel_rdma_source_finalize
-};
-
-static GSource *qio_channel_rdma_create_watch(QIOChannel *ioc,
-                                              GIOCondition condition)
-{
-    QIOChannelRDMA *rioc = QIO_CHANNEL_RDMA(ioc);
-    QIOChannelRDMASource *ssource;
-    GSource *source;
-
-    source = g_source_new(&qio_channel_rdma_source_funcs,
-                          sizeof(QIOChannelRDMASource));
-    ssource = (QIOChannelRDMASource *)source;
-
-    ssource->rioc = rioc;
-    object_ref(OBJECT(rioc));
-
-    ssource->condition = condition;
-
-    return source;
-}
-
-static void qio_channel_rdma_set_aio_fd_handler(QIOChannel *ioc,
-                                                AioContext *read_ctx,
-                                                IOHandler *io_read,
-                                                AioContext *write_ctx,
-                                                IOHandler *io_write,
-                                                void *opaque)
-{
-    QIOChannelRDMA *rioc = QIO_CHANNEL_RDMA(ioc);
-    if (io_read) {
-        aio_set_fd_handler(read_ctx, rioc->rdmain->recv_comp_channel->fd,
-                           io_read, io_write, NULL, NULL, opaque);
-        aio_set_fd_handler(read_ctx, rioc->rdmain->send_comp_channel->fd,
-                           io_read, io_write, NULL, NULL, opaque);
-    } else {
-        aio_set_fd_handler(write_ctx, rioc->rdmaout->recv_comp_channel->fd,
-                           io_read, io_write, NULL, NULL, opaque);
-        aio_set_fd_handler(write_ctx, rioc->rdmaout->send_comp_channel->fd,
-                           io_read, io_write, NULL, NULL, opaque);
-    }
-}
-
-struct rdma_close_rcu {
-    struct rcu_head rcu;
-    RDMAContext *rdmain;
-    RDMAContext *rdmaout;
-};
-
-/* callback from qio_channel_rdma_close via call_rcu */
-static void qio_channel_rdma_close_rcu(struct rdma_close_rcu *rcu)
-{
-    if (rcu->rdmain) {
-        qemu_rdma_cleanup(rcu->rdmain);
-    }
-
-    if (rcu->rdmaout) {
-        qemu_rdma_cleanup(rcu->rdmaout);
-    }
-
-    g_free(rcu->rdmain);
-    g_free(rcu->rdmaout);
-    g_free(rcu);
-}
-
-static int qio_channel_rdma_close(QIOChannel *ioc,
-                                  Error **errp)
-{
-    QIOChannelRDMA *rioc = QIO_CHANNEL_RDMA(ioc);
-    RDMAContext *rdmain, *rdmaout;
-    struct rdma_close_rcu *rcu = g_new(struct rdma_close_rcu, 1);
-
-    trace_qemu_rdma_close();
-
-    rdmain = rioc->rdmain;
-    if (rdmain) {
-        qatomic_rcu_set(&rioc->rdmain, NULL);
-    }
-
-    rdmaout = rioc->rdmaout;
-    if (rdmaout) {
-        qatomic_rcu_set(&rioc->rdmaout, NULL);
-    }
-
-    rcu->rdmain = rdmain;
-    rcu->rdmaout = rdmaout;
-    call_rcu(rcu, qio_channel_rdma_close_rcu, rcu);
-
-    return 0;
-}
-
-static int
-qio_channel_rdma_shutdown(QIOChannel *ioc,
-                            QIOChannelShutdown how,
-                            Error **errp)
-{
-    QIOChannelRDMA *rioc = QIO_CHANNEL_RDMA(ioc);
-    RDMAContext *rdmain, *rdmaout;
-
-    RCU_READ_LOCK_GUARD();
-
-    rdmain = qatomic_rcu_read(&rioc->rdmain);
-    rdmaout = qatomic_rcu_read(&rioc->rdmain);
-
-    switch (how) {
-    case QIO_CHANNEL_SHUTDOWN_READ:
-        if (rdmain) {
-            rdmain->errored = true;
-        }
-        break;
-    case QIO_CHANNEL_SHUTDOWN_WRITE:
-        if (rdmaout) {
-            rdmaout->errored = true;
-        }
-        break;
-    case QIO_CHANNEL_SHUTDOWN_BOTH:
-    default:
-        if (rdmain) {
-            rdmain->errored = true;
-        }
-        if (rdmaout) {
-            rdmaout->errored = true;
-        }
-        break;
-    }
-
-    return 0;
-}
-
-/*
- * Parameters:
- *    @offset == 0 :
- *        This means that 'block_offset' is a full virtual address that does not
- *        belong to a RAMBlock of the virtual machine and instead
- *        represents a private malloc'd memory area that the caller wishes to
- *        transfer.
- *
- *    @offset != 0 :
- *        Offset is an offset to be added to block_offset and used
- *        to also lookup the corresponding RAMBlock.
- *
- *    @size : Number of bytes to transfer
- *
- *    @pages_sent : User-specificed pointer to indicate how many pages were
- *                  sent. Usually, this will not be more than a few bytes of
- *                  the protocol because most transfers are sent asynchronously.
- */
-static int qemu_rdma_save_page(QEMUFile *f, ram_addr_t block_offset,
-                               ram_addr_t offset, size_t size)
-{
-    QIOChannelRDMA *rioc = QIO_CHANNEL_RDMA(qemu_file_get_ioc(f));
-    Error *err = NULL;
-    RDMAContext *rdma;
-    int ret;
-
-    RCU_READ_LOCK_GUARD();
-    rdma = qatomic_rcu_read(&rioc->rdmaout);
-
-    if (!rdma) {
-        return -1;
-    }
-
-    if (rdma_errored(rdma)) {
-        return -1;
-    }
-
-    qemu_fflush(f);
-
-    /*
-     * Add this page to the current 'chunk'. If the chunk
-     * is full, or the page doesn't belong to the current chunk,
-     * an actual RDMA write will occur and a new chunk will be formed.
-     */
-    ret = qemu_rdma_write(rdma, block_offset, offset, size, &err);
-    if (ret < 0) {
-        error_report_err(err);
-        goto err;
-    }
-
-    /*
-     * Drain the Completion Queue if possible, but do not block,
-     * just poll.
-     *
-     * If nothing to poll, the end of the iteration will do this
-     * again to make sure we don't overflow the request queue.
-     */
-    while (1) {
-        uint64_t wr_id, wr_id_in;
-        ret = qemu_rdma_poll(rdma, rdma->recv_cq, &wr_id_in, NULL);
-
-        if (ret < 0) {
-            error_report("rdma migration: polling error");
-            goto err;
-        }
-
-        wr_id = wr_id_in & RDMA_WRID_TYPE_MASK;
-
-        if (wr_id == RDMA_WRID_NONE) {
-            break;
-        }
-    }
-
-    while (1) {
-        uint64_t wr_id, wr_id_in;
-        ret = qemu_rdma_poll(rdma, rdma->send_cq, &wr_id_in, NULL);
-
-        if (ret < 0) {
-            error_report("rdma migration: polling error");
-            goto err;
-        }
-
-        wr_id = wr_id_in & RDMA_WRID_TYPE_MASK;
-
-        if (wr_id == RDMA_WRID_NONE) {
-            break;
-        }
-    }
-
-    return RAM_SAVE_CONTROL_DELAYED;
-
-err:
-    rdma->errored = true;
-    return -1;
-}
-
-int rdma_control_save_page(QEMUFile *f, ram_addr_t block_offset,
-                           ram_addr_t offset, size_t size)
-{
-    if (!migrate_rdma() || migration_in_postcopy()) {
-        return RAM_SAVE_CONTROL_NOT_SUPP;
-    }
-
-    int ret = qemu_rdma_save_page(f, block_offset, offset, size);
-
-    if (ret != RAM_SAVE_CONTROL_DELAYED &&
-        ret != RAM_SAVE_CONTROL_NOT_SUPP) {
-        if (ret < 0) {
-            qemu_file_set_error(f, ret);
-        }
-    }
-    return ret;
-}
-
-static void rdma_accept_incoming_migration(void *opaque);
-
-static void rdma_cm_poll_handler(void *opaque)
-{
-    RDMAContext *rdma = opaque;
-    struct rdma_cm_event *cm_event;
-    MigrationIncomingState *mis = migration_incoming_get_current();
-
-    if (rdma_get_cm_event(rdma->channel, &cm_event) < 0) {
-        error_report("get_cm_event failed %d", errno);
-        return;
-    }
-
-    if (cm_event->event == RDMA_CM_EVENT_DISCONNECTED ||
-        cm_event->event == RDMA_CM_EVENT_DEVICE_REMOVAL) {
-        if (!rdma->errored &&
-            migration_incoming_get_current()->state !=
-              MIGRATION_STATUS_COMPLETED) {
-            error_report("receive cm event, cm event is %d", cm_event->event);
-            rdma->errored = true;
-            if (rdma->return_path) {
-                rdma->return_path->errored = true;
-            }
-        }
-        rdma_ack_cm_event(cm_event);
-        if (mis->loadvm_co) {
-            qemu_coroutine_enter(mis->loadvm_co);
-        }
-        return;
-    }
-    rdma_ack_cm_event(cm_event);
-}
-
-static int qemu_rdma_accept(RDMAContext *rdma)
-{
-    Error *err = NULL;
-    RDMACapabilities cap;
-    struct rdma_conn_param conn_param = {
-                                            .responder_resources = 2,
-                                            .private_data = &cap,
-                                            .private_data_len = sizeof(cap),
-                                         };
-    RDMAContext *rdma_return_path = NULL;
-    g_autoptr(InetSocketAddress) isock = g_new0(InetSocketAddress, 1);
-    struct rdma_cm_event *cm_event;
-    struct ibv_context *verbs;
-    int ret;
-
-    ret = rdma_get_cm_event(rdma->channel, &cm_event);
-    if (ret < 0) {
-        goto err_rdma_dest_wait;
-    }
-
-    if (cm_event->event != RDMA_CM_EVENT_CONNECT_REQUEST) {
-        rdma_ack_cm_event(cm_event);
-        goto err_rdma_dest_wait;
-    }
-
-    isock->host = g_strdup(rdma->host);
-    isock->port = g_strdup_printf("%d", rdma->port);
-
-    /*
-     * initialize the RDMAContext for return path for postcopy after first
-     * connection request reached.
-     */
-    if ((migrate_postcopy() || migrate_return_path())
-        && !rdma->is_return_path) {
-        rdma_return_path = qemu_rdma_data_init(isock, NULL);
-        if (rdma_return_path == NULL) {
-            rdma_ack_cm_event(cm_event);
-            goto err_rdma_dest_wait;
-        }
-
-        qemu_rdma_return_path_dest_init(rdma_return_path, rdma);
-    }
-
-    memcpy(&cap, cm_event->param.conn.private_data, sizeof(cap));
-
-    network_to_caps(&cap);
-
-    if (cap.version < 1 || cap.version > RDMA_CONTROL_VERSION_CURRENT) {
-        error_report("Unknown source RDMA version: %d, bailing...",
-                     cap.version);
-        rdma_ack_cm_event(cm_event);
-        goto err_rdma_dest_wait;
-    }
-
-    /*
-     * Respond with only the capabilities this version of QEMU knows about.
-     */
-    cap.flags &= known_capabilities;
-
-    /*
-     * Enable the ones that we do know about.
-     * Add other checks here as new ones are introduced.
-     */
-    if (cap.flags & RDMA_CAPABILITY_PIN_ALL) {
-        rdma->pin_all = true;
-    }
-
-    rdma->cm_id = cm_event->id;
-    verbs = cm_event->id->verbs;
-
-    rdma_ack_cm_event(cm_event);
-
-    trace_qemu_rdma_accept_pin_state(rdma->pin_all);
-
-    caps_to_network(&cap);
-
-    trace_qemu_rdma_accept_pin_verbsc(verbs);
-
-    if (!rdma->verbs) {
-        rdma->verbs = verbs;
-    } else if (rdma->verbs != verbs) {
-        error_report("ibv context not matching %p, %p!", rdma->verbs,
-                     verbs);
-        goto err_rdma_dest_wait;
-    }
-
-    qemu_rdma_dump_id("dest_init", verbs);
-
-    ret = qemu_rdma_alloc_pd_cq(rdma, &err);
-    if (ret < 0) {
-        error_report_err(err);
-        goto err_rdma_dest_wait;
-    }
-
-    ret = qemu_rdma_alloc_qp(rdma);
-    if (ret < 0) {
-        error_report("rdma migration: error allocating qp!");
-        goto err_rdma_dest_wait;
-    }
-
-    qemu_rdma_init_ram_blocks(rdma);
-
-    for (int i = 0; i < RDMA_WRID_MAX; i++) {
-        ret = qemu_rdma_reg_control(rdma, i);
-        if (ret < 0) {
-            error_report("rdma: error registering %d control", i);
-            goto err_rdma_dest_wait;
-        }
-    }
-
-    /* Accept the second connection request for return path */
-    if ((migrate_postcopy() || migrate_return_path())
-        && !rdma->is_return_path) {
-        qemu_set_fd_handler(rdma->channel->fd, rdma_accept_incoming_migration,
-                            NULL,
-                            (void *)(intptr_t)rdma->return_path);
-    } else {
-        qemu_set_fd_handler(rdma->channel->fd, rdma_cm_poll_handler,
-                            NULL, rdma);
-    }
-
-    ret = rdma_accept(rdma->cm_id, &conn_param);
-    if (ret < 0) {
-        error_report("rdma_accept failed");
-        goto err_rdma_dest_wait;
-    }
-
-    ret = rdma_get_cm_event(rdma->channel, &cm_event);
-    if (ret < 0) {
-        error_report("rdma_accept get_cm_event failed");
-        goto err_rdma_dest_wait;
-    }
-
-    if (cm_event->event != RDMA_CM_EVENT_ESTABLISHED) {
-        error_report("rdma_accept not event established");
-        rdma_ack_cm_event(cm_event);
-        goto err_rdma_dest_wait;
-    }
-
-    rdma_ack_cm_event(cm_event);
-    rdma->connected = true;
-
-    ret = qemu_rdma_post_recv_control(rdma, RDMA_WRID_READY, &err);
-    if (ret < 0) {
-        error_report_err(err);
-        goto err_rdma_dest_wait;
-    }
-
-    qemu_rdma_dump_gid("dest_connect", rdma->cm_id);
-
-    return 0;
-
-err_rdma_dest_wait:
-    rdma->errored = true;
-    qemu_rdma_cleanup(rdma);
-    g_free(rdma_return_path);
-    return -1;
-}
-
-static int dest_ram_sort_func(const void *a, const void *b)
-{
-    unsigned int a_index = ((const RDMALocalBlock *)a)->src_index;
-    unsigned int b_index = ((const RDMALocalBlock *)b)->src_index;
-
-    return (a_index < b_index) ? -1 : (a_index != b_index);
-}
-
-/*
- * During each iteration of the migration, we listen for instructions
- * by the source VM to perform dynamic page registrations before they
- * can perform RDMA operations.
- *
- * We respond with the 'rkey'.
- *
- * Keep doing this until the source tells us to stop.
- */
-int rdma_registration_handle(QEMUFile *f)
-{
-    RDMAControlHeader reg_resp = { .len = sizeof(RDMARegisterResult),
-                               .type = RDMA_CONTROL_REGISTER_RESULT,
-                               .repeat = 0,
-                             };
-    RDMAControlHeader unreg_resp = { .len = 0,
-                               .type = RDMA_CONTROL_UNREGISTER_FINISHED,
-                               .repeat = 0,
-                             };
-    RDMAControlHeader blocks = { .type = RDMA_CONTROL_RAM_BLOCKS_RESULT,
-                                 .repeat = 1 };
-    QIOChannelRDMA *rioc;
-    Error *err = NULL;
-    RDMAContext *rdma;
-    RDMALocalBlocks *local;
-    RDMAControlHeader head;
-    RDMARegister *reg, *registers;
-    RDMACompress *comp;
-    RDMARegisterResult *reg_result;
-    static RDMARegisterResult results[RDMA_CONTROL_MAX_COMMANDS_PER_MESSAGE];
-    RDMALocalBlock *block;
-    void *host_addr;
-    int ret;
-    int idx = 0;
-
-    if (!migrate_rdma()) {
-        return 0;
-    }
-
-    RCU_READ_LOCK_GUARD();
-    rioc = QIO_CHANNEL_RDMA(qemu_file_get_ioc(f));
-    rdma = qatomic_rcu_read(&rioc->rdmain);
-
-    if (!rdma) {
-        return -1;
-    }
-
-    if (rdma_errored(rdma)) {
-        return -1;
-    }
-
-    local = &rdma->local_ram_blocks;
-    do {
-        trace_rdma_registration_handle_wait();
-
-        ret = qemu_rdma_exchange_recv(rdma, &head, RDMA_CONTROL_NONE, &err);
-
-        if (ret < 0) {
-            error_report_err(err);
-            break;
-        }
-
-        if (head.repeat > RDMA_CONTROL_MAX_COMMANDS_PER_MESSAGE) {
-            error_report("rdma: Too many requests in this message (%d)."
-                            "Bailing.", head.repeat);
-            break;
-        }
-
-        switch (head.type) {
-        case RDMA_CONTROL_COMPRESS:
-            comp = (RDMACompress *) rdma->wr_data[idx].control_curr;
-            network_to_compress(comp);
-
-            trace_rdma_registration_handle_compress(comp->length,
-                                                    comp->block_idx,
-                                                    comp->offset);
-            if (comp->block_idx >= rdma->local_ram_blocks.nb_blocks) {
-                error_report("rdma: 'compress' bad block index %u (vs %d)",
-                             (unsigned int)comp->block_idx,
-                             rdma->local_ram_blocks.nb_blocks);
-                goto err;
-            }
-            block = &(rdma->local_ram_blocks.block[comp->block_idx]);
-
-            host_addr = block->local_host_addr +
-                            (comp->offset - block->offset);
-            if (comp->value) {
-                error_report("rdma: Zero page with non-zero (%d) value",
-                             comp->value);
-                goto err;
-            }
-            ram_handle_zero(host_addr, comp->length);
-            break;
-
-        case RDMA_CONTROL_REGISTER_FINISHED:
-            trace_rdma_registration_handle_finished();
-            return 0;
-
-        case RDMA_CONTROL_RAM_BLOCKS_REQUEST:
-            trace_rdma_registration_handle_ram_blocks();
-
-            /* Sort our local RAM Block list so it's the same as the source,
-             * we can do this since we've filled in a src_index in the list
-             * as we received the RAMBlock list earlier.
-             */
-            qsort(rdma->local_ram_blocks.block,
-                  rdma->local_ram_blocks.nb_blocks,
-                  sizeof(RDMALocalBlock), dest_ram_sort_func);
-            for (int i = 0; i < local->nb_blocks; i++) {
-                local->block[i].index = i;
-            }
-
-            if (rdma->pin_all) {
-                ret = qemu_rdma_reg_whole_ram_blocks(rdma, &err);
-                if (ret < 0) {
-                    error_report_err(err);
-                    goto err;
-                }
-            }
-
-            /*
-             * Dest uses this to prepare to transmit the RAMBlock descriptions
-             * to the source VM after connection setup.
-             * Both sides use the "remote" structure to communicate and update
-             * their "local" descriptions with what was sent.
-             */
-            for (int i = 0; i < local->nb_blocks; i++) {
-                rdma->dest_blocks[i].remote_host_addr =
-                    (uintptr_t)(local->block[i].local_host_addr);
-
-                if (rdma->pin_all) {
-                    rdma->dest_blocks[i].remote_rkey = local->block[i].mr->rkey;
-                }
-
-                rdma->dest_blocks[i].offset = local->block[i].offset;
-                rdma->dest_blocks[i].length = local->block[i].length;
-
-                dest_block_to_network(&rdma->dest_blocks[i]);
-                trace_rdma_registration_handle_ram_blocks_loop(
-                    local->block[i].block_name,
-                    local->block[i].offset,
-                    local->block[i].length,
-                    local->block[i].local_host_addr,
-                    local->block[i].src_index);
-            }
-
-            blocks.len = rdma->local_ram_blocks.nb_blocks
-                                                * sizeof(RDMADestBlock);
-
-
-            ret = qemu_rdma_post_send_control(rdma,
-                                    (uint8_t *) rdma->dest_blocks, &blocks,
-                                    &err);
-
-            if (ret < 0) {
-                error_report_err(err);
-                goto err;
-            }
-
-            break;
-        case RDMA_CONTROL_REGISTER_REQUEST:
-            trace_rdma_registration_handle_register(head.repeat);
-
-            reg_resp.repeat = head.repeat;
-            registers = (RDMARegister *) rdma->wr_data[idx].control_curr;
-
-            for (int count = 0; count < head.repeat; count++) {
-                uint64_t chunk;
-                uint8_t *chunk_start, *chunk_end;
-
-                reg = &registers[count];
-                network_to_register(reg);
-
-                reg_result = &results[count];
-
-                trace_rdma_registration_handle_register_loop(count,
-                         reg->current_index, reg->key.current_addr, reg->chunks);
-
-                if (reg->current_index >= rdma->local_ram_blocks.nb_blocks) {
-                    error_report("rdma: 'register' bad block index %u (vs %d)",
-                                 (unsigned int)reg->current_index,
-                                 rdma->local_ram_blocks.nb_blocks);
-                    goto err;
-                }
-                block = &(rdma->local_ram_blocks.block[reg->current_index]);
-                if (block->is_ram_block) {
-                    if (block->offset > reg->key.current_addr) {
-                        error_report("rdma: bad register address for block %s"
-                            " offset: %" PRIx64 " current_addr: %" PRIx64,
-                            block->block_name, block->offset,
-                            reg->key.current_addr);
-                        goto err;
-                    }
-                    host_addr = (block->local_host_addr +
-                                (reg->key.current_addr - block->offset));
-                    chunk = ram_chunk_index(block->local_host_addr,
-                                            (uint8_t *) host_addr);
-                } else {
-                    chunk = reg->key.chunk;
-                    host_addr = block->local_host_addr +
-                        (reg->key.chunk * (1UL << RDMA_REG_CHUNK_SHIFT));
-                    /* Check for particularly bad chunk value */
-                    if (host_addr < (void *)block->local_host_addr) {
-                        error_report("rdma: bad chunk for block %s"
-                            " chunk: %" PRIx64,
-                            block->block_name, reg->key.chunk);
-                        goto err;
-                    }
-                }
-                chunk_start = ram_chunk_start(block, chunk);
-                chunk_end = ram_chunk_end(block, chunk + reg->chunks);
-                /* avoid "-Waddress-of-packed-member" warning */
-                uint32_t tmp_rkey = 0;
-                if (qemu_rdma_register_and_get_keys(rdma, block,
-                            (uintptr_t)host_addr, NULL, &tmp_rkey,
-                            chunk, chunk_start, chunk_end)) {
-                    error_report("cannot get rkey");
-                    goto err;
-                }
-                reg_result->rkey = tmp_rkey;
-
-                reg_result->host_addr = (uintptr_t)block->local_host_addr;
-
-                trace_rdma_registration_handle_register_rkey(reg_result->rkey);
-
-                result_to_network(reg_result);
-            }
-
-            ret = qemu_rdma_post_send_control(rdma,
-                            (uint8_t *) results, &reg_resp, &err);
-
-            if (ret < 0) {
-                error_report_err(err);
-                goto err;
-            }
-            break;
-        case RDMA_CONTROL_UNREGISTER_REQUEST:
-            trace_rdma_registration_handle_unregister(head.repeat);
-            unreg_resp.repeat = head.repeat;
-            registers = (RDMARegister *) rdma->wr_data[idx].control_curr;
-
-            for (int count = 0; count < head.repeat; count++) {
-                reg = &registers[count];
-                network_to_register(reg);
-
-                trace_rdma_registration_handle_unregister_loop(count,
-                           reg->current_index, reg->key.chunk);
-
-                block = &(rdma->local_ram_blocks.block[reg->current_index]);
-
-                ret = ibv_dereg_mr(block->pmr[reg->key.chunk]);
-                block->pmr[reg->key.chunk] = NULL;
-
-                if (ret != 0) {
-                    error_report("rdma unregistration chunk failed: %s",
-                                 strerror(errno));
-                    goto err;
-                }
-
-                rdma->total_registrations--;
-
-                trace_rdma_registration_handle_unregister_success(reg->key.chunk);
-            }
-
-            ret = qemu_rdma_post_send_control(rdma, NULL, &unreg_resp, &err);
-
-            if (ret < 0) {
-                error_report_err(err);
-                goto err;
-            }
-            break;
-        case RDMA_CONTROL_REGISTER_RESULT:
-            error_report("Invalid RESULT message at dest.");
-            goto err;
-        default:
-            error_report("Unknown control message %s", control_desc(head.type));
-            goto err;
-        }
-    } while (1);
-
-err:
-    rdma->errored = true;
-    return -1;
-}
-
-/* Destination:
- * Called during the initial RAM load section which lists the
- * RAMBlocks by name.  This lets us know the order of the RAMBlocks on
- * the source.  We've already built our local RAMBlock list, but not
- * yet sent the list to the source.
- */
-int rdma_block_notification_handle(QEMUFile *f, const char *name)
-{
-    int curr;
-    int found = -1;
-
-    if (!migrate_rdma()) {
-        return 0;
-    }
-
-    RCU_READ_LOCK_GUARD();
-    QIOChannelRDMA *rioc = QIO_CHANNEL_RDMA(qemu_file_get_ioc(f));
-    RDMAContext *rdma = qatomic_rcu_read(&rioc->rdmain);
-
-    if (!rdma) {
-        return -1;
-    }
-
-    /* Find the matching RAMBlock in our local list */
-    for (curr = 0; curr < rdma->local_ram_blocks.nb_blocks; curr++) {
-        if (!strcmp(rdma->local_ram_blocks.block[curr].block_name, name)) {
-            found = curr;
-            break;
-        }
-    }
-
-    if (found == -1) {
-        error_report("RAMBlock '%s' not found on destination", name);
-        return -1;
-    }
-
-    rdma->local_ram_blocks.block[curr].src_index = rdma->next_src_index;
-    trace_rdma_block_notification_handle(name, rdma->next_src_index);
-    rdma->next_src_index++;
-
-    return 0;
-}
-
-int rdma_registration_start(QEMUFile *f, uint64_t flags)
-{
-    if (!migrate_rdma() || migration_in_postcopy()) {
-        return 0;
-    }
-
-    QIOChannelRDMA *rioc = QIO_CHANNEL_RDMA(qemu_file_get_ioc(f));
-    RCU_READ_LOCK_GUARD();
-    RDMAContext *rdma = qatomic_rcu_read(&rioc->rdmaout);
-    if (!rdma) {
-        return -1;
-    }
-
-    if (rdma_errored(rdma)) {
-        return -1;
-    }
-
-    trace_rdma_registration_start(flags);
-    qemu_put_be64(f, RAM_SAVE_FLAG_HOOK);
-    return qemu_fflush(f);
-}
-
-/*
- * Inform dest that dynamic registrations are done for now.
- * First, flush writes, if any.
- */
-int rdma_registration_stop(QEMUFile *f, uint64_t flags)
-{
-    QIOChannelRDMA *rioc;
-    Error *err = NULL;
-    RDMAContext *rdma;
-    RDMAControlHeader head = { .len = 0, .repeat = 1 };
-    int ret;
-
-    if (!migrate_rdma() || migration_in_postcopy()) {
-        return 0;
-    }
-
-    RCU_READ_LOCK_GUARD();
-    rioc = QIO_CHANNEL_RDMA(qemu_file_get_ioc(f));
-    rdma = qatomic_rcu_read(&rioc->rdmaout);
-    if (!rdma) {
-        return -1;
-    }
-
-    if (rdma_errored(rdma)) {
-        return -1;
-    }
-
-    qemu_fflush(f);
-    ret = qemu_rdma_drain_cq(rdma);
-
-    if (ret < 0) {
-        goto err;
-    }
-
-    if (flags == RAM_CONTROL_SETUP) {
-        RDMAControlHeader resp = {.type = RDMA_CONTROL_RAM_BLOCKS_RESULT };
-        RDMALocalBlocks *local = &rdma->local_ram_blocks;
-        int reg_result_idx, nb_dest_blocks;
-
-        head.type = RDMA_CONTROL_RAM_BLOCKS_REQUEST;
-        trace_rdma_registration_stop_ram();
-
-        /*
-         * Make sure that we parallelize the pinning on both sides.
-         * For very large guests, doing this serially takes a really
-         * long time, so we have to 'interleave' the pinning locally
-         * with the control messages by performing the pinning on this
-         * side before we receive the control response from the other
-         * side that the pinning has completed.
-         */
-        ret = qemu_rdma_exchange_send(rdma, &head, NULL, &resp,
-                    &reg_result_idx, rdma->pin_all ?
-                    qemu_rdma_reg_whole_ram_blocks : NULL,
-                    &err);
-        if (ret < 0) {
-            error_report_err(err);
-            return -1;
-        }
-
-        nb_dest_blocks = resp.len / sizeof(RDMADestBlock);
-
-        /*
-         * The protocol uses two different sets of rkeys (mutually exclusive):
-         * 1. One key to represent the virtual address of the entire ram block.
-         *    (dynamic chunk registration disabled - pin everything with one rkey.)
-         * 2. One to represent individual chunks within a ram block.
-         *    (dynamic chunk registration enabled - pin individual chunks.)
-         *
-         * Once the capability is successfully negotiated, the destination transmits
-         * the keys to use (or sends them later) including the virtual addresses
-         * and then propagates the remote ram block descriptions to his local copy.
-         */
-
-        if (local->nb_blocks != nb_dest_blocks) {
-            error_report("ram blocks mismatch (Number of blocks %d vs %d)",
-                         local->nb_blocks, nb_dest_blocks);
-            error_printf("Your QEMU command line parameters are probably "
-                         "not identical on both the source and destination.");
-            rdma->errored = true;
-            return -1;
-        }
-
-        qemu_rdma_move_header(rdma, reg_result_idx, &resp);
-        memcpy(rdma->dest_blocks,
-            rdma->wr_data[reg_result_idx].control_curr, resp.len);
-        for (int i = 0; i < nb_dest_blocks; i++) {
-            network_to_dest_block(&rdma->dest_blocks[i]);
-
-            /* We require that the blocks are in the same order */
-            if (rdma->dest_blocks[i].length != local->block[i].length) {
-                error_report("Block %s/%d has a different length %" PRIu64
-                             "vs %" PRIu64,
-                             local->block[i].block_name, i,
-                             local->block[i].length,
-                             rdma->dest_blocks[i].length);
-                rdma->errored = true;
-                return -1;
-            }
-            local->block[i].remote_host_addr =
-                    rdma->dest_blocks[i].remote_host_addr;
-            local->block[i].remote_rkey = rdma->dest_blocks[i].remote_rkey;
-        }
-    }
-
-    trace_rdma_registration_stop(flags);
-
-    head.type = RDMA_CONTROL_REGISTER_FINISHED;
-    ret = qemu_rdma_exchange_send(rdma, &head, NULL, NULL, NULL, NULL, &err);
-
-    if (ret < 0) {
-        error_report_err(err);
-        goto err;
-    }
-
-    return 0;
-err:
-    rdma->errored = true;
-    return -1;
-}
-
-static void qio_channel_rdma_finalize(Object *obj)
-{
-    QIOChannelRDMA *rioc = QIO_CHANNEL_RDMA(obj);
-    if (rioc->rdmain) {
-        qemu_rdma_cleanup(rioc->rdmain);
-        g_free(rioc->rdmain);
-        rioc->rdmain = NULL;
-    }
-    if (rioc->rdmaout) {
-        qemu_rdma_cleanup(rioc->rdmaout);
-        g_free(rioc->rdmaout);
-        rioc->rdmaout = NULL;
-    }
-}
-
-static void qio_channel_rdma_class_init(ObjectClass *klass,
-                                        void *class_data G_GNUC_UNUSED)
-{
-    QIOChannelClass *ioc_klass = QIO_CHANNEL_CLASS(klass);
-
-    ioc_klass->io_writev = qio_channel_rdma_writev;
-    ioc_klass->io_readv = qio_channel_rdma_readv;
-    ioc_klass->io_set_blocking = qio_channel_rdma_set_blocking;
-    ioc_klass->io_close = qio_channel_rdma_close;
-    ioc_klass->io_create_watch = qio_channel_rdma_create_watch;
-    ioc_klass->io_set_aio_fd_handler = qio_channel_rdma_set_aio_fd_handler;
-    ioc_klass->io_shutdown = qio_channel_rdma_shutdown;
-}
-
-static const TypeInfo qio_channel_rdma_info = {
-    .parent = TYPE_QIO_CHANNEL,
-    .name = TYPE_QIO_CHANNEL_RDMA,
-    .instance_size = sizeof(QIOChannelRDMA),
-    .instance_finalize = qio_channel_rdma_finalize,
-    .class_init = qio_channel_rdma_class_init,
-};
-
-static void qio_channel_rdma_register_types(void)
-{
-    type_register_static(&qio_channel_rdma_info);
-}
-
-type_init(qio_channel_rdma_register_types);
-
-static QEMUFile *rdma_new_input(RDMAContext *rdma)
-{
-    QIOChannelRDMA *rioc = QIO_CHANNEL_RDMA(object_new(TYPE_QIO_CHANNEL_RDMA));
-
-    rioc->file = qemu_file_new_input(QIO_CHANNEL(rioc));
-    rioc->rdmain = rdma;
-    rioc->rdmaout = rdma->return_path;
-
-    return rioc->file;
-}
-
-static QEMUFile *rdma_new_output(RDMAContext *rdma)
-{
-    QIOChannelRDMA *rioc = QIO_CHANNEL_RDMA(object_new(TYPE_QIO_CHANNEL_RDMA));
-
-    rioc->file = qemu_file_new_output(QIO_CHANNEL(rioc));
-    rioc->rdmaout = rdma;
-    rioc->rdmain = rdma->return_path;
-
-    return rioc->file;
-}
-
-static void rdma_accept_incoming_migration(void *opaque)
-{
-    RDMAContext *rdma = opaque;
-    QEMUFile *f;
-
-    trace_qemu_rdma_accept_incoming_migration();
-    if (qemu_rdma_accept(rdma) < 0) {
-        error_report("RDMA ERROR: Migration initialization failed");
-        return;
-    }
-
-    trace_qemu_rdma_accept_incoming_migration_accepted();
-
-    if (rdma->is_return_path) {
-        return;
-    }
-
-    f = rdma_new_input(rdma);
-    if (f == NULL) {
-        error_report("RDMA ERROR: could not open RDMA for input");
-        qemu_rdma_cleanup(rdma);
-        return;
-    }
-
-    rdma->migration_started_on_destination = 1;
-    migration_fd_process_incoming(f);
-}
-
-void rdma_start_incoming_migration(InetSocketAddress *host_port,
-                                   Error **errp)
-{
-    MigrationState *s = migrate_get_current();
-    int ret;
-    RDMAContext *rdma;
-
-    trace_rdma_start_incoming_migration();
-
-    /* Avoid ram_block_discard_disable(), cannot change during migration. */
-    if (ram_block_discard_is_required()) {
-        error_setg(errp, "RDMA: cannot disable RAM discard");
-        return;
-    }
-
-    rdma = qemu_rdma_data_init(host_port, errp);
-    if (rdma == NULL) {
-        goto err;
-    }
-
-    ret = qemu_rdma_dest_init(rdma, errp);
-    if (ret < 0) {
-        goto err;
-    }
-
-    trace_rdma_start_incoming_migration_after_dest_init();
-
-    ret = rdma_listen(rdma->listen_id, 5);
-
-    if (ret < 0) {
-        error_setg(errp, "RDMA ERROR: listening on socket!");
-        goto cleanup_rdma;
-    }
-
-    trace_rdma_start_incoming_migration_after_rdma_listen();
-    s->rdma_migration = true;
-    qemu_set_fd_handler(rdma->channel->fd, rdma_accept_incoming_migration,
-                        NULL, (void *)(intptr_t)rdma);
-    return;
-
-cleanup_rdma:
-    qemu_rdma_cleanup(rdma);
-err:
-    if (rdma) {
-        g_free(rdma->host);
-    }
-    g_free(rdma);
-}
-
-void rdma_start_outgoing_migration(void *opaque,
-                            InetSocketAddress *host_port, Error **errp)
-{
-    MigrationState *s = opaque;
-    RDMAContext *rdma_return_path = NULL;
-    RDMAContext *rdma;
-    int ret;
-
-    /* Avoid ram_block_discard_disable(), cannot change during migration. */
-    if (ram_block_discard_is_required()) {
-        error_setg(errp, "RDMA: cannot disable RAM discard");
-        return;
-    }
-
-    rdma = qemu_rdma_data_init(host_port, errp);
-    if (rdma == NULL) {
-        goto err;
-    }
-
-    ret = qemu_rdma_source_init(rdma, migrate_rdma_pin_all(), errp);
-
-    if (ret < 0) {
-        goto err;
-    }
-
-    trace_rdma_start_outgoing_migration_after_rdma_source_init();
-    ret = qemu_rdma_connect(rdma, false, errp);
-
-    if (ret < 0) {
-        goto err;
-    }
-
-    /* RDMA postcopy need a separate queue pair for return path */
-    if (migrate_postcopy() || migrate_return_path()) {
-        rdma_return_path = qemu_rdma_data_init(host_port, errp);
-
-        if (rdma_return_path == NULL) {
-            goto return_path_err;
-        }
-
-        ret = qemu_rdma_source_init(rdma_return_path,
-                                    migrate_rdma_pin_all(), errp);
-
-        if (ret < 0) {
-            goto return_path_err;
-        }
-
-        ret = qemu_rdma_connect(rdma_return_path, true, errp);
-
-        if (ret < 0) {
-            goto return_path_err;
-        }
-
-        rdma->return_path = rdma_return_path;
-        rdma_return_path->return_path = rdma;
-        rdma_return_path->is_return_path = true;
-    }
-
-    trace_rdma_start_outgoing_migration_after_rdma_connect();
-
-    s->to_dst_file = rdma_new_output(rdma);
-    s->rdma_migration = true;
-    migrate_fd_connect(s, NULL);
-    return;
-return_path_err:
-    qemu_rdma_cleanup(rdma);
-err:
-    g_free(rdma);
-    g_free(rdma_return_path);
-}
diff --git a/migration/savevm.c b/migration/savevm.c
index 388d7af7cd..939d35d69e 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -2970,7 +2970,7 @@  int qemu_loadvm_state(QEMUFile *f)
 
     /* We've got to be careful; if we don't read the data and just shut the fd
      * then the sender can error if we close while it's still sending.
-     * We also mustn't read data that isn't there; some transports (RDMA)
+     * We also mustn't read data that isn't there; some transports
      * will stall waiting for that data when the source has already closed.
      */
     if (ret == 0 && should_send_vmdesc()) {
diff --git a/meson_options.txt b/meson_options.txt
index b5c0bad9e7..79b69d4286 100644
--- a/meson_options.txt
+++ b/meson_options.txt
@@ -196,8 +196,6 @@  option('rbd', type : 'feature', value : 'auto',
        description: 'Ceph block device driver')
 option('opengl', type : 'feature', value : 'auto',
        description: 'OpenGL support')
-option('rdma', type : 'feature', value : 'auto',
-       description: 'Enable RDMA-based migration')
 option('gtk', type : 'feature', value : 'auto',
        description: 'GTK+ user interface')
 option('sdl', type : 'feature', value : 'auto',
diff --git a/migration/meson.build b/migration/meson.build
index 1eeb915ff6..e2cd92c01f 100644
--- a/migration/meson.build
+++ b/migration/meson.build
@@ -36,7 +36,6 @@  if get_option('replication').allowed()
   system_ss.add(files('colo-failover.c', 'colo.c'))
 endif
 
-system_ss.add(when: rdma, if_true: files('rdma.c'))
 if get_option('live_block_migration').allowed()
   system_ss.add(files('block.c'))
 endif
diff --git a/migration/trace-events b/migration/trace-events
index f0e1cb80c7..7db3a5194f 100644
--- a/migration/trace-events
+++ b/migration/trace-events
@@ -193,7 +193,7 @@  process_incoming_migration_co_postcopy_end_main(void) ""
 postcopy_preempt_enabled(bool value) "%d"
 
 # migration-stats
-migration_transferred_bytes(uint64_t qemu_file, uint64_t multifd, uint64_t rdma) "qemu_file %" PRIu64 " multifd %" PRIu64 " RDMA %" PRIu64
+migration_transferred_bytes(uint64_t qemu_file, uint64_t multifd) "qemu_file %" PRIu64 " multifd %" PRIu64
 
 # channel.c
 migration_set_incoming_channel(void *ioc, const char *ioctype) "ioc=%p ioctype=%s"
@@ -204,72 +204,6 @@  migrate_state_too_big(void) ""
 migrate_global_state_post_load(const char *state) "loaded state: %s"
 migrate_global_state_pre_save(const char *state) "saved state: %s"
 
-# rdma.c
-qemu_rdma_accept_incoming_migration(void) ""
-qemu_rdma_accept_incoming_migration_accepted(void) ""
-qemu_rdma_accept_pin_state(bool pin) "%d"
-qemu_rdma_accept_pin_verbsc(void *verbs) "Verbs context after listen: %p"
-qemu_rdma_block_for_wrid_miss(uint64_t wcomp, uint64_t req) "A Wanted wrid %" PRIu64 " but got %" PRIu64
-qemu_rdma_cleanup_disconnect(void) ""
-qemu_rdma_close(void) ""
-qemu_rdma_connect_pin_all_requested(void) ""
-qemu_rdma_connect_pin_all_outcome(bool pin) "%d"
-qemu_rdma_dest_init_trying(const char *host, const char *ip) "%s => %s"
-qemu_rdma_dump_id_failed(const char *who) "%s RDMA Device opened, but can't query port information"
-qemu_rdma_dump_id(const char *who, const char *name, const char *dev_name, const char *dev_path, const char *ibdev_path, int transport, const char *transport_name) "%s RDMA Device opened: kernel name %s uverbs device name %s, infiniband_verbs class device path %s, infiniband class device path %s, transport: (%d) %s"
-qemu_rdma_dump_gid(const char *who, const char *src, const char *dst) "%s Source GID: %s, Dest GID: %s"
-qemu_rdma_exchange_get_response_start(const char *desc) "CONTROL: %s receiving..."
-qemu_rdma_exchange_get_response_none(const char *desc, int type) "Surprise: got %s (%d)"
-qemu_rdma_exchange_send_issue_callback(void) ""
-qemu_rdma_exchange_send_waiting(const char *desc) "Waiting for response %s"
-qemu_rdma_exchange_send_received(const char *desc) "Response %s received."
-qemu_rdma_fill(size_t control_len, size_t size) "RDMA %zd of %zd bytes already in buffer"
-qemu_rdma_init_ram_blocks(int blocks) "Allocated %d local ram block structures"
-qemu_rdma_poll_recv(uint64_t comp, int64_t id, int sent) "completion %" PRIu64 " received (%" PRId64 ") left %d"
-qemu_rdma_poll_write(uint64_t comp, int left, uint64_t block, uint64_t chunk, void *local, void *remote) "completions %" PRIu64 " left %d, block %" PRIu64 ", chunk: %" PRIu64 " %p %p"
-qemu_rdma_poll_other(uint64_t comp, int left) "other completion %" PRIu64 " received left %d"
-qemu_rdma_post_send_control(const char *desc) "CONTROL: sending %s.."
-qemu_rdma_register_and_get_keys(uint64_t len, void *start) "Registering %" PRIu64 " bytes @ %p"
-qemu_rdma_register_odp_mr(const char *name) "Try to register On-Demand Paging memory region: %s"
-qemu_rdma_advise_mr(const char *name, uint32_t len, uint64_t addr, const char *res) "Try to advise block %s prefetch at %" PRIu32 "@0x%" PRIx64 ": %s"
-qemu_rdma_resolve_host_trying(const char *host, const char *ip) "Trying %s => %s"
-qemu_rdma_signal_unregister_append(uint64_t chunk, int pos) "Appending unregister chunk %" PRIu64 " at position %d"
-qemu_rdma_signal_unregister_already(uint64_t chunk) "Unregister chunk %" PRIu64 " already in queue"
-qemu_rdma_unregister_waiting_inflight(uint64_t chunk) "Cannot unregister inflight chunk: %" PRIu64
-qemu_rdma_unregister_waiting_proc(uint64_t chunk, int pos) "Processing unregister for chunk: %" PRIu64 " at position %d"
-qemu_rdma_unregister_waiting_send(uint64_t chunk) "Sending unregister for chunk: %" PRIu64
-qemu_rdma_unregister_waiting_complete(uint64_t chunk) "Unregister for chunk: %" PRIu64 " complete."
-qemu_rdma_write_flush(int sent) "sent total: %d"
-qemu_rdma_write_one_block(int count, int block, uint64_t chunk, uint64_t current, uint64_t len, int nb_sent, int nb_chunks) "(%d) Not clobbering: block: %d chunk %" PRIu64 " current %" PRIu64 " len %" PRIu64 " %d %d"
-qemu_rdma_write_one_post(uint64_t chunk, long addr, long remote, uint32_t len) "Posting chunk: %" PRIu64 ", addr: 0x%lx remote: 0x%lx, bytes %" PRIu32
-qemu_rdma_write_one_queue_full(void) ""
-qemu_rdma_write_one_recvregres(int mykey, int theirkey, uint64_t chunk) "Received registration result: my key: 0x%x their key 0x%x, chunk %" PRIu64
-qemu_rdma_write_one_sendreg(uint64_t chunk, int len, int index, int64_t offset) "Sending registration request chunk %" PRIu64 " for %d bytes, index: %d, offset: %" PRId64
-qemu_rdma_write_one_top(uint64_t chunks, uint64_t size) "Writing %" PRIu64 " chunks, (%" PRIu64 " MB)"
-qemu_rdma_write_one_zero(uint64_t chunk, int len, int index, int64_t offset) "Entire chunk is zero, sending compress: %" PRIu64 " for %d bytes, index: %d, offset: %" PRId64
-rdma_add_block(const char *block_name, int block, uint64_t addr, uint64_t offset, uint64_t len, uint64_t end, uint64_t bits, int chunks) "Added Block: '%s':%d, addr: %" PRIu64 ", offset: %" PRIu64 " length: %" PRIu64 " end: %" PRIu64 " bits %" PRIu64 " chunks %d"
-rdma_block_notification_handle(const char *name, int index) "%s at %d"
-rdma_delete_block(void *block, uint64_t addr, uint64_t offset, uint64_t len, uint64_t end, uint64_t bits, int chunks) "Deleted Block: %p, addr: %" PRIu64 ", offset: %" PRIu64 " length: %" PRIu64 " end: %" PRIu64 " bits %" PRIu64 " chunks %d"
-rdma_registration_handle_compress(int64_t length, int index, int64_t offset) "Zapping zero chunk: %" PRId64 " bytes, index %d, offset %" PRId64
-rdma_registration_handle_finished(void) ""
-rdma_registration_handle_ram_blocks(void) ""
-rdma_registration_handle_ram_blocks_loop(const char *name, uint64_t offset, uint64_t length, void *local_host_addr, unsigned int src_index) "%s: @0x%" PRIx64 "/%" PRIu64 " host:@%p src_index: %u"
-rdma_registration_handle_register(int requests) "%d requests"
-rdma_registration_handle_register_loop(int req, int index, uint64_t addr, uint64_t chunks) "Registration request (%d): index %d, current_addr %" PRIu64 " chunks: %" PRIu64
-rdma_registration_handle_register_rkey(int rkey) "0x%x"
-rdma_registration_handle_unregister(int requests) "%d requests"
-rdma_registration_handle_unregister_loop(int count, int index, uint64_t chunk) "Unregistration request (%d): index %d, chunk %" PRIu64
-rdma_registration_handle_unregister_success(uint64_t chunk) "%" PRIu64
-rdma_registration_handle_wait(void) ""
-rdma_registration_start(uint64_t flags) "%" PRIu64
-rdma_registration_stop(uint64_t flags) "%" PRIu64
-rdma_registration_stop_ram(void) ""
-rdma_start_incoming_migration(void) ""
-rdma_start_incoming_migration_after_dest_init(void) ""
-rdma_start_incoming_migration_after_rdma_listen(void) ""
-rdma_start_outgoing_migration_after_rdma_connect(void) ""
-rdma_start_outgoing_migration_after_rdma_source_init(void) ""
-
 # postcopy-ram.c
 postcopy_discard_send_finish(const char *ramblock, int nwords, int ncmds) "%s mask words sent=%d in %d commands"
 postcopy_discard_send_range(const char *ramblock, unsigned long start, unsigned long length) "%s:%lx/%lx"
diff --git a/qemu-options.hx b/qemu-options.hx
index f7ef9b4e41..4f390c33ef 100644
--- a/qemu-options.hx
+++ b/qemu-options.hx
@@ -4759,7 +4759,6 @@  ERST
 
 DEF("incoming", HAS_ARG, QEMU_OPTION_incoming, \
     "-incoming tcp:[host]:port[,to=maxport][,ipv4=on|off][,ipv6=on|off]\n" \
-    "-incoming rdma:host:port[,ipv4=on|off][,ipv6=on|off]\n" \
     "-incoming unix:socketpath\n" \
     "                prepare for incoming migration, listen on\n" \
     "                specified protocol and socket address\n" \
@@ -4773,8 +4772,6 @@  DEF("incoming", HAS_ARG, QEMU_OPTION_incoming, \
     QEMU_ARCH_ALL)
 SRST
 ``-incoming tcp:[host]:port[,to=maxport][,ipv4=on|off][,ipv6=on|off]``
-  \ 
-``-incoming rdma:host:port[,ipv4=on|off][,ipv6=on|off]``
     Prepare for incoming migration, listen on a given tcp port.
 
 ``-incoming unix:socketpath``
diff --git a/scripts/ci/org.centos/stream/8/build-environment.yml b/scripts/ci/org.centos/stream/8/build-environment.yml
index 1ead77e2cb..a366bb185b 100644
--- a/scripts/ci/org.centos/stream/8/build-environment.yml
+++ b/scripts/ci/org.centos/stream/8/build-environment.yml
@@ -68,7 +68,6 @@ 
           - pixman-devel
           - python38
           - python3-sphinx
-          - rdma-core-devel
           - redhat-rpm-config
           - snappy-devel
           - spice-glib-devel
diff --git a/scripts/ci/org.centos/stream/8/x86_64/configure b/scripts/ci/org.centos/stream/8/x86_64/configure
index 868db665f6..5dead834fb 100755
--- a/scripts/ci/org.centos/stream/8/x86_64/configure
+++ b/scripts/ci/org.centos/stream/8/x86_64/configure
@@ -103,7 +103,6 @@ 
 --disable-qed \
 --disable-qom-cast-debug \
 --disable-rbd \
---disable-rdma \
 --disable-replication \
 --disable-rng-none \
 --disable-safe-stack \
@@ -175,7 +174,6 @@ 
 --enable-opengl \
 --enable-pie \
 --enable-rbd \
---enable-rdma \
 --enable-seccomp \
 --enable-snappy \
 --enable-smartcard \
diff --git a/scripts/ci/setup/build-environment.yml b/scripts/ci/setup/build-environment.yml
index f344d1a850..0359b1c023 100644
--- a/scripts/ci/setup/build-environment.yml
+++ b/scripts/ci/setup/build-environment.yml
@@ -81,8 +81,6 @@ 
           - libglusterfs-dev
           - libgnutls28-dev
           - libgtk-3-dev
-          - libibumad-dev
-          - libibverbs-dev
           - libiscsi-dev
           - libjemalloc-dev
           - libjpeg-turbo8-dev
@@ -99,7 +97,6 @@ 
           - libpng-dev
           - libpulse-dev
           - librbd-dev
-          - librdmacm-dev
           - libsasl2-dev
           - libsdl2-dev
           - libsdl2-image-dev
@@ -236,7 +233,6 @@ 
           - pixman-devel
           - python38
           - python3-sphinx
-          - rdma-core-devel
           - redhat-rpm-config
           - snappy-devel
           - spice-glib-devel
diff --git a/scripts/coverity-scan/run-coverity-scan b/scripts/coverity-scan/run-coverity-scan
index 43cf770f5e..3dd14c3cc4 100755
--- a/scripts/coverity-scan/run-coverity-scan
+++ b/scripts/coverity-scan/run-coverity-scan
@@ -426,7 +426,7 @@  echo "Configuring..."
     --enable-libusb --enable-usb-redir \
     --enable-libiscsi --enable-libnfs --enable-seccomp \
     --enable-tpm --enable-libssh --enable-lzo --enable-snappy --enable-bzip2 \
-    --enable-numa --enable-rdma --enable-smartcard --enable-virglrenderer \
+    --enable-numa --enable-smartcard --enable-virglrenderer \
     --enable-mpath --enable-glusterfs \
     --enable-virtfs --enable-zstd
 
diff --git a/scripts/meson-buildoptions.sh b/scripts/meson-buildoptions.sh
index 5ace33f167..52c34598ba 100644
--- a/scripts/meson-buildoptions.sh
+++ b/scripts/meson-buildoptions.sh
@@ -167,7 +167,6 @@  meson_options_help() {
   printf "%s\n" '  qed             qed image format support'
   printf "%s\n" '  qga-vss         build QGA VSS support (broken with MinGW)'
   printf "%s\n" '  rbd             Ceph block device driver'
-  printf "%s\n" '  rdma            Enable RDMA-based migration'
   printf "%s\n" '  replication     replication support'
   printf "%s\n" '  rutabaga-gfx    rutabaga_gfx support'
   printf "%s\n" '  sdl             SDL user interface'
@@ -442,8 +441,6 @@  _meson_option_parse() {
     --disable-qom-cast-debug) printf "%s" -Dqom_cast_debug=false ;;
     --enable-rbd) printf "%s" -Drbd=enabled ;;
     --disable-rbd) printf "%s" -Drbd=disabled ;;
-    --enable-rdma) printf "%s" -Drdma=enabled ;;
-    --disable-rdma) printf "%s" -Drdma=disabled ;;
     --enable-relocatable) printf "%s" -Drelocatable=true ;;
     --disable-relocatable) printf "%s" -Drelocatable=false ;;
     --enable-replication) printf "%s" -Dreplication=enabled ;;
diff --git a/tests/lcitool/projects/qemu.yml b/tests/lcitool/projects/qemu.yml
index 149b15de57..511e48a5ec 100644
--- a/tests/lcitool/projects/qemu.yml
+++ b/tests/lcitool/projects/qemu.yml
@@ -48,8 +48,6 @@  packages:
  - libfdt
  - libffi
  - libgcrypt
- - libibumad
- - libibverbs
  - libiscsi
  - libjemalloc
  - libjpeg
@@ -58,7 +56,6 @@  packages:
  - libpmem
  - libpng
  - librbd
- - librdmacm
  - libseccomp
  - libselinux
  - libslirp
diff --git a/tests/migration/guestperf/engine.py b/tests/migration/guestperf/engine.py
index 608d7270f6..a704419082 100644
--- a/tests/migration/guestperf/engine.py
+++ b/tests/migration/guestperf/engine.py
@@ -41,7 +41,7 @@  def __init__(self, binary, dst_host, kernel, initrd, transport="tcp",
         self._dst_host = dst_host # Hostname of target host
         self._kernel = kernel # Path to kernel image
         self._initrd = initrd # Path to stress initrd
-        self._transport = transport # 'unix' or 'tcp' or 'rdma'
+        self._transport = transport # 'unix' or 'tcp'
         self._sleep = sleep
         self._verbose = verbose
         self._debug = debug
@@ -427,8 +427,6 @@  def run(self, hardware, scenario, result_dir=os.getcwd()):
 
         if self._transport == "tcp":
             uri = "tcp:%s:9000" % self._dst_host
-        elif self._transport == "rdma":
-            uri = "rdma:%s:9000" % self._dst_host
         elif self._transport == "unix":
             if self._dst_host != "localhost":
                 raise Exception("Running use unix migration transport for non-local host")