diff mbox series

[net-next,v10,02/14] net: page_pool: create hooks for custom page providers

Message ID 20240530201616.1316526-3-almasrymina@google.com
State New
Headers show
Series Device Memory TCP | expand

Commit Message

Mina Almasry May 30, 2024, 8:16 p.m. UTC
From: Jakub Kicinski <kuba@kernel.org>

The page providers which try to reuse the same pages will
need to hold onto the ref, even if page gets released from
the pool - as in releasing the page from the pp just transfers
the "ownership" reference from pp to the provider, and provider
will wait for other references to be gone before feeding this
page back into the pool.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Mina Almasry <almasrymina@google.com>

---

- This is implemented by Jakub in his RFC:
https://lore.kernel.org/netdev/f8270765-a27b-6ccf-33ea-cda097168d79@redhat.com/T/

I take no credit for the idea or implementation; I only added minor
edits to make this workable with device memory TCP, and removed some
hacky test code. This is a critical dependency of device memory TCP
and thus I'm pulling it into this series to make it revewable and
mergeable.

- There is a pending discussion about the acceptance of the page_pool
  memory provider hooks:

https://lore.kernel.org/netdev/20240403002053.2376017-3-almasrymina@google.com/

I'm unsure if the discussion has been resolved yet. Sending the series
anyway to get reviews/feedback on the (unrelated) rest of the series.

Cc: Christoph Hellwig <hch@infradead.org>

v10:
- Renamed alloc_pages -> alloc_netmems. alloc_pages is now
  a preprocessor macro, and reusing the string results in a build error.

RFC v3 -> v1
- Removed unusued mem_provider. (Yunsheng).
- Replaced memory_provider & mp_priv with netdev_rx_queue (Jakub).

---
 include/net/page_pool/types.h | 12 ++++++++++
 net/core/page_pool.c          | 43 +++++++++++++++++++++++++++++++----
 2 files changed, 50 insertions(+), 5 deletions(-)

Comments

'Christoph Hellwig' June 1, 2024, 5:35 a.m. UTC | #1
On Thu, May 30, 2024 at 08:16:01PM +0000, Mina Almasry wrote:
> I'm unsure if the discussion has been resolved yet. Sending the series
> anyway to get reviews/feedback on the (unrelated) rest of the series.

As far as I'm concerned it is not.  I've not seen any convincing
argument for more than page/folio allocator including larger order /
huge page and dmabuf.
Pavel Begunkov June 3, 2024, 2:52 p.m. UTC | #2
On 6/3/24 15:17, Mina Almasry wrote:
> On Fri, May 31, 2024 at 10:35 PM Christoph Hellwig <hch@infradead.org> wrote:
>>
>> On Thu, May 30, 2024 at 08:16:01PM +0000, Mina Almasry wrote:
>>> I'm unsure if the discussion has been resolved yet. Sending the series
>>> anyway to get reviews/feedback on the (unrelated) rest of the series.
>>
>> As far as I'm concerned it is not.  I've not seen any convincing
>> argument for more than page/folio allocator including larger order /
>> huge page and dmabuf.
>>
> 
> Thanks Christoph, this particular patch series adds dmabuf, so I
> assume no objection there. I assume the objection is that you want the
> generic, extensible hooks removed.
> 
> To be honest, I don't think the hooks are an integral part of the
> design, and at this point I think we've argued for them enough. I
> think we can easily achieve the same thing with just raw if statements
> in a couple of places. We can always add the hooks if and only if we
> actually justify many memory providers.
> 
> Any objections to me removing the hooks and directing to memory
> allocations via simple if statements? Something like (very rough
> draft, doesn't compile):

The question for Christoph is what exactly is the objection here? Why we
would not be using well defined ops when we know there will be more
users? Repeating what I said in the last thread, for io_uring it's used
to implement the flow of buffers from userspace to the kernel, the ABI,
which is orthogonal to the issue of what memory type it is and how it
came there. And even if you mandate unnecessary dmabuf condoms for user
memory in one form or another IMHO for no clear reason, the callbacks
(or yet another if-else) would still be needed.

Sure, Mina can drop and hard code devmem path to easy the pain for
him and delay the discussion, but then shortly after I will be
re-sending same shit. So, what's the convincing argument _not_ to have
it?


> 
> diff --git a/net/core/page_pool.c b/net/core/page_pool.c
> index 92be1aaf18ccc..2cc986455bce6 100644
> --- a/net/core/page_pool.c
> +++ b/net/core/page_pool.c
> @@ -557,8 +557,8 @@ netmem_ref page_pool_alloc_netmem(struct page_pool
> *pool, gfp_t gfp)
>                  return netmem;
> 
>          /* Slow-path: cache empty, do real allocation */
> -       if (static_branch_unlikely(&page_pool_mem_providers) && pool->mp_ops)
> -               netmem = pool->mp_ops->alloc_pages(pool, gfp);
> +       if (unlikely(page_pool_is_dmabuf(pool)))
> +               netmem = mp_dmabuf_devmem_alloc_pages():
>          else
>                  netmem = __page_pool_alloc_pages_slow(pool, gfp);
>          return netmem;
> 
>
'Christoph Hellwig' June 5, 2024, 8:23 a.m. UTC | #3
On Mon, Jun 03, 2024 at 07:17:05AM -0700, Mina Almasry wrote:
> On Fri, May 31, 2024 at 10:35 PM Christoph Hellwig <hch@infradead.org> wrote:
> >
> > On Thu, May 30, 2024 at 08:16:01PM +0000, Mina Almasry wrote:
> > > I'm unsure if the discussion has been resolved yet. Sending the series
> > > anyway to get reviews/feedback on the (unrelated) rest of the series.
> >
> > As far as I'm concerned it is not.  I've not seen any convincing
> > argument for more than page/folio allocator including larger order /
> > huge page and dmabuf.
> >
> 
> Thanks Christoph, this particular patch series adds dmabuf, so I
> assume no objection there. I assume the objection is that you want the
> generic, extensible hooks removed.

Exactly!  Note that this isn't a review of the dmabuf bits as there
are people more qualified with me.

> To be honest, I don't think the hooks are an integral part of the
> design, and at this point I think we've argued for them enough. I
> think we can easily achieve the same thing with just raw if statements
> in a couple of places. We can always add the hooks if and only if we
> actually justify many memory providers.
> 
> Any objections to me removing the hooks and directing to memory
> allocations via simple if statements? Something like (very rough
> draft, doesn't compile):

I like this approach, thanks! 

You might still want to keep the static key, though.
Pavel Begunkov June 7, 2024, 1:42 p.m. UTC | #4
On 6/3/24 16:43, Mina Almasry wrote:
> On Mon, Jun 3, 2024 at 7:52 AM Pavel Begunkov <asml.silence@gmail.com> wrote:
>>
>> On 6/3/24 15:17, Mina Almasry wrote:
>>> On Fri, May 31, 2024 at 10:35 PM Christoph Hellwig <hch@infradead.org> wrote:
>>>>
>>>> On Thu, May 30, 2024 at 08:16:01PM +0000, Mina Almasry wrote:
>>>>> I'm unsure if the discussion has been resolved yet. Sending the series
>>>>> anyway to get reviews/feedback on the (unrelated) rest of the series.
>>>>
>>>> As far as I'm concerned it is not.  I've not seen any convincing
>>>> argument for more than page/folio allocator including larger order /
>>>> huge page and dmabuf.
>>>>
>>>
>>> Thanks Christoph, this particular patch series adds dmabuf, so I
>>> assume no objection there. I assume the objection is that you want the
>>> generic, extensible hooks removed.
>>>
>>> To be honest, I don't think the hooks are an integral part of the
>>> design, and at this point I think we've argued for them enough. I
>>> think we can easily achieve the same thing with just raw if statements
>>> in a couple of places. We can always add the hooks if and only if we
>>> actually justify many memory providers.
>>>
>>> Any objections to me removing the hooks and directing to memory
>>> allocations via simple if statements? Something like (very rough
>>> draft, doesn't compile):
>>
>> The question for Christoph is what exactly is the objection here? Why we
>> would not be using well defined ops when we know there will be more
>> users? Repeating what I said in the last thread, for io_uring it's used
>> to implement the flow of buffers from userspace to the kernel, the ABI,
>> which is orthogonal to the issue of what memory type it is and how it
>> came there. And even if you mandate unnecessary dmabuf condoms for user
>> memory in one form or another IMHO for no clear reason, the callbacks
>> (or yet another if-else) would still be needed.
>>
>> Sure, Mina can drop and hard code devmem path to easy the pain for
>> him and delay the discussion, but then shortly after I will be
>> re-sending same shit.
> 
> You don't need to re-send the same ops again, right? You can add io
> uring support without ops. Something like:
> 
> diff --git a/net/core/page_pool.c b/net/core/page_pool.c
> index 92be1aaf18ccc..2cc986455bce6 100644
> --- a/net/core/page_pool.c
> +++ b/net/core/page_pool.c
> @@ -557,8 +557,8 @@ netmem_ref page_pool_alloc_netmem(struct page_pool
> *pool, gfp_t gfp)
>                  return netmem;
> 
>          /* Slow-path: cache empty, do real allocation */
> -       if (static_branch_unlikely(&page_pool_mem_providers) && pool->mp_ops)
> -               netmem = pool->mp_ops->alloc_pages(pool, gfp);
> +       if (unlikely(page_pool_is_dmabuf(pool)))
> +               netmem = mp_dmabuf_devmem_alloc_pages():
> +       else if (unlikely(page_pool_is_iouring(pool)))
> +               netmem = mp_io_uring_alloc_pages():
>         else
>                  netmem = __page_pool_alloc_pages_slow(pool, gfp);
>          return netmem;
> 
> So IMO, the ops themselves, which Christoph is repeatedly nacking, are
> not that important.
> 
> I humbly think the energy should be spent convincing maintainers of
> the use case of io uring memory, not the ops. The ops are a cosmetic

I haven't seen any arguments against from the (net) maintainers so
far. Nor I see any objection against callbacks from them (considering
that either option adds an if).

And just not to confuse folks, it's just user pages, not some
weird special io_uring memory.

> change to the code, and can be added later. Christoph is nacking the
> ops because it gives people too much rope [1].

Yes, it is cosmetic, just as much as removing it is a cosmetic
change. You can apply same "too much rope" argument basically
to anything.

Take io_uring, nothing would change in the process, it'd still
be sent to net and reviewed exactly same way, while being less
clean, with poorer subsystem separation, allowing custom
formats / argument list, etc. I think it's cleaner with callbacks,
Mr. Christoph has other beliefs and keeps coercing to them,
even though from time to time it backfires for the author, just
personal experience.


> But if you disagree and think the ops themselves are important for a
> reason I missed, I'm happy waiting until agreement is reached here.
> Sorry, just voicing my 2 cents.
> 
> [1] https://lore.kernel.org/netdev/ZjjHUh1eINPg1wkn@infradead.org/
>
Pavel Begunkov June 7, 2024, 1:45 p.m. UTC | #5
On 6/5/24 09:24, Christoph Hellwig wrote:
> On Mon, Jun 03, 2024 at 03:52:32PM +0100, Pavel Begunkov wrote:
>> The question for Christoph is what exactly is the objection here? Why we
>> would not be using well defined ops when we know there will be more
>> users?
> 
> The point is that there should be no more users.  If you need another

Does that "No more" stops after devmem tcp? Or after io_uring
proposal? For the latter I explained why io_uring has to do it
for good design and that's it's not even related to the memory
type used.

> case you are doing something very wrong.

That's not a very illuminating answer
David Ahern June 7, 2024, 2:27 p.m. UTC | #6
On 6/7/24 7:42 AM, Pavel Begunkov wrote:
> I haven't seen any arguments against from the (net) maintainers so
> far. Nor I see any objection against callbacks from them (considering
> that either option adds an if).

I have said before I do not understand why the dmabuf paradigm is not
sufficient for both device memory and host memory. A less than ideal
control path to put hostmem in a dmabuf wrapper vs extra checks and
changes in the datapath. The former should always be preferred.

I also do not understand why the ifq cache and overloading xdp functions
have stuck around; I always thought both were added by Jonathan to
simplify kernel ports during early POC days.
Jason Gunthorpe June 7, 2024, 2:52 p.m. UTC | #7
On Fri, Jun 07, 2024 at 08:27:29AM -0600, David Ahern wrote:
> On 6/7/24 7:42 AM, Pavel Begunkov wrote:
> > I haven't seen any arguments against from the (net) maintainers so
> > far. Nor I see any objection against callbacks from them (considering
> > that either option adds an if).
> 
> I have said before I do not understand why the dmabuf paradigm is not
> sufficient for both device memory and host memory. A less than ideal
> control path to put hostmem in a dmabuf wrapper vs extra checks and
> changes in the datapath. The former should always be preferred.

I think Pavel explained this - his project is principally to replace
the lifetime policy of pages in the data plane. He wants to change
when a page is considered available for re-allocation because
userspace may continue to use the page after the netstack thinks it is
done with it. It sounds like having a different source of the pages is
the less important part.

IMHO it seems to compose poorly if you can only use the io_uring
lifecycle model with io_uring registered memory, and not with DMABUF
memory registered through Mina's mechanism.

Jason
Pavel Begunkov June 7, 2024, 3:42 p.m. UTC | #8
On 6/7/24 15:27, David Ahern wrote:
> On 6/7/24 7:42 AM, Pavel Begunkov wrote:
>> I haven't seen any arguments against from the (net) maintainers so
>> far. Nor I see any objection against callbacks from them (considering
>> that either option adds an if).
> 
> I have said before I do not understand why the dmabuf paradigm is not
> sufficient for both device memory and host memory. A less than ideal
> control path to put hostmem in a dmabuf wrapper vs extra checks and
> changes in the datapath. The former should always be preferred.

If we're talking about types of memory specifically, I'm not strictly
against wrapping into dmabuf in kernel, but that just doesn't give
anything.
But the main reason for allocations there is the difference in
approaches to the api. With io_uring the allocation callback is
responsible for getting buffers back from the user (via a shared
ring). No locking for the ring, and buffers are already in the
context (napi) where they would be consumed from. Removes some
headaches for the user (like batching before returning buffers),
and should go better with smaller buffers and such.

> I also do not understand why the ifq cache 

I'm not sure what you mean by ifq cache. Can you elaborate?

> and overloading xdp functions

Assuming it's about setup via xdp, it was marked for remaking in
RFCs for longer than desired but it's gone now in our tree (but
maybe not in the latest series).

> have stuck around; I always thought both were added by Jonathan to
> simplify kernel ports during early POC days.
Pavel Begunkov June 7, 2024, 3:46 p.m. UTC | #9
On 6/7/24 16:42, Pavel Begunkov wrote:
> On 6/7/24 15:27, David Ahern wrote:
>> On 6/7/24 7:42 AM, Pavel Begunkov wrote:
>>> I haven't seen any arguments against from the (net) maintainers so
>>> far. Nor I see any objection against callbacks from them (considering
>>> that either option adds an if).
>>
>> I have said before I do not understand why the dmabuf paradigm is not
>> sufficient for both device memory and host memory. A less than ideal
>> control path to put hostmem in a dmabuf wrapper vs extra checks and
>> changes in the datapath. The former should always be preferred.
> 
> If we're talking about types of memory specifically, I'm not strictly
> against wrapping into dmabuf in kernel, but that just doesn't give
> anything.

And the reason I don't have too strong of an opinion on that is
mainly because it's just setup/cleanup path.

> But the main reason for allocations there is the difference in
> approaches to the api. With io_uring the allocation callback is
> responsible for getting buffers back from the user (via a shared
> ring). No locking for the ring, and buffers are already in the
> context (napi) where they would be consumed from. Removes some
> headaches for the user (like batching before returning buffers),
> and should go better with smaller buffers and such.
> 
>> I also do not understand why the ifq cache 
> 
> I'm not sure what you mean by ifq cache. Can you elaborate?
> 
>> and overloading xdp functions
> 
> Assuming it's about setup via xdp, it was marked for remaking in
> RFCs for longer than desired but it's gone now in our tree (but
> maybe not in the latest series).
> 
>> have stuck around; I always thought both were added by Jonathan to
>> simplify kernel ports during early POC days.
>
Mina Almasry June 7, 2024, 4:59 p.m. UTC | #10
On Fri, Jun 7, 2024 at 8:47 AM Pavel Begunkov <asml.silence@gmail.com> wrote:
>
> On 6/7/24 16:42, Pavel Begunkov wrote:
> > On 6/7/24 15:27, David Ahern wrote:
> >> On 6/7/24 7:42 AM, Pavel Begunkov wrote:
> >>> I haven't seen any arguments against from the (net) maintainers so
> >>> far. Nor I see any objection against callbacks from them (considering
> >>> that either option adds an if).
> >>
> >> I have said before I do not understand why the dmabuf paradigm is not
> >> sufficient for both device memory and host memory. A less than ideal
> >> control path to put hostmem in a dmabuf wrapper vs extra checks and
> >> changes in the datapath. The former should always be preferred.
> >
> > If we're talking about types of memory specifically, I'm not strictly
> > against wrapping into dmabuf in kernel, but that just doesn't give
> > anything.
>
> And the reason I don't have too strong of an opinion on that is
> mainly because it's just setup/cleanup path.
>

I agree wrapping io uring in dmabuf seems to be an unnecessary detour.
I never understood the need or upside to do that, but it could be a
lack of understanding on my part.

However, the concern that David brings up may materialize. I've had to
spend a lot of time minimizing or justifying checks to the code with
page pool benchmarks that detect even 1 cycle regressions. You may be
asked to run the same benchmarks and minimize similar overhead.

The benchmark in question is Jesper's bench_page_pool_simple. I've
forked it and applied it on top of net-next here:
https://github.com/mina/linux/commit/927596f87ab5791a8a6ba8597ba2189747396e54

As io_uring ZC comes close to merging, I suspect it would be good to
run this to understand the regression in the fast path, if any. If
there are no to little regressions, I have no concerns over io uring
memory not being wrapped in dmabufs, and David may agree as well.

--
Thanks,
Mina
David Wei June 10, 2024, 12:37 a.m. UTC | #11
On 2024-06-07 17:52, Jason Gunthorpe wrote:
> IMHO it seems to compose poorly if you can only use the io_uring
> lifecycle model with io_uring registered memory, and not with DMABUF
> memory registered through Mina's mechanism.

By this, do you mean io_uring must be exclusively used to use this
feature?

And you'd rather see the two decoupled, so userspace can register w/ say
dmabuf then pass it to io_uring?

> 
> Jason
Pavel Begunkov June 10, 2024, 1:12 a.m. UTC | #12
On 6/7/24 17:59, Mina Almasry wrote:
> On Fri, Jun 7, 2024 at 8:47 AM Pavel Begunkov <asml.silence@gmail.com> wrote:
>>
>> On 6/7/24 16:42, Pavel Begunkov wrote:
>>> On 6/7/24 15:27, David Ahern wrote:
>>>> On 6/7/24 7:42 AM, Pavel Begunkov wrote:
>>>>> I haven't seen any arguments against from the (net) maintainers so
>>>>> far. Nor I see any objection against callbacks from them (considering
>>>>> that either option adds an if).
>>>>
>>>> I have said before I do not understand why the dmabuf paradigm is not
>>>> sufficient for both device memory and host memory. A less than ideal
>>>> control path to put hostmem in a dmabuf wrapper vs extra checks and
>>>> changes in the datapath. The former should always be preferred.
>>>
>>> If we're talking about types of memory specifically, I'm not strictly
>>> against wrapping into dmabuf in kernel, but that just doesn't give
>>> anything.
>>
>> And the reason I don't have too strong of an opinion on that is
>> mainly because it's just setup/cleanup path.
>>
> 
> I agree wrapping io uring in dmabuf seems to be an unnecessary detour.
> I never understood the need or upside to do that, but it could be a
> lack of understanding on my part.
> 
> However, the concern that David brings up may materialize. I've had to
> spend a lot of time minimizing or justifying checks to the code with
> page pool benchmarks that detect even 1 cycle regressions. You may be
> asked to run the same benchmarks and minimize similar overhead.
> 
> The benchmark in question is Jesper's bench_page_pool_simple. I've
> forked it and applied it on top of net-next here:
> https://github.com/mina/linux/commit/927596f87ab5791a8a6ba8597ba2189747396e54
> 
> As io_uring ZC comes close to merging, I suspect it would be good to
> run this to understand the regression in the fast path, if any. If
> there are no to little regressions, I have no concerns over io uring
> memory not being wrapped in dmabufs, and David may agree as well.

That's the easiest part as io_uring only reusing call points
you added for devmem and thus doesn't add anything new on top
to hot paths.
Christian König June 10, 2024, 12:38 p.m. UTC | #13
Am 10.06.24 um 14:16 schrieb Jason Gunthorpe:
> On Mon, Jun 10, 2024 at 02:07:01AM +0100, Pavel Begunkov wrote:
>> On 6/10/24 01:37, David Wei wrote:
>>> On 2024-06-07 17:52, Jason Gunthorpe wrote:
>>>> IMHO it seems to compose poorly if you can only use the io_uring
>>>> lifecycle model with io_uring registered memory, and not with DMABUF
>>>> memory registered through Mina's mechanism.
>>> By this, do you mean io_uring must be exclusively used to use this
>>> feature?
>>>
>>> And you'd rather see the two decoupled, so userspace can register w/ say
>>> dmabuf then pass it to io_uring?
>> Personally, I have no clue what Jason means. You can just as
>> well say that it's poorly composable that write(2) to a disk
>> cannot post a completion into a XDP ring, or a netlink socket,
>> or io_uring's main completion queue, or name any other API.
> There is no reason you shouldn't be able to use your fast io_uring
> completion and lifecycle flow with DMABUF backed memory. Those are not
> widly different things and there is good reason they should work
> together.

Well there is the fundamental problem that you can't use io_uring to 
implement the semantics necessary for a dma_fence.

That's why we had to reject the io_uring work on DMA-buf sharing from 
Google a few years ago.

But this only affects the dma_fence synchronization part of DMA-buf, but 
*not* the general buffer sharing.

Regards,
Christian.

>
> Pretending they are totally different just because two different
> people wrote them is a very siloed view.
>
>> The devmem TCP callback can implement it in a way feasible to
>> the project, but it cannot directly post events to an unrelated
>> API like io_uring. And devmem attaches buffers to a socket,
>> for which a ring for returning buffers might even be a nuisance.
> If you can't compose your io_uring completion mechanism with a DMABUF
> provided backing store then I think it needs more work.
>
> Jason
Pavel Begunkov June 10, 2024, 7:20 p.m. UTC | #14
On 6/10/24 16:16, David Ahern wrote:
> On 6/10/24 6:16 AM, Jason Gunthorpe wrote:
>> On Mon, Jun 10, 2024 at 02:07:01AM +0100, Pavel Begunkov wrote:
>>> On 6/10/24 01:37, David Wei wrote:
>>>> On 2024-06-07 17:52, Jason Gunthorpe wrote:
>>>>> IMHO it seems to compose poorly if you can only use the io_uring
>>>>> lifecycle model with io_uring registered memory, and not with DMABUF
>>>>> memory registered through Mina's mechanism.
>>>>
>>>> By this, do you mean io_uring must be exclusively used to use this
>>>> feature?
>>>>
>>>> And you'd rather see the two decoupled, so userspace can register w/ say
>>>> dmabuf then pass it to io_uring?
>>>
>>> Personally, I have no clue what Jason means. You can just as
>>> well say that it's poorly composable that write(2) to a disk
>>> cannot post a completion into a XDP ring, or a netlink socket,
>>> or io_uring's main completion queue, or name any other API.
>>
>> There is no reason you shouldn't be able to use your fast io_uring
>> completion and lifecycle flow with DMABUF backed memory. Those are not
>> widly different things and there is good reason they should work
>> together.

Let's not mix up devmem TCP and dmabuf specifically, as I see it
your question was concerning the latter: "... DMABUF memory registered
through Mina's mechanism". io_uring's zcrx can trivially get dmabuf
support in future, as mentioned it's mostly the setup side. ABI,
buffer workflow and some details is a separate issue, and I don't
see how further integration aside from what we're already sharing
is beneficial, on opposite it'll complicate things.

>> Pretending they are totally different just because two different
>> people wrote them is a very siloed view.
io_uring zcrx and devmem? They are not, nobody is saying otherwise,
_very_ similar approaches if anything but with different API, which
is the reason we already use common infra.

>>> The devmem TCP callback can implement it in a way feasible to
>>> the project, but it cannot directly post events to an unrelated
>>> API like io_uring. And devmem attaches buffers to a socket,
>>> for which a ring for returning buffers might even be a nuisance.
>>
>> If you can't compose your io_uring completion mechanism with a DMABUF
>> provided backing store then I think it needs more work.

As per above, it conflates devmem TCP with dmabuf.

> exactly. io_uring, page_pool, dmabuf - all kernel building blocks for
> solutions. This why I was pushing for Mina's set not to be using the
> name `devmem` - it is but one type of memory and with dmabuf it should
> not matter if it is gpu or host (or something else later on - cxl?).
Pavel Begunkov June 10, 2024, 7:32 p.m. UTC | #15
On 6/10/24 16:41, Mina Almasry wrote:
> On Mon, Jun 10, 2024 at 5:38 AM Christian König
> <christian.koenig@amd.com> wrote:
>>
>> Am 10.06.24 um 14:16 schrieb Jason Gunthorpe:
>>> On Mon, Jun 10, 2024 at 02:07:01AM +0100, Pavel Begunkov wrote:
>>>> On 6/10/24 01:37, David Wei wrote:
>>>>> On 2024-06-07 17:52, Jason Gunthorpe wrote:
>>>>>> IMHO it seems to compose poorly if you can only use the io_uring
>>>>>> lifecycle model with io_uring registered memory, and not with DMABUF
>>>>>> memory registered through Mina's mechanism.
>>>>> By this, do you mean io_uring must be exclusively used to use this
>>>>> feature?
>>>>>
>>>>> And you'd rather see the two decoupled, so userspace can register w/ say
>>>>> dmabuf then pass it to io_uring?
>>>> Personally, I have no clue what Jason means. You can just as
>>>> well say that it's poorly composable that write(2) to a disk
>>>> cannot post a completion into a XDP ring, or a netlink socket,
>>>> or io_uring's main completion queue, or name any other API.
>>> There is no reason you shouldn't be able to use your fast io_uring
>>> completion and lifecycle flow with DMABUF backed memory. Those are not
>>> widly different things and there is good reason they should work
>>> together.
>>
>> Well there is the fundamental problem that you can't use io_uring to
>> implement the semantics necessary for a dma_fence.
>>
>> That's why we had to reject the io_uring work on DMA-buf sharing from
>> Google a few years ago.
>>
> 
> Any chance someone can link me to this? io_uring, as far as my
> primitive understanding goes, is not yet very adopted at Google, and
> I'm curious what this effort is.
I'm curious as well, I don't remember it floating anywhere in mailing
lists. The only discussion I recall was about
DRM_IOCTL_SYNCOBJ_TIMELINE_WAIT, but it didn't get through only because
someone pushed for evenfds.
Jason Gunthorpe June 10, 2024, 10:15 p.m. UTC | #16
On Mon, Jun 10, 2024 at 08:20:08PM +0100, Pavel Begunkov wrote:
> On 6/10/24 16:16, David Ahern wrote:

> > > There is no reason you shouldn't be able to use your fast io_uring
> > > completion and lifecycle flow with DMABUF backed memory. Those are not
> > > widly different things and there is good reason they should work
> > > together.
> 
> Let's not mix up devmem TCP and dmabuf specifically, as I see it
> your question was concerning the latter: "... DMABUF memory registered
> through Mina's mechanism". io_uring's zcrx can trivially get dmabuf
> support in future, as mentioned it's mostly the setup side. ABI,
> buffer workflow and some details is a separate issue, and I don't
> see how further integration aside from what we're already sharing
> is beneficial, on opposite it'll complicate things.

Again, I am talking about composability here, duplicating the DMABUF
stuff into io_uring is not composable, it is just duplicating things.

It does not match the view that there should be two distinct layers
here, one that provides the pages and one that manages the
lifecycle. As HCH pushes for pages either come from the allocator and
get to use the struct folio or the come from a dmabuf and they
don't. That is it, the only two choices.

The iouring stuff is trying to confuse the source of the pages with
the lifecycle - which is surely convenient, but is why Christoph is
opposing it.

Jason
'Christoph Hellwig' June 11, 2024, 6:26 a.m. UTC | #17
On Mon, Jun 10, 2024 at 09:16:43AM -0600, David Ahern wrote:
> 
> exactly. io_uring, page_pool, dmabuf - all kernel building blocks for
> solutions. This why I was pushing for Mina's set not to be using the
> name `devmem` - it is but one type of memory and with dmabuf it should
> not matter if it is gpu or host (or something else later on - cxl?).

While not really realted to the rest of the discussion I agree.
It really is dmabuf integration now, so let's call it that?
'Christoph Hellwig' June 11, 2024, 6:34 a.m. UTC | #18
On Fri, Jun 07, 2024 at 02:45:55PM +0100, Pavel Begunkov wrote:
> On 6/5/24 09:24, Christoph Hellwig wrote:
> > On Mon, Jun 03, 2024 at 03:52:32PM +0100, Pavel Begunkov wrote:
> > > The question for Christoph is what exactly is the objection here? Why we
> > > would not be using well defined ops when we know there will be more
> > > users?
> > 
> > The point is that there should be no more users.  If you need another
> 
> Does that "No more" stops after devmem tcp?

There should be no other memory source other than the page allocator
and dmabuf.  If you need different life time control for your
zero copy proposal don't mix that up with the contol of the memory
source.
Christian König June 11, 2024, 8:21 a.m. UTC | #19
Am 11.06.24 um 08:25 schrieb Christoph Hellwig:
> On Mon, Jun 10, 2024 at 02:38:18PM +0200, Christian König wrote:
>> Well there is the fundamental problem that you can't use io_uring to
>> implement the semantics necessary for a dma_fence.
> What is the exact problem there?

It's an intentional design decision that dma_fences can be waited on 
with quite a bunch of locks held. Including the DMA-buf reservation 
lock, mmap lock, anything page fault related, shrinker etc...

When you give userspace control over the signaling of a dma_fence then 
that has the same effect as returning to userspace with those locks held 
- you can basically trivially deadlock the system.

I think nearly a dozen implementations fell into that trap: 
https://www.kernel.org/doc/html/v5.9/driver-api/dma-buf.html#indefinite-dma-fences

It's well understood and documented by now why this approach doesn't 
work. So not much of an issue any more, we just have to reject 
implementations from time to time which try doing the same thing again.

Regards,
Christian.
Mina Almasry June 11, 2024, 5:48 p.m. UTC | #20
On Mon, Jun 10, 2024 at 11:26 PM Christoph Hellwig <hch@infradead.org> wrote:
>
> On Mon, Jun 10, 2024 at 09:16:43AM -0600, David Ahern wrote:
> >
> > exactly. io_uring, page_pool, dmabuf - all kernel building blocks for
> > solutions. This why I was pushing for Mina's set not to be using the
> > name `devmem` - it is but one type of memory and with dmabuf it should
> > not matter if it is gpu or host (or something else later on - cxl?).
>
> While not really realted to the rest of the discussion I agree.
> It really is dmabuf integration now, so let's call it that?

My mental model is that the feature folks care about is the ability to
use TCP with device memory, and dmabuf is an implementation detail
that is the format that device memory is packaged in. Although not
likely given this discussion, in theory we could want to extend devmem
TCP to support p2pdma for nvme, or some other format if a new one
arises in device drivers. I also think it's more obvious to an end
user what 'devmem TCP' aims to do rather than 'dmabuf TCP' especially
if the user is not a kernel developer familiar with dmabuf.
Mina Almasry June 11, 2024, 6:09 p.m. UTC | #21
On Mon, Jun 10, 2024 at 3:15 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Mon, Jun 10, 2024 at 08:20:08PM +0100, Pavel Begunkov wrote:
> > On 6/10/24 16:16, David Ahern wrote:
>
> > > > There is no reason you shouldn't be able to use your fast io_uring
> > > > completion and lifecycle flow with DMABUF backed memory. Those are not
> > > > widly different things and there is good reason they should work
> > > > together.
> >
> > Let's not mix up devmem TCP and dmabuf specifically, as I see it
> > your question was concerning the latter: "... DMABUF memory registered
> > through Mina's mechanism". io_uring's zcrx can trivially get dmabuf
> > support in future, as mentioned it's mostly the setup side. ABI,
> > buffer workflow and some details is a separate issue, and I don't
> > see how further integration aside from what we're already sharing
> > is beneficial, on opposite it'll complicate things.
>
> Again, I am talking about composability here, duplicating the DMABUF
> stuff into io_uring is not composable, it is just duplicating things.
>
> It does not match the view that there should be two distinct layers
> here, one that provides the pages and one that manages the
> lifecycle. As HCH pushes for pages either come from the allocator and
> get to use the struct folio or the come from a dmabuf and they
> don't. That is it, the only two choices.
>
> The iouring stuff is trying to confuse the source of the pages with
> the lifecycle - which is surely convenient, but is why Christoph is
> opposing it.
>

Just curious: in Pavel's effort, io_uring - which is not a device - is
trying to share memory with the page_pool, which is also not a device.
And Pavel is being asked to wrap the memory in a dmabuf. Is dmabuf
going to be the kernel's standard for any memory sharing between any 2
components in the future, even when they're not devices? As in you
expect dmabuf exporters which are not devices to be added to the
kernel? Currently the only dmabuf exporter which is not a device
(AFAIK) is udmabuf, which is used for testing and emulation, not
really a production thing, I think.
Jason Gunthorpe June 12, 2024, 12:06 p.m. UTC | #22
On Tue, Jun 11, 2024 at 11:09:15AM -0700, Mina Almasry wrote:

> Just curious: in Pavel's effort, io_uring - which is not a device - is
> trying to share memory with the page_pool, which is also not a device.
> And Pavel is being asked to wrap the memory in a dmabuf. Is dmabuf
> going to be the kernel's standard for any memory sharing between any 2
> components in the future, even when they're not devices?

dmabuf is how we are refcounting non-struct page memory, there is
nothing about it that says it has to be MMIO memory, or even that the
memory doesn't have struct pages.

All it says is that the memory is alive according to dmabuf
refcounting rules. And the importer obviously don't get to touch the
underlying folios, if any.

Jason
David Ahern June 12, 2024, 3:47 p.m. UTC | #23
On 6/12/24 6:06 AM, Jason Gunthorpe wrote:
> On Tue, Jun 11, 2024 at 11:09:15AM -0700, Mina Almasry wrote:
> 
>> Just curious: in Pavel's effort, io_uring - which is not a device - is
>> trying to share memory with the page_pool, which is also not a device.
>> And Pavel is being asked to wrap the memory in a dmabuf. Is dmabuf
>> going to be the kernel's standard for any memory sharing between any 2
>> components in the future, even when they're not devices?
> 
> dmabuf is how we are refcounting non-struct page memory, there is
> nothing about it that says it has to be MMIO memory, or even that the
> memory doesn't have struct pages.
> 
> All it says is that the memory is alive according to dmabuf
> refcounting rules. And the importer obviously don't get to touch the
> underlying folios, if any.
> 

In addition, the io_uring developers should be considering the use case
of device memory. There is no reason for this design to be limited to
host memory. io_uring should not care (it is not peeking inside the
memory buffers); it is just memory references.

One of io_uring's primary benefits is avoiding system calls. io_uring
works with TCP sockets. Let it work with any dmabuf without concern of
memory type. The performance benefits the Google crowd sees with system
call based apps should be even better with io_uring.

Focus on primitives, building blocks with solid APIs for other
subsystems to leverage and let them be wired up in ways you cannot
imagine today.
Pavel Begunkov June 17, 2024, 6:04 p.m. UTC | #24
On 6/11/24 07:34, Christoph Hellwig wrote:
> On Fri, Jun 07, 2024 at 02:45:55PM +0100, Pavel Begunkov wrote:
>> On 6/5/24 09:24, Christoph Hellwig wrote:
>>> On Mon, Jun 03, 2024 at 03:52:32PM +0100, Pavel Begunkov wrote:
>>>> The question for Christoph is what exactly is the objection here? Why we
>>>> would not be using well defined ops when we know there will be more
>>>> users?
>>>
>>> The point is that there should be no more users.  If you need another
>>
>> Does that "No more" stops after devmem tcp?
> 
> There should be no other memory source other than the page allocator
> and dmabuf.  If you need different life time control for your
> zero copy proposal don't mix that up with the contol of the memory
> source.

No idea how I'm mixing it up when I was explaining exactly this
all along as well as that the callback (and presumably the call
site in general) you was so eager to nack is used exactly to
implement the life time control.
Pavel Begunkov June 17, 2024, 7:15 p.m. UTC | #25
On 6/10/24 23:15, Jason Gunthorpe wrote:
> On Mon, Jun 10, 2024 at 08:20:08PM +0100, Pavel Begunkov wrote:
>> On 6/10/24 16:16, David Ahern wrote:
> 
>>>> There is no reason you shouldn't be able to use your fast io_uring
>>>> completion and lifecycle flow with DMABUF backed memory. Those are not
>>>> widly different things and there is good reason they should work
>>>> together.
>>
>> Let's not mix up devmem TCP and dmabuf specifically, as I see it
>> your question was concerning the latter: "... DMABUF memory registered
>> through Mina's mechanism". io_uring's zcrx can trivially get dmabuf
>> support in future, as mentioned it's mostly the setup side. ABI,
>> buffer workflow and some details is a separate issue, and I don't
>> see how further integration aside from what we're already sharing
>> is beneficial, on opposite it'll complicate things.
> 
> Again, I am talking about composability here, duplicating the DMABUF
> stuff into io_uring is not composable, it is just duplicating things.

Ok, then registering, say, a dmabuf via devmem TCP and then using it
in io_uring. Let's say we make devmem TCP API to be able to register
a dmabuf without using it, from where io_uring can take ownership
over it and use in the flow. And I strongly believe the same memory
region/dmabuf should never be used by both at the same time and hence
lifetime of any such memory should be exclusively bound to io_uring.

That leaves the user api, where to add memory you need to create
a netlink socket and pass everything through it, which is an extra
step, and then letting know io_uring that it can use the memory, not
forgetting to eject it from netlink. That's not a good api as far as
it goes with io_uring.

I don't think slight duplicating of registration is a problem when
the upside is much cleaner API. Internals, however, can be easily
shared. We can even say that the net stack should provide helpers
like init_page_pool_from_dmabuf_fd() and now allow poking into
related bits aside from it (initialising net_iov / etc.).

> It does not match the view that there should be two distinct layers
> here, one that provides the pages and one that manages the
> lifecycle. As HCH pushes for pages either come from the allocator and
> get to use the struct folio or the come from a dmabuf and they
> don't. That is it, the only two choices.
> 
> The iouring stuff is trying to confuse the source of the pages with
> the lifecycle - which is surely convenient, but is why Christoph is
> opposing it.
'Christoph Hellwig' June 18, 2024, 6:43 a.m. UTC | #26
On Mon, Jun 17, 2024 at 07:04:43PM +0100, Pavel Begunkov wrote:
> > There should be no other memory source other than the page allocator
> > and dmabuf.  If you need different life time control for your
> > zero copy proposal don't mix that up with the contol of the memory
> > source.
> 
> No idea how I'm mixing it up when I was explaining exactly this
> all along as well as that the callback (and presumably the call
> site in general) you was so eager to nack is used exactly to
> implement the life time control.

And that's exactly my point.  You want to use one callback to mix
allocation source and life time control.  That's the perfect recipe
to create an un-extensible un-composable mess.
Pavel Begunkov June 18, 2024, 11:40 a.m. UTC | #27
On 6/18/24 07:43, Christoph Hellwig wrote:
> On Mon, Jun 17, 2024 at 07:04:43PM +0100, Pavel Begunkov wrote:
>>> There should be no other memory source other than the page allocator
>>> and dmabuf.  If you need different life time control for your
>>> zero copy proposal don't mix that up with the contol of the memory
>>> source.
>>
>> No idea how I'm mixing it up when I was explaining exactly this
>> all along as well as that the callback (and presumably the call
>> site in general) you was so eager to nack is used exactly to
>> implement the life time control.
> 
> And that's exactly my point.  You want to use one callback to mix
> allocation source and life time control.  

No, it only takes the role of life time control and doesn't
care about the source. The allocation source step with
corresponding initialisation happens separately and
priorly, at initialisation time.

> That's the perfect recipe
> to create an un-extensible un-composable mess
diff mbox series

Patch

diff --git a/include/net/page_pool/types.h b/include/net/page_pool/types.h
index b088d131aeb0d..b038b838f042f 100644
--- a/include/net/page_pool/types.h
+++ b/include/net/page_pool/types.h
@@ -51,6 +51,7 @@  struct pp_alloc_cache {
  * @dev:	device, for DMA pre-mapping purposes
  * @netdev:	netdev this pool will serve (leave as NULL if none or multiple)
  * @napi:	NAPI which is the sole consumer of pages, otherwise NULL
+ * @queue:	struct netdev_rx_queue this page_pool is being created for.
  * @dma_dir:	DMA mapping direction
  * @max_len:	max DMA sync memory size for PP_FLAG_DMA_SYNC_DEV
  * @offset:	DMA sync address offset for PP_FLAG_DMA_SYNC_DEV
@@ -64,6 +65,7 @@  struct page_pool_params {
 		int		nid;
 		struct device	*dev;
 		struct napi_struct *napi;
+		struct netdev_rx_queue *queue;
 		enum dma_data_direction dma_dir;
 		unsigned int	max_len;
 		unsigned int	offset;
@@ -127,6 +129,13 @@  struct page_pool_stats {
 };
 #endif
 
+struct memory_provider_ops {
+	int (*init)(struct page_pool *pool);
+	void (*destroy)(struct page_pool *pool);
+	struct page *(*alloc_netmems)(struct page_pool *pool, gfp_t gfp);
+	bool (*release_page)(struct page_pool *pool, struct page *page);
+};
+
 struct page_pool {
 	struct page_pool_params_fast p;
 
@@ -193,6 +202,9 @@  struct page_pool {
 	 */
 	struct ptr_ring ring;
 
+	void *mp_priv;
+	const struct memory_provider_ops *mp_ops;
+
 #ifdef CONFIG_PAGE_POOL_STATS
 	/* recycle stats are per-cpu to avoid locking */
 	struct page_pool_recycle_stats __percpu *recycle_stats;
diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index f4444b4e39e63..251c9356c9202 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -26,6 +26,8 @@ 
 
 #include "page_pool_priv.h"
 
+static DEFINE_STATIC_KEY_FALSE(page_pool_mem_providers);
+
 #define DEFER_TIME (msecs_to_jiffies(1000))
 #define DEFER_WARN_INTERVAL (60 * HZ)
 
@@ -186,6 +188,7 @@  static int page_pool_init(struct page_pool *pool,
 			  int cpuid)
 {
 	unsigned int ring_qsize = 1024; /* Default */
+	int err;
 
 	page_pool_struct_check();
 
@@ -267,7 +270,22 @@  static int page_pool_init(struct page_pool *pool,
 	if (pool->dma_map)
 		get_device(pool->p.dev);
 
+	if (pool->mp_ops) {
+		err = pool->mp_ops->init(pool);
+		if (err) {
+			pr_warn("%s() mem-provider init failed %d\n", __func__,
+				err);
+			goto free_ptr_ring;
+		}
+
+		static_branch_inc(&page_pool_mem_providers);
+	}
+
 	return 0;
+
+free_ptr_ring:
+	ptr_ring_cleanup(&pool->ring, NULL);
+	return err;
 }
 
 static void page_pool_uninit(struct page_pool *pool)
@@ -569,7 +587,10 @@  struct page *page_pool_alloc_pages(struct page_pool *pool, gfp_t gfp)
 		return page;
 
 	/* Slow-path: cache empty, do real allocation */
-	page = __page_pool_alloc_pages_slow(pool, gfp);
+	if (static_branch_unlikely(&page_pool_mem_providers) && pool->mp_ops)
+		page = pool->mp_ops->alloc_netmems(pool, gfp);
+	else
+		page = __page_pool_alloc_pages_slow(pool, gfp);
 	return page;
 }
 EXPORT_SYMBOL(page_pool_alloc_pages);
@@ -627,10 +648,13 @@  void __page_pool_release_page_dma(struct page_pool *pool, struct page *page)
 void page_pool_return_page(struct page_pool *pool, struct page *page)
 {
 	int count;
+	bool put;
 
-	__page_pool_release_page_dma(pool, page);
-
-	page_pool_clear_pp_info(page);
+	put = true;
+	if (static_branch_unlikely(&page_pool_mem_providers) && pool->mp_ops)
+		put = pool->mp_ops->release_page(pool, page);
+	else
+		__page_pool_release_page_dma(pool, page);
 
 	/* This may be the last page returned, releasing the pool, so
 	 * it is not safe to reference pool afterwards.
@@ -638,7 +662,10 @@  void page_pool_return_page(struct page_pool *pool, struct page *page)
 	count = atomic_inc_return_relaxed(&pool->pages_state_release_cnt);
 	trace_page_pool_state_release(pool, page, count);
 
-	put_page(page);
+	if (put) {
+		page_pool_clear_pp_info(page);
+		put_page(page);
+	}
 	/* An optimization would be to call __free_pages(page, pool->p.order)
 	 * knowing page is not part of page-cache (thus avoiding a
 	 * __page_cache_release() call).
@@ -937,6 +964,12 @@  static void __page_pool_destroy(struct page_pool *pool)
 
 	page_pool_unlist(pool);
 	page_pool_uninit(pool);
+
+	if (pool->mp_ops) {
+		pool->mp_ops->destroy(pool);
+		static_branch_dec(&page_pool_mem_providers);
+	}
+
 	kfree(pool);
 }