Message ID | 20240530201616.1316526-3-almasrymina@google.com |
---|---|
State | New |
Headers | show |
Series | Device Memory TCP | expand |
On Fri, May 31, 2024 at 10:35 PM Christoph Hellwig <hch@infradead.org> wrote: > > On Thu, May 30, 2024 at 08:16:01PM +0000, Mina Almasry wrote: > > I'm unsure if the discussion has been resolved yet. Sending the series > > anyway to get reviews/feedback on the (unrelated) rest of the series. > > As far as I'm concerned it is not. I've not seen any convincing > argument for more than page/folio allocator including larger order / > huge page and dmabuf. > Thanks Christoph, this particular patch series adds dmabuf, so I assume no objection there. I assume the objection is that you want the generic, extensible hooks removed. To be honest, I don't think the hooks are an integral part of the design, and at this point I think we've argued for them enough. I think we can easily achieve the same thing with just raw if statements in a couple of places. We can always add the hooks if and only if we actually justify many memory providers. Any objections to me removing the hooks and directing to memory allocations via simple if statements? Something like (very rough draft, doesn't compile): diff --git a/net/core/page_pool.c b/net/core/page_pool.c index 92be1aaf18ccc..2cc986455bce6 100644 --- a/net/core/page_pool.c +++ b/net/core/page_pool.c @@ -557,8 +557,8 @@ netmem_ref page_pool_alloc_netmem(struct page_pool *pool, gfp_t gfp) return netmem; /* Slow-path: cache empty, do real allocation */ - if (static_branch_unlikely(&page_pool_mem_providers) && pool->mp_ops) - netmem = pool->mp_ops->alloc_pages(pool, gfp); + if (unlikely(page_pool_is_dmabuf(pool))) + netmem = mp_dmabuf_devmem_alloc_pages(): else netmem = __page_pool_alloc_pages_slow(pool, gfp); return netmem;
On 6/3/24 15:17, Mina Almasry wrote: > On Fri, May 31, 2024 at 10:35 PM Christoph Hellwig <hch@infradead.org> wrote: >> >> On Thu, May 30, 2024 at 08:16:01PM +0000, Mina Almasry wrote: >>> I'm unsure if the discussion has been resolved yet. Sending the series >>> anyway to get reviews/feedback on the (unrelated) rest of the series. >> >> As far as I'm concerned it is not. I've not seen any convincing >> argument for more than page/folio allocator including larger order / >> huge page and dmabuf. >> > > Thanks Christoph, this particular patch series adds dmabuf, so I > assume no objection there. I assume the objection is that you want the > generic, extensible hooks removed. > > To be honest, I don't think the hooks are an integral part of the > design, and at this point I think we've argued for them enough. I > think we can easily achieve the same thing with just raw if statements > in a couple of places. We can always add the hooks if and only if we > actually justify many memory providers. > > Any objections to me removing the hooks and directing to memory > allocations via simple if statements? Something like (very rough > draft, doesn't compile): The question for Christoph is what exactly is the objection here? Why we would not be using well defined ops when we know there will be more users? Repeating what I said in the last thread, for io_uring it's used to implement the flow of buffers from userspace to the kernel, the ABI, which is orthogonal to the issue of what memory type it is and how it came there. And even if you mandate unnecessary dmabuf condoms for user memory in one form or another IMHO for no clear reason, the callbacks (or yet another if-else) would still be needed. Sure, Mina can drop and hard code devmem path to easy the pain for him and delay the discussion, but then shortly after I will be re-sending same shit. So, what's the convincing argument _not_ to have it? > > diff --git a/net/core/page_pool.c b/net/core/page_pool.c > index 92be1aaf18ccc..2cc986455bce6 100644 > --- a/net/core/page_pool.c > +++ b/net/core/page_pool.c > @@ -557,8 +557,8 @@ netmem_ref page_pool_alloc_netmem(struct page_pool > *pool, gfp_t gfp) > return netmem; > > /* Slow-path: cache empty, do real allocation */ > - if (static_branch_unlikely(&page_pool_mem_providers) && pool->mp_ops) > - netmem = pool->mp_ops->alloc_pages(pool, gfp); > + if (unlikely(page_pool_is_dmabuf(pool))) > + netmem = mp_dmabuf_devmem_alloc_pages(): > else > netmem = __page_pool_alloc_pages_slow(pool, gfp); > return netmem; > >
On Mon, Jun 3, 2024 at 7:52 AM Pavel Begunkov <asml.silence@gmail.com> wrote: > > On 6/3/24 15:17, Mina Almasry wrote: > > On Fri, May 31, 2024 at 10:35 PM Christoph Hellwig <hch@infradead.org> wrote: > >> > >> On Thu, May 30, 2024 at 08:16:01PM +0000, Mina Almasry wrote: > >>> I'm unsure if the discussion has been resolved yet. Sending the series > >>> anyway to get reviews/feedback on the (unrelated) rest of the series. > >> > >> As far as I'm concerned it is not. I've not seen any convincing > >> argument for more than page/folio allocator including larger order / > >> huge page and dmabuf. > >> > > > > Thanks Christoph, this particular patch series adds dmabuf, so I > > assume no objection there. I assume the objection is that you want the > > generic, extensible hooks removed. > > > > To be honest, I don't think the hooks are an integral part of the > > design, and at this point I think we've argued for them enough. I > > think we can easily achieve the same thing with just raw if statements > > in a couple of places. We can always add the hooks if and only if we > > actually justify many memory providers. > > > > Any objections to me removing the hooks and directing to memory > > allocations via simple if statements? Something like (very rough > > draft, doesn't compile): > > The question for Christoph is what exactly is the objection here? Why we > would not be using well defined ops when we know there will be more > users? Repeating what I said in the last thread, for io_uring it's used > to implement the flow of buffers from userspace to the kernel, the ABI, > which is orthogonal to the issue of what memory type it is and how it > came there. And even if you mandate unnecessary dmabuf condoms for user > memory in one form or another IMHO for no clear reason, the callbacks > (or yet another if-else) would still be needed. > > Sure, Mina can drop and hard code devmem path to easy the pain for > him and delay the discussion, but then shortly after I will be > re-sending same shit. You don't need to re-send the same ops again, right? You can add io uring support without ops. Something like: diff --git a/net/core/page_pool.c b/net/core/page_pool.c index 92be1aaf18ccc..2cc986455bce6 100644 --- a/net/core/page_pool.c +++ b/net/core/page_pool.c @@ -557,8 +557,8 @@ netmem_ref page_pool_alloc_netmem(struct page_pool *pool, gfp_t gfp) return netmem; /* Slow-path: cache empty, do real allocation */ - if (static_branch_unlikely(&page_pool_mem_providers) && pool->mp_ops) - netmem = pool->mp_ops->alloc_pages(pool, gfp); + if (unlikely(page_pool_is_dmabuf(pool))) + netmem = mp_dmabuf_devmem_alloc_pages(): + else if (unlikely(page_pool_is_iouring(pool))) + netmem = mp_io_uring_alloc_pages(): else netmem = __page_pool_alloc_pages_slow(pool, gfp); return netmem; So IMO, the ops themselves, which Christoph is repeatedly nacking, are not that important. I humbly think the energy should be spent convincing maintainers of the use case of io uring memory, not the ops. The ops are a cosmetic change to the code, and can be added later. Christoph is nacking the ops because it gives people too much rope [1]. But if you disagree and think the ops themselves are important for a reason I missed, I'm happy waiting until agreement is reached here. Sorry, just voicing my 2 cents. [1] https://lore.kernel.org/netdev/ZjjHUh1eINPg1wkn@infradead.org/
On Mon, Jun 03, 2024 at 07:17:05AM -0700, Mina Almasry wrote: > On Fri, May 31, 2024 at 10:35 PM Christoph Hellwig <hch@infradead.org> wrote: > > > > On Thu, May 30, 2024 at 08:16:01PM +0000, Mina Almasry wrote: > > > I'm unsure if the discussion has been resolved yet. Sending the series > > > anyway to get reviews/feedback on the (unrelated) rest of the series. > > > > As far as I'm concerned it is not. I've not seen any convincing > > argument for more than page/folio allocator including larger order / > > huge page and dmabuf. > > > > Thanks Christoph, this particular patch series adds dmabuf, so I > assume no objection there. I assume the objection is that you want the > generic, extensible hooks removed. Exactly! Note that this isn't a review of the dmabuf bits as there are people more qualified with me. > To be honest, I don't think the hooks are an integral part of the > design, and at this point I think we've argued for them enough. I > think we can easily achieve the same thing with just raw if statements > in a couple of places. We can always add the hooks if and only if we > actually justify many memory providers. > > Any objections to me removing the hooks and directing to memory > allocations via simple if statements? Something like (very rough > draft, doesn't compile): I like this approach, thanks! You might still want to keep the static key, though.
On Mon, Jun 03, 2024 at 03:52:32PM +0100, Pavel Begunkov wrote: > The question for Christoph is what exactly is the objection here? Why we > would not be using well defined ops when we know there will be more > users? The point is that there should be no more users. If you need another case you are doing something very wrong.
On 6/3/24 16:43, Mina Almasry wrote: > On Mon, Jun 3, 2024 at 7:52 AM Pavel Begunkov <asml.silence@gmail.com> wrote: >> >> On 6/3/24 15:17, Mina Almasry wrote: >>> On Fri, May 31, 2024 at 10:35 PM Christoph Hellwig <hch@infradead.org> wrote: >>>> >>>> On Thu, May 30, 2024 at 08:16:01PM +0000, Mina Almasry wrote: >>>>> I'm unsure if the discussion has been resolved yet. Sending the series >>>>> anyway to get reviews/feedback on the (unrelated) rest of the series. >>>> >>>> As far as I'm concerned it is not. I've not seen any convincing >>>> argument for more than page/folio allocator including larger order / >>>> huge page and dmabuf. >>>> >>> >>> Thanks Christoph, this particular patch series adds dmabuf, so I >>> assume no objection there. I assume the objection is that you want the >>> generic, extensible hooks removed. >>> >>> To be honest, I don't think the hooks are an integral part of the >>> design, and at this point I think we've argued for them enough. I >>> think we can easily achieve the same thing with just raw if statements >>> in a couple of places. We can always add the hooks if and only if we >>> actually justify many memory providers. >>> >>> Any objections to me removing the hooks and directing to memory >>> allocations via simple if statements? Something like (very rough >>> draft, doesn't compile): >> >> The question for Christoph is what exactly is the objection here? Why we >> would not be using well defined ops when we know there will be more >> users? Repeating what I said in the last thread, for io_uring it's used >> to implement the flow of buffers from userspace to the kernel, the ABI, >> which is orthogonal to the issue of what memory type it is and how it >> came there. And even if you mandate unnecessary dmabuf condoms for user >> memory in one form or another IMHO for no clear reason, the callbacks >> (or yet another if-else) would still be needed. >> >> Sure, Mina can drop and hard code devmem path to easy the pain for >> him and delay the discussion, but then shortly after I will be >> re-sending same shit. > > You don't need to re-send the same ops again, right? You can add io > uring support without ops. Something like: > > diff --git a/net/core/page_pool.c b/net/core/page_pool.c > index 92be1aaf18ccc..2cc986455bce6 100644 > --- a/net/core/page_pool.c > +++ b/net/core/page_pool.c > @@ -557,8 +557,8 @@ netmem_ref page_pool_alloc_netmem(struct page_pool > *pool, gfp_t gfp) > return netmem; > > /* Slow-path: cache empty, do real allocation */ > - if (static_branch_unlikely(&page_pool_mem_providers) && pool->mp_ops) > - netmem = pool->mp_ops->alloc_pages(pool, gfp); > + if (unlikely(page_pool_is_dmabuf(pool))) > + netmem = mp_dmabuf_devmem_alloc_pages(): > + else if (unlikely(page_pool_is_iouring(pool))) > + netmem = mp_io_uring_alloc_pages(): > else > netmem = __page_pool_alloc_pages_slow(pool, gfp); > return netmem; > > So IMO, the ops themselves, which Christoph is repeatedly nacking, are > not that important. > > I humbly think the energy should be spent convincing maintainers of > the use case of io uring memory, not the ops. The ops are a cosmetic I haven't seen any arguments against from the (net) maintainers so far. Nor I see any objection against callbacks from them (considering that either option adds an if). And just not to confuse folks, it's just user pages, not some weird special io_uring memory. > change to the code, and can be added later. Christoph is nacking the > ops because it gives people too much rope [1]. Yes, it is cosmetic, just as much as removing it is a cosmetic change. You can apply same "too much rope" argument basically to anything. Take io_uring, nothing would change in the process, it'd still be sent to net and reviewed exactly same way, while being less clean, with poorer subsystem separation, allowing custom formats / argument list, etc. I think it's cleaner with callbacks, Mr. Christoph has other beliefs and keeps coercing to them, even though from time to time it backfires for the author, just personal experience. > But if you disagree and think the ops themselves are important for a > reason I missed, I'm happy waiting until agreement is reached here. > Sorry, just voicing my 2 cents. > > [1] https://lore.kernel.org/netdev/ZjjHUh1eINPg1wkn@infradead.org/ >
On 6/5/24 09:24, Christoph Hellwig wrote: > On Mon, Jun 03, 2024 at 03:52:32PM +0100, Pavel Begunkov wrote: >> The question for Christoph is what exactly is the objection here? Why we >> would not be using well defined ops when we know there will be more >> users? > > The point is that there should be no more users. If you need another Does that "No more" stops after devmem tcp? Or after io_uring proposal? For the latter I explained why io_uring has to do it for good design and that's it's not even related to the memory type used. > case you are doing something very wrong. That's not a very illuminating answer
On 6/7/24 7:42 AM, Pavel Begunkov wrote: > I haven't seen any arguments against from the (net) maintainers so > far. Nor I see any objection against callbacks from them (considering > that either option adds an if). I have said before I do not understand why the dmabuf paradigm is not sufficient for both device memory and host memory. A less than ideal control path to put hostmem in a dmabuf wrapper vs extra checks and changes in the datapath. The former should always be preferred. I also do not understand why the ifq cache and overloading xdp functions have stuck around; I always thought both were added by Jonathan to simplify kernel ports during early POC days.
On Fri, Jun 07, 2024 at 08:27:29AM -0600, David Ahern wrote: > On 6/7/24 7:42 AM, Pavel Begunkov wrote: > > I haven't seen any arguments against from the (net) maintainers so > > far. Nor I see any objection against callbacks from them (considering > > that either option adds an if). > > I have said before I do not understand why the dmabuf paradigm is not > sufficient for both device memory and host memory. A less than ideal > control path to put hostmem in a dmabuf wrapper vs extra checks and > changes in the datapath. The former should always be preferred. I think Pavel explained this - his project is principally to replace the lifetime policy of pages in the data plane. He wants to change when a page is considered available for re-allocation because userspace may continue to use the page after the netstack thinks it is done with it. It sounds like having a different source of the pages is the less important part. IMHO it seems to compose poorly if you can only use the io_uring lifecycle model with io_uring registered memory, and not with DMABUF memory registered through Mina's mechanism. Jason
On 6/7/24 15:27, David Ahern wrote: > On 6/7/24 7:42 AM, Pavel Begunkov wrote: >> I haven't seen any arguments against from the (net) maintainers so >> far. Nor I see any objection against callbacks from them (considering >> that either option adds an if). > > I have said before I do not understand why the dmabuf paradigm is not > sufficient for both device memory and host memory. A less than ideal > control path to put hostmem in a dmabuf wrapper vs extra checks and > changes in the datapath. The former should always be preferred. If we're talking about types of memory specifically, I'm not strictly against wrapping into dmabuf in kernel, but that just doesn't give anything. But the main reason for allocations there is the difference in approaches to the api. With io_uring the allocation callback is responsible for getting buffers back from the user (via a shared ring). No locking for the ring, and buffers are already in the context (napi) where they would be consumed from. Removes some headaches for the user (like batching before returning buffers), and should go better with smaller buffers and such. > I also do not understand why the ifq cache I'm not sure what you mean by ifq cache. Can you elaborate? > and overloading xdp functions Assuming it's about setup via xdp, it was marked for remaking in RFCs for longer than desired but it's gone now in our tree (but maybe not in the latest series). > have stuck around; I always thought both were added by Jonathan to > simplify kernel ports during early POC days.
On 6/7/24 16:42, Pavel Begunkov wrote: > On 6/7/24 15:27, David Ahern wrote: >> On 6/7/24 7:42 AM, Pavel Begunkov wrote: >>> I haven't seen any arguments against from the (net) maintainers so >>> far. Nor I see any objection against callbacks from them (considering >>> that either option adds an if). >> >> I have said before I do not understand why the dmabuf paradigm is not >> sufficient for both device memory and host memory. A less than ideal >> control path to put hostmem in a dmabuf wrapper vs extra checks and >> changes in the datapath. The former should always be preferred. > > If we're talking about types of memory specifically, I'm not strictly > against wrapping into dmabuf in kernel, but that just doesn't give > anything. And the reason I don't have too strong of an opinion on that is mainly because it's just setup/cleanup path. > But the main reason for allocations there is the difference in > approaches to the api. With io_uring the allocation callback is > responsible for getting buffers back from the user (via a shared > ring). No locking for the ring, and buffers are already in the > context (napi) where they would be consumed from. Removes some > headaches for the user (like batching before returning buffers), > and should go better with smaller buffers and such. > >> I also do not understand why the ifq cache > > I'm not sure what you mean by ifq cache. Can you elaborate? > >> and overloading xdp functions > > Assuming it's about setup via xdp, it was marked for remaking in > RFCs for longer than desired but it's gone now in our tree (but > maybe not in the latest series). > >> have stuck around; I always thought both were added by Jonathan to >> simplify kernel ports during early POC days. >
On Fri, Jun 7, 2024 at 8:47 AM Pavel Begunkov <asml.silence@gmail.com> wrote: > > On 6/7/24 16:42, Pavel Begunkov wrote: > > On 6/7/24 15:27, David Ahern wrote: > >> On 6/7/24 7:42 AM, Pavel Begunkov wrote: > >>> I haven't seen any arguments against from the (net) maintainers so > >>> far. Nor I see any objection against callbacks from them (considering > >>> that either option adds an if). > >> > >> I have said before I do not understand why the dmabuf paradigm is not > >> sufficient for both device memory and host memory. A less than ideal > >> control path to put hostmem in a dmabuf wrapper vs extra checks and > >> changes in the datapath. The former should always be preferred. > > > > If we're talking about types of memory specifically, I'm not strictly > > against wrapping into dmabuf in kernel, but that just doesn't give > > anything. > > And the reason I don't have too strong of an opinion on that is > mainly because it's just setup/cleanup path. > I agree wrapping io uring in dmabuf seems to be an unnecessary detour. I never understood the need or upside to do that, but it could be a lack of understanding on my part. However, the concern that David brings up may materialize. I've had to spend a lot of time minimizing or justifying checks to the code with page pool benchmarks that detect even 1 cycle regressions. You may be asked to run the same benchmarks and minimize similar overhead. The benchmark in question is Jesper's bench_page_pool_simple. I've forked it and applied it on top of net-next here: https://github.com/mina/linux/commit/927596f87ab5791a8a6ba8597ba2189747396e54 As io_uring ZC comes close to merging, I suspect it would be good to run this to understand the regression in the fast path, if any. If there are no to little regressions, I have no concerns over io uring memory not being wrapped in dmabufs, and David may agree as well. -- Thanks, Mina
On 2024-06-07 17:27, David Ahern wrote: > I also do not understand why the ifq cache and overloading xdp functions > have stuck around; I always thought both were added by Jonathan to > simplify kernel ports during early POC days. Setting up an Rx queue for ZC w/ a different pp will be done properly using the new queue API that Mina merged recently. Those custom XDP hooks will be gone in a non-RFC patchset.
On 6/10/24 01:37, David Wei wrote: > On 2024-06-07 17:52, Jason Gunthorpe wrote: >> IMHO it seems to compose poorly if you can only use the io_uring >> lifecycle model with io_uring registered memory, and not with DMABUF >> memory registered through Mina's mechanism. > > By this, do you mean io_uring must be exclusively used to use this > feature? > > And you'd rather see the two decoupled, so userspace can register w/ say > dmabuf then pass it to io_uring? Personally, I have no clue what Jason means. You can just as well say that it's poorly composable that write(2) to a disk cannot post a completion into a XDP ring, or a netlink socket, or io_uring's main completion queue, or name any other API. The devmem TCP callback can implement it in a way feasible to the project, but it cannot directly post events to an unrelated API like io_uring. And devmem attaches buffers to a socket, for which a ring for returning buffers might even be a nuisance.
On Mon, Jun 10, 2024 at 02:07:01AM +0100, Pavel Begunkov wrote: > On 6/10/24 01:37, David Wei wrote: > > On 2024-06-07 17:52, Jason Gunthorpe wrote: > > > IMHO it seems to compose poorly if you can only use the io_uring > > > lifecycle model with io_uring registered memory, and not with DMABUF > > > memory registered through Mina's mechanism. > > > > By this, do you mean io_uring must be exclusively used to use this > > feature? > > > > And you'd rather see the two decoupled, so userspace can register w/ say > > dmabuf then pass it to io_uring? > > Personally, I have no clue what Jason means. You can just as > well say that it's poorly composable that write(2) to a disk > cannot post a completion into a XDP ring, or a netlink socket, > or io_uring's main completion queue, or name any other API. There is no reason you shouldn't be able to use your fast io_uring completion and lifecycle flow with DMABUF backed memory. Those are not widly different things and there is good reason they should work together. Pretending they are totally different just because two different people wrote them is a very siloed view. > The devmem TCP callback can implement it in a way feasible to > the project, but it cannot directly post events to an unrelated > API like io_uring. And devmem attaches buffers to a socket, > for which a ring for returning buffers might even be a nuisance. If you can't compose your io_uring completion mechanism with a DMABUF provided backing store then I think it needs more work. Jason
On Mon, Jun 10, 2024 at 5:38 AM Christian König <christian.koenig@amd.com> wrote: > > Am 10.06.24 um 14:16 schrieb Jason Gunthorpe: > > On Mon, Jun 10, 2024 at 02:07:01AM +0100, Pavel Begunkov wrote: > >> On 6/10/24 01:37, David Wei wrote: > >>> On 2024-06-07 17:52, Jason Gunthorpe wrote: > >>>> IMHO it seems to compose poorly if you can only use the io_uring > >>>> lifecycle model with io_uring registered memory, and not with DMABUF > >>>> memory registered through Mina's mechanism. > >>> By this, do you mean io_uring must be exclusively used to use this > >>> feature? > >>> > >>> And you'd rather see the two decoupled, so userspace can register w/ say > >>> dmabuf then pass it to io_uring? > >> Personally, I have no clue what Jason means. You can just as > >> well say that it's poorly composable that write(2) to a disk > >> cannot post a completion into a XDP ring, or a netlink socket, > >> or io_uring's main completion queue, or name any other API. > > There is no reason you shouldn't be able to use your fast io_uring > > completion and lifecycle flow with DMABUF backed memory. Those are not > > widly different things and there is good reason they should work > > together. > > Well there is the fundamental problem that you can't use io_uring to > implement the semantics necessary for a dma_fence. > > That's why we had to reject the io_uring work on DMA-buf sharing from > Google a few years ago. > Any chance someone can link me to this? io_uring, as far as my primitive understanding goes, is not yet very adopted at Google, and I'm curious what this effort is.
On 6/10/24 16:41, Mina Almasry wrote: > On Mon, Jun 10, 2024 at 5:38 AM Christian König > <christian.koenig@amd.com> wrote: >> >> Am 10.06.24 um 14:16 schrieb Jason Gunthorpe: >>> On Mon, Jun 10, 2024 at 02:07:01AM +0100, Pavel Begunkov wrote: >>>> On 6/10/24 01:37, David Wei wrote: >>>>> On 2024-06-07 17:52, Jason Gunthorpe wrote: >>>>>> IMHO it seems to compose poorly if you can only use the io_uring >>>>>> lifecycle model with io_uring registered memory, and not with DMABUF >>>>>> memory registered through Mina's mechanism. >>>>> By this, do you mean io_uring must be exclusively used to use this >>>>> feature? >>>>> >>>>> And you'd rather see the two decoupled, so userspace can register w/ say >>>>> dmabuf then pass it to io_uring? >>>> Personally, I have no clue what Jason means. You can just as >>>> well say that it's poorly composable that write(2) to a disk >>>> cannot post a completion into a XDP ring, or a netlink socket, >>>> or io_uring's main completion queue, or name any other API. >>> There is no reason you shouldn't be able to use your fast io_uring >>> completion and lifecycle flow with DMABUF backed memory. Those are not >>> widly different things and there is good reason they should work >>> together. >> >> Well there is the fundamental problem that you can't use io_uring to >> implement the semantics necessary for a dma_fence. >> >> That's why we had to reject the io_uring work on DMA-buf sharing from >> Google a few years ago. >> > > Any chance someone can link me to this? io_uring, as far as my > primitive understanding goes, is not yet very adopted at Google, and > I'm curious what this effort is. I'm curious as well, I don't remember it floating anywhere in mailing lists. The only discussion I recall was about DRM_IOCTL_SYNCOBJ_TIMELINE_WAIT, but it didn't get through only because someone pushed for evenfds.
On Mon, Jun 10, 2024 at 08:20:08PM +0100, Pavel Begunkov wrote: > On 6/10/24 16:16, David Ahern wrote: > > > There is no reason you shouldn't be able to use your fast io_uring > > > completion and lifecycle flow with DMABUF backed memory. Those are not > > > widly different things and there is good reason they should work > > > together. > > Let's not mix up devmem TCP and dmabuf specifically, as I see it > your question was concerning the latter: "... DMABUF memory registered > through Mina's mechanism". io_uring's zcrx can trivially get dmabuf > support in future, as mentioned it's mostly the setup side. ABI, > buffer workflow and some details is a separate issue, and I don't > see how further integration aside from what we're already sharing > is beneficial, on opposite it'll complicate things. Again, I am talking about composability here, duplicating the DMABUF stuff into io_uring is not composable, it is just duplicating things. It does not match the view that there should be two distinct layers here, one that provides the pages and one that manages the lifecycle. As HCH pushes for pages either come from the allocator and get to use the struct folio or the come from a dmabuf and they don't. That is it, the only two choices. The iouring stuff is trying to confuse the source of the pages with the lifecycle - which is surely convenient, but is why Christoph is opposing it. Jason
On Mon, Jun 10, 2024 at 09:16:43AM -0600, David Ahern wrote: > > exactly. io_uring, page_pool, dmabuf - all kernel building blocks for > solutions. This why I was pushing for Mina's set not to be using the > name `devmem` - it is but one type of memory and with dmabuf it should > not matter if it is gpu or host (or something else later on - cxl?). While not really realted to the rest of the discussion I agree. It really is dmabuf integration now, so let's call it that?
On Fri, Jun 07, 2024 at 02:45:55PM +0100, Pavel Begunkov wrote: > On 6/5/24 09:24, Christoph Hellwig wrote: > > On Mon, Jun 03, 2024 at 03:52:32PM +0100, Pavel Begunkov wrote: > > > The question for Christoph is what exactly is the objection here? Why we > > > would not be using well defined ops when we know there will be more > > > users? > > > > The point is that there should be no more users. If you need another > > Does that "No more" stops after devmem tcp? There should be no other memory source other than the page allocator and dmabuf. If you need different life time control for your zero copy proposal don't mix that up with the contol of the memory source.
Am 11.06.24 um 08:25 schrieb Christoph Hellwig: > On Mon, Jun 10, 2024 at 02:38:18PM +0200, Christian König wrote: >> Well there is the fundamental problem that you can't use io_uring to >> implement the semantics necessary for a dma_fence. > What is the exact problem there? It's an intentional design decision that dma_fences can be waited on with quite a bunch of locks held. Including the DMA-buf reservation lock, mmap lock, anything page fault related, shrinker etc... When you give userspace control over the signaling of a dma_fence then that has the same effect as returning to userspace with those locks held - you can basically trivially deadlock the system. I think nearly a dozen implementations fell into that trap: https://www.kernel.org/doc/html/v5.9/driver-api/dma-buf.html#indefinite-dma-fences It's well understood and documented by now why this approach doesn't work. So not much of an issue any more, we just have to reject implementations from time to time which try doing the same thing again. Regards, Christian.
On Mon, Jun 10, 2024 at 11:26 PM Christoph Hellwig <hch@infradead.org> wrote: > > On Mon, Jun 10, 2024 at 09:16:43AM -0600, David Ahern wrote: > > > > exactly. io_uring, page_pool, dmabuf - all kernel building blocks for > > solutions. This why I was pushing for Mina's set not to be using the > > name `devmem` - it is but one type of memory and with dmabuf it should > > not matter if it is gpu or host (or something else later on - cxl?). > > While not really realted to the rest of the discussion I agree. > It really is dmabuf integration now, so let's call it that? My mental model is that the feature folks care about is the ability to use TCP with device memory, and dmabuf is an implementation detail that is the format that device memory is packaged in. Although not likely given this discussion, in theory we could want to extend devmem TCP to support p2pdma for nvme, or some other format if a new one arises in device drivers. I also think it's more obvious to an end user what 'devmem TCP' aims to do rather than 'dmabuf TCP' especially if the user is not a kernel developer familiar with dmabuf.
On Mon, Jun 10, 2024 at 3:15 PM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > On Mon, Jun 10, 2024 at 08:20:08PM +0100, Pavel Begunkov wrote: > > On 6/10/24 16:16, David Ahern wrote: > > > > > There is no reason you shouldn't be able to use your fast io_uring > > > > completion and lifecycle flow with DMABUF backed memory. Those are not > > > > widly different things and there is good reason they should work > > > > together. > > > > Let's not mix up devmem TCP and dmabuf specifically, as I see it > > your question was concerning the latter: "... DMABUF memory registered > > through Mina's mechanism". io_uring's zcrx can trivially get dmabuf > > support in future, as mentioned it's mostly the setup side. ABI, > > buffer workflow and some details is a separate issue, and I don't > > see how further integration aside from what we're already sharing > > is beneficial, on opposite it'll complicate things. > > Again, I am talking about composability here, duplicating the DMABUF > stuff into io_uring is not composable, it is just duplicating things. > > It does not match the view that there should be two distinct layers > here, one that provides the pages and one that manages the > lifecycle. As HCH pushes for pages either come from the allocator and > get to use the struct folio or the come from a dmabuf and they > don't. That is it, the only two choices. > > The iouring stuff is trying to confuse the source of the pages with > the lifecycle - which is surely convenient, but is why Christoph is > opposing it. > Just curious: in Pavel's effort, io_uring - which is not a device - is trying to share memory with the page_pool, which is also not a device. And Pavel is being asked to wrap the memory in a dmabuf. Is dmabuf going to be the kernel's standard for any memory sharing between any 2 components in the future, even when they're not devices? As in you expect dmabuf exporters which are not devices to be added to the kernel? Currently the only dmabuf exporter which is not a device (AFAIK) is udmabuf, which is used for testing and emulation, not really a production thing, I think.
On Tue, Jun 11, 2024 at 11:09:15AM -0700, Mina Almasry wrote: > Just curious: in Pavel's effort, io_uring - which is not a device - is > trying to share memory with the page_pool, which is also not a device. > And Pavel is being asked to wrap the memory in a dmabuf. Is dmabuf > going to be the kernel's standard for any memory sharing between any 2 > components in the future, even when they're not devices? dmabuf is how we are refcounting non-struct page memory, there is nothing about it that says it has to be MMIO memory, or even that the memory doesn't have struct pages. All it says is that the memory is alive according to dmabuf refcounting rules. And the importer obviously don't get to touch the underlying folios, if any. Jason
On 6/12/24 6:06 AM, Jason Gunthorpe wrote: > On Tue, Jun 11, 2024 at 11:09:15AM -0700, Mina Almasry wrote: > >> Just curious: in Pavel's effort, io_uring - which is not a device - is >> trying to share memory with the page_pool, which is also not a device. >> And Pavel is being asked to wrap the memory in a dmabuf. Is dmabuf >> going to be the kernel's standard for any memory sharing between any 2 >> components in the future, even when they're not devices? > > dmabuf is how we are refcounting non-struct page memory, there is > nothing about it that says it has to be MMIO memory, or even that the > memory doesn't have struct pages. > > All it says is that the memory is alive according to dmabuf > refcounting rules. And the importer obviously don't get to touch the > underlying folios, if any. > In addition, the io_uring developers should be considering the use case of device memory. There is no reason for this design to be limited to host memory. io_uring should not care (it is not peeking inside the memory buffers); it is just memory references. One of io_uring's primary benefits is avoiding system calls. io_uring works with TCP sockets. Let it work with any dmabuf without concern of memory type. The performance benefits the Google crowd sees with system call based apps should be even better with io_uring. Focus on primitives, building blocks with solid APIs for other subsystems to leverage and let them be wired up in ways you cannot imagine today.
On 6/11/24 07:34, Christoph Hellwig wrote: > On Fri, Jun 07, 2024 at 02:45:55PM +0100, Pavel Begunkov wrote: >> On 6/5/24 09:24, Christoph Hellwig wrote: >>> On Mon, Jun 03, 2024 at 03:52:32PM +0100, Pavel Begunkov wrote: >>>> The question for Christoph is what exactly is the objection here? Why we >>>> would not be using well defined ops when we know there will be more >>>> users? >>> >>> The point is that there should be no more users. If you need another >> >> Does that "No more" stops after devmem tcp? > > There should be no other memory source other than the page allocator > and dmabuf. If you need different life time control for your > zero copy proposal don't mix that up with the contol of the memory > source. No idea how I'm mixing it up when I was explaining exactly this all along as well as that the callback (and presumably the call site in general) you was so eager to nack is used exactly to implement the life time control.
On 6/10/24 23:15, Jason Gunthorpe wrote: > On Mon, Jun 10, 2024 at 08:20:08PM +0100, Pavel Begunkov wrote: >> On 6/10/24 16:16, David Ahern wrote: > >>>> There is no reason you shouldn't be able to use your fast io_uring >>>> completion and lifecycle flow with DMABUF backed memory. Those are not >>>> widly different things and there is good reason they should work >>>> together. >> >> Let's not mix up devmem TCP and dmabuf specifically, as I see it >> your question was concerning the latter: "... DMABUF memory registered >> through Mina's mechanism". io_uring's zcrx can trivially get dmabuf >> support in future, as mentioned it's mostly the setup side. ABI, >> buffer workflow and some details is a separate issue, and I don't >> see how further integration aside from what we're already sharing >> is beneficial, on opposite it'll complicate things. > > Again, I am talking about composability here, duplicating the DMABUF > stuff into io_uring is not composable, it is just duplicating things. Ok, then registering, say, a dmabuf via devmem TCP and then using it in io_uring. Let's say we make devmem TCP API to be able to register a dmabuf without using it, from where io_uring can take ownership over it and use in the flow. And I strongly believe the same memory region/dmabuf should never be used by both at the same time and hence lifetime of any such memory should be exclusively bound to io_uring. That leaves the user api, where to add memory you need to create a netlink socket and pass everything through it, which is an extra step, and then letting know io_uring that it can use the memory, not forgetting to eject it from netlink. That's not a good api as far as it goes with io_uring. I don't think slight duplicating of registration is a problem when the upside is much cleaner API. Internals, however, can be easily shared. We can even say that the net stack should provide helpers like init_page_pool_from_dmabuf_fd() and now allow poking into related bits aside from it (initialising net_iov / etc.). > It does not match the view that there should be two distinct layers > here, one that provides the pages and one that manages the > lifecycle. As HCH pushes for pages either come from the allocator and > get to use the struct folio or the come from a dmabuf and they > don't. That is it, the only two choices. > > The iouring stuff is trying to confuse the source of the pages with > the lifecycle - which is surely convenient, but is why Christoph is > opposing it.
On Mon, Jun 17, 2024 at 07:04:43PM +0100, Pavel Begunkov wrote: > > There should be no other memory source other than the page allocator > > and dmabuf. If you need different life time control for your > > zero copy proposal don't mix that up with the contol of the memory > > source. > > No idea how I'm mixing it up when I was explaining exactly this > all along as well as that the callback (and presumably the call > site in general) you was so eager to nack is used exactly to > implement the life time control. And that's exactly my point. You want to use one callback to mix allocation source and life time control. That's the perfect recipe to create an un-extensible un-composable mess.
On 6/18/24 07:43, Christoph Hellwig wrote: > On Mon, Jun 17, 2024 at 07:04:43PM +0100, Pavel Begunkov wrote: >>> There should be no other memory source other than the page allocator >>> and dmabuf. If you need different life time control for your >>> zero copy proposal don't mix that up with the contol of the memory >>> source. >> >> No idea how I'm mixing it up when I was explaining exactly this >> all along as well as that the callback (and presumably the call >> site in general) you was so eager to nack is used exactly to >> implement the life time control. > > And that's exactly my point. You want to use one callback to mix > allocation source and life time control. No, it only takes the role of life time control and doesn't care about the source. The allocation source step with corresponding initialisation happens separately and priorly, at initialisation time. > That's the perfect recipe > to create an un-extensible un-composable mess
diff --git a/include/net/page_pool/types.h b/include/net/page_pool/types.h index b088d131aeb0d..b038b838f042f 100644 --- a/include/net/page_pool/types.h +++ b/include/net/page_pool/types.h @@ -51,6 +51,7 @@ struct pp_alloc_cache { * @dev: device, for DMA pre-mapping purposes * @netdev: netdev this pool will serve (leave as NULL if none or multiple) * @napi: NAPI which is the sole consumer of pages, otherwise NULL + * @queue: struct netdev_rx_queue this page_pool is being created for. * @dma_dir: DMA mapping direction * @max_len: max DMA sync memory size for PP_FLAG_DMA_SYNC_DEV * @offset: DMA sync address offset for PP_FLAG_DMA_SYNC_DEV @@ -64,6 +65,7 @@ struct page_pool_params { int nid; struct device *dev; struct napi_struct *napi; + struct netdev_rx_queue *queue; enum dma_data_direction dma_dir; unsigned int max_len; unsigned int offset; @@ -127,6 +129,13 @@ struct page_pool_stats { }; #endif +struct memory_provider_ops { + int (*init)(struct page_pool *pool); + void (*destroy)(struct page_pool *pool); + struct page *(*alloc_netmems)(struct page_pool *pool, gfp_t gfp); + bool (*release_page)(struct page_pool *pool, struct page *page); +}; + struct page_pool { struct page_pool_params_fast p; @@ -193,6 +202,9 @@ struct page_pool { */ struct ptr_ring ring; + void *mp_priv; + const struct memory_provider_ops *mp_ops; + #ifdef CONFIG_PAGE_POOL_STATS /* recycle stats are per-cpu to avoid locking */ struct page_pool_recycle_stats __percpu *recycle_stats; diff --git a/net/core/page_pool.c b/net/core/page_pool.c index f4444b4e39e63..251c9356c9202 100644 --- a/net/core/page_pool.c +++ b/net/core/page_pool.c @@ -26,6 +26,8 @@ #include "page_pool_priv.h" +static DEFINE_STATIC_KEY_FALSE(page_pool_mem_providers); + #define DEFER_TIME (msecs_to_jiffies(1000)) #define DEFER_WARN_INTERVAL (60 * HZ) @@ -186,6 +188,7 @@ static int page_pool_init(struct page_pool *pool, int cpuid) { unsigned int ring_qsize = 1024; /* Default */ + int err; page_pool_struct_check(); @@ -267,7 +270,22 @@ static int page_pool_init(struct page_pool *pool, if (pool->dma_map) get_device(pool->p.dev); + if (pool->mp_ops) { + err = pool->mp_ops->init(pool); + if (err) { + pr_warn("%s() mem-provider init failed %d\n", __func__, + err); + goto free_ptr_ring; + } + + static_branch_inc(&page_pool_mem_providers); + } + return 0; + +free_ptr_ring: + ptr_ring_cleanup(&pool->ring, NULL); + return err; } static void page_pool_uninit(struct page_pool *pool) @@ -569,7 +587,10 @@ struct page *page_pool_alloc_pages(struct page_pool *pool, gfp_t gfp) return page; /* Slow-path: cache empty, do real allocation */ - page = __page_pool_alloc_pages_slow(pool, gfp); + if (static_branch_unlikely(&page_pool_mem_providers) && pool->mp_ops) + page = pool->mp_ops->alloc_netmems(pool, gfp); + else + page = __page_pool_alloc_pages_slow(pool, gfp); return page; } EXPORT_SYMBOL(page_pool_alloc_pages); @@ -627,10 +648,13 @@ void __page_pool_release_page_dma(struct page_pool *pool, struct page *page) void page_pool_return_page(struct page_pool *pool, struct page *page) { int count; + bool put; - __page_pool_release_page_dma(pool, page); - - page_pool_clear_pp_info(page); + put = true; + if (static_branch_unlikely(&page_pool_mem_providers) && pool->mp_ops) + put = pool->mp_ops->release_page(pool, page); + else + __page_pool_release_page_dma(pool, page); /* This may be the last page returned, releasing the pool, so * it is not safe to reference pool afterwards. @@ -638,7 +662,10 @@ void page_pool_return_page(struct page_pool *pool, struct page *page) count = atomic_inc_return_relaxed(&pool->pages_state_release_cnt); trace_page_pool_state_release(pool, page, count); - put_page(page); + if (put) { + page_pool_clear_pp_info(page); + put_page(page); + } /* An optimization would be to call __free_pages(page, pool->p.order) * knowing page is not part of page-cache (thus avoiding a * __page_cache_release() call). @@ -937,6 +964,12 @@ static void __page_pool_destroy(struct page_pool *pool) page_pool_unlist(pool); page_pool_uninit(pool); + + if (pool->mp_ops) { + pool->mp_ops->destroy(pool); + static_branch_dec(&page_pool_mem_providers); + } + kfree(pool); }