Message ID | 20231208005250.2910004-1-almasrymina@google.com |
---|---|
Headers | show |
Series | Device Memory TCP | expand |
On Thu, Dec 7, 2023 at 4:52 PM Mina Almasry <almasrymina@google.com> wrote: > > Major changes in v1: > -------------- > > 1. Implemented MVP queue API ndos to remove the userspace-visible > driver reset. > > 2. Fixed issues in the napi_pp_put_page() devmem frag unref path. > > 3. Removed RFC tag. > > Many smaller addressed comments across all the patches (patches have > individual change log). > > Full tree including the rest of the GVE driver changes: > https://github.com/mina/linux/commits/tcpdevmem-v1 > > Cc: Yunsheng Lin <linyunsheng@huawei.com> > Cc: Shailend Chand <shailend@google.com> > Cc: Harshitha Ramamurthy <hramamurthy@google.com> > Welp, I messed up the subject line. It should say [PATCH net-next...] across all the patches. This may trip up bots and email filters. If this is annoying, I'll resend with the fixed subject line after the 24hr cooldown period. Sorry about that.
On 12/7/23 5:52 PM, Mina Almasry wrote: > In tcp_recvmsg_locked(), detect if the skb being received by the user > is a devmem skb. In this case - if the user provided the MSG_SOCK_DEVMEM > flag - pass it to tcp_recvmsg_devmem() for custom handling. > > tcp_recvmsg_devmem() copies any data in the skb header to the linear > buffer, and returns a cmsg to the user indicating the number of bytes > returned in the linear buffer. > > tcp_recvmsg_devmem() then loops over the unaccessible devmem skb frags, > and returns to the user a cmsg_devmem indicating the location of the > data in the dmabuf device memory. cmsg_devmem contains this information: > > 1. the offset into the dmabuf where the payload starts. 'frag_offset'. > 2. the size of the frag. 'frag_size'. > 3. an opaque token 'frag_token' to return to the kernel when the buffer > is to be released. > > The pages awaiting freeing are stored in the newly added > sk->sk_user_pages, and each page passed to userspace is get_page()'d. > This reference is dropped once the userspace indicates that it is > done reading this page. All pages are released when the socket is > destroyed. > > Signed-off-by: Willem de Bruijn <willemb@google.com> > Signed-off-by: Kaiyuan Zhang <kaiyuanz@google.com> > Signed-off-by: Mina Almasry <almasrymina@google.com> > > --- > > Changes in v1: > - Added dmabuf_id to dmabuf_cmsg (David/Stan). > - Devmem -> dmabuf (David). > - Change tcp_recvmsg_dmabuf() check to skb->dmabuf (Paolo). > - Use __skb_frag_ref() & napi_pp_put_page() for refcounting (Yunsheng). > > RFC v3: > - Fixed issue with put_cmsg() failing silently. > What happens if a retransmitted packet is received or an rx window is closed and a probe is received where the kernel drops the skb - is the iov reference(s) in the skb returned to the pool by the stack and ready for use again?
On 12/7/23 5:52 PM, Mina Almasry wrote: > Major changes in v1: > -------------- > > 1. Implemented MVP queue API ndos to remove the userspace-visible > driver reset. > > 2. Fixed issues in the napi_pp_put_page() devmem frag unref path. > > 3. Removed RFC tag. > > Many smaller addressed comments across all the patches (patches have > individual change log). > > Full tree including the rest of the GVE driver changes: > https://github.com/mina/linux/commits/tcpdevmem-v1 > Still a lot of DEVMEM references (e.g., socket API). Any reason not to move those to DMABUF?
On Fri, Dec 8, 2023 at 9:56 AM David Ahern <dsahern@kernel.org> wrote: > > On 12/7/23 5:52 PM, Mina Almasry wrote: > > diff --git a/net/core/dev.c b/net/core/dev.c > > index b8c8be5a912e..30667e4c3b95 100644 > > --- a/net/core/dev.c > > +++ b/net/core/dev.c > > @@ -2120,6 +2120,41 @@ static int netdev_restart_rx_queue(struct net_device *dev, int rxq_idx) > > return err; > > } > > > > +struct page_pool_iov *netdev_alloc_dmabuf(struct netdev_dmabuf_binding *binding) > > +{ > > + struct dmabuf_genpool_chunk_owner *owner; > > + struct page_pool_iov *ppiov; > > + unsigned long dma_addr; > > + ssize_t offset; > > + ssize_t index; > > + > > + dma_addr = gen_pool_alloc_owner(binding->chunk_pool, PAGE_SIZE, > > Any reason not to allow allocation sizes other than PAGE_SIZE? e.g., > 2048 for smaller MTUs or 8192 for larger ones. It can be a property of > page_pool and constant across allocations vs allowing different size for > each allocation. Only for simplicity. Supporting non-PAGE_SIZE is certainly possible, but in my estimation it's a huge can of worms worthy of itss own series. I find this series complicated to implement and review and support as-is, and if reasonable I would like to punt that as a future improvement. At the minimum, I think the needed changes are: 1. The memory provider needs to report to the page pool the alloc size. 2. The page_pool needs to handle non-PAGE_SIZE memory regions. 3. The drivers need to handle non-PAGE_SIZE memory regions. Drivers today handle fragged pages, but that is different because it's a PAGE_SIZE region that is fragged. This is a non-PAGE_SIZE region in the first place. 4. Any PAGE_SIZE assumptions in the entire net stack need to be removed. At Google we mostly use page aligned MTUs so we're likely not that interested in sub PAGE_SIZE allocations, but we are interested in n * PAGE_SIZE allocations, but, I hope, in a separate followup effort.
On Thu, Dec 07, 2023 at 04:52:31PM -0800, Mina Almasry wrote: [...] > > Today, the majority of the Device-to-Device data transfers the network are 'the network' in above can be removed. > implemented as the following low level operations: Device-to-Host copy, > Host-to-Host network transfer, and Host-to-Device copy. > [...] > > ** Part 5: recvmsg() APIs > > We define user APIs for the user to send and receive device memory. > > Not included with this RFC is the GVE devmem TCP support, just to no more RFC > simplify the review. Code available here if desired: > https://github.com/mina/linux/tree/tcpdevmem > > This RFC is built on top of net-next with Jakub's pp-providers changes no more RFC [...] > > ** Test Setup > > Kernel: net-next with this RFC and memory provider API cherry-picked no more RFC > locally. > > Hardware: Google Cloud A3 VMs. > > NIC: GVE with header split & RSS & flow steering support. >
On Thu, Dec 07, 2023 at 04:52:32PM -0800, Mina Almasry wrote: > From: Jakub Kicinski <kuba@kernel.org> > > Releasing the DMA mapping will be useful for other types > of pages, so factor it out. Make sure compiler inlines it, > to avoid any regressions. > > Signed-off-by: Jakub Kicinski <kuba@kernel.org> > Signed-off-by: Mina Almasry <almasrymina@google.com> > Reviewed-by: Shakeel Butt <shakeelb@google.com>
On Thu, Dec 14, 2023 at 06:20:27AM +0000, patchwork-bot+netdevbpf@kernel.org wrote: > Hello: > > This series was applied to netdev/net-next.git (main) > by Jakub Kicinski <kuba@kernel.org>: Umm, this is still very broken in intraction with other subsystems. Please don't push ahead so quickly.
On Wed, Dec 13, 2023 at 10:51:25PM -0800, Mina Almasry wrote: > On Wed, Dec 13, 2023 at 10:49 PM Christoph Hellwig <hch@infradead.org> wrote: > > > > On Thu, Dec 14, 2023 at 06:20:27AM +0000, patchwork-bot+netdevbpf@kernel.org wrote: > > > Hello: > > > > > > This series was applied to netdev/net-next.git (main) > > > by Jakub Kicinski <kuba@kernel.org>: > > > > Umm, this is still very broken in intraction with other subsystems. > > Please don't push ahead so quickly. > > > > The bot is just a bit optimistic. Only this first patch was applied. > It does not interact with other subsystems. > > - [net-next,v1,01/16] net: page_pool: factor out releasing DMA from > releasing the page Ah, that makes sense. Thanks for the update!