Message ID | 20201202220945.911116-2-arjunroy.kdev@gmail.com |
---|---|
State | Superseded |
Headers | show |
Series | Perf. optimizations for TCP Recv. Zerocopy | expand |
On Wed, 2 Dec 2020 14:09:38 -0800 Arjun Roy <arjunroy.kdev@gmail.com> wrote: > diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h > index cfcb10b75483..62db78b9c1a0 100644 > --- a/include/uapi/linux/tcp.h > +++ b/include/uapi/linux/tcp.h > @@ -349,5 +349,7 @@ struct tcp_zerocopy_receive { > __u32 recv_skip_hint; /* out: amount of bytes to skip */ > __u32 inq; /* out: amount of bytes in read queue */ > __s32 err; /* out: socket error */ > + __u64 copybuf_address; /* in: copybuf address (small reads) */ > + __s32 copybuf_len; /* in/out: copybuf bytes avail/used or error */ > }; > #endif /* _UAPI_LINUX_TCP_H */ You can't safely grow the size of a userspace API without handling the case of older applications. Logic in setsockopt() would have to handle both old and new sizes of the structure.
On Wed, Dec 2, 2020 at 4:15 PM Stephen Hemminger <stephen@networkplumber.org> wrote: > > On Wed, 2 Dec 2020 14:09:38 -0800 > Arjun Roy <arjunroy.kdev@gmail.com> wrote: > > > diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h > > index cfcb10b75483..62db78b9c1a0 100644 > > --- a/include/uapi/linux/tcp.h > > +++ b/include/uapi/linux/tcp.h > > @@ -349,5 +349,7 @@ struct tcp_zerocopy_receive { > > __u32 recv_skip_hint; /* out: amount of bytes to skip */ > > __u32 inq; /* out: amount of bytes in read queue */ > > __s32 err; /* out: socket error */ > > + __u64 copybuf_address; /* in: copybuf address (small reads) */ > > + __s32 copybuf_len; /* in/out: copybuf bytes avail/used or error */ > > }; > > #endif /* _UAPI_LINUX_TCP_H */ > > You can't safely grow the size of a userspace API without handling the > case of older applications. Logic in setsockopt() would have to handle > both old and new sizes of the structure. Acknowledged, but tcp zerocopy receive is a special case: it does not exist in setsockopt(). Regarding old applications in the getsockopt() case: the current layout should handle older and newer callers as is (it devolves to only use the features the application provides buffer space for). Please note also v3 for this patchset; it's the same as this but I forgot the signed-off-by statements here in v2. Thanks, -Arjun
From: Stephen Hemminger > Sent: 03 December 2020 00:15 > > On Wed, 2 Dec 2020 14:09:38 -0800 > Arjun Roy <arjunroy.kdev@gmail.com> wrote: > > > diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h > > index cfcb10b75483..62db78b9c1a0 100644 > > --- a/include/uapi/linux/tcp.h > > +++ b/include/uapi/linux/tcp.h > > @@ -349,5 +349,7 @@ struct tcp_zerocopy_receive { > > __u32 recv_skip_hint; /* out: amount of bytes to skip */ > > __u32 inq; /* out: amount of bytes in read queue */ > > __s32 err; /* out: socket error */ > > + __u64 copybuf_address; /* in: copybuf address (small reads) */ > > + __s32 copybuf_len; /* in/out: copybuf bytes avail/used or error */ You need to swap the order of the above fields to avoid padding and differing alignments for 32bit and 64bit apps. > > }; > > #endif /* _UAPI_LINUX_TCP_H */ > > You can't safely grow the size of a userspace API without handling the > case of older applications. Logic in setsockopt() would have to handle > both old and new sizes of the structure. You also have to allow for old (working) applications being recompiled with the new headers. So you cannot rely on the fields being zero even if you are passed the size of the structure. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
On Fri, Dec 4, 2020 at 12:01 AM David Laight <David.Laight@aculab.com> wrote: > > From: Stephen Hemminger > > Sent: 03 December 2020 00:15 > > > > On Wed, 2 Dec 2020 14:09:38 -0800 > > Arjun Roy <arjunroy.kdev@gmail.com> wrote: > > > > > diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h > > > index cfcb10b75483..62db78b9c1a0 100644 > > > --- a/include/uapi/linux/tcp.h > > > +++ b/include/uapi/linux/tcp.h > > > @@ -349,5 +349,7 @@ struct tcp_zerocopy_receive { > > > __u32 recv_skip_hint; /* out: amount of bytes to skip */ > > > __u32 inq; /* out: amount of bytes in read queue */ > > > __s32 err; /* out: socket error */ > > > + __u64 copybuf_address; /* in: copybuf address (small reads) */ > > > + __s32 copybuf_len; /* in/out: copybuf bytes avail/used or error */ > > You need to swap the order of the above fields to avoid padding > and differing alignments for 32bit and 64bit apps. I do not think so. Please review this patch series carefully. > > > > }; > > > #endif /* _UAPI_LINUX_TCP_H */ > > > > You can't safely grow the size of a userspace API without handling the > > case of older applications. Logic in setsockopt() would have to handle > > both old and new sizes of the structure. > > You also have to allow for old (working) applications being recompiled > with the new headers. > So you cannot rely on the fields being zero even if you are passed > the size of the structure. This is too late, there is precedent for this. We already mentioned this in tcp_mmap.c reference program. commit bf5525f3a8e3248be5aa5defe5aaadd60e1c1ba1 Author: Eric Dumazet <edumazet@google.com> Date: Tue May 5 20:51:06 2020 -0700 selftests: net: tcp_mmap: clear whole tcp_zerocopy_receive struct We added fields in tcp_zerocopy_receive structure, so make sure to clear all fields to not pass garbage to the kernel. We were lucky because recent additions added 'out' parameters, still we need to clean our reference implementation, before folks copy/paste it. Fixes: c8856c051454 ("tcp-zerocopy: Return inq along with tcp receive zerocopy.") Fixes: 33946518d493 ("tcp-zerocopy: Return sk_err (if set) along with tcp receive zerocopy.") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Arjun Roy <arjunroy@google.com> Cc: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> diff --git a/tools/testing/selftests/net/tcp_mmap.c b/tools/testing/selftests/net/tcp_mmap.c index 35505b31e5cc092453ea7b72d9dba45bed2d6549..62171fd638c817dabe2d988f3cfae74522112584 100644 --- a/tools/testing/selftests/net/tcp_mmap.c +++ b/tools/testing/selftests/net/tcp_mmap.c @@ -165,9 +165,10 @@ void *child_thread(void *arg) socklen_t zc_len = sizeof(zc); int res; + memset(&zc, 0, sizeof(zc)); zc.address = (__u64)((unsigned long)addr); zc.length = chunk_size; - zc.recv_skip_hint = 0; + res = getsockopt(fd, IPPROTO_TCP, TCP_ZEROCOPY_RECEIVE, &zc, &zc_len); if (res == -1) > > David > > - > Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK > Registration No: 1397386 (Wales) >
On Thu, Dec 3, 2020 at 3:01 PM David Laight <David.Laight@aculab.com> wrote: > > From: Stephen Hemminger > > Sent: 03 December 2020 00:15 > > > > On Wed, 2 Dec 2020 14:09:38 -0800 > > Arjun Roy <arjunroy.kdev@gmail.com> wrote: > > > > > diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h > > > index cfcb10b75483..62db78b9c1a0 100644 > > > --- a/include/uapi/linux/tcp.h > > > +++ b/include/uapi/linux/tcp.h > > > @@ -349,5 +349,7 @@ struct tcp_zerocopy_receive { > > > __u32 recv_skip_hint; /* out: amount of bytes to skip */ > > > __u32 inq; /* out: amount of bytes in read queue */ > > > __s32 err; /* out: socket error */ > > > + __u64 copybuf_address; /* in: copybuf address (small reads) */ > > > + __s32 copybuf_len; /* in/out: copybuf bytes avail/used or error */ > > You need to swap the order of the above fields to avoid padding > and differing alignments for 32bit and 64bit apps. > Just to double check, are you referring to the order of copybuf_address and copybuf_len? If so, it seems that the current ordering is not creating any alignment holes, but flipping it would: https://godbolt.org/z/bdxP6b > > > }; > > > #endif /* _UAPI_LINUX_TCP_H */ > > > > You can't safely grow the size of a userspace API without handling the > > case of older applications. Logic in setsockopt() would have to handle > > both old and new sizes of the structure. > > You also have to allow for old (working) applications being recompiled > with the new headers. > So you cannot rely on the fields being zero even if you are passed > the size of the structure. > I think this should already be taken care of in the current code; the full-sized struct with new fields is being zero-initialized, then we're getting the user-provided optlen, then copying from userspace only that much data. So the newer fields would be zero in that case, so this should handle the case of new kernels but old applications. Does this address the concern, or am I misunderstanding? Thanks, -Arjun > David > > - > Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK > Registration No: 1397386 (Wales) >
On Thu, Dec 3, 2020 at 3:19 PM Arjun Roy <arjunroy@google.com> wrote: > > On Thu, Dec 3, 2020 at 3:01 PM David Laight <David.Laight@aculab.com> wrote: > > > > From: Stephen Hemminger > > > Sent: 03 December 2020 00:15 > > > > > > On Wed, 2 Dec 2020 14:09:38 -0800 > > > Arjun Roy <arjunroy.kdev@gmail.com> wrote: > > > > > > > diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h > > > > index cfcb10b75483..62db78b9c1a0 100644 > > > > --- a/include/uapi/linux/tcp.h > > > > +++ b/include/uapi/linux/tcp.h > > > > @@ -349,5 +349,7 @@ struct tcp_zerocopy_receive { > > > > __u32 recv_skip_hint; /* out: amount of bytes to skip */ > > > > __u32 inq; /* out: amount of bytes in read queue */ > > > > __s32 err; /* out: socket error */ > > > > + __u64 copybuf_address; /* in: copybuf address (small reads) */ > > > > + __s32 copybuf_len; /* in/out: copybuf bytes avail/used or error */ > > > > You need to swap the order of the above fields to avoid padding > > and differing alignments for 32bit and 64bit apps. > > > > Just to double check, are you referring to the order of > copybuf_address and copybuf_len? > If so, it seems that the current ordering is not creating any > alignment holes, but flipping it would: https://godbolt.org/z/bdxP6b > > > > > > }; > > > > #endif /* _UAPI_LINUX_TCP_H */ > > > > > > You can't safely grow the size of a userspace API without handling the > > > case of older applications. Logic in setsockopt() would have to handle > > > both old and new sizes of the structure. > > > > You also have to allow for old (working) applications being recompiled > > with the new headers. > > So you cannot rely on the fields being zero even if you are passed > > the size of the structure. > > > > I think this should already be taken care of in the current code; the > full-sized struct with new fields is being zero-initialized, then > we're getting the user-provided optlen, then copying from userspace > only that much data. So the newer fields would be zero in that case, > so this should handle the case of new kernels but old applications. > Does this address the concern, or am I misunderstanding? > Actually, on closer read, perhaps the following is what you have in mind for the old application? struct zerocopy_args args; args.address = ...; args.length = ...; args.recv_skip_hint = ...; args.inq = ...; args.err = ...; getsockopt(fd, IPPROTO_TCP, TCP_ZEROCOPY_RECEIVE, &args, sizeof(args)); // sizeof(args) is now bigger when recompiled with new headers, but we did not explicitly set the new fields to 0, therefore issues -Arjun > Thanks, > -Arjun > > > David > > > > - > > Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK > > Registration No: 1397386 (Wales) > >
From: Eric > Sent: 03 December 2020 23:15 > > On Fri, Dec 4, 2020 at 12:01 AM David Laight <David.Laight@aculab.com> wrote: > > > > From: Stephen Hemminger > > > Sent: 03 December 2020 00:15 > > > > > > On Wed, 2 Dec 2020 14:09:38 -0800 > > > Arjun Roy <arjunroy.kdev@gmail.com> wrote: > > > > > > > diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h > > > > index cfcb10b75483..62db78b9c1a0 100644 > > > > --- a/include/uapi/linux/tcp.h > > > > +++ b/include/uapi/linux/tcp.h > > > > @@ -349,5 +349,7 @@ struct tcp_zerocopy_receive { > > > > __u32 recv_skip_hint; /* out: amount of bytes to skip */ > > > > __u32 inq; /* out: amount of bytes in read queue */ > > > > __s32 err; /* out: socket error */ > > > > + __u64 copybuf_address; /* in: copybuf address (small reads) */ > > > > + __s32 copybuf_len; /* in/out: copybuf bytes avail/used or error */ > > > > You need to swap the order of the above fields to avoid padding > > and differing alignments for 32bit and 64bit apps. > > I do not think so. Please review this patch series carefully. Late at night. The actual problem is 'tail padding'. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
From: Arjun Roy > Sent: 03 December 2020 23:25 ... > > > You also have to allow for old (working) applications being recompiled > > > with the new headers. > > > So you cannot rely on the fields being zero even if you are passed > > > the size of the structure. > > > > > > > I think this should already be taken care of in the current code; the > > full-sized struct with new fields is being zero-initialized, then > > we're getting the user-provided optlen, then copying from userspace > > only that much data. So the newer fields would be zero in that case, > > so this should handle the case of new kernels but old applications. > > Does this address the concern, or am I misunderstanding? > > > > Actually, on closer read, perhaps the following is what you have in > mind for the old application? > > struct zerocopy_args args; > args.address = ...; > args.length = ...; > args.recv_skip_hint = ...; > args.inq = ...; > args.err = ...; > getsockopt(fd, IPPROTO_TCP, TCP_ZEROCOPY_RECEIVE, &args, sizeof(args)); > // sizeof(args) is now bigger when recompiled with new headers, but we > did not explicitly set the new fields to 0, therefore issues That's the one... David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
On Fri, Dec 4, 2020 at 1:03 AM David Laight <David.Laight@aculab.com> wrote: > > From: Arjun Roy > > Sent: 03 December 2020 23:25 > ... > > > > You also have to allow for old (working) applications being recompiled > > > > with the new headers. > > > > So you cannot rely on the fields being zero even if you are passed > > > > the size of the structure. > > > > > > > > > > I think this should already be taken care of in the current code; the > > > full-sized struct with new fields is being zero-initialized, then > > > we're getting the user-provided optlen, then copying from userspace > > > only that much data. So the newer fields would be zero in that case, > > > so this should handle the case of new kernels but old applications. > > > Does this address the concern, or am I misunderstanding? > > > > > > > Actually, on closer read, perhaps the following is what you have in > > mind for the old application? > > > > struct zerocopy_args args; > > args.address = ...; > > args.length = ...; > > args.recv_skip_hint = ...; > > args.inq = ...; > > args.err = ...; > > getsockopt(fd, IPPROTO_TCP, TCP_ZEROCOPY_RECEIVE, &args, sizeof(args)); > > // sizeof(args) is now bigger when recompiled with new headers, but we > > did not explicitly set the new fields to 0, therefore issues > > That's the one... > I'll defer in this case to Eric's previous response in this thread (paraphrased: that there is precedent for this in tcp_mmap.c). -Arjun > David > > - > Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK > Registration No: 1397386 (Wales)
diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h index cfcb10b75483..62db78b9c1a0 100644 --- a/include/uapi/linux/tcp.h +++ b/include/uapi/linux/tcp.h @@ -349,5 +349,7 @@ struct tcp_zerocopy_receive { __u32 recv_skip_hint; /* out: amount of bytes to skip */ __u32 inq; /* out: amount of bytes in read queue */ __s32 err; /* out: socket error */ + __u64 copybuf_address; /* in: copybuf address (small reads) */ + __s32 copybuf_len; /* in/out: copybuf bytes avail/used or error */ }; #endif /* _UAPI_LINUX_TCP_H */ diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index b2bc3d7fe9e8..887c6e986927 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -1743,6 +1743,52 @@ int tcp_mmap(struct file *file, struct socket *sock, } EXPORT_SYMBOL(tcp_mmap); +static int tcp_copy_straggler_data(struct tcp_zerocopy_receive *zc, + struct sk_buff *skb, u32 copylen, + u32 *offset, u32 *seq) +{ + unsigned long copy_address = (unsigned long)zc->copybuf_address; + struct msghdr msg = {}; + struct iovec iov; + int err; + + if (copy_address != zc->copybuf_address) + return -EINVAL; + + err = import_single_range(READ, (void __user *)copy_address, + copylen, &iov, &msg.msg_iter); + if (err) + return err; + err = skb_copy_datagram_msg(skb, *offset, &msg, copylen); + if (err) + return err; + zc->recv_skip_hint -= copylen; + *offset += copylen; + *seq += copylen; + return (__s32)copylen; +} + +static int tcp_zerocopy_handle_leftover_data(struct tcp_zerocopy_receive *zc, + struct sock *sk, + struct sk_buff *skb, + u32 *seq, + s32 copybuf_len) +{ + u32 offset, copylen = min_t(u32, copybuf_len, zc->recv_skip_hint); + + if (!copylen) + return 0; + /* skb is null if inq < PAGE_SIZE. */ + if (skb) + offset = *seq - TCP_SKB_CB(skb)->seq; + else + skb = tcp_recv_skb(sk, *seq, &offset); + + zc->copybuf_len = tcp_copy_straggler_data(zc, skb, copylen, &offset, + seq); + return zc->copybuf_len < 0 ? 0 : copylen; +} + static int tcp_zerocopy_vm_insert_batch(struct vm_area_struct *vma, struct page **pages, unsigned long pages_to_map, @@ -1776,8 +1822,10 @@ static int tcp_zerocopy_vm_insert_batch(struct vm_area_struct *vma, static int tcp_zerocopy_receive(struct sock *sk, struct tcp_zerocopy_receive *zc) { + u32 length = 0, offset, vma_len, avail_len, aligned_len, copylen = 0; unsigned long address = (unsigned long)zc->address; - u32 length = 0, seq, offset, zap_len; + s32 copybuf_len = zc->copybuf_len; + struct tcp_sock *tp = tcp_sk(sk); #define PAGE_BATCH_SIZE 8 struct page *pages[PAGE_BATCH_SIZE]; const skb_frag_t *frags = NULL; @@ -1785,10 +1833,12 @@ static int tcp_zerocopy_receive(struct sock *sk, struct sk_buff *skb = NULL; unsigned long pg_idx = 0; unsigned long curr_addr; - struct tcp_sock *tp; - int inq; + u32 seq = tp->copied_seq; + int inq = tcp_inq(sk); int ret; + zc->copybuf_len = 0; + if (address & (PAGE_SIZE - 1) || address != zc->address) return -EINVAL; @@ -1797,8 +1847,6 @@ static int tcp_zerocopy_receive(struct sock *sk, sock_rps_record_flow(sk); - tp = tcp_sk(sk); - mmap_read_lock(current->mm); vma = find_vma(current->mm, address); @@ -1806,17 +1854,16 @@ static int tcp_zerocopy_receive(struct sock *sk, mmap_read_unlock(current->mm); return -EINVAL; } - zc->length = min_t(unsigned long, zc->length, vma->vm_end - address); - - seq = tp->copied_seq; - inq = tcp_inq(sk); - zc->length = min_t(u32, zc->length, inq); - zap_len = zc->length & ~(PAGE_SIZE - 1); - if (zap_len) { - zap_page_range(vma, address, zap_len); + vma_len = min_t(unsigned long, zc->length, vma->vm_end - address); + avail_len = min_t(u32, vma_len, inq); + aligned_len = avail_len & ~(PAGE_SIZE - 1); + if (aligned_len) { + zap_page_range(vma, address, aligned_len); + zc->length = aligned_len; zc->recv_skip_hint = 0; } else { - zc->recv_skip_hint = zc->length; + zc->length = avail_len; + zc->recv_skip_hint = avail_len; } ret = 0; curr_addr = address; @@ -1885,13 +1932,18 @@ static int tcp_zerocopy_receive(struct sock *sk, } out: mmap_read_unlock(current->mm); - if (length) { + /* Try to copy straggler data. */ + if (!ret) + copylen = tcp_zerocopy_handle_leftover_data(zc, sk, skb, &seq, + copybuf_len); + + if (length + copylen) { WRITE_ONCE(tp->copied_seq, seq); tcp_rcv_space_adjust(sk); /* Clean up data we have read: This will do ACK frames. */ tcp_recv_skb(sk, seq, &offset); - tcp_cleanup_rbuf(sk, length); + tcp_cleanup_rbuf(sk, length + copylen); ret = 0; if (length == zc->length) zc->recv_skip_hint = 0;
From: Arjun Roy <arjunroy@google.com> When TCP receive zerocopy does not successfully map the entire requested space, it outputs a 'hint' that the caller should recvmsg(). Augment zerocopy to accept a user buffer that it tries to copy this hint into - if it is possible to copy the entire hint, it will do so. This elides a recvmsg() call for received traffic that isn't exactly page-aligned in size. This was tested with RPC-style traffic of arbitrary sizes. Normally, each received message required at least one getsockopt() call, and one recvmsg() call for the remaining unaligned data. With this change, almost all of the recvmsg() calls are eliminated, leading to a savings of about 25%-50% in number of system calls for RPC-style workloads. --- include/uapi/linux/tcp.h | 2 + net/ipv4/tcp.c | 84 ++++++++++++++++++++++++++++++++-------- 2 files changed, 70 insertions(+), 16 deletions(-)