diff mbox series

[v2,1/1] io_uring: call req_set_fail_links() on short send[msg]()/recv[msg]() with MSG_WAITALL

Message ID 12efc18b6bef3955500080a238197e90ca6a402c.1616268538.git.metze@samba.org
State New
Headers show
Series [v2,1/1] io_uring: call req_set_fail_links() on short send[msg]()/recv[msg]() with MSG_WAITALL | expand

Commit Message

Stefan Metzmacher March 20, 2021, 7:33 p.m. UTC
Without that it's not safe to use them in a linked combination with
others.

Now combinations like IORING_OP_SENDMSG followed by IORING_OP_SPLICE
should be possible.

We already handle short reads and writes for the following opcodes:

- IORING_OP_READV
- IORING_OP_READ_FIXED
- IORING_OP_READ
- IORING_OP_WRITEV
- IORING_OP_WRITE_FIXED
- IORING_OP_WRITE
- IORING_OP_SPLICE
- IORING_OP_TEE

Now we have it for these as well:

- IORING_OP_SENDMSG
- IORING_OP_SEND
- IORING_OP_RECVMSG
- IORING_OP_RECV

For IORING_OP_RECVMSG we also check for the MSG_TRUNC and MSG_CTRUNC
flags in order to call req_set_fail_links().

There might be applications arround depending on the behavior
that even short send[msg]()/recv[msg]() retuns continue an
IOSQE_IO_LINK chain.

It's very unlikely that such applications pass in MSG_WAITALL,
which is only defined in 'man 2 recvmsg', but not in 'man 2 sendmsg'.

It's expected that the low level sock_sendmsg() call just ignores
MSG_WAITALL, as MSG_ZEROCOPY is also ignored without explicitly set
SO_ZEROCOPY.

We also expect the caller to know about the implicit truncation to
MAX_RW_COUNT, which we don't detect.

cc: netdev@vger.kernel.org
Link: https://lore.kernel.org/r/c4e1a4cc0d905314f4d5dc567e65a7b09621aab3.1615908477.git.metze@samba.org
Signed-off-by: Stefan Metzmacher <metze@samba.org>
---
 fs/io_uring.c | 24 ++++++++++++++++++++----
 1 file changed, 20 insertions(+), 4 deletions(-)

Comments

Stefan Metzmacher March 21, 2021, 10:20 a.m. UTC | #1
Am 20.03.21 um 23:57 schrieb Jens Axboe:
> On 3/20/21 1:33 PM, Stefan Metzmacher wrote:
>> Without that it's not safe to use them in a linked combination with
>> others.
>>
>> Now combinations like IORING_OP_SENDMSG followed by IORING_OP_SPLICE
>> should be possible.
>>
>> We already handle short reads and writes for the following opcodes:
>>
>> - IORING_OP_READV
>> - IORING_OP_READ_FIXED
>> - IORING_OP_READ
>> - IORING_OP_WRITEV
>> - IORING_OP_WRITE_FIXED
>> - IORING_OP_WRITE
>> - IORING_OP_SPLICE
>> - IORING_OP_TEE
>>
>> Now we have it for these as well:
>>
>> - IORING_OP_SENDMSG
>> - IORING_OP_SEND
>> - IORING_OP_RECVMSG
>> - IORING_OP_RECV
>>
>> For IORING_OP_RECVMSG we also check for the MSG_TRUNC and MSG_CTRUNC
>> flags in order to call req_set_fail_links().
>>
>> There might be applications arround depending on the behavior
>> that even short send[msg]()/recv[msg]() retuns continue an
>> IOSQE_IO_LINK chain.
>>
>> It's very unlikely that such applications pass in MSG_WAITALL,
>> which is only defined in 'man 2 recvmsg', but not in 'man 2 sendmsg'.
>>
>> It's expected that the low level sock_sendmsg() call just ignores
>> MSG_WAITALL, as MSG_ZEROCOPY is also ignored without explicitly set
>> SO_ZEROCOPY.
>>
>> We also expect the caller to know about the implicit truncation to
>> MAX_RW_COUNT, which we don't detect.
> 
> Thanks, I do think this is much better and I feel comfortable getting
> htis applied for 5.12 (and stable).
> 

Great thanks!

Related to that I have a questing regarding the IOSQE_IO_LINK behavior.
(Assuming I have a dedicated ring for the send-path of each socket.)

Is it possible to just set IOSQE_IO_LINK on every sqe in order to create
an endless chain of requests so that userspace can pass as much sqes as possible
which all need to be submitted in the exact correct order. And if any request
is short, then all remaining get ECANCELED, without the risk of running any later
request out of order.

Are such link chains possible also over multiple io_uring_submit() calls?
Is there still a race between, having an iothread removing the request from
from the list and fill in a cqe with ECANCELED, that userspace is not awaire
of yet, which then starts a new independed link chain with a request that
ought to be submitted after all the canceled once.

Or do I have to submit a link chain with just a single __io_uring_flush_sq()
and then strictly need to wait until I got a cqe for the last request in
the chain?

Thanks!
metze
Jens Axboe March 21, 2021, 1:10 p.m. UTC | #2
On 3/21/21 4:20 AM, Stefan Metzmacher wrote:
> 
> Am 20.03.21 um 23:57 schrieb Jens Axboe:
>> On 3/20/21 1:33 PM, Stefan Metzmacher wrote:
>>> Without that it's not safe to use them in a linked combination with
>>> others.
>>>
>>> Now combinations like IORING_OP_SENDMSG followed by IORING_OP_SPLICE
>>> should be possible.
>>>
>>> We already handle short reads and writes for the following opcodes:
>>>
>>> - IORING_OP_READV
>>> - IORING_OP_READ_FIXED
>>> - IORING_OP_READ
>>> - IORING_OP_WRITEV
>>> - IORING_OP_WRITE_FIXED
>>> - IORING_OP_WRITE
>>> - IORING_OP_SPLICE
>>> - IORING_OP_TEE
>>>
>>> Now we have it for these as well:
>>>
>>> - IORING_OP_SENDMSG
>>> - IORING_OP_SEND
>>> - IORING_OP_RECVMSG
>>> - IORING_OP_RECV
>>>
>>> For IORING_OP_RECVMSG we also check for the MSG_TRUNC and MSG_CTRUNC
>>> flags in order to call req_set_fail_links().
>>>
>>> There might be applications arround depending on the behavior
>>> that even short send[msg]()/recv[msg]() retuns continue an
>>> IOSQE_IO_LINK chain.
>>>
>>> It's very unlikely that such applications pass in MSG_WAITALL,
>>> which is only defined in 'man 2 recvmsg', but not in 'man 2 sendmsg'.
>>>
>>> It's expected that the low level sock_sendmsg() call just ignores
>>> MSG_WAITALL, as MSG_ZEROCOPY is also ignored without explicitly set
>>> SO_ZEROCOPY.
>>>
>>> We also expect the caller to know about the implicit truncation to
>>> MAX_RW_COUNT, which we don't detect.
>>
>> Thanks, I do think this is much better and I feel comfortable getting
>> htis applied for 5.12 (and stable).
>>
> 
> Great thanks!
> 
> Related to that I have a questing regarding the IOSQE_IO_LINK behavior.
> (Assuming I have a dedicated ring for the send-path of each socket.)
> 
> Is it possible to just set IOSQE_IO_LINK on every sqe in order to create
> an endless chain of requests so that userspace can pass as much sqes as possible
> which all need to be submitted in the exact correct order. And if any request
> is short, then all remaining get ECANCELED, without the risk of running any later
> request out of order.
> 
> Are such link chains possible also over multiple io_uring_submit() calls?
> Is there still a race between, having an iothread removing the request from
> from the list and fill in a cqe with ECANCELED, that userspace is not awaire
> of yet, which then starts a new independed link chain with a request that
> ought to be submitted after all the canceled once.
> 
> Or do I have to submit a link chain with just a single __io_uring_flush_sq()
> and then strictly need to wait until I got a cqe for the last request in
> the chain?

A chain can only exist within a single submit attempt, so it will not work
if you need to break it up over multiple io_uring_enter() calls.
diff mbox series

Patch

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 75b791ff21ec..746435e3f534 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -4386,6 +4386,7 @@  static int io_sendmsg(struct io_kiocb *req, unsigned int issue_flags)
 	struct io_async_msghdr iomsg, *kmsg;
 	struct socket *sock;
 	unsigned flags;
+	int min_ret = 0;
 	int ret;
 
 	sock = sock_from_file(req->file);
@@ -4406,6 +4407,9 @@  static int io_sendmsg(struct io_kiocb *req, unsigned int issue_flags)
 	else if (issue_flags & IO_URING_F_NONBLOCK)
 		flags |= MSG_DONTWAIT;
 
+	if (flags & MSG_WAITALL)
+		min_ret = iov_iter_count(&kmsg->msg.msg_iter);
+
 	ret = __sys_sendmsg_sock(sock, &kmsg->msg, flags);
 	if ((issue_flags & IO_URING_F_NONBLOCK) && ret == -EAGAIN)
 		return io_setup_async_msg(req, kmsg);
@@ -4416,7 +4420,7 @@  static int io_sendmsg(struct io_kiocb *req, unsigned int issue_flags)
 	if (kmsg->free_iov)
 		kfree(kmsg->free_iov);
 	req->flags &= ~REQ_F_NEED_CLEANUP;
-	if (ret < 0)
+	if (ret < min_ret)
 		req_set_fail_links(req);
 	__io_req_complete(req, issue_flags, ret, 0);
 	return 0;
@@ -4429,6 +4433,7 @@  static int io_send(struct io_kiocb *req, unsigned int issue_flags)
 	struct iovec iov;
 	struct socket *sock;
 	unsigned flags;
+	int min_ret = 0;
 	int ret;
 
 	sock = sock_from_file(req->file);
@@ -4450,6 +4455,9 @@  static int io_send(struct io_kiocb *req, unsigned int issue_flags)
 	else if (issue_flags & IO_URING_F_NONBLOCK)
 		flags |= MSG_DONTWAIT;
 
+	if (flags & MSG_WAITALL)
+		min_ret = iov_iter_count(&msg.msg_iter);
+
 	msg.msg_flags = flags;
 	ret = sock_sendmsg(sock, &msg);
 	if ((issue_flags & IO_URING_F_NONBLOCK) && ret == -EAGAIN)
@@ -4457,7 +4465,7 @@  static int io_send(struct io_kiocb *req, unsigned int issue_flags)
 	if (ret == -ERESTARTSYS)
 		ret = -EINTR;
 
-	if (ret < 0)
+	if (ret < min_ret)
 		req_set_fail_links(req);
 	__io_req_complete(req, issue_flags, ret, 0);
 	return 0;
@@ -4609,6 +4617,7 @@  static int io_recvmsg(struct io_kiocb *req, unsigned int issue_flags)
 	struct socket *sock;
 	struct io_buffer *kbuf;
 	unsigned flags;
+	int min_ret = 0;
 	int ret, cflags = 0;
 	bool force_nonblock = issue_flags & IO_URING_F_NONBLOCK;
 
@@ -4640,6 +4649,9 @@  static int io_recvmsg(struct io_kiocb *req, unsigned int issue_flags)
 	else if (force_nonblock)
 		flags |= MSG_DONTWAIT;
 
+	if (flags & MSG_WAITALL)
+		min_ret = iov_iter_count(&kmsg->msg.msg_iter);
+
 	ret = __sys_recvmsg_sock(sock, &kmsg->msg, req->sr_msg.umsg,
 					kmsg->uaddr, flags);
 	if (force_nonblock && ret == -EAGAIN)
@@ -4653,7 +4665,7 @@  static int io_recvmsg(struct io_kiocb *req, unsigned int issue_flags)
 	if (kmsg->free_iov)
 		kfree(kmsg->free_iov);
 	req->flags &= ~REQ_F_NEED_CLEANUP;
-	if (ret < 0)
+	if (ret < min_ret || ((flags & MSG_WAITALL) && (kmsg->msg.msg_flags & (MSG_TRUNC | MSG_CTRUNC))))
 		req_set_fail_links(req);
 	__io_req_complete(req, issue_flags, ret, cflags);
 	return 0;
@@ -4668,6 +4680,7 @@  static int io_recv(struct io_kiocb *req, unsigned int issue_flags)
 	struct socket *sock;
 	struct iovec iov;
 	unsigned flags;
+	int min_ret = 0;
 	int ret, cflags = 0;
 	bool force_nonblock = issue_flags & IO_URING_F_NONBLOCK;
 
@@ -4699,6 +4712,9 @@  static int io_recv(struct io_kiocb *req, unsigned int issue_flags)
 	else if (force_nonblock)
 		flags |= MSG_DONTWAIT;
 
+	if (flags & MSG_WAITALL)
+		min_ret = iov_iter_count(&msg.msg_iter);
+
 	ret = sock_recvmsg(sock, &msg, flags);
 	if (force_nonblock && ret == -EAGAIN)
 		return -EAGAIN;
@@ -4707,7 +4723,7 @@  static int io_recv(struct io_kiocb *req, unsigned int issue_flags)
 out_free:
 	if (req->flags & REQ_F_BUFFER_SELECTED)
 		cflags = io_put_recv_kbuf(req);
-	if (ret < 0)
+	if (ret < min_ret || ((flags & MSG_WAITALL) && (msg.msg_flags & (MSG_TRUNC | MSG_CTRUNC))))
 		req_set_fail_links(req);
 	__io_req_complete(req, issue_flags, ret, cflags);
 	return 0;