Message ID | 61ea0f0faaaaf26dd3c762eabe4420306ced21b9.1630770829.git.linux_oss@crudebyte.com |
---|---|
State | New |
Headers | show |
Series | net/9p: increase default msize | expand |
On Sonntag, 5. September 2021 01:31:50 CEST Dominique Martinet wrote: > Hi Christian, > > thanks for the follow up, this has been on my todolist forever and > despite how simple it is I never took the time for it... > > > I've applied both patches to my -next branch... We're a bit tight for > this merge window (v5.15) but I'll send it to linux mid next week anyway. That would be great! > I've also added this patch (sorry for laziness) for TCP, other transports > are OK iirc: > > From 657e35583c70bed86526cb8eb207abe3d55ea4ea Mon Sep 17 00:00:00 2001 > From: Dominique Martinet <asmadeus@codewreck.org> > Date: Sun, 5 Sep 2021 08:29:22 +0900 > Subject: [PATCH] net/9p: increase tcp max msize to 1MB > > Historically TCP has been limited to 64K buffers, but increasing msize > provides huge performance benefits especially as latency increase so allow > for bigger buffers. > > Ideally further improvements could change the allocation from the current > contiguous chunk in slab (kmem_cache) to some scatter-gather compatible > API... > > Signed-off-by: Dominique Martinet <asmadeus@codewreck.org> > > diff --git a/net/9p/trans_fd.c b/net/9p/trans_fd.c > index f4dd0456beaf..007bbcc68010 100644 > --- a/net/9p/trans_fd.c > +++ b/net/9p/trans_fd.c > @@ -34,7 +34,7 @@ > #include <linux/syscalls.h> /* killme */ > > #define P9_PORT 564 > -#define MAX_SOCK_BUF (64*1024) > +#define MAX_SOCK_BUF (1024*1024) > #define MAXPOLLWADDR 2 > > static struct p9_trans_module p9_tcp_trans; Yes, makes sense. Is the TCP transport driver of the Linux 9p client somewhat equally mature to the virtio transport driver? Because I still have macOS support for 9p in QEMU on my TODO list and accordingly a decision for a specific transport type for macOS will be needed. For the next series mentioned by me (getting rid of the 512k msize capping) I first need to readup on the Linux kernel code. According to the discussion we already had about that issue, the reason for that capping was that the amount of virtio elements is currently hard coded in the Linux client's virtio transport: On Samstag, 27. Februar 2021 01:03:40 CEST Dominique Martinet wrote: > Christian Schoenebeck wrote on Fri, Feb 26, 2021 at 02:49:12PM +0100: > > Right now the client uses a hard coded amount of 128 elements. So what > > about replacing VIRTQUEUE_NUM by a variable which is initialized with a > > value according to the user's requested 'msize' option at init time? > > > > According to the virtio specs the max. amount of elements in a virtqueue > > is > > 32768. So 32768 * 4k = 128M as new upper limit would already be a > > significant improvement and would not require too many changes to the > > client code, right? > The current code inits the chan->sg at probe time (when driver is > loader) and not mount time, and it is currently embedded in the chan > struct, so that would need allocating at mount time (p9_client_create ; > either resizing if required or not sharing) but it doesn't sound too > intrusive yes. > > I don't see more adherenences to VIRTQUEUE_NUM that would hurt trying. So probably as a first step I would only re-allocate the virtio elements according to the user supplied (i.e. large) 'msize' value if the 9p driver is not in use at that point, and stick to capping otherwise. That should probably be fine for many users and would avoid need for some synchronization measures and the traps it might bring. But again, I still need to read more of the Linux client code first. BTW just in case I have not mentioned it before: there are some developer docs for the QEMU 9p server implementation now: https://wiki.qemu.org/Documentation/9p Best regards, Christian Schoenebeck
Christian Schoenebeck wrote on Sun, Sep 05, 2021 at 03:33:11PM +0200: > > Subject: [PATCH] net/9p: increase tcp max msize to 1MB > > Yes, makes sense. > > Is the TCP transport driver of the Linux 9p client somewhat equally mature to > the virtio transport driver? Because I still have macOS support for 9p in QEMU > on my TODO list and accordingly a decision for a specific transport type for > macOS will be needed. I'd say it should be just about as stable as virtio.. It's definitely getting a lot of tests like syzcaller as it's easy to open an arbitrary fd and pass that to kernel for fuzzing. Performance-wise you won't have zero-copy, and the monolitic memory blocks requirement is the same, so it won't be as good as virtio but it can probably be used. The problem will more be one of setting -- for user networking we can always use qemu's networking implementation, but for passthrough or tap qemu won't easily be discoverable from the VM, I'm not sure about that. Does AF_VSOCK work on macos? that could probably easily be added to trans_fd... > For the next series mentioned by me (getting rid of the 512k msize capping) I > first need to readup on the Linux kernel code. According to the discussion we > already had about that issue, the reason for that capping was that the amount > of virtio elements is currently hard coded in the Linux client's virtio > transport: > > On Samstag, 27. Februar 2021 01:03:40 CEST Dominique Martinet wrote: > > Christian Schoenebeck wrote on Fri, Feb 26, 2021 at 02:49:12PM +0100: > > > Right now the client uses a hard coded amount of 128 elements. So what > > > about replacing VIRTQUEUE_NUM by a variable which is initialized with a > > > value according to the user's requested 'msize' option at init time? > > > > > > According to the virtio specs the max. amount of elements in a virtqueue > > > is > > > 32768. So 32768 * 4k = 128M as new upper limit would already be a > > > significant improvement and would not require too many changes to the > > > client code, right? > > The current code inits the chan->sg at probe time (when driver is > > loader) and not mount time, and it is currently embedded in the chan > > struct, so that would need allocating at mount time (p9_client_create ; > > either resizing if required or not sharing) but it doesn't sound too > > intrusive yes. > > > > I don't see more adherenences to VIRTQUEUE_NUM that would hurt trying. > > So probably as a first step I would only re-allocate the virtio elements > according to the user supplied (i.e. large) 'msize' value if the 9p driver is > not in use at that point, and stick to capping otherwise. That should probably > be fine for many users and would avoid need for some synchronization measures > and the traps it might bring. But again, I still need to read more of the > Linux client code first. Right, looking at it again p9_virtio_create would already allow that so you just need to pick the most appropriate channel or create a new one if required, synchronization shouldn't be too much of a problem. The 9p code is sometimes impressive in its foresight ;) > BTW just in case I have not mentioned it before: there are some developer docs > for the QEMU 9p server implementation now: > https://wiki.qemu.org/Documentation/9p Wow, that's an impressive wiki page. I've never been good with documentation (be it writing or using it), but hopefully it'll help people make the first step -- good work! -- Dominique
Eric Van Hensbergen wrote on Sun, Sep 05, 2021 at 06:44:13PM -0500: > there will likely be a tradeoff with tcp in terms of latency to first > message so while > absolute bw may be higher processing time may suffer. 8k was default msize > to more closely match it to jumbo frames on an ethernet. of course all > that intuition is close to 30 years out of date…. It's not because the max size is 128k (or 1MB) that this much is sent over the wire everytime -- if a message used to fit in 8KB, then its on-the-wire size won't change and speed/latency won't be affected for these. For messages that do require more than 8KB (read/write/readdir) then you can fit more data per message, so for a given userspace request (feed me xyz amount of data) you'll have less client-server round-trips, and the final user-reflected latency will be better as well -- that's why e.g. NFS has been setting a max size of 1MB by default for a while now, and they allow even more (32MB iirc? not sure) I've only had done these tests years ago and no longer have access to the note that was written back then, but TCP also definitely benefits from > 64k msize as long as there's enough memory available. The downside (because it's not free) is there though, you need more memory for 9p with big buffers even if we didn't need so much in the first place. The code using a slab now means that the memory is not locked per mount as it used to, but that also means allocations can fail if there is a big pressure after not having been released. OTOH as long as it's consistently used the buffers will be recycled so it's not necessarily too bad performance-wise in hot periods. Ideally we'd need to rework transports to allow scatter-gather (iovec or similar API), and work with smaller allocations batched together on send, but I don't have time for something like that... If we do that we can probably get the best of both worlds -- and could consider >1MB, but that's unrealistic as of now, regardless of the transport. -- Dominique
On Sonntag, 5. September 2021 23:48:04 CEST Dominique Martinet wrote: > Christian Schoenebeck wrote on Sun, Sep 05, 2021 at 03:33:11PM +0200: > > > Subject: [PATCH] net/9p: increase tcp max msize to 1MB > > > > Yes, makes sense. > > > > Is the TCP transport driver of the Linux 9p client somewhat equally mature > > to the virtio transport driver? Because I still have macOS support for 9p > > in QEMU on my TODO list and accordingly a decision for a specific > > transport type for macOS will be needed. > > I'd say it should be just about as stable as virtio.. It's definitely > getting a lot of tests like syzcaller as it's easy to open an arbitrary > fd and pass that to kernel for fuzzing. > > Performance-wise you won't have zero-copy, and the monolitic memory > blocks requirement is the same, so it won't be as good as virtio but it > can probably be used. The problem will more be one of setting -- for > user networking we can always use qemu's networking implementation, but > for passthrough or tap qemu won't easily be discoverable from the VM, > I'm not sure about that. > Does AF_VSOCK work on macos? that could probably easily be added to > trans_fd... I haven't looked yet into what the latest options are on macOS. It probably takes a while before I'll have time for that, as it is not high priority for me. With macOS 11 Apple added many new restrictions of what can still be done beyond user space. For virtualization you would now use the macoOS provided hypervisor for instance. Apple sent a bunch of patches last year to support their HVF in QEMU. > > On Samstag, 27. Februar 2021 01:03:40 CEST Dominique Martinet wrote: > > > Christian Schoenebeck wrote on Fri, Feb 26, 2021 at 02:49:12PM +0100: > > > > Right now the client uses a hard coded amount of 128 elements. So what > > > > about replacing VIRTQUEUE_NUM by a variable which is initialized with > > > > a > > > > value according to the user's requested 'msize' option at init time? > > > > > > > > According to the virtio specs the max. amount of elements in a > > > > virtqueue > > > > is > > > > 32768. So 32768 * 4k = 128M as new upper limit would already be a > > > > significant improvement and would not require too many changes to the > > > > client code, right? > > > > > > The current code inits the chan->sg at probe time (when driver is > > > loader) and not mount time, and it is currently embedded in the chan > > > struct, so that would need allocating at mount time (p9_client_create ; > > > either resizing if required or not sharing) but it doesn't sound too > > > intrusive yes. > > > > > > I don't see more adherenences to VIRTQUEUE_NUM that would hurt trying. > > > > So probably as a first step I would only re-allocate the virtio elements > > according to the user supplied (i.e. large) 'msize' value if the 9p driver > > is not in use at that point, and stick to capping otherwise. That should > > probably be fine for many users and would avoid need for some > > synchronization measures and the traps it might bring. But again, I still > > need to read more of the Linux client code first. > > Right, looking at it again p9_virtio_create would already allow that so > you just need to pick the most appropriate channel or create a new one > if required, synchronization shouldn't be too much of a problem. > > The 9p code is sometimes impressive in its foresight ;) I just realized this is going to be much more work than I expected. Apparently struct scatterlist is limited to a value of SG_MAX_SINGLE_ALLOC, which is something around 128 or slightly more: /* * Maximum number of entries that will be allocated in one piece, if * a list larger than this is required then chaining will be utilized. */ #define SG_MAX_SINGLE_ALLOC (PAGE_SIZE / sizeof(struct scatterlist)) /* * The maximum number of SG segments that we will put inside a * scatterlist (unless chaining is used). Should ideally fit inside a * single page, to avoid a higher order allocation. We could define this * to SG_MAX_SINGLE_ALLOC to pack correctly at the highest order. The * minimum value is 32 */ #define SG_CHUNK_SIZE 128 Which explains the current hard coded default value of 128 sglist elements in the 9p virtio transport: #define VIRTQUEUE_NUM 128 So struct virtio_chan would need to switch from scatterlist to a different type, probably to sg_table. The latter API allows to automatically chain sg lists if the required amount exceeds SG_MAX_SINGLE_ALLOC. It misuses the last sglist element to build a chained list. However chaining is not supported on some architectures: int __sg_alloc_table(struct sg_table *table, unsigned int nents, unsigned int max_ents, struct scatterlist *first_chunk, unsigned int nents_first_chunk, gfp_t gfp_mask, sg_alloc_fn *alloc_fn) { ... #ifdef CONFIG_ARCH_NO_SG_CHAIN if (WARN_ON_ONCE(nents > max_ents)) return -EINVAL; #endif ... } So those architectures would still end up with the current 128 sglist elements limitation with the 9p virtio transport. And it is also not clear to me whether the Linux sg_table API allows to grow the table if it had been allocated already; say virtio transport retains the current pre-allocation of hard-coded 128 elements in early p9_virtio_probe() and would then just append more sglists in p9_virtio_create() if needed due to a large msize option provided by user. It "looks" like sg_alloc_table() might allow to append, but I am not sure. Another type candidate would be struct sg_append_table, but its API wants a complete list of pages to be provided upon table allocation upfront: /** * sg_alloc_append_table_from_pages - Allocate and initialize an append sg * table from an array of pages * @sgt_append: The sg append table to use * @pages: Pointer to an array of page pointers * @n_pages: Number of pages in the pages array * @offset: Offset from start of the first page to the start of a buffer * @size: Number of valid bytes in the buffer (after offset) * @max_segment: Maximum size of a scatterlist element in bytes * @left_pages: Left pages caller have to set after this call * @gfp_mask: GFP allocation mask * * Description: * In the first call it allocate and initialize an sg table from a list of * pages, else reuse the scatterlist from sgt_append. Contiguous ranges of * the pages are squashed into a single scatterlist entry up to the maximum * size specified in @max_segment. A user may provide an offset at a start * and a size of valid data in a buffer specified by the page array. The * returned sg table is released by sg_free_append_table * * Returns: * 0 on success, negative error on failure * * Notes: * If this function returns non-0 (eg failure), the caller must call * sg_free_append_table() to cleanup any leftover allocations. * * In the fist call, sgt_append must by initialized. */ int sg_alloc_append_table_from_pages(struct sg_append_table *sgt_append, struct page **pages, unsigned int n_pages, unsigned int offset, unsigned long size, unsigned int max_segment, unsigned int left_pages, gfp_t gfp_mask) { ... } Which does not really fit here (ATM), as the 9p virtio transport maps the pages on demand apparently, not when the client/transport is created already. Either way, additionally functions pack_sg_list() and pack_sg_list_p() in the 9p virtio transport would need to be adjusted as well to run like: int pack_sg_list() { FOR_EACH(sglist in sgtable) { for (i in 0..127) { ... sglist[i] ... } } } Which is not really a trivial change set. :/ On Montag, 6. September 2021 02:07:56 CEST Dominique Martinet wrote: > Eric Van Hensbergen wrote on Sun, Sep 05, 2021 at 06:44:13PM -0500: > > there will likely be a tradeoff with tcp in terms of latency to first > > message so while > > absolute bw may be higher processing time may suffer. 8k was default > > msize > > to more closely match it to jumbo frames on an ethernet. of course all > > that intuition is close to 30 years out of date…. > > It's not because the max size is 128k (or 1MB) that this much is sent > over the wire everytime -- if a message used to fit in 8KB, then its > on-the-wire size won't change and speed/latency won't be affected for > these. > > For messages that do require more than 8KB (read/write/readdir) then you > can fit more data per message, so for a given userspace request (feed me > xyz amount of data) you'll have less client-server round-trips, and the > final user-reflected latency will be better as well -- that's why > e.g. NFS has been setting a max size of 1MB by default for a while now, > and they allow even more (32MB iirc? not sure) I can just speak for the setup case QEMU server <-> virtio <-> Linux client now: In this setup there is definitely no disadvantage regarding latency by raising msize. It is actually the exact opposite: low msize values significantly increase overall latency, because for each 9p message there is not only the latency between each server and client 9p message transfer, but there is also latency added on server side when handling each request, as server has to dispatch between fs driver worker thread(s) and server main thread at least twice per 9p request, in some cases even multiple times per request. That's actually something I'm working on to reduce this to its theoretical limit of 2 thread hops for every 9p request type on QEMU server side (WIP). The only bottleneck situation I can think of when raising msize to a giant value is when you have a guest app reading/writing constantly with maximum chunk size according to msize (which 'cat' for instance is always doing by reading stat's st_blksize). In this case other 9p requests of other threads/ processes would need to wait as that single 9p consumer would occupy all available pages with a single, huge 9p request. That could be addressed maybe by multiplying msize with the amount of available CPU cores or something like that when allocating the sg table on client transport side. > I've only had done these tests years ago and no longer have access to > the note that was written back then, but TCP also definitely benefits > from > 64k msize as long as there's enough memory available. > > The downside (because it's not free) is there though, you need more > memory for 9p with big buffers even if we didn't need so much in the > first place. > The code using a slab now means that the memory is not locked per mount > as it used to, but that also means allocations can fail if there is a > big pressure after not having been released. OTOH as long as it's > consistently used the buffers will be recycled so it's not necessarily > too bad performance-wise in hot periods. > > Ideally we'd need to rework transports to allow scatter-gather (iovec or > similar API), and work with smaller allocations batched together on > send, but I don't have time for something like that... If we do that we > can probably get the best of both worlds -- and could consider >1MB, but > that's unrealistic as of now, regardless of the transport. Ok, but that's not an issue for the virtio transport, is it? As virtio transport already uses scatter-gather. AFAICS in case of virtio the issue might be that it's claiming pages on demand instead of pinning them when client/virtio is already created. So it could run out of free pages during life time I guess. Best regards, Christian Schoenebeck
diff --git a/net/9p/client.c b/net/9p/client.c index 1cb255587fff..213f12ed76cd 100644 --- a/net/9p/client.c +++ b/net/9p/client.c @@ -30,7 +30,7 @@ #define CREATE_TRACE_POINTS #include <trace/events/9p.h> -#define DEFAULT_MSIZE 8192 +#define DEFAULT_MSIZE (128 * 1024) /* * Client Option Parsing (code inspired by NFS code)
Let's raise the default msize value to 128k. The 'msize' option defines the maximum message size allowed for any message being transmitted (in both directions) between 9p server and 9p client during a 9p session. Currently the default 'msize' is just 8k, which is way too conservative. Such a small 'msize' value has quite a negative performance impact, because individual 9p messages have to be split up far too often into numerous smaller messages to fit into this message size limitation. A default value of just 8k also has a much higher probablity of hitting short-read issues like: https://gitlab.com/qemu-project/qemu/-/issues/409 Unfortunately user feedback showed that many 9p users are not aware that this option even exists, nor the negative impact it might have if it is too low. Link: https://lists.gnu.org/archive/html/qemu-devel/2021-03/msg01003.html Signed-off-by: Christian Schoenebeck <linux_oss@crudebyte.com> --- net/9p/client.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)