diff mbox series

[net-next,v3] net: ip: avoid OOM kills with large UDP sends over loopback

Message ID 20210623162328.2197645-1-kuba@kernel.org
State New
Headers show
Series [net-next,v3] net: ip: avoid OOM kills with large UDP sends over loopback | expand

Commit Message

Jakub Kicinski June 23, 2021, 4:23 p.m. UTC
Dave observed number of machines hitting OOM on the UDP send
path. The workload seems to be sending large UDP packets over
loopback. Since loopback has MTU of 64k kernel will try to
allocate an skb with up to 64k of head space. This has a good
chance of failing under memory pressure. What's worse if
the message length is <32k the allocation may trigger an
OOM killer.

This is entirely avoidable, we can use an skb with page frags.

af_unix solves a similar problem by limiting the head
length to SKB_MAX_ALLOC. This seems like a good and simple
approach. It means that UDP messages > 16kB will now
use fragments if underlying device supports SG, if extra
allocator pressure causes regressions in real workloads
we can switch to trying the large allocation first and
falling back.

Reported-by: Dave Jones <dsj@fb.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
---
 net/ipv4/ip_output.c  | 4 +++-
 net/ipv6/ip6_output.c | 4 +++-
 2 files changed, 6 insertions(+), 2 deletions(-)

Comments

Eric Dumazet June 23, 2021, 6:11 p.m. UTC | #1
On 6/23/21 8:05 PM, Eric Dumazet wrote:
> 
> 
> On 6/23/21 6:23 PM, Jakub Kicinski wrote:
>> Dave observed number of machines hitting OOM on the UDP send
>> path. The workload seems to be sending large UDP packets over
>> loopback. Since loopback has MTU of 64k kernel will try to
>> allocate an skb with up to 64k of head space. This has a good
>> chance of failing under memory pressure. What's worse if
>> the message length is <32k the allocation may trigger an
>> OOM killer.
>>
>> This is entirely avoidable, we can use an skb with page frags.
>>
>> af_unix solves a similar problem by limiting the head
>> length to SKB_MAX_ALLOC. This seems like a good and simple
>> approach. It means that UDP messages > 16kB will now
>> use fragments if underlying device supports SG, if extra
>> allocator pressure causes regressions in real workloads
>> we can switch to trying the large allocation first and
>> falling back.
>>
>> Reported-by: Dave Jones <dsj@fb.com>
>> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
>> ---
> 
> Reviewed-by: Eric Dumazet <edumazet@google.com>
> 
> Thanks !
> 

I am taking this back.

IPv6 side also needs to account for sizeof(struct frag_hdr) ?
Willem de Bruijn June 24, 2021, 2:23 a.m. UTC | #2
On Wed, Jun 23, 2021 at 5:07 PM Jakub Kicinski <kuba@kernel.org> wrote:
>

> On Wed, 23 Jun 2021 21:45:55 +0200 Jesper Dangaard Brouer wrote:

> > > diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c

> > > index c3efc7d658f6..790dd28fd198 100644

> > > --- a/net/ipv4/ip_output.c

> > > +++ b/net/ipv4/ip_output.c

> > > @@ -1077,7 +1077,9 @@ static int __ip_append_data(struct sock *sk,

> > >                     if ((flags & MSG_MORE) &&

> > >                         !(rt->dst.dev->features&NETIF_F_SG))

> > >                             alloclen = mtu;

> > > -                   else if (!paged)

> > > +                   else if (!paged &&

> > > +                            (fraglen + hh_len + 15 < SKB_MAX_ALLOC ||

> >

> > What does the number 15 represent here?

>

> No idea, it's there on the allocation line, so I need to include it on

> the size check.

>

> Looking at super old code (2.4.x) it looks like it may have gotten

> copy & pasted mistakenly? The hard headers are rounded up to 16B,

> and there is code which does things like:

>

>         skb_alloc(size + dev->hard_header_len + 15);

>         skb_reserve(skb, (dev->hard_header_len + 15) & ~15);

>

> in other spots. So if I was to guess I'd say someone decided to add the

> 15B "to be safe" even though hh_len already includes the round up here.


The 15 seems to come from alignment indeed. Not sure when it was
introduced, but until 56951b54e87a there is also this

                /*
                 *      Get the memory we require with some space left
for alignment.
                 */

                skb = sock_alloc_send_skb(sk, fraglen+hh_len+15, 0,
flags&MSG_DONTWAIT, &err);
diff mbox series

Patch

diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index c3efc7d658f6..790dd28fd198 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -1077,7 +1077,9 @@  static int __ip_append_data(struct sock *sk,
 			if ((flags & MSG_MORE) &&
 			    !(rt->dst.dev->features&NETIF_F_SG))
 				alloclen = mtu;
-			else if (!paged)
+			else if (!paged &&
+				 (fraglen + hh_len + 15 < SKB_MAX_ALLOC ||
+				  !(rt->dst.dev->features & NETIF_F_SG)))
 				alloclen = fraglen;
 			else {
 				alloclen = min_t(int, fraglen, MAX_HEADER);
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index ff4f9ebcf7f6..ae8dbd6cdab1 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -1585,7 +1585,9 @@  static int __ip6_append_data(struct sock *sk,
 			if ((flags & MSG_MORE) &&
 			    !(rt->dst.dev->features&NETIF_F_SG))
 				alloclen = mtu;
-			else if (!paged)
+			else if (!paged &&
+				 (fraglen + hh_len < SKB_MAX_ALLOC ||
+				  !(rt->dst.dev->features & NETIF_F_SG)))
 				alloclen = fraglen;
 			else {
 				alloclen = min_t(int, fraglen, MAX_HEADER);