diff mbox series

[1/2] net/ipv6: always honour route mtu during forwarding

Message ID 20201008033102.623894-1-zenczykowski@gmail.com
State New
Headers show
Series [1/2] net/ipv6: always honour route mtu during forwarding | expand

Commit Message

Maciej Żenczykowski Oct. 8, 2020, 3:31 a.m. UTC
From: Maciej Żenczykowski <maze@google.com>

This matches the new ipv4 behaviour as of commit:
  commit 02a1b175b0e92d9e0fa5df3957ade8d733ceb6a0
  Author: Maciej Żenczykowski <maze@google.com>
  Date:   Wed Sep 23 13:18:15 2020 -0700

  net/ipv4: always honour route mtu during forwarding

The reasoning is similar: There doesn't seem to be any reason
why you would want to ignore route mtu.

There are two potential sources of ipv6 route mtu:
  - manually configured by NET_ADMIN, since you configured
    a route mtu explicitly you probably know best...
  - derived from mtu information from RA messages,
    but this is the network telling you what will work,
    again presumably whatever network admin configured
    the RA content knows best what the network conditions are.

One could argue that RAs can be spoofed, but if we get spoofed
RAs we're *already* screwed, and erroneous mtu information is
less dangerous then the erroneous routes themselves...
(The proper place to do RA filtering is in the switch/router)

Additionally, a reduction from 1500 to 1280 (min ipv6 mtu) is
not very noticable on performance (especially with gro/gso/tso),
while packets getting lost (due to rx buffer overruns) or
generating icmpv6 packet too big errors and needing to be
retransmitted is very noticable (guaranteed impact of full rtt)

It is pretty common to have a higher device mtu to allow receiving
large (jumbo) frames, while having some routes via that interface
(potentially including the default route to the internet) specify
a lower mtu.

There might also be use cases around xfrm/ipsec/tunnels.
Especially for something like sit/6to4/6rd, where you may have one
sit device, but traffic through it will flow over different
underlying paths and thus is per subnet and not per device.

(Note that this function does not honour pmtu, which can be spoofed
via icmpv6 messages, but see also ip6_mtu_from_fib6() which honours
pmtu for ipv6 'locked mtu' routes)

Signed-off-by: Maciej Żenczykowski <maze@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Lorenzo Colitti <lorenzo@google.com>
Cc: Sunmeet Gill (Sunny) <sgill@quicinc.com>
Cc: Vinay Paradkar <vparadka@qti.qualcomm.com>
Cc: Tyler Wear <twear@quicinc.com>
Cc: David Ahern <dsahern@kernel.org>
---
 include/net/ip6_route.h | 14 ++++----------
 1 file changed, 4 insertions(+), 10 deletions(-)

Comments

Lorenzo Colitti Oct. 8, 2020, 6:04 a.m. UTC | #1
On Thu, Oct 8, 2020 at 12:31 PM Maciej Żenczykowski
<zenczykowski@gmail.com> wrote:
> diff --git a/include/net/ip6_route.h b/include/net/ip6_route.h
> index 2a5277758379..598415743f46 100644
> --- a/include/net/ip6_route.h
> +++ b/include/net/ip6_route.h
> @@ -311,19 +311,13 @@ static inline bool rt6_duplicate_nexthop(struct fib6_info *a, struct fib6_info *
>  static inline unsigned int ip6_dst_mtu_forward(const struct dst_entry *dst)
>  {
>         struct inet6_dev *idev;
> -       unsigned int mtu;
> +       unsigned int mtu = dst_metric_raw(dst, RTAX_MTU);
> +       if (mtu)
> +               return mtu;

What should happen here if mtu is less than idev->cnf.mtu6? Should the
code pick the minimum? If not: will picking the higher value work, or
will the packet be dropped? I suppose we already have this problem
today if the administrator configures a route with a locked MTU.
Maciej Żenczykowski Oct. 8, 2020, 6:22 a.m. UTC | #2
> On Thu, Oct 8, 2020 at 12:31 PM Maciej Żenczykowski
> <zenczykowski@gmail.com> wrote:
> > diff --git a/include/net/ip6_route.h b/include/net/ip6_route.h
> > index 2a5277758379..598415743f46 100644
> > --- a/include/net/ip6_route.h
> > +++ b/include/net/ip6_route.h
> > @@ -311,19 +311,13 @@ static inline bool rt6_duplicate_nexthop(struct fib6_info *a, struct fib6_info *
> >  static inline unsigned int ip6_dst_mtu_forward(const struct dst_entry *dst)
> >  {
> >         struct inet6_dev *idev;
> > -       unsigned int mtu;
> > +       unsigned int mtu = dst_metric_raw(dst, RTAX_MTU);
> > +       if (mtu)
> > +               return mtu;
>
> What should happen here if mtu is less than idev->cnf.mtu6? Should the
> code pick the minimum? If not: will picking the higher value work, or
> will the packet be dropped? I suppose we already have this problem
> today if the administrator configures a route with a locked MTU.

This feels like a misconfiguration of some sort (ie maybe should be
denied at route config time), but honestly it could maybe potentially
be useful:  for example an ipv6 route out an ipv4 only interface with
ebpf doing translation should/could actually be higher (by 20 or even
28) then device mtu.  (Note: this is not the Android case, as we
translate on ingress, not egress, but a setup could be created that
would work, especially if there was no nat44 in the picture and the
ipv6 dst address would directly select the end host)

And either way... it's the same for v4, and it already behave this way
in other places.
David Ahern Oct. 8, 2020, 4:31 p.m. UTC | #3
On 10/7/20 8:31 PM, Maciej Żenczykowski wrote:
> From: Maciej Żenczykowski <maze@google.com>
> 
> This matches the new ipv4 behaviour as of commit:
>   commit 02a1b175b0e92d9e0fa5df3957ade8d733ceb6a0
>   Author: Maciej Żenczykowski <maze@google.com>
>   Date:   Wed Sep 23 13:18:15 2020 -0700
> 
>   net/ipv4: always honour route mtu during forwarding

just summarize that as:
commit 02a1b175b0e9 ("net/ipv4: always honour route mtu during forwarding")



> diff --git a/include/net/ip6_route.h b/include/net/ip6_route.h
> index 2a5277758379..598415743f46 100644
> --- a/include/net/ip6_route.h
> +++ b/include/net/ip6_route.h
> @@ -311,19 +311,13 @@ static inline bool rt6_duplicate_nexthop(struct fib6_info *a, struct fib6_info *
>  static inline unsigned int ip6_dst_mtu_forward(const struct dst_entry *dst)
>  {
>  	struct inet6_dev *idev;
> -	unsigned int mtu;
> +	unsigned int mtu = dst_metric_raw(dst, RTAX_MTU);

newline here for readability

> +	if (mtu)
> +		return mtu;
>  
> -	if (dst_metric_locked(dst, RTAX_MTU)) {
> -		mtu = dst_metric_raw(dst, RTAX_MTU);
> -		if (mtu)
> -			return mtu;
> -	}
> -
> -	mtu = IPV6_MIN_MTU;
>  	rcu_read_lock();
>  	idev = __in6_dev_get(dst->dev);
> -	if (idev)
> -		mtu = idev->cnf.mtu6;
> +	mtu = idev ? idev->cnf.mtu6 : IPV6_MIN_MTU;
>  	rcu_read_unlock();
>  
>  	return mtu;
> 

besides the nit comments, the change looks fine to me. Please add test
cases to tools/testing/selftests/net/pmtu.sh for this change.
diff mbox series

Patch

diff --git a/include/net/ip6_route.h b/include/net/ip6_route.h
index 2a5277758379..598415743f46 100644
--- a/include/net/ip6_route.h
+++ b/include/net/ip6_route.h
@@ -311,19 +311,13 @@  static inline bool rt6_duplicate_nexthop(struct fib6_info *a, struct fib6_info *
 static inline unsigned int ip6_dst_mtu_forward(const struct dst_entry *dst)
 {
 	struct inet6_dev *idev;
-	unsigned int mtu;
+	unsigned int mtu = dst_metric_raw(dst, RTAX_MTU);
+	if (mtu)
+		return mtu;
 
-	if (dst_metric_locked(dst, RTAX_MTU)) {
-		mtu = dst_metric_raw(dst, RTAX_MTU);
-		if (mtu)
-			return mtu;
-	}
-
-	mtu = IPV6_MIN_MTU;
 	rcu_read_lock();
 	idev = __in6_dev_get(dst->dev);
-	if (idev)
-		mtu = idev->cnf.mtu6;
+	mtu = idev ? idev->cnf.mtu6 : IPV6_MIN_MTU;
 	rcu_read_unlock();
 
 	return mtu;