[PATCHv2,net-next,00/14] sctp: implement RFC8899: Packetization Layer Path MTU Discovery for SCTP transport

Message ID	cover.1624384990.git.lucien.xin@gmail.com
Headers	show Return-Path: <netdev-owner@kernel.org> From: Xin Long <lucien.xin@gmail.com> To: network dev <netdev@vger.kernel.org>, davem@davemloft.net, kuba@kernel.org, Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>, linux-sctp@vger.kernel.org Subject: [PATCHv2 net-next 00/14] sctp: implement RFC8899: Packetization Layer Path MTU Discovery for SCTP transport Date: Tue, 22 Jun 2021 14:04:46 -0400 Message-Id: <cover.1624384990.git.lucien.xin@gmail.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	sctp: implement RFC8899: Packetization Layer Path MTU Discovery for SCTP transport \| expand [PATCHv2,net-next,00/14] sctp: implement RFC8899: Packetization Layer Path MTU Discovery for SCTP... [PATCHv2,net-next,01/14] sctp: add pad chunk and its make function and event table [PATCHv2,net-next,02/14] sctp: add probe_interval in sysctl and sock/asoc/transport [PATCHv2,net-next,03/14] sctp: add SCTP_PLPMTUD_PROBE_INTERVAL sockopt for sock/asoc/transport [PATCHv2,net-next,04/14] sctp: add the constants/variables and states and some APIs for transport [PATCHv2,net-next,05/14] sctp: add the probe timer in transport for PLPMTUD [PATCHv2,net-next,06/14] sctp: do the basic send and recv for PLPMTUD probe [PATCHv2,net-next,07/14] sctp: do state transition when PROBE_COUNT == MAX_PROBES on HB send path [PATCHv2,net-next,08/14] sctp: do state transition when a probe succeeds on HB ACK recv path [PATCHv2,net-next,09/14] sctp: do state transition when receiving an icmp TOOBIG packet [PATCHv2,net-next,10/14] sctp: enable PLPMTUD when the transport is ready [PATCHv2,net-next,11/14] sctp: remove the unessessary hold for idev in sctp_v6_err [PATCHv2,net-next,12/14] sctp: extract sctp_v6_err_handle function from sctp_v6_err [PATCHv2,net-next,13/14] sctp: extract sctp_v4_err_handle function from sctp_v4_err [PATCHv2,net-next,14/14] sctp: process sctp over udp icmp err on sctp side

Message ID

cover.1624384990.git.lucien.xin@gmail.com

Headers

From: Xin Long <lucien.xin@gmail.com>
To: network dev <netdev@vger.kernel.org>, davem@davemloft.net,
	kuba@kernel.org, Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>,
	linux-sctp@vger.kernel.org
Subject: [PATCHv2 net-next 00/14] sctp: implement RFC8899: Packetization
	Layer Path MTU Discovery for SCTP transport
Date: Tue, 22 Jun 2021 14:04:46 -0400
Message-Id: <cover.1624384990.git.lucien.xin@gmail.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Precedence: bulk

Series

sctp: implement RFC8899: Packetization Layer Path MTU Discovery for SCTP transport | expand

Message

Xin Long June 22, 2021, 6:04 p.m. UTC

Overview(From RFC8899):

  In contrast to PMTUD, Packetization Layer Path MTU Discovery
  (PLPMTUD) [RFC4821] introduces a method that does not rely upon
  reception and validation of PTB messages.  It is therefore more
  robust than Classical PMTUD.  This has become the recommended
  approach for implementing discovery of the PMTU [BCP145].

  It uses a general strategy in which the PL sends probe packets to
  search for the largest size of unfragmented datagram that can be sent
  over a network path.  Probe packets are sent to explore using a
  larger packet size.  If a probe packet is successfully delivered (as
  determined by the PL), then the PLPMTU is raised to the size of the
  successful probe.  If a black hole is detected (e.g., where packets
  of size PLPMTU are consistently not received), the method reduces the
  PLPMTU.

SCTP Probe Packets:

  As the RFC suggested, the probe packets consist of an SCTP common header
  followed by a HEARTBEAT chunk and a PAD chunk. The PAD chunk is used to
  control the length of the probe packet.  The HEARTBEAT chunk is used to
  trigger the sending of a HEARTBEAT ACK chunk to confirm this probe on
  the HEARTBEAT sender.

  The HEARTBEAT chunk also carries a Heartbeat Information parameter that
  includes the probe size to help an implementation associate a HEARTBEAT
  ACK with the size of probe that was sent. The sender use the nonce and
  the probe size to verify the information returned.

Detailed Implementation on SCTP:

                       +------+
              +------->| Base |-----------------+ Connectivity
              |        +------+                 | or BASE_PLPMTU
              |           |                     | confirmation failed
              |           |                     v
              |           | Connectivity    +-------+
              |           | and BASE_PLPMTU | Error |
              |           | confirmed       +-------+
              |           |                     | Consistent
              |           v                     | connectivity
   Black Hole |       +--------+                | and BASE_PLPMTU
    detected  |       | Search |<---------------+ confirmed
              |       +--------+
              |          ^  |
              |          |  |
              |    Raise |  | Search
              |    timer |  | algorithm
              |  expired |  | completed
              |          |  |
              |          |  v
              |   +-----------------+
              +---| Search Complete |
                  +-----------------+

  When PLPMTUD is enabled, it's in Base state, and starts to probe with
  BASE_PLPMTU (1200). If this probe succeeds, it goes to Search state;
  If this probe fails, it goes to Error state under which pl.pmtu goes
  down to MIN_PLPMTU (512) and keeps probing with BASE_PLPMTU until it
  succeeds and goes to Search state.

  During the Search state, the probe size is growing by a Big step (32)
  every time when the last probe succeeds at the beginning. Once a probe
  (such as 1420) fails after trying MAX_PROBES (3) times, the probe_size
  goes back to the last one (1420 - 32 = 1388), meanwhile 'probe_high'
  is set to 1420 and the growing step becomes a Small one (4). Then the
  probe is continuing with a Small step grown each round. Until it gets
  the optimal size (such as 1400) when probe with its next probe size
  (1404) fails, it sync this size to pathmtu and goes to Complete state.

  In Complete state, it will only does a probe check for the pathmtu just
  set, if it fails, which means a Black Hole is detected and it goes back
  to Base state. If it succeeds, it goes back to Search state again, and
  probe is continuing with growing a Small step (1400 + 4). If this probe
  fails, probe_high is set and goes back to 1388 and then Complete state,
  which is kind of a loop normally. However if the env's pathmtu changes
  to a big size somehow, this probe will succeed and then probe continues
  with growing a Big step (1400 + 32) each round until another probe fails.

PTB Messages Process:

  PLPMTUD doesn't rely on these package to find the pmtu, and shouldn't
  trust it either. When processing them, it only changes the probe_size
  to PL_PTB_SIZE(info - hlen) if 'pl.pmtu < PL_PTB_SIZE < the current
  probe_size' druing Search state. As this could help probe_size to get
  to the optimal size faster, for exmaple:

  pl.pmtu = 1388, probe_size = 1420, while the env's pathmtu = 1400.
  When probe_size is 1420, a Toobig packet with 1400 comes back. If probe
  size changes to use 1400, it will save quite a few rounds to get there.
  But of course after having this value, PLPMTUD will still verify it on
  its own before using it.

Patches:

  - Patch 1-6: introduce some new constants/variables from the RFC, systcl
    and members in transport, APIs for the following patches, chunks and
    a timer for the probe sending and some codes for the probe receiving.

  - Patch 7-9: implement the state transition on the tx path, rx path and
    toobig ICMP packet processing. This is the main algorithm part.

  - Patch 10: activate this feature

  - Patch 11-14: improve the process for ICMP packets for SCTP over UDP,
    so that it can also be covered by this feature.

Tests:

  - do sysctl and setsockopt tests for this feature's enabling and disabling.

  - get these pr_debug points for this feature by
      # cat /sys/kernel/debug/dynamic_debug/control | grep PLP
    and enable them on kernel dynamic debug, then play with the pathmtu and
    check if the state transition and plpmtu change match the RFC.

  - do the above tests for SCTP over IPv4/IPv6 and SCTP over UDP.

v1->v2:
  - See Patch 06/14.

Xin Long (14):
  sctp: add pad chunk and its make function and event table
  sctp: add probe_interval in sysctl and sock/asoc/transport
  sctp: add SCTP_PLPMTUD_PROBE_INTERVAL sockopt for sock/asoc/transport
  sctp: add the constants/variables and states and some APIs for
    transport
  sctp: add the probe timer in transport for PLPMTUD
  sctp: do the basic send and recv for PLPMTUD probe
  sctp: do state transition when PROBE_COUNT == MAX_PROBES on HB send
    path
  sctp: do state transition when a probe succeeds on HB ACK recv path
  sctp: do state transition when receiving an icmp TOOBIG packet
  sctp: enable PLPMTUD when the transport is ready
  sctp: remove the unessessary hold for idev in sctp_v6_err
  sctp: extract sctp_v6_err_handle function from sctp_v6_err
  sctp: extract sctp_v4_err_handle function from sctp_v4_err
  sctp: process sctp over udp icmp err on sctp side

 Documentation/networking/ip-sysctl.rst |   8 ++
 include/linux/sctp.h                   |   7 ++
 include/net/netns/sctp.h               |   3 +
 include/net/sctp/command.h             |   1 +
 include/net/sctp/constants.h           |  20 ++++
 include/net/sctp/sctp.h                |  57 ++++++++-
 include/net/sctp/sm.h                  |   6 +-
 include/net/sctp/structs.h             |  19 +++
 include/uapi/linux/sctp.h              |   8 ++
 net/sctp/associola.c                   |   6 +
 net/sctp/debug.c                       |   1 +
 net/sctp/input.c                       | 132 ++++++++++++---------
 net/sctp/ipv6.c                        | 112 +++++++++++-------
 net/sctp/output.c                      |  33 +++++-
 net/sctp/outqueue.c                    |  13 ++-
 net/sctp/protocol.c                    |  21 +---
 net/sctp/sm_make_chunk.c               |  31 ++++-
 net/sctp/sm_sideeffect.c               |  37 ++++++
 net/sctp/sm_statefuns.c                |  37 +++++-
 net/sctp/sm_statetable.c               |  43 +++++++
 net/sctp/socket.c                      | 123 ++++++++++++++++++++
 net/sctp/sysctl.c                      |  35 ++++++
 net/sctp/transport.c                   | 153 ++++++++++++++++++++++++-
 23 files changed, 779 insertions(+), 127 deletions(-)

Comments

Xin Long June 23, 2021, 1:09 a.m. UTC | #1

On Tue, Jun 22, 2021 at 6:13 PM David Laight <David.Laight@aculab.com> wrote:
>
> From: Xin Long
> > Sent: 22 June 2021 19:05
> >
> > Overview(From RFC8899):
> >
> >   In contrast to PMTUD, Packetization Layer Path MTU Discovery
> >   (PLPMTUD) [RFC4821] introduces a method that does not rely upon
> >   reception and validation of PTB messages.  It is therefore more
> >   robust than Classical PMTUD.  This has become the recommended
> >   approach for implementing discovery of the PMTU [BCP145].
> >
> >   It uses a general strategy in which the PL sends probe packets to
> >   search for the largest size of unfragmented datagram that can be sent
> >   over a network path.  Probe packets are sent to explore using a
> >   larger packet size.  If a probe packet is successfully delivered (as
> >   determined by the PL), then the PLPMTU is raised to the size of the
> >   successful probe.  If a black hole is detected (e.g., where packets
> >   of size PLPMTU are consistently not received), the method reduces the
> >   PLPMTU.
>
> This seems to take a long time (probably well over a minute)
> to determine the mtu.
I just noticed this is a misread of RFC8899, and the next probe packet
should be sent immediately once the ACK of the last probe is received,
instead of waiting the timeout, which should be for the missing probe.

I will fix this with:

diff --git a/net/sctp/sm_statefuns.c b/net/sctp/sm_statefuns.c
index d29b579da904..f3aca1acf93a 100644
--- a/net/sctp/sm_statefuns.c
+++ b/net/sctp/sm_statefuns.c
@@ -1275,6 +1275,8 @@ enum sctp_disposition
sctp_sf_backbeat_8_3(struct net *net,
                        return SCTP_DISPOSITION_DISCARD;

                sctp_transport_pl_recv(link);
+               sctp_add_cmd_sf(commands, SCTP_CMD_PROBE_TIMER_UPDATE,
+                               SCTP_TRANSPORT(link));
                return SCTP_DISPOSITION_CONSUME;
        }

diff --git a/net/sctp/transport.c b/net/sctp/transport.c
index f27b856ea8ce..88815b98d9d0 100644
--- a/net/sctp/transport.c
+++ b/net/sctp/transport.c
@@ -215,6 +215,11 @@ void sctp_transport_reset_probe_timer(struct
sctp_transport *transport)
 {
        int scale = 1;

+       if (transport->pl.probe_count == 0) {
+               if (!mod_timer(&transport->probe_timer, jiffies +
transport->rto))
+                       sctp_transport_hold(transport);
+               return;
+       }
        if (timer_pending(&transport->probe_timer))
                return;
        if (transport->pl.state == SCTP_PL_COMPLETE &&

Thanks for the comment.
>
> What is used for the actual mtu while this is in progress?
>
> Does packet loss and packet retransmission cause the mtu
> to be reduced as well?
No, the data packet is not a probe in this implementation.

>
> I can imagine that there is an expectation (from the application)
> that the mtu is that of an ethernet link - perhaps less a PPPoE
> header.
> Starting with an mtu of 1200 will break this assumption and may
> have odd side effects.
Starting searching from mtu of 1200, but the real pmtu will only be updated
when the search is done and optimal mtu is found.
So at the beginning, it will still use the dst mtu as before.

> For TCP/UDP the ICMP segmentation required error is immediate
> and gets used for the retransmissions.
> This code seems to be looking at separate timeouts - so a lot of
> packets could get discarded and application timers expire before
> if determines the correct mtu.
This patch will also process ICMP error msg, and gets the 'mtu' size from it
but before using it, it will verify(probe) it first:

see Patch: sctp: do state transition when receiving an icmp TOOBIG packet

>
> Maybe I missed something about this only being done on inactive
> paths?
>
>         David
>
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)
>

Xin Long June 23, 2021, 3:59 p.m. UTC | #2

On Wed, Jun 23, 2021 at 5:50 AM David Laight <David.Laight@aculab.com> wrote:
>
> From: Xin Long
> > Sent: 23 June 2021 04:49
> ...
> > [103] pl_send: PLPMTUD: state: 1, size: 1200, high: 0 <--[a]
> > [103] pl_recv: PLPMTUD: state: 1, size: 1200, high: 0
> ...
> > [103] pl_send: PLPMTUD: state: 2, size: 1456, high: 0
> > [103] pl_recv: PLPMTUD: state: 2, size: 1456, high: 0  <--[b]
> > [103] pl_send: PLPMTUD: state: 2, size: 1488, high: 0
> > [108] pl_send: PLPMTUD: state: 2, size: 1488, high: 0
> > [113] pl_send: PLPMTUD: state: 2, size: 1488, high: 0
> > [118] pl_send: PLPMTUD: state: 2, size: 1488, high: 0
> > [118] pl_recv: PLPMTUD: state: 2, size: 1456, high: 1488 <---[c]
> > [118] pl_send: PLPMTUD: state: 2, size: 1460, high: 1488
> > [118] pl_recv: PLPMTUD: state: 2, size: 1460, high: 1488 <--- [d]
> > [118] pl_send: PLPMTUD: state: 2, size: 1464, high: 1488
> > [124] pl_send: PLPMTUD: state: 2, size: 1464, high: 1488
> > [129] pl_send: PLPMTUD: state: 2, size: 1464, high: 1488
> > [134] pl_send: PLPMTUD: state: 2, size: 1464, high: 1488
> > [134] pl_recv: PLPMTUD: state: 2, size: 1460, high: 1464 <-- around
> > 30s "search complete from 1200 bytes"
> > [287] pl_send: PLPMTUD: state: 3, size: 1460, high: 0
> > [287] pl_recv: PLPMTUD: state: 3, size: 1460, high: 0
> > [287] pl_send: PLPMTUD: state: 2, size: 1464, high: 0 <-- [aa]
> > [292] pl_send: PLPMTUD: state: 2, size: 1464, high: 0
> > [298] pl_send: PLPMTUD: state: 2, size: 1464, high: 0
> > [303] pl_send: PLPMTUD: state: 2, size: 1464, high: 0
> > [303] pl_recv: PLPMTUD: state: 2, size: 1460, high: 1464  <--[bb]  <--
> > around 15s "re-search complete from current pmtu"
> >
> > So since no interval to send the next probe when the ACK is received
> > for the last one,
> > it won't take much time from [a] to [b], and [c] to [d],
> > and there are at most 2 failures to find the right pmtu, each failure
> > takes 5s * 3 = 15s.
> >
> > when it goes back to search from search complete after a long timeout,
> > it will take only 1 failure to get the right pmtu from [aa] to [bb].
>
> What mtu is being used during the 'failures' ?
> I hope it is the last working one.
Yes, it's always the working one, which was set in search complete state.
More specifically, it changes in 3 cases:
a) at the beginning, it's using the dst->mtu;
b) set to the optimal one in search complete state after searching is done.
c) 'black hole' found, it sets to 1200, and starts to probe from 1200.
d) if still fails with 1200, the mtu is set to MIN_PLPMTU, but still
probes with 1200 until it succeeds.

>
> Also, what actually happen if the network route changes from
> one that supports 1460 bytes to one that only supports 1200
> and where ICMP errors are not generated?
then it's the case c) and d) above.

For the "black hole" detection, I'd like to probe it with the current
mtu and 3 times (15s in total)

>
> The first protocol retry is (probably) after 2 seconds.
> But it will use the 1460 byte mtu and fail again.
yeah, but 15s is the time we have to wait for confirming this is
caused by really the mtu change.

>
> Notwithstanding the standards, what pmtu actually exist
> 'in the wild' for normal networks?
PLPMTUD is trying to probe the max payload of SCTP, not
the real pmtu.

> Are there actually any others apart from 'full sized ethernet'
> and 'PPPoE'?
sctp over UDP if that's what you mean.
With the probe, we don't really care about the outer header, because
what we get with the probe is the real available size for sctp payload.

Like we probe with 1400, if it gets ACKed from the peer, the payload 1400
will be able to go though the path.

> So would it actually better to send two probes one for
> each of those two sizes and see which ones respond?

we could only if regardless of the RFC8899:

   "To avoid excessive load, the interval between individual probe
   packets MUST be at least one RTT..."

>
> (I'm not sure we ever manage to send full length packets.
> Our data is M3UA (mostly SMS) and sent with Nagle disabled.
> So even the customers sending 1000s of SMS/sec are unlikely
> to fill packets.)
>
>         David
>
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)