mbox series

[net-next,00/10] Add notifications when route hardware flags change

Message ID 20210126132311.3061388-1-idosch@idosch.org
Headers show
Series Add notifications when route hardware flags change | expand

Message

Ido Schimmel Jan. 26, 2021, 1:23 p.m. UTC
From: Ido Schimmel <idosch@nvidia.com>

Routes installed to the kernel can be programmed to capable devices, in
which case they are marked with one of two flags. RTM_F_OFFLOAD for
routes that offload traffic from the kernel and RTM_F_TRAP for routes
that trap packets to the kernel for processing (e.g., host routes).

These flags are of interest to routing daemons since they would like to
delay advertisement of routes until they are installed in hardware. This
allows them to avoid packet loss or misrouted packets. Currently,
routing daemons do not receive any notifications when these flags are
changed, requiring them to poll the kernel tables for changes which is
inefficient.

This series addresses the issue by having the kernel emit RTM_NEWROUTE
notifications whenever these flags change. The behavior is controlled by
two sysctls (net.ipv4.fib_notify_on_flag_change and
net.ipv6.fib_notify_on_flag_change) that default to 0 (no
notifications).

Note that even if route installation in hardware is improved to be more
synchronous, these notifications are still of interest. For example, a
multipath route can change from RTM_F_OFFLOAD to RTM_F_TRAP if its
neighbours become invalid. A routing daemon can choose to withdraw /
replace the route in that case. In addition, the deletion of a route
from the kernel can prompt the installation of an identical route
(already in kernel, with an higher metric) to hardware.

For testing purposes, netdevsim is aligned to simulate a "real" driver
that programs routes to hardware.

Series overview:

Patches #1-#2 align netdevsim to perform route programming in a
non-atomic context

Patches #3-#5 add sysctl to control IPv4 notifications

Patches #6-#8 add sysctl to control IPv6 notifications

Patch #9 extends existing fib tests to set sysctls before running tests

Patch #10 adds test for fib notifications over netdevsim

Amit Cohen (10):
  netdevsim: fib: Convert the current occupancy to an atomic variable
  netdevsim: fib: Perform the route programming in a non-atomic context
  net: ipv4: Pass fib_rt_info as const to fib_dump_info()
  net: ipv4: Publish fib_nlmsg_size()
  net: ipv4: Emit notification when fib hardware flags are changed
  net: Pass 'net' struct as first argument to fib6_info_hw_flags_set()
  net: Do not call fib6_info_hw_flags_set() when IPv6 is disabled
  net: ipv6: Emit notification when fib hardware flags are changed
  selftests: Extend fib tests to run with and without flags
    notifications
  selftests: netdevsim: Add fib_notifications test

 Documentation/networking/ip-sysctl.rst        |  40 ++
 .../ethernet/mellanox/mlxsw/spectrum_router.c |  23 +-
 drivers/net/netdevsim/fib.c                   | 535 ++++++++++++------
 include/net/ip6_fib.h                         |   9 +-
 include/net/netns/ipv4.h                      |   2 +
 include/net/netns/ipv6.h                      |   1 +
 net/ipv4/af_inet.c                            |   2 +
 net/ipv4/fib_lookup.h                         |   3 +-
 net/ipv4/fib_semantics.c                      |   4 +-
 net/ipv4/fib_trie.c                           |  27 +
 net/ipv4/sysctl_net_ipv4.c                    |   9 +
 net/ipv6/af_inet6.c                           |   1 +
 net/ipv6/route.c                              |  44 ++
 net/ipv6/sysctl_net_ipv6.c                    |   9 +
 .../selftests/drivers/net/mlxsw/fib.sh        |  14 +
 .../selftests/drivers/net/netdevsim/fib.sh    |  14 +
 .../net/netdevsim/fib_notifications.sh        | 300 ++++++++++
 17 files changed, 855 insertions(+), 182 deletions(-)
 create mode 100755 tools/testing/selftests/drivers/net/netdevsim/fib_notifications.sh

Comments

David Ahern Jan. 27, 2021, 4:33 a.m. UTC | #1
On 1/26/21 6:23 AM, Ido Schimmel wrote:
> @@ -889,22 +882,29 @@ static void nsim_nexthop_destroy(struct nsim_nexthop *nexthop)

>  static int nsim_nexthop_account(struct nsim_fib_data *data, u64 occ,

>  				bool add, struct netlink_ext_ack *extack)

>  {

> -	int err = 0;

> +	int i, err = 0;

>  

>  	if (add) {

> -		if (data->nexthops.num + occ <= data->nexthops.max) {

> -			data->nexthops.num += occ;

> -		} else {

> -			err = -ENOSPC;

> -			NL_SET_ERR_MSG_MOD(extack, "Exceeded number of supported nexthops");

> -		}

> +		for (i = 0; i < occ; i++)

> +			if (!atomic64_add_unless(&data->nexthops.num, 1,

> +						 data->nexthops.max)) {


seems like this can be
		if (!atomic64_add_unless(&data->nexthops.num, occ,
					 data->nexthops.max)) {

and then the err_num_decrease is not needed

> +				err = -ENOSPC;

> +				NL_SET_ERR_MSG_MOD(extack, "Exceeded number of supported nexthops");

> +				goto err_num_decrease;

> +			}

>  	} else {

> -		if (WARN_ON(occ > data->nexthops.num))

> +		if (WARN_ON(occ > atomic64_read(&data->nexthops.num)))

>  			return -EINVAL;

> -		data->nexthops.num -= occ;

> +		atomic64_sub(occ, &data->nexthops.num);

>  	}

>  

>  	return err;

> +

> +err_num_decrease:

> +	for (i--; i >= 0; i--)

> +		atomic64_dec(&data->nexthops.num);


and if this path is really needed, why not atomic64_sub here?

> +	return err;

> +

>  }

>  

>  static int nsim_nexthop_add(struct nsim_fib_data *data,

>
David Ahern Jan. 27, 2021, 4:36 a.m. UTC | #2
On 1/26/21 6:23 AM, Ido Schimmel wrote:
> From: Amit Cohen <amcohen@nvidia.com>

> 

> Currently, netdevsim implements dummy FIB offload and marks notified

> routes with RTM_F_TRAP flag. netdevsim does not defer route notifications

> to a work queue because it does not need to program any hardware.

> 

> Given that netdevsim's purpose is to both give an example implementation

> and allow developers to test their code, align netdevsim to a "real"

> hardware device driver like mlxsw and have it also perform the route

> "programming" in a non-atomic context.

> 

> It will be used to test route flags notifications which will be added in

> the next patches.

> 

> The following changes are needed when route handling is performed in WQ:

> - Handle the accounting in the main context, to be able to return an

>   error for adding route when all the routes are used.

>   For FIB_EVENT_ENTRY_REPLACE increase the counter before scheduling

>   the delayed work, and in case that this event replaces an existing route,

>   decrease the counter as part of the delayed work.

> 

> - For IPv6, cannot use fen6_info->rt->fib6_siblings list because it

>   might be changed during handling the delayed work.

>   Save an array with the nexthops as part of fib6_event struct, and take

>   a reference for each nexthop to prevent them from being freed while

>   event is queued.

> 

> - Change GFP_ATOMIC allocations to GFP_KERNEL.

> 

> - Use single work item that is handling a list of ordered routes.

>   Handling routes must be processed in the order they were submitted to

>   avoid logical errors that could lead to unexpected failures.

> 

> Signed-off-by: Amit Cohen <amcohen@nvidia.com>

> Signed-off-by: Ido Schimmel <idosch@nvidia.com>

> ---

>  drivers/net/netdevsim/fib.c | 467 +++++++++++++++++++++++++-----------

>  1 file changed, 327 insertions(+), 140 deletions(-)

> 


Acked-by: David Ahern <dsahern@kernel.org>
David Ahern Jan. 27, 2021, 4:38 a.m. UTC | #3
On 1/26/21 6:23 AM, Ido Schimmel wrote:
> From: Amit Cohen <amcohen@nvidia.com>

> 

> fib_dump_info() does not change 'fri', so pass it as 'const'.

> It will later allow us to invoke fib_dump_info() from

> fib_alias_hw_flags_set().

> 

> Signed-off-by: Amit Cohen <amcohen@nvidia.com>

> Signed-off-by: Ido Schimmel <idosch@nvidia.com>

> ---

>  net/ipv4/fib_lookup.h    | 2 +-

>  net/ipv4/fib_semantics.c | 2 +-

>  2 files changed, 2 insertions(+), 2 deletions(-)

> 


Reviewed-by: David Ahern <dsahern@kernel.org>
David Ahern Jan. 27, 2021, 5:02 a.m. UTC | #4
On 1/26/21 6:23 AM, Ido Schimmel wrote:
> From: Amit Cohen <amcohen@nvidia.com>

> 

> After installing a route to the kernel, user space receives an

> acknowledgment, which means the route was installed in the kernel,

> but not necessarily in hardware.

> 

> The asynchronous nature of route installation in hardware can lead to a

> routing daemon advertising a route before it was actually installed in

> hardware. This can result in packet loss or mis-routed packets until the

> route is installed in hardware.

> 

> It is also possible for a route already installed in hardware to change

> its action and therefore its flags. For example, a host route that is

> trapping packets can be "promoted" to perform decapsulation following

> the installation of an IPinIP/VXLAN tunnel.

> 

> Emit RTM_NEWROUTE notifications whenever RTM_F_OFFLOAD/RTM_F_TRAP flags

> are changed. The aim is to provide an indication to user-space

> (e.g., routing daemons) about the state of the route in hardware.

> 

> Introduce a sysctl that controls this behavior.

> 

> Keep the default value at 0 (i.e., do not emit notifications) for several

> reasons:

> - Multiple RTM_NEWROUTE notification per-route might confuse existing

>   routing daemons.


are you aware of any routing daemons that are affected? Seems like they
should be able to handle redundant notifications

> - Convergence reasons in routing daemons.

> - The extra notifications will negatively impact the insertion rate.


any numbers on the overhead?

> - Not all users are interested in these notifications.

> 

> Signed-off-by: Amit Cohen <amcohen@nvidia.com>

> Acked-by: Roopa Prabhu <roopa@nvidia.com>

> Signed-off-by: Ido Schimmel <idosch@nvidia.com>

> ---

>  Documentation/networking/ip-sysctl.rst | 20 +++++++++++++++++++

>  include/net/netns/ipv4.h               |  2 ++

>  net/ipv4/af_inet.c                     |  2 ++

>  net/ipv4/fib_trie.c                    | 27 ++++++++++++++++++++++++++

>  net/ipv4/sysctl_net_ipv4.c             |  9 +++++++++

>  5 files changed, 60 insertions(+)

>
David Ahern Jan. 27, 2021, 5:08 a.m. UTC | #5
On 1/26/21 6:23 AM, Ido Schimmel wrote:
> From: Amit Cohen <amcohen@nvidia.com>

> 

> The next patch will emit notification when hardware flags are changed,

> in case that fib_notify_on_flag_change sysctl is set to 1.

> 

> To know sysctl values, net struct is needed.

> This change is consistent with the IPv4 version, which gets 'net' struct

> as its first argument.

> 

> Currently, the only callers of this function are mlxsw and netdevsim.

> Patch the callers to pass net.

> 

> Signed-off-by: Amit Cohen <amcohen@nvidia.com>

> Signed-off-by: Ido Schimmel <idosch@nvidia.com>

> ---

>  .../net/ethernet/mellanox/mlxsw/spectrum_router.c  |  7 ++++---

>  drivers/net/netdevsim/fib.c                        | 14 ++++++++------

>  include/net/ip6_fib.h                              |  5 +++--

>  3 files changed, 15 insertions(+), 11 deletions(-)

> 


Reviewed-by: David Ahern <dsahern@kernel.org>
Amit Cohen Jan. 27, 2021, 10:51 a.m. UTC | #6
>-----Original Message-----

>From: David Ahern <dsahern@gmail.com>

>Sent: Wednesday, January 27, 2021 6:33

>To: Ido Schimmel <idosch@idosch.org>; netdev@vger.kernel.org

>Cc: davem@davemloft.net; kuba@kernel.org; Amit Cohen <amcohen@nvidia.com>; Roopa Prabhu <roopa@nvidia.com>; Donald

>Sharp <sharpd@nvidia.com>; Benjamin Poirier <bpoirier@nvidia.com>; mlxsw <mlxsw@nvidia.com>; Ido Schimmel

><idosch@nvidia.com>

>Subject: Re: [PATCH net-next 01/10] netdevsim: fib: Convert the current occupancy to an atomic variable

>

>On 1/26/21 6:23 AM, Ido Schimmel wrote:

>> @@ -889,22 +882,29 @@ static void nsim_nexthop_destroy(struct

>> nsim_nexthop *nexthop)  static int nsim_nexthop_account(struct nsim_fib_data *data, u64 occ,

>>  				bool add, struct netlink_ext_ack *extack)  {

>> -	int err = 0;

>> +	int i, err = 0;

>>

>>  	if (add) {

>> -		if (data->nexthops.num + occ <= data->nexthops.max) {

>> -			data->nexthops.num += occ;

>> -		} else {

>> -			err = -ENOSPC;

>> -			NL_SET_ERR_MSG_MOD(extack, "Exceeded number of supported nexthops");

>> -		}

>> +		for (i = 0; i < occ; i++)

>> +			if (!atomic64_add_unless(&data->nexthops.num, 1,

>> +						 data->nexthops.max)) {

>

>seems like this can be

>		if (!atomic64_add_unless(&data->nexthops.num, occ,

>					 data->nexthops.max)) {


atomic64_add_unless(x, y, z) adds y to x if x was not already z.
Which means that when for example num=2, occ=2, max=3:
atomic64_add_unless(&data->nexthops.num, occ, data->nexthops.max) won't fail when it should.

This situation is realistic and actually with atomic64_add_unless() selftests/drivers/net/netdevsim/nexthop.sh fails.

>

>and then the err_num_decrease is not needed

>

>> +				err = -ENOSPC;

>> +				NL_SET_ERR_MSG_MOD(extack, "Exceeded number of supported nexthops");

>> +				goto err_num_decrease;

>> +			}

>>  	} else {

>> -		if (WARN_ON(occ > data->nexthops.num))

>> +		if (WARN_ON(occ > atomic64_read(&data->nexthops.num)))

>>  			return -EINVAL;

>> -		data->nexthops.num -= occ;

>> +		atomic64_sub(occ, &data->nexthops.num);

>>  	}

>>

>>  	return err;

>> +

>> +err_num_decrease:

>> +	for (i--; i >= 0; i--)

>> +		atomic64_dec(&data->nexthops.num);

>

>and if this path is really needed, why not atomic64_sub here?

>

>> +	return err;

>> +

>>  }

>>

>>  static int nsim_nexthop_add(struct nsim_fib_data *data,

>>
Amit Cohen Jan. 27, 2021, 11:46 a.m. UTC | #7
>-----Original Message-----

>From: David Ahern <dsahern@gmail.com>

>Sent: Wednesday, January 27, 2021 7:03

>To: Ido Schimmel <idosch@idosch.org>; netdev@vger.kernel.org

>Cc: davem@davemloft.net; kuba@kernel.org; Amit Cohen <amcohen@nvidia.com>; Roopa Prabhu <roopa@nvidia.com>; Donald

>Sharp <sharpd@nvidia.com>; Benjamin Poirier <bpoirier@nvidia.com>; mlxsw <mlxsw@nvidia.com>; Ido Schimmel

><idosch@nvidia.com>

>Subject: Re: [PATCH net-next 05/10] net: ipv4: Emit notification when fib hardware flags are changed

>

>On 1/26/21 6:23 AM, Ido Schimmel wrote:

>> From: Amit Cohen <amcohen@nvidia.com>

>>

>> After installing a route to the kernel, user space receives an

>> acknowledgment, which means the route was installed in the kernel, but

>> not necessarily in hardware.

>>

>> The asynchronous nature of route installation in hardware can lead to

>> a routing daemon advertising a route before it was actually installed

>> in hardware. This can result in packet loss or mis-routed packets

>> until the route is installed in hardware.

>>

>> It is also possible for a route already installed in hardware to

>> change its action and therefore its flags. For example, a host route

>> that is trapping packets can be "promoted" to perform decapsulation

>> following the installation of an IPinIP/VXLAN tunnel.

>>

>> Emit RTM_NEWROUTE notifications whenever RTM_F_OFFLOAD/RTM_F_TRAP

>> flags are changed. The aim is to provide an indication to user-space

>> (e.g., routing daemons) about the state of the route in hardware.

>>

>> Introduce a sysctl that controls this behavior.

>>

>> Keep the default value at 0 (i.e., do not emit notifications) for

>> several

>> reasons:

>> - Multiple RTM_NEWROUTE notification per-route might confuse existing

>>   routing daemons.

>

>are you aware of any routing daemons that are affected? Seems like they should be able to handle redundant notifications


Actually no, we didn't check all the existing daemons, just assume that not everyone will want to activate the notifications at all.
So there is no point in sending notifications for users which aren't interested in them.

>

>> - Convergence reasons in routing daemons.

>> - The extra notifications will negatively impact the insertion rate.

>

>any numbers on the overhead?


For addition of 256k routes in mlxsw, the overhead is 3.6% of the total time.

>

>> - Not all users are interested in these notifications.

>>

>> Signed-off-by: Amit Cohen <amcohen@nvidia.com>

>> Acked-by: Roopa Prabhu <roopa@nvidia.com>

>> Signed-off-by: Ido Schimmel <idosch@nvidia.com>

>> ---

>>  Documentation/networking/ip-sysctl.rst | 20 +++++++++++++++++++

>>  include/net/netns/ipv4.h               |  2 ++

>>  net/ipv4/af_inet.c                     |  2 ++

>>  net/ipv4/fib_trie.c                    | 27 ++++++++++++++++++++++++++

>>  net/ipv4/sysctl_net_ipv4.c             |  9 +++++++++

>>  5 files changed, 60 insertions(+)

>>
David Ahern Jan. 28, 2021, 3:34 a.m. UTC | #8
On 1/26/21 6:23 AM, Ido Schimmel wrote:
> From: Amit Cohen <amcohen@nvidia.com>

> 

> After installing a route to the kernel, user space receives an

> acknowledgment, which means the route was installed in the kernel,

> but not necessarily in hardware.

> 

> The asynchronous nature of route installation in hardware can lead to a

> routing daemon advertising a route before it was actually installed in

> hardware. This can result in packet loss or mis-routed packets until the

> route is installed in hardware.

> 

> It is also possible for a route already installed in hardware to change

> its action and therefore its flags. For example, a host route that is

> trapping packets can be "promoted" to perform decapsulation following

> the installation of an IPinIP/VXLAN tunnel.

> 

> Emit RTM_NEWROUTE notifications whenever RTM_F_OFFLOAD/RTM_F_TRAP flags

> are changed. The aim is to provide an indication to user-space

> (e.g., routing daemons) about the state of the route in hardware.

> 

> Introduce a sysctl that controls this behavior.

> 

> Keep the default value at 0 (i.e., do not emit notifications) for several

> reasons:

> - Multiple RTM_NEWROUTE notification per-route might confuse existing

>   routing daemons.

> - Convergence reasons in routing daemons.

> - The extra notifications will negatively impact the insertion rate.

> - Not all users are interested in these notifications.

> 

> Signed-off-by: Amit Cohen <amcohen@nvidia.com>

> Acked-by: Roopa Prabhu <roopa@nvidia.com>

> Signed-off-by: Ido Schimmel <idosch@nvidia.com>

> ---

>  Documentation/networking/ip-sysctl.rst | 20 +++++++++++++++++++

>  include/net/netns/ipv4.h               |  2 ++

>  net/ipv4/af_inet.c                     |  2 ++

>  net/ipv4/fib_trie.c                    | 27 ++++++++++++++++++++++++++

>  net/ipv4/sysctl_net_ipv4.c             |  9 +++++++++

>  5 files changed, 60 insertions(+)

> 


Reviewed-by: David Ahern <dsahern@kernel.org>
David Ahern Jan. 28, 2021, 3:42 a.m. UTC | #9
On 1/27/21 3:51 AM, Amit Cohen wrote:
> 

> 

>> -----Original Message-----

>> From: David Ahern <dsahern@gmail.com>

>> Sent: Wednesday, January 27, 2021 6:33

>> To: Ido Schimmel <idosch@idosch.org>; netdev@vger.kernel.org

>> Cc: davem@davemloft.net; kuba@kernel.org; Amit Cohen <amcohen@nvidia.com>; Roopa Prabhu <roopa@nvidia.com>; Donald

>> Sharp <sharpd@nvidia.com>; Benjamin Poirier <bpoirier@nvidia.com>; mlxsw <mlxsw@nvidia.com>; Ido Schimmel

>> <idosch@nvidia.com>

>> Subject: Re: [PATCH net-next 01/10] netdevsim: fib: Convert the current occupancy to an atomic variable

>>

>> On 1/26/21 6:23 AM, Ido Schimmel wrote:

>>> @@ -889,22 +882,29 @@ static void nsim_nexthop_destroy(struct

>>> nsim_nexthop *nexthop)  static int nsim_nexthop_account(struct nsim_fib_data *data, u64 occ,

>>>  				bool add, struct netlink_ext_ack *extack)  {

>>> -	int err = 0;

>>> +	int i, err = 0;

>>>

>>>  	if (add) {

>>> -		if (data->nexthops.num + occ <= data->nexthops.max) {

>>> -			data->nexthops.num += occ;

>>> -		} else {

>>> -			err = -ENOSPC;

>>> -			NL_SET_ERR_MSG_MOD(extack, "Exceeded number of supported nexthops");

>>> -		}

>>> +		for (i = 0; i < occ; i++)

>>> +			if (!atomic64_add_unless(&data->nexthops.num, 1,

>>> +						 data->nexthops.max)) {

>>

>> seems like this can be

>> 		if (!atomic64_add_unless(&data->nexthops.num, occ,

>> 					 data->nexthops.max)) {

> 

> atomic64_add_unless(x, y, z) adds y to x if x was not already z.

> Which means that when for example num=2, occ=2, max=3:

> atomic64_add_unless(&data->nexthops.num, occ, data->nexthops.max) won't fail when it should.

> 


ok, missed that in the description. I thought it was if the total would
equal or be greater than z.
Jakub Kicinski Jan. 29, 2021, 3:04 a.m. UTC | #10
On Tue, 26 Jan 2021 15:23:06 +0200 Ido Schimmel wrote:
> Emit RTM_NEWROUTE notifications whenever RTM_F_OFFLOAD/RTM_F_TRAP flags

> are changed. The aim is to provide an indication to user-space

> (e.g., routing daemons) about the state of the route in hardware.


What does the daemon in the user space do with it?

The notification will only be generated for the _first_ ASIC which
offloaded the object. Which may be fine for you today but as an uAPI 
it feels slightly lacking.

If the user space just wants to make sure the devices are synced to
notifications from certain stage, wouldn't it be more idiomatic to
provide some "fence" operation?

WDYT? David?
David Ahern Jan. 29, 2021, 3:33 a.m. UTC | #11
On 1/28/21 8:04 PM, Jakub Kicinski wrote:
> On Tue, 26 Jan 2021 15:23:06 +0200 Ido Schimmel wrote:

>> Emit RTM_NEWROUTE notifications whenever RTM_F_OFFLOAD/RTM_F_TRAP flags

>> are changed. The aim is to provide an indication to user-space

>> (e.g., routing daemons) about the state of the route in hardware.

> 

> What does the daemon in the user space do with it?


You don't want FRR for example to advertise a route to a peer until it
is really programmed in h/w. This notification gives routing daemons
that information.

> 

> The notification will only be generated for the _first_ ASIC which

> offloaded the object. Which may be fine for you today but as an uAPI 

> it feels slightly lacking.

> 

> If the user space just wants to make sure the devices are synced to

> notifications from certain stage, wouldn't it be more idiomatic to

> provide some "fence" operation?

> 

> WDYT? David?

> 


This feature was first discussed I think about 2 years ago - when I was
still with Cumulus, so I already knew the intent and end goal.

I think support for multiple ASICs / NICs doing this kind of offload
will have a whole lot of challenges. I don't think this particular user
notification is going to be a big problem - e.g., you could always delay
the emit until all have indicated the offload.
Jakub Kicinski Jan. 29, 2021, 4:15 a.m. UTC | #12
On Thu, 28 Jan 2021 20:33:22 -0700 David Ahern wrote:
> On 1/28/21 8:04 PM, Jakub Kicinski wrote:

> > On Tue, 26 Jan 2021 15:23:06 +0200 Ido Schimmel wrote:  

> >> Emit RTM_NEWROUTE notifications whenever RTM_F_OFFLOAD/RTM_F_TRAP flags

> >> are changed. The aim is to provide an indication to user-space

> >> (e.g., routing daemons) about the state of the route in hardware.  

> > 

> > What does the daemon in the user space do with it?  

> 

> You don't want FRR for example to advertise a route to a peer until it

> is really programmed in h/w. This notification gives routing daemons

> that information.


I see. Hm.

> > The notification will only be generated for the _first_ ASIC which

> > offloaded the object. Which may be fine for you today but as an uAPI 

> > it feels slightly lacking.

> > 

> > If the user space just wants to make sure the devices are synced to

> > notifications from certain stage, wouldn't it be more idiomatic to

> > provide some "fence" operation?

> > 

> > WDYT? David?

> 

> This feature was first discussed I think about 2 years ago - when I was

> still with Cumulus, so I already knew the intent and end goal.

> 

> I think support for multiple ASICs / NICs doing this kind of offload

> will have a whole lot of challenges. I don't think this particular user

> notification is going to be a big problem - e.g., you could always delay

> the emit until all have indicated the offload.


My impression from working on this problem in TC is that the definition
of "all" becomes problematic especially if one takes into account
drivers getting reloaded. But I think routing offload has stronger
semantics than TC, so no objections.

We need a respin for the somewhat embarrassing loop in patch 1, tho.
Ido Schimmel Jan. 30, 2021, 3:11 p.m. UTC | #13
On Thu, Jan 28, 2021 at 08:33:22PM -0700, David Ahern wrote:
> On 1/28/21 8:04 PM, Jakub Kicinski wrote:

> > On Tue, 26 Jan 2021 15:23:06 +0200 Ido Schimmel wrote:

> >> Emit RTM_NEWROUTE notifications whenever RTM_F_OFFLOAD/RTM_F_TRAP flags

> >> are changed. The aim is to provide an indication to user-space

> >> (e.g., routing daemons) about the state of the route in hardware.

> > 

> > What does the daemon in the user space do with it?

> 

> You don't want FRR for example to advertise a route to a peer until it

> is really programmed in h/w. This notification gives routing daemons

> that information.


Correct. It is in the cover letter:

"These flags are of interest to routing daemons since they would like to
delay advertisement of routes until they are installed in hardware."

Amit is working on follow-up to emit notifications when route offload
fails. This request also comes from the FRR team. Currently we have a
policy inside mlxsw to abort route offload and install a default route
that sends all the traffic to the CPU. It obviously kills the box and
anyway the policy is something user space should decide, not the kernel.

> 

> > 

> > The notification will only be generated for the _first_ ASIC which

> > offloaded the object. Which may be fine for you today but as an uAPI 

> > it feels slightly lacking.

> > 

> > If the user space just wants to make sure the devices are synced to

> > notifications from certain stage, wouldn't it be more idiomatic to

> > provide some "fence" operation?

> > 

> > WDYT? David?

> > 

> 

> This feature was first discussed I think about 2 years ago - when I was

> still with Cumulus, so I already knew the intent and end goal.

> 

> I think support for multiple ASICs / NICs doing this kind of offload

> will have a whole lot of challenges. I don't think this particular user

> notification is going to be a big problem - e.g., you could always delay

> the emit until all have indicated the offload.


I do not have experience with multi-ASIC systems, but my understanding
is that each ASIC has its own copy of the networking stack and the ASICs
are connected via front panel or backplane ports, like distinct
leaf/spine switches. In Linux, such a system can be supported by
registering a devlink instance for each ASIC and reloading each instance
to a separate namespace.

Thanks for reviewing, David
Ido Schimmel Jan. 30, 2021, 3:36 p.m. UTC | #14
On Thu, Jan 28, 2021 at 08:15:45PM -0800, Jakub Kicinski wrote:
> My impression from working on this problem in TC is that the definition

> of "all" becomes problematic especially if one takes into account

> drivers getting reloaded. But I think routing offload has stronger

> semantics than TC, so no objections.


During the teardown phase of the reload, all the routes using the
driver's netdevs (or their uppers) will be deleted by the kernel because
the netdevs will disappear. During the init phase of the reload, the
driver will re-register its FIB notifier and ask for a dump of all the
existing routes (usually only host routes). With this patchset, user
space will receive a notification that these routes are now in hardware.

# ip monitor route
broadcast 127.255.255.255 dev lo table local proto kernel scope link src 127.0.0.1 
local 127.0.0.1 dev lo table local proto kernel scope host src 127.0.0.1 
local 127.0.0.0/8 dev lo table local proto kernel scope host src 127.0.0.1 
broadcast 127.0.0.0 dev lo table local proto kernel scope link src 127.0.0.1 
broadcast 10.209.1.255 dev eth0 table local proto kernel scope link src 10.209.0.191 
local 10.209.0.191 dev eth0 table local proto kernel scope host src 10.209.0.191 
broadcast 10.209.0.0 dev eth0 table local proto kernel scope link src 10.209.0.191 
10.209.0.1 dev eth0 proto dhcp scope link src 10.209.0.191 metric 1024 
10.209.0.0/23 dev eth0 proto kernel scope link src 10.209.0.191 
default via 10.209.0.1 dev eth0 proto dhcp src 10.209.0.191 metric 1024 
<< init phase starts here >>
default via 10.209.0.1 dev eth0 proto dhcp src 10.209.0.191 metric 1024 rt_trap 
10.209.0.0/23 dev eth0 proto kernel scope link src 10.209.0.191 rt_trap 
10.209.0.1 dev eth0 proto dhcp scope link src 10.209.0.191 metric 1024 rt_trap 
broadcast 10.209.0.0 dev eth0 table local proto kernel scope link src 10.209.0.191 rt_trap 
local 10.209.0.191 dev eth0 table local proto kernel scope host src 10.209.0.191 rt_trap 
broadcast 10.209.1.255 dev eth0 table local proto kernel scope link src 10.209.0.191 rt_trap 
broadcast 127.0.0.0 dev lo table local proto kernel scope link src 127.0.0.1 rt_trap 
local 127.0.0.0/8 dev lo table local proto kernel scope host src 127.0.0.1 rt_trap 
local 127.0.0.1 dev lo table local proto kernel scope host src 127.0.0.1 rt_trap 
broadcast 127.255.255.255 dev lo table local proto kernel scope link src 127.0.0.1 rt_trap