mbox series

[net-next,v2,0/5] seg6: add support for SRv6 End.DT4 behavior

Message ID 20201107153139.3552-1-andrea.mayer@uniroma2.it
Headers show
Series seg6: add support for SRv6 End.DT4 behavior | expand

Message

Andrea Mayer Nov. 7, 2020, 3:31 p.m. UTC
This patchset provides support for the SRv6 End.DT4 behavior.

The SRv6 End.DT4 is used to implement multi-tenant IPv4 L3 VPN. It
decapsulates the received packets and performs IPv4 routing lookup in the
routing table of the tenant. The SRv6 End.DT4 Linux implementation
leverages a VRF device. The SRv6 End.DT4 is defined in the SRv6 Network
Programming [1].

- Patch 1/5 is needed to solve a pre-existing issue with tunneled packets
  when a sniffer is attached;

- Patch 2/5 improves the management of the seg6local attributes used by the
  SRv6 behaviors;

- Patch 3/5 introduces two callbacks used for customizing the
  creation/destruction of a SRv6 behavior;

- Patch 4/5 is the core patch that adds support for the SRv6 End.DT4
  behavior;

- Patch 5/5 adds the selftest for SRv6 End.DT4 behavior.

I would like to thank David Ahern for his support during the development of
this patch set.

Comments, suggestions and improvements are very welcome!

Thanks,
Andrea Mayer

v2
 no changes made: resubmitted after false build report.

v1
 improve comments;

 add new patch 2/5 titled: seg6: improve management of behavior attributes

 seg6: add support for the SRv6 End.DT4 behavior
  - remove the inline keyword in the definition of fib6_config_get_net().

 selftests: add selftest for the SRv6 End.DT4 behavior
  - add check for the vrf sysctl

[1] https://tools.ietf.org/html/draft-ietf-spring-srv6-network-programming

Andrea Mayer (5):
  vrf: add mac header for tunneled packets when sniffer is attached
  seg6: improve management of behavior attributes
  seg6: add callbacks for customizing the creation/destruction of a
    behavior
  seg6: add support for the SRv6 End.DT4 behavior
  selftests: add selftest for the SRv6 End.DT4 behavior

 drivers/net/vrf.c                             |  78 ++-
 net/ipv6/seg6_local.c                         | 370 ++++++++++++-
 .../selftests/net/srv6_end_dt4_l3vpn_test.sh  | 494 ++++++++++++++++++
 3 files changed, 927 insertions(+), 15 deletions(-)
 create mode 100755 tools/testing/selftests/net/srv6_end_dt4_l3vpn_test.sh

Comments

Jakub Kicinski Nov. 10, 2020, 11:12 p.m. UTC | #1
On Sat,  7 Nov 2020 16:31:38 +0100 Andrea Mayer wrote:
> SRv6 End.DT4 is defined in the SRv6 Network Programming [1].
> 
> The SRv6 End.DT4 is used to implement IPv4 L3VPN use-cases in
> multi-tenants environments. It decapsulates the received packets and it
> performs IPv4 routing lookup in the routing table of the tenant.
> 
> The SRv6 End.DT4 Linux implementation leverages a VRF device in order to
> force the routing lookup into the associated routing table.

How does the behavior of DT4 compare to DT6?

The implementation looks quite different.

> To make the End.DT4 work properly, it must be guaranteed that the routing
> table used for routing lookup operations is bound to one and only one
> VRF during the tunnel creation. Such constraint has to be enforced by
> enabling the VRF strict_mode sysctl parameter, i.e:
>  $ sysctl -wq net.vrf.strict_mode=1.
> 
> At JANOG44, LINE corporation presented their multi-tenant DC architecture
> using SRv6 [2]. In the slides, they reported that the Linux kernel is
> missing the support of SRv6 End.DT4 behavior.
> 
> The iproute2 counterpart required for configuring the SRv6 End.DT4
> behavior is already implemented along with the other supported SRv6
> behaviors [3].
> 
> [1] https://tools.ietf.org/html/draft-ietf-spring-srv6-network-programming
> [2] https://speakerdeck.com/line_developers/line-data-center-networking-with-srv6
> [3] https://patchwork.ozlabs.org/patch/799837/
> 
> Signed-off-by: Andrea Mayer <andrea.mayer@uniroma2.it>
> ---
>  net/ipv6/seg6_local.c | 205 ++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 205 insertions(+)
> 
> diff --git a/net/ipv6/seg6_local.c b/net/ipv6/seg6_local.c
> index 4b0f155d641d..a41074acd43e 100644
> --- a/net/ipv6/seg6_local.c
> +++ b/net/ipv6/seg6_local.c
> @@ -57,6 +57,14 @@ struct bpf_lwt_prog {
>  	char *name;
>  };
>  
> +struct seg6_end_dt4_info {
> +	struct net *net;
> +	/* VRF device associated to the routing table used by the SRv6 End.DT4
> +	 * behavior for routing IPv4 packets.
> +	 */
> +	int vrf_ifindex;
> +};
> +
>  struct seg6_local_lwt {
>  	int action;
>  	struct ipv6_sr_hdr *srh;
> @@ -66,6 +74,7 @@ struct seg6_local_lwt {
>  	int iif;
>  	int oif;
>  	struct bpf_lwt_prog bpf;
> +	struct seg6_end_dt4_info dt4_info;
>  
>  	int headroom;
>  	struct seg6_action_desc *desc;
> @@ -413,6 +422,194 @@ static int input_action_end_dx4(struct sk_buff *skb,
>  	return -EINVAL;
>  }
>  
> +#ifdef CONFIG_NET_L3_MASTER_DEV
> +

no need for this empty line.

> +static struct net *fib6_config_get_net(const struct fib6_config *fib6_cfg)
> +{
> +	const struct nl_info *nli = &fib6_cfg->fc_nlinfo;
> +
> +	return nli->nl_net;
> +}
> +
> +static int seg6_end_dt4_build(struct seg6_local_lwt *slwt, const void *cfg,
> +			      struct netlink_ext_ack *extack)
> +{
> +	struct seg6_end_dt4_info *info = &slwt->dt4_info;
> +	int vrf_ifindex;
> +	struct net *net;
> +
> +	net = fib6_config_get_net(cfg);
> +
> +	vrf_ifindex = l3mdev_ifindex_lookup_by_table_id(L3MDEV_TYPE_VRF, net,
> +							slwt->table);
> +	if (vrf_ifindex < 0) {
> +		if (vrf_ifindex == -EPERM) {
> +			NL_SET_ERR_MSG(extack,
> +				       "Strict mode for VRF is disabled");
> +		} else if (vrf_ifindex == -ENODEV) {
> +			NL_SET_ERR_MSG(extack, "No such device");

That's what -ENODEV already says.

> +		} else {
> +			NL_SET_ERR_MSG(extack, "Unknown error");

Useless error.

> +			pr_debug("seg6local: SRv6 End.DT4 creation error=%d\n",
> +				 vrf_ifindex);
> +		}
> +
> +		return vrf_ifindex;
> +	}
> +
> +	info->net = net;
> +	info->vrf_ifindex = vrf_ifindex;
> +
> +	return 0;
> +}
> +
> +/* The SRv6 End.DT4 behavior extracts the inner (IPv4) packet and routes the
> + * IPv4 packet by looking at the configured routing table.
> + *
> + * In the SRv6 End.DT4 use case, we can receive traffic (IPv6+Segment Routing
> + * Header packets) from several interfaces and the IPv6 destination address (DA)
> + * is used for retrieving the specific instance of the End.DT4 behavior that
> + * should process the packets.
> + *
> + * However, the inner IPv4 packet is not really bound to any receiving
> + * interface and thus the End.DT4 sets the VRF (associated with the
> + * corresponding routing table) as the *receiving* interface.
> + * In other words, the End.DT4 processes a packet as if it has been received
> + * directly by the VRF (and not by one of its slave devices, if any).
> + * In this way, the VRF interface is used for routing the IPv4 packet in
> + * according to the routing table configured by the End.DT4 instance.
> + *
> + * This design allows you to get some interesting features like:
> + *  1) the statistics on rx packets;
> + *  2) the possibility to install a packet sniffer on the receiving interface
> + *     (the VRF one) for looking at the incoming packets;
> + *  3) the possibility to leverage the netfilter prerouting hook for the inner
> + *     IPv4 packet.
> + *
> + * This function returns:
> + *  - the sk_buff* when the VRF rcv handler has processed the packet correctly;
> + *  - NULL when the skb is consumed by the VRF rcv handler;
> + *  - a pointer which encodes a negative error number in case of error.
> + *    Note that in this case, the function takes care of freeing the skb.
> + */
> +static struct sk_buff *end_dt4_vrf_rcv(struct sk_buff *skb,
> +				       struct net_device *dev)
> +{
> +	/* based on l3mdev_ip_rcv; we are only interested in the master */
> +	if (unlikely(!netif_is_l3_master(dev) && !netif_has_l3_rx_handler(dev)))
> +		goto drop;
> +
> +	if (unlikely(!dev->l3mdev_ops->l3mdev_l3_rcv))
> +		goto drop;
> +
> +	/* the decap packet (IPv4) does not come with any mac header info.
> +	 * We must unset the mac header to allow the VRF device to rebuild it,
> +	 * just in case there is a sniffer attached on the device.
> +	 */
> +	skb_unset_mac_header(skb);
> +
> +	skb = dev->l3mdev_ops->l3mdev_l3_rcv(dev, skb, AF_INET);
> +	if (!skb)
> +		/* the skb buffer was consumed by the handler */
> +		return NULL;
> +
> +	/* when a packet is received by a VRF or by one of its slaves, the
> +	 * master device reference is set into the skb.
> +	 */
> +	if (unlikely(skb->dev != dev || skb->skb_iif != dev->ifindex))
> +		goto drop;
> +
> +	return skb;
> +
> +drop:
> +	kfree_skb(skb);
> +	return ERR_PTR(-EINVAL);
> +}
> +
> +static struct net_device *end_dt4_get_vrf_rcu(struct sk_buff *skb,
> +					      struct seg6_end_dt4_info *info)
> +{
> +	int vrf_ifindex = info->vrf_ifindex;
> +	struct net *net = info->net;
> +
> +	if (unlikely(vrf_ifindex < 0))
> +		goto error;
> +
> +	if (unlikely(!net_eq(dev_net(skb->dev), net)))
> +		goto error;
> +
> +	return dev_get_by_index_rcu(net, vrf_ifindex);
> +
> +error:
> +	return NULL;
> +}
> +
> +static int input_action_end_dt4(struct sk_buff *skb,
> +				struct seg6_local_lwt *slwt)
> +{
> +	struct net_device *vrf;
> +	struct iphdr *iph;
> +	int err;
> +
> +	if (!decap_and_validate(skb, IPPROTO_IPIP))
> +		goto drop;
> +
> +	if (!pskb_may_pull(skb, sizeof(struct iphdr)))
> +		goto drop;
> +
> +	vrf = end_dt4_get_vrf_rcu(skb, &slwt->dt4_info);
> +	if (unlikely(!vrf))
> +		goto drop;
> +
> +	skb->protocol = htons(ETH_P_IP);
> +
> +	skb_dst_drop(skb);
> +
> +	skb_set_transport_header(skb, sizeof(struct iphdr));
> +
> +	skb = end_dt4_vrf_rcv(skb, vrf);
> +	if (!skb)
> +		/* packet has been processed and consumed by the VRF */
> +		return 0;
> +
> +	if (IS_ERR(skb)) {
> +		err = PTR_ERR(skb);
> +		return err;

return PTR_ERR(skb)

> +	}
> +
> +	iph = ip_hdr(skb);
> +
> +	err = ip_route_input(skb, iph->daddr, iph->saddr, 0, skb->dev);
> +	if (err)
> +		goto drop;
> +
> +	return dst_input(skb);
> +
> +drop:
> +	kfree_skb(skb);
> +	return -EINVAL;
> +}
> +
> +#else
> +

new line not needed

> +static int seg6_end_dt4_build(struct seg6_local_lwt *slwt, const void *cfg,
> +			      struct netlink_ext_ack *extack)
> +{
> +	NL_SET_ERR_MSG(extack, "Operation is not supported");

This extack message probably could be more helpful. As it stands it's
basically 

> +
> +	return -EOPNOTSUPP;
> +}
> +
> +static int input_action_end_dt4(struct sk_buff *skb,
> +				struct seg6_local_lwt *slwt)

Maybe just ifdef out the part of the action table instead of creating
those stubs?

> +{
> +	kfree_skb(skb);
> +	return -EOPNOTSUPP;
> +}
> +
> +#endif
> +
>  static int input_action_end_dt6(struct sk_buff *skb,
>  				struct seg6_local_lwt *slwt)
>  {
> @@ -601,6 +798,14 @@ static struct seg6_action_desc seg6_action_table[] = {

BTW any idea why the action table is not marked as const?

Would you mind sending a patch to fix that?

>  		.attrs		= (1 << SEG6_LOCAL_NH4),
>  		.input		= input_action_end_dx4,
>  	},
> +	{
> +		.action		= SEG6_LOCAL_ACTION_END_DT4,
> +		.attrs		= (1 << SEG6_LOCAL_TABLE),
> +		.input		= input_action_end_dt4,
> +		.slwt_ops	= {
> +					.build_state = seg6_end_dt4_build,
> +				  },
> +	},
>  	{
>  		.action		= SEG6_LOCAL_ACTION_END_DT6,
>  		.attrs		= (1 << SEG6_LOCAL_TABLE),
Andrea Mayer Nov. 13, 2020, 1:06 a.m. UTC | #2
Hi Jakub,
many thanks for your review. Please see my responses inline:

On Tue, 10 Nov 2020 14:56:55 -0800
Jakub Kicinski <kuba@kernel.org> wrote:

> On Sat,  7 Nov 2020 16:31:37 +0100 Andrea Mayer wrote:

> > We introduce two callbacks used for customizing the creation/destruction of

> > a SRv6 behavior. Such callbacks are defined in the new struct

> > seg6_local_lwtunnel_ops and hereafter we provide a brief description of

> > them:

> > 

> >  - build_state(...): used for calling the custom constructor of the

> >    behavior during its initialization phase and after all the attributes

> >    have been parsed successfully;

> > 

> >  - destroy_state(...): used for calling the custom destructor of the

> >    behavior before it is completely destroyed.

> > 

> > Signed-off-by: Andrea Mayer <andrea.mayer@uniroma2.it>

> 

> Looks good, minor nits.

> 

> > diff --git a/net/ipv6/seg6_local.c b/net/ipv6/seg6_local.c

> > index 63a82e2fdea9..4b0f155d641d 100644

> > --- a/net/ipv6/seg6_local.c

> > +++ b/net/ipv6/seg6_local.c

> > @@ -33,11 +33,23 @@

> >  

> >  struct seg6_local_lwt;

> >  

> > +typedef int (*slwt_build_state_t)(struct seg6_local_lwt *slwt, const void *cfg,

> > +				  struct netlink_ext_ack *extack);

> > +typedef void (*slwt_destroy_state_t)(struct seg6_local_lwt *slwt);

> 

> Let's avoid the typedefs. Instead of taking a pointer to the op take a

> pointer to the ops struct in seg6_local_lwtunnel_build_state() etc.

>


Ok, I will do it this way in v3.

> > +/* callbacks used for customizing the creation and destruction of a behavior */

> > +struct seg6_local_lwtunnel_ops {

> > +	slwt_build_state_t build_state;

> > +	slwt_destroy_state_t destroy_state;

> > +};

> > +

> >  struct seg6_action_desc {

> >  	int action;

> >  	unsigned long attrs;

> >  	int (*input)(struct sk_buff *skb, struct seg6_local_lwt *slwt);

> >  	int static_headroom;

> > +

> > +	struct seg6_local_lwtunnel_ops slwt_ops;

> >  };

> >  

> >  struct bpf_lwt_prog {

> > @@ -1015,6 +1027,45 @@ static void destroy_attrs(struct seg6_local_lwt *slwt)

> >  	__destroy_attrs(attrs, 0, SEG6_LOCAL_MAX + 1, slwt);

> >  }

> >  

> > +/* call the custom constructor of the behavior during its initialization phase

> > + * and after that all its attributes have been parsed successfully.

> > + */

> > +static int

> > +seg6_local_lwtunnel_build_state(struct seg6_local_lwt *slwt, const void *cfg,

> > +				struct netlink_ext_ack *extack)

> > +{

> > +	slwt_build_state_t build_func;

> > +	struct seg6_action_desc *desc;

> > +	int err = 0;

> > +

> > +	desc = slwt->desc;

> > +	if (!desc)

> > +		return -EINVAL;

> 

> This is impossible, right?

> 


Yes, it is. I will remove this check in v3.

> > +

> > +	build_func = desc->slwt_ops.build_state;

> > +	if (build_func)

> > +		err = build_func(slwt, cfg, extack);

> > +

> > +	return err;

> 

> no need for err, just use return directly.

> 

> 	if (!ops->build_state)

> 		return 0;

> 	return ops->build_state(...);

> 


Ok, I will do it in this way in v3.

> > +}

> > +

> > +/* call the custom destructor of the behavior which is invoked before the

> > + * tunnel is going to be destroyed.

> > + */

> > +static void seg6_local_lwtunnel_destroy_state(struct seg6_local_lwt *slwt)

> > +{

> > +	slwt_destroy_state_t destroy_func;

> > +	struct seg6_action_desc *desc;

> > +

> > +	desc = slwt->desc;

> > +	if (!desc)

> > +		return;

> > +

> > +	destroy_func = desc->slwt_ops.destroy_state;

> > +	if (destroy_func)

> > +		destroy_func(slwt);

> > +}

> > +

> >  static int parse_nla_action(struct nlattr **attrs, struct seg6_local_lwt *slwt)

> >  {

> >  	struct seg6_action_param *param;

> > @@ -1090,8 +1141,16 @@ static int seg6_local_build_state(struct net *net, struct nlattr *nla,

> >  

> >  	err = parse_nla_action(tb, slwt);

> >  	if (err < 0)

> > +		/* In case of error, the parse_nla_action() takes care of

> > +		 * releasing resources which have been acquired during the

> > +		 * processing of attributes.

> > +		 */

> 

> that's the normal behavior for a kernel function, comment is

> unnecessary IMO

> 


Yes and this is the way it should be. But before this patch, the
parse_nla_action() in case of error did not always release all the acquired
resources. From this patcheset onward, the parse_nla_action() behaves like we
expect. Therefore, I will remove the comment in v3.

> >  		goto out_free;

> >  

> > +	err = seg6_local_lwtunnel_build_state(slwt, cfg, extack);

> > +	if (err < 0)

> > +		goto free_attrs;

> 

> The function is called destroy_attrs, call the label out_destroy_attrs,

> or err_destroy_attrs.

> 


Fine, I will stick with the out_destroy_attrs to be consistent and uniform with
the out_free label in v3.

> >  	newts->type = LWTUNNEL_ENCAP_SEG6_LOCAL;

> >  	newts->flags = LWTUNNEL_STATE_INPUT_REDIRECT;

> >  	newts->headroom = slwt->headroom;

> > @@ -1100,6 +1159,9 @@ static int seg6_local_build_state(struct net *net, struct nlattr *nla,

> >  

> >  	return 0;

> >  

> > +free_attrs:

> > +	destroy_attrs(slwt);

> > +

> 

> no need for empty lines on error paths

> 


Ok.

> >  out_free:

> >  	kfree(newts);

> >  	return err;

> > @@ -1109,6 +1171,8 @@ static void seg6_local_destroy_state(struct lwtunnel_state *lwt)

> >  {

> >  	struct seg6_local_lwt *slwt = seg6_local_lwtunnel(lwt);

> >  

> > +	seg6_local_lwtunnel_destroy_state(slwt);

> > +

> >  	destroy_attrs(slwt);

> >  

> >  	return;

> 


Thank you,
Andrea
Andrea Mayer Nov. 13, 2020, 1:28 a.m. UTC | #3
Hi Jakub,
many thanks for your review. Please see my responses inline:

On Tue, 10 Nov 2020 15:12:55 -0800
Jakub Kicinski <kuba@kernel.org> wrote:

> On Sat,  7 Nov 2020 16:31:38 +0100 Andrea Mayer wrote:

> > SRv6 End.DT4 is defined in the SRv6 Network Programming [1].

> > 

> > The SRv6 End.DT4 is used to implement IPv4 L3VPN use-cases in

> > multi-tenants environments. It decapsulates the received packets and it

> > performs IPv4 routing lookup in the routing table of the tenant.

> > 

> > The SRv6 End.DT4 Linux implementation leverages a VRF device in order to

> > force the routing lookup into the associated routing table.

> 

> How does the behavior of DT4 compare to DT6?

> 


The implementation of SRv6 End.DT4 differs from the the implementation of SRv6
End.DT6 due to the different *route input* lookup functions. For IPv6 is it
possible to force the routing lookup specifying a routing table through the
ip6_pol_route() function (as it is done in the seg6_lookup_any_nexthop()).

Conversely, for the IPv4 we cannot force the lookup into a specific table with
the functions that are currently exposed by the kernel.

> The implementation looks quite different.

>


Long story short:
A long time ago, we discussed here on the mailing list how best to implement the
SRv6 DT4. After some time, we identified with the help of David Ahern the VRF as
the key infrastructure on which to build the SRv6 End.DT4. Indeed, the use of
VRF allows us not to touch in any way the core components of the kernel (i.e.:
the ipv4 routing system) and to exploit an already existing infrastructure.

I would say that also the SRv6 End.DT6 should leverage the VRF as we did for
SRv6 End.DT4. We can also try to change End.DT6 implementation, if needed.

> > To make the End.DT4 work properly, it must be guaranteed that the routing

> > table used for routing lookup operations is bound to one and only one

> > VRF during the tunnel creation. Such constraint has to be enforced by

> > enabling the VRF strict_mode sysctl parameter, i.e:

> >  $ sysctl -wq net.vrf.strict_mode=1.

> > 

> > At JANOG44, LINE corporation presented their multi-tenant DC architecture

> > using SRv6 [2]. In the slides, they reported that the Linux kernel is

> > missing the support of SRv6 End.DT4 behavior.

> > 

> > The iproute2 counterpart required for configuring the SRv6 End.DT4

> > behavior is already implemented along with the other supported SRv6

> > behaviors [3].

> > 

> > [1] https://tools.ietf.org/html/draft-ietf-spring-srv6-network-programming

> > [2] https://speakerdeck.com/line_developers/line-data-center-networking-with-srv6

> > [3] https://patchwork.ozlabs.org/patch/799837/

> > 

> > Signed-off-by: Andrea Mayer <andrea.mayer@uniroma2.it>

> > ---

> >  net/ipv6/seg6_local.c | 205 ++++++++++++++++++++++++++++++++++++++++++

> >  1 file changed, 205 insertions(+)

> > 

> > diff --git a/net/ipv6/seg6_local.c b/net/ipv6/seg6_local.c

> > index 4b0f155d641d..a41074acd43e 100644

> > --- a/net/ipv6/seg6_local.c

> > +++ b/net/ipv6/seg6_local.c

> > @@ -57,6 +57,14 @@ struct bpf_lwt_prog {

> >  	char *name;

> >  };

> >  

> > +struct seg6_end_dt4_info {

> > +	struct net *net;

> > +	/* VRF device associated to the routing table used by the SRv6 End.DT4

> > +	 * behavior for routing IPv4 packets.

> > +	 */

> > +	int vrf_ifindex;

> > +};

> > +

> >  struct seg6_local_lwt {

> >  	int action;

> >  	struct ipv6_sr_hdr *srh;

> > @@ -66,6 +74,7 @@ struct seg6_local_lwt {

> >  	int iif;

> >  	int oif;

> >  	struct bpf_lwt_prog bpf;

> > +	struct seg6_end_dt4_info dt4_info;

> >  

> >  	int headroom;

> >  	struct seg6_action_desc *desc;

> > @@ -413,6 +422,194 @@ static int input_action_end_dx4(struct sk_buff *skb,

> >  	return -EINVAL;

> >  }

> >  

> > +#ifdef CONFIG_NET_L3_MASTER_DEV

> > +

> 

> no need for this empty line.

> 


Ok.

> > +static struct net *fib6_config_get_net(const struct fib6_config *fib6_cfg)

> > +{

> > +	const struct nl_info *nli = &fib6_cfg->fc_nlinfo;

> > +

> > +	return nli->nl_net;

> > +}

> > +

> > +static int seg6_end_dt4_build(struct seg6_local_lwt *slwt, const void *cfg,

> > +			      struct netlink_ext_ack *extack)

> > +{

> > +	struct seg6_end_dt4_info *info = &slwt->dt4_info;

> > +	int vrf_ifindex;

> > +	struct net *net;

> > +

> > +	net = fib6_config_get_net(cfg);

> > +

> > +	vrf_ifindex = l3mdev_ifindex_lookup_by_table_id(L3MDEV_TYPE_VRF, net,

> > +							slwt->table);

> > +	if (vrf_ifindex < 0) {

> > +		if (vrf_ifindex == -EPERM) {

> > +			NL_SET_ERR_MSG(extack,

> > +				       "Strict mode for VRF is disabled");

> > +		} else if (vrf_ifindex == -ENODEV) {

> > +			NL_SET_ERR_MSG(extack, "No such device");

> 

> That's what -ENODEV already says.

>


Yes, sorry for this very trivial message. I will improve it in v3.
 
> > +		} else {

> > +			NL_SET_ERR_MSG(extack, "Unknown error");

> 

> Useless error.

> 


Ok, I will remove it and keep only the pr_debug message in v3.

> > +			pr_debug("seg6local: SRv6 End.DT4 creation error=%d\n",

> > +				 vrf_ifindex);

> > +		}

> > +

> > +		return vrf_ifindex;

> > +	}

> > +

> > +	info->net = net;

> > +	info->vrf_ifindex = vrf_ifindex;

> > +

> > +	return 0;

> > +}

> > +

> > +/* The SRv6 End.DT4 behavior extracts the inner (IPv4) packet and routes the

> > + * IPv4 packet by looking at the configured routing table.

> > + *

> > + * In the SRv6 End.DT4 use case, we can receive traffic (IPv6+Segment Routing

> > + * Header packets) from several interfaces and the IPv6 destination address (DA)

> > + * is used for retrieving the specific instance of the End.DT4 behavior that

> > + * should process the packets.

> > + *

> > + * However, the inner IPv4 packet is not really bound to any receiving

> > + * interface and thus the End.DT4 sets the VRF (associated with the

> > + * corresponding routing table) as the *receiving* interface.

> > + * In other words, the End.DT4 processes a packet as if it has been received

> > + * directly by the VRF (and not by one of its slave devices, if any).

> > + * In this way, the VRF interface is used for routing the IPv4 packet in

> > + * according to the routing table configured by the End.DT4 instance.

> > + *

> > + * This design allows you to get some interesting features like:

> > + *  1) the statistics on rx packets;

> > + *  2) the possibility to install a packet sniffer on the receiving interface

> > + *     (the VRF one) for looking at the incoming packets;

> > + *  3) the possibility to leverage the netfilter prerouting hook for the inner

> > + *     IPv4 packet.

> > + *

> > + * This function returns:

> > + *  - the sk_buff* when the VRF rcv handler has processed the packet correctly;

> > + *  - NULL when the skb is consumed by the VRF rcv handler;

> > + *  - a pointer which encodes a negative error number in case of error.

> > + *    Note that in this case, the function takes care of freeing the skb.

> > + */

> > +static struct sk_buff *end_dt4_vrf_rcv(struct sk_buff *skb,

> > +				       struct net_device *dev)

> > +{

> > +	/* based on l3mdev_ip_rcv; we are only interested in the master */

> > +	if (unlikely(!netif_is_l3_master(dev) && !netif_has_l3_rx_handler(dev)))

> > +		goto drop;

> > +

> > +	if (unlikely(!dev->l3mdev_ops->l3mdev_l3_rcv))

> > +		goto drop;

> > +

> > +	/* the decap packet (IPv4) does not come with any mac header info.

> > +	 * We must unset the mac header to allow the VRF device to rebuild it,

> > +	 * just in case there is a sniffer attached on the device.

> > +	 */

> > +	skb_unset_mac_header(skb);

> > +

> > +	skb = dev->l3mdev_ops->l3mdev_l3_rcv(dev, skb, AF_INET);

> > +	if (!skb)

> > +		/* the skb buffer was consumed by the handler */

> > +		return NULL;

> > +

> > +	/* when a packet is received by a VRF or by one of its slaves, the

> > +	 * master device reference is set into the skb.

> > +	 */

> > +	if (unlikely(skb->dev != dev || skb->skb_iif != dev->ifindex))

> > +		goto drop;

> > +

> > +	return skb;

> > +

> > +drop:

> > +	kfree_skb(skb);

> > +	return ERR_PTR(-EINVAL);

> > +}

> > +

> > +static struct net_device *end_dt4_get_vrf_rcu(struct sk_buff *skb,

> > +					      struct seg6_end_dt4_info *info)

> > +{

> > +	int vrf_ifindex = info->vrf_ifindex;

> > +	struct net *net = info->net;

> > +

> > +	if (unlikely(vrf_ifindex < 0))

> > +		goto error;

> > +

> > +	if (unlikely(!net_eq(dev_net(skb->dev), net)))

> > +		goto error;

> > +

> > +	return dev_get_by_index_rcu(net, vrf_ifindex);

> > +

> > +error:

> > +	return NULL;

> > +}

> > +

> > +static int input_action_end_dt4(struct sk_buff *skb,

> > +				struct seg6_local_lwt *slwt)

> > +{

> > +	struct net_device *vrf;

> > +	struct iphdr *iph;

> > +	int err;

> > +

> > +	if (!decap_and_validate(skb, IPPROTO_IPIP))

> > +		goto drop;

> > +

> > +	if (!pskb_may_pull(skb, sizeof(struct iphdr)))

> > +		goto drop;

> > +

> > +	vrf = end_dt4_get_vrf_rcu(skb, &slwt->dt4_info);

> > +	if (unlikely(!vrf))

> > +		goto drop;

> > +

> > +	skb->protocol = htons(ETH_P_IP);

> > +

> > +	skb_dst_drop(skb);

> > +

> > +	skb_set_transport_header(skb, sizeof(struct iphdr));

> > +

> > +	skb = end_dt4_vrf_rcv(skb, vrf);

> > +	if (!skb)

> > +		/* packet has been processed and consumed by the VRF */

> > +		return 0;

> > +

> > +	if (IS_ERR(skb)) {

> > +		err = PTR_ERR(skb);

> > +		return err;

> 

> return PTR_ERR(skb)

> 


I will fix it in v3.

> > +	}

> > +

> > +	iph = ip_hdr(skb);

> > +

> > +	err = ip_route_input(skb, iph->daddr, iph->saddr, 0, skb->dev);

> > +	if (err)

> > +		goto drop;

> > +

> > +	return dst_input(skb);

> > +

> > +drop:

> > +	kfree_skb(skb);

> > +	return -EINVAL;

> > +}

> > +

> > +#else

> > +

> 

> new line not needed

> 


Ok.

> > +static int seg6_end_dt4_build(struct seg6_local_lwt *slwt, const void *cfg,

> > +			      struct netlink_ext_ack *extack)

> > +{

> > +	NL_SET_ERR_MSG(extack, "Operation is not supported");

> 

> This extack message probably could be more helpful. As it stands it's

> basically 

> 


Please, see just right below.

> > +

> > +	return -EOPNOTSUPP;

> > +}

> > +

> > +static int input_action_end_dt4(struct sk_buff *skb,

> > +				struct seg6_local_lwt *slwt)

> 

> Maybe just ifdef out the part of the action table instead of creating

> those stubs?

> 


This is a very interesting point and I like your idea. We can eliminate the two
stubs while keeping the "unsupported operation" semantics in this way:

static struct seg6_action_desc seg6_action_table[] = {
   [...]
   {
       .action = SEG6_LOCAL_ACTION_END_DT4,
       .attrs = (1 << SEG6_LOCAL_TABLE),
#ifdef CONFIG_NET_L3_MASTER_DEV
       .input = input_action_end_dt4,
       .slwt_ops = {
           .build_state = seg6_end_dt4_build,
        },
#endif
   },
[...]
}

when the CONFIG_NET_L3_MASTER_DEV is not defined, the behavior can not be
instantiated because the "input" callback is initialized to NULL. This fact
forces the parse_nla_action() to fail returning -EOPNOTSUPP to the user
(that is exactly what we want to achieve).

Note that surrounding the entire DT4 action table entry with #ifdef/#endif does
not allow us to distinguish whether the DT4 was really implemented or it was
not supported due to the way in which the CONFIG_NET_L3_MASTER_DEV was set.
In both cases, when the user tries to instantiate a new DT4 behavior, the
kernel replies back with the -EINVAL error.

> > +{

> > +	kfree_skb(skb);

> > +	return -EOPNOTSUPP;

> > +}

> > +

> > +#endif

> > +

> >  static int input_action_end_dt6(struct sk_buff *skb,

> >  				struct seg6_local_lwt *slwt)

> >  {

> > @@ -601,6 +798,14 @@ static struct seg6_action_desc seg6_action_table[] = {

> 

> BTW any idea why the action table is not marked as const?

> 


Frankly speaking, I have no idea. I have been working on the seg6 infrastructure
for some time now, and I have never seen a single value changed in
seg6_action_table[] after its initialization (neither the necessity to carry
out an update operation).

> Would you mind sending a patch to fix that?

> 


Yes, I will send a fix for this issue adding the 'const' keyword.

> >  		.attrs		= (1 << SEG6_LOCAL_NH4),

> >  		.input		= input_action_end_dx4,

> >  	},

> > +	{

> > +		.action		= SEG6_LOCAL_ACTION_END_DT4,

> > +		.attrs		= (1 << SEG6_LOCAL_TABLE),

> > +		.input		= input_action_end_dt4,

> > +		.slwt_ops	= {

> > +					.build_state = seg6_end_dt4_build,

> > +				  },

> > +	},

> >  	{

> >  		.action		= SEG6_LOCAL_ACTION_END_DT6,

> >  		.attrs		= (1 << SEG6_LOCAL_TABLE),

> 


Thank you,
Andrea
David Ahern Nov. 13, 2020, 1:49 a.m. UTC | #4
On 11/12/20 6:28 PM, Andrea Mayer wrote:
> The implementation of SRv6 End.DT4 differs from the the implementation of SRv6

> End.DT6 due to the different *route input* lookup functions. For IPv6 is it

> possible to force the routing lookup specifying a routing table through the

> ip6_pol_route() function (as it is done in the seg6_lookup_any_nexthop()).


It is unfortunate that the IPv6 variant got in without the VRF piece.
David Ahern Nov. 13, 2020, 5:04 p.m. UTC | #5
On 11/13/20 10:02 AM, Stefano Salsano wrote:
> Il 2020-11-13 17:55, Jakub Kicinski ha scritto:

>> On Thu, 12 Nov 2020 18:49:17 -0700 David Ahern wrote:

>>> On 11/12/20 6:28 PM, Andrea Mayer wrote:

>>>> The implementation of SRv6 End.DT4 differs from the the

>>>> implementation of SRv6

>>>> End.DT6 due to the different *route input* lookup functions. For

>>>> IPv6 is it

>>>> possible to force the routing lookup specifying a routing table

>>>> through the

>>>> ip6_pol_route() function (as it is done in the

>>>> seg6_lookup_any_nexthop()).

>>>

>>> It is unfortunate that the IPv6 variant got in without the VRF piece.

>>

>> Should we make it a requirement for this series to also extend the v6

>> version to support the preferred VRF-based operation? Given VRF is

>> better and we require v4 features to be implemented for v6?

> 

> I think it is better to separate the two aspects... adding a missing

> feature in IPv4 datapath should not depend on improving the quality of

> the implementation of the IPv6 datapath :-)

> 

> I think that Andrea is willing to work on improving the IPv6

> implementation, but this should be considered after this patchset...

> 


agreed. The v6 variant has existed for a while. The v4 version is
independent.
Jakub Kicinski Nov. 13, 2020, 7:40 p.m. UTC | #6
On Fri, 13 Nov 2020 10:04:44 -0700 David Ahern wrote:
> On 11/13/20 10:02 AM, Stefano Salsano wrote:

> > Il 2020-11-13 17:55, Jakub Kicinski ha scritto:  

> >> On Thu, 12 Nov 2020 18:49:17 -0700 David Ahern wrote:  

> >>> On 11/12/20 6:28 PM, Andrea Mayer wrote:  

> >>>> The implementation of SRv6 End.DT4 differs from the the

> >>>> implementation of SRv6

> >>>> End.DT6 due to the different *route input* lookup functions. For

> >>>> IPv6 is it

> >>>> possible to force the routing lookup specifying a routing table

> >>>> through the

> >>>> ip6_pol_route() function (as it is done in the

> >>>> seg6_lookup_any_nexthop()).  

> >>>

> >>> It is unfortunate that the IPv6 variant got in without the VRF piece.  

> >>

> >> Should we make it a requirement for this series to also extend the v6

> >> version to support the preferred VRF-based operation? Given VRF is

> >> better and we require v4 features to be implemented for v6?  

> > 

> > I think it is better to separate the two aspects... adding a missing

> > feature in IPv4 datapath should not depend on improving the quality of

> > the implementation of the IPv6 datapath :-)

> > 

> > I think that Andrea is willing to work on improving the IPv6

> > implementation, but this should be considered after this patchset...

>

> agreed. The v6 variant has existed for a while. The v4 version is

> independent.


Okay, I'm not sure what's the right call so I asked DaveM.

TBH I wasn't expecting this reaction, we're talking about a 200 LoC
patch which would probably be 90% reused for v6...
Jakub Kicinski Nov. 13, 2020, 9:40 p.m. UTC | #7
On Fri, 13 Nov 2020 11:40:36 -0800 Jakub Kicinski wrote:
> > agreed. The v6 variant has existed for a while. The v4 version is

> > independent.  

> 

> Okay, I'm not sure what's the right call so I asked DaveM.


DaveM raised a concern that unless we implement v6 now we can't be sure
the interface we create for v4 is going to fit there.

So Andrea unless it's a major hurdle, could you take a stab at the v6
version with VRFs as part of this series?
Andrea Mayer Nov. 13, 2020, 11 p.m. UTC | #8
Hi Jakub,

On Fri, 13 Nov 2020 13:40:10 -0800
Jakub Kicinski <kuba@kernel.org> wrote:

> On Fri, 13 Nov 2020 11:40:36 -0800 Jakub Kicinski wrote:

> > > agreed. The v6 variant has existed for a while. The v4 version is

> > > independent.  

> > 

> > Okay, I'm not sure what's the right call so I asked DaveM.

> 

> DaveM raised a concern that unless we implement v6 now we can't be sure

> the interface we create for v4 is going to fit there.

> 

> So Andrea unless it's a major hurdle, could you take a stab at the v6

> version with VRFs as part of this series?


I can tackle the v6 version but how do we face the compatibility issue raised
by Stefano in his message?

if it is ok to implement a uAPI that breaks the existing scripts, it is relatively
easy to replicate the VRF-based approach also in v6.

Waiting for your advice!

Thanks,
Andrea