[bpf-next,v6,0/3] Add TC-BPF API

Message ID	20210504005023.1240974-1-memxor@gmail.com
Headers	show Return-Path: <netdev-owner@kernel.org> From: Kumar Kartikeya Dwivedi <memxor@gmail.com> To: bpf@vger.kernel.org Cc: Kumar Kartikeya Dwivedi <memxor@gmail.com>, Alexei Starovoitov <ast@kernel.org>, Daniel Borkmann <daniel@iogearbox.net>, Andrii Nakryiko <andrii@kernel.org>, Martin KaFai Lau <kafai@fb.com>, Song Liu <songliubraving@fb.com>, Yonghong Song <yhs@fb.com>, John Fastabend <john.fastabend@gmail.com>, KP Singh <kpsingh@kernel.org>, "David S. Miller" <davem@davemloft.net>, Jakub Kicinski <kuba@kernel.org>, Jesper Dangaard Brouer <brouer@redhat.com>, =?utf-8?q?Toke_H=C3=B8iland-J=C3=B8rgensen?= <toke@redhat.com>, Shaun Crampton <shaun@tigera.io>, netdev@vger.kernel.org Subject: [PATCH bpf-next v6 0/3] Add TC-BPF API Date: Tue, 4 May 2021 06:20:20 +0530 Message-Id: <20210504005023.1240974-1-memxor@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	Add TC-BPF API \| expand [bpf-next,v6,0/3] Add TC-BPF API [bpf-next,v6,3/3] libbpf: add selftests for TC-BPF API

Message ID

20210504005023.1240974-1-memxor@gmail.com

Headers

From: Kumar Kartikeya Dwivedi <memxor@gmail.com>
To: bpf@vger.kernel.org
Cc: Kumar Kartikeya Dwivedi <memxor@gmail.com>, Alexei Starovoitov
	<ast@kernel.org>,         Daniel Borkmann <daniel@iogearbox.net>,
	Andrii Nakryiko <andrii@kernel.org>, 
	Martin KaFai Lau <kafai@fb.com>,
	Song Liu <songliubraving@fb.com>, Yonghong Song <yhs@fb.com>, John
	Fastabend <john.fastabend@gmail.com>, KP Singh <kpsingh@kernel.org>,
	"David S. Miller" <davem@davemloft.net>,
	Jakub Kicinski <kuba@kernel.org>, Jesper Dangaard Brouer
	<brouer@redhat.com>, =?utf-8?q?Toke_H=C3=B8iland-J=C3=B8rgensen?=
	<toke@redhat.com>, Shaun Crampton <shaun@tigera.io>,
	netdev@vger.kernel.org
Subject: [PATCH bpf-next v6 0/3] Add TC-BPF API
Date: Tue,  4 May 2021 06:20:20 +0530
Message-Id: <20210504005023.1240974-1-memxor@gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Precedence: bulk

Series

Add TC-BPF API | expand

Message

Kumar Kartikeya Dwivedi May 4, 2021, 12:50 a.m. UTC

This is the sixth version of the TC-BPF series.

It adds a simple API that uses netlink to attach the tc filter and its bpf
classifier program. Currently, a user needs to shell out to the tc command line
to be able to create filters and attach SCHED_CLS programs as classifiers. With
the help of this API, it will be possible to use libbpf for doing all parts of
bpf program setup and attach.

Changelog contains details of patchset evolution.

In an effort to keep discussion focused, this series doesn't have the high level
TC-BPF API. It was clear that there is a need for a bpf_link API in the kernel,
hence that will be submitted as a separate patchset based on this.

The individual commit messages contain more details, and also a brief summary of
the API.

Changelog:
----------
v5 -> v6
v5: https://lore.kernel.org/bpf/20210428162553.719588-1-memxor@gmail.com

 * Address all comments from Andrii.
 * Reorganize selftest to make logical separation between each test's set up
   and clean up more clear to the reader. Also add a common way to test
   different combination of opts.
 * Cleanup the commit message a bit.
 * Fix instances of ret < 0 && ret == -ENOENT pattern everywhere.
 * Use C89 declaration syntax.
 * Drop PRIu32.
 * Move flags to bpf_tc_opts and bpf_tc_hook.
 * Other misc comments.

v4 -> v5
v4: https://lore.kernel.org/bpf/20210423150600.498490-1-memxor@gmail.com

 * Added bpf_tc_hook to represent the attach location of a filter.
 * Removed the bpf_tc_ctx context object, refactored code to not assume shared
   open socket across operations on the same ctx.
 * Add a helper libbpf_nl_send_recv that wraps socket creation, sending and
   receiving the netlink message.
 * Extended netlink code to cut short message processing using BPF_NL_DONE. This
   is used in a few places to return early to the user and discard remaining
   data.
 * selftests rewrite and expansion, considering API is looking more solid now.
 * Documented the API assumptions and behaviour in the commit that adds it,
   along with a few basic usage examples.
 * Dropped documentation from libbpf.h.
 * Relax some restrictions on bpf_tc_query to make it more useful (e.g. to
   detect if any filters exist).
 * Incorporate other minor suggestions from previous review (Andrii and Daniel).

v3 -> v4
v3: https://lore.kernel.org/bpf/20210420193740.124285-1-memxor@gmail.com

 * Added a concept of bpf_tc_ctx context structure representing the attach point.
   The qdisc setup and delete is tied to this object's lifetime if it succeeds
   in creating the clsact qdisc when the attach point is BPF_TC_INGRESS or
   BPF_TC_EGRESS. Qdisc is only deleted when there are no filters attached to
   it.
 * Refactored all API functions to take ctx.
 * Removed bpf_tc_info, bpf_tc_attach_id, instead reused bpf_tc_opts for filling
   in attributes in various API functions (including query).
 * Explicitly documented the expectation of each function regarding the opts
   fields set. Added some small notes for the defaults chosen by the API.
 * Rename bpf_tc_get_info to bpf_tc_query
 * Keep the netlink socket open in the context structure to save on open/close
   cycles for each operation.
 * Miscellaneous adjustments due to keeping the socket open.
 * Rewrote the tests, and also added tests for testing all preconditions of the
   TC-BPF API.
 * We now use bpf skeleton in examples and tests.

v2 -> v3
v2: https://lore.kernel.org/bpf/20210419121811.117400-1-memxor@gmail.com

 * bpf_tc_cls_* -> bpf_tc_* rename
 * bpf_tc_attach_id now only consists of handle and priority, the two variables
   that user may or may not set.
 * bpf_tc_replace has been dropped, instead a replace bool is introduced in
   bpf_tc_opts for the same purpose.
 * bpf_tc_get_info now takes attach_id for filling in filter details during
   lookup instead of requiring user to do so. This also allows us to remove the
   fd parameter, as no matching is needed as long as we have all attributes
   necessary to identify a specific filter.
 * A little bit of code simplification taking into account the change above.
 * priority and protocol are now __u16 members in user facing API structs to
   reflect actual size.
 * Patch updating pkt_cls.h header has been removed, as it is unused now.
 * protocol and chain_index options have been dropped in bpf_tc_opts,
   protocol is always set to ETH_P_ALL, while chain_index is set as 0 by
   default in the kernel. This also means removal of chain_index from
   bpf_tc_attach_id, as it is unconditionally always 0.
 * bpf_tc_cls_change has been dropped
 * selftest now uses ASSERT_* macros

v1-> v2
v1: https://lore.kernel.org/bpf/20210325120020.236504-1-memxor@gmail.com

 * netlink helpers have been renamed to object_action style.
 * attach_id now only contains attributes that are not explicitly set. Only
   the bare minimum info is kept in it.
 * protocol is now an optional and always set to ETH_P_ALL.
 * direct-action mode is always set.
 * skip_sw and skip_hw options have also been removed.
 * bpf_tc_cls_info struct now also returns the bpf program tag and id, as
   available in the netlink response. This came up as a requirement during
   discussion with people wanting to use this functionality.
 * support for attaching SCHED_ACT programs has been dropped, as it isn't
   useful without any support for binding loaded actions to a classifier.
 * the distinction between dev and block API has been dropped, there is now
   a single set of functions and user has to pass the special ifindex value
   to indicate operation on a shared filter block on their own.
 * The high level API returning a bpf_link is gone. This was already non-
   functional for pinning and typical ownership semantics. Instead, a separate
   patchset will be sent adding a bpf_link API for attaching SCHED_CLS progs to
   the kernel, and its corresponding libbpf API.
 * The clsact qdisc is now setup automatically in a best-effort fashion whenever
   user passes in the clsact ingress or egress parent id. This is done with
   exclusive mode, such that if an ingress or clsact qdisc is already set up,
   we skip the setup and move on with filter creation.
 * Other minor changes that came up during the course of discussion and rework.

Kumar Kartikeya Dwivedi (3):
  libbpf: add netlink helpers
  libbpf: add low level TC-BPF API
  libbpf: add selftests for TC-BPF API

 tools/lib/bpf/libbpf.h                        |  42 ++
 tools/lib/bpf/libbpf.map                      |   5 +
 tools/lib/bpf/netlink.c                       | 586 ++++++++++++++++--
 tools/lib/bpf/nlattr.h                        |  48 ++
 .../testing/selftests/bpf/prog_tests/tc_bpf.c | 544 ++++++++++++++++
 .../testing/selftests/bpf/progs/test_tc_bpf.c |  12 +
 6 files changed, 1173 insertions(+), 64 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/tc_bpf.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_tc_bpf.c

Comments

Daniel Borkmann May 5, 2021, 9:42 p.m. UTC | #1

On 5/4/21 2:50 AM, Kumar Kartikeya Dwivedi wrote:
> This adds functions that wrap the netlink API used for adding,

> manipulating, and removing traffic control filters.

> 

> An API summary:


Looks better, few minor comments below:

> A bpf_tc_hook represents a location where a TC-BPF filter can be

> attached. This means that creating a hook leads to creation of the

> backing qdisc, while destruction either removes all filters attached to

> a hook, or destroys qdisc if requested explicitly (as discussed below).

> 

> The TC-BPF API functions operate on this bpf_tc_hook to attach, replace,

> query, and detach tc filters.

> 

> All functions return 0 on success, and a negative error code on failure.

> 

> bpf_tc_hook_create - Create a hook

> Parameters:

> 	@hook - Cannot be NULL, ifindex > 0, attach_point must be set to

> 		proper enum constant. Note that parent must be unset when

> 		attach_point is one of BPF_TC_INGRESS or BPF_TC_EGRESS. Note

> 		that as an exception BPF_TC_INGRESS|BPF_TC_EGRESS is also a

> 		valid value for attach_point.

> 

> 		Returns -EOPNOTSUPP when hook has attach_point as BPF_TC_CUSTOM.

> 

> 		hook's flags member can be BPF_TC_F_REPLACE, which

> 		creates qdisc in non-exclusive mode (i.e. an existing

> 		qdisc will be replaced instead of this function failing

> 		with -EEXIST).


Why supporting BPF_TC_F_REPLACE here? It's not changing any qdisc parameters
given clsact doesn't have any, no? Iow, what effect are you expecting on this
with BPF_TC_F_REPLACE & why supporting it? I'd probably just require flags to
be 0 here, and if hook exists return sth like -EEXIST.

> bpf_tc_hook_destroy - Destroy the hook

> Parameters:

>          @hook - Cannot be NULL. The behaviour depends on value of

> 		attach_point.

> 

> 		If BPF_TC_INGRESS, all filters attached to the ingress

> 		hook will be detached.

> 		If BPF_TC_EGRESS, all filters attached to the egress hook

> 		will be detached.

> 		If BPF_TC_INGRESS|BPF_TC_EGRESS, the clsact qdisc will be

> 		deleted, also detaching all filters.

> 

> 		As before, parent must be unset for these attach_points,

> 		and set for BPF_TC_CUSTOM. flags must also be unset.

> 

> 		It is advised that if the qdisc is operated on by many programs,

> 		then the program at least check that there are no other existing

> 		filters before deleting the clsact qdisc. An example is shown

> 		below:

> 

> 		DECLARE_LIBBPF_OPTS(bpf_tc_hook, .ifindex = if_nametoindex("lo"),

> 				    .attach_point = BPF_TC_INGRESS);

> 		/* set opts as NULL, as we're not really interested in

> 		 * getting any info for a particular filter, but just

> 	 	 * detecting its presence.

> 		 */

> 		r = bpf_tc_query(&hook, NULL);

> 		if (r == -ENOENT) {

> 			/* no filters */

> 			hook.attach_point = BPF_TC_INGRESS|BPF_TC_EGREESS;

> 			return bpf_tc_hook_destroy(&hook);

> 		} else {

> 			/* failed or r == 0, the latter means filters do exist */

> 			return r;

> 		}

> 

> 		Note that there is a small race between checking for no

> 		filters and deleting the qdisc. This is currently unavoidable.

> 

> 		Returns -EOPNOTSUPP when hook has attach_point as BPF_TC_CUSTOM.

> 

> bpf_tc_attach - Attach a filter to a hook

> Parameters:

> 	@hook - Cannot be NULL. Represents the hook the filter will be

> 		attached to. Requirements for ifindex and attach_point are

> 		same as described in bpf_tc_hook_create, but BPF_TC_CUSTOM

> 		is also supported.  In that case, parent must be set to the

> 		handle where the filter will be attached (using TC_H_MAKE).

> 		flags member must be unset.

> 

> 		E.g. To set parent to 1:16 like in tc command line,

> 		     the equivalent would be TC_H_MAKE(1 << 16, 16)


Small nit: I wonder whether from libbpf side we should just support a more
user friendly TC_H_MAKE, so you'd have: BPF_TC_CUSTOM + BPF_TC_PARENT(1, 16).

> 	@opts - Cannot be NULL.

> 

> 		The following opts are optional:

> 			handle - The handle of the filter

> 			priority - The priority of the filter

> 				   Must be >= 0 and <= UINT16_MAX


It should probably be mentioned that if they are not specified, then they
are auto-allocated from kernel.

> 		The following opts must be set:

> 			prog_fd - The fd of the loaded SCHED_CLS prog

> 		The following opts must be unset:

> 			prog_id - The ID of the BPF prog

> 		The following opts are optional:

> 			flags - Currently only BPF_TC_F_REPLACE is

> 				allowed. It allows replacing an existing

> 				filter instead of failing with -EEXIST.

> 

> 		The following opts will be filled by bpf_tc_attach on a

> 		successful attach operation if they are unset:

> 			handle - The handle of the attached filter

> 			priority - The priority of the attached filter

> 			prog_id - The ID of the attached SCHED_CLS prog

> 

> 		This way, the user can know what the auto allocated

> 		values for optional opts like handle and priority are

> 		for the newly attached filter, if they were unset.

> 

> 		Note that some other attributes are set to some default

> 		values listed below (this holds for all bpf_tc_* APIs):

> 			protocol - ETH_P_ALL

> 			mode - direct action

> 			chain index - 0

> 			class ID - 0 (this can be set by writing to the

> 			skb->tc_classid field from the BPF program)

> 

> bpf_tc_detach

> Parameters:

> 	@hook: Cannot be NULL. Represents the hook the filter will be

> 		detached from. Requirements are same as described above

> 		in bpf_tc_attach.

> 

> 	@opts:	Cannot be NULL.

> 

> 		The following opts must be set:

> 			handle

> 			priority

> 		The following opts must be unset:

> 			prog_fd

> 			prog_id

> 			flags

> 

> bpf_tc_query

> Parameters:

> 	@hook: Cannot be NULL. Represents the hook where the filter

> 	       lookup will be performed. Requires are same as described

> 	       above in bpf_tc_attach.

> 

> 	@opts: Can be NULL.


Shouldn't it be: Cannot be NULL?

> 	       The following opts are optional:

> 			handle

> 			priority

> 			prog_fd

> 			prog_id


What is the use case to set prog_fd here?

> 	       The following opts must be unset:

> 			flags

> 

> 	       However, only one of prog_fd and prog_id must be

> 	       set. Setting both leads to an error. Setting none is

> 	       allowed.

> 

> 	       The following fields will be filled by bpf_tc_query on a

> 	       successful lookup if they are unset:

> 			handle

> 			priority

> 			prog_id

> 

> 	       Based on the specified optional parameters, the matching

> 	       data for the first matching filter is filled in and 0 is

> 	       returned. When setting prog_fd, the prog_id will be

> 	       matched against prog_id of the loaded SCHED_CLS prog

> 	       represented by prog_fd.

> 

> 	       To uniquely identify a filter, e.g. to detect its presence,

> 	       it is recommended to set both handle and priority fields.


What if prog_id is not unique, but part of multiple instances? Do we need
to support this case?

Why not just bpf_tc_query() with non-NULL hook and non-NULL opts where
handle and priority is required to be set, and rest must be 0?

> Some usage examples (using bpf skeleton infrastructure):

> 

> BPF program (test_tc_bpf.c):

> 

> 	#include <linux/bpf.h>

> 	#include <bpf/bpf_helpers.h>

> 

> 	SEC("classifier")

> 	int cls(struct __sk_buff *skb)

> 	{

> 		return 0;

> 	}

> 

> Userspace loader:

> 

> 	DECLARE_LIBBPF_OPTS(bpf_tc_opts, opts, 0);

> 	struct test_tc_bpf *skel = NULL;

> 	int fd, r;

> 

> 	skel = test_tc_bpf__open_and_load();

> 	if (!skel)

> 		return -ENOMEM;

> 

> 	fd = bpf_program__fd(skel->progs.cls);

> 

> 	DECLARE_LIBBPF_OPTS(bpf_tc_hook, hook, .ifindex =

> 			    if_nametoindex("lo"), .attach_point =

> 			    BPF_TC_INGRESS);

> 	/* Create clsact qdisc */

> 	r = bpf_tc_hook_create(&hook);

> 	if (r < 0)

> 		goto end;

> 

> 	DECLARE_LIBBPF_OPTS(bpf_tc_opts, opts, .prog_fd = fd);


Given we had DECLARE_LIBBPF_OPTS earlier, can't we just set:
opts.prog_fd = fd here?

> 	r = bpf_tc_attach(&hook, &opts);

> 	if (r < 0)

> 		goto end;

> 	/* Print the auto allocated handle and priority */

> 	printf("Handle=%u", opts.handle);

> 	printf("Priority=%u", opts.priority);

> 

> 	opts.prog_fd = opts.prog_id = 0;

> 	bpf_tc_detach(&hook, &opts);


Here we detach ...

> end:

> 	test_tc_bpf__destroy(skel);

> 

> This is equivalent to doing the following using tc command line:

>    # tc qdisc add dev lo clsact

>    # tc filter add dev lo ingress bpf obj foo.o sec classifier da


... so this is not equivalent to your tc cmdline description.

> Another example replacing a filter (extending prior example):

> 

> 	/* We can also choose both (or one), let's try replacing an

> 	 * existing filter.

> 	 */

> 	DECLARE_LIBBPF_OPTS(bpf_tc_opts, replace_opts, .handle =

> 			    opts.handle, .priority = opts.priority,

> 			    .prog_fd = fd);

> 	r = bpf_tc_attach(&hook, &replace_opts);

> 	if (r == -EEXIST) {

> 		/* Expected, now use BPF_TC_F_REPLACE to replace it */

> 		replace_opts.flags = BPF_TC_F_REPLACE;

> 		return bpf_tc_attach(&hook, &replace_opts);

> 	} else if (r < 0) {

> 		return r;

> 	}

> 	/* There must be no existing filter with these

> 	 * attributes, so cleanup and return an error.

> 	 */

> 	replace_opts.flags = replace_opts.prog_fd = replace_opts.prog_id = 0;

> 	bpf_tc_detach(&hook, &replace_opts);

> 	return -1;

> 

> To obtain info of a particular filter:

> 

> 	/* Find info for filter with handle 1 and priority 50 */

> 	DECLARE_LIBBPF_OPTS(bpf_tc_opts, info_opts, .handle = 1,

> 			    .priority = 50);

> 	r = bpf_tc_query(&hook, &info_opts);

> 	if (r == -ENOENT)

> 		printf("Filter not found");

> 	else if (r < 0)

> 		return r;

> 	printf("Prog ID: %u", info_opts.prog_id);

> 	return 0;

> 

> We can also match using prog_id to find the same filter:

> 

> 	DECLARE_LIBBPF_OPTS(bpf_tc_opts, info_opts2, .prog_id =

> 			    info_opts.prog_id);

> 	r = bpf_tc_query(&hook, &info_opts2);

> 	if (r == -ENOENT)

> 		printf("Filter not found");

> 	else if (r < 0)

> 		return r;

> 	/* If we know there's only one filter for this loaded prog,

> 	 * it is safe to assert that the handle and priority are

> 	 * as expected.

> 	 */

> 	assert(info_opts2.handle == 1);

> 	assert(info_opts2.priority == 50);


What if a given prog_id is attached to multiple instances?

> 	return 0;

> 

> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>

> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>

> ---

>   tools/lib/bpf/libbpf.h   |  42 ++++

>   tools/lib/bpf/libbpf.map |   5 +

>   tools/lib/bpf/netlink.c  | 473 ++++++++++++++++++++++++++++++++++++++-

>   3 files changed, 519 insertions(+), 1 deletion(-)

> 

> diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h

> index bec4e6a6e31d..09d1a4fb10f9 100644

> --- a/tools/lib/bpf/libbpf.h

> +++ b/tools/lib/bpf/libbpf.h

> @@ -775,6 +775,48 @@ LIBBPF_API int bpf_linker__add_file(struct bpf_linker *linker, const char *filen

>   LIBBPF_API int bpf_linker__finalize(struct bpf_linker *linker);

>   LIBBPF_API void bpf_linker__free(struct bpf_linker *linker);

>   

> +enum bpf_tc_attach_point {

> +	BPF_TC_INGRESS = 1 << 0,

> +	BPF_TC_EGRESS  = 1 << 1,

> +	BPF_TC_CUSTOM  = 1 << 2,

> +};

> +

> +enum bpf_tc_flags {

> +	BPF_TC_F_REPLACE = 1 << 0,

> +};

> +

> +struct bpf_tc_hook {

> +	size_t sz;

> +	int ifindex;

> +	int flags;


nit: __u32 flags; (or rather dropping as discussed)

> +	enum bpf_tc_attach_point attach_point;

> +	__u32 parent;

> +	size_t :0;

> +};

> +

> +#define bpf_tc_hook__last_field parent

> +

> +struct bpf_tc_opts {

> +	size_t sz;

> +	int prog_fd;

> +	int flags;


nit: __u32 flags;

> +	__u32 prog_id;

> +	__u32 handle;

> +	__u32 priority;

> +	size_t :0;

> +};

> +

> +#define bpf_tc_opts__last_field priority

> +

> +LIBBPF_API int bpf_tc_hook_create(struct bpf_tc_hook *hook);

> +LIBBPF_API int bpf_tc_hook_destroy(struct bpf_tc_hook *hook);

> +LIBBPF_API int bpf_tc_attach(const struct bpf_tc_hook *hook,

> +			     struct bpf_tc_opts *opts);

> +LIBBPF_API int bpf_tc_detach(const struct bpf_tc_hook *hook,

> +			     const struct bpf_tc_opts *opts);

> +LIBBPF_API int bpf_tc_query(const struct bpf_tc_hook *hook,

> +			    struct bpf_tc_opts *opts);

> +

>   #ifdef __cplusplus

>   } /* extern "C" */

>   #endif

> diff --git a/tools/lib/bpf/libbpf.map b/tools/lib/bpf/libbpf.map

> index b9b29baf1df8..6c96729050dc 100644

> --- a/tools/lib/bpf/libbpf.map

> +++ b/tools/lib/bpf/libbpf.map

> @@ -361,4 +361,9 @@ LIBBPF_0.4.0 {

>   		bpf_linker__new;

>   		bpf_map__inner_map;

>   		bpf_object__set_kversion;

> +		bpf_tc_attach;

> +		bpf_tc_detach;

> +		bpf_tc_hook_create;

> +		bpf_tc_hook_destroy;

> +		bpf_tc_query;

>   } LIBBPF_0.3.0;

> diff --git a/tools/lib/bpf/netlink.c b/tools/lib/bpf/netlink.c

> index 8a01d9eed5f9..95c87f87a178 100644

> --- a/tools/lib/bpf/netlink.c

> +++ b/tools/lib/bpf/netlink.c

> @@ -4,7 +4,10 @@

>   #include <stdlib.h>

>   #include <memory.h>

>   #include <unistd.h>

> +#include <arpa/inet.h>

>   #include <linux/bpf.h>

> +#include <linux/if_ether.h>

> +#include <linux/pkt_cls.h>

>   #include <linux/rtnetlink.h>

>   #include <sys/socket.h>

>   #include <errno.h>

> @@ -73,6 +76,12 @@ static int libbpf_netlink_open(__u32 *nl_pid)

>   	return ret;

>   }

>   

> +enum {

> +	BPF_NL_CONT,

> +	BPF_NL_NEXT,

> +	BPF_NL_DONE,


nit: I don't think we need BPF_ prefix here given it's not specific to BPF.

> +};

> +

>   static int bpf_netlink_recv(int sock, __u32 nl_pid, int seq,

>   			    __dump_nlmsg_t _fn, libbpf_dump_nlmsg_t fn,

>   			    void *cookie)

> @@ -84,6 +93,7 @@ static int bpf_netlink_recv(int sock, __u32 nl_pid, int seq,

>   	int len, ret;

>   

>   	while (multipart) {

> +start:

>   		multipart = false;

>   		len = recv(sock, buf, sizeof(buf), 0);

>   		if (len < 0) {

[...]

Kumar Kartikeya Dwivedi May 6, 2021, 2:37 a.m. UTC | #2

On Thu, May 06, 2021 at 03:12:01AM IST, Daniel Borkmann wrote:
> On 5/4/21 2:50 AM, Kumar Kartikeya Dwivedi wrote:

> > This adds functions that wrap the netlink API used for adding,

> > manipulating, and removing traffic control filters.

> >

> > An API summary:

>

> Looks better, few minor comments below:

>

> > A bpf_tc_hook represents a location where a TC-BPF filter can be

> > attached. This means that creating a hook leads to creation of the

> > backing qdisc, while destruction either removes all filters attached to

> > a hook, or destroys qdisc if requested explicitly (as discussed below).

> >

> > The TC-BPF API functions operate on this bpf_tc_hook to attach, replace,

> > query, and detach tc filters.

> >

> > All functions return 0 on success, and a negative error code on failure.

> >

> > bpf_tc_hook_create - Create a hook

> > Parameters:

> > 	@hook - Cannot be NULL, ifindex > 0, attach_point must be set to

> > 		proper enum constant. Note that parent must be unset when

> > 		attach_point is one of BPF_TC_INGRESS or BPF_TC_EGRESS. Note

> > 		that as an exception BPF_TC_INGRESS|BPF_TC_EGRESS is also a

> > 		valid value for attach_point.

> >

> > 		Returns -EOPNOTSUPP when hook has attach_point as BPF_TC_CUSTOM.

> >

> > 		hook's flags member can be BPF_TC_F_REPLACE, which

> > 		creates qdisc in non-exclusive mode (i.e. an existing

> > 		qdisc will be replaced instead of this function failing

> > 		with -EEXIST).

>

> Why supporting BPF_TC_F_REPLACE here? It's not changing any qdisc parameters

> given clsact doesn't have any, no? Iow, what effect are you expecting on this

> with BPF_TC_F_REPLACE & why supporting it? I'd probably just require flags to

> be 0 here, and if hook exists return sth like -EEXIST.

>


Ok, will change.

> > bpf_tc_hook_destroy - Destroy the hook

> > Parameters:

> >          @hook - Cannot be NULL. The behaviour depends on value of

> > 		attach_point.

> >

> > 		If BPF_TC_INGRESS, all filters attached to the ingress

> > 		hook will be detached.

> > 		If BPF_TC_EGRESS, all filters attached to the egress hook

> > 		will be detached.

> > 		If BPF_TC_INGRESS|BPF_TC_EGRESS, the clsact qdisc will be

> > 		deleted, also detaching all filters.

> >

> > 		As before, parent must be unset for these attach_points,

> > 		and set for BPF_TC_CUSTOM. flags must also be unset.

> >

> > 		It is advised that if the qdisc is operated on by many programs,

> > 		then the program at least check that there are no other existing

> > 		filters before deleting the clsact qdisc. An example is shown

> > 		below:

> >

> > 		DECLARE_LIBBPF_OPTS(bpf_tc_hook, .ifindex = if_nametoindex("lo"),

> > 				    .attach_point = BPF_TC_INGRESS);

> > 		/* set opts as NULL, as we're not really interested in

> > 		 * getting any info for a particular filter, but just

> > 	 	 * detecting its presence.

> > 		 */

> > 		r = bpf_tc_query(&hook, NULL);

> > 		if (r == -ENOENT) {

> > 			/* no filters */

> > 			hook.attach_point = BPF_TC_INGRESS|BPF_TC_EGREESS;

> > 			return bpf_tc_hook_destroy(&hook);

> > 		} else {

> > 			/* failed or r == 0, the latter means filters do exist */

> > 			return r;

> > 		}

> >

> > 		Note that there is a small race between checking for no

> > 		filters and deleting the qdisc. This is currently unavoidable.

> >

> > 		Returns -EOPNOTSUPP when hook has attach_point as BPF_TC_CUSTOM.

> >

> > bpf_tc_attach - Attach a filter to a hook

> > Parameters:

> > 	@hook - Cannot be NULL. Represents the hook the filter will be

> > 		attached to. Requirements for ifindex and attach_point are

> > 		same as described in bpf_tc_hook_create, but BPF_TC_CUSTOM

> > 		is also supported.  In that case, parent must be set to the

> > 		handle where the filter will be attached (using TC_H_MAKE).

> > 		flags member must be unset.

> >

> > 		E.g. To set parent to 1:16 like in tc command line,

> > 		     the equivalent would be TC_H_MAKE(1 << 16, 16)

>

> Small nit: I wonder whether from libbpf side we should just support a more

> user friendly TC_H_MAKE, so you'd have: BPF_TC_CUSTOM + BPF_TC_PARENT(1, 16).

>


Something like this was there in v1. I'll add this macro again (I guess the most surprising part of
TC_H_MAKE is that it won't shift the major number).

> > 	@opts - Cannot be NULL.

> >

> > 		The following opts are optional:

> > 			handle - The handle of the filter

> > 			priority - The priority of the filter

> > 				   Must be >= 0 and <= UINT16_MAX

>

> It should probably be mentioned that if they are not specified, then they

> are auto-allocated from kernel.


Right, I'll add a small note.

>

> > 		The following opts must be set:

> > 			prog_fd - The fd of the loaded SCHED_CLS prog

> > 		The following opts must be unset:

> > 			prog_id - The ID of the BPF prog

> > 		The following opts are optional:

> > 			flags - Currently only BPF_TC_F_REPLACE is

> > 				allowed. It allows replacing an existing

> > 				filter instead of failing with -EEXIST.

> >

> > 		The following opts will be filled by bpf_tc_attach on a

> > 		successful attach operation if they are unset:

> > 			handle - The handle of the attached filter

> > 			priority - The priority of the attached filter

> > 			prog_id - The ID of the attached SCHED_CLS prog

> >

> > 		This way, the user can know what the auto allocated

> > 		values for optional opts like handle and priority are

> > 		for the newly attached filter, if they were unset.

> >

> > 		Note that some other attributes are set to some default

> > 		values listed below (this holds for all bpf_tc_* APIs):

> > 			protocol - ETH_P_ALL

> > 			mode - direct action

> > 			chain index - 0

> > 			class ID - 0 (this can be set by writing to the

> > 			skb->tc_classid field from the BPF program)

> >

> > bpf_tc_detach

> > Parameters:

> > 	@hook: Cannot be NULL. Represents the hook the filter will be

> > 		detached from. Requirements are same as described above

> > 		in bpf_tc_attach.

> >

> > 	@opts:	Cannot be NULL.

> >

> > 		The following opts must be set:

> > 			handle

> > 			priority

> > 		The following opts must be unset:

> > 			prog_fd

> > 			prog_id

> > 			flags

> >

> > bpf_tc_query

> > Parameters:

> > 	@hook: Cannot be NULL. Represents the hook where the filter

> > 	       lookup will be performed. Requires are same as described

> > 	       above in bpf_tc_attach.

> >

> > 	@opts: Can be NULL.

>

> Shouldn't it be: Cannot be NULL?

>


This allows you to check the existence of a filter. If set to NULL we skip writing anything to opts,
but we still return -ENOENT or 0 depending on whether atleast one filter exists (based on the
default attributes that we choose). This is used in multiple places in the test, to determine
whether no filters exists.

> > 	       The following opts are optional:

> > 			handle

> > 			priority

> > 			prog_fd

> > 			prog_id

>

> What is the use case to set prog_fd here?

>


It allows you to search with the prog_id of the program represented by fd. It's just a convenience
thing, we end up doing a call to get the prog_id for you, and since the parameter is already there,
it seemed ok to support this.

> > 	       The following opts must be unset:

> > 			flags

> >

> > 	       However, only one of prog_fd and prog_id must be

> > 	       set. Setting both leads to an error. Setting none is

> > 	       allowed.

> >

> > 	       The following fields will be filled by bpf_tc_query on a

> > 	       successful lookup if they are unset:

> > 			handle

> > 			priority

> > 			prog_id

> >

> > 	       Based on the specified optional parameters, the matching

> > 	       data for the first matching filter is filled in and 0 is

> > 	       returned. When setting prog_fd, the prog_id will be

> > 	       matched against prog_id of the loaded SCHED_CLS prog

> > 	       represented by prog_fd.

> >

> > 	       To uniquely identify a filter, e.g. to detect its presence,

> > 	       it is recommended to set both handle and priority fields.

>

> What if prog_id is not unique, but part of multiple instances? Do we need

> to support this case?


We return the first filter that matches on the prog_id. I think it is worthwhile to support this, as
long as the kernel's sequence of returning filters is stable (which it is), we keep returning the
same filter's handle/priority, so you can essentially pop filters attached to a hook one by one by
passing in unset opts and getting its details (or setting one of the parameters and making the
lookup domain smaller).

In simple words, setting one of the parameters that will be filled leads to only returning an entry
that matches them. This is similar to what tc filter show's dump allows you to do.

>

> Why not just bpf_tc_query() with non-NULL hook and non-NULL opts where

> handle and priority is required to be set, and rest must be 0?

>


There is also a usecase for us where we need to query the existing filter on a hook without knowing
its handle/priority. Shaun also mentioned something similar, where they then go on to check the tag
they get from the returned prog_id to determine what to do next.

> > Some usage examples (using bpf skeleton infrastructure):

> >

> > BPF program (test_tc_bpf.c):

> >

> > 	#include <linux/bpf.h>

> > 	#include <bpf/bpf_helpers.h>

> >

> > 	SEC("classifier")

> > 	int cls(struct __sk_buff *skb)

> > 	{

> > 		return 0;

> > 	}

> >

> > Userspace loader:

> >

> > 	DECLARE_LIBBPF_OPTS(bpf_tc_opts, opts, 0);

> > 	struct test_tc_bpf *skel = NULL;

> > 	int fd, r;

> >

> > 	skel = test_tc_bpf__open_and_load();

> > 	if (!skel)

> > 		return -ENOMEM;

> >

> > 	fd = bpf_program__fd(skel->progs.cls);

> >

> > 	DECLARE_LIBBPF_OPTS(bpf_tc_hook, hook, .ifindex =

> > 			    if_nametoindex("lo"), .attach_point =

> > 			    BPF_TC_INGRESS);

> > 	/* Create clsact qdisc */

> > 	r = bpf_tc_hook_create(&hook);

> > 	if (r < 0)

> > 		goto end;

> >

> > 	DECLARE_LIBBPF_OPTS(bpf_tc_opts, opts, .prog_fd = fd);

>

> Given we had DECLARE_LIBBPF_OPTS earlier, can't we just set:

> opts.prog_fd = fd here?


Right, will fix.

>

> > 	r = bpf_tc_attach(&hook, &opts);

> > 	if (r < 0)

> > 		goto end;

> > 	/* Print the auto allocated handle and priority */

> > 	printf("Handle=%u", opts.handle);

> > 	printf("Priority=%u", opts.priority);

> >

> > 	opts.prog_fd = opts.prog_id = 0;

> > 	bpf_tc_detach(&hook, &opts);

>

> Here we detach ...

>

> > end:

> > 	test_tc_bpf__destroy(skel);

> >

> > This is equivalent to doing the following using tc command line:

> >    # tc qdisc add dev lo clsact

> >    # tc filter add dev lo ingress bpf obj foo.o sec classifier da

>

> ... so this is not equivalent to your tc cmdline description.

>


I'll add a tc filter del.

> > Another example replacing a filter (extending prior example):

> >

> > 	/* We can also choose both (or one), let's try replacing an

> > 	 * existing filter.

> > 	 */

> > 	DECLARE_LIBBPF_OPTS(bpf_tc_opts, replace_opts, .handle =

> > 			    opts.handle, .priority = opts.priority,

> > 			    .prog_fd = fd);

> > 	r = bpf_tc_attach(&hook, &replace_opts);

> > 	if (r == -EEXIST) {

> > 		/* Expected, now use BPF_TC_F_REPLACE to replace it */

> > 		replace_opts.flags = BPF_TC_F_REPLACE;

> > 		return bpf_tc_attach(&hook, &replace_opts);

> > 	} else if (r < 0) {

> > 		return r;

> > 	}

> > 	/* There must be no existing filter with these

> > 	 * attributes, so cleanup and return an error.

> > 	 */

> > 	replace_opts.flags = replace_opts.prog_fd = replace_opts.prog_id = 0;

> > 	bpf_tc_detach(&hook, &replace_opts);

> > 	return -1;

> >

> > To obtain info of a particular filter:

> >

> > 	/* Find info for filter with handle 1 and priority 50 */

> > 	DECLARE_LIBBPF_OPTS(bpf_tc_opts, info_opts, .handle = 1,

> > 			    .priority = 50);

> > 	r = bpf_tc_query(&hook, &info_opts);

> > 	if (r == -ENOENT)

> > 		printf("Filter not found");

> > 	else if (r < 0)

> > 		return r;

> > 	printf("Prog ID: %u", info_opts.prog_id);

> > 	return 0;

> >

> > We can also match using prog_id to find the same filter:

> >

> > 	DECLARE_LIBBPF_OPTS(bpf_tc_opts, info_opts2, .prog_id =

> > 			    info_opts.prog_id);

> > 	r = bpf_tc_query(&hook, &info_opts2);

> > 	if (r == -ENOENT)

> > 		printf("Filter not found");

> > 	else if (r < 0)

> > 		return r;

> > 	/* If we know there's only one filter for this loaded prog,

> > 	 * it is safe to assert that the handle and priority are

> > 	 * as expected.

> > 	 */

> > 	assert(info_opts2.handle == 1);

> > 	assert(info_opts2.priority == 50);

>

> What if a given prog_id is attached to multiple instances?

>


The first match is returned.

> > 	return 0;

> >

> > Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>

> > Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>

> > ---

> >   tools/lib/bpf/libbpf.h   |  42 ++++

> >   tools/lib/bpf/libbpf.map |   5 +

> >   tools/lib/bpf/netlink.c  | 473 ++++++++++++++++++++++++++++++++++++++-

> >   3 files changed, 519 insertions(+), 1 deletion(-)

> >

> > diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h

> > index bec4e6a6e31d..09d1a4fb10f9 100644

> > --- a/tools/lib/bpf/libbpf.h

> > +++ b/tools/lib/bpf/libbpf.h

> > @@ -775,6 +775,48 @@ LIBBPF_API int bpf_linker__add_file(struct bpf_linker *linker, const char *filen

> >   LIBBPF_API int bpf_linker__finalize(struct bpf_linker *linker);

> >   LIBBPF_API void bpf_linker__free(struct bpf_linker *linker);

> > +enum bpf_tc_attach_point {

> > +	BPF_TC_INGRESS = 1 << 0,

> > +	BPF_TC_EGRESS  = 1 << 1,

> > +	BPF_TC_CUSTOM  = 1 << 2,

> > +};

> > +

> > +enum bpf_tc_flags {

> > +	BPF_TC_F_REPLACE = 1 << 0,

> > +};

> > +

> > +struct bpf_tc_hook {

> > +	size_t sz;

> > +	int ifindex;

> > +	int flags;

>

> nit: __u32 flags; (or rather dropping as discussed)

>


I'll drop it for now.

> > +	enum bpf_tc_attach_point attach_point;

> > +	__u32 parent;

> > +	size_t :0;

> > +};

> > +

> > +#define bpf_tc_hook__last_field parent

> > +

> > +struct bpf_tc_opts {

> > +	size_t sz;

> > +	int prog_fd;

> > +	int flags;

>

> nit: __u32 flags;

>


Ok.

> > +	__u32 prog_id;

> > +	__u32 handle;

> > +	__u32 priority;

> > +	size_t :0;

> > +};

> > +

> > +#define bpf_tc_opts__last_field priority

> > +

> > +LIBBPF_API int bpf_tc_hook_create(struct bpf_tc_hook *hook);

> > +LIBBPF_API int bpf_tc_hook_destroy(struct bpf_tc_hook *hook);

> > +LIBBPF_API int bpf_tc_attach(const struct bpf_tc_hook *hook,

> > +			     struct bpf_tc_opts *opts);

> > +LIBBPF_API int bpf_tc_detach(const struct bpf_tc_hook *hook,

> > +			     const struct bpf_tc_opts *opts);

> > +LIBBPF_API int bpf_tc_query(const struct bpf_tc_hook *hook,

> > +			    struct bpf_tc_opts *opts);

> > +

> >   #ifdef __cplusplus

> >   } /* extern "C" */

> >   #endif

> > diff --git a/tools/lib/bpf/libbpf.map b/tools/lib/bpf/libbpf.map

> > index b9b29baf1df8..6c96729050dc 100644

> > --- a/tools/lib/bpf/libbpf.map

> > +++ b/tools/lib/bpf/libbpf.map

> > @@ -361,4 +361,9 @@ LIBBPF_0.4.0 {

> >   		bpf_linker__new;

> >   		bpf_map__inner_map;

> >   		bpf_object__set_kversion;

> > +		bpf_tc_attach;

> > +		bpf_tc_detach;

> > +		bpf_tc_hook_create;

> > +		bpf_tc_hook_destroy;

> > +		bpf_tc_query;

> >   } LIBBPF_0.3.0;

> > diff --git a/tools/lib/bpf/netlink.c b/tools/lib/bpf/netlink.c

> > index 8a01d9eed5f9..95c87f87a178 100644

> > --- a/tools/lib/bpf/netlink.c

> > +++ b/tools/lib/bpf/netlink.c

> > @@ -4,7 +4,10 @@

> >   #include <stdlib.h>

> >   #include <memory.h>

> >   #include <unistd.h>

> > +#include <arpa/inet.h>

> >   #include <linux/bpf.h>

> > +#include <linux/if_ether.h>

> > +#include <linux/pkt_cls.h>

> >   #include <linux/rtnetlink.h>

> >   #include <sys/socket.h>

> >   #include <errno.h>

> > @@ -73,6 +76,12 @@ static int libbpf_netlink_open(__u32 *nl_pid)

> >   	return ret;

> >   }

> > +enum {

> > +	BPF_NL_CONT,

> > +	BPF_NL_NEXT,

> > +	BPF_NL_DONE,

>

> nit: I don't think we need BPF_ prefix here given it's not specific to BPF.

>


Ok, will rename.

> > +};

> > +

> >   static int bpf_netlink_recv(int sock, __u32 nl_pid, int seq,

> >   			    __dump_nlmsg_t _fn, libbpf_dump_nlmsg_t fn,

> >   			    void *cookie)

> > @@ -84,6 +93,7 @@ static int bpf_netlink_recv(int sock, __u32 nl_pid, int seq,

> >   	int len, ret;

> >   	while (multipart) {

> > +start:

> >   		multipart = false;

> >   		len = recv(sock, buf, sizeof(buf), 0);

> >   		if (len < 0) {

> [...]


--
Kartikeya

Daniel Borkmann May 6, 2021, 9:57 p.m. UTC | #3

On 5/6/21 4:37 AM, Kumar Kartikeya Dwivedi wrote:
> On Thu, May 06, 2021 at 03:12:01AM IST, Daniel Borkmann wrote:

>> On 5/4/21 2:50 AM, Kumar Kartikeya Dwivedi wrote:

>>> This adds functions that wrap the netlink API used for adding,

>>> manipulating, and removing traffic control filters.

>>>

>>> An API summary:

>>

>> Looks better, few minor comments below:

>>

>>> A bpf_tc_hook represents a location where a TC-BPF filter can be

>>> attached. This means that creating a hook leads to creation of the

>>> backing qdisc, while destruction either removes all filters attached to

>>> a hook, or destroys qdisc if requested explicitly (as discussed below).

>>>

>>> The TC-BPF API functions operate on this bpf_tc_hook to attach, replace,

>>> query, and detach tc filters.

>>>

>>> All functions return 0 on success, and a negative error code on failure.

>>>

>>> bpf_tc_hook_create - Create a hook

>>> Parameters:

>>> 	@hook - Cannot be NULL, ifindex > 0, attach_point must be set to

>>> 		proper enum constant. Note that parent must be unset when

>>> 		attach_point is one of BPF_TC_INGRESS or BPF_TC_EGRESS. Note

>>> 		that as an exception BPF_TC_INGRESS|BPF_TC_EGRESS is also a

>>> 		valid value for attach_point.

>>>

>>> 		Returns -EOPNOTSUPP when hook has attach_point as BPF_TC_CUSTOM.

>>>

>>> 		hook's flags member can be BPF_TC_F_REPLACE, which

>>> 		creates qdisc in non-exclusive mode (i.e. an existing

>>> 		qdisc will be replaced instead of this function failing

>>> 		with -EEXIST).

>>

>> Why supporting BPF_TC_F_REPLACE here? It's not changing any qdisc parameters

>> given clsact doesn't have any, no? Iow, what effect are you expecting on this

>> with BPF_TC_F_REPLACE & why supporting it? I'd probably just require flags to

>> be 0 here, and if hook exists return sth like -EEXIST.

> 

> Ok, will change.

> 

>>> bpf_tc_hook_destroy - Destroy the hook

>>> Parameters:

>>>           @hook - Cannot be NULL. The behaviour depends on value of

>>> 		attach_point.

>>>

>>> 		If BPF_TC_INGRESS, all filters attached to the ingress

>>> 		hook will be detached.

>>> 		If BPF_TC_EGRESS, all filters attached to the egress hook

>>> 		will be detached.

>>> 		If BPF_TC_INGRESS|BPF_TC_EGRESS, the clsact qdisc will be

>>> 		deleted, also detaching all filters.

>>>

>>> 		As before, parent must be unset for these attach_points,

>>> 		and set for BPF_TC_CUSTOM. flags must also be unset.

>>>

>>> 		It is advised that if the qdisc is operated on by many programs,

>>> 		then the program at least check that there are no other existing

>>> 		filters before deleting the clsact qdisc. An example is shown

>>> 		below:

>>>

>>> 		DECLARE_LIBBPF_OPTS(bpf_tc_hook, .ifindex = if_nametoindex("lo"),

>>> 				    .attach_point = BPF_TC_INGRESS);

>>> 		/* set opts as NULL, as we're not really interested in

>>> 		 * getting any info for a particular filter, but just

>>> 	 	 * detecting its presence.

>>> 		 */

>>> 		r = bpf_tc_query(&hook, NULL);

>>> 		if (r == -ENOENT) {

>>> 			/* no filters */

>>> 			hook.attach_point = BPF_TC_INGRESS|BPF_TC_EGREESS;

>>> 			return bpf_tc_hook_destroy(&hook);

>>> 		} else {

>>> 			/* failed or r == 0, the latter means filters do exist */

>>> 			return r;

>>> 		}

>>>

>>> 		Note that there is a small race between checking for no

>>> 		filters and deleting the qdisc. This is currently unavoidable.

>>>

>>> 		Returns -EOPNOTSUPP when hook has attach_point as BPF_TC_CUSTOM.

>>>

>>> bpf_tc_attach - Attach a filter to a hook

>>> Parameters:

>>> 	@hook - Cannot be NULL. Represents the hook the filter will be

>>> 		attached to. Requirements for ifindex and attach_point are

>>> 		same as described in bpf_tc_hook_create, but BPF_TC_CUSTOM

>>> 		is also supported.  In that case, parent must be set to the

>>> 		handle where the filter will be attached (using TC_H_MAKE).

>>> 		flags member must be unset.

>>>

>>> 		E.g. To set parent to 1:16 like in tc command line,

>>> 		     the equivalent would be TC_H_MAKE(1 << 16, 16)

>>

>> Small nit: I wonder whether from libbpf side we should just support a more

>> user friendly TC_H_MAKE, so you'd have: BPF_TC_CUSTOM + BPF_TC_PARENT(1, 16).

> 

> Something like this was there in v1. I'll add this macro again (I guess the most surprising part of

> TC_H_MAKE is that it won't shift the major number).


Agree, weird one. :)

[...]
>>> bpf_tc_detach

>>> Parameters:

>>> 	@hook: Cannot be NULL. Represents the hook the filter will be

>>> 		detached from. Requirements are same as described above

>>> 		in bpf_tc_attach.

>>>

>>> 	@opts:	Cannot be NULL.

>>>

>>> 		The following opts must be set:

>>> 			handle

>>> 			priority

>>> 		The following opts must be unset:

>>> 			prog_fd

>>> 			prog_id

>>> 			flags

>>>

>>> bpf_tc_query

>>> Parameters:

>>> 	@hook: Cannot be NULL. Represents the hook where the filter

>>> 	       lookup will be performed. Requires are same as described

>>> 	       above in bpf_tc_attach.

>>>

>>> 	@opts: Can be NULL.

>>

>> Shouldn't it be: Cannot be NULL?

> 

> This allows you to check the existence of a filter. If set to NULL we skip writing anything to opts,


You mean in this case s/filter/hook/, right?

> but we still return -ENOENT or 0 depending on whether atleast one filter exists (based on the

> default attributes that we choose). This is used in multiple places in the test, to determine

> whether no filters exists.


In other words, it's same as bpf_tc_hook_create() which would return -EEXIST just that
we do /not/ create the hook if it does not exist, right?

>>> 	       The following opts are optional:

>>> 			handle

>>> 			priority

>>> 			prog_fd

>>> 			prog_id

>>

>> What is the use case to set prog_fd here?

> 

> It allows you to search with the prog_id of the program represented by fd. It's just a convenience

> thing, we end up doing a call to get the prog_id for you, and since the parameter is already there,

> it seemed ok to support this.


I would drop that part and have prog_fd forced to 0, given libbpf already has other means to
retrieve it from fd, and if non-convenient, then lets add a simple/generic libbpf API.

>>> 	       The following opts must be unset:

>>> 			flags

>>>

>>> 	       However, only one of prog_fd and prog_id must be

>>> 	       set. Setting both leads to an error. Setting none is

>>> 	       allowed.

>>>

>>> 	       The following fields will be filled by bpf_tc_query on a

>>> 	       successful lookup if they are unset:

>>> 			handle

>>> 			priority

>>> 			prog_id

>>>

>>> 	       Based on the specified optional parameters, the matching

>>> 	       data for the first matching filter is filled in and 0 is

>>> 	       returned. When setting prog_fd, the prog_id will be

>>> 	       matched against prog_id of the loaded SCHED_CLS prog

>>> 	       represented by prog_fd.

>>>

>>> 	       To uniquely identify a filter, e.g. to detect its presence,

>>> 	       it is recommended to set both handle and priority fields.

>>

>> What if prog_id is not unique, but part of multiple instances? Do we need

>> to support this case?

> 

> We return the first filter that matches on the prog_id. I think it is worthwhile to support this, as

> long as the kernel's sequence of returning filters is stable (which it is), we keep returning the

> same filter's handle/priority, so you can essentially pop filters attached to a hook one by one by

> passing in unset opts and getting its details (or setting one of the parameters and making the

> lookup domain smaller).

> 

> In simple words, setting one of the parameters that will be filled leads to only returning an entry

> that matches them. This is similar to what tc filter show's dump allows you to do.


I think this is rather a bit weird/hacky/unintuitive. If we need such API, then lets add a
proper one which returns all handle/priority combinations that match for a given prog_id
for the provided hook, but I don't think this needs to be in the initial set; could be done
as follow-up. (*)

>> Why not just bpf_tc_query() with non-NULL hook and non-NULL opts where

>> handle and priority is required to be set, and rest must be 0?

> 

> There is also a usecase for us where we need to query the existing filter on a hook without knowing

> its handle/priority. Shaun also mentioned something similar, where they then go on to check the tag

> they get from the returned prog_id to determine what to do next.


See (*).

>>> Some usage examples (using bpf skeleton infrastructure):

>>>

>>> BPF program (test_tc_bpf.c):

>>>

>>> 	#include <linux/bpf.h>

>>> 	#include <bpf/bpf_helpers.h>

>>>

>>> 	SEC("classifier")

>>> 	int cls(struct __sk_buff *skb)

>>> 	{

>>> 		return 0;

>>> 	}

>>>

>>> Userspace loader:

>>>

>>> 	DECLARE_LIBBPF_OPTS(bpf_tc_opts, opts, 0);

>>> 	struct test_tc_bpf *skel = NULL;

>>> 	int fd, r;

>>>

>>> 	skel = test_tc_bpf__open_and_load();

>>> 	if (!skel)

>>> 		return -ENOMEM;

>>>

>>> 	fd = bpf_program__fd(skel->progs.cls);

>>>

>>> 	DECLARE_LIBBPF_OPTS(bpf_tc_hook, hook, .ifindex =

>>> 			    if_nametoindex("lo"), .attach_point =

>>> 			    BPF_TC_INGRESS);

>>> 	/* Create clsact qdisc */

>>> 	r = bpf_tc_hook_create(&hook);

>>> 	if (r < 0)

>>> 		goto end;

>>>

>>> 	DECLARE_LIBBPF_OPTS(bpf_tc_opts, opts, .prog_fd = fd);

>>

>> Given we had DECLARE_LIBBPF_OPTS earlier, can't we just set:

>> opts.prog_fd = fd here?

> 

> Right, will fix.

> 

>>

>>> 	r = bpf_tc_attach(&hook, &opts);

>>> 	if (r < 0)

>>> 		goto end;

>>> 	/* Print the auto allocated handle and priority */

>>> 	printf("Handle=%u", opts.handle);

>>> 	printf("Priority=%u", opts.priority);

>>>

>>> 	opts.prog_fd = opts.prog_id = 0;

>>> 	bpf_tc_detach(&hook, &opts);

>>


Thanks,
Daniel

Kumar Kartikeya Dwivedi May 7, 2021, 2:11 a.m. UTC | #4

On Fri, May 07, 2021 at 03:27:10AM IST, Daniel Borkmann wrote:
> On 5/6/21 4:37 AM, Kumar Kartikeya Dwivedi wrote:

> > On Thu, May 06, 2021 at 03:12:01AM IST, Daniel Borkmann wrote:

> > > On 5/4/21 2:50 AM, Kumar Kartikeya Dwivedi wrote:

> > > > This adds functions that wrap the netlink API used for adding,

> > > > manipulating, and removing traffic control filters.

> > > >

> > > > An API summary:

> > >

> > > Looks better, few minor comments below:

> > >

> > > > A bpf_tc_hook represents a location where a TC-BPF filter can be

> > > > attached. This means that creating a hook leads to creation of the

> > > > backing qdisc, while destruction either removes all filters attached to

> > > > a hook, or destroys qdisc if requested explicitly (as discussed below).

> > > >

> > > > The TC-BPF API functions operate on this bpf_tc_hook to attach, replace,

> > > > query, and detach tc filters.

> > > >

> > > > All functions return 0 on success, and a negative error code on failure.

> > > >

> > > > bpf_tc_hook_create - Create a hook

> > > > Parameters:

> > > > 	@hook - Cannot be NULL, ifindex > 0, attach_point must be set to

> > > > 		proper enum constant. Note that parent must be unset when

> > > > 		attach_point is one of BPF_TC_INGRESS or BPF_TC_EGRESS. Note

> > > > 		that as an exception BPF_TC_INGRESS|BPF_TC_EGRESS is also a

> > > > 		valid value for attach_point.

> > > >

> > > > 		Returns -EOPNOTSUPP when hook has attach_point as BPF_TC_CUSTOM.

> > > >

> > > > 		hook's flags member can be BPF_TC_F_REPLACE, which

> > > > 		creates qdisc in non-exclusive mode (i.e. an existing

> > > > 		qdisc will be replaced instead of this function failing

> > > > 		with -EEXIST).

> > >

> > > Why supporting BPF_TC_F_REPLACE here? It's not changing any qdisc parameters

> > > given clsact doesn't have any, no? Iow, what effect are you expecting on this

> > > with BPF_TC_F_REPLACE & why supporting it? I'd probably just require flags to

> > > be 0 here, and if hook exists return sth like -EEXIST.

> >

> > Ok, will change.

> >

> > > > bpf_tc_hook_destroy - Destroy the hook

> > > > Parameters:

> > > >           @hook - Cannot be NULL. The behaviour depends on value of

> > > > 		attach_point.

> > > >

> > > > 		If BPF_TC_INGRESS, all filters attached to the ingress

> > > > 		hook will be detached.

> > > > 		If BPF_TC_EGRESS, all filters attached to the egress hook

> > > > 		will be detached.

> > > > 		If BPF_TC_INGRESS|BPF_TC_EGRESS, the clsact qdisc will be

> > > > 		deleted, also detaching all filters.

> > > >

> > > > 		As before, parent must be unset for these attach_points,

> > > > 		and set for BPF_TC_CUSTOM. flags must also be unset.

> > > >

> > > > 		It is advised that if the qdisc is operated on by many programs,

> > > > 		then the program at least check that there are no other existing

> > > > 		filters before deleting the clsact qdisc. An example is shown

> > > > 		below:

> > > >

> > > > 		DECLARE_LIBBPF_OPTS(bpf_tc_hook, .ifindex = if_nametoindex("lo"),

> > > > 				    .attach_point = BPF_TC_INGRESS);

> > > > 		/* set opts as NULL, as we're not really interested in

> > > > 		 * getting any info for a particular filter, but just

> > > > 	 	 * detecting its presence.

> > > > 		 */

> > > > 		r = bpf_tc_query(&hook, NULL);

> > > > 		if (r == -ENOENT) {

> > > > 			/* no filters */

> > > > 			hook.attach_point = BPF_TC_INGRESS|BPF_TC_EGREESS;

> > > > 			return bpf_tc_hook_destroy(&hook);

> > > > 		} else {

> > > > 			/* failed or r == 0, the latter means filters do exist */

> > > > 			return r;

> > > > 		}

> > > >

> > > > 		Note that there is a small race between checking for no

> > > > 		filters and deleting the qdisc. This is currently unavoidable.

> > > >

> > > > 		Returns -EOPNOTSUPP when hook has attach_point as BPF_TC_CUSTOM.

> > > >

> > > > bpf_tc_attach - Attach a filter to a hook

> > > > Parameters:

> > > > 	@hook - Cannot be NULL. Represents the hook the filter will be

> > > > 		attached to. Requirements for ifindex and attach_point are

> > > > 		same as described in bpf_tc_hook_create, but BPF_TC_CUSTOM

> > > > 		is also supported.  In that case, parent must be set to the

> > > > 		handle where the filter will be attached (using TC_H_MAKE).

> > > > 		flags member must be unset.

> > > >

> > > > 		E.g. To set parent to 1:16 like in tc command line,

> > > > 		     the equivalent would be TC_H_MAKE(1 << 16, 16)

> > >

> > > Small nit: I wonder whether from libbpf side we should just support a more

> > > user friendly TC_H_MAKE, so you'd have: BPF_TC_CUSTOM + BPF_TC_PARENT(1, 16).

> >

> > Something like this was there in v1. I'll add this macro again (I guess the most surprising part of

> > TC_H_MAKE is that it won't shift the major number).

>

> Agree, weird one. :)

>

> [...]

> > > > bpf_tc_detach

> > > > Parameters:

> > > > 	@hook: Cannot be NULL. Represents the hook the filter will be

> > > > 		detached from. Requirements are same as described above

> > > > 		in bpf_tc_attach.

> > > >

> > > > 	@opts:	Cannot be NULL.

> > > >

> > > > 		The following opts must be set:

> > > > 			handle

> > > > 			priority

> > > > 		The following opts must be unset:

> > > > 			prog_fd

> > > > 			prog_id

> > > > 			flags

> > > >

> > > > bpf_tc_query

> > > > Parameters:

> > > > 	@hook: Cannot be NULL. Represents the hook where the filter

> > > > 	       lookup will be performed. Requires are same as described

> > > > 	       above in bpf_tc_attach.

> > > >

> > > > 	@opts: Can be NULL.

> > >

> > > Shouldn't it be: Cannot be NULL?

> >

> > This allows you to check the existence of a filter. If set to NULL we skip writing anything to opts,

>

> You mean in this case s/filter/hook/, right?

>


Hm? I do mean filter. Since there is nothing to fill, we just cut short reading any more and return
early (but set info->processed) indicating there is (atleast) a filter attached.

It also allows you to implement the if (zero_filters()) del_qdisc(); logic on your own.

> > but we still return -ENOENT or 0 depending on whether atleast one filter exists (based on the

> > default attributes that we choose). This is used in multiple places in the test, to determine

> > whether no filters exists.

>

> In other words, it's same as bpf_tc_hook_create() which would return -EEXIST just that

> we do /not/ create the hook if it does not exist, right?

>


It really has nothing to do with bpf_tc_hook, that is only for obtaining ifindex/parent. With opts
as NULL you can just determine if there is any filter that is attached to the hook or not. In case
you do pass in opts but leave everything unset, we then return the first match.

> > > > 	       The following opts are optional:

> > > > 			handle

> > > > 			priority

> > > > 			prog_fd

> > > > 			prog_id

> > >

> > > What is the use case to set prog_fd here?

> >

> > It allows you to search with the prog_id of the program represented by fd. It's just a convenience

> > thing, we end up doing a call to get the prog_id for you, and since the parameter is already there,

> > it seemed ok to support this.

>

> I would drop that part and have prog_fd forced to 0, given libbpf already has other means to

> retrieve it from fd, and if non-convenient, then lets add a simple/generic libbpf API.

>


Ok, will drop.

> > > > 	       The following opts must be unset:

> > > > 			flags

> > > >

> > > > 	       However, only one of prog_fd and prog_id must be

> > > > 	       set. Setting both leads to an error. Setting none is

> > > > 	       allowed.

> > > >

> > > > 	       The following fields will be filled by bpf_tc_query on a

> > > > 	       successful lookup if they are unset:

> > > > 			handle

> > > > 			priority

> > > > 			prog_id

> > > >

> > > > 	       Based on the specified optional parameters, the matching

> > > > 	       data for the first matching filter is filled in and 0 is

> > > > 	       returned. When setting prog_fd, the prog_id will be

> > > > 	       matched against prog_id of the loaded SCHED_CLS prog

> > > > 	       represented by prog_fd.

> > > >

> > > > 	       To uniquely identify a filter, e.g. to detect its presence,

> > > > 	       it is recommended to set both handle and priority fields.

> > >

> > > What if prog_id is not unique, but part of multiple instances? Do we need

> > > to support this case?

> >

> > We return the first filter that matches on the prog_id. I think it is worthwhile to support this, as

> > long as the kernel's sequence of returning filters is stable (which it is), we keep returning the

> > same filter's handle/priority, so you can essentially pop filters attached to a hook one by one by

> > passing in unset opts and getting its details (or setting one of the parameters and making the

> > lookup domain smaller).

> >

> > In simple words, setting one of the parameters that will be filled leads to only returning an entry

> > that matches them. This is similar to what tc filter show's dump allows you to do.

>

> I think this is rather a bit weird/hacky/unintuitive. If we need such API, then lets add a

> proper one which returns all handle/priority combinations that match for a given prog_id

> for the provided hook, but I don't think this needs to be in the initial set; could be done

> as follow-up. (*)

>


Initially when adding this me and Toke did discuss the possibility of a query API that returns all
matches instead of the first one, but there were a few questions around how this would be returned.

One of the ways would be to require the caller to provide a buffer, and then provide some way to
iterate over this. The latter part is easy, but it is hard to predict how big the buffer should be
from the calling end. If it is smaller than needed, we will have to leave out entries and indicate
that in some way, and we also cannot return them in a subsequent call as that would break atomicity
(and we would have to know where to seek forward to in the netlink reply the next time).

The other way is making bpf_tc_query return an allocated buffer. This is better, as we can keep
doing realloc to grow it as needed, but again it seems like there must be a cap on the maximum size
to avoid unbounded growth (as there can be potentially many, many matching filters, especially on
use of NLM_F_DUMP).

A completely different idea would be to take a callback pointer from the user and invoke it for each
matching entry, allowing the user to do whatever they want with it (and have a void *userdata
parameter which we pass in). This sounds nice and has many benefits, but is potentially slower as
far as iteration is concerned (which might be a valid concern depending on how this interface is
used in the future).

It would be nice to have some more input on how this should be done before I get to writing the
code.

> [...]


--
Kartikeya