mbox series

[v2,bpf-next,00/16] bpf: syscall program, FD array, loader program, light skeleton.

Message ID 20210423002646.35043-1-alexei.starovoitov@gmail.com
Headers show
Series bpf: syscall program, FD array, loader program, light skeleton. | expand

Message

Alexei Starovoitov April 23, 2021, 12:26 a.m. UTC
From: Alexei Starovoitov <ast@kernel.org>

v1->v2: Addressed comments from Al, Yonghong and Andrii.
- documented sys_close fdget/fdput requirement and non-recursion check.
- reduced internal api leaks between libbpf and bpftool.
  Now bpf_object__gen_loader() is the only new libbf api with minimal fields.
- fixed light skeleton __destroy() method to munmap and close maps and progs.
- refactored bpf_btf_find_by_name_kind to return btf_id | (btf_obj_fd << 32).
- refactored use of bpf_btf_find_by_name_kind from loader prog.
- moved auto-gen like code into skel_internal.h that is used by *.lskel.h
  It has minimal static inline bpf_load_and_run() method used by lskel.
- added lksel.h example in patch 15.
- replaced union bpf_map_prog_desc with struct bpf_map_desc and struct bpf_prog_desc.
- removed mark_feat_supported and added a patch to pass 'obj' into kernel_supports.
- added proper tracking of temporary FDs in loader prog and their cleanup via bpf_sys_close.
- rename gen_trace.c into gen_loader.c to better align the naming throughout.
- expanded number of available helpers in new prog type.
- added support for raw_tp attaching in lskel.
  lskel supports tracing and raw_tp progs now.
  It correctly loads all networking prog types too, but __attach() method is tbd.
- converted progs/test_ksyms_module.c to lskel.
- minor feedback fixes all over.

One thing that was not addressed from feedback is the name of new program type.
Currently it's still:
BPF_PROG_TYPE_SYSCALL, /* a program that can execute syscalls */

The concern raised was that it sounds like a program that should be attached
to a syscall. Like BPF_PROG_TYPE_KPROBE is used to process kprobes.
I've considered and rejected:
BPF_PROG_TYPE_USER - too generic
BPF_PROG_TYPE_USERCTX - ambiguous with uprobes
BPF_PROG_TYPE_LOADER - ok-ish, but imo TYPE_SYSCALL is cleaner.
Other suggestions?

The description of V1 set is still valid:
----
This is a first step towards signed bpf programs and the third approach of that kind.
The first approach was to bring libbpf into the kernel as a user-mode-driver.
The second approach was to invent a new file format and let kernel execute
that format as a sequence of syscalls that create maps and load programs.
This third approach is using new type of bpf program instead of inventing file format.
1st and 2nd approaches had too many downsides comparing to this 3rd and were discarded
after months of work.

To make it work the following new concepts are introduced:
1. syscall bpf program type
A kind of bpf program that can do sys_bpf and sys_close syscalls.
It can only execute in user context.

2. FD array or FD index.
Traditionally BPF instructions are patched with FDs.
What it means that maps has to be created first and then instructions modified
which breaks signature verification if the program is signed.
Instead of patching each instruction with FD patch it with an index into array of FDs.
That makes the program signature stable if it uses maps.

3. loader program that is generated as "strace of libbpf".
When libbpf is loading bpf_file.o it does a bunch of sys_bpf() syscalls to
load BTF, create maps, populate maps and finally load programs.
Instead of actually doing the syscalls generate a trace of what libbpf
would have done and represent it as the "loader program".
The "loader program" consists of single map and single bpf program that
does those syscalls.
Executing such "loader program" via bpf_prog_test_run() command will
replay the sequence of syscalls that libbpf would have done which will result
the same maps created and programs loaded as specified in the elf file.
The "loader program" removes libelf and majority of libbpf dependency from
program loading process.

4. light skeleton
Instead of embedding the whole elf file into skeleton and using libbpf
to parse it later generate a loader program and embed it into "light skeleton".
Such skeleton can load the same set of elf files, but it doesn't need
libbpf and libelf to do that. It only needs few sys_bpf wrappers.

Future steps:
- support CO-RE in the kernel. This patch set is already too big,
so that critical feature is left for the next step.
- generate light skeleton in golang to allow such users use BTF and
all other features provided by libbpf
- generate light skeleton for kernel, so that bpf programs can be embeded
in the kernel module. The UMD usage in bpf_preload will be replaced with
such skeleton, so bpf_preload would become a standard kernel module
without user space dependency.
- finally do the signing of the loader program.

The patches are work in progress with few rough edges.

Alexei Starovoitov (16):
  bpf: Introduce bpf_sys_bpf() helper and program type.
  bpf: Introduce bpfptr_t user/kernel pointer.
  bpf: Prepare bpf syscall to be used from kernel and user space.
  libbpf: Support for syscall program type
  selftests/bpf: Test for syscall program type
  bpf: Make btf_load command to be bpfptr_t compatible.
  selftests/bpf: Test for btf_load command.
  bpf: Introduce fd_idx
  libbpf: Support for fd_idx
  bpf: Add bpf_btf_find_by_name_kind() helper.
  bpf: Add bpf_sys_close() helper.
  libbpf: Change the order of data and text relocations.
  libbpf: Add bpf_object pointer to kernel_supports().
  libbpf: Generate loader program out of BPF ELF file.
  bpftool: Use syscall/loader program in "prog load" and "gen skeleton"
    command.
  selftests/bpf: Convert few tests to light skeleton.

 include/linux/bpf.h                           |  19 +-
 include/linux/bpf_types.h                     |   2 +
 include/linux/bpf_verifier.h                  |   1 +
 include/linux/bpfptr.h                        |  81 +++
 include/linux/btf.h                           |   2 +-
 include/uapi/linux/bpf.h                      |  39 +-
 kernel/bpf/bpf_iter.c                         |  13 +-
 kernel/bpf/btf.c                              |  76 ++-
 kernel/bpf/syscall.c                          | 195 ++++--
 kernel/bpf/verifier.c                         |  81 ++-
 net/bpf/test_run.c                            |  45 +-
 tools/bpf/bpftool/Makefile                    |   2 +-
 tools/bpf/bpftool/gen.c                       | 313 ++++++++-
 tools/bpf/bpftool/main.c                      |   7 +-
 tools/bpf/bpftool/main.h                      |   1 +
 tools/bpf/bpftool/prog.c                      |  80 +++
 tools/bpf/bpftool/xlated_dumper.c             |   3 +
 tools/include/uapi/linux/bpf.h                |  39 +-
 tools/lib/bpf/Build                           |   2 +-
 tools/lib/bpf/bpf.c                           |   1 +
 tools/lib/bpf/bpf_gen_internal.h              |  40 ++
 tools/lib/bpf/gen_loader.c                    | 615 ++++++++++++++++++
 tools/lib/bpf/libbpf.c                        | 399 +++++++++---
 tools/lib/bpf/libbpf.h                        |  12 +
 tools/lib/bpf/libbpf.map                      |   1 +
 tools/lib/bpf/libbpf_internal.h               |   3 +
 tools/lib/bpf/skel_internal.h                 | 105 +++
 tools/testing/selftests/bpf/.gitignore        |   1 +
 tools/testing/selftests/bpf/Makefile          |  16 +-
 .../selftests/bpf/prog_tests/fentry_fexit.c   |   6 +-
 .../selftests/bpf/prog_tests/fentry_test.c    |   4 +-
 .../selftests/bpf/prog_tests/fexit_sleep.c    |   6 +-
 .../selftests/bpf/prog_tests/fexit_test.c     |   4 +-
 .../selftests/bpf/prog_tests/kfunc_call.c     |   6 +-
 .../selftests/bpf/prog_tests/ksyms_module.c   |   2 +-
 .../selftests/bpf/prog_tests/syscall.c        |  53 ++
 tools/testing/selftests/bpf/progs/syscall.c   | 121 ++++
 .../selftests/bpf/progs/test_subprogs.c       |  13 +
 38 files changed, 2198 insertions(+), 211 deletions(-)
 create mode 100644 include/linux/bpfptr.h
 create mode 100644 tools/lib/bpf/bpf_gen_internal.h
 create mode 100644 tools/lib/bpf/gen_loader.c
 create mode 100644 tools/lib/bpf/skel_internal.h
 create mode 100644 tools/testing/selftests/bpf/prog_tests/syscall.c
 create mode 100644 tools/testing/selftests/bpf/progs/syscall.c

Comments

Yonghong Song April 23, 2021, 6:15 p.m. UTC | #1
On 4/22/21 5:26 PM, Alexei Starovoitov wrote:
> From: Alexei Starovoitov <ast@kernel.org>
> 
> Add placeholders for bpf_sys_bpf() helper and new program type.
> 
> v1->v2:
> - check that expected_attach_type is zero
> - allow more helper functions to be used in this program type, since they will
>    only execute from user context via bpf_prog_test_run.
> 
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> ---
>   include/linux/bpf.h            | 10 +++++++
>   include/linux/bpf_types.h      |  2 ++
>   include/uapi/linux/bpf.h       |  8 +++++
>   kernel/bpf/syscall.c           | 54 ++++++++++++++++++++++++++++++++++
>   net/bpf/test_run.c             | 43 +++++++++++++++++++++++++++
>   tools/include/uapi/linux/bpf.h |  8 +++++
>   6 files changed, 125 insertions(+)
> 
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index f8a45f109e96..aed30bbffb54 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -1824,6 +1824,9 @@ static inline bool bpf_map_is_dev_bound(struct bpf_map *map)
>   
>   struct bpf_map *bpf_map_offload_map_alloc(union bpf_attr *attr);
>   void bpf_map_offload_map_free(struct bpf_map *map);
> +int bpf_prog_test_run_syscall(struct bpf_prog *prog,
> +			      const union bpf_attr *kattr,
> +			      union bpf_attr __user *uattr);
>   #else
>   static inline int bpf_prog_offload_init(struct bpf_prog *prog,
>   					union bpf_attr *attr)
> @@ -1849,6 +1852,13 @@ static inline struct bpf_map *bpf_map_offload_map_alloc(union bpf_attr *attr)
>   static inline void bpf_map_offload_map_free(struct bpf_map *map)
>   {
>   }
> +
> +static inline int bpf_prog_test_run_syscall(struct bpf_prog *prog,
> +					    const union bpf_attr *kattr,
> +					    union bpf_attr __user *uattr)
> +{
> +	return -ENOTSUPP;
> +}
>   #endif /* CONFIG_NET && CONFIG_BPF_SYSCALL */
>   
>   #if defined(CONFIG_INET) && defined(CONFIG_BPF_SYSCALL)
> diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
> index f883f01a5061..a9db1eae6796 100644
> --- a/include/linux/bpf_types.h
> +++ b/include/linux/bpf_types.h
> @@ -77,6 +77,8 @@ BPF_PROG_TYPE(BPF_PROG_TYPE_LSM, lsm,
>   	       void *, void *)
>   #endif /* CONFIG_BPF_LSM */
>   #endif
> +BPF_PROG_TYPE(BPF_PROG_TYPE_SYSCALL, bpf_syscall,
> +	      void *, void *)
>   
>   BPF_MAP_TYPE(BPF_MAP_TYPE_ARRAY, array_map_ops)
>   BPF_MAP_TYPE(BPF_MAP_TYPE_PERCPU_ARRAY, percpu_array_map_ops)
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index ec6d85a81744..c92648f38144 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -937,6 +937,7 @@ enum bpf_prog_type {
>   	BPF_PROG_TYPE_EXT,
>   	BPF_PROG_TYPE_LSM,
>   	BPF_PROG_TYPE_SK_LOOKUP,
> +	BPF_PROG_TYPE_SYSCALL, /* a program that can execute syscalls */
>   };
>   
>   enum bpf_attach_type {
> @@ -4735,6 +4736,12 @@ union bpf_attr {
>    *		be zero-terminated except when **str_size** is 0.
>    *
>    *		Or **-EBUSY** if the per-CPU memory copy buffer is busy.
> + *
> + * long bpf_sys_bpf(u32 cmd, void *attr, u32 attr_size)
> + * 	Description
> + * 		Execute bpf syscall with given arguments.
> + * 	Return
> + * 		A syscall result.
>    */
>   #define __BPF_FUNC_MAPPER(FN)		\
>   	FN(unspec),			\
> @@ -4903,6 +4910,7 @@ union bpf_attr {
>   	FN(check_mtu),			\
>   	FN(for_each_map_elem),		\
>   	FN(snprintf),			\
> +	FN(sys_bpf),			\
>   	/* */
>   
>   /* integer value in 'imm' field of BPF_CALL instruction selects which helper
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index fd495190115e..8636876f3e6b 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -2014,6 +2014,7 @@ bpf_prog_load_check_attach(enum bpf_prog_type prog_type,
>   		if (expected_attach_type == BPF_SK_LOOKUP)
>   			return 0;
>   		return -EINVAL;
> +	case BPF_PROG_TYPE_SYSCALL:
>   	case BPF_PROG_TYPE_EXT:
>   		if (expected_attach_type)
>   			return -EINVAL;
> @@ -4497,3 +4498,56 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, siz
>   
>   	return err;
>   }
> +
> +static bool syscall_prog_is_valid_access(int off, int size,
> +					 enum bpf_access_type type,
> +					 const struct bpf_prog *prog,
> +					 struct bpf_insn_access_aux *info)
> +{
> +	if (off < 0 || off >= U16_MAX)
> +		return false;

Is this enough? If I understand correctly, the new program type
allows any arbitrary context data from user as long as its size
meets the following constraints:
    if (ctx_size_in < prog->aux->max_ctx_offset ||
  	    ctx_size_in > U16_MAX)
		return -EINVAL;

So if user provides a ctx with size say 40 and inside the program looks
it is still able to read/write to say offset 400.
Should we be a little more restrictive on this?

> +	if (off % size != 0)
> +		return false;
> +	return true;
> +}
> +
> +BPF_CALL_3(bpf_sys_bpf, int, cmd, void *, attr, u32, attr_size)
> +{
> +	return -EINVAL;
> +}
> +
> +const struct bpf_func_proto bpf_sys_bpf_proto = {
> +	.func		= bpf_sys_bpf,
> +	.gpl_only	= false,
> +	.ret_type	= RET_INTEGER,
> +	.arg1_type	= ARG_ANYTHING,
> +	.arg2_type	= ARG_PTR_TO_MEM,
> +	.arg3_type	= ARG_CONST_SIZE,
> +};
> +
> +const struct bpf_func_proto * __weak
> +tracing_prog_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
> +{
> +
> +	return bpf_base_func_proto(func_id);
> +}
> +
> +static const struct bpf_func_proto *
> +syscall_prog_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
> +{
> +	switch (func_id) {
> +	case BPF_FUNC_sys_bpf:
> +		return &bpf_sys_bpf_proto;
> +	default:
> +		return tracing_prog_func_proto(func_id, prog);
> +	}
> +}
> +
> +const struct bpf_verifier_ops bpf_syscall_verifier_ops = {
> +	.get_func_proto  = syscall_prog_func_proto,
> +	.is_valid_access = syscall_prog_is_valid_access,
> +};
> +
> +const struct bpf_prog_ops bpf_syscall_prog_ops = {
> +	.test_run = bpf_prog_test_run_syscall,
> +};
> diff --git a/net/bpf/test_run.c b/net/bpf/test_run.c
> index a5d72c48fb66..1783ea77b95c 100644
> --- a/net/bpf/test_run.c
> +++ b/net/bpf/test_run.c
> @@ -918,3 +918,46 @@ int bpf_prog_test_run_sk_lookup(struct bpf_prog *prog, const union bpf_attr *kat
>   	kfree(user_ctx);
>   	return ret;
>   }
> +
> +int bpf_prog_test_run_syscall(struct bpf_prog *prog,
> +			      const union bpf_attr *kattr,
> +			      union bpf_attr __user *uattr)
> +{
> +	void __user *ctx_in = u64_to_user_ptr(kattr->test.ctx_in);
> +	__u32 ctx_size_in = kattr->test.ctx_size_in;
> +	void *ctx = NULL;
> +	u32 retval;
> +	int err = 0;
> +
> +	/* doesn't support data_in/out, ctx_out, duration, or repeat or flags */
> +	if (kattr->test.data_in || kattr->test.data_out ||
> +	    kattr->test.ctx_out || kattr->test.duration ||
> +	    kattr->test.repeat || kattr->test.flags)
> +		return -EINVAL;
> +
> +	if (ctx_size_in < prog->aux->max_ctx_offset ||
> +	    ctx_size_in > U16_MAX)
> +		return -EINVAL;
> +
> +	if (ctx_size_in) {
> +		ctx = kzalloc(ctx_size_in, GFP_USER);
> +		if (!ctx)
> +			return -ENOMEM;
> +		if (copy_from_user(ctx, ctx_in, ctx_size_in)) {
> +			err = -EFAULT;
> +			goto out;
> +		}
> +	}
> +	retval = bpf_prog_run_pin_on_cpu(prog, ctx);
> +
> +	if (copy_to_user(&uattr->test.retval, &retval, sizeof(u32)))
> +		err = -EFAULT;
> +	if (ctx_size_in)
> +		if (copy_to_user(ctx_in, ctx, ctx_size_in)) {
> +			err = -EFAULT;
> +			goto out;
> +		}
> +out:
> +	kfree(ctx);
> +	return err;
> +}
[...]
Yonghong Song April 23, 2021, 9:36 p.m. UTC | #2
On 4/22/21 5:26 PM, Alexei Starovoitov wrote:
> From: Alexei Starovoitov <ast@kernel.org>
> 
> v1->v2: Addressed comments from Al, Yonghong and Andrii.
> - documented sys_close fdget/fdput requirement and non-recursion check.
> - reduced internal api leaks between libbpf and bpftool.
>    Now bpf_object__gen_loader() is the only new libbf api with minimal fields.
> - fixed light skeleton __destroy() method to munmap and close maps and progs.
> - refactored bpf_btf_find_by_name_kind to return btf_id | (btf_obj_fd << 32).
> - refactored use of bpf_btf_find_by_name_kind from loader prog.
> - moved auto-gen like code into skel_internal.h that is used by *.lskel.h
>    It has minimal static inline bpf_load_and_run() method used by lskel.
> - added lksel.h example in patch 15.
> - replaced union bpf_map_prog_desc with struct bpf_map_desc and struct bpf_prog_desc.
> - removed mark_feat_supported and added a patch to pass 'obj' into kernel_supports.
> - added proper tracking of temporary FDs in loader prog and their cleanup via bpf_sys_close.
> - rename gen_trace.c into gen_loader.c to better align the naming throughout.
> - expanded number of available helpers in new prog type.
> - added support for raw_tp attaching in lskel.
>    lskel supports tracing and raw_tp progs now.
>    It correctly loads all networking prog types too, but __attach() method is tbd.
> - converted progs/test_ksyms_module.c to lskel.
> - minor feedback fixes all over.
> 
> One thing that was not addressed from feedback is the name of new program type.
> Currently it's still:
> BPF_PROG_TYPE_SYSCALL, /* a program that can execute syscalls */

Do you have plan for other non-bpf syscalls? Maybe use the name
BPF_PROG_TYPE_BPF_SYSCALL? It will be really clear this is
the program type you can execute bpf syscalls.

> 
> The concern raised was that it sounds like a program that should be attached
> to a syscall. Like BPF_PROG_TYPE_KPROBE is used to process kprobes.
> I've considered and rejected:
> BPF_PROG_TYPE_USER - too generic
> BPF_PROG_TYPE_USERCTX - ambiguous with uprobes

USERCTX probably not a good choice. People can write a program without
context and put the ctx into a map and use it.

> BPF_PROG_TYPE_LOADER - ok-ish, but imo TYPE_SYSCALL is cleaner.

User can write a program to do more than loading although I am not sure
how useful it is compared to implementation in user space.

> Other suggestions?
> 
> The description of V1 set is still valid:
> ----
[...]
Alexei Starovoitov April 23, 2021, 11:16 p.m. UTC | #3
On Fri, Apr 23, 2021 at 02:36:43PM -0700, Yonghong Song wrote:
> > 
> > One thing that was not addressed from feedback is the name of new program type.
> > Currently it's still:
> > BPF_PROG_TYPE_SYSCALL, /* a program that can execute syscalls */
> 
> Do you have plan for other non-bpf syscalls? Maybe use the name
> BPF_PROG_TYPE_BPF_SYSCALL? It will be really clear this is
> the program type you can execute bpf syscalls.

In this patch set it's already doing sys_bpf and sys_close syscalls :)

> > 
> > The concern raised was that it sounds like a program that should be attached
> > to a syscall. Like BPF_PROG_TYPE_KPROBE is used to process kprobes.
> > I've considered and rejected:
> > BPF_PROG_TYPE_USER - too generic
> > BPF_PROG_TYPE_USERCTX - ambiguous with uprobes
> 
> USERCTX probably not a good choice. People can write a program without
> context and put the ctx into a map and use it.
> 
> > BPF_PROG_TYPE_LOADER - ok-ish, but imo TYPE_SYSCALL is cleaner.
> 
> User can write a program to do more than loading although I am not sure
> how useful it is compared to implementation in user space.

Exactly.
Just BPF_PROG_TYPE_SYSCALL alone can be used as more generic equivalent
to sys_close_range syscalls.
If somebody needs to close a sparse set of FDs or get fd_to_be_closed
from a map they can craft a bpf prog that would do that.
Or if somebody wants to do a batched map processing...
instead of doing sys_bpf() with BPF_MAP_UPDATE_BATCH they can craft
a bpf prog.
Plenty of use cases beyond LOADER.

This patch set only allows BPF_PROG_TYPE_SYSCALL to be executed
via prog_test_run, but I think it's safe to execute it upon entry
to pretty much any syscall.
So _SYSCALL suffix fits as both "a program that can execute syscalls"
and as "a program that attaches to syscalls".
The later is not implemented yet, but would fit right in.
Andrii Nakryiko April 26, 2021, 4:51 p.m. UTC | #4
On Thu, Apr 22, 2021 at 5:26 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>

> From: Alexei Starovoitov <ast@kernel.org>

>

> Add placeholders for bpf_sys_bpf() helper and new program type.

>

> v1->v2:

> - check that expected_attach_type is zero

> - allow more helper functions to be used in this program type, since they will

>   only execute from user context via bpf_prog_test_run.

>

> Signed-off-by: Alexei Starovoitov <ast@kernel.org>

> ---


LGTM, see minor comments below.

Acked-by: Andrii Nakryiko <andrii@kernel.org>


>  include/linux/bpf.h            | 10 +++++++

>  include/linux/bpf_types.h      |  2 ++

>  include/uapi/linux/bpf.h       |  8 +++++

>  kernel/bpf/syscall.c           | 54 ++++++++++++++++++++++++++++++++++

>  net/bpf/test_run.c             | 43 +++++++++++++++++++++++++++

>  tools/include/uapi/linux/bpf.h |  8 +++++

>  6 files changed, 125 insertions(+)

>


[...]

> +

> +const struct bpf_func_proto * __weak

> +tracing_prog_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)

> +{

> +


extra empty line

> +       return bpf_base_func_proto(func_id);

> +}

> +

> +static const struct bpf_func_proto *

> +syscall_prog_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)

> +{

> +       switch (func_id) {

> +       case BPF_FUNC_sys_bpf:

> +               return &bpf_sys_bpf_proto;

> +       default:

> +               return tracing_prog_func_proto(func_id, prog);

> +       }

> +}

> +


[...]

> +       if (ctx_size_in) {

> +               ctx = kzalloc(ctx_size_in, GFP_USER);

> +               if (!ctx)

> +                       return -ENOMEM;

> +               if (copy_from_user(ctx, ctx_in, ctx_size_in)) {

> +                       err = -EFAULT;

> +                       goto out;

> +               }

> +       }

> +       retval = bpf_prog_run_pin_on_cpu(prog, ctx);

> +

> +       if (copy_to_user(&uattr->test.retval, &retval, sizeof(u32)))

> +               err = -EFAULT;


is there a point in trying to do another copy_to_user if this fails?
I.e., why not goto out here?

> +       if (ctx_size_in)

> +               if (copy_to_user(ctx_in, ctx, ctx_size_in)) {

> +                       err = -EFAULT;

> +                       goto out;

> +               }

> +out:

> +       kfree(ctx);

> +       return err;

> +}


[...]
Andrii Nakryiko April 26, 2021, 5:02 p.m. UTC | #5
On Thu, Apr 22, 2021 at 5:26 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>

> From: Alexei Starovoitov <ast@kernel.org>

>

> bpf_prog_type_syscall is a program that creates a bpf map,

> updates it, and loads another bpf program using bpf_sys_bpf() helper.

>

> Signed-off-by: Alexei Starovoitov <ast@kernel.org>

> ---

>  tools/testing/selftests/bpf/Makefile          |  1 +

>  .../selftests/bpf/prog_tests/syscall.c        | 53 ++++++++++++++

>  tools/testing/selftests/bpf/progs/syscall.c   | 73 +++++++++++++++++++

>  3 files changed, 127 insertions(+)

>  create mode 100644 tools/testing/selftests/bpf/prog_tests/syscall.c

>  create mode 100644 tools/testing/selftests/bpf/progs/syscall.c

>

> diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile

> index c5bcdb3d4b12..9fdfdbc61857 100644

> --- a/tools/testing/selftests/bpf/Makefile

> +++ b/tools/testing/selftests/bpf/Makefile

> @@ -278,6 +278,7 @@ MENDIAN=$(if $(IS_LITTLE_ENDIAN),-mlittle-endian,-mbig-endian)

>  CLANG_SYS_INCLUDES = $(call get_sys_includes,$(CLANG))

>  BPF_CFLAGS = -g -D__TARGET_ARCH_$(SRCARCH) $(MENDIAN)                  \

>              -I$(INCLUDE_DIR) -I$(CURDIR) -I$(APIDIR)                   \

> +            -I$(TOOLSINCDIR) \


is this for filter.h? also, please align \ with the previous line


>              -I$(abspath $(OUTPUT)/../usr/include)

>

>  CLANG_CFLAGS = $(CLANG_SYS_INCLUDES) \

> diff --git a/tools/testing/selftests/bpf/prog_tests/syscall.c b/tools/testing/selftests/bpf/prog_tests/syscall.c

> new file mode 100644

> index 000000000000..e550e36bb5da

> --- /dev/null

> +++ b/tools/testing/selftests/bpf/prog_tests/syscall.c

> @@ -0,0 +1,53 @@

> +// SPDX-License-Identifier: GPL-2.0

> +/* Copyright (c) 2021 Facebook */

> +#include <test_progs.h>

> +#include "syscall.skel.h"

> +

> +struct args {

> +       __u64 log_buf;

> +       __u32 log_size;

> +       int max_entries;

> +       int map_fd;

> +       int prog_fd;

> +};

> +

> +void test_syscall(void)

> +{

> +       static char verifier_log[8192];

> +       struct args ctx = {

> +               .max_entries = 1024,

> +               .log_buf = (uintptr_t) verifier_log,

> +               .log_size = sizeof(verifier_log),

> +       };

> +       struct bpf_prog_test_run_attr tattr = {

> +               .ctx_in = &ctx,

> +               .ctx_size_in = sizeof(ctx),

> +       };

> +       struct syscall *skel = NULL;

> +       __u64 key = 12, value = 0;

> +       __u32 duration = 0;

> +       int err;

> +

> +       skel = syscall__open_and_load();

> +       if (CHECK(!skel, "skel_load", "syscall skeleton failed\n"))

> +               goto cleanup;

> +

> +       tattr.prog_fd = bpf_program__fd(skel->progs.bpf_prog);

> +       err = bpf_prog_test_run_xattr(&tattr);

> +       if (CHECK(err || tattr.retval != 1, "test_run sys_bpf",

> +                 "err %d errno %d retval %d duration %d\n",

> +                 err, errno, tattr.retval, tattr.duration))

> +               goto cleanup;

> +

> +       CHECK(ctx.map_fd <= 0, "map_fd", "fd = %d\n", ctx.map_fd);

> +       CHECK(ctx.prog_fd <= 0, "prog_fd", "fd = %d\n", ctx.prog_fd);


please use ASSERT_xxx() macros everywhere. I've just added
ASSERT_GT(), so once that patch set lands you should have all the
variants you need.

> +       CHECK(memcmp(verifier_log, "processed", sizeof("processed") - 1) != 0,

> +             "verifier_log", "%s\n", verifier_log);

> +

> +       err = bpf_map_lookup_elem(ctx.map_fd, &key, &value);

> +       CHECK(err, "map_lookup", "map_lookup failed\n");

> +       CHECK(value != 34, "invalid_value",

> +             "got value %llu expected %u\n", value, 34);

> +cleanup:

> +       syscall__destroy(skel);

> +}

> diff --git a/tools/testing/selftests/bpf/progs/syscall.c b/tools/testing/selftests/bpf/progs/syscall.c

> new file mode 100644

> index 000000000000..01476f88e45f

> --- /dev/null

> +++ b/tools/testing/selftests/bpf/progs/syscall.c

> @@ -0,0 +1,73 @@

> +// SPDX-License-Identifier: GPL-2.0

> +/* Copyright (c) 2021 Facebook */

> +#include <linux/stddef.h>

> +#include <linux/bpf.h>

> +#include <bpf/bpf_helpers.h>

> +#include <bpf/bpf_tracing.h>

> +#include <../../tools/include/linux/filter.h>


with TOOLSINCDIR shouldn't this be just <linux/fiter.h>?

> +

> +volatile const int workaround = 1;


not needed anymore?

> +

> +char _license[] SEC("license") = "GPL";

> +

> +struct args {

> +       __u64 log_buf;

> +       __u32 log_size;

> +       int max_entries;

> +       int map_fd;

> +       int prog_fd;

> +};

> +


[...]
Andrii Nakryiko April 26, 2021, 5:29 p.m. UTC | #6
On Thu, Apr 22, 2021 at 5:27 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>

> From: Alexei Starovoitov <ast@kernel.org>

>

> In order to be able to generate loader program in the later

> patches change the order of data and text relocations.

> Also improve the test to include data relos.

>

> If the kernel supports "FD array" the map_fd relocations can be processed

> before text relos since generated loader program won't need to manually

> patch ld_imm64 insns with map_fd.

> But ksym and kfunc relocations can only be processed after all calls

> are relocated, since loader program will consist of a sequence

> of calls to bpf_btf_find_by_name_kind() followed by patching of btf_id

> and btf_obj_fd into corresponding ld_imm64 insns. The locations of those

> ld_imm64 insns are specified in relocations.

> Hence process all data relocations (maps, ksym, kfunc) together after call relos.

>

> Signed-off-by: Alexei Starovoitov <ast@kernel.org>

> ---

>  tools/lib/bpf/libbpf.c                        | 86 +++++++++++++++----

>  .../selftests/bpf/progs/test_subprogs.c       | 13 +++

>  2 files changed, 80 insertions(+), 19 deletions(-)

>

> diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c

> index 17cfc5b66111..c73a85b97ca5 100644

> --- a/tools/lib/bpf/libbpf.c

> +++ b/tools/lib/bpf/libbpf.c

> @@ -6379,11 +6379,15 @@ bpf_object__relocate_data(struct bpf_object *obj, struct bpf_program *prog)

>                         insn[0].imm = ext->ksym.kernel_btf_id;

>                         break;

>                 case RELO_SUBPROG_ADDR:

> -                       insn[0].src_reg = BPF_PSEUDO_FUNC;

> -                       /* will be handled as a follow up pass */

> +                       if (insn[0].src_reg != BPF_PSEUDO_FUNC) {

> +                               pr_warn("prog '%s': relo #%d: bad insn\n",

> +                                       prog->name, i);

> +                               return -EINVAL;

> +                       }


given SUBPROG_ADDR is now handled similarly to RELO_CALL in a
different place, I'd probably drop this error check and just combine
RELO_SUBPROG_ADDR and RELO_CALL cases with just a /* handled already
*/ comment.

> +                       /* handled already */

>                         break;

>                 case RELO_CALL:

> -                       /* will be handled as a follow up pass */

> +                       /* handled already */

>                         break;

>                 default:

>                         pr_warn("prog '%s': relo #%d: bad relo type %d\n",

> @@ -6552,6 +6556,31 @@ static struct reloc_desc *find_prog_insn_relo(const struct bpf_program *prog, si

>                        sizeof(*prog->reloc_desc), cmp_relo_by_insn_idx);

>  }

>

> +static int append_subprog_relos(struct bpf_program *main_prog, struct bpf_program *subprog)

> +{

> +       int new_cnt = main_prog->nr_reloc + subprog->nr_reloc;

> +       struct reloc_desc *relos;

> +       size_t off = subprog->sub_insn_off;

> +       int i;

> +

> +       if (main_prog == subprog)

> +               return 0;

> +       relos = libbpf_reallocarray(main_prog->reloc_desc, new_cnt, sizeof(*relos));

> +       if (!relos)

> +               return -ENOMEM;

> +       memcpy(relos + main_prog->nr_reloc, subprog->reloc_desc,

> +              sizeof(*relos) * subprog->nr_reloc);

> +

> +       for (i = main_prog->nr_reloc; i < new_cnt; i++)

> +               relos[i].insn_idx += off;


nit: off is used only here, so there is little point in having it as a
separate var, inline?

> +       /* After insn_idx adjustment the 'relos' array is still sorted

> +        * by insn_idx and doesn't break bsearch.

> +        */

> +       main_prog->reloc_desc = relos;

> +       main_prog->nr_reloc = new_cnt;

> +       return 0;

> +}

> +

>  static int

>  bpf_object__reloc_code(struct bpf_object *obj, struct bpf_program *main_prog,

>                        struct bpf_program *prog)

> @@ -6560,18 +6589,32 @@ bpf_object__reloc_code(struct bpf_object *obj, struct bpf_program *main_prog,

>         struct bpf_program *subprog;

>         struct bpf_insn *insns, *insn;

>         struct reloc_desc *relo;

> -       int err;

> +       int err, i;

>

>         err = reloc_prog_func_and_line_info(obj, main_prog, prog);

>         if (err)

>                 return err;

>

> +       for (i = 0; i < prog->nr_reloc; i++) {

> +               relo = &prog->reloc_desc[i];

> +               insn = &main_prog->insns[prog->sub_insn_off + relo->insn_idx];

> +

> +               if (relo->type == RELO_SUBPROG_ADDR)

> +                       /* mark the insn, so it becomes insn_is_pseudo_func() */

> +                       insn[0].src_reg = BPF_PSEUDO_FUNC;

> +       }

> +


This will do the same work over and over each time we append a subprog
to main_prog. This should logically follow append_subprog_relos(), but
you wanted to do it for main_prog with the same code, right?

How about instead doing this before we start appending subprogs to
main_progs? I.e., do it explicitly in bpf_object__relocate() before
you start code relocation loop.

>         for (insn_idx = 0; insn_idx < prog->sec_insn_cnt; insn_idx++) {

>                 insn = &main_prog->insns[prog->sub_insn_off + insn_idx];

>                 if (!insn_is_subprog_call(insn) && !insn_is_pseudo_func(insn))

>                         continue;

>

>                 relo = find_prog_insn_relo(prog, insn_idx);

> +               if (relo && relo->type == RELO_EXTERN_FUNC)

> +                       /* kfunc relocations will be handled later

> +                        * in bpf_object__relocate_data()

> +                        */

> +                       continue;

>                 if (relo && relo->type != RELO_CALL && relo->type != RELO_SUBPROG_ADDR) {

>                         pr_warn("prog '%s': unexpected relo for insn #%zu, type %d\n",

>                                 prog->name, insn_idx, relo->type);


[...]
Andrii Nakryiko April 26, 2021, 5:30 p.m. UTC | #7
On Thu, Apr 22, 2021 at 5:27 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>

> From: Alexei Starovoitov <ast@kernel.org>

>

> Add a pointer to 'struct bpf_object' to kernel_supports() helper.

> It will be used in the next patch.

> No functional changes.

>

> Signed-off-by: Alexei Starovoitov <ast@kernel.org>

> ---


LGTM.

Acked-by: Andrii Nakryiko <andrii@kernel.org>


>  tools/lib/bpf/libbpf.c | 52 +++++++++++++++++++++---------------------

>  1 file changed, 26 insertions(+), 26 deletions(-)

>


[...]
Andrii Nakryiko April 26, 2021, 10:22 p.m. UTC | #8
On Thu, Apr 22, 2021 at 5:27 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>

> From: Alexei Starovoitov <ast@kernel.org>

>

> The BPF program loading process performed by libbpf is quite complex

> and consists of the following steps:

> "open" phase:

> - parse elf file and remember relocations, sections

> - collect externs and ksyms including their btf_ids in prog's BTF

> - patch BTF datasec (since llvm couldn't do it)

> - init maps (old style map_def, BTF based, global data map, kconfig map)

> - collect relocations against progs and maps

> "load" phase:

> - probe kernel features

> - load vmlinux BTF

> - resolve externs (kconfig and ksym)

> - load program BTF

> - init struct_ops

> - create maps

> - apply CO-RE relocations

> - patch ld_imm64 insns with src_reg=PSEUDO_MAP, PSEUDO_MAP_VALUE, PSEUDO_BTF_ID

> - reposition subprograms and adjust call insns

> - sanitize and load progs

>

> During this process libbpf does sys_bpf() calls to load BTF, create maps,

> populate maps and finally load programs.

> Instead of actually doing the syscalls generate a trace of what libbpf

> would have done and represent it as the "loader program".

> The "loader program" consists of single map with:

> - union bpf_attr(s)

> - BTF bytes

> - map value bytes

> - insns bytes

> and single bpf program that passes bpf_attr(s) and data into bpf_sys_bpf() helper.

> Executing such "loader program" via bpf_prog_test_run() command will

> replay the sequence of syscalls that libbpf would have done which will result

> the same maps created and programs loaded as specified in the elf file.

> The "loader program" removes libelf and majority of libbpf dependency from

> program loading process.

>

> kconfig, typeless ksym, struct_ops and CO-RE are not supported yet.

>

> The order of relocate_data and relocate_calls had to change, so that

> bpf_gen__prog_load() can see all relocations for a given program with

> correct insn_idx-es.

>

> Signed-off-by: Alexei Starovoitov <ast@kernel.org>

> ---

>  tools/lib/bpf/Build              |   2 +-

>  tools/lib/bpf/bpf_gen_internal.h |  40 ++

>  tools/lib/bpf/gen_loader.c       | 615 +++++++++++++++++++++++++++++++

>  tools/lib/bpf/libbpf.c           | 204 ++++++++--

>  tools/lib/bpf/libbpf.h           |  12 +

>  tools/lib/bpf/libbpf.map         |   1 +

>  tools/lib/bpf/libbpf_internal.h  |   2 +

>  tools/lib/bpf/skel_internal.h    | 105 ++++++

>  8 files changed, 948 insertions(+), 33 deletions(-)

>  create mode 100644 tools/lib/bpf/bpf_gen_internal.h

>  create mode 100644 tools/lib/bpf/gen_loader.c

>  create mode 100644 tools/lib/bpf/skel_internal.h

>

> diff --git a/tools/lib/bpf/Build b/tools/lib/bpf/Build

> index 9b057cc7650a..430f6874fa41 100644

> --- a/tools/lib/bpf/Build

> +++ b/tools/lib/bpf/Build

> @@ -1,3 +1,3 @@

>  libbpf-y := libbpf.o bpf.o nlattr.o btf.o libbpf_errno.o str_error.o \

>             netlink.o bpf_prog_linfo.o libbpf_probes.o xsk.o hashmap.o \

> -           btf_dump.o ringbuf.o strset.o linker.o

> +           btf_dump.o ringbuf.o strset.o linker.o gen_loader.o

> diff --git a/tools/lib/bpf/bpf_gen_internal.h b/tools/lib/bpf/bpf_gen_internal.h

> new file mode 100644

> index 000000000000..dc3e2cbf9ce3

> --- /dev/null

> +++ b/tools/lib/bpf/bpf_gen_internal.h

> @@ -0,0 +1,40 @@

> +/* SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) */

> +/* Copyright (c) 2021 Facebook */

> +#ifndef __BPF_GEN_INTERNAL_H

> +#define __BPF_GEN_INTERNAL_H

> +

> +struct relo_desc {


there is very similarly named reloc_desc struct in libbpf.c, can you
rename it to something like gen_btf_relo_desc?

> +       const char *name;

> +       int kind;

> +       int insn_idx;

> +};

> +


[...]

> +

> +static int bpf_gen__realloc_insn_buf(struct bpf_gen *gen, __u32 size)

> +{

> +       size_t off = gen->insn_cur - gen->insn_start;

> +

> +       if (gen->error)

> +               return gen->error;

> +       if (size > INT32_MAX || off + size > INT32_MAX) {

> +               gen->error = -ERANGE;

> +               return -ERANGE;

> +       }

> +       gen->insn_start = realloc(gen->insn_start, off + size);


leaking memory here: gen->insn_start will be NULL on failure

> +       if (!gen->insn_start) {

> +               gen->error = -ENOMEM;

> +               return -ENOMEM;

> +       }

> +       gen->insn_cur = gen->insn_start + off;

> +       return 0;

> +}

> +

> +static int bpf_gen__realloc_data_buf(struct bpf_gen *gen, __u32 size)

> +{

> +       size_t off = gen->data_cur - gen->data_start;

> +

> +       if (gen->error)

> +               return gen->error;

> +       if (size > INT32_MAX || off + size > INT32_MAX) {

> +               gen->error = -ERANGE;

> +               return -ERANGE;

> +       }

> +       gen->data_start = realloc(gen->data_start, off + size);


same as above

> +       if (!gen->data_start) {

> +               gen->error = -ENOMEM;

> +               return -ENOMEM;

> +       }

> +       gen->data_cur = gen->data_start + off;

> +       return 0;

> +}

> +


[...]

> +

> +static void bpf_gen__emit_sys_bpf(struct bpf_gen *gen, int cmd, int attr, int attr_size)

> +{

> +       bpf_gen__emit(gen, BPF_MOV64_IMM(BPF_REG_1, cmd));

> +       bpf_gen__emit2(gen, BPF_LD_IMM64_RAW_FULL(BPF_REG_2, BPF_PSEUDO_MAP_IDX_VALUE, 0, 0, 0, attr));


is attr an offset into a blob? if yes, attr_off? or attr_base_off,
anything with _off

> +       bpf_gen__emit(gen, BPF_MOV64_IMM(BPF_REG_3, attr_size));

> +       bpf_gen__emit(gen, BPF_EMIT_CALL(BPF_FUNC_sys_bpf));

> +       /* remember the result in R7 */

> +       bpf_gen__emit(gen, BPF_MOV64_REG(BPF_REG_7, BPF_REG_0));

> +}

> +

> +static void bpf_gen__emit_check_err(struct bpf_gen *gen)

> +{

> +       bpf_gen__emit(gen, BPF_JMP_IMM(BPF_JSGE, BPF_REG_7, 0, 2));

> +       bpf_gen__emit(gen, BPF_MOV64_REG(BPF_REG_0, BPF_REG_7));

> +       bpf_gen__emit(gen, BPF_EXIT_INSN());

> +}

> +

> +static void __bpf_gen__debug(struct bpf_gen *gen, int reg1, int reg2, const char *fmt, va_list args)


Can you please leave a comment on what reg1 and reg2 is, it's not very
clear and the code clearly assumes that it can't be reg[1-4]. It's
probably those special R7 and R9 (or -1, of course), but having a
short comment makes sense to not jump around trying to figure out
possible inputs.

Oh, reading further, it can also be R0.

> +{

> +       char buf[1024];

> +       int addr, len, ret;

> +

> +       if (!gen->log_level)

> +               return;

> +       ret = vsnprintf(buf, sizeof(buf), fmt, args);

> +       if (ret < 1024 - 7 && reg1 >= 0 && reg2 < 0)

> +               /* The special case to accommodate common bpf_gen__debug_ret():

> +                * to avoid specifying BPF_REG_7 and adding " r=%%d" to prints explicitly.

> +                */

> +               strcat(buf, " r=%d");

> +       len = strlen(buf) + 1;

> +       addr = bpf_gen__add_data(gen, buf, len);


nit: offset, not address, right?

> +

> +       bpf_gen__emit2(gen, BPF_LD_IMM64_RAW_FULL(BPF_REG_1, BPF_PSEUDO_MAP_IDX_VALUE, 0, 0, 0, addr));

> +       bpf_gen__emit(gen, BPF_MOV64_IMM(BPF_REG_2, len));

> +       if (reg1 >= 0)

> +               bpf_gen__emit(gen, BPF_MOV64_REG(BPF_REG_3, reg1));

> +       if (reg2 >= 0)

> +               bpf_gen__emit(gen, BPF_MOV64_REG(BPF_REG_4, reg2));

> +       bpf_gen__emit(gen, BPF_EMIT_CALL(BPF_FUNC_trace_printk));

> +}

> +


[...]

> +int bpf_gen__finish(struct bpf_gen *gen)

> +{

> +       int i;

> +

> +       bpf_gen__emit_sys_close_stack(gen, stack_off(btf_fd));

> +       for (i = 0; i < gen->nr_progs; i++)

> +               bpf_gen__move_stack2ctx(gen,

> +                                       sizeof(struct bpf_loader_ctx) +

> +                                       sizeof(struct bpf_map_desc) * gen->nr_maps +

> +                                       sizeof(struct bpf_prog_desc) * i +

> +                                       offsetof(struct bpf_prog_desc, prog_fd), 4,

> +                                       stack_off(prog_fd[i]));

> +       for (i = 0; i < gen->nr_maps; i++)

> +               bpf_gen__move_stack2ctx(gen,

> +                                       sizeof(struct bpf_loader_ctx) +

> +                                       sizeof(struct bpf_map_desc) * i +

> +                                       offsetof(struct bpf_map_desc, map_fd), 4,

> +                                       stack_off(map_fd[i]));

> +       bpf_gen__emit(gen, BPF_MOV64_IMM(BPF_REG_0, 0));

> +       bpf_gen__emit(gen, BPF_EXIT_INSN());

> +       pr_debug("bpf_gen__finish %d\n", gen->error);


maybe prefix all those pr_debug()s with "gen: " to distinguish them
from the rest of libbpf logging?

> +       if (!gen->error) {

> +               struct gen_loader_opts *opts = gen->opts;

> +

> +               opts->insns = gen->insn_start;

> +               opts->insns_sz = gen->insn_cur - gen->insn_start;

> +               opts->data = gen->data_start;

> +               opts->data_sz = gen->data_cur - gen->data_start;

> +       }

> +       return gen->error;

> +}

> +

> +void bpf_gen__free(struct bpf_gen *gen)

> +{

> +       if (!gen)

> +               return;

> +       free(gen->data_start);

> +       free(gen->insn_start);

> +       gen->data_start = NULL;

> +       gen->insn_start = NULL;


what's the point of NULL'ing them out if you don't clear gen->data_cur
and gen->insn_cur?

also should it free(gen) itself?

> +}

> +

> +void bpf_gen__load_btf(struct bpf_gen *gen, const void *btf_raw_data, __u32 btf_raw_size)

> +{

> +       union bpf_attr attr = {};


here and below: memset(0)?

> +       int attr_size = offsetofend(union bpf_attr, btf_log_level);

> +       int btf_data, btf_load_attr;

> +

> +       pr_debug("btf_load: size %d\n", btf_raw_size);

> +       btf_data = bpf_gen__add_data(gen, btf_raw_data, btf_raw_size);

> +


[...]

> +       map_create_attr = bpf_gen__add_data(gen, &attr, attr_size);

> +       if (attr.btf_value_type_id)

> +               /* populate union bpf_attr with btf_fd saved in the stack earlier */

> +               bpf_gen__move_stack2blob(gen, map_create_attr + offsetof(union bpf_attr, btf_fd), 4,

> +                                        stack_off(btf_fd));

> +       switch (attr.map_type) {

> +       case BPF_MAP_TYPE_ARRAY_OF_MAPS:

> +       case BPF_MAP_TYPE_HASH_OF_MAPS:

> +               bpf_gen__move_stack2blob(gen, map_create_attr + offsetof(union bpf_attr, inner_map_fd),

> +                                        4, stack_off(inner_map_fd));

> +               close_inner_map_fd = true;

> +               break;

> +       default:;


default:
    break;

> +       }

> +       /* emit MAP_CREATE command */

> +       bpf_gen__emit_sys_bpf(gen, BPF_MAP_CREATE, map_create_attr, attr_size);

> +       bpf_gen__debug_ret(gen, "map_create %s idx %d type %d value_size %d",

> +                          attr.map_name, map_idx, map_attr->map_type, attr.value_size);

> +       bpf_gen__emit_check_err(gen);


what will happen on error with inner_map_fd and all the other fds
created by now?

> +       /* remember map_fd in the stack, if successful */

> +       if (map_idx < 0) {

> +               /* This bpf_gen__map_create() function is called with map_idx >= 0 for all maps

> +                * that libbpf loading logic tracks.

> +                * It's called with -1 to create an inner map.

> +                */

> +               bpf_gen__emit(gen, BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_7, stack_off(inner_map_fd)));

> +       } else {

> +               if (map_idx != gen->nr_maps) {


why would that happen? defensive programming? and even then `if () {}
else if () {} else {}` structure is more appropriate

> +                       gen->error = -EDOM; /* internal bug */

> +                       return;

> +               }

> +               bpf_gen__emit(gen, BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_7, stack_off(map_fd[map_idx])));

> +               gen->nr_maps++;

> +       }

> +       if (close_inner_map_fd)

> +               bpf_gen__emit_sys_close_stack(gen, stack_off(inner_map_fd));

> +}

> +


[...]

> +static void bpf_gen__cleanup_relos(struct bpf_gen *gen, int insns)

> +{

> +       int i, insn;

> +

> +       for (i = 0; i < gen->relo_cnt; i++) {

> +               if (gen->relos[i].kind != BTF_KIND_VAR)

> +                       continue;

> +               /* close fd recorded in insn[insn_idx + 1].imm */

> +               insn = insns + sizeof(struct bpf_insn) * (gen->relos[i].insn_idx + 1)

> +                       + offsetof(struct bpf_insn, imm);

> +               bpf_gen__emit_sys_close_blob(gen, insn);


wouldn't this close the same FD used across multiple "relos" multiple times?

> +       }

> +       if (gen->relo_cnt) {

> +               free(gen->relos);

> +               gen->relo_cnt = 0;

> +               gen->relos = NULL;

> +       }

> +}

> +


[...]

> +       struct bpf_gen *gen_loader;

> +

>         /*

>          * Information when doing elf related work. Only valid if fd

>          * is valid.

> @@ -2651,7 +2654,15 @@ static int bpf_object__sanitize_and_load_btf(struct bpf_object *obj)

>                 bpf_object__sanitize_btf(obj, kern_btf);

>         }

>

> -       err = btf__load(kern_btf);

> +       if (obj->gen_loader) {

> +               __u32 raw_size = 0;

> +               const void *raw_data = btf__get_raw_data(kern_btf, &raw_size);


this can return NULL on ENOMEM

> +

> +               bpf_gen__load_btf(obj->gen_loader, raw_data, raw_size);

> +               btf__set_fd(kern_btf, 0);


why setting fd to 0 (stdin)? does gen depend on this somewhere? The
problem is that it will eventually be closed on btf__free(), which
will close stdin, causing a big surprise. What will happen if you
leave it at -1?


> +       } else {

> +               err = btf__load(kern_btf);

> +       }

>         if (sanitize) {

>                 if (!err) {

>                         /* move fd to libbpf's BTF */

> @@ -4262,6 +4273,12 @@ static bool kernel_supports(const struct bpf_object *obj, enum kern_feature_id f

>         struct kern_feature_desc *feat = &feature_probes[feat_id];

>         int ret;

>

> +       if (obj->gen_loader)

> +               /* To generate loader program assume the latest kernel

> +                * to avoid doing extra prog_load, map_create syscalls.

> +                */

> +               return true;

> +

>         if (READ_ONCE(feat->res) == FEAT_UNKNOWN) {

>                 ret = feat->probe();

>                 if (ret > 0) {

> @@ -4344,6 +4361,13 @@ bpf_object__populate_internal_map(struct bpf_object *obj, struct bpf_map *map)

>         char *cp, errmsg[STRERR_BUFSIZE];

>         int err, zero = 0;

>

> +       if (obj->gen_loader) {

> +               bpf_gen__map_update_elem(obj->gen_loader, map - obj->maps,


it would be great for bpf_gen__map_update_elem to reflect that it's
not a generic map_update_elem() call, rather special internal map
update (just use bpf_gen__populate_internal_map?) Whether to freeze or
not could be just a flag to the same call, they always go together.

> +                                        map->mmaped, map->def.value_size);

> +               if (map_type == LIBBPF_MAP_RODATA || map_type == LIBBPF_MAP_KCONFIG)

> +                       bpf_gen__map_freeze(obj->gen_loader, map - obj->maps);

> +               return 0;

> +       }

>         err = bpf_map_update_elem(map->fd, &zero, map->mmaped, 0);

>         if (err) {

>                 err = -errno;

> @@ -4369,7 +4393,7 @@ bpf_object__populate_internal_map(struct bpf_object *obj, struct bpf_map *map)

>

>  static void bpf_map__destroy(struct bpf_map *map);

>

> -static int bpf_object__create_map(struct bpf_object *obj, struct bpf_map *map)

> +static int bpf_object__create_map(struct bpf_object *obj, struct bpf_map *map, bool is_inner)

>  {

>         struct bpf_create_map_attr create_attr;

>         struct bpf_map_def *def = &map->def;

> @@ -4415,9 +4439,9 @@ static int bpf_object__create_map(struct bpf_object *obj, struct bpf_map *map)

>

>         if (bpf_map_type__is_map_in_map(def->type)) {

>                 if (map->inner_map) {

> -                       int err;

> +                       int err = 0;


no need to initialize to zero, you are assigning it right below

>

> -                       err = bpf_object__create_map(obj, map->inner_map);

> +                       err = bpf_object__create_map(obj, map->inner_map, true);

>                         if (err) {

>                                 pr_warn("map '%s': failed to create inner map: %d\n",

>                                         map->name, err);


[...]

> @@ -4469,7 +4498,12 @@ static int init_map_slots(struct bpf_map *map)

>

>                 targ_map = map->init_slots[i];

>                 fd = bpf_map__fd(targ_map);

> -               err = bpf_map_update_elem(map->fd, &i, &fd, 0);

> +               if (obj->gen_loader) {

> +                       printf("// TODO map_update_elem: idx %ld key %d value==map_idx %ld\n",

> +                              map - obj->maps, i, targ_map - obj->maps);


return error for now?

> +               } else {

> +                       err = bpf_map_update_elem(map->fd, &i, &fd, 0);

> +               }

>                 if (err) {

>                         err = -errno;

>                         pr_warn("map '%s': failed to initialize slot [%d] to map '%s' fd=%d: %d\n",


[...]

> @@ -6082,6 +6119,11 @@ static int bpf_core_apply_relo(struct bpf_program *prog,

>         if (str_is_empty(spec_str))

>                 return -EINVAL;

>

> +       if (prog->obj->gen_loader) {

> +               printf("// TODO core_relo: prog %ld insn[%d] %s %s kind %d\n",

> +                      prog - prog->obj->programs, relo->insn_off / 8,

> +                      local_name, spec_str, relo->kind);


same, return error? Drop printf, maybe leave pr_debug()?

> +       }

>         err = bpf_core_parse_spec(local_btf, local_id, spec_str, relo->kind, &local_spec);

>         if (err) {

>                 pr_warn("prog '%s': relo #%d: parsing [%d] %s %s + %s failed: %d\n",

> @@ -6821,6 +6863,19 @@ bpf_object__relocate_calls(struct bpf_object *obj, struct bpf_program *prog)

>

>         return 0;

>  }


empty line here

> +static void

> +bpf_object__free_relocs(struct bpf_object *obj)

> +{

> +       struct bpf_program *prog;

> +       int i;

> +

> +       /* free up relocation descriptors */

> +       for (i = 0; i < obj->nr_programs; i++) {

> +               prog = &obj->programs[i];

> +               zfree(&prog->reloc_desc);

> +               prog->nr_reloc = 0;

> +       }

> +}

>


[...]

> +static int bpf_program__record_externs(struct bpf_program *prog)

> +{

> +       struct bpf_object *obj = prog->obj;

> +       int i;

> +

> +       for (i = 0; i < prog->nr_reloc; i++) {

> +               struct reloc_desc *relo = &prog->reloc_desc[i];

> +               struct extern_desc *ext = &obj->externs[relo->sym_off];

> +

> +               switch (relo->type) {

> +               case RELO_EXTERN_VAR:

> +                       if (ext->type != EXT_KSYM)

> +                               continue;

> +                       if (!ext->ksym.type_id) /* typeless ksym */

> +                               continue;


this shouldn't be silently ignored, if it's not supported, it should
return error

> +                       bpf_gen__record_extern(obj->gen_loader, ext->name, BTF_KIND_VAR,

> +                                              relo->insn_idx);

> +                       break;

> +               case RELO_EXTERN_FUNC:

> +                       bpf_gen__record_extern(obj->gen_loader, ext->name, BTF_KIND_FUNC,

> +                                              relo->insn_idx);

> +                       break;

> +               default:

> +                       continue;

> +               }

> +       }

> +       return 0;

> +}

> +


[...]

> @@ -7868,6 +7970,9 @@ int bpf_object__load_xattr(struct bpf_object_load_attr *attr)

>         err = err ? : bpf_object__relocate(obj, attr->target_btf_path);

>         err = err ? : bpf_object__load_progs(obj, attr->log_level);

>

> +       if (obj->gen_loader && !err)

> +               err = bpf_gen__finish(obj->gen_loader);

> +

>         /* clean up module BTFs */

>         for (i = 0; i < obj->btf_module_cnt; i++) {

>                 close(obj->btf_modules[i].fd);

> @@ -8493,6 +8598,7 @@ void bpf_object__close(struct bpf_object *obj)

>         if (obj->clear_priv)

>                 obj->clear_priv(obj, obj->priv);


bpf_object__close() will close all those FD=0 in maps/progs, that's not good

>

> +       bpf_gen__free(obj->gen_loader);

>         bpf_object__elf_finish(obj);

>         bpf_object__unload(obj);

>         btf__free(obj->btf);


[...]

> @@ -9387,7 +9521,13 @@ static int libbpf_find_attach_btf_id(struct bpf_program *prog, int *btf_obj_fd,

>         }

>

>         /* kernel/module BTF ID */

> -       err = find_kernel_btf_id(prog->obj, attach_name, attach_type, btf_obj_fd, btf_type_id);

> +       if (prog->obj->gen_loader) {

> +               bpf_gen__record_attach_target(prog->obj->gen_loader, attach_name, attach_type);

> +               *btf_obj_fd = 0;


this will leak kernel module BTF FDs

> +               *btf_type_id = 1;

> +       } else {

> +               err = find_kernel_btf_id(prog->obj, attach_name, attach_type, btf_obj_fd, btf_type_id);

> +       }

>         if (err) {

>                 pr_warn("failed to find kernel BTF type ID of '%s': %d\n", attach_name, err);

>                 return err;


[...]

> +out:

> +       close(map_fd);

> +       close(prog_fd);


this does close(-1), check >= 0


> +       return err;

> +}

> +

> +#endif

> --

> 2.30.2

>
Alexei Starovoitov April 27, 2021, 2:43 a.m. UTC | #9
On Mon, Apr 26, 2021 at 10:02:59AM -0700, Andrii Nakryiko wrote:
> > +/* Copyright (c) 2021 Facebook */

> > +#include <linux/stddef.h>

> > +#include <linux/bpf.h>

> > +#include <bpf/bpf_helpers.h>

> > +#include <bpf/bpf_tracing.h>

> > +#include <../../tools/include/linux/filter.h>

> 

> with TOOLSINCDIR shouldn't this be just <linux/fiter.h>?


sadly no. There is uapi/linux/filter.h that gets included first.
And changing the order of -Is brings the whole set of other issues.
I couldn't come up with anything better unfortunately.
Alexei Starovoitov April 27, 2021, 3 a.m. UTC | #10
On Mon, Apr 26, 2021 at 10:29:09AM -0700, Andrii Nakryiko wrote:
> On Thu, Apr 22, 2021 at 5:27 PM Alexei Starovoitov

> <alexei.starovoitov@gmail.com> wrote:

> >

> > From: Alexei Starovoitov <ast@kernel.org>

> >

> > In order to be able to generate loader program in the later

> > patches change the order of data and text relocations.

> > Also improve the test to include data relos.

> >

> > If the kernel supports "FD array" the map_fd relocations can be processed

> > before text relos since generated loader program won't need to manually

> > patch ld_imm64 insns with map_fd.

> > But ksym and kfunc relocations can only be processed after all calls

> > are relocated, since loader program will consist of a sequence

> > of calls to bpf_btf_find_by_name_kind() followed by patching of btf_id

> > and btf_obj_fd into corresponding ld_imm64 insns. The locations of those

> > ld_imm64 insns are specified in relocations.

> > Hence process all data relocations (maps, ksym, kfunc) together after call relos.

> >

> > Signed-off-by: Alexei Starovoitov <ast@kernel.org>

> > ---

> >  tools/lib/bpf/libbpf.c                        | 86 +++++++++++++++----

> >  .../selftests/bpf/progs/test_subprogs.c       | 13 +++

> >  2 files changed, 80 insertions(+), 19 deletions(-)

> >

> > diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c

> > index 17cfc5b66111..c73a85b97ca5 100644

> > --- a/tools/lib/bpf/libbpf.c

> > +++ b/tools/lib/bpf/libbpf.c

> > @@ -6379,11 +6379,15 @@ bpf_object__relocate_data(struct bpf_object *obj, struct bpf_program *prog)

> >                         insn[0].imm = ext->ksym.kernel_btf_id;

> >                         break;

> >                 case RELO_SUBPROG_ADDR:

> > -                       insn[0].src_reg = BPF_PSEUDO_FUNC;

> > -                       /* will be handled as a follow up pass */

> > +                       if (insn[0].src_reg != BPF_PSEUDO_FUNC) {

> > +                               pr_warn("prog '%s': relo #%d: bad insn\n",

> > +                                       prog->name, i);

> > +                               return -EINVAL;

> > +                       }

> 

> given SUBPROG_ADDR is now handled similarly to RELO_CALL in a

> different place, I'd probably drop this error check and just combine

> RELO_SUBPROG_ADDR and RELO_CALL cases with just a /* handled already

> */ comment.


I prefer to keep them separate. I've hit this pr_warn couple times
while messing with relos and it saved my time.
I bet it will save time to the next developer too.

> > +                       /* handled already */

> >                         break;

> >                 case RELO_CALL:

> > -                       /* will be handled as a follow up pass */

> > +                       /* handled already */

> >                         break;

> >                 default:

> >                         pr_warn("prog '%s': relo #%d: bad relo type %d\n",

> > @@ -6552,6 +6556,31 @@ static struct reloc_desc *find_prog_insn_relo(const struct bpf_program *prog, si

> >                        sizeof(*prog->reloc_desc), cmp_relo_by_insn_idx);

> >  }

> >

> > +static int append_subprog_relos(struct bpf_program *main_prog, struct bpf_program *subprog)

> > +{

> > +       int new_cnt = main_prog->nr_reloc + subprog->nr_reloc;

> > +       struct reloc_desc *relos;

> > +       size_t off = subprog->sub_insn_off;

> > +       int i;

> > +

> > +       if (main_prog == subprog)

> > +               return 0;

> > +       relos = libbpf_reallocarray(main_prog->reloc_desc, new_cnt, sizeof(*relos));

> > +       if (!relos)

> > +               return -ENOMEM;

> > +       memcpy(relos + main_prog->nr_reloc, subprog->reloc_desc,

> > +              sizeof(*relos) * subprog->nr_reloc);

> > +

> > +       for (i = main_prog->nr_reloc; i < new_cnt; i++)

> > +               relos[i].insn_idx += off;

> 

> nit: off is used only here, so there is little point in having it as a

> separate var, inline?


sure.

> > +       /* After insn_idx adjustment the 'relos' array is still sorted

> > +        * by insn_idx and doesn't break bsearch.

> > +        */

> > +       main_prog->reloc_desc = relos;

> > +       main_prog->nr_reloc = new_cnt;

> > +       return 0;

> > +}

> > +

> >  static int

> >  bpf_object__reloc_code(struct bpf_object *obj, struct bpf_program *main_prog,

> >                        struct bpf_program *prog)

> > @@ -6560,18 +6589,32 @@ bpf_object__reloc_code(struct bpf_object *obj, struct bpf_program *main_prog,

> >         struct bpf_program *subprog;

> >         struct bpf_insn *insns, *insn;

> >         struct reloc_desc *relo;

> > -       int err;

> > +       int err, i;

> >

> >         err = reloc_prog_func_and_line_info(obj, main_prog, prog);

> >         if (err)

> >                 return err;

> >

> > +       for (i = 0; i < prog->nr_reloc; i++) {

> > +               relo = &prog->reloc_desc[i];

> > +               insn = &main_prog->insns[prog->sub_insn_off + relo->insn_idx];

> > +

> > +               if (relo->type == RELO_SUBPROG_ADDR)

> > +                       /* mark the insn, so it becomes insn_is_pseudo_func() */

> > +                       insn[0].src_reg = BPF_PSEUDO_FUNC;

> > +       }

> > +

> 

> This will do the same work over and over each time we append a subprog

> to main_prog. This should logically follow append_subprog_relos(), but

> you wanted to do it for main_prog with the same code, right?


It cannot follow append_subprog_relos.
It has to be done before the loop below.
Otherwise !insn_is_pseudo_func() won't catch it and all ld_imm64 insns
will be considered which will make the loop below more complex and slower.
The find_prog_insn_relo() will be called a lot more times.
!relo condition would be treated different ld_imm64 vs call insn, etc.

> How about instead doing this before we start appending subprogs to

> main_progs? I.e., do it explicitly in bpf_object__relocate() before

> you start code relocation loop.


Not sure I follow.
Do another loop:
 for (i = 0; i < obj->nr_programs; i++)
    for (i = 0; i < prog->nr_reloc; i++)
      if (relo->type == RELO_SUBPROG_ADDR)
      ?
That's an option too.
I can do that if you prefer.
It felt cleaner to do this mark here right before the loop below that needs it.

> >         for (insn_idx = 0; insn_idx < prog->sec_insn_cnt; insn_idx++) {

> >                 insn = &main_prog->insns[prog->sub_insn_off + insn_idx];

> >                 if (!insn_is_subprog_call(insn) && !insn_is_pseudo_func(insn))

> >                         continue;

> >

> >                 relo = find_prog_insn_relo(prog, insn_idx);

> > +               if (relo && relo->type == RELO_EXTERN_FUNC)

> > +                       /* kfunc relocations will be handled later

> > +                        * in bpf_object__relocate_data()

> > +                        */

> > +                       continue;

> >                 if (relo && relo->type != RELO_CALL && relo->type != RELO_SUBPROG_ADDR) {

> >                         pr_warn("prog '%s': unexpected relo for insn #%zu, type %d\n",

> >                                 prog->name, insn_idx, relo->type);

> 

> [...]


--
Alexei Starovoitov April 27, 2021, 3:25 a.m. UTC | #11
On Mon, Apr 26, 2021 at 03:22:36PM -0700, Andrii Nakryiko wrote:
> On Thu, Apr 22, 2021 at 5:27 PM Alexei Starovoitov

> <alexei.starovoitov@gmail.com> wrote:

> >

> > From: Alexei Starovoitov <ast@kernel.org>

> >

> > The BPF program loading process performed by libbpf is quite complex

> > and consists of the following steps:

> > "open" phase:

> > - parse elf file and remember relocations, sections

> > - collect externs and ksyms including their btf_ids in prog's BTF

> > - patch BTF datasec (since llvm couldn't do it)

> > - init maps (old style map_def, BTF based, global data map, kconfig map)

> > - collect relocations against progs and maps

> > "load" phase:

> > - probe kernel features

> > - load vmlinux BTF

> > - resolve externs (kconfig and ksym)

> > - load program BTF

> > - init struct_ops

> > - create maps

> > - apply CO-RE relocations

> > - patch ld_imm64 insns with src_reg=PSEUDO_MAP, PSEUDO_MAP_VALUE, PSEUDO_BTF_ID

> > - reposition subprograms and adjust call insns

> > - sanitize and load progs

> >

> > During this process libbpf does sys_bpf() calls to load BTF, create maps,

> > populate maps and finally load programs.

> > Instead of actually doing the syscalls generate a trace of what libbpf

> > would have done and represent it as the "loader program".

> > The "loader program" consists of single map with:

> > - union bpf_attr(s)

> > - BTF bytes

> > - map value bytes

> > - insns bytes

> > and single bpf program that passes bpf_attr(s) and data into bpf_sys_bpf() helper.

> > Executing such "loader program" via bpf_prog_test_run() command will

> > replay the sequence of syscalls that libbpf would have done which will result

> > the same maps created and programs loaded as specified in the elf file.

> > The "loader program" removes libelf and majority of libbpf dependency from

> > program loading process.

> >

> > kconfig, typeless ksym, struct_ops and CO-RE are not supported yet.

> >

> > The order of relocate_data and relocate_calls had to change, so that

> > bpf_gen__prog_load() can see all relocations for a given program with

> > correct insn_idx-es.

> >

> > Signed-off-by: Alexei Starovoitov <ast@kernel.org>

> > ---

> >  tools/lib/bpf/Build              |   2 +-

> >  tools/lib/bpf/bpf_gen_internal.h |  40 ++

> >  tools/lib/bpf/gen_loader.c       | 615 +++++++++++++++++++++++++++++++

> >  tools/lib/bpf/libbpf.c           | 204 ++++++++--

> >  tools/lib/bpf/libbpf.h           |  12 +

> >  tools/lib/bpf/libbpf.map         |   1 +

> >  tools/lib/bpf/libbpf_internal.h  |   2 +

> >  tools/lib/bpf/skel_internal.h    | 105 ++++++

> >  8 files changed, 948 insertions(+), 33 deletions(-)

> >  create mode 100644 tools/lib/bpf/bpf_gen_internal.h

> >  create mode 100644 tools/lib/bpf/gen_loader.c

> >  create mode 100644 tools/lib/bpf/skel_internal.h

> >

> > diff --git a/tools/lib/bpf/Build b/tools/lib/bpf/Build

> > index 9b057cc7650a..430f6874fa41 100644

> > --- a/tools/lib/bpf/Build

> > +++ b/tools/lib/bpf/Build

> > @@ -1,3 +1,3 @@

> >  libbpf-y := libbpf.o bpf.o nlattr.o btf.o libbpf_errno.o str_error.o \

> >             netlink.o bpf_prog_linfo.o libbpf_probes.o xsk.o hashmap.o \

> > -           btf_dump.o ringbuf.o strset.o linker.o

> > +           btf_dump.o ringbuf.o strset.o linker.o gen_loader.o

> > diff --git a/tools/lib/bpf/bpf_gen_internal.h b/tools/lib/bpf/bpf_gen_internal.h

> > new file mode 100644

> > index 000000000000..dc3e2cbf9ce3

> > --- /dev/null

> > +++ b/tools/lib/bpf/bpf_gen_internal.h

> > @@ -0,0 +1,40 @@

> > +/* SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) */

> > +/* Copyright (c) 2021 Facebook */

> > +#ifndef __BPF_GEN_INTERNAL_H

> > +#define __BPF_GEN_INTERNAL_H

> > +

> > +struct relo_desc {

> 

> there is very similarly named reloc_desc struct in libbpf.c, can you

> rename it to something like gen_btf_relo_desc?


sure.

> > +       const char *name;

> > +       int kind;

> > +       int insn_idx;

> > +};

> > +

> 

> [...]

> 

> > +

> > +static int bpf_gen__realloc_insn_buf(struct bpf_gen *gen, __u32 size)

> > +{

> > +       size_t off = gen->insn_cur - gen->insn_start;

> > +

> > +       if (gen->error)

> > +               return gen->error;

> > +       if (size > INT32_MAX || off + size > INT32_MAX) {

> > +               gen->error = -ERANGE;

> > +               return -ERANGE;

> > +       }

> > +       gen->insn_start = realloc(gen->insn_start, off + size);

> 

> leaking memory here: gen->insn_start will be NULL on failure


ohh. good catch.

> > +       if (!gen->insn_start) {

> > +               gen->error = -ENOMEM;

> > +               return -ENOMEM;

> > +       }

> > +       gen->insn_cur = gen->insn_start + off;

> > +       return 0;

> > +}

> > +

> > +static int bpf_gen__realloc_data_buf(struct bpf_gen *gen, __u32 size)

> > +{

> > +       size_t off = gen->data_cur - gen->data_start;

> > +

> > +       if (gen->error)

> > +               return gen->error;

> > +       if (size > INT32_MAX || off + size > INT32_MAX) {

> > +               gen->error = -ERANGE;

> > +               return -ERANGE;

> > +       }

> > +       gen->data_start = realloc(gen->data_start, off + size);

> 

> same as above

> 

> > +       if (!gen->data_start) {

> > +               gen->error = -ENOMEM;

> > +               return -ENOMEM;

> > +       }

> > +       gen->data_cur = gen->data_start + off;

> > +       return 0;

> > +}

> > +

> 

> [...]

> 

> > +

> > +static void bpf_gen__emit_sys_bpf(struct bpf_gen *gen, int cmd, int attr, int attr_size)

> > +{

> > +       bpf_gen__emit(gen, BPF_MOV64_IMM(BPF_REG_1, cmd));

> > +       bpf_gen__emit2(gen, BPF_LD_IMM64_RAW_FULL(BPF_REG_2, BPF_PSEUDO_MAP_IDX_VALUE, 0, 0, 0, attr));

> 

> is attr an offset into a blob? if yes, attr_off? or attr_base_off,

> anything with _off


yes. it's an offset into a blob, but I don't use _off anywhere
otherwise all variables through out would have to have _off which is too verbose.

> > +       bpf_gen__emit(gen, BPF_MOV64_IMM(BPF_REG_3, attr_size));

> > +       bpf_gen__emit(gen, BPF_EMIT_CALL(BPF_FUNC_sys_bpf));

> > +       /* remember the result in R7 */

> > +       bpf_gen__emit(gen, BPF_MOV64_REG(BPF_REG_7, BPF_REG_0));

> > +}

> > +

> > +static void bpf_gen__emit_check_err(struct bpf_gen *gen)

> > +{

> > +       bpf_gen__emit(gen, BPF_JMP_IMM(BPF_JSGE, BPF_REG_7, 0, 2));

> > +       bpf_gen__emit(gen, BPF_MOV64_REG(BPF_REG_0, BPF_REG_7));

> > +       bpf_gen__emit(gen, BPF_EXIT_INSN());

> > +}

> > +

> > +static void __bpf_gen__debug(struct bpf_gen *gen, int reg1, int reg2, const char *fmt, va_list args)

> 

> Can you please leave a comment on what reg1 and reg2 is, it's not very

> clear and the code clearly assumes that it can't be reg[1-4]. It's

> probably those special R7 and R9 (or -1, of course), but having a

> short comment makes sense to not jump around trying to figure out

> possible inputs.

> 

> Oh, reading further, it can also be R0.


good point. will add a comment.

> > +{

> > +       char buf[1024];

> > +       int addr, len, ret;

> > +

> > +       if (!gen->log_level)

> > +               return;

> > +       ret = vsnprintf(buf, sizeof(buf), fmt, args);

> > +       if (ret < 1024 - 7 && reg1 >= 0 && reg2 < 0)

> > +               /* The special case to accommodate common bpf_gen__debug_ret():

> > +                * to avoid specifying BPF_REG_7 and adding " r=%%d" to prints explicitly.

> > +                */

> > +               strcat(buf, " r=%d");

> > +       len = strlen(buf) + 1;

> > +       addr = bpf_gen__add_data(gen, buf, len);

> 

> nit: offset, not address, right?


it's actually an address.
From pov of the program and insn below:

> > +       bpf_gen__emit2(gen, BPF_LD_IMM64_RAW_FULL(BPF_REG_1, BPF_PSEUDO_MAP_IDX_VALUE, 0, 0, 0, addr));


I guess it's kinda both. It's an offset within global data, but no one calls
int var;
&var -> taking an offset here? no. It's taking an address of the glob var.
Our implementation of global data is via single map value element.
So global vars have offsets too.
imo addr is more accurate here and through out this file.

> > +       bpf_gen__emit(gen, BPF_MOV64_IMM(BPF_REG_2, len));

> > +       if (reg1 >= 0)

> > +               bpf_gen__emit(gen, BPF_MOV64_REG(BPF_REG_3, reg1));

> > +       if (reg2 >= 0)

> > +               bpf_gen__emit(gen, BPF_MOV64_REG(BPF_REG_4, reg2));

> > +       bpf_gen__emit(gen, BPF_EMIT_CALL(BPF_FUNC_trace_printk));

> > +}

> > +

> 

> [...]

> 

> > +int bpf_gen__finish(struct bpf_gen *gen)

> > +{

> > +       int i;

> > +

> > +       bpf_gen__emit_sys_close_stack(gen, stack_off(btf_fd));

> > +       for (i = 0; i < gen->nr_progs; i++)

> > +               bpf_gen__move_stack2ctx(gen,

> > +                                       sizeof(struct bpf_loader_ctx) +

> > +                                       sizeof(struct bpf_map_desc) * gen->nr_maps +

> > +                                       sizeof(struct bpf_prog_desc) * i +

> > +                                       offsetof(struct bpf_prog_desc, prog_fd), 4,

> > +                                       stack_off(prog_fd[i]));

> > +       for (i = 0; i < gen->nr_maps; i++)

> > +               bpf_gen__move_stack2ctx(gen,

> > +                                       sizeof(struct bpf_loader_ctx) +

> > +                                       sizeof(struct bpf_map_desc) * i +

> > +                                       offsetof(struct bpf_map_desc, map_fd), 4,

> > +                                       stack_off(map_fd[i]));

> > +       bpf_gen__emit(gen, BPF_MOV64_IMM(BPF_REG_0, 0));

> > +       bpf_gen__emit(gen, BPF_EXIT_INSN());

> > +       pr_debug("bpf_gen__finish %d\n", gen->error);

> 

> maybe prefix all those pr_debug()s with "gen: " to distinguish them

> from the rest of libbpf logging?


sure.

> > +       if (!gen->error) {

> > +               struct gen_loader_opts *opts = gen->opts;

> > +

> > +               opts->insns = gen->insn_start;

> > +               opts->insns_sz = gen->insn_cur - gen->insn_start;

> > +               opts->data = gen->data_start;

> > +               opts->data_sz = gen->data_cur - gen->data_start;

> > +       }

> > +       return gen->error;

> > +}

> > +

> > +void bpf_gen__free(struct bpf_gen *gen)

> > +{

> > +       if (!gen)

> > +               return;

> > +       free(gen->data_start);

> > +       free(gen->insn_start);

> > +       gen->data_start = NULL;

> > +       gen->insn_start = NULL;

> 

> what's the point of NULL'ing them out if you don't clear gen->data_cur

> and gen->insn_cur?


To spot the bugs quicker if there are issues.

> also should it free(gen) itself?


ohh. bpf_object__close() should probably do it to stay symmetrical with calloc.

> > +}

> > +

> > +void bpf_gen__load_btf(struct bpf_gen *gen, const void *btf_raw_data, __u32 btf_raw_size)

> > +{

> > +       union bpf_attr attr = {};

> 

> here and below: memset(0)?


that's unnecessary. there is no backward/forward compat issue here.
and bpf_attr doesn't have gaps inside.

> > +       int attr_size = offsetofend(union bpf_attr, btf_log_level);

> > +       int btf_data, btf_load_attr;

> > +

> > +       pr_debug("btf_load: size %d\n", btf_raw_size);

> > +       btf_data = bpf_gen__add_data(gen, btf_raw_data, btf_raw_size);

> > +

> 

> [...]

> 

> > +       map_create_attr = bpf_gen__add_data(gen, &attr, attr_size);

> > +       if (attr.btf_value_type_id)

> > +               /* populate union bpf_attr with btf_fd saved in the stack earlier */

> > +               bpf_gen__move_stack2blob(gen, map_create_attr + offsetof(union bpf_attr, btf_fd), 4,

> > +                                        stack_off(btf_fd));

> > +       switch (attr.map_type) {

> > +       case BPF_MAP_TYPE_ARRAY_OF_MAPS:

> > +       case BPF_MAP_TYPE_HASH_OF_MAPS:

> > +               bpf_gen__move_stack2blob(gen, map_create_attr + offsetof(union bpf_attr, inner_map_fd),

> > +                                        4, stack_off(inner_map_fd));

> > +               close_inner_map_fd = true;

> > +               break;

> > +       default:;

> 

> default:

>     break;


why?

> > +       }

> > +       /* emit MAP_CREATE command */

> > +       bpf_gen__emit_sys_bpf(gen, BPF_MAP_CREATE, map_create_attr, attr_size);

> > +       bpf_gen__debug_ret(gen, "map_create %s idx %d type %d value_size %d",

> > +                          attr.map_name, map_idx, map_attr->map_type, attr.value_size);

> > +       bpf_gen__emit_check_err(gen);

> 

> what will happen on error with inner_map_fd and all the other fds

> created by now?


that's a todo item. I'll add a comment that error path is far from perfect.

> > +       /* remember map_fd in the stack, if successful */

> > +       if (map_idx < 0) {

> > +               /* This bpf_gen__map_create() function is called with map_idx >= 0 for all maps

> > +                * that libbpf loading logic tracks.

> > +                * It's called with -1 to create an inner map.

> > +                */

> > +               bpf_gen__emit(gen, BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_7, stack_off(inner_map_fd)));

> > +       } else {

> > +               if (map_idx != gen->nr_maps) {

> 

> why would that happen? defensive programming? and even then `if () {}

> else if () {} else {}` structure is more appropriate


sure. will use that style.

> > +                       gen->error = -EDOM; /* internal bug */

> > +                       return;

> > +               }

> > +               bpf_gen__emit(gen, BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_7, stack_off(map_fd[map_idx])));

> > +               gen->nr_maps++;

> > +       }

> > +       if (close_inner_map_fd)

> > +               bpf_gen__emit_sys_close_stack(gen, stack_off(inner_map_fd));

> > +}

> > +

> 

> [...]

> 

> > +static void bpf_gen__cleanup_relos(struct bpf_gen *gen, int insns)

> > +{

> > +       int i, insn;

> > +

> > +       for (i = 0; i < gen->relo_cnt; i++) {

> > +               if (gen->relos[i].kind != BTF_KIND_VAR)

> > +                       continue;

> > +               /* close fd recorded in insn[insn_idx + 1].imm */

> > +               insn = insns + sizeof(struct bpf_insn) * (gen->relos[i].insn_idx + 1)

> > +                       + offsetof(struct bpf_insn, imm);

> > +               bpf_gen__emit_sys_close_blob(gen, insn);

> 

> wouldn't this close the same FD used across multiple "relos" multiple times?


no. since every relo has its own fd.
Right now the loader gen is simple and doesn't do preprocessing of
all relos for all progs to avoid complicating the code.
It's good enough for now.

> > +       }

> > +       if (gen->relo_cnt) {

> > +               free(gen->relos);

> > +               gen->relo_cnt = 0;

> > +               gen->relos = NULL;

> > +       }

> > +}

> > +

> 

> [...]

> 

> > +       struct bpf_gen *gen_loader;

> > +

> >         /*

> >          * Information when doing elf related work. Only valid if fd

> >          * is valid.

> > @@ -2651,7 +2654,15 @@ static int bpf_object__sanitize_and_load_btf(struct bpf_object *obj)

> >                 bpf_object__sanitize_btf(obj, kern_btf);

> >         }

> >

> > -       err = btf__load(kern_btf);

> > +       if (obj->gen_loader) {

> > +               __u32 raw_size = 0;

> > +               const void *raw_data = btf__get_raw_data(kern_btf, &raw_size);

> 

> this can return NULL on ENOMEM


good point.

> > +

> > +               bpf_gen__load_btf(obj->gen_loader, raw_data, raw_size);

> > +               btf__set_fd(kern_btf, 0);

> 

> why setting fd to 0 (stdin)? does gen depend on this somewhere? The

> problem is that it will eventually be closed on btf__free(), which

> will close stdin, causing a big surprise. What will happen if you

> leave it at -1?


unfortunately there are various piece of the code that check <0
means not init.
I can try to fix them all, but it felt unncessary complex vs screwing up stdin.
It's bpftool's stdin, but you're right that it's not pretty.
I can probably use special very large FD number and check for it.
Still cleaner than fixing all checks.

> 

> > +       } else {

> > +               err = btf__load(kern_btf);

> > +       }

> >         if (sanitize) {

> >                 if (!err) {

> >                         /* move fd to libbpf's BTF */

> > @@ -4262,6 +4273,12 @@ static bool kernel_supports(const struct bpf_object *obj, enum kern_feature_id f

> >         struct kern_feature_desc *feat = &feature_probes[feat_id];

> >         int ret;

> >

> > +       if (obj->gen_loader)

> > +               /* To generate loader program assume the latest kernel

> > +                * to avoid doing extra prog_load, map_create syscalls.

> > +                */

> > +               return true;

> > +

> >         if (READ_ONCE(feat->res) == FEAT_UNKNOWN) {

> >                 ret = feat->probe();

> >                 if (ret > 0) {

> > @@ -4344,6 +4361,13 @@ bpf_object__populate_internal_map(struct bpf_object *obj, struct bpf_map *map)

> >         char *cp, errmsg[STRERR_BUFSIZE];

> >         int err, zero = 0;

> >

> > +       if (obj->gen_loader) {

> > +               bpf_gen__map_update_elem(obj->gen_loader, map - obj->maps,

> 

> it would be great for bpf_gen__map_update_elem to reflect that it's

> not a generic map_update_elem() call, rather special internal map

> update (just use bpf_gen__populate_internal_map?) Whether to freeze or

> not could be just a flag to the same call, they always go together.


It's actually generic map_update_elem. I haven't used it in inner map init yet.
Still on my todo list.

> > +                                        map->mmaped, map->def.value_size);

> > +               if (map_type == LIBBPF_MAP_RODATA || map_type == LIBBPF_MAP_KCONFIG)

> > +                       bpf_gen__map_freeze(obj->gen_loader, map - obj->maps);

> > +               return 0;

> > +       }

> >         err = bpf_map_update_elem(map->fd, &zero, map->mmaped, 0);

> >         if (err) {

> >                 err = -errno;

> > @@ -4369,7 +4393,7 @@ bpf_object__populate_internal_map(struct bpf_object *obj, struct bpf_map *map)

> >

> >  static void bpf_map__destroy(struct bpf_map *map);

> >

> > -static int bpf_object__create_map(struct bpf_object *obj, struct bpf_map *map)

> > +static int bpf_object__create_map(struct bpf_object *obj, struct bpf_map *map, bool is_inner)

> >  {

> >         struct bpf_create_map_attr create_attr;

> >         struct bpf_map_def *def = &map->def;

> > @@ -4415,9 +4439,9 @@ static int bpf_object__create_map(struct bpf_object *obj, struct bpf_map *map)

> >

> >         if (bpf_map_type__is_map_in_map(def->type)) {

> >                 if (map->inner_map) {

> > -                       int err;

> > +                       int err = 0;

> 

> no need to initialize to zero, you are assigning it right below

> 

> >

> > -                       err = bpf_object__create_map(obj, map->inner_map);

> > +                       err = bpf_object__create_map(obj, map->inner_map, true);

> >                         if (err) {

> >                                 pr_warn("map '%s': failed to create inner map: %d\n",

> >                                         map->name, err);

> 

> [...]

> 

> > @@ -4469,7 +4498,12 @@ static int init_map_slots(struct bpf_map *map)

> >

> >                 targ_map = map->init_slots[i];

> >                 fd = bpf_map__fd(targ_map);

> > -               err = bpf_map_update_elem(map->fd, &i, &fd, 0);

> > +               if (obj->gen_loader) {

> > +                       printf("// TODO map_update_elem: idx %ld key %d value==map_idx %ld\n",

> > +                              map - obj->maps, i, targ_map - obj->maps);

> 

> return error for now?

> 

> > +               } else {

> > +                       err = bpf_map_update_elem(map->fd, &i, &fd, 0);

> > +               }

> >                 if (err) {

> >                         err = -errno;

> >                         pr_warn("map '%s': failed to initialize slot [%d] to map '%s' fd=%d: %d\n",

> 

> [...]

> 

> > @@ -6082,6 +6119,11 @@ static int bpf_core_apply_relo(struct bpf_program *prog,

> >         if (str_is_empty(spec_str))

> >                 return -EINVAL;

> >

> > +       if (prog->obj->gen_loader) {

> > +               printf("// TODO core_relo: prog %ld insn[%d] %s %s kind %d\n",

> > +                      prog - prog->obj->programs, relo->insn_off / 8,

> > +                      local_name, spec_str, relo->kind);

> 

> same, return error? Drop printf, maybe leave pr_debug()?


sure. pr_debug with error sounds fine.

> > +       }

> >         err = bpf_core_parse_spec(local_btf, local_id, spec_str, relo->kind, &local_spec);

> >         if (err) {

> >                 pr_warn("prog '%s': relo #%d: parsing [%d] %s %s + %s failed: %d\n",

> > @@ -6821,6 +6863,19 @@ bpf_object__relocate_calls(struct bpf_object *obj, struct bpf_program *prog)

> >

> >         return 0;

> >  }

> 

> empty line here

> 

> > +static void

> > +bpf_object__free_relocs(struct bpf_object *obj)

> > +{

> > +       struct bpf_program *prog;

> > +       int i;

> > +

> > +       /* free up relocation descriptors */

> > +       for (i = 0; i < obj->nr_programs; i++) {

> > +               prog = &obj->programs[i];

> > +               zfree(&prog->reloc_desc);

> > +               prog->nr_reloc = 0;

> > +       }

> > +}

> >

> 

> [...]

> 

> > +static int bpf_program__record_externs(struct bpf_program *prog)

> > +{

> > +       struct bpf_object *obj = prog->obj;

> > +       int i;

> > +

> > +       for (i = 0; i < prog->nr_reloc; i++) {

> > +               struct reloc_desc *relo = &prog->reloc_desc[i];

> > +               struct extern_desc *ext = &obj->externs[relo->sym_off];

> > +

> > +               switch (relo->type) {

> > +               case RELO_EXTERN_VAR:

> > +                       if (ext->type != EXT_KSYM)

> > +                               continue;

> > +                       if (!ext->ksym.type_id) /* typeless ksym */

> > +                               continue;

> 

> this shouldn't be silently ignored, if it's not supported, it should

> return error


agree

> > +                       bpf_gen__record_extern(obj->gen_loader, ext->name, BTF_KIND_VAR,

> > +                                              relo->insn_idx);

> > +                       break;

> > +               case RELO_EXTERN_FUNC:

> > +                       bpf_gen__record_extern(obj->gen_loader, ext->name, BTF_KIND_FUNC,

> > +                                              relo->insn_idx);

> > +                       break;

> > +               default:

> > +                       continue;

> > +               }

> > +       }

> > +       return 0;

> > +}

> > +

> 

> [...]

> 

> > @@ -7868,6 +7970,9 @@ int bpf_object__load_xattr(struct bpf_object_load_attr *attr)

> >         err = err ? : bpf_object__relocate(obj, attr->target_btf_path);

> >         err = err ? : bpf_object__load_progs(obj, attr->log_level);

> >

> > +       if (obj->gen_loader && !err)

> > +               err = bpf_gen__finish(obj->gen_loader);

> > +

> >         /* clean up module BTFs */

> >         for (i = 0; i < obj->btf_module_cnt; i++) {

> >                 close(obj->btf_modules[i].fd);

> > @@ -8493,6 +8598,7 @@ void bpf_object__close(struct bpf_object *obj)

> >         if (obj->clear_priv)

> >                 obj->clear_priv(obj, obj->priv);

> 

> bpf_object__close() will close all those FD=0 in maps/progs, that's not good

> 

> >

> > +       bpf_gen__free(obj->gen_loader);

> >         bpf_object__elf_finish(obj);

> >         bpf_object__unload(obj);

> >         btf__free(obj->btf);

> 

> [...]

> 

> > @@ -9387,7 +9521,13 @@ static int libbpf_find_attach_btf_id(struct bpf_program *prog, int *btf_obj_fd,

> >         }

> >

> >         /* kernel/module BTF ID */

> > -       err = find_kernel_btf_id(prog->obj, attach_name, attach_type, btf_obj_fd, btf_type_id);

> > +       if (prog->obj->gen_loader) {

> > +               bpf_gen__record_attach_target(prog->obj->gen_loader, attach_name, attach_type);

> > +               *btf_obj_fd = 0;

> 

> this will leak kernel module BTF FDs


I don't follow.
When gen_loader is happening the find_kernel_btf_id() is not called and modules BTFs are not loaded.

> > +               *btf_type_id = 1;

> > +       } else {

> > +               err = find_kernel_btf_id(prog->obj, attach_name, attach_type, btf_obj_fd, btf_type_id);

> > +       }

> >         if (err) {

> >                 pr_warn("failed to find kernel BTF type ID of '%s': %d\n", attach_name, err);

> >                 return err;

> 

> [...]

> 

> > +out:

> > +       close(map_fd);

> > +       close(prog_fd);

> 

> this does close(-1), check >= 0


Right. I felt there is no risk in doing that. I guess extra check is fine too.
Andrii Nakryiko April 27, 2021, 4:28 p.m. UTC | #12
On Mon, Apr 26, 2021 at 7:43 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>

> On Mon, Apr 26, 2021 at 10:02:59AM -0700, Andrii Nakryiko wrote:

> > > +/* Copyright (c) 2021 Facebook */

> > > +#include <linux/stddef.h>

> > > +#include <linux/bpf.h>

> > > +#include <bpf/bpf_helpers.h>

> > > +#include <bpf/bpf_tracing.h>

> > > +#include <../../tools/include/linux/filter.h>

> >

> > with TOOLSINCDIR shouldn't this be just <linux/fiter.h>?

>

> sadly no. There is uapi/linux/filter.h that gets included first.

> And changing the order of -Is brings the whole set of other issues.

> I couldn't come up with anything better unfortunately.


Then let's at least drop TOOLSINCDIR for now, if it's not really used?
Andrii Nakryiko April 27, 2021, 4:47 p.m. UTC | #13
On Mon, Apr 26, 2021 at 8:00 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>

> On Mon, Apr 26, 2021 at 10:29:09AM -0700, Andrii Nakryiko wrote:

> > On Thu, Apr 22, 2021 at 5:27 PM Alexei Starovoitov

> > <alexei.starovoitov@gmail.com> wrote:

> > >

> > > From: Alexei Starovoitov <ast@kernel.org>

> > >

> > > In order to be able to generate loader program in the later

> > > patches change the order of data and text relocations.

> > > Also improve the test to include data relos.

> > >

> > > If the kernel supports "FD array" the map_fd relocations can be processed

> > > before text relos since generated loader program won't need to manually

> > > patch ld_imm64 insns with map_fd.

> > > But ksym and kfunc relocations can only be processed after all calls

> > > are relocated, since loader program will consist of a sequence

> > > of calls to bpf_btf_find_by_name_kind() followed by patching of btf_id

> > > and btf_obj_fd into corresponding ld_imm64 insns. The locations of those

> > > ld_imm64 insns are specified in relocations.

> > > Hence process all data relocations (maps, ksym, kfunc) together after call relos.

> > >

> > > Signed-off-by: Alexei Starovoitov <ast@kernel.org>

> > > ---

> > >  tools/lib/bpf/libbpf.c                        | 86 +++++++++++++++----

> > >  .../selftests/bpf/progs/test_subprogs.c       | 13 +++

> > >  2 files changed, 80 insertions(+), 19 deletions(-)

> > >

> > > diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c

> > > index 17cfc5b66111..c73a85b97ca5 100644

> > > --- a/tools/lib/bpf/libbpf.c

> > > +++ b/tools/lib/bpf/libbpf.c

> > > @@ -6379,11 +6379,15 @@ bpf_object__relocate_data(struct bpf_object *obj, struct bpf_program *prog)

> > >                         insn[0].imm = ext->ksym.kernel_btf_id;

> > >                         break;

> > >                 case RELO_SUBPROG_ADDR:

> > > -                       insn[0].src_reg = BPF_PSEUDO_FUNC;

> > > -                       /* will be handled as a follow up pass */

> > > +                       if (insn[0].src_reg != BPF_PSEUDO_FUNC) {

> > > +                               pr_warn("prog '%s': relo #%d: bad insn\n",

> > > +                                       prog->name, i);

> > > +                               return -EINVAL;

> > > +                       }

> >

> > given SUBPROG_ADDR is now handled similarly to RELO_CALL in a

> > different place, I'd probably drop this error check and just combine

> > RELO_SUBPROG_ADDR and RELO_CALL cases with just a /* handled already

> > */ comment.

>

> I prefer to keep them separate. I've hit this pr_warn couple times

> while messing with relos and it saved my time.

> I bet it will save time to the next developer too.


hmm.. ok, not critical to me

>

> > > +                       /* handled already */

> > >                         break;

> > >                 case RELO_CALL:

> > > -                       /* will be handled as a follow up pass */

> > > +                       /* handled already */

> > >                         break;

> > >                 default:

> > >                         pr_warn("prog '%s': relo #%d: bad relo type %d\n",

> > > @@ -6552,6 +6556,31 @@ static struct reloc_desc *find_prog_insn_relo(const struct bpf_program *prog, si

> > >                        sizeof(*prog->reloc_desc), cmp_relo_by_insn_idx);

> > >  }

> > >

> > > +static int append_subprog_relos(struct bpf_program *main_prog, struct bpf_program *subprog)

> > > +{

> > > +       int new_cnt = main_prog->nr_reloc + subprog->nr_reloc;

> > > +       struct reloc_desc *relos;

> > > +       size_t off = subprog->sub_insn_off;

> > > +       int i;

> > > +

> > > +       if (main_prog == subprog)

> > > +               return 0;

> > > +       relos = libbpf_reallocarray(main_prog->reloc_desc, new_cnt, sizeof(*relos));

> > > +       if (!relos)

> > > +               return -ENOMEM;

> > > +       memcpy(relos + main_prog->nr_reloc, subprog->reloc_desc,

> > > +              sizeof(*relos) * subprog->nr_reloc);

> > > +

> > > +       for (i = main_prog->nr_reloc; i < new_cnt; i++)

> > > +               relos[i].insn_idx += off;

> >

> > nit: off is used only here, so there is little point in having it as a

> > separate var, inline?

>

> sure.

>

> > > +       /* After insn_idx adjustment the 'relos' array is still sorted

> > > +        * by insn_idx and doesn't break bsearch.

> > > +        */

> > > +       main_prog->reloc_desc = relos;

> > > +       main_prog->nr_reloc = new_cnt;

> > > +       return 0;

> > > +}

> > > +

> > >  static int

> > >  bpf_object__reloc_code(struct bpf_object *obj, struct bpf_program *main_prog,

> > >                        struct bpf_program *prog)

> > > @@ -6560,18 +6589,32 @@ bpf_object__reloc_code(struct bpf_object *obj, struct bpf_program *main_prog,

> > >         struct bpf_program *subprog;

> > >         struct bpf_insn *insns, *insn;

> > >         struct reloc_desc *relo;

> > > -       int err;

> > > +       int err, i;

> > >

> > >         err = reloc_prog_func_and_line_info(obj, main_prog, prog);

> > >         if (err)

> > >                 return err;

> > >

> > > +       for (i = 0; i < prog->nr_reloc; i++) {

> > > +               relo = &prog->reloc_desc[i];

> > > +               insn = &main_prog->insns[prog->sub_insn_off + relo->insn_idx];

> > > +

> > > +               if (relo->type == RELO_SUBPROG_ADDR)

> > > +                       /* mark the insn, so it becomes insn_is_pseudo_func() */

> > > +                       insn[0].src_reg = BPF_PSEUDO_FUNC;

> > > +       }

> > > +

> >

> > This will do the same work over and over each time we append a subprog

> > to main_prog. This should logically follow append_subprog_relos(), but

> > you wanted to do it for main_prog with the same code, right?

>

> It cannot follow append_subprog_relos.

> It has to be done before the loop below.

> Otherwise !insn_is_pseudo_func() won't catch it and all ld_imm64 insns

> will be considered which will make the loop below more complex and slower.

> The find_prog_insn_relo() will be called a lot more times.

> !relo condition would be treated different ld_imm64 vs call insn, etc.


if you process main_prog->insns first all the calls to subprogs would
be updated, then it recursively would do this right before a new
subprog is added (initial call is bpf_object__reloc_code(obj, prog,
prog) where prog is entry-point program). But either way I'm not
suggesting doing this and splitting this logic into two places.

>

> > How about instead doing this before we start appending subprogs to

> > main_progs? I.e., do it explicitly in bpf_object__relocate() before

> > you start code relocation loop.

>

> Not sure I follow.

> Do another loop:

>  for (i = 0; i < obj->nr_programs; i++)

>     for (i = 0; i < prog->nr_reloc; i++)

>       if (relo->type == RELO_SUBPROG_ADDR)

>       ?

> That's an option too.

> I can do that if you prefer.

> It felt cleaner to do this mark here right before the loop below that needs it.


Yes, I'm proposing to do another loop in bpf_object__relocate() before
we start adding subprogs to main_progs. The reason is that
bpf_object__reloc_code() is called recursively many times for the same
main_prog, so doing that here is O(N^2) in the number of total
instructions in main_prog. It processes the same (already processed)
instructions many times unnecessarily. It's wasteful and unclean.

>

> > >         for (insn_idx = 0; insn_idx < prog->sec_insn_cnt; insn_idx++) {

> > >                 insn = &main_prog->insns[prog->sub_insn_off + insn_idx];

> > >                 if (!insn_is_subprog_call(insn) && !insn_is_pseudo_func(insn))

> > >                         continue;

> > >

> > >                 relo = find_prog_insn_relo(prog, insn_idx);

> > > +               if (relo && relo->type == RELO_EXTERN_FUNC)

> > > +                       /* kfunc relocations will be handled later

> > > +                        * in bpf_object__relocate_data()

> > > +                        */

> > > +                       continue;

> > >                 if (relo && relo->type != RELO_CALL && relo->type != RELO_SUBPROG_ADDR) {

> > >                         pr_warn("prog '%s': unexpected relo for insn #%zu, type %d\n",

> > >                                 prog->name, insn_idx, relo->type);

> >

> > [...]

>

> --
Andrii Nakryiko April 27, 2021, 5:34 p.m. UTC | #14
On Mon, Apr 26, 2021 at 8:25 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>

> On Mon, Apr 26, 2021 at 03:22:36PM -0700, Andrii Nakryiko wrote:

> > On Thu, Apr 22, 2021 at 5:27 PM Alexei Starovoitov

> > <alexei.starovoitov@gmail.com> wrote:

> > >

> > > From: Alexei Starovoitov <ast@kernel.org>

> > >

> > > The BPF program loading process performed by libbpf is quite complex

> > > and consists of the following steps:

> > > "open" phase:

> > > - parse elf file and remember relocations, sections

> > > - collect externs and ksyms including their btf_ids in prog's BTF

> > > - patch BTF datasec (since llvm couldn't do it)

> > > - init maps (old style map_def, BTF based, global data map, kconfig map)

> > > - collect relocations against progs and maps

> > > "load" phase:

> > > - probe kernel features

> > > - load vmlinux BTF

> > > - resolve externs (kconfig and ksym)

> > > - load program BTF

> > > - init struct_ops

> > > - create maps

> > > - apply CO-RE relocations

> > > - patch ld_imm64 insns with src_reg=PSEUDO_MAP, PSEUDO_MAP_VALUE, PSEUDO_BTF_ID

> > > - reposition subprograms and adjust call insns

> > > - sanitize and load progs

> > >

> > > During this process libbpf does sys_bpf() calls to load BTF, create maps,

> > > populate maps and finally load programs.

> > > Instead of actually doing the syscalls generate a trace of what libbpf

> > > would have done and represent it as the "loader program".

> > > The "loader program" consists of single map with:

> > > - union bpf_attr(s)

> > > - BTF bytes

> > > - map value bytes

> > > - insns bytes

> > > and single bpf program that passes bpf_attr(s) and data into bpf_sys_bpf() helper.

> > > Executing such "loader program" via bpf_prog_test_run() command will

> > > replay the sequence of syscalls that libbpf would have done which will result

> > > the same maps created and programs loaded as specified in the elf file.

> > > The "loader program" removes libelf and majority of libbpf dependency from

> > > program loading process.

> > >

> > > kconfig, typeless ksym, struct_ops and CO-RE are not supported yet.

> > >

> > > The order of relocate_data and relocate_calls had to change, so that

> > > bpf_gen__prog_load() can see all relocations for a given program with

> > > correct insn_idx-es.

> > >

> > > Signed-off-by: Alexei Starovoitov <ast@kernel.org>

> > > ---

> > >  tools/lib/bpf/Build              |   2 +-

> > >  tools/lib/bpf/bpf_gen_internal.h |  40 ++

> > >  tools/lib/bpf/gen_loader.c       | 615 +++++++++++++++++++++++++++++++

> > >  tools/lib/bpf/libbpf.c           | 204 ++++++++--

> > >  tools/lib/bpf/libbpf.h           |  12 +

> > >  tools/lib/bpf/libbpf.map         |   1 +

> > >  tools/lib/bpf/libbpf_internal.h  |   2 +

> > >  tools/lib/bpf/skel_internal.h    | 105 ++++++

> > >  8 files changed, 948 insertions(+), 33 deletions(-)

> > >  create mode 100644 tools/lib/bpf/bpf_gen_internal.h

> > >  create mode 100644 tools/lib/bpf/gen_loader.c

> > >  create mode 100644 tools/lib/bpf/skel_internal.h

> > >

> > > diff --git a/tools/lib/bpf/Build b/tools/lib/bpf/Build

> > > index 9b057cc7650a..430f6874fa41 100644

> > > --- a/tools/lib/bpf/Build

> > > +++ b/tools/lib/bpf/Build

> > > @@ -1,3 +1,3 @@

> > >  libbpf-y := libbpf.o bpf.o nlattr.o btf.o libbpf_errno.o str_error.o \

> > >             netlink.o bpf_prog_linfo.o libbpf_probes.o xsk.o hashmap.o \

> > > -           btf_dump.o ringbuf.o strset.o linker.o

> > > +           btf_dump.o ringbuf.o strset.o linker.o gen_loader.o

> > > diff --git a/tools/lib/bpf/bpf_gen_internal.h b/tools/lib/bpf/bpf_gen_internal.h

> > > new file mode 100644

> > > index 000000000000..dc3e2cbf9ce3

> > > --- /dev/null

> > > +++ b/tools/lib/bpf/bpf_gen_internal.h

> > > @@ -0,0 +1,40 @@

> > > +/* SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) */

> > > +/* Copyright (c) 2021 Facebook */

> > > +#ifndef __BPF_GEN_INTERNAL_H

> > > +#define __BPF_GEN_INTERNAL_H

> > > +

> > > +struct relo_desc {

> >

> > there is very similarly named reloc_desc struct in libbpf.c, can you

> > rename it to something like gen_btf_relo_desc?

>

> sure.

>

> > > +       const char *name;

> > > +       int kind;

> > > +       int insn_idx;

> > > +};

> > > +

> >

> > [...]

> >

> > > +

> > > +static int bpf_gen__realloc_insn_buf(struct bpf_gen *gen, __u32 size)

> > > +{

> > > +       size_t off = gen->insn_cur - gen->insn_start;

> > > +

> > > +       if (gen->error)

> > > +               return gen->error;

> > > +       if (size > INT32_MAX || off + size > INT32_MAX) {

> > > +               gen->error = -ERANGE;

> > > +               return -ERANGE;

> > > +       }

> > > +       gen->insn_start = realloc(gen->insn_start, off + size);

> >

> > leaking memory here: gen->insn_start will be NULL on failure

>

> ohh. good catch.

>

> > > +       if (!gen->insn_start) {

> > > +               gen->error = -ENOMEM;

> > > +               return -ENOMEM;

> > > +       }

> > > +       gen->insn_cur = gen->insn_start + off;

> > > +       return 0;

> > > +}

> > > +

> > > +static int bpf_gen__realloc_data_buf(struct bpf_gen *gen, __u32 size)

> > > +{

> > > +       size_t off = gen->data_cur - gen->data_start;

> > > +

> > > +       if (gen->error)

> > > +               return gen->error;

> > > +       if (size > INT32_MAX || off + size > INT32_MAX) {

> > > +               gen->error = -ERANGE;

> > > +               return -ERANGE;

> > > +       }

> > > +       gen->data_start = realloc(gen->data_start, off + size);

> >

> > same as above

> >

> > > +       if (!gen->data_start) {

> > > +               gen->error = -ENOMEM;

> > > +               return -ENOMEM;

> > > +       }

> > > +       gen->data_cur = gen->data_start + off;

> > > +       return 0;

> > > +}

> > > +

> >

> > [...]

> >

> > > +

> > > +static void bpf_gen__emit_sys_bpf(struct bpf_gen *gen, int cmd, int attr, int attr_size)

> > > +{

> > > +       bpf_gen__emit(gen, BPF_MOV64_IMM(BPF_REG_1, cmd));

> > > +       bpf_gen__emit2(gen, BPF_LD_IMM64_RAW_FULL(BPF_REG_2, BPF_PSEUDO_MAP_IDX_VALUE, 0, 0, 0, attr));

> >

> > is attr an offset into a blob? if yes, attr_off? or attr_base_off,

> > anything with _off

>

> yes. it's an offset into a blob, but I don't use _off anywhere

> otherwise all variables through out would have to have _off which is too verbose.


After reading bpf_gen__emit_rel_store() which uses "off" I assumed
offset terminology will be used everywhere, but I got used to that by
the end I finished reading, so I don't care anymore.

>

> > > +       bpf_gen__emit(gen, BPF_MOV64_IMM(BPF_REG_3, attr_size));

> > > +       bpf_gen__emit(gen, BPF_EMIT_CALL(BPF_FUNC_sys_bpf));

> > > +       /* remember the result in R7 */

> > > +       bpf_gen__emit(gen, BPF_MOV64_REG(BPF_REG_7, BPF_REG_0));

> > > +}

> > > +

> > > +static void bpf_gen__emit_check_err(struct bpf_gen *gen)

> > > +{

> > > +       bpf_gen__emit(gen, BPF_JMP_IMM(BPF_JSGE, BPF_REG_7, 0, 2));

> > > +       bpf_gen__emit(gen, BPF_MOV64_REG(BPF_REG_0, BPF_REG_7));

> > > +       bpf_gen__emit(gen, BPF_EXIT_INSN());

> > > +}

> > > +

> > > +static void __bpf_gen__debug(struct bpf_gen *gen, int reg1, int reg2, const char *fmt, va_list args)

> >

> > Can you please leave a comment on what reg1 and reg2 is, it's not very

> > clear and the code clearly assumes that it can't be reg[1-4]. It's

> > probably those special R7 and R9 (or -1, of course), but having a

> > short comment makes sense to not jump around trying to figure out

> > possible inputs.

> >

> > Oh, reading further, it can also be R0.

>

> good point. will add a comment.

>

> > > +{

> > > +       char buf[1024];

> > > +       int addr, len, ret;

> > > +

> > > +       if (!gen->log_level)

> > > +               return;

> > > +       ret = vsnprintf(buf, sizeof(buf), fmt, args);

> > > +       if (ret < 1024 - 7 && reg1 >= 0 && reg2 < 0)

> > > +               /* The special case to accommodate common bpf_gen__debug_ret():

> > > +                * to avoid specifying BPF_REG_7 and adding " r=%%d" to prints explicitly.

> > > +                */

> > > +               strcat(buf, " r=%d");

> > > +       len = strlen(buf) + 1;

> > > +       addr = bpf_gen__add_data(gen, buf, len);

> >

> > nit: offset, not address, right?

>

> it's actually an address.

> From pov of the program and insn below:

>

> > > +       bpf_gen__emit2(gen, BPF_LD_IMM64_RAW_FULL(BPF_REG_1, BPF_PSEUDO_MAP_IDX_VALUE, 0, 0, 0, addr));

>

> I guess it's kinda both. It's an offset within global data, but no one calls

> int var;

> &var -> taking an offset here? no. It's taking an address of the glob var.

> Our implementation of global data is via single map value element.

> So global vars have offsets too.

> imo addr is more accurate here and through out this file.


alright

>

> > > +       bpf_gen__emit(gen, BPF_MOV64_IMM(BPF_REG_2, len));

> > > +       if (reg1 >= 0)

> > > +               bpf_gen__emit(gen, BPF_MOV64_REG(BPF_REG_3, reg1));

> > > +       if (reg2 >= 0)

> > > +               bpf_gen__emit(gen, BPF_MOV64_REG(BPF_REG_4, reg2));

> > > +       bpf_gen__emit(gen, BPF_EMIT_CALL(BPF_FUNC_trace_printk));

> > > +}

> > > +

> >

> > [...]

> >

> > > +int bpf_gen__finish(struct bpf_gen *gen)

> > > +{

> > > +       int i;

> > > +

> > > +       bpf_gen__emit_sys_close_stack(gen, stack_off(btf_fd));

> > > +       for (i = 0; i < gen->nr_progs; i++)

> > > +               bpf_gen__move_stack2ctx(gen,

> > > +                                       sizeof(struct bpf_loader_ctx) +

> > > +                                       sizeof(struct bpf_map_desc) * gen->nr_maps +

> > > +                                       sizeof(struct bpf_prog_desc) * i +

> > > +                                       offsetof(struct bpf_prog_desc, prog_fd), 4,

> > > +                                       stack_off(prog_fd[i]));

> > > +       for (i = 0; i < gen->nr_maps; i++)

> > > +               bpf_gen__move_stack2ctx(gen,

> > > +                                       sizeof(struct bpf_loader_ctx) +

> > > +                                       sizeof(struct bpf_map_desc) * i +

> > > +                                       offsetof(struct bpf_map_desc, map_fd), 4,

> > > +                                       stack_off(map_fd[i]));

> > > +       bpf_gen__emit(gen, BPF_MOV64_IMM(BPF_REG_0, 0));

> > > +       bpf_gen__emit(gen, BPF_EXIT_INSN());

> > > +       pr_debug("bpf_gen__finish %d\n", gen->error);

> >

> > maybe prefix all those pr_debug()s with "gen: " to distinguish them

> > from the rest of libbpf logging?

>

> sure.

>

> > > +       if (!gen->error) {

> > > +               struct gen_loader_opts *opts = gen->opts;

> > > +

> > > +               opts->insns = gen->insn_start;

> > > +               opts->insns_sz = gen->insn_cur - gen->insn_start;

> > > +               opts->data = gen->data_start;

> > > +               opts->data_sz = gen->data_cur - gen->data_start;

> > > +       }

> > > +       return gen->error;

> > > +}

> > > +

> > > +void bpf_gen__free(struct bpf_gen *gen)

> > > +{

> > > +       if (!gen)

> > > +               return;

> > > +       free(gen->data_start);

> > > +       free(gen->insn_start);

> > > +       gen->data_start = NULL;

> > > +       gen->insn_start = NULL;

> >

> > what's the point of NULL'ing them out if you don't clear gen->data_cur

> > and gen->insn_cur?

>

> To spot the bugs quicker if there are issues.

>

> > also should it free(gen) itself?

>

> ohh. bpf_object__close() should probably do it to stay symmetrical with calloc.


in libbpf usually xxx__free() frees the object itself, so doing
free(gen) here would be consistent with that, IMO

>

> > > +}

> > > +

> > > +void bpf_gen__load_btf(struct bpf_gen *gen, const void *btf_raw_data, __u32 btf_raw_size)

> > > +{

> > > +       union bpf_attr attr = {};

> >

> > here and below: memset(0)?

>

> that's unnecessary. there is no backward/forward compat issue here.

> and bpf_attr doesn't have gaps inside.


systemd definitely had a problem with non-zero padding with such usage
of bpf_attr recently, but I don't remember which command specifically.
Is there any downside to making sure that this will keep working for
later bpf_attr changes regardless of whether there are gaps or not?

>

> > > +       int attr_size = offsetofend(union bpf_attr, btf_log_level);

> > > +       int btf_data, btf_load_attr;

> > > +

> > > +       pr_debug("btf_load: size %d\n", btf_raw_size);

> > > +       btf_data = bpf_gen__add_data(gen, btf_raw_data, btf_raw_size);

> > > +

> >

> > [...]

> >

> > > +       map_create_attr = bpf_gen__add_data(gen, &attr, attr_size);

> > > +       if (attr.btf_value_type_id)

> > > +               /* populate union bpf_attr with btf_fd saved in the stack earlier */

> > > +               bpf_gen__move_stack2blob(gen, map_create_attr + offsetof(union bpf_attr, btf_fd), 4,

> > > +                                        stack_off(btf_fd));

> > > +       switch (attr.map_type) {

> > > +       case BPF_MAP_TYPE_ARRAY_OF_MAPS:

> > > +       case BPF_MAP_TYPE_HASH_OF_MAPS:

> > > +               bpf_gen__move_stack2blob(gen, map_create_attr + offsetof(union bpf_attr, inner_map_fd),

> > > +                                        4, stack_off(inner_map_fd));

> > > +               close_inner_map_fd = true;

> > > +               break;

> > > +       default:;

> >

> > default:

> >     break;

>

> why?


because it's consistent with all the code in libbpf:

$ rg --multiline 'default:\s*\n\s*break;' tools/lib/bpf

returns quite a few cases, while

$ rg --multiline 'default:\s*;' tools/lib/bpf

returns none

>

> > > +       }

> > > +       /* emit MAP_CREATE command */

> > > +       bpf_gen__emit_sys_bpf(gen, BPF_MAP_CREATE, map_create_attr, attr_size);

> > > +       bpf_gen__debug_ret(gen, "map_create %s idx %d type %d value_size %d",

> > > +                          attr.map_name, map_idx, map_attr->map_type, attr.value_size);

> > > +       bpf_gen__emit_check_err(gen);

> >

> > what will happen on error with inner_map_fd and all the other fds

> > created by now?

>

> that's a todo item. I'll add a comment that error path is far from perfect.


ok

>

> > > +       /* remember map_fd in the stack, if successful */

> > > +       if (map_idx < 0) {

> > > +               /* This bpf_gen__map_create() function is called with map_idx >= 0 for all maps

> > > +                * that libbpf loading logic tracks.

> > > +                * It's called with -1 to create an inner map.

> > > +                */

> > > +               bpf_gen__emit(gen, BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_7, stack_off(inner_map_fd)));

> > > +       } else {

> > > +               if (map_idx != gen->nr_maps) {

> >

> > why would that happen? defensive programming? and even then `if () {}

> > else if () {} else {}` structure is more appropriate

>

> sure. will use that style.

>

> > > +                       gen->error = -EDOM; /* internal bug */

> > > +                       return;

> > > +               }

> > > +               bpf_gen__emit(gen, BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_7, stack_off(map_fd[map_idx])));

> > > +               gen->nr_maps++;

> > > +       }

> > > +       if (close_inner_map_fd)

> > > +               bpf_gen__emit_sys_close_stack(gen, stack_off(inner_map_fd));

> > > +}

> > > +

> >

> > [...]

> >

> > > +static void bpf_gen__cleanup_relos(struct bpf_gen *gen, int insns)

> > > +{

> > > +       int i, insn;

> > > +

> > > +       for (i = 0; i < gen->relo_cnt; i++) {

> > > +               if (gen->relos[i].kind != BTF_KIND_VAR)

> > > +                       continue;

> > > +               /* close fd recorded in insn[insn_idx + 1].imm */

> > > +               insn = insns + sizeof(struct bpf_insn) * (gen->relos[i].insn_idx + 1)

> > > +                       + offsetof(struct bpf_insn, imm);

> > > +               bpf_gen__emit_sys_close_blob(gen, insn);

> >

> > wouldn't this close the same FD used across multiple "relos" multiple times?

>

> no. since every relo has its own fd.

> Right now the loader gen is simple and doesn't do preprocessing of

> all relos for all progs to avoid complicating the code.

> It's good enough for now.


I keep forgetting about this split of "this will happen much later in
kernel" vs "this is what libbpf is normally doing". This is going to
be fun to support.

>

> > > +       }

> > > +       if (gen->relo_cnt) {

> > > +               free(gen->relos);

> > > +               gen->relo_cnt = 0;

> > > +               gen->relos = NULL;

> > > +       }

> > > +}

> > > +

> >

> > [...]

> >

> > > +       struct bpf_gen *gen_loader;

> > > +

> > >         /*

> > >          * Information when doing elf related work. Only valid if fd

> > >          * is valid.

> > > @@ -2651,7 +2654,15 @@ static int bpf_object__sanitize_and_load_btf(struct bpf_object *obj)

> > >                 bpf_object__sanitize_btf(obj, kern_btf);

> > >         }

> > >

> > > -       err = btf__load(kern_btf);

> > > +       if (obj->gen_loader) {

> > > +               __u32 raw_size = 0;

> > > +               const void *raw_data = btf__get_raw_data(kern_btf, &raw_size);

> >

> > this can return NULL on ENOMEM

>

> good point.

>

> > > +

> > > +               bpf_gen__load_btf(obj->gen_loader, raw_data, raw_size);

> > > +               btf__set_fd(kern_btf, 0);

> >

> > why setting fd to 0 (stdin)? does gen depend on this somewhere? The

> > problem is that it will eventually be closed on btf__free(), which

> > will close stdin, causing a big surprise. What will happen if you

> > leave it at -1?

>

> unfortunately there are various piece of the code that check <0

> means not init.

> I can try to fix them all, but it felt unncessary complex vs screwing up stdin.

> It's bpftool's stdin, but you're right that it's not pretty.

> I can probably use special very large FD number and check for it.

> Still cleaner than fixing all checks.


Maybe after generation go over each prog/map/BTF and reset their FDs
to -1 if gen_loader is used? Or I guess we can just sprinkle

if (!obj->gen_loader)
    close(fd);

in the right places. Not great from code readability, but at least
won't have spurious close()s.

>

> >

> > > +       } else {

> > > +               err = btf__load(kern_btf);

> > > +       }

> > >         if (sanitize) {

> > >                 if (!err) {

> > >                         /* move fd to libbpf's BTF */

> > > @@ -4262,6 +4273,12 @@ static bool kernel_supports(const struct bpf_object *obj, enum kern_feature_id f

> > >         struct kern_feature_desc *feat = &feature_probes[feat_id];

> > >         int ret;

> > >

> > > +       if (obj->gen_loader)

> > > +               /* To generate loader program assume the latest kernel

> > > +                * to avoid doing extra prog_load, map_create syscalls.

> > > +                */

> > > +               return true;

> > > +

> > >         if (READ_ONCE(feat->res) == FEAT_UNKNOWN) {

> > >                 ret = feat->probe();

> > >                 if (ret > 0) {

> > > @@ -4344,6 +4361,13 @@ bpf_object__populate_internal_map(struct bpf_object *obj, struct bpf_map *map)

> > >         char *cp, errmsg[STRERR_BUFSIZE];

> > >         int err, zero = 0;

> > >

> > > +       if (obj->gen_loader) {

> > > +               bpf_gen__map_update_elem(obj->gen_loader, map - obj->maps,

> >

> > it would be great for bpf_gen__map_update_elem to reflect that it's

> > not a generic map_update_elem() call, rather special internal map

> > update (just use bpf_gen__populate_internal_map?) Whether to freeze or

> > not could be just a flag to the same call, they always go together.

>

> It's actually generic map_update_elem. I haven't used it in inner map init yet.

> Still on my todo list.


Right now it assumes the key is zero and sizeof(int), I was wondering
if you are going to capture generic key in the future. Ok.

>

> > > +                                        map->mmaped, map->def.value_size);

> > > +               if (map_type == LIBBPF_MAP_RODATA || map_type == LIBBPF_MAP_KCONFIG)

> > > +                       bpf_gen__map_freeze(obj->gen_loader, map - obj->maps);

> > > +               return 0;

> > > +       }

> > >         err = bpf_map_update_elem(map->fd, &zero, map->mmaped, 0);

> > >         if (err) {

> > >                 err = -errno;


[...]

> > > @@ -9387,7 +9521,13 @@ static int libbpf_find_attach_btf_id(struct bpf_program *prog, int *btf_obj_fd,

> > >         }

> > >

> > >         /* kernel/module BTF ID */

> > > -       err = find_kernel_btf_id(prog->obj, attach_name, attach_type, btf_obj_fd, btf_type_id);

> > > +       if (prog->obj->gen_loader) {

> > > +               bpf_gen__record_attach_target(prog->obj->gen_loader, attach_name, attach_type);

> > > +               *btf_obj_fd = 0;

> >

> > this will leak kernel module BTF FDs

>

> I don't follow.

> When gen_loader is happening the find_kernel_btf_id() is not called and modules BTFs are not loaded.


Yeah, my bad again, I misread diff as if find_kernel_btf_id() was
still called before. Keep forgetting about this "in the kernel, in the
future" semantics. Never mind.

>

> > > +               *btf_type_id = 1;

> > > +       } else {

> > > +               err = find_kernel_btf_id(prog->obj, attach_name, attach_type, btf_obj_fd, btf_type_id);

> > > +       }

> > >         if (err) {

> > >                 pr_warn("failed to find kernel BTF type ID of '%s': %d\n", attach_name, err);

> > >                 return err;

> >

> > [...]

> >

> > > +out:

> > > +       close(map_fd);

> > > +       close(prog_fd);

> >

> > this does close(-1), check >= 0

>

> Right. I felt there is no risk in doing that. I guess extra check is fine too.


We try to not do this even in selftests, so good to not have spurious close()s.
John Fastabend April 27, 2021, 6:45 p.m. UTC | #15
Alexei Starovoitov wrote:
> From: Alexei Starovoitov <ast@kernel.org>

> 

> Add placeholders for bpf_sys_bpf() helper and new program type.

> 

> v1->v2:

> - check that expected_attach_type is zero

> - allow more helper functions to be used in this program type, since they will

>   only execute from user context via bpf_prog_test_run.

> 

> Signed-off-by: Alexei Starovoitov <ast@kernel.org>

> ---


Acked-by: John Fastabend <john.fastabend@gmail.com>


> +int bpf_prog_test_run_syscall(struct bpf_prog *prog,

> +			      const union bpf_attr *kattr,

> +			      union bpf_attr __user *uattr)

> +{

> +	void __user *ctx_in = u64_to_user_ptr(kattr->test.ctx_in);

> +	__u32 ctx_size_in = kattr->test.ctx_size_in;

> +	void *ctx = NULL;

> +	u32 retval;

> +	int err = 0;

> +

> +	/* doesn't support data_in/out, ctx_out, duration, or repeat or flags */

> +	if (kattr->test.data_in || kattr->test.data_out ||

> +	    kattr->test.ctx_out || kattr->test.duration ||

> +	    kattr->test.repeat || kattr->test.flags)

> +		return -EINVAL;

> +

> +	if (ctx_size_in < prog->aux->max_ctx_offset ||

> +	    ctx_size_in > U16_MAX)

> +		return -EINVAL;

> +

> +	if (ctx_size_in) {

> +		ctx = kzalloc(ctx_size_in, GFP_USER);

> +		if (!ctx)

> +			return -ENOMEM;

> +		if (copy_from_user(ctx, ctx_in, ctx_size_in)) {

> +			err = -EFAULT;

> +			goto out;

> +		}

> +	}

> +	retval = bpf_prog_run_pin_on_cpu(prog, ctx);

> +

> +	if (copy_to_user(&uattr->test.retval, &retval, sizeof(u32)))

> +		err = -EFAULT;

> +	if (ctx_size_in)

> +		if (copy_to_user(ctx_in, ctx, ctx_size_in)) {

> +			err = -EFAULT;

> +			goto out;

> +		}


stupid nit, the last goto there is not needed.

> +out:

> +	kfree(ctx);

> +	return err;

> +}
Alexei Starovoitov April 28, 2021, 1:42 a.m. UTC | #16
On Tue, Apr 27, 2021 at 10:34:50AM -0700, Andrii Nakryiko wrote:
> > > > +void bpf_gen__load_btf(struct bpf_gen *gen, const void *btf_raw_data, __u32 btf_raw_size)

> > > > +{

> > > > +       union bpf_attr attr = {};

> > >

> > > here and below: memset(0)?

> >

> > that's unnecessary. there is no backward/forward compat issue here.

> > and bpf_attr doesn't have gaps inside.

> 

> systemd definitely had a problem with non-zero padding with such usage

> of bpf_attr recently, but I don't remember which command specifically.

> Is there any downside to making sure that this will keep working for

> later bpf_attr changes regardless of whether there are gaps or not?


I'd like to avoid cargo culting memset where it's not needed,
but thinking more about it...
I'll switch to memset(, cmd_specific_attr_size) instead.
I wanted to do this optimization forever in the rest of libbpf.
That would be a starting place.
Eventually when bpf.c will migrate into bpf.h I will convert it to
memset(, attr_size) too.
The bpf_attr is too big to do full zeroing.
Loader gen already taking advantage of that with partial bpf_attr for everything.

> > unfortunately there are various piece of the code that check <0

> > means not init.

> > I can try to fix them all, but it felt unncessary complex vs screwing up stdin.

> > It's bpftool's stdin, but you're right that it's not pretty.

> > I can probably use special very large FD number and check for it.

> > Still cleaner than fixing all checks.

> 

> Maybe after generation go over each prog/map/BTF and reset their FDs

> to -1 if gen_loader is used? Or I guess we can just sprinkle

> 

> if (!obj->gen_loader)

>     close(fd);

> 

> in the right places. Not great from code readability, but at least

> won't have spurious close()s.


Interesting ideas. I haven't thought about reseting back to -1.
Will do and address the rest of comments too.