diff mbox series

[bpf,v2] bpf: zero-fill re-used per-cpu map element

Message ID 20201027221324.27894-1-david.verbeiren@tessares.net
State New
Headers show
Series [bpf,v2] bpf: zero-fill re-used per-cpu map element | expand

Commit Message

David Verbeiren Oct. 27, 2020, 10:13 p.m. UTC
Zero-fill element values for all other cpus than current, just as
when not using prealloc. This is the only way the bpf program can
ensure known initial values for all cpus ('onallcpus' cannot be
set when coming from the bpf program).

The scenario is: bpf program inserts some elements in a per-cpu
map, then deletes some (or userspace does). When later adding
new elements using bpf_map_update_elem(), the bpf program can
only set the value of the new elements for the current cpu.
When prealloc is enabled, previously deleted elements are re-used.
Without the fix, values for other cpus remain whatever they were
when the re-used entry was previously freed.

Fixes: 6c9059817432 ("bpf: pre-allocate hash map elements")
Acked-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: David Verbeiren <david.verbeiren@tessares.net>
---

Notes:
    v2:
      - Moved memset() to separate pcpu_init_value() function,
        which replaces pcpu_copy_value() but delegates to it
        for the cases where no memset() is needed (Andrii).
      - This function now also avoids doing the memset() for
        the current cpu for which the value must be set
        anyhow (Andrii).
      - Same pcpu_init_value() used for per-cpu LRU map
        (Andrii).
    
      Note that I could not test the per-cpu LRU other than
      by running the bpf selftests. lru_map and maps tests
      passed but for the rest of the test suite, I don't
      think I know how to spot problems...
    
      Question: Is it ok to use raw_smp_processor_id() in
      these contexts? bpf prog context should be fine, I think.
      Is it also ok in the syscall context?

 kernel/bpf/hashtab.c | 30 ++++++++++++++++++++++++++++--
 1 file changed, 28 insertions(+), 2 deletions(-)

Comments

Andrii Nakryiko Oct. 27, 2020, 10:55 p.m. UTC | #1
On Tue, Oct 27, 2020 at 3:15 PM David Verbeiren
<david.verbeiren@tessares.net> wrote:
>
> Zero-fill element values for all other cpus than current, just as
> when not using prealloc. This is the only way the bpf program can
> ensure known initial values for all cpus ('onallcpus' cannot be
> set when coming from the bpf program).
>
> The scenario is: bpf program inserts some elements in a per-cpu
> map, then deletes some (or userspace does). When later adding
> new elements using bpf_map_update_elem(), the bpf program can
> only set the value of the new elements for the current cpu.
> When prealloc is enabled, previously deleted elements are re-used.
> Without the fix, values for other cpus remain whatever they were
> when the re-used entry was previously freed.
>
> Fixes: 6c9059817432 ("bpf: pre-allocate hash map elements")
> Acked-by: Matthieu Baerts <matthieu.baerts@tessares.net>
> Signed-off-by: David Verbeiren <david.verbeiren@tessares.net>
> ---
>

Looks good, but would be good to have a unit test (see below). Maybe
in a follow up.

Acked-by: Andrii Nakryiko <andrii@kernel.org>

> Notes:
>     v2:
>       - Moved memset() to separate pcpu_init_value() function,
>         which replaces pcpu_copy_value() but delegates to it
>         for the cases where no memset() is needed (Andrii).
>       - This function now also avoids doing the memset() for
>         the current cpu for which the value must be set
>         anyhow (Andrii).
>       - Same pcpu_init_value() used for per-cpu LRU map
>         (Andrii).
>
>       Note that I could not test the per-cpu LRU other than
>       by running the bpf selftests. lru_map and maps tests
>       passed but for the rest of the test suite, I don't
>       think I know how to spot problems...

It would be good to write a new selftest specifically for this. You
can create a single-element pre-allocated per-CPU hashmap. From
user-space, initialize it to non-zeros on all CPUs. Then delete that
key (it will get put on the free list). Then trigger BPF program, do
an update (that should take an element from the freelist), doesn't
matter which value you set (could be zero). Then from user-space get
all per-CPU values for than new key. It should all be zeroes with your
fix and non-zero without it.

It sounds more complicated than it would look like in practice :)

>
>       Question: Is it ok to use raw_smp_processor_id() in
>       these contexts? bpf prog context should be fine, I think.
>       Is it also ok in the syscall context?
David Verbeiren Oct. 29, 2020, 2:44 p.m. UTC | #2
On Tue, Oct 27, 2020 at 11:55 PM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
> Looks good, but would be good to have a unit test (see below). Maybe
> in a follow up.

Here is the associated new selftest, implementing the sequence you
proposed (thanks for that!), and also one for LRU case:
https://lore.kernel.org/bpf/20201029111730.6881-1-david.verbeiren@tessares.net/

I hope I did it in the "right" framework of the day.
Andrii Nakryiko Oct. 29, 2020, 10:40 p.m. UTC | #3
On Thu, Oct 29, 2020 at 7:44 AM David Verbeiren
<david.verbeiren@tessares.net> wrote:
>

> On Tue, Oct 27, 2020 at 11:55 PM Andrii Nakryiko

> <andrii.nakryiko@gmail.com> wrote:

> > Looks good, but would be good to have a unit test (see below). Maybe

> > in a follow up.

>

> Here is the associated new selftest, implementing the sequence you

> proposed (thanks for that!), and also one for LRU case:

> https://lore.kernel.org/bpf/20201029111730.6881-1-david.verbeiren@tessares.net/

>

> I hope I did it in the "right" framework of the day.


You did it in a very laborious and hard way, which will be harder to
maintain, unfortunately. But it should be trivial to simplify it with
skeleton. It's probably better to combine the selftest patch with the
kernel fix in this patch and send it as v2?
Andrii Nakryiko Nov. 3, 2020, 6:19 p.m. UTC | #4
On Tue, Nov 3, 2020 at 7:49 AM David Verbeiren
<david.verbeiren@tessares.net> wrote:
>
> Zero-fill element values for all other cpus than current, just as
> when not using prealloc. This is the only way the bpf program can
> ensure known initial values for all cpus ('onallcpus' cannot be
> set when coming from the bpf program).
>
> The scenario is: bpf program inserts some elements in a per-cpu
> map, then deletes some (or userspace does). When later adding
> new elements using bpf_map_update_elem(), the bpf program can
> only set the value of the new elements for the current cpu.
> When prealloc is enabled, previously deleted elements are re-used.
> Without the fix, values for other cpus remain whatever they were
> when the re-used entry was previously freed.
>
> A selftest is added to validate correct operation in above
> scenario as well as in case of LRU per-cpu map element re-use.
>
> Fixes: 6c9059817432 ("bpf: pre-allocate hash map elements")
> Acked-by: Matthieu Baerts <matthieu.baerts@tessares.net>
> Signed-off-by: David Verbeiren <david.verbeiren@tessares.net>
> ---
>

Tests look really nice, thanks! I'm worried about still racy once
check, see suggestions below. Otherwise looks great!

> Notes:
>     v3:
>       - Added selftest that was initially provided as separate
>         patch, and reworked to
>         * use skeleton (Andrii, Song Liu)
>         * skip test if <=1 CPU (Song Liu)
>
>     v2:
>       - Moved memset() to separate pcpu_init_value() function,
>         which replaces pcpu_copy_value() but delegates to it
>         for the cases where no memset() is needed (Andrii).
>       - This function now also avoids doing the memset() for
>         the current cpu for which the value must be set
>         anyhow (Andrii).
>       - Same pcpu_init_value() used for per-cpu LRU map
>         (Andrii).
>
>  kernel/bpf/hashtab.c                          |  30 ++-
>  .../selftests/bpf/prog_tests/map_init.c       | 213 ++++++++++++++++++
>  .../selftests/bpf/progs/test_map_init.c       |  34 +++
>  3 files changed, 275 insertions(+), 2 deletions(-)
>  create mode 100644 tools/testing/selftests/bpf/prog_tests/map_init.c
>  create mode 100644 tools/testing/selftests/bpf/progs/test_map_init.c
>

[...]

> diff --git a/tools/testing/selftests/bpf/prog_tests/map_init.c b/tools/testing/selftests/bpf/prog_tests/map_init.c
> new file mode 100644
> index 000000000000..386d9439bad9
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/prog_tests/map_init.c
> @@ -0,0 +1,213 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +// Copyright (c) 2020 Tessares SA <http://www.tessares.net>
> +

nit: see below, /* */

> +#include <test_progs.h>
> +#include "test_map_init.skel.h"
> +

[...]

> diff --git a/tools/testing/selftests/bpf/progs/test_map_init.c b/tools/testing/selftests/bpf/progs/test_map_init.c
> new file mode 100644
> index 000000000000..280a45e366d6
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/progs/test_map_init.c
> @@ -0,0 +1,34 @@
> +// SPDX-License-Identifier: GPL-2.0
> +// Copyright (c) 2020 Tessares SA <http://www.tessares.net>

nit: I think copyright line has to be in /* */ comment block

> +
> +#include "vmlinux.h"
> +#include <bpf/bpf_helpers.h>
> +
> +__u64 inKey = 0;
> +__u64 inValue = 0;
> +__u32 once = 0;
> +
> +struct {
> +       __uint(type, BPF_MAP_TYPE_PERCPU_HASH);
> +       __uint(max_entries, 2);
> +       __type(key, __u64);
> +       __type(value, __u64);
> +} hashmap1 SEC(".maps");
> +
> +
> +SEC("raw_tp/sys_enter")
> +int sys_enter(const void *ctx)
> +{
> +       /* Just do it once so the value is only updated for a single CPU.
> +        * Indeed, this tracepoint will quickly be hit from different CPUs.
> +        */
> +       if (!once) {
> +               __sync_fetch_and_add(&once, 1);

This is quite racy, actually, especially for the generic sys_enter
tracepoint. The way I did this before (see progs/trigger_bench.c) was
through doing a "tp/syscalls/sys_enter_getpgid" tracepoint program and
checking for thread id. Or you can use bpf_test_run, probably, with a
different type of BPF program. I just find bpf_test_run() too
inconvenient with all the extra setup, so I usually stick to
tracepoints.

> +
> +               bpf_map_update_elem(&hashmap1, &inKey, &inValue, BPF_NOEXIST);
> +       }
> +
> +       return 0;
> +}
> +
> +char _license[] SEC("license") = "GPL";
> --
> 2.29.0
>
diff mbox series

Patch

diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index 1815e97d4c9c..1fccba6e88c4 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -821,6 +821,32 @@  static void pcpu_copy_value(struct bpf_htab *htab, void __percpu *pptr,
 	}
 }
 
+static void pcpu_init_value(struct bpf_htab *htab, void __percpu *pptr,
+			    void *value, bool onallcpus)
+{
+	/* When using prealloc and not setting the initial value on all cpus,
+	 * zero-fill element values for other cpus (just as what happens when
+	 * not using prealloc). Otherwise, bpf program has no way to ensure
+	 * known initial values for cpus other than current one
+	 * (onallcpus=false always when coming from bpf prog).
+	 */
+	if (htab_is_prealloc(htab) && !onallcpus) {
+		u32 size = round_up(htab->map.value_size, 8);
+		int current_cpu = raw_smp_processor_id();
+		int cpu;
+
+		for_each_possible_cpu(cpu) {
+			if (cpu == current_cpu)
+				bpf_long_memcpy(per_cpu_ptr(pptr, cpu), value,
+						size);
+			else
+				memset(per_cpu_ptr(pptr, cpu), 0, size);
+		}
+	} else {
+		pcpu_copy_value(htab, pptr, value, onallcpus);
+	}
+}
+
 static bool fd_htab_map_needs_adjust(const struct bpf_htab *htab)
 {
 	return htab->map.map_type == BPF_MAP_TYPE_HASH_OF_MAPS &&
@@ -891,7 +917,7 @@  static struct htab_elem *alloc_htab_elem(struct bpf_htab *htab, void *key,
 			}
 		}
 
-		pcpu_copy_value(htab, pptr, value, onallcpus);
+		pcpu_init_value(htab, pptr, value, onallcpus);
 
 		if (!prealloc)
 			htab_elem_set_ptr(l_new, key_size, pptr);
@@ -1183,7 +1209,7 @@  static int __htab_lru_percpu_map_update_elem(struct bpf_map *map, void *key,
 		pcpu_copy_value(htab, htab_elem_get_ptr(l_old, key_size),
 				value, onallcpus);
 	} else {
-		pcpu_copy_value(htab, htab_elem_get_ptr(l_new, key_size),
+		pcpu_init_value(htab, htab_elem_get_ptr(l_new, key_size),
 				value, onallcpus);
 		hlist_nulls_add_head_rcu(&l_new->hash_node, head);
 		l_new = NULL;