Message ID | 20240705145009.32340-1-puranjay@kernel.org |
---|---|
State | Accepted |
Commit | 90dc946059b7d346f077b870a8d8aaf03b4d0772 |
Headers | show |
Series | [bpf] selftests/bpf: DENYLIST.aarch64: Remove fexit_sleep | expand |
Daniel Borkmann <daniel@iogearbox.net> writes: > On 7/5/24 4:50 PM, Puranjay Mohan wrote: >> fexit_sleep test runs successfully now on the CI so remove it from the >> deny list. > > Do you happen to know which commit fixed it? If yes, might be nice to have it > documented in the commit message. Actually, I never saw this test failing on my local setup and yesterday I tried running it on the CI where it passed as well. So, I assumed that this would be fixed by some commit. I am not sure which exact commit might have fixed this. Manu, Martin When this was added to the deny list was this failing every time and did you have some reproducer for this. If there is a reproducer, I can try fixing it but when ran normally this test never fails for me. Thanks, Puranjay
On 7/8/24 5:26 PM, KP Singh wrote: > On Mon, Jul 8, 2024 at 5:00 PM Puranjay Mohan <puranjay@kernel.org> wrote: >> >> Daniel Borkmann <daniel@iogearbox.net> writes: >> >>> On 7/5/24 4:50 PM, Puranjay Mohan wrote: >>>> fexit_sleep test runs successfully now on the CI so remove it from the >>>> deny list. >>> >>> Do you happen to know which commit fixed it? If yes, might be nice to have it >>> documented in the commit message. >> >> Actually, I never saw this test failing on my local setup and yesterday >> I tried running it on the CI where it passed as well. So, I assumed that >> this would be fixed by some commit. I am not sure which exact commit >> might have fixed this. >> >> Manu, Martin >> >> When this was added to the deny list was this failing every time and did >> you have some reproducer for this. If there is a reproducer, I can try >> fixing it but when ran normally this test never fails for me. > > I think this never worked until > https://lore.kernel.org/lkml/20230405180250.2046566-1-revest@chromium.org/ > was merged, FTrace direct calls was blocking tracing programs on ARM, > since then it has always worked. Awesome, thanks! I'll add this to the commit desc then when applying.
Daniel Borkmann <daniel@iogearbox.net> writes: > On 7/8/24 5:26 PM, KP Singh wrote: >> On Mon, Jul 8, 2024 at 5:00 PM Puranjay Mohan <puranjay@kernel.org> wrote: >>> >>> Daniel Borkmann <daniel@iogearbox.net> writes: >>> >>>> On 7/5/24 4:50 PM, Puranjay Mohan wrote: >>>>> fexit_sleep test runs successfully now on the CI so remove it from the >>>>> deny list. >>>> >>>> Do you happen to know which commit fixed it? If yes, might be nice to have it >>>> documented in the commit message. >>> >>> Actually, I never saw this test failing on my local setup and yesterday >>> I tried running it on the CI where it passed as well. So, I assumed that >>> this would be fixed by some commit. I am not sure which exact commit >>> might have fixed this. >>> >>> Manu, Martin >>> >>> When this was added to the deny list was this failing every time and did >>> you have some reproducer for this. If there is a reproducer, I can try >>> fixing it but when ran normally this test never fails for me. >> >> I think this never worked until >> https://lore.kernel.org/lkml/20230405180250.2046566-1-revest@chromium.org/ >> was merged, FTrace direct calls was blocking tracing programs on ARM, >> since then it has always worked. > > Awesome, thanks! I'll add this to the commit desc then when applying. The commit that added this to the deny list said: 31f4f810d533 ("selftests/bpf: Add fexit_sleep to DENYLIST.aarch64") ``` It is reported that the fexit_sleep never returns in aarch64. The remaining tests cannot start. ``` So, if the lack of Ftrace direct calls would be the reason then the failure would be due to fexit programs not being supported on arm64. But this says that the selftest never returns therefore is not related to ftrace direct call support but another bug? Thanks, Puranjay
Hello: This patch was applied to bpf/bpf-next.git (master) by Daniel Borkmann <daniel@iogearbox.net>: On Fri, 5 Jul 2024 14:50:09 +0000 you wrote: > fexit_sleep test runs successfully now on the CI so remove it from the > deny list. > > Signed-off-by: Puranjay Mohan <puranjay@kernel.org> > --- > tools/testing/selftests/bpf/DENYLIST.aarch64 | 1 - > 1 file changed, 1 deletion(-) Here is the summary with links: - [bpf] selftests/bpf: DENYLIST.aarch64: Remove fexit_sleep https://git.kernel.org/bpf/bpf-next/c/90dc946059b7 You are awesome, thank you!
On 7/8/24 6:42 PM, KP Singh wrote: > On Mon, Jul 8, 2024 at 6:09 PM Daniel Borkmann <daniel@iogearbox.net> wrote: >> On 7/8/24 5:35 PM, Puranjay Mohan wrote: >>> Daniel Borkmann <daniel@iogearbox.net> writes: >>>> On 7/8/24 5:26 PM, KP Singh wrote: >>>>> On Mon, Jul 8, 2024 at 5:00 PM Puranjay Mohan <puranjay@kernel.org> wrote: >>>>>> Daniel Borkmann <daniel@iogearbox.net> writes: >>>>>>> On 7/5/24 4:50 PM, Puranjay Mohan wrote: >>>>>>>> fexit_sleep test runs successfully now on the CI so remove it from the >>>>>>>> deny list. >>>>>>> >>>>>>> Do you happen to know which commit fixed it? If yes, might be nice to have it >>>>>>> documented in the commit message. >>>>>> >>>>>> Actually, I never saw this test failing on my local setup and yesterday >>>>>> I tried running it on the CI where it passed as well. So, I assumed that >>>>>> this would be fixed by some commit. I am not sure which exact commit >>>>>> might have fixed this. >>>>>> >>>>>> Manu, Martin >>>>>> >>>>>> When this was added to the deny list was this failing every time and did >>>>>> you have some reproducer for this. If there is a reproducer, I can try >>>>>> fixing it but when ran normally this test never fails for me. >>>>> >>>>> I think this never worked until >>>>> https://lore.kernel.org/lkml/20230405180250.2046566-1-revest@chromium.org/ >>>>> was merged, FTrace direct calls was blocking tracing programs on ARM, >>>>> since then it has always worked. >>>> >>>> Awesome, thanks! I'll add this to the commit desc then when applying. >>> >>> The commit that added this to the deny list said: >>> 31f4f810d533 ("selftests/bpf: Add fexit_sleep to DENYLIST.aarch64") >>> >>> ``` >>> It is reported that the fexit_sleep never returns in aarch64. >>> The remaining tests cannot start. >>> ``` > > It may also have something to do with sleepable programs. But I think > it's generally in the category of "BPF tracing was catching up with > ARM", it has now. Hm, the latest run actually hangs in fexit_sleep (which is the test right after fexit_bpf2bpf). So looks like this was too early. It seems some CI runs pass on arm64 but others fail: https://github.com/kernel-patches/bpf/actions/runs/9859826851/job/27224868398 (fail) https://github.com/kernel-patches/bpf/actions/runs/9859837213/job/27224955045 (pass) Puranjay, do you have a chance to look into this again? >>> So, if the lack of Ftrace direct calls would be the reason then the >>> failure would be due to fexit programs not being supported on arm64. >>> >>> But this says that the selftest never returns therefore is not related >>> to ftrace direct call support but another bug? >> >> Fwiw, at least it is passing in the BPF CI now. >> >> https://github.com/kernel-patches/bpf/actions/runs/9841781347/job/27169610006
Hi Manu, > > I was able to confirm the fix using the artifacts from https://github.com/kernel-patches/bpf/actions/runs/9905842936 > Thanks > Thanks for testing the fix. This bug has been resolved now but the test still hangs sometimes. Unfortunately, I am not able to reproduce this hang using vmtest. Can you extract some logs from the CI somehow?? If it is hanging in the kernel there should be some soft lockup or RCU lockup related messages. I was talking about this with Kumar and we think that this test is hanging in the userspace in the following loop: while (READ_ONCE(fexit_skel->bss->fentry_cnt) != 2); Could it be that fentry_cnt is > 2 somehow before we reach this?? This is only a random guess though. Thanks, Puranjay
Hi Daniel, Manu I was able to reproduce this issue on KVM and found the root cause for this hang! The other issue that we fixed is unrelated to this hang and doesn't occur on self hosted github runners as they use 48-bit VAs. The userspace test code has: #define STACK_SIZE (1024 * 1024) static char child_stack[STACK_SIZE]; cpid = clone(do_sleep, child_stack + STACK_SIZE, CLONE_FILES | SIGCHLD, fexit_skel); arm64 requires the stack pointer to be 16 byte aligned otherwise SPAlignmentFault occurs, this appears as Bus error in the userspace. The stack provided to the clone system call is not guaranteed to be aligned properly in this selftest. The test hangs on the following line: while (READ_ONCE(fexit_skel->bss->fentry_cnt) != 2); Because the child process is killed due to SPAlignmentFault, the fentry_cnt remains at 0! Reading the man page of clone system call, the correct way to allocate stack for this call is using mmap like this: stack = mmap(NULL, STACK_SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS | MAP_STACK, -1, 0); This fixes the issue, I will send a patch to use this and once again remove this test from DENYLIST and I hope this time it fixes it for good. > It looks like there is still an issue left. A recent CI run on bpf-next is > still hitting the same on arm64: > > Base: > > https://github.com/kernel-patches/bpf/commits/series/870746%3D%3Ebpf-next/ > > CI: > > https://github.com/kernel-patches/bpf/actions/runs/9905842936/job/27366435436 > > [...] > #89/11 fexit_bpf2bpf/func_replace_global_func:OK > #89/12 fexit_bpf2bpf/fentry_to_cgroup_bpf:OK > #89/13 fexit_bpf2bpf/func_replace_progmap:OK > #89 fexit_bpf2bpf:OK > Error: The operation was canceled. Thanks, Puranjay
On Mon, Jul 15, 2024 at 9:32 AM Puranjay Mohan <puranjay@kernel.org> wrote: > > > Hi Daniel, Manu > I was able to reproduce this issue on KVM and found the root cause for > this hang! The other issue that we fixed is unrelated to this hang and > doesn't occur on self hosted github runners as they use 48-bit VAs. > > The userspace test code has: > > #define STACK_SIZE (1024 * 1024) > static char child_stack[STACK_SIZE]; > > cpid = clone(do_sleep, child_stack + STACK_SIZE, CLONE_FILES | SIGCHLD, fexit_skel); > > arm64 requires the stack pointer to be 16 byte aligned otherwise > SPAlignmentFault occurs, this appears as Bus error in the userspace. > > The stack provided to the clone system call is not guaranteed to be > aligned properly in this selftest. > > The test hangs on the following line: > while (READ_ONCE(fexit_skel->bss->fentry_cnt) != 2); > > Because the child process is killed due to SPAlignmentFault, the > fentry_cnt remains at 0! > > Reading the man page of clone system call, the correct way to allocate > stack for this call is using mmap like this: > > stack = mmap(NULL, STACK_SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS | MAP_STACK, -1, 0); > > This fixes the issue, I will send a patch to use this and once again > remove this test from DENYLIST and I hope this time it fixes it for good. Wow. Great find. Good to know. prog_tests/ns_current_pid_tgid.c has the same issue probably.
Alexei Starovoitov <alexei.starovoitov@gmail.com> writes: > On Mon, Jul 15, 2024 at 9:32 AM Puranjay Mohan <puranjay@kernel.org> wrote: >> >> >> Hi Daniel, Manu >> I was able to reproduce this issue on KVM and found the root cause for >> this hang! The other issue that we fixed is unrelated to this hang and >> doesn't occur on self hosted github runners as they use 48-bit VAs. >> >> The userspace test code has: >> >> #define STACK_SIZE (1024 * 1024) >> static char child_stack[STACK_SIZE]; >> >> cpid = clone(do_sleep, child_stack + STACK_SIZE, CLONE_FILES | SIGCHLD, fexit_skel); >> >> arm64 requires the stack pointer to be 16 byte aligned otherwise >> SPAlignmentFault occurs, this appears as Bus error in the userspace. >> >> The stack provided to the clone system call is not guaranteed to be >> aligned properly in this selftest. >> >> The test hangs on the following line: >> while (READ_ONCE(fexit_skel->bss->fentry_cnt) != 2); >> >> Because the child process is killed due to SPAlignmentFault, the >> fentry_cnt remains at 0! >> >> Reading the man page of clone system call, the correct way to allocate >> stack for this call is using mmap like this: >> >> stack = mmap(NULL, STACK_SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS | MAP_STACK, -1, 0); >> >> This fixes the issue, I will send a patch to use this and once again >> remove this test from DENYLIST and I hope this time it fixes it for good. > > Wow. Great find. Good to know. > prog_tests/ns_current_pid_tgid.c has the same issue probably. Yes, I checked that test as well using gdb and fortunately it gets a 16 byte aligned stack pointer, but this is just luck, so I will send a patch to fix that test as well. Thanks, Puranjay
diff --git a/tools/testing/selftests/bpf/DENYLIST.aarch64 b/tools/testing/selftests/bpf/DENYLIST.aarch64 index e865451e90d2..2bf981c80180 100644 --- a/tools/testing/selftests/bpf/DENYLIST.aarch64 +++ b/tools/testing/selftests/bpf/DENYLIST.aarch64 @@ -1,6 +1,5 @@ bpf_cookie/multi_kprobe_attach_api # kprobe_multi_link_api_subtest:FAIL:fentry_raw_skel_load unexpected error: -3 bpf_cookie/multi_kprobe_link_api # kprobe_multi_link_api_subtest:FAIL:fentry_raw_skel_load unexpected error: -3 -fexit_sleep # The test never returns. The remaining tests cannot start. kprobe_multi_bench_attach # needs CONFIG_FPROBE kprobe_multi_test # needs CONFIG_FPROBE module_attach # prog 'kprobe_multi': failed to auto-attach: -95
fexit_sleep test runs successfully now on the CI so remove it from the deny list. Signed-off-by: Puranjay Mohan <puranjay@kernel.org> --- tools/testing/selftests/bpf/DENYLIST.aarch64 | 1 - 1 file changed, 1 deletion(-)