Message ID | 20201211081903.17857-1-glin@suse.com |
---|---|
State | New |
Headers | show |
Series | bpf,x64: pad NOPs to make images converge more easily | expand |
On Fri, Dec 11, 2020 at 1:13 PM Daniel Borkmann <daniel@iogearbox.net> wrote: > >> + } > >> emit_jmp: > >> if (is_imm8(jmp_offset)) { > >> + if (jmp_padding) > >> + cnt += emit_nops(&prog, INSN_SZ_DIFF - 2); Could you describe all possible numbers of bytes in padding? Is it 0, 2, 4 ? Would be good to add warn_on_once to make sure the number of nops is expected. > >> struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog) > >> { > >> struct bpf_binary_header *header = NULL; > >> @@ -1981,6 +1997,7 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog) > >> struct jit_context ctx = {}; > >> bool tmp_blinded = false; > >> bool extra_pass = false; > >> + bool padding = prog->padded; > > > > can this ever be true on assignment? I.e., can the program be jitted twice? > > Yes, progs can be passed into the JIT twice, see also jit_subprogs(). In one of > the earlier patches it would still potentially change the image size a second > time which would break subprogs aka bpf2bpf calls. Right. I think memorized padded flag shouldn't be in sticky bits of the prog structure. It's only needed between the last pass and extra pass for bpf2bpf calls. I think it would be cleaner to keep it in struct x64_jit_data *jit_data. As others have said the selftests are must have. Especially for bpf2bpf calls where one subprog is padded. Thanks!
On Fri, Dec 11, 2020 at 12:58:17PM -0800, Andrii Nakryiko wrote: > On Fri, Dec 11, 2020 at 8:51 AM Gary Lin <glin@suse.com> wrote: > > > > The x64 bpf jit expects bpf images converge within the given passes, but > > it could fail to do so with some corner cases. For example: > > > > l0: ldh [4] > > l1: jeq #0x537d, l2, l40 > > l2: ld [0] > > l3: jeq #0xfa163e0d, l4, l40 > > l4: ldh [12] > > l5: ldx #0xe > > l6: jeq #0x86dd, l41, l7 > > l8: ld [x+16] > > l9: ja 41 > > > > [... repeated ja 41 ] > > > > l40: ja 41 > > l41: ret #0 > > l42: ld #len > > l43: ret a > > > > This bpf program contains 32 "ja 41" instructions which are effectively > > NOPs and designed to be replaced with valid code dynamically. Ideally, > > bpf jit should optimize those "ja 41" instructions out when translating > > the bpf instructions into x86_64 machine code. However, do_jit() can > > only remove one "ja 41" for offset==0 on each pass, so it requires at > > least 32 runs to eliminate those JMPs and exceeds the current limit of > > passes (20). In the end, the program got rejected when BPF_JIT_ALWAYS_ON > > is set even though it's legit as a classic socket filter. > > > > To make the image more likely converge within 20 passes, this commit > > pads some instructions with NOPs in the last 5 passes: > > > > 1. conditional jumps > > A possible size variance comes from the adoption of imm8 JMP. If the > > offset is imm8, we calculate the size difference of this BPF instruction > > between the previous pass and the current pass and fill the gap with NOPs. > > To avoid the recalculation of jump offset, those NOPs are inserted before > > the JMP code, so we have to subtract the 2 bytes of imm8 JMP when > > calculating the NOP number. > > > > 2. BPF_JA > > There are two conditions for BPF_JA. > > a.) nop jumps > > If this instruction is not optimized out in the previous pass, > > instead of removing it, we insert the equivalent size of NOPs. > > b.) label jumps > > Similar to condition jumps, we prepend NOPs right before the JMP > > code. > > > > To make the code concise, emit_nops() is modified to use the signed len and > > return the number of inserted NOPs. > > > > To support bpf-to-bpf, a new flag, padded, is introduced to 'struct bpf_prog' > > so that bpf_int_jit_compile() could know if the program is padded or not. > > > > Signed-off-by: Gary Lin <glin@suse.com> > > --- > > arch/x86/net/bpf_jit_comp.c | 68 ++++++++++++++++++++++++------------- > > include/linux/filter.h | 1 + > > 2 files changed, 45 insertions(+), 24 deletions(-) > > > > diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c > > index 796506dcfc42..30b81c8539b3 100644 > > --- a/arch/x86/net/bpf_jit_comp.c > > +++ b/arch/x86/net/bpf_jit_comp.c > > @@ -789,8 +789,31 @@ static void detect_reg_usage(struct bpf_insn *insn, int insn_cnt, > > } > > } > > > > +static int emit_nops(u8 **pprog, int len) > > +{ > > + u8 *prog = *pprog; > > + int i, noplen, cnt = 0; > > + > > + while (len > 0) { > > + noplen = len; > > + > > + if (noplen > ASM_NOP_MAX) > > + noplen = ASM_NOP_MAX; > > + > > + for (i = 0; i < noplen; i++) > > + EMIT1(ideal_nops[noplen][i]); > > + len -= noplen; > > + } > > + > > + *pprog = prog; > > + > > + return cnt; > > Isn't cnt always zero? I guess it was supposed to be `cnt = len` at > the beginning? > [Skip this one since Daniel answered in another mail.] > But then it begs the question how this patch was actually tested given > emit_nops() is returning wrong answers? Changes like this should > definitely come with tests. > Before submitting this patch, I lowered PADDING_PASSES to 2 and ran tools/testing/selftests/bpf/test_progs. We surely need some official test cases for this patch. Will do it in v2. > > +} > > + > > +#define INSN_SZ_DIFF (((addrs[i] - addrs[i - 1]) - (prog - temp))) > > + > > static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image, > > - int oldproglen, struct jit_context *ctx) > > + int oldproglen, struct jit_context *ctx, bool jmp_padding) > > { > > bool tail_call_reachable = bpf_prog->aux->tail_call_reachable; > > struct bpf_insn *insn = bpf_prog->insnsi; > > @@ -1409,6 +1432,8 @@ xadd: if (is_imm8(insn->off)) > > } > > jmp_offset = addrs[i + insn->off] - addrs[i]; > > if (is_imm8(jmp_offset)) { > > + if (jmp_padding) > > + cnt += emit_nops(&prog, INSN_SZ_DIFF - 2); > > EMIT2(jmp_cond, jmp_offset); > > } else if (is_simm32(jmp_offset)) { > > EMIT2_off32(0x0F, jmp_cond + 0x10, jmp_offset); > > @@ -1431,11 +1456,19 @@ xadd: if (is_imm8(insn->off)) > > else > > jmp_offset = addrs[i + insn->off] - addrs[i]; > > > > - if (!jmp_offset) > > - /* Optimize out nop jumps */ > > + if (!jmp_offset) { > > + /* > > + * If jmp_padding is enabled, the extra nops will > > + * be inserted. Otherwise, optimize out nop jumps. > > + */ > > + if (jmp_padding) > > + cnt += emit_nops(&prog, INSN_SZ_DIFF); > > break; > > + } > > emit_jmp: > > if (is_imm8(jmp_offset)) { > > + if (jmp_padding) > > + cnt += emit_nops(&prog, INSN_SZ_DIFF - 2); > > EMIT2(0xEB, jmp_offset); > > } else if (is_simm32(jmp_offset)) { > > EMIT1_off32(0xE9, jmp_offset); > > @@ -1578,26 +1611,6 @@ static int invoke_bpf_prog(const struct btf_func_model *m, u8 **pprog, > > return 0; > > } > > > > -static void emit_nops(u8 **pprog, unsigned int len) > > -{ > > - unsigned int i, noplen; > > - u8 *prog = *pprog; > > - int cnt = 0; > > - > > - while (len > 0) { > > - noplen = len; > > - > > - if (noplen > ASM_NOP_MAX) > > - noplen = ASM_NOP_MAX; > > - > > - for (i = 0; i < noplen; i++) > > - EMIT1(ideal_nops[noplen][i]); > > - len -= noplen; > > - } > > - > > - *pprog = prog; > > -} > > - > > static void emit_align(u8 **pprog, u32 align) > > { > > u8 *target, *prog = *pprog; > > @@ -1972,6 +1985,9 @@ struct x64_jit_data { > > struct jit_context ctx; > > }; > > > > +#define MAX_PASSES 20 > > +#define PADDING_PASSES (MAX_PASSES - 5) > > + > > struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog) > > { > > struct bpf_binary_header *header = NULL; > > @@ -1981,6 +1997,7 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog) > > struct jit_context ctx = {}; > > bool tmp_blinded = false; > > bool extra_pass = false; > > + bool padding = prog->padded; > > can this ever be true on assignment? I.e., can the program be jitted twice? > Yes, bpf-to-bpf runs bpf_int_jit_compile twice and it expects to run only one pass in the second time. > > u8 *image = NULL; > > int *addrs; > > int pass; > > @@ -2043,7 +2060,9 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog) > > * pass to emit the final image. > > */ > > for (pass = 0; pass < 20 || image; pass++) { > > - proglen = do_jit(prog, addrs, image, oldproglen, &ctx); > > + if (!padding && pass >= PADDING_PASSES) > > + padding = true; > > Just, unconditionally: > > padding = pass >= PADDING_PASSES; > But this would turn 'padding' from 'true' to 'false' when invoking bpf_int_jit_compile() the second time for bpf-to-bpf, so we still need to do it conditionally. Gary Lin > > + proglen = do_jit(prog, addrs, image, oldproglen, &ctx, padding); > > if (proglen <= 0) { > > out_image: > > image = NULL; > > @@ -2101,6 +2120,7 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog) > > prog->bpf_func = (void *)image; > > prog->jited = 1; > > prog->jited_len = proglen; > > + prog->padded = padding; > > } else { > > prog = orig_prog; > > } > > diff --git a/include/linux/filter.h b/include/linux/filter.h > > index 1b62397bd124..cb7ce2b3737a 100644 > > --- a/include/linux/filter.h > > +++ b/include/linux/filter.h > > @@ -531,6 +531,7 @@ struct bpf_prog { > > dst_needed:1, /* Do we need dst entry? */ > > blinded:1, /* Was blinded */ > > is_func:1, /* program is a bpf function */ > > + padded:1, /* jitted image was padded */ > > kprobe_override:1, /* Do we override a kprobe? */ > > has_callchain_buf:1, /* callchain buffer allocated? */ > > enforce_expected_attach_type:1, /* Enforce expected_attach_type checking at attach time */ > > -- > > 2.29.2 > > >
On Fri, Dec 11, 2020 at 09:05:05PM +0100, Daniel Borkmann wrote: > On 12/11/20 9:19 AM, Gary Lin wrote: > > The x64 bpf jit expects bpf images converge within the given passes, but > > it could fail to do so with some corner cases. For example: > > > > l0: ldh [4] > > l1: jeq #0x537d, l2, l40 > > l2: ld [0] > > l3: jeq #0xfa163e0d, l4, l40 > > l4: ldh [12] > > l5: ldx #0xe > > l6: jeq #0x86dd, l41, l7 > > l8: ld [x+16] > > l9: ja 41 > > > > [... repeated ja 41 ] > > > > l40: ja 41 > > l41: ret #0 > > l42: ld #len > > l43: ret a > > > > This bpf program contains 32 "ja 41" instructions which are effectively > > NOPs and designed to be replaced with valid code dynamically. Ideally, > > bpf jit should optimize those "ja 41" instructions out when translating > > the bpf instructions into x86_64 machine code. However, do_jit() can > > only remove one "ja 41" for offset==0 on each pass, so it requires at > > least 32 runs to eliminate those JMPs and exceeds the current limit of > > passes (20). In the end, the program got rejected when BPF_JIT_ALWAYS_ON > > is set even though it's legit as a classic socket filter. > > > > To make the image more likely converge within 20 passes, this commit > > pads some instructions with NOPs in the last 5 passes: > > > > 1. conditional jumps > > A possible size variance comes from the adoption of imm8 JMP. If the > > offset is imm8, we calculate the size difference of this BPF instruction > > between the previous pass and the current pass and fill the gap with NOPs. > > To avoid the recalculation of jump offset, those NOPs are inserted before > > the JMP code, so we have to subtract the 2 bytes of imm8 JMP when > > calculating the NOP number. > > > > 2. BPF_JA > > There are two conditions for BPF_JA. > > a.) nop jumps > > If this instruction is not optimized out in the previous pass, > > instead of removing it, we insert the equivalent size of NOPs. > > b.) label jumps > > Similar to condition jumps, we prepend NOPs right before the JMP > > code. > > > > To make the code concise, emit_nops() is modified to use the signed len and > > return the number of inserted NOPs. > > > > To support bpf-to-bpf, a new flag, padded, is introduced to 'struct bpf_prog' > > so that bpf_int_jit_compile() could know if the program is padded or not. > > Please also add multiple hand-crafted test cases e.g. for bpf-to-bpf calls into > test_verifier (which is part of bpf kselftests) that would exercise this corner > case in x86 jit where we would start to nop pad so that there is proper coverage, > too. > The corner case I had in the commit description is likely being rejected by the verifier because most of those "ja 41" are unreachable instructions. Is there any known test case that needs more than 15 passes in x86 jit? Gary Lin
On Fri, Dec 11, 2020 at 06:24:47PM -0800, Alexei Starovoitov wrote: > On Fri, Dec 11, 2020 at 1:13 PM Daniel Borkmann <daniel@iogearbox.net> wrote: > > >> + } > > >> emit_jmp: > > >> if (is_imm8(jmp_offset)) { > > >> + if (jmp_padding) > > >> + cnt += emit_nops(&prog, INSN_SZ_DIFF - 2); > > Could you describe all possible numbers of bytes in padding? > Is it 0, 2, 4 ? > Would be good to add warn_on_once to make sure the number > of nops is expected. > For the conditional jumps, it could be 0 or 4. As for nop jumps, it may be 0, 2, or 5. For the pure jumps, 0 or 3. Will add the warning in the next version. > > >> struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog) > > >> { > > >> struct bpf_binary_header *header = NULL; > > >> @@ -1981,6 +1997,7 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog) > > >> struct jit_context ctx = {}; > > >> bool tmp_blinded = false; > > >> bool extra_pass = false; > > >> + bool padding = prog->padded; > > > > > > can this ever be true on assignment? I.e., can the program be jitted twice? > > > > Yes, progs can be passed into the JIT twice, see also jit_subprogs(). In one of > > the earlier patches it would still potentially change the image size a second > > time which would break subprogs aka bpf2bpf calls. > > Right. I think memorized padded flag shouldn't be in sticky bits > of the prog structure. > It's only needed between the last pass and extra pass for bpf2bpf calls. > I think it would be cleaner to keep it in struct x64_jit_data *jit_data. > Okay, jit_data is surely a better place for the flag. > As others have said the selftests are must have. > Especially for bpf2bpf calls where one subprog is padded. > Will try to craft some test cases for this patch in v2. Gary Lin
On Mon, Dec 14, 2020 at 11:56:22AM +0800, Gary Lin wrote: > On Fri, Dec 11, 2020 at 09:05:05PM +0100, Daniel Borkmann wrote: > > On 12/11/20 9:19 AM, Gary Lin wrote: > > > The x64 bpf jit expects bpf images converge within the given passes, but > > > it could fail to do so with some corner cases. For example: > > > > > > l0: ldh [4] > > > l1: jeq #0x537d, l2, l40 > > > l2: ld [0] > > > l3: jeq #0xfa163e0d, l4, l40 > > > l4: ldh [12] > > > l5: ldx #0xe > > > l6: jeq #0x86dd, l41, l7 > > > l8: ld [x+16] > > > l9: ja 41 > > > > > > [... repeated ja 41 ] > > > > > > l40: ja 41 > > > l41: ret #0 > > > l42: ld #len > > > l43: ret a > > > > > > This bpf program contains 32 "ja 41" instructions which are effectively > > > NOPs and designed to be replaced with valid code dynamically. Ideally, > > > bpf jit should optimize those "ja 41" instructions out when translating > > > the bpf instructions into x86_64 machine code. However, do_jit() can > > > only remove one "ja 41" for offset==0 on each pass, so it requires at > > > least 32 runs to eliminate those JMPs and exceeds the current limit of > > > passes (20). In the end, the program got rejected when BPF_JIT_ALWAYS_ON > > > is set even though it's legit as a classic socket filter. > > > > > > To make the image more likely converge within 20 passes, this commit > > > pads some instructions with NOPs in the last 5 passes: > > > > > > 1. conditional jumps > > > A possible size variance comes from the adoption of imm8 JMP. If the > > > offset is imm8, we calculate the size difference of this BPF instruction > > > between the previous pass and the current pass and fill the gap with NOPs. > > > To avoid the recalculation of jump offset, those NOPs are inserted before > > > the JMP code, so we have to subtract the 2 bytes of imm8 JMP when > > > calculating the NOP number. > > > > > > 2. BPF_JA > > > There are two conditions for BPF_JA. > > > a.) nop jumps > > > If this instruction is not optimized out in the previous pass, > > > instead of removing it, we insert the equivalent size of NOPs. > > > b.) label jumps > > > Similar to condition jumps, we prepend NOPs right before the JMP > > > code. > > > > > > To make the code concise, emit_nops() is modified to use the signed len and > > > return the number of inserted NOPs. > > > > > > To support bpf-to-bpf, a new flag, padded, is introduced to 'struct bpf_prog' > > > so that bpf_int_jit_compile() could know if the program is padded or not. > > > > Please also add multiple hand-crafted test cases e.g. for bpf-to-bpf calls into > > test_verifier (which is part of bpf kselftests) that would exercise this corner > > case in x86 jit where we would start to nop pad so that there is proper coverage, > > too. > > > The corner case I had in the commit description is likely being rejected by > the verifier because most of those "ja 41" are unreachable instructions. > Is there any known test case that needs more than 15 passes in x86 jit? > Just an idea. Besides the mentioned corner case, how about making PADDING_PASSES dynamically configurable (sysfs?) and reusing the existing test cases? So that we can have a script to set PADDING_PASSES from 1 to 20 and run the bpf selftests separately. This guarantees that the padding strategy will be applied at least in a certain PADDING_PASSES settings. Gary Lin
On 12/14/20 9:15 AM, Gary Lin wrote: > On Mon, Dec 14, 2020 at 11:56:22AM +0800, Gary Lin wrote: >> On Fri, Dec 11, 2020 at 09:05:05PM +0100, Daniel Borkmann wrote: >>> On 12/11/20 9:19 AM, Gary Lin wrote: >>>> The x64 bpf jit expects bpf images converge within the given passes, but >>>> it could fail to do so with some corner cases. For example: >>>> >>>> l0: ldh [4] >>>> l1: jeq #0x537d, l2, l40 >>>> l2: ld [0] >>>> l3: jeq #0xfa163e0d, l4, l40 >>>> l4: ldh [12] >>>> l5: ldx #0xe >>>> l6: jeq #0x86dd, l41, l7 >>>> l8: ld [x+16] >>>> l9: ja 41 >>>> >>>> [... repeated ja 41 ] >>>> >>>> l40: ja 41 >>>> l41: ret #0 >>>> l42: ld #len >>>> l43: ret a >>>> >>>> This bpf program contains 32 "ja 41" instructions which are effectively >>>> NOPs and designed to be replaced with valid code dynamically. Ideally, >>>> bpf jit should optimize those "ja 41" instructions out when translating >>>> the bpf instructions into x86_64 machine code. However, do_jit() can >>>> only remove one "ja 41" for offset==0 on each pass, so it requires at >>>> least 32 runs to eliminate those JMPs and exceeds the current limit of >>>> passes (20). In the end, the program got rejected when BPF_JIT_ALWAYS_ON >>>> is set even though it's legit as a classic socket filter. >>>> >>>> To make the image more likely converge within 20 passes, this commit >>>> pads some instructions with NOPs in the last 5 passes: >>>> >>>> 1. conditional jumps >>>> A possible size variance comes from the adoption of imm8 JMP. If the >>>> offset is imm8, we calculate the size difference of this BPF instruction >>>> between the previous pass and the current pass and fill the gap with NOPs. >>>> To avoid the recalculation of jump offset, those NOPs are inserted before >>>> the JMP code, so we have to subtract the 2 bytes of imm8 JMP when >>>> calculating the NOP number. >>>> >>>> 2. BPF_JA >>>> There are two conditions for BPF_JA. >>>> a.) nop jumps >>>> If this instruction is not optimized out in the previous pass, >>>> instead of removing it, we insert the equivalent size of NOPs. >>>> b.) label jumps >>>> Similar to condition jumps, we prepend NOPs right before the JMP >>>> code. >>>> >>>> To make the code concise, emit_nops() is modified to use the signed len and >>>> return the number of inserted NOPs. >>>> >>>> To support bpf-to-bpf, a new flag, padded, is introduced to 'struct bpf_prog' >>>> so that bpf_int_jit_compile() could know if the program is padded or not. >>> >>> Please also add multiple hand-crafted test cases e.g. for bpf-to-bpf calls into >>> test_verifier (which is part of bpf kselftests) that would exercise this corner >>> case in x86 jit where we would start to nop pad so that there is proper coverage, >>> too. >>> >> The corner case I had in the commit description is likely being rejected by >> the verifier because most of those "ja 41" are unreachable instructions. >> Is there any known test case that needs more than 15 passes in x86 jit? >> > Just an idea. Besides the mentioned corner case, how about making > PADDING_PASSES dynamically configurable (sysfs?) and reusing the existing > test cases? So that we can have a script to set PADDING_PASSES from 1 to 20 > and run the bpf selftests separately. This guarantees that the padding > strategy will be applied at least in a certain PADDING_PASSES settings. I think exposing such implementation detail to users is not that great as they normally should not need to worry about these things (plus it's also rarely hit in practice when developing against llvm). On top of all that, such knob would have no meaning in case of other JITs since most other non-x86 ones have a fixed number of passes. I think it's probably useful for local testing of the fix, but less suitable for exposing as sysctl 'uapi' upstream. Re crafting a test case for bpf-2-bpf calls, you could orientate on bpf_fill_maxinsns10() in lib/test_bpf.c which is also triggering a high number of passes, port it over to test_verifier from selftests and experiment from there to integrate calls. Thanks, Daniel
On Mon, Dec 14, 2020 at 04:31:44PM +0100, Daniel Borkmann wrote: > On 12/14/20 9:15 AM, Gary Lin wrote: > > On Mon, Dec 14, 2020 at 11:56:22AM +0800, Gary Lin wrote: > > > On Fri, Dec 11, 2020 at 09:05:05PM +0100, Daniel Borkmann wrote: > > > > On 12/11/20 9:19 AM, Gary Lin wrote: > > > > > The x64 bpf jit expects bpf images converge within the given passes, but > > > > > it could fail to do so with some corner cases. For example: > > > > > > > > > > l0: ldh [4] > > > > > l1: jeq #0x537d, l2, l40 > > > > > l2: ld [0] > > > > > l3: jeq #0xfa163e0d, l4, l40 > > > > > l4: ldh [12] > > > > > l5: ldx #0xe > > > > > l6: jeq #0x86dd, l41, l7 > > > > > l8: ld [x+16] > > > > > l9: ja 41 > > > > > > > > > > [... repeated ja 41 ] > > > > > > > > > > l40: ja 41 > > > > > l41: ret #0 > > > > > l42: ld #len > > > > > l43: ret a > > > > > > > > > > This bpf program contains 32 "ja 41" instructions which are effectively > > > > > NOPs and designed to be replaced with valid code dynamically. Ideally, > > > > > bpf jit should optimize those "ja 41" instructions out when translating > > > > > the bpf instructions into x86_64 machine code. However, do_jit() can > > > > > only remove one "ja 41" for offset==0 on each pass, so it requires at > > > > > least 32 runs to eliminate those JMPs and exceeds the current limit of > > > > > passes (20). In the end, the program got rejected when BPF_JIT_ALWAYS_ON > > > > > is set even though it's legit as a classic socket filter. > > > > > > > > > > To make the image more likely converge within 20 passes, this commit > > > > > pads some instructions with NOPs in the last 5 passes: > > > > > > > > > > 1. conditional jumps > > > > > A possible size variance comes from the adoption of imm8 JMP. If the > > > > > offset is imm8, we calculate the size difference of this BPF instruction > > > > > between the previous pass and the current pass and fill the gap with NOPs. > > > > > To avoid the recalculation of jump offset, those NOPs are inserted before > > > > > the JMP code, so we have to subtract the 2 bytes of imm8 JMP when > > > > > calculating the NOP number. > > > > > > > > > > 2. BPF_JA > > > > > There are two conditions for BPF_JA. > > > > > a.) nop jumps > > > > > If this instruction is not optimized out in the previous pass, > > > > > instead of removing it, we insert the equivalent size of NOPs. > > > > > b.) label jumps > > > > > Similar to condition jumps, we prepend NOPs right before the JMP > > > > > code. > > > > > > > > > > To make the code concise, emit_nops() is modified to use the signed len and > > > > > return the number of inserted NOPs. > > > > > > > > > > To support bpf-to-bpf, a new flag, padded, is introduced to 'struct bpf_prog' > > > > > so that bpf_int_jit_compile() could know if the program is padded or not. > > > > > > > > Please also add multiple hand-crafted test cases e.g. for bpf-to-bpf calls into > > > > test_verifier (which is part of bpf kselftests) that would exercise this corner > > > > case in x86 jit where we would start to nop pad so that there is proper coverage, > > > > too. > > > > > > > The corner case I had in the commit description is likely being rejected by > > > the verifier because most of those "ja 41" are unreachable instructions. > > > Is there any known test case that needs more than 15 passes in x86 jit? > > > > > Just an idea. Besides the mentioned corner case, how about making > > PADDING_PASSES dynamically configurable (sysfs?) and reusing the existing > > test cases? So that we can have a script to set PADDING_PASSES from 1 to 20 > > and run the bpf selftests separately. This guarantees that the padding > > strategy will be applied at least in a certain PADDING_PASSES settings. > > I think exposing such implementation detail to users is not that great as they > normally should not need to worry about these things (plus it's also rarely hit > in practice when developing against llvm). On top of all that, such knob would > have no meaning in case of other JITs since most other non-x86 ones have a fixed > number of passes. I think it's probably useful for local testing of the fix, but > less suitable for exposing as sysctl 'uapi' upstream. Re crafting a test case for > bpf-2-bpf calls, you could orientate on bpf_fill_maxinsns10() in lib/test_bpf.c > which is also triggering a high number of passes, port it over to test_verifier > from selftests and experiment from there to integrate calls. > Thanks for the hint. Will try bpf_fill_maxinsns10(). Gary Lin
diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c index 796506dcfc42..30b81c8539b3 100644 --- a/arch/x86/net/bpf_jit_comp.c +++ b/arch/x86/net/bpf_jit_comp.c @@ -789,8 +789,31 @@ static void detect_reg_usage(struct bpf_insn *insn, int insn_cnt, } } +static int emit_nops(u8 **pprog, int len) +{ + u8 *prog = *pprog; + int i, noplen, cnt = 0; + + while (len > 0) { + noplen = len; + + if (noplen > ASM_NOP_MAX) + noplen = ASM_NOP_MAX; + + for (i = 0; i < noplen; i++) + EMIT1(ideal_nops[noplen][i]); + len -= noplen; + } + + *pprog = prog; + + return cnt; +} + +#define INSN_SZ_DIFF (((addrs[i] - addrs[i - 1]) - (prog - temp))) + static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image, - int oldproglen, struct jit_context *ctx) + int oldproglen, struct jit_context *ctx, bool jmp_padding) { bool tail_call_reachable = bpf_prog->aux->tail_call_reachable; struct bpf_insn *insn = bpf_prog->insnsi; @@ -1409,6 +1432,8 @@ xadd: if (is_imm8(insn->off)) } jmp_offset = addrs[i + insn->off] - addrs[i]; if (is_imm8(jmp_offset)) { + if (jmp_padding) + cnt += emit_nops(&prog, INSN_SZ_DIFF - 2); EMIT2(jmp_cond, jmp_offset); } else if (is_simm32(jmp_offset)) { EMIT2_off32(0x0F, jmp_cond + 0x10, jmp_offset); @@ -1431,11 +1456,19 @@ xadd: if (is_imm8(insn->off)) else jmp_offset = addrs[i + insn->off] - addrs[i]; - if (!jmp_offset) - /* Optimize out nop jumps */ + if (!jmp_offset) { + /* + * If jmp_padding is enabled, the extra nops will + * be inserted. Otherwise, optimize out nop jumps. + */ + if (jmp_padding) + cnt += emit_nops(&prog, INSN_SZ_DIFF); break; + } emit_jmp: if (is_imm8(jmp_offset)) { + if (jmp_padding) + cnt += emit_nops(&prog, INSN_SZ_DIFF - 2); EMIT2(0xEB, jmp_offset); } else if (is_simm32(jmp_offset)) { EMIT1_off32(0xE9, jmp_offset); @@ -1578,26 +1611,6 @@ static int invoke_bpf_prog(const struct btf_func_model *m, u8 **pprog, return 0; } -static void emit_nops(u8 **pprog, unsigned int len) -{ - unsigned int i, noplen; - u8 *prog = *pprog; - int cnt = 0; - - while (len > 0) { - noplen = len; - - if (noplen > ASM_NOP_MAX) - noplen = ASM_NOP_MAX; - - for (i = 0; i < noplen; i++) - EMIT1(ideal_nops[noplen][i]); - len -= noplen; - } - - *pprog = prog; -} - static void emit_align(u8 **pprog, u32 align) { u8 *target, *prog = *pprog; @@ -1972,6 +1985,9 @@ struct x64_jit_data { struct jit_context ctx; }; +#define MAX_PASSES 20 +#define PADDING_PASSES (MAX_PASSES - 5) + struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog) { struct bpf_binary_header *header = NULL; @@ -1981,6 +1997,7 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog) struct jit_context ctx = {}; bool tmp_blinded = false; bool extra_pass = false; + bool padding = prog->padded; u8 *image = NULL; int *addrs; int pass; @@ -2043,7 +2060,9 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog) * pass to emit the final image. */ for (pass = 0; pass < 20 || image; pass++) { - proglen = do_jit(prog, addrs, image, oldproglen, &ctx); + if (!padding && pass >= PADDING_PASSES) + padding = true; + proglen = do_jit(prog, addrs, image, oldproglen, &ctx, padding); if (proglen <= 0) { out_image: image = NULL; @@ -2101,6 +2120,7 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog) prog->bpf_func = (void *)image; prog->jited = 1; prog->jited_len = proglen; + prog->padded = padding; } else { prog = orig_prog; } diff --git a/include/linux/filter.h b/include/linux/filter.h index 1b62397bd124..cb7ce2b3737a 100644 --- a/include/linux/filter.h +++ b/include/linux/filter.h @@ -531,6 +531,7 @@ struct bpf_prog { dst_needed:1, /* Do we need dst entry? */ blinded:1, /* Was blinded */ is_func:1, /* program is a bpf function */ + padded:1, /* jitted image was padded */ kprobe_override:1, /* Do we override a kprobe? */ has_callchain_buf:1, /* callchain buffer allocated? */ enforce_expected_attach_type:1, /* Enforce expected_attach_type checking at attach time */
The x64 bpf jit expects bpf images converge within the given passes, but it could fail to do so with some corner cases. For example: l0: ldh [4] l1: jeq #0x537d, l2, l40 l2: ld [0] l3: jeq #0xfa163e0d, l4, l40 l4: ldh [12] l5: ldx #0xe l6: jeq #0x86dd, l41, l7 l8: ld [x+16] l9: ja 41 [... repeated ja 41 ] l40: ja 41 l41: ret #0 l42: ld #len l43: ret a This bpf program contains 32 "ja 41" instructions which are effectively NOPs and designed to be replaced with valid code dynamically. Ideally, bpf jit should optimize those "ja 41" instructions out when translating the bpf instructions into x86_64 machine code. However, do_jit() can only remove one "ja 41" for offset==0 on each pass, so it requires at least 32 runs to eliminate those JMPs and exceeds the current limit of passes (20). In the end, the program got rejected when BPF_JIT_ALWAYS_ON is set even though it's legit as a classic socket filter. To make the image more likely converge within 20 passes, this commit pads some instructions with NOPs in the last 5 passes: 1. conditional jumps A possible size variance comes from the adoption of imm8 JMP. If the offset is imm8, we calculate the size difference of this BPF instruction between the previous pass and the current pass and fill the gap with NOPs. To avoid the recalculation of jump offset, those NOPs are inserted before the JMP code, so we have to subtract the 2 bytes of imm8 JMP when calculating the NOP number. 2. BPF_JA There are two conditions for BPF_JA. a.) nop jumps If this instruction is not optimized out in the previous pass, instead of removing it, we insert the equivalent size of NOPs. b.) label jumps Similar to condition jumps, we prepend NOPs right before the JMP code. To make the code concise, emit_nops() is modified to use the signed len and return the number of inserted NOPs. To support bpf-to-bpf, a new flag, padded, is introduced to 'struct bpf_prog' so that bpf_int_jit_compile() could know if the program is padded or not. Signed-off-by: Gary Lin <glin@suse.com> --- arch/x86/net/bpf_jit_comp.c | 68 ++++++++++++++++++++++++------------- include/linux/filter.h | 1 + 2 files changed, 45 insertions(+), 24 deletions(-)