[net,v8,0/3] fix packet stuck problem for lockless qdisc

Message ID	1620959218-17250-1-git-send-email-linyunsheng@huawei.com
Headers	show Return-Path: <netdev-owner@kernel.org> From: Yunsheng Lin <linyunsheng@huawei.com> To: <davem@davemloft.net>, <kuba@kernel.org> CC: <olteanv@gmail.com>, <ast@kernel.org>, <daniel@iogearbox.net>, <andriin@fb.com>, <edumazet@google.com>, <weiwan@google.com>, <cong.wang@bytedance.com>, <ap420073@gmail.com>, <netdev@vger.kernel.org>, <linux-kernel@vger.kernel.org>, <linuxarm@openeuler.org>, <mkl@pengutronix.de>, <linux-can@vger.kernel.org>, <jhs@mojatatu.com>, <xiyou.wangcong@gmail.com>, <jiri@resnulli.us>, <andrii@kernel.org>, <kafai@fb.com>, <songliubraving@fb.com>, <yhs@fb.com>, <john.fastabend@gmail.com>, <kpsingh@kernel.org>, <bpf@vger.kernel.org>, <jonas.bonn@netrounds.com>, <pabeni@redhat.com>, <mzhivich@akamai.com>, <johunt@akamai.com>, <albcamus@gmail.com>, <kehuan.feng@gmail.com>, <a.fatoum@pengutronix.de>, <atenart@kernel.org>, <alexander.duyck@gmail.com>, <hdanton@sina.com>, <jgross@suse.com>, <JKosina@suse.com>, <mkubecek@suse.cz>, <bjorn@kernel.org>, <alobakin@pm.me> Subject: [PATCH net v8 0/3] fix packet stuck problem for lockless qdisc Date: Fri, 14 May 2021 10:26:55 +0800 Message-ID: <1620959218-17250-1-git-send-email-linyunsheng@huawei.com> MIME-Version: 1.0 Content-Type: text/plain Precedence: bulk
Series	fix packet stuck problem for lockless qdisc \| expand [net,v8,0/3] fix packet stuck problem for lockless qdisc [net,v8,1/3] net: sched: fix packet stuck problem for lockless qdisc [net,v8,2/3] net: sched: fix tx action rescheduling issue during deactivation [net,v8,3/3] net: sched: fix tx action reschedule issue with stopped queue

Message ID

1620959218-17250-1-git-send-email-linyunsheng@huawei.com

Headers

From: Yunsheng Lin <linyunsheng@huawei.com>
To: <davem@davemloft.net>, <kuba@kernel.org>
CC: <olteanv@gmail.com>, <ast@kernel.org>, <daniel@iogearbox.net>,
	<andriin@fb.com>, <edumazet@google.com>, <weiwan@google.com>,
	<cong.wang@bytedance.com>, <ap420073@gmail.com>,
	<netdev@vger.kernel.org>, <linux-kernel@vger.kernel.org>,
	<linuxarm@openeuler.org>, <mkl@pengutronix.de>,
	<linux-can@vger.kernel.org>, <jhs@mojatatu.com>,
	<xiyou.wangcong@gmail.com>, <jiri@resnulli.us>,
	<andrii@kernel.org>, <kafai@fb.com>, <songliubraving@fb.com>,
	<yhs@fb.com>, <john.fastabend@gmail.com>, <kpsingh@kernel.org>,
	<bpf@vger.kernel.org>, <jonas.bonn@netrounds.com>,
	<pabeni@redhat.com>, <mzhivich@akamai.com>, <johunt@akamai.com>,
	<albcamus@gmail.com>, <kehuan.feng@gmail.com>,
	<a.fatoum@pengutronix.de>, <atenart@kernel.org>,
	<alexander.duyck@gmail.com>, <hdanton@sina.com>, <jgross@suse.com>,
	<JKosina@suse.com>, <mkubecek@suse.cz>, <bjorn@kernel.org>,
	<alobakin@pm.me>
Subject: [PATCH net v8 0/3] fix packet stuck problem for lockless qdisc
Date: Fri, 14 May 2021 10:26:55 +0800
Message-ID: <1620959218-17250-1-git-send-email-linyunsheng@huawei.com>
MIME-Version: 1.0
Content-Type: text/plain
Precedence: bulk

Series

fix packet stuck problem for lockless qdisc | expand

Message

Yunsheng Lin May 14, 2021, 2:26 a.m. UTC

This patchset fixes the packet stuck problem mentioned in [1].

Patch 1: Add STATE_MISSED flag to fix packet stuck problem.
Patch 2: Fix a tx_action rescheduling problem after STATE_MISSED
         flag is added in patch 1.
Patch 3: Fix the significantly higher CPU consumption problem when
         multiple threads are competing on a saturated outgoing
         device.

V8: Change function name in patch 3 as suggested by Jakub, adjust
    commit log in patch 2, and add Acked-by from Jakub.
V7: Fix netif_tx_wake_queue() data race noted by Jakub.
V6: Some performance optimization in patch 1 suggested by Jakub
    and drop NET_XMIT_DROP checking in patch 3.
V5: add patch 3 to fix the problem reported by Michal Kubecek.
V4: Change STATE_NEED_RESCHEDULE to STATE_MISSED and add patch 2.

[1]. https://lkml.org/lkml/2019/10/9/42

Yunsheng Lin (3):
  net: sched: fix packet stuck problem for lockless qdisc
  net: sched: fix tx action rescheduling issue during deactivation
  net: sched: fix tx action reschedule issue with stopped queue

 include/net/pkt_sched.h   |  7 +------
 include/net/sch_generic.h | 35 ++++++++++++++++++++++++++++++++-
 net/core/dev.c            | 29 ++++++++++++++++++++++-----
 net/sched/sch_generic.c   | 50 +++++++++++++++++++++++++++++++++++++++++++++--
 4 files changed, 107 insertions(+), 14 deletions(-)

Comments

Cong Wang May 14, 2021, 11:36 p.m. UTC | #1

On Thu, May 13, 2021 at 7:27 PM Yunsheng Lin <linyunsheng@huawei.com> wrote:
>  struct qdisc_size_table {
> @@ -159,8 +160,33 @@ static inline bool qdisc_is_empty(const struct Qdisc *qdisc)
>  static inline bool qdisc_run_begin(struct Qdisc *qdisc)
>  {
>         if (qdisc->flags & TCQ_F_NOLOCK) {
> +               if (spin_trylock(&qdisc->seqlock))
> +                       goto nolock_empty;
> +
> +               /* If the MISSED flag is set, it means other thread has
> +                * set the MISSED flag before second spin_trylock(), so
> +                * we can return false here to avoid multi cpus doing
> +                * the set_bit() and second spin_trylock() concurrently.
> +                */
> +               if (test_bit(__QDISC_STATE_MISSED, &qdisc->state))
> +                       return false;
> +
> +               /* Set the MISSED flag before the second spin_trylock(),
> +                * if the second spin_trylock() return false, it means
> +                * other cpu holding the lock will do dequeuing for us
> +                * or it will see the MISSED flag set after releasing
> +                * lock and reschedule the net_tx_action() to do the
> +                * dequeuing.
> +                */
> +               set_bit(__QDISC_STATE_MISSED, &qdisc->state);
> +
> +               /* Retry again in case other CPU may not see the new flag
> +                * after it releases the lock at the end of qdisc_run_end().
> +                */
>                 if (!spin_trylock(&qdisc->seqlock))
>                         return false;
> +
> +nolock_empty:
>                 WRITE_ONCE(qdisc->empty, false);
>         } else if (qdisc_is_running(qdisc)) {
>                 return false;
> @@ -176,8 +202,15 @@ static inline bool qdisc_run_begin(struct Qdisc *qdisc)
>  static inline void qdisc_run_end(struct Qdisc *qdisc)
>  {
>         write_seqcount_end(&qdisc->running);
> -       if (qdisc->flags & TCQ_F_NOLOCK)
> +       if (qdisc->flags & TCQ_F_NOLOCK) {
>                 spin_unlock(&qdisc->seqlock);
> +
> +               if (unlikely(test_bit(__QDISC_STATE_MISSED,
> +                                     &qdisc->state))) {
> +                       clear_bit(__QDISC_STATE_MISSED, &qdisc->state);


We have test_and_clear_bit() which is atomic, test_bit()+clear_bit()
is not.


> +                       __netif_schedule(qdisc);
> +               }
> +       }
>  }
>
>  static inline bool qdisc_may_bulk(const struct Qdisc *qdisc)
> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
> index 44991ea..795d986 100644
> --- a/net/sched/sch_generic.c
> +++ b/net/sched/sch_generic.c
> @@ -640,8 +640,10 @@ static struct sk_buff *pfifo_fast_dequeue(struct Qdisc *qdisc)
>  {
>         struct pfifo_fast_priv *priv = qdisc_priv(qdisc);
>         struct sk_buff *skb = NULL;
> +       bool need_retry = true;
>         int band;
>
> +retry:
>         for (band = 0; band < PFIFO_FAST_BANDS && !skb; band++) {
>                 struct skb_array *q = band2list(priv, band);
>
> @@ -652,6 +654,23 @@ static struct sk_buff *pfifo_fast_dequeue(struct Qdisc *qdisc)
>         }
>         if (likely(skb)) {
>                 qdisc_update_stats_at_dequeue(qdisc, skb);
> +       } else if (need_retry &&
> +                  test_bit(__QDISC_STATE_MISSED, &qdisc->state)) {
> +               /* Delay clearing the STATE_MISSED here to reduce
> +                * the overhead of the second spin_trylock() in
> +                * qdisc_run_begin() and __netif_schedule() calling
> +                * in qdisc_run_end().
> +                */
> +               clear_bit(__QDISC_STATE_MISSED, &qdisc->state);

Ditto.

> +
> +               /* Make sure dequeuing happens after clearing
> +                * STATE_MISSED.
> +                */
> +               smp_mb__after_atomic();
> +
> +               need_retry = false;
> +
> +               goto retry;

Two concurrent pfifo_fast_dequeue() would possibly retry it at the
same time when they test __QDISC_STATE_MISSED at the same
time and get true. Is this a problem?

Also, any reason why you want pfifo_fast to handle a generic
Qdisc flag? IOW, why not handle this logic in, for example,
qdisc_restart()?

Thanks.

Jakub Kicinski May 15, 2021, 12:17 a.m. UTC | #2

On Fri, 14 May 2021 16:57:29 -0700 Cong Wang wrote:
> On Fri, May 14, 2021 at 4:39 PM Jakub Kicinski <kuba@kernel.org> wrote:
> >
> > On Fri, 14 May 2021 16:36:16 -0700 Cong Wang wrote:  
>  [...]  
> > >
> > > We have test_and_clear_bit() which is atomic, test_bit()+clear_bit()
> > > is not.  
> >
> > It doesn't have to be atomic, right? I asked to split the test because
> > test_and_clear is a locked op on x86, test by itself is not.  
> 
> It depends on whether you expect the code under the true condition
> to run once or multiple times, something like:
> 
> if (test_bit()) {
>   clear_bit();
>   // this code may run multiple times
> }
> 
> With the atomic test_and_clear_bit(), it only runs once:
> 
> if (test_and_clear_bit()) {
>   // this code runs once
> }
> 
> This is why __netif_schedule() uses test_and_set_bit() instead of
> test_bit()+set_bit().

Thanks, makes sense, so hopefully the MISSED-was-set case is not common
and we can depend on __netif_schedule() to DTRT, avoiding the atomic op
in the common case.

Yunsheng Lin May 15, 2021, 2:25 a.m. UTC | #3

On 2021/5/15 8:17, Jakub Kicinski wrote:
> On Fri, 14 May 2021 16:57:29 -0700 Cong Wang wrote:

>> On Fri, May 14, 2021 at 4:39 PM Jakub Kicinski <kuba@kernel.org> wrote:

>>>

>>> On Fri, 14 May 2021 16:36:16 -0700 Cong Wang wrote:  

>>  [...]  

>>>>

>>>> We have test_and_clear_bit() which is atomic, test_bit()+clear_bit()

>>>> is not.  

>>>

>>> It doesn't have to be atomic, right? I asked to split the test because

>>> test_and_clear is a locked op on x86, test by itself is not.  

>>

>> It depends on whether you expect the code under the true condition

>> to run once or multiple times, something like:

>>

>> if (test_bit()) {

>>   clear_bit();

>>   // this code may run multiple times

>> }

>>

>> With the atomic test_and_clear_bit(), it only runs once:

>>

>> if (test_and_clear_bit()) {

>>   // this code runs once

>> }


I am not sure if the above really matter when the test and clear
does not need to be atomic.

In order for the above to happens, the MISSED has to set between
test and clear, right?

>>

>> This is why __netif_schedule() uses test_and_set_bit() instead of

>> test_bit()+set_bit().


I think test_and_set_bit() is needed in __netif_schedule() mainly
because STATE_SCHED is also used to indicate if the qdisc is in
sd->output_queue, so it has to be atomic.

> 

> Thanks, makes sense, so hopefully the MISSED-was-set case is not common

> and we can depend on __netif_schedule() to DTRT, avoiding the atomic op

> in the common case.

> 

> .

>

Cong Wang May 18, 2021, 12:49 a.m. UTC | #4

On Fri, May 14, 2021 at 7:25 PM Yunsheng Lin <linyunsheng@huawei.com> wrote:
>

> On 2021/5/15 8:17, Jakub Kicinski wrote:

> > On Fri, 14 May 2021 16:57:29 -0700 Cong Wang wrote:

> >> On Fri, May 14, 2021 at 4:39 PM Jakub Kicinski <kuba@kernel.org> wrote:

> >>>

> >>> On Fri, 14 May 2021 16:36:16 -0700 Cong Wang wrote:

> >>  [...]

> >>>>

> >>>> We have test_and_clear_bit() which is atomic, test_bit()+clear_bit()

> >>>> is not.

> >>>

> >>> It doesn't have to be atomic, right? I asked to split the test because

> >>> test_and_clear is a locked op on x86, test by itself is not.

> >>

> >> It depends on whether you expect the code under the true condition

> >> to run once or multiple times, something like:

> >>

> >> if (test_bit()) {

> >>   clear_bit();

> >>   // this code may run multiple times

> >> }

> >>

> >> With the atomic test_and_clear_bit(), it only runs once:

> >>

> >> if (test_and_clear_bit()) {

> >>   // this code runs once

> >> }

>

> I am not sure if the above really matter when the test and clear

> does not need to be atomic.

>

> In order for the above to happens, the MISSED has to set between

> test and clear, right?


Nope, see the following:

// MISSED bit is already set
CPU0                            CPU1
if (test_bit(MISSED) ( //true
                                if (test_bit(MISSED)) { // also true
        clear_bit(MISSED);
        do_something();
                                        clear_bit(MISSED);
                                        do_something();
                                }
}

Now do_something() is called twice instead of once. This may or may
not be a problem, hence I asked this question.

>

> >>

> >> This is why __netif_schedule() uses test_and_set_bit() instead of

> >> test_bit()+set_bit().

>

> I think test_and_set_bit() is needed in __netif_schedule() mainly

> because STATE_SCHED is also used to indicate if the qdisc is in

> sd->output_queue, so it has to be atomic.


If you replace the "do_something()" above with __netif_reschedule(),
this is exactly my point. An entry can not be inserted twice into a
list, hence it should never be called twice like above.

Thanks.