diff mbox series

scsi: core: move scsi_host_busy() out of host lock for waking up EH handler

Message ID 20240112070000.4161982-1-ming.lei@redhat.com
State New
Headers show
Series scsi: core: move scsi_host_busy() out of host lock for waking up EH handler | expand

Commit Message

Ming Lei Jan. 12, 2024, 7 a.m. UTC
Inside scsi_eh_wakeup(), scsi_host_busy() is called & checked with host lock
every time for deciding if error handler kthread needs to be waken up.

This way can be too heavy in case of recovery, such as:

- N hardware queues
- queue depth is M for each hardware queue
- each scsi_host_busy() iterates over (N * M) tag/requests

If recovery is triggered in case that all requests are in-flight, each
scsi_eh_wakeup() is strictly serialized, when scsi_eh_wakeup() is called
for the last in-flight request, scsi_host_busy() has been run for (N * M - 1)
times, and request has been iterated for (N*M - 1) * (N * M) times.

If both N and M are big enough, hard lockup can be triggered on acquiring
host lock, and it is observed on mpi3mr(128 hw queues, queue depth 8169).

Fix the issue by calling scsi_host_busy() outside host lock, and we
don't need host lock for getting busy count because host lock never
covers that.

Cc: Ewan Milne <emilne@redhat.com>
Fixes: 6eb045e092ef ("scsi: core: avoid host-wide host_busy counter for scsi_mq")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
BTW, another way is to wake up EH handler unconditionally, and it should
work too.

 drivers/scsi/scsi_error.c | 11 +++++++----
 drivers/scsi/scsi_lib.c   |  4 +++-
 drivers/scsi/scsi_priv.h  |  2 +-
 3 files changed, 11 insertions(+), 6 deletions(-)

Comments

Ming Lei Jan. 12, 2024, 12:42 p.m. UTC | #1
On Fri, Jan 12, 2024 at 12:12:57PM +0100, Hannes Reinecke wrote:
> On 1/12/24 08:00, Ming Lei wrote:
> > Inside scsi_eh_wakeup(), scsi_host_busy() is called & checked with host lock
> > every time for deciding if error handler kthread needs to be waken up.
> > 
> > This way can be too heavy in case of recovery, such as:
> > 
> > - N hardware queues
> > - queue depth is M for each hardware queue
> > - each scsi_host_busy() iterates over (N * M) tag/requests
> > 
> > If recovery is triggered in case that all requests are in-flight, each
> > scsi_eh_wakeup() is strictly serialized, when scsi_eh_wakeup() is called
> > for the last in-flight request, scsi_host_busy() has been run for (N * M - 1)
> > times, and request has been iterated for (N*M - 1) * (N * M) times.
> > 
> > If both N and M are big enough, hard lockup can be triggered on acquiring
> > host lock, and it is observed on mpi3mr(128 hw queues, queue depth 8169).
> > 
> > Fix the issue by calling scsi_host_busy() outside host lock, and we
> > don't need host lock for getting busy count because host lock never
> > covers that.
> > 
> Can you share details for the hard lockup?
> I do agree that scsi_host_busy() is an expensive operation, so it
> might not be ideal to call it under a spin lock.
> But I wonder where the lockup comes in here.
> Care to explain?

Recovery happens when there is N * M inflight requests, then scsi_dec_host_busy()
can be called for each inflight request/scmnd from irq context.

host lock serializes every scsi_eh_wakeup().

Given each hardware queue has its own irq handler, so there could be one
request, scsi_dec_host_busy() is called and the host lock is spinned until
it is released from scsi_dec_host_busy() for all requests from all other
hardware queues.

The spin time can be long enough to trigger the hard lockup if N and M
is big enough, and the total wait time can be:

	(N - 1) * M * time_taken_in_scsi_host_busy().

Meantime the same story happens on scsi_eh_inc_host_failed() which is
called from softirq context, so host lock spin can be much more worse.

It is observed on mpi3mr with 128(N) hw queues and 8169(M) queue depth.

> 
> And if it leads to a lockup, aren't other instances calling scsi_host_busy()
> under a spinlock affected, as well?

It is only possible when it is called in per-command situation.


Thanks,
Ming
Ewan Milne Jan. 12, 2024, 7:34 p.m. UTC | #2
On Fri, Jan 12, 2024 at 7:43 AM Ming Lei <ming.lei@redhat.com> wrote:
>
> On Fri, Jan 12, 2024 at 12:12:57PM +0100, Hannes Reinecke wrote:
> > On 1/12/24 08:00, Ming Lei wrote:
> > > Inside scsi_eh_wakeup(), scsi_host_busy() is called & checked with host lock
> > > every time for deciding if error handler kthread needs to be waken up.
> > >
> > > This way can be too heavy in case of recovery, such as:
> > >
> > > - N hardware queues
> > > - queue depth is M for each hardware queue
> > > - each scsi_host_busy() iterates over (N * M) tag/requests
> > >
> > > If recovery is triggered in case that all requests are in-flight, each
> > > scsi_eh_wakeup() is strictly serialized, when scsi_eh_wakeup() is called
> > > for the last in-flight request, scsi_host_busy() has been run for (N * M - 1)
> > > times, and request has been iterated for (N*M - 1) * (N * M) times.
> > >
> > > If both N and M are big enough, hard lockup can be triggered on acquiring
> > > host lock, and it is observed on mpi3mr(128 hw queues, queue depth 8169).
> > >
> > > Fix the issue by calling scsi_host_busy() outside host lock, and we
> > > don't need host lock for getting busy count because host lock never
> > > covers that.
> > >
> > Can you share details for the hard lockup?
> > I do agree that scsi_host_busy() is an expensive operation, so it
> > might not be ideal to call it under a spin lock.
> > But I wonder where the lockup comes in here.
> > Care to explain?
>
> Recovery happens when there is N * M inflight requests, then scsi_dec_host_busy()
> can be called for each inflight request/scmnd from irq context.
>
> host lock serializes every scsi_eh_wakeup().
>
> Given each hardware queue has its own irq handler, so there could be one
> request, scsi_dec_host_busy() is called and the host lock is spinned until
> it is released from scsi_dec_host_busy() for all requests from all other
> hardware queues.
>
> The spin time can be long enough to trigger the hard lockup if N and M
> is big enough, and the total wait time can be:
>
>         (N - 1) * M * time_taken_in_scsi_host_busy().
>
> Meantime the same story happens on scsi_eh_inc_host_failed() which is
> called from softirq context, so host lock spin can be much more worse.
>
> It is observed on mpi3mr with 128(N) hw queues and 8169(M) queue depth.
>
> >
> > And if it leads to a lockup, aren't other instances calling scsi_host_busy()
> > under a spinlock affected, as well?
>
> It is only possible when it is called in per-command situation.
>
>
> Thanks,
> Ming
>

I can't see why this wouldn't work, or cause a problem with a lost wakeup,
but the cost of iterating to obtain the host_busy value is still being paid,
just outside the host_lock.  If this has triggered a hard lockup, should
we revisit the algorithm, e.g. are we still delaying EH wakeup for a noticeable
amount of time?  O(n^2) algorithms in the kernel don't seem like the best idea.

In any case...
Reviewed-by: Ewan D. Milne <emilne@redhat.com>

-Ewan
Ming Lei Jan. 13, 2024, 1:59 a.m. UTC | #3
On Fri, Jan 12, 2024 at 02:34:52PM -0500, Ewan Milne wrote:
> On Fri, Jan 12, 2024 at 7:43 AM Ming Lei <ming.lei@redhat.com> wrote:
> >
> > On Fri, Jan 12, 2024 at 12:12:57PM +0100, Hannes Reinecke wrote:
> > > On 1/12/24 08:00, Ming Lei wrote:
> > > > Inside scsi_eh_wakeup(), scsi_host_busy() is called & checked with host lock
> > > > every time for deciding if error handler kthread needs to be waken up.
> > > >
> > > > This way can be too heavy in case of recovery, such as:
> > > >
> > > > - N hardware queues
> > > > - queue depth is M for each hardware queue
> > > > - each scsi_host_busy() iterates over (N * M) tag/requests
> > > >
> > > > If recovery is triggered in case that all requests are in-flight, each
> > > > scsi_eh_wakeup() is strictly serialized, when scsi_eh_wakeup() is called
> > > > for the last in-flight request, scsi_host_busy() has been run for (N * M - 1)
> > > > times, and request has been iterated for (N*M - 1) * (N * M) times.
> > > >
> > > > If both N and M are big enough, hard lockup can be triggered on acquiring
> > > > host lock, and it is observed on mpi3mr(128 hw queues, queue depth 8169).
> > > >
> > > > Fix the issue by calling scsi_host_busy() outside host lock, and we
> > > > don't need host lock for getting busy count because host lock never
> > > > covers that.
> > > >
> > > Can you share details for the hard lockup?
> > > I do agree that scsi_host_busy() is an expensive operation, so it
> > > might not be ideal to call it under a spin lock.
> > > But I wonder where the lockup comes in here.
> > > Care to explain?
> >
> > Recovery happens when there is N * M inflight requests, then scsi_dec_host_busy()
> > can be called for each inflight request/scmnd from irq context.
> >
> > host lock serializes every scsi_eh_wakeup().
> >
> > Given each hardware queue has its own irq handler, so there could be one
> > request, scsi_dec_host_busy() is called and the host lock is spinned until
> > it is released from scsi_dec_host_busy() for all requests from all other
> > hardware queues.
> >
> > The spin time can be long enough to trigger the hard lockup if N and M
> > is big enough, and the total wait time can be:
> >
> >         (N - 1) * M * time_taken_in_scsi_host_busy().
> >
> > Meantime the same story happens on scsi_eh_inc_host_failed() which is
> > called from softirq context, so host lock spin can be much more worse.
> >
> > It is observed on mpi3mr with 128(N) hw queues and 8169(M) queue depth.
> >
> > >
> > > And if it leads to a lockup, aren't other instances calling scsi_host_busy()
> > > under a spinlock affected, as well?
> >
> > It is only possible when it is called in per-command situation.
> >
> >
> > Thanks,
> > Ming
> >
> 
> I can't see why this wouldn't work, or cause a problem with a lost wakeup,
> but the cost of iterating to obtain the host_busy value is still being paid,
> just outside the host_lock.  If this has triggered a hard lockup, should
> we revisit the algorithm, e.g. are we still delaying EH wakeup for a noticeable
> amount of time?

SCSI EH is designed to start handling until all in-flight commands are
failed, so it waits until all requests are failed first.

> O(n^2) algorithms in the kernel don't seem like the best idea.

It is actually O(n) because each hardware queue handles request
in parallel.

It is degraded to O(n^2) or O(n * m) just because of shared host lock.

Single or N scsi_host_busy() won't take too long without host lock, what
matters is actually the per-host lock spin time which can be accumulated
as too big.

> 
> In any case...
> Reviewed-by: Ewan D. Milne <emilne@redhat.com>

Thanks for the review!
Sathya Prakash Veerichetty Jan. 23, 2024, 7:04 a.m. UTC | #4
On Fri, Jan 12, 2024 at 6:59 PM Ming Lei <ming.lei@redhat.com> wrote:
>
> On Fri, Jan 12, 2024 at 02:34:52PM -0500, Ewan Milne wrote:
> > On Fri, Jan 12, 2024 at 7:43 AM Ming Lei <ming.lei@redhat.com> wrote:
> > >
> > > On Fri, Jan 12, 2024 at 12:12:57PM +0100, Hannes Reinecke wrote:
> > > > On 1/12/24 08:00, Ming Lei wrote:
> > > > > Inside scsi_eh_wakeup(), scsi_host_busy() is called & checked with host lock
> > > > > every time for deciding if error handler kthread needs to be waken up.
> > > > >
> > > > > This way can be too heavy in case of recovery, such as:
> > > > >
> > > > > - N hardware queues
> > > > > - queue depth is M for each hardware queue
> > > > > - each scsi_host_busy() iterates over (N * M) tag/requests
> > > > >
> > > > > If recovery is triggered in case that all requests are in-flight, each
> > > > > scsi_eh_wakeup() is strictly serialized, when scsi_eh_wakeup() is called
> > > > > for the last in-flight request, scsi_host_busy() has been run for (N * M - 1)
> > > > > times, and request has been iterated for (N*M - 1) * (N * M) times.
> > > > >
> > > > > If both N and M are big enough, hard lockup can be triggered on acquiring
> > > > > host lock, and it is observed on mpi3mr(128 hw queues, queue depth 8169).
> > > > >
> > > > > Fix the issue by calling scsi_host_busy() outside host lock, and we
> > > > > don't need host lock for getting busy count because host lock never
> > > > > covers that.
> > > > >
> > > > Can you share details for the hard lockup?
> > > > I do agree that scsi_host_busy() is an expensive operation, so it
> > > > might not be ideal to call it under a spin lock.
> > > > But I wonder where the lockup comes in here.
> > > > Care to explain?
> > >
> > > Recovery happens when there is N * M inflight requests, then scsi_dec_host_busy()
> > > can be called for each inflight request/scmnd from irq context.
> > >
> > > host lock serializes every scsi_eh_wakeup().
> > >
> > > Given each hardware queue has its own irq handler, so there could be one
> > > request, scsi_dec_host_busy() is called and the host lock is spinned until
> > > it is released from scsi_dec_host_busy() for all requests from all other
> > > hardware queues.
> > >
> > > The spin time can be long enough to trigger the hard lockup if N and M
> > > is big enough, and the total wait time can be:
> > >
> > >         (N - 1) * M * time_taken_in_scsi_host_busy().
> > >
> > > Meantime the same story happens on scsi_eh_inc_host_failed() which is
> > > called from softirq context, so host lock spin can be much more worse.
> > >
> > > It is observed on mpi3mr with 128(N) hw queues and 8169(M) queue depth.
> > >
> > > >
> > > > And if it leads to a lockup, aren't other instances calling scsi_host_busy()
> > > > under a spinlock affected, as well?
> > >
> > > It is only possible when it is called in per-command situation.
> > >
> > >
> > > Thanks,
> > > Ming
> > >
> >
> > I can't see why this wouldn't work, or cause a problem with a lost wakeup,
> > but the cost of iterating to obtain the host_busy value is still being paid,
> > just outside the host_lock.  If this has triggered a hard lockup, should
> > we revisit the algorithm, e.g. are we still delaying EH wakeup for a noticeable
> > amount of time?
>
> SCSI EH is designed to start handling until all in-flight commands are
> failed, so it waits until all requests are failed first.
>
> > O(n^2) algorithms in the kernel don't seem like the best idea.
>
> It is actually O(n) because each hardware queue handles request
> in parallel.
>
> It is degraded to O(n^2) or O(n * m) just because of shared host lock.
>
> Single or N scsi_host_busy() won't take too long without host lock, what
> matters is actually the per-host lock spin time which can be accumulated
> as too big.
>
> >
> > In any case...
> > Reviewed-by: Ewan D. Milne <emilne@redhat.com>
>
> Thanks for the review!
>
>
> --
> Ming
>
>
Reviewed-by: Sathya Prakash Veerichetty <safhya.prakash@broadcom.com>
Tested-by:  Sathya Prakash Veerichetty <safhya.prakash@broadcom.com>
Bart Van Assche Jan. 23, 2024, 3:23 p.m. UTC | #5
On 1/11/24 23:00, Ming Lei wrote:
> @@ -87,8 +87,10 @@ void scsi_schedule_eh(struct Scsi_Host *shost)
>   
>   	if (scsi_host_set_state(shost, SHOST_RECOVERY) == 0 ||
>   	    scsi_host_set_state(shost, SHOST_CANCEL_RECOVERY) == 0) {
> +		unsigned int busy = scsi_host_busy(shost);
> +
>   		shost->host_eh_scheduled++;
> -		scsi_eh_wakeup(shost);
> +		scsi_eh_wakeup(shost, busy);
>   	}

No new variable is needed here. If this patch is reposted, please change
the above into the following:

+		scsi_eh_wakeup(shost, scsi_host_busy(shost));

With or without that additional change:

Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Martin K. Petersen Jan. 24, 2024, 3 a.m. UTC | #6
On Fri, 12 Jan 2024 15:00:00 +0800, Ming Lei wrote:

> Inside scsi_eh_wakeup(), scsi_host_busy() is called & checked with host lock
> every time for deciding if error handler kthread needs to be waken up.
> 
> This way can be too heavy in case of recovery, such as:
> 
> - N hardware queues
> - queue depth is M for each hardware queue
> - each scsi_host_busy() iterates over (N * M) tag/requests
> 
> [...]

Applied to 6.8/scsi-fixes, thanks!

[1/1] scsi: core: move scsi_host_busy() out of host lock for waking up EH handler
      https://git.kernel.org/mkp/scsi/c/4373534a9850
Ming Lei Feb. 3, 2024, 2:31 a.m. UTC | #7
On Wed, Jan 24, 2024 at 11:01 AM Martin K. Petersen
<martin.petersen@oracle.com> wrote:
>
> On Fri, 12 Jan 2024 15:00:00 +0800, Ming Lei wrote:
>
> > Inside scsi_eh_wakeup(), scsi_host_busy() is called & checked with host lock
> > every time for deciding if error handler kthread needs to be waken up.
> >
> > This way can be too heavy in case of recovery, such as:
> >
> > - N hardware queues
> > - queue depth is M for each hardware queue
> > - each scsi_host_busy() iterates over (N * M) tag/requests
> >
> > [...]
>
> Applied to 6.8/scsi-fixes, thanks!
>
> [1/1] scsi: core: move scsi_host_busy() out of host lock for waking up EH handler
>       https://git.kernel.org/mkp/scsi/c/4373534a9850

Hi Martin,

When I started to backport this commit, I found it was merged as wrong,
the point is that scsi_host_busy() needs to be moved out of host lock.

I will send one new patch to fix it.

Thanks,
Ming
diff mbox series

Patch

diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
index c67cdcdc3ba8..6743eb88e0a5 100644
--- a/drivers/scsi/scsi_error.c
+++ b/drivers/scsi/scsi_error.c
@@ -61,11 +61,11 @@  static int scsi_eh_try_stu(struct scsi_cmnd *scmd);
 static enum scsi_disposition scsi_try_to_abort_cmd(const struct scsi_host_template *,
 						   struct scsi_cmnd *);
 
-void scsi_eh_wakeup(struct Scsi_Host *shost)
+void scsi_eh_wakeup(struct Scsi_Host *shost, unsigned int busy)
 {
 	lockdep_assert_held(shost->host_lock);
 
-	if (scsi_host_busy(shost) == shost->host_failed) {
+	if (busy == shost->host_failed) {
 		trace_scsi_eh_wakeup(shost);
 		wake_up_process(shost->ehandler);
 		SCSI_LOG_ERROR_RECOVERY(5, shost_printk(KERN_INFO, shost,
@@ -87,8 +87,10 @@  void scsi_schedule_eh(struct Scsi_Host *shost)
 
 	if (scsi_host_set_state(shost, SHOST_RECOVERY) == 0 ||
 	    scsi_host_set_state(shost, SHOST_CANCEL_RECOVERY) == 0) {
+		unsigned int busy = scsi_host_busy(shost);
+
 		shost->host_eh_scheduled++;
-		scsi_eh_wakeup(shost);
+		scsi_eh_wakeup(shost, busy);
 	}
 
 	spin_unlock_irqrestore(shost->host_lock, flags);
@@ -283,10 +285,11 @@  static void scsi_eh_inc_host_failed(struct rcu_head *head)
 	struct scsi_cmnd *scmd = container_of(head, typeof(*scmd), rcu);
 	struct Scsi_Host *shost = scmd->device->host;
 	unsigned long flags;
+	unsigned int busy = scsi_host_busy(shost);
 
 	spin_lock_irqsave(shost->host_lock, flags);
 	shost->host_failed++;
-	scsi_eh_wakeup(shost);
+	scsi_eh_wakeup(shost, busy);
 	spin_unlock_irqrestore(shost->host_lock, flags);
 }
 
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index cf3864f72093..df5ac03d5d6c 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -278,9 +278,11 @@  static void scsi_dec_host_busy(struct Scsi_Host *shost, struct scsi_cmnd *cmd)
 	rcu_read_lock();
 	__clear_bit(SCMD_STATE_INFLIGHT, &cmd->state);
 	if (unlikely(scsi_host_in_recovery(shost))) {
+		unsigned int busy = scsi_host_busy(shost);
+
 		spin_lock_irqsave(shost->host_lock, flags);
 		if (shost->host_failed || shost->host_eh_scheduled)
-			scsi_eh_wakeup(shost);
+			scsi_eh_wakeup(shost, busy);
 		spin_unlock_irqrestore(shost->host_lock, flags);
 	}
 	rcu_read_unlock();
diff --git a/drivers/scsi/scsi_priv.h b/drivers/scsi/scsi_priv.h
index 3f0dfb97db6b..1fbfe1b52c9f 100644
--- a/drivers/scsi/scsi_priv.h
+++ b/drivers/scsi/scsi_priv.h
@@ -92,7 +92,7 @@  extern void scmd_eh_abort_handler(struct work_struct *work);
 extern enum blk_eh_timer_return scsi_timeout(struct request *req);
 extern int scsi_error_handler(void *host);
 extern enum scsi_disposition scsi_decide_disposition(struct scsi_cmnd *cmd);
-extern void scsi_eh_wakeup(struct Scsi_Host *shost);
+extern void scsi_eh_wakeup(struct Scsi_Host *shost, unsigned int busy);
 extern void scsi_eh_scmd_add(struct scsi_cmnd *);
 void scsi_eh_ready_devs(struct Scsi_Host *shost,
 			struct list_head *work_q,