mbox series

[0/8] blk-mq: fix wrong queue mapping for kdump kernel

Message ID 20230712125455.1986455-1-ming.lei@redhat.com
Headers show
Series blk-mq: fix wrong queue mapping for kdump kernel | expand

Message

Ming Lei July 12, 2023, 12:54 p.m. UTC
Hi,

On arm and ppc64, 'maxcpus=1' is required for kdump kernel,
see `Documentation/admin-guide/kdump/kdump.rst`, so num_possible_cpus()
still returns all CPUs because 'maxcpus=1' just bring up one single
cpu core during booting.

blk-mq sees single queue in kdump kernel, and in driver's viewpoint
there are still multiple queues, this inconsistency causes driver to apply
wrong queue mapping for handling IO, and IO timeout is triggered.

This issue is only triggered on managed irq in case of multiple hw
queues. Some drivers takes online cpus into account for nr_hw_queues,
and don't have such issue, such as nvme rdma/tcp.

Meantime, single queue makes much less resource utilization, and reduce
risk of kernel failure.


Thanks,
Ming

Ming Lei (8):
  blk-mq: add blk_mq_max_nr_hw_queues()
  nvme-pci: use blk_mq_max_nr_hw_queues() to calculate io queues
  scsi: lpfc: use blk_mq_max_nr_hw_queues() to calculate io vectors
  scsi: hisi: take blk_mq_max_nr_hw_queues() into account for
    calculating io vectors
  scsi: mpi3mr: take blk_mq_max_nr_hw_queues() into account for
    calculating io vectors
  scsi: megaraid: take blk_mq_max_nr_hw_queues() into account for
    calculating io vectors
  scsi: mpt3sas: take blk_mq_max_nr_hw_queues() into account for
    calculating io vectors
  scsi: pm8001: take blk_mq_max_nr_hw_queues() into account for
    calculating io vectors

 block/blk-mq.c                            | 9 +++++++++
 drivers/nvme/host/pci.c                   | 2 +-
 drivers/scsi/hisi_sas/hisi_sas_v3_hw.c    | 3 +++
 drivers/scsi/lpfc/lpfc_init.c             | 1 +
 drivers/scsi/megaraid/megaraid_sas_base.c | 6 +++++-
 drivers/scsi/mpi3mr/mpi3mr_fw.c           | 3 +++
 drivers/scsi/mpt3sas/mpt3sas_base.c       | 4 ++--
 drivers/scsi/pm8001/pm8001_init.c         | 4 +++-
 include/linux/blk-mq.h                    | 1 +
 9 files changed, 28 insertions(+), 5 deletions(-)

Comments

Christoph Hellwig July 12, 2023, 1:19 p.m. UTC | #1
On Wed, Jul 12, 2023 at 09:16:11PM +0800, Ming Lei wrote:
> The problem is that blk_mq_alloc_tag_set() forces to set nr_hw_queues
> as 1 for kdump kernel, that is why blk_mq_max_nr_hw_queues() has to
> return 1 for kdump kernel.

Well, let's fix that first and work from there.  Same argument against
that deep magic applies there as well.

> Thomas, can we disable managed irq for kdump kernel and switch to
> non-managed irq? Then we can avoid driver's change. I'd suggest
> this way if it is possible.

Why the heck would we?
Ming Lei July 12, 2023, 1:31 p.m. UTC | #2
On Wed, Jul 12, 2023 at 03:19:25PM +0200, Christoph Hellwig wrote:
> On Wed, Jul 12, 2023 at 09:16:11PM +0800, Ming Lei wrote:
> > The problem is that blk_mq_alloc_tag_set() forces to set nr_hw_queues
> > as 1 for kdump kernel, that is why blk_mq_max_nr_hw_queues() has to
> > return 1 for kdump kernel.
> 
> Well, let's fix that first and work from there.  Same argument against
> that deep magic applies there as well.

In short, driver needs to figure out nr_hw_queues first by hardware info,
then pass it to blk_mq_alloc_tag_set(), but blk_mq_alloc_tag_set() changes it,
so inconsistency is caused.

The only solution in this way is to tell driver the max supported
number from the beginning, that is what this patchset is doing.

> 
> > Thomas, can we disable managed irq for kdump kernel and switch to
> > non-managed irq? Then we can avoid driver's change. I'd suggest
> > this way if it is possible.
> 
> Why the heck would we?

IMO irq kernel doesn't make sense in kdump kernel, which is very
resource limited and has to be reliable.

PCI_IRQ_AFFINITY can be just one hint, pci_alloc_irq_vectors_affinity()
still allocates affinity in managed way, then queue mapping can work
just fine, and the only difference is that genirq handles this irqs
as non-manged wrt. migration.

This way should solve queue mapping issue, but driver still allocates
lots of queues, which take resource useless. So looks we still have to
fix drivers.


Thanks, 
Ming