Message ID | 20250108092520.1325324-4-hch@lst.de |
---|---|
State | New |
Headers | show |
Series | [01/10] block: fix docs for freezing of queue limits updates | expand |
On 1/8/25 6:25 PM, Christoph Hellwig wrote: > When __blk_mq_update_nr_hw_queues changes the number of tag sets, it > might have to disable poll queues. Currently it does so by adjusting > the BLK_FEAT_POLL, which is a bit against the intent of features that > describe hardware / driver capabilities, but more importantly causes > nasty lock order problems with the broadly held freeze when updating the > number of hardware queues and the limits lock. Fix this by leaving > BLK_FEAT_POLL alone, and instead check for the number of poll queues in > the bio submission and poll handlers. While this adds extra work to the > fast path, the variables are in cache lines used by these operations > anyway, so it should be cheap enough. > > Fixes: 8023e144f9d6 ("block: move the poll flag to queue_limits") > Signed-off-by: Christoph Hellwig <hch@lst.de> Looks OK to me. Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
On Wed, Jan 08, 2025 at 10:25:00AM +0100, Christoph Hellwig wrote: > When __blk_mq_update_nr_hw_queues changes the number of tag sets, it > might have to disable poll queues. Currently it does so by adjusting > the BLK_FEAT_POLL, which is a bit against the intent of features that > describe hardware / driver capabilities, but more importantly causes > nasty lock order problems with the broadly held freeze when updating the > number of hardware queues and the limits lock. Fix this by leaving > BLK_FEAT_POLL alone, and instead check for the number of poll queues in > the bio submission and poll handlers. While this adds extra work to the > fast path, the variables are in cache lines used by these operations > anyway, so it should be cheap enough. > > Fixes: 8023e144f9d6 ("block: move the poll flag to queue_limits") > Signed-off-by: Christoph Hellwig <hch@lst.de> > --- ... > /** > * submit_bio_noacct - re-submit a bio to the block device layer for I/O > * @bio: The bio describing the location in memory and on the device. > @@ -805,8 +817,7 @@ void submit_bio_noacct(struct bio *bio) > } > } > > - if (!(q->limits.features & BLK_FEAT_POLL) && > - (bio->bi_opf & REQ_POLLED)) { > + if ((bio->bi_opf & REQ_POLLED) && !bdev_can_poll(bdev)) { submit_bio_noacct() is called without grabbing .q_usage_counter, so tagset may be freed now, then use-after-free on q->tag_set? Thanks, Ming
On 1/9/25 00:27, Christoph Hellwig wrote: > On Wed, Jan 08, 2025 at 06:31:15PM +0800, Ming Lei wrote: >>> - if (!(q->limits.features & BLK_FEAT_POLL) && >>> - (bio->bi_opf & REQ_POLLED)) { >>> + if ((bio->bi_opf & REQ_POLLED) && !bdev_can_poll(bdev)) { >> >> submit_bio_noacct() is called without grabbing .q_usage_counter, >> so tagset may be freed now, then use-after-free on q->tag_set? > > Indeed. That also means the previous check wasn't reliable either. > I think we can simple move the check into > blk_mq_submit_bio/__submit_bio which means we'll do a bunch more > checks before we eventually fail, but otherwise it'll work the > same. Given that the request queue is the same for all tag sets, I do not think we need to have the queue_limits_start_update()/commit_update() within the tag set loop in __blk_mq_update_nr_hw_queues(). So something like this should be enough for an initial fix, no ? diff --git a/block/blk-mq.c b/block/blk-mq.c index 8ac19d4ae3c0..ac71e9cee25b 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -4986,6 +4986,7 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, int nr_hw_queues) { struct request_queue *q; + struct queue_limits lim; LIST_HEAD(head); int prev_nr_hw_queues = set->nr_hw_queues; int i; @@ -4999,8 +5000,10 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, if (set->nr_maps == 1 && nr_hw_queues == set->nr_hw_queues) return; + lim = queue_limits_start_update(q); list_for_each_entry(q, &set->tag_list, tag_set_list) blk_mq_freeze_queue(q); + /* * Switch IO scheduler to 'none', cleaning up the data associated * with the previous scheduler. We will switch back once we are done @@ -5036,13 +5039,10 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, set->nr_hw_queues = prev_nr_hw_queues; goto fallback; } - lim = queue_limits_start_update(q); if (blk_mq_can_poll(set)) lim.features |= BLK_FEAT_POLL; else lim.features &= ~BLK_FEAT_POLL; - if (queue_limits_commit_update(q, &lim) < 0) - pr_warn("updating the poll flag failed\n"); blk_mq_map_swqueue(q); } @@ -5059,6 +5059,9 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, list_for_each_entry(q, &set->tag_list, tag_set_list) blk_mq_unfreeze_queue(q); + if (queue_limits_commit_update(q, &lim) < 0) + pr_warn("updating the poll flag failed\n"); + /* Free the excess tags when nr_hw_queues shrink. */ for (i = set->nr_hw_queues; i < prev_nr_hw_queues; i++) __blk_mq_free_map_and_rqs(set, i); With that, no modification of the hot path to check the poll feature should be needed. And I also fail to see why we need to do the queue freeze for all tag sets. Once should be enough as well...
On Thu, Jan 09, 2025 at 09:05:49AM +0900, Damien Le Moal wrote: > On 1/9/25 00:27, Christoph Hellwig wrote: > > On Wed, Jan 08, 2025 at 06:31:15PM +0800, Ming Lei wrote: > >>> - if (!(q->limits.features & BLK_FEAT_POLL) && > >>> - (bio->bi_opf & REQ_POLLED)) { > >>> + if ((bio->bi_opf & REQ_POLLED) && !bdev_can_poll(bdev)) { > >> > >> submit_bio_noacct() is called without grabbing .q_usage_counter, > >> so tagset may be freed now, then use-after-free on q->tag_set? > > > > Indeed. That also means the previous check wasn't reliable either. > > I think we can simple move the check into > > blk_mq_submit_bio/__submit_bio which means we'll do a bunch more > > checks before we eventually fail, but otherwise it'll work the > > same. > > Given that the request queue is the same for all tag sets, I do not think we No, it isn't same. > need to have the queue_limits_start_update()/commit_update() within the tag set > loop in __blk_mq_update_nr_hw_queues(). So something like this should be enough > for an initial fix, no ? > > diff --git a/block/blk-mq.c b/block/blk-mq.c > index 8ac19d4ae3c0..ac71e9cee25b 100644 > --- a/block/blk-mq.c > +++ b/block/blk-mq.c > @@ -4986,6 +4986,7 @@ static void __blk_mq_update_nr_hw_queues(struct > blk_mq_tag_set *set, > int nr_hw_queues) > { > struct request_queue *q; > + struct queue_limits lim; > LIST_HEAD(head); > int prev_nr_hw_queues = set->nr_hw_queues; > int i; > @@ -4999,8 +5000,10 @@ static void __blk_mq_update_nr_hw_queues(struct > blk_mq_tag_set *set, > if (set->nr_maps == 1 && nr_hw_queues == set->nr_hw_queues) > return; > > + lim = queue_limits_start_update(q); > list_for_each_entry(q, &set->tag_list, tag_set_list) > blk_mq_freeze_queue(q); It could be worse, since the limits_lock is connected with lots of other subsystem's lock(debugfs, sysfs dir, ...), it may introduce new deadlock risk. Thanks, Ming
diff --git a/block/blk-core.c b/block/blk-core.c index 666efe8fa202..4fb495d25c85 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -753,6 +753,18 @@ static blk_status_t blk_validate_atomic_write_op_size(struct request_queue *q, return BLK_STS_OK; } +inline bool bdev_can_poll(struct block_device *bdev) +{ + struct request_queue *q = bdev_get_queue(bdev); + + if (!(q->limits.features & BLK_FEAT_POLL)) + return false; + + if (queue_is_mq(q)) + return q->tag_set->map[HCTX_TYPE_POLL].nr_queues; + return true; +} + /** * submit_bio_noacct - re-submit a bio to the block device layer for I/O * @bio: The bio describing the location in memory and on the device. @@ -805,8 +817,7 @@ void submit_bio_noacct(struct bio *bio) } } - if (!(q->limits.features & BLK_FEAT_POLL) && - (bio->bi_opf & REQ_POLLED)) { + if ((bio->bi_opf & REQ_POLLED) && !bdev_can_poll(bdev)) { bio_clear_polled(bio); goto not_supported; } @@ -935,7 +946,7 @@ int bio_poll(struct bio *bio, struct io_comp_batch *iob, unsigned int flags) return 0; q = bdev_get_queue(bdev); - if (cookie == BLK_QC_T_NONE || !(q->limits.features & BLK_FEAT_POLL)) + if (cookie == BLK_QC_T_NONE || !bdev_can_poll(bdev)) return 0; blk_flush_plug(current->plug, false); diff --git a/block/blk-mq.c b/block/blk-mq.c index 2e6132f778fd..f795d81b6b38 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -4320,12 +4320,6 @@ void blk_mq_release(struct request_queue *q) blk_mq_sysfs_deinit(q); } -static bool blk_mq_can_poll(struct blk_mq_tag_set *set) -{ - return set->nr_maps > HCTX_TYPE_POLL && - set->map[HCTX_TYPE_POLL].nr_queues; -} - struct request_queue *blk_mq_alloc_queue(struct blk_mq_tag_set *set, struct queue_limits *lim, void *queuedata) { @@ -4336,7 +4330,7 @@ struct request_queue *blk_mq_alloc_queue(struct blk_mq_tag_set *set, if (!lim) lim = &default_lim; lim->features |= BLK_FEAT_IO_STAT | BLK_FEAT_NOWAIT; - if (blk_mq_can_poll(set)) + if (set->nr_maps > HCTX_TYPE_POLL) lim->features |= BLK_FEAT_POLL; q = blk_alloc_queue(lim, set->numa_node); @@ -5024,8 +5018,6 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, fallback: blk_mq_update_queue_map(set); list_for_each_entry(q, &set->tag_list, tag_set_list) { - struct queue_limits lim; - blk_mq_realloc_hw_ctxs(set, q); if (q->nr_hw_queues != set->nr_hw_queues) { @@ -5039,13 +5031,6 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, set->nr_hw_queues = prev_nr_hw_queues; goto fallback; } - lim = queue_limits_start_update(q); - if (blk_mq_can_poll(set)) - lim.features |= BLK_FEAT_POLL; - else - lim.features &= ~BLK_FEAT_POLL; - if (queue_limits_commit_update(q, &lim) < 0) - pr_warn("updating the poll flag failed\n"); blk_mq_map_swqueue(q); } diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c index 767598e719ab..54488af6c001 100644 --- a/block/blk-sysfs.c +++ b/block/blk-sysfs.c @@ -245,10 +245,14 @@ static ssize_t queue_##_name##_show(struct gendisk *disk, char *page) \ !!(disk->queue->limits.features & _feature)); \ } -QUEUE_SYSFS_FEATURE_SHOW(poll, BLK_FEAT_POLL); QUEUE_SYSFS_FEATURE_SHOW(fua, BLK_FEAT_FUA); QUEUE_SYSFS_FEATURE_SHOW(dax, BLK_FEAT_DAX); +static ssize_t queue_poll_show(struct gendisk *disk, char *page) +{ + return sysfs_emit(page, "%u\n", bdev_can_poll(disk->part0)); +} + static ssize_t queue_zoned_show(struct gendisk *disk, char *page) { if (blk_queue_is_zoned(disk->queue)) diff --git a/block/blk.h b/block/blk.h index 4904b86d5fec..c8fdbb22d483 100644 --- a/block/blk.h +++ b/block/blk.h @@ -589,6 +589,7 @@ int truncate_bdev_range(struct block_device *bdev, blk_mode_t mode, long blkdev_ioctl(struct file *file, unsigned cmd, unsigned long arg); int blkdev_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags); long compat_blkdev_ioctl(struct file *file, unsigned cmd, unsigned long arg); +bool bdev_can_poll(struct block_device *bdev); extern const struct address_space_operations def_blk_aops;
When __blk_mq_update_nr_hw_queues changes the number of tag sets, it might have to disable poll queues. Currently it does so by adjusting the BLK_FEAT_POLL, which is a bit against the intent of features that describe hardware / driver capabilities, but more importantly causes nasty lock order problems with the broadly held freeze when updating the number of hardware queues and the limits lock. Fix this by leaving BLK_FEAT_POLL alone, and instead check for the number of poll queues in the bio submission and poll handlers. While this adds extra work to the fast path, the variables are in cache lines used by these operations anyway, so it should be cheap enough. Fixes: 8023e144f9d6 ("block: move the poll flag to queue_limits") Signed-off-by: Christoph Hellwig <hch@lst.de> --- block/blk-core.c | 17 ++++++++++++++--- block/blk-mq.c | 17 +---------------- block/blk-sysfs.c | 6 +++++- block/blk.h | 1 + 4 files changed, 21 insertions(+), 20 deletions(-)