Message ID | 20240325044452.3125418-1-dlemoal@kernel.org |
---|---|
Headers | show |
Series | Zone write plugging | expand |
On 3/24/24 21:44, Damien Le Moal wrote: > diff --git a/block/blk-mq.c b/block/blk-mq.c > index 8aeb8e96f1a7..9e6e2a9a147c 100644 > --- a/block/blk-mq.c > +++ b/block/blk-mq.c > @@ -820,11 +820,11 @@ static void blk_complete_request(struct request *req) > /* Completion has already been traced */ > bio_clear_flag(bio, BIO_TRACE_COMPLETION); > > - if (req_op(req) == REQ_OP_ZONE_APPEND) > - bio->bi_iter.bi_sector = req->__sector; > - > - if (!is_flush) > + if (!is_flush) { > + blk_zone_update_request_bio(req, bio); > bio_endio(bio); > + } The above change includes a behavior change. It seems wrong to me not to call blk_zone_update_request_bio() for REQ_OP_ZONE_APPEND requests if RQF_FLUSH_SEQ has been set. Thanks, Bart.
On 3/24/24 21:44, Damien Le Moal wrote: > +/* > + * Per-zone write plug. > + */ > +struct blk_zone_wplug { > + struct hlist_node node; > + struct list_head err; > + atomic_t ref; > + spinlock_t lock; > + unsigned int flags; > + unsigned int zone_no; > + unsigned int wp_offset; > + struct bio_list bio_list; > + struct work_struct bio_work; > +}; Please document what 'lock' protects. Please also document the unit of wp_offset. Since there is an atomic reference count in this data structure, why is the flag BLK_ZONE_WPLUG_FREEING required? Can that flag be replaced by checking whether or not 'ref' is zero? > -void disk_free_zone_bitmaps(struct gendisk *disk) > +static bool disk_insert_zone_wplug(struct gendisk *disk, > + struct blk_zone_wplug *zwplug) > +{ > + struct blk_zone_wplug *zwplg; > + unsigned long flags; > + unsigned int idx = > + hash_32(zwplug->zone_no, disk->zone_wplugs_hash_bits); > + > + /* > + * Add the new zone write plug to the hash table, but carefully as we > + * are racing with other submission context, so we may already have a > + * zone write plug for the same zone. > + */ > + spin_lock_irqsave(&disk->zone_wplugs_lock, flags); > + hlist_for_each_entry_rcu(zwplg, &disk->zone_wplugs_hash[idx], node) { > + if (zwplg->zone_no == zwplug->zone_no) { > + spin_unlock_irqrestore(&disk->zone_wplugs_lock, flags); > + return false; > + } > + } > + hlist_add_head_rcu(&zwplug->node, &disk->zone_wplugs_hash[idx]); > + spin_unlock_irqrestore(&disk->zone_wplugs_lock, flags); > + > + return true; > +} Since this function inserts an element into disk->zone_wplugs_hash[], can it happen that another thread removes that element from the hash list before this function returns? > +static inline void disk_put_zone_wplug(struct blk_zone_wplug *zwplug) > +{ > + if (atomic_dec_and_test(&zwplug->ref)) { > + WARN_ON_ONCE(!bio_list_empty(&zwplug->bio_list)); > + WARN_ON_ONCE(!list_empty(&zwplug->err)); > + > + kmem_cache_free(blk_zone_wplugs_cachep, zwplug); > + } > +} Calling kfree() or any of its variants for elements on an RCU-list usually is not safe. If this is really safe in this case, a comment should explain why. > +static struct blk_zone_wplug *disk_get_zone_wplug_locked(struct gendisk *disk, > + sector_t sector, gfp_t gfp_mask, > + unsigned long *flags) What does the "_locked" suffix in the function name mean? Please add a comment that explains this. > +static void disk_zone_wplug_abort_unaligned(struct gendisk *disk, > + struct blk_zone_wplug *zwplug) What does the "_unaligned" suffix in this function name mean? Please add a comment that explains this. Thanks, Bart.
On 3/24/24 21:44, Damien Le Moal wrote: > With the block layer generic plugging of write operations for zoned > block devices, mq-deadline, or any other scheduler, can only ever > see at most one write operation per zone at any time. There is thus no > sequentiality requirements for these writes and thus no need to tightly > control the dispatching of write requests using zone write locking. > > Remove all the code that implement this control in the mq-deadline > scheduler and remove advertizing support for the > ELEVATOR_F_ZBD_SEQ_WRITE elevator feature. Reviewed-by: Bart Van Assche <bvanassche@acm.org>
On 3/26/24 06:53, Bart Van Assche wrote: > On 3/24/24 21:44, Damien Le Moal wrote: >> + /* >> + * Remember the capacity of the first sequential zone and check >> + * if it is constant for all zones. >> + */ >> + if (!args->zone_capacity) >> + args->zone_capacity = zone->capacity; >> + if (zone->capacity != args->zone_capacity) { >> + pr_warn("%s: Invalid variable zone capacity\n", >> + disk->disk_name); >> + return -ENODEV; >> + } > > The above code won't refuse devices for which the first few zones have > capacity zero. Shouldn't these be rejected? The device driver is supposed to ensure that zone capacity is always set to the device reported value or to the zone size for devices that do not have a zone capacity smaller than the zone size. But sure, I can add a check.
On 3/26/24 04:52, Bart Van Assche wrote: > On 3/24/24 21:44, Damien Le Moal wrote: >> diff --git a/block/blk-mq.c b/block/blk-mq.c >> index 8aeb8e96f1a7..9e6e2a9a147c 100644 >> --- a/block/blk-mq.c >> +++ b/block/blk-mq.c >> @@ -820,11 +820,11 @@ static void blk_complete_request(struct request *req) >> /* Completion has already been traced */ >> bio_clear_flag(bio, BIO_TRACE_COMPLETION); >> >> - if (req_op(req) == REQ_OP_ZONE_APPEND) >> - bio->bi_iter.bi_sector = req->__sector; >> - >> - if (!is_flush) >> + if (!is_flush) { >> + blk_zone_update_request_bio(req, bio); >> bio_endio(bio); >> + } > > The above change includes a behavior change. It seems wrong to me not > to call blk_zone_update_request_bio() for REQ_OP_ZONE_APPEND requests if > RQF_FLUSH_SEQ has been set. REQ_OP_ZONE_APPEND + RQF_FLUSH_SEQ is not something supported, and this patch series is not changing that. The reason is that the flush machinery is not zone-append aware and will break if such request is issued for a device that does not support fua. We probably should check for this, but that is not something for this series to do and should be a separate fix.
On 3/26/24 06:53, Bart Van Assche wrote: > On 3/24/24 21:44, Damien Le Moal wrote: >> +/* >> + * Per-zone write plug. >> + */ >> +struct blk_zone_wplug { >> + struct hlist_node node; >> + struct list_head err; >> + atomic_t ref; >> + spinlock_t lock; >> + unsigned int flags; >> + unsigned int zone_no; >> + unsigned int wp_offset; >> + struct bio_list bio_list; >> + struct work_struct bio_work; >> +}; > > Please document what 'lock' protects. Please also document the unit of > wp_offset. > > Since there is an atomic reference count in this data structure, why is > the flag BLK_ZONE_WPLUG_FREEING required? Can that flag be replaced by > checking whether or not 'ref' is zero? Nope, we cannot. The reason is that BIO issuing and zone reset/finish can be concurrently processed and we need to be ready for a user doing really stupid things like resetting or finishing a zone while BIOs for that zone are being issued. When zone reset/finish is processed, the plug is removed from the hash table, but disk_get_zone_wplug_locked() may still get a reference to it because we do not have the plug locked yet. Hence the flag, to prevent reusing the plug for the reset/finished zone that was already removed from the hash table. This is mentioned with a comment in disk_get_zone_wplug_locked(): /* * Check that a BIO completion or a zone reset or finish * operation has not already flagged the zone write plug for * freeing and dropped its reference count. In such case, we * need to get a new plug so start over from the beginning. */ The reference count dropping to 0 will then be the trigger for actually freeing the plug, after all in-flight or plugged BIOs are completed (most likely failed). >> -void disk_free_zone_bitmaps(struct gendisk *disk) >> +static bool disk_insert_zone_wplug(struct gendisk *disk, >> + struct blk_zone_wplug *zwplug) >> +{ >> + struct blk_zone_wplug *zwplg; >> + unsigned long flags; >> + unsigned int idx = >> + hash_32(zwplug->zone_no, disk->zone_wplugs_hash_bits); >> + >> + /* >> + * Add the new zone write plug to the hash table, but carefully as we >> + * are racing with other submission context, so we may already have a >> + * zone write plug for the same zone. >> + */ >> + spin_lock_irqsave(&disk->zone_wplugs_lock, flags); >> + hlist_for_each_entry_rcu(zwplg, &disk->zone_wplugs_hash[idx], node) { >> + if (zwplg->zone_no == zwplug->zone_no) { >> + spin_unlock_irqrestore(&disk->zone_wplugs_lock, flags); >> + return false; >> + } >> + } >> + hlist_add_head_rcu(&zwplug->node, &disk->zone_wplugs_hash[idx]); >> + spin_unlock_irqrestore(&disk->zone_wplugs_lock, flags); >> + >> + return true; >> +} > > Since this function inserts an element into disk->zone_wplugs_hash[], > can it happen that another thread removes that element from the hash > list before this function returns? No, that cannot happen. Both insertion and deletion of plugs in the hash table are serialized with disk->zone_wplugs_lock. See disk_remove_zone_wplug().
On Tue, Mar 26, 2024 at 08:23:19AM +0900, Damien Le Moal wrote: > REQ_OP_ZONE_APPEND + RQF_FLUSH_SEQ is not something supported, and this patch > series is not changing that. The reason is that the flush machinery is not > zone-append aware and will break if such request is issued for a device that > does not support fua. We probably should check for this, but that is not > something for this series to do and should be a separate fix. Btw, I don't think we're even catching this right now. Would be great to have a submission path check for it.
Looks good:
Reviewed-by: Christoph Hellwig <hch@lst.de>
On Tue, Mar 26, 2024 at 12:12:03PM +0900, Damien Le Moal wrote: > Nope, we cannot. The reason is that BIO issuing and zone reset/finish can be > concurrently processed and we need to be ready for a user doing really stupid > things like resetting or finishing a zone while BIOs for that zone are being > issued. When zone reset/finish is processed, the plug is removed from the hash > table, but disk_get_zone_wplug_locked() may still get a reference to it because > we do not have the plug locked yet. Hence the flag, to prevent reusing the plug > for the reset/finished zone that was already removed from the hash table. This > is mentioned with a comment in disk_get_zone_wplug_locked(): > > /* > * Check that a BIO completion or a zone reset or finish > * operation has not already flagged the zone write plug for > * freeing and dropped its reference count. In such case, we > * need to get a new plug so start over from the beginning. > */ > > The reference count dropping to 0 will then be the trigger for actually freeing > the plug, after all in-flight or plugged BIOs are completed (most likely failed). Maybe the comment should be expanded even more and move to the definition of the freeing flag?
On Mon, Mar 25, 2024 at 01:44:33PM +0900, Damien Le Moal wrote: > For a zoned block device that has no limit on the number of open zones > and no limit on the number of active zones, the zone write plug mempool > size defaults to 128 zone write plugs. For such case, set the device > max_open_zones queue limit to this value to indicate to the user the > potential performance penalty that may happen when writing > simultaneously to more zones than the mempool size. zone_hw_limits is a horrible name for max(lim->max_active_zones, lim->max_open_zones). But why do we even need it? I'd rather make sure that we always have valid max_open/active limits, doing something like: if (!max_open && !max_active) max_open = max_active = DEFAULT_CAP; else if (!max_open || WARN_ON_ONCE(max_open > max_active) max_open = max_active; else if (!max_active) max_active = max_open; and do away with the extra limit. Maybe DEFAULT_CAP should be tunable as well?
On 3/26/24 15:37, Christoph Hellwig wrote: > On Tue, Mar 26, 2024 at 08:23:19AM +0900, Damien Le Moal wrote: >> REQ_OP_ZONE_APPEND + RQF_FLUSH_SEQ is not something supported, and this patch >> series is not changing that. The reason is that the flush machinery is not >> zone-append aware and will break if such request is issued for a device that >> does not support fua. We probably should check for this, but that is not >> something for this series to do and should be a separate fix. > > Btw, I don't think we're even catching this right now. Would be great > to have a submission path check for it. Yep, we are not checking this, but we should. Will send a separate patch for this.
On 3/25/24 20:12, Damien Le Moal wrote: > On 3/26/24 06:53, Bart Van Assche wrote: >> On 3/24/24 21:44, Damien Le Moal wrote: >>> -void disk_free_zone_bitmaps(struct gendisk *disk) >>> +static bool disk_insert_zone_wplug(struct gendisk *disk, >>> + struct blk_zone_wplug *zwplug) >>> +{ >>> + struct blk_zone_wplug *zwplg; >>> + unsigned long flags; >>> + unsigned int idx = >>> + hash_32(zwplug->zone_no, disk->zone_wplugs_hash_bits); >>> + >>> + /* >>> + * Add the new zone write plug to the hash table, but carefully as we >>> + * are racing with other submission context, so we may already have a >>> + * zone write plug for the same zone. >>> + */ >>> + spin_lock_irqsave(&disk->zone_wplugs_lock, flags); >>> + hlist_for_each_entry_rcu(zwplg, &disk->zone_wplugs_hash[idx], node) { >>> + if (zwplg->zone_no == zwplug->zone_no) { >>> + spin_unlock_irqrestore(&disk->zone_wplugs_lock, flags); >>> + return false; >>> + } >>> + } >>> + hlist_add_head_rcu(&zwplug->node, &disk->zone_wplugs_hash[idx]); >>> + spin_unlock_irqrestore(&disk->zone_wplugs_lock, flags); >>> + >>> + return true; >>> +} >> >> Since this function inserts an element into disk->zone_wplugs_hash[], >> can it happen that another thread removes that element from the hash >> list before this function returns? > > No, that cannot happen. Both insertion and deletion of plugs in the hash table > are serialized with disk->zone_wplugs_lock. See disk_remove_zone_wplug(). I think that documenting locking assumptions with lockdep_assert_held() would make this patch easier to review. Thanks, Bart.