Message ID | 20230920135439.929695-1-dlemoal@kernel.org |
---|---|
Headers | show |
Series | Fix libata suspend/resume handling and code cleanup | expand |
On Wed, Sep 20, 2023 at 10:54:17PM +0900, Damien Le Moal wrote: > The function ata_port_request_pm() checks the port flag > ATA_PFLAG_PM_PENDING and calls ata_port_wait_eh() if this flag is set to > ensure that power management operations for a port are not scheduled > simultaneously. However, this flag check is done without holding the > port lock. > > Fix this by taking the port lock on entry to the function and checking > the flag under this lock. The lock is released and re-taken if > ata_port_wait_eh() needs to be called. The two WARN_ON() macros checking > that the ATA_PFLAG_PM_PENDING flag was cleared are removed as the first > call is racy and the second one done without holding the port lock. > > Fixes: 5ef41082912b ("ata: add ata port system PM callbacks") > Cc: stable@vger.kernel.org > Signed-off-by: Damien Le Moal <dlemoal@kernel.org> > Reviewed-by: Hannes Reinecke <hare@suse.de> > Tested-by: Chia-Lin Kao (AceLan) <acelan.kao@canonical.com> > --- > drivers/ata/libata-core.c | 18 +++++++++--------- > 1 file changed, 9 insertions(+), 9 deletions(-) > > diff --git a/drivers/ata/libata-core.c b/drivers/ata/libata-core.c > index 0072e0f9ad39..732f3d0b4fd9 100644 > --- a/drivers/ata/libata-core.c > +++ b/drivers/ata/libata-core.c > @@ -5037,17 +5037,19 @@ static void ata_port_request_pm(struct ata_port *ap, pm_message_t mesg, > struct ata_link *link; > unsigned long flags; > > - /* Previous resume operation might still be in > - * progress. Wait for PM_PENDING to clear. > + spin_lock_irqsave(ap->lock, flags); > + > + /* > + * A previous PM operation might still be in progress. Wait for > + * ATA_PFLAG_PM_PENDING to clear. > */ > if (ap->pflags & ATA_PFLAG_PM_PENDING) { > + spin_unlock_irqrestore(ap->lock, flags); > ata_port_wait_eh(ap); > - WARN_ON(ap->pflags & ATA_PFLAG_PM_PENDING); > + spin_lock_irqsave(ap->lock, flags); > } > > - /* request PM ops to EH */ > - spin_lock_irqsave(ap->lock, flags); > - > + /* Request PM operation to EH */ > ap->pm_mesg = mesg; > ap->pflags |= ATA_PFLAG_PM_PENDING; > ata_for_each_link(link, ap, HOST_FIRST) { > @@ -5059,10 +5061,8 @@ static void ata_port_request_pm(struct ata_port *ap, pm_message_t mesg, > > spin_unlock_irqrestore(ap->lock, flags); > > - if (!async) { > + if (!async) > ata_port_wait_eh(ap); > - WARN_ON(ap->pflags & ATA_PFLAG_PM_PENDING); > - } > } > > /* > -- > 2.41.0 > Reviewed-by: Niklas Cassel <niklas.cassel@wdc.com>
Hi Damien, On Wed, Sep 20, 2023 at 3:54 PM Damien Le Moal <dlemoal@kernel.org> wrote: > The first 9 patches of this series fix several issues with suspend/resume > power management operations in scsi and libata. The most significant > changes introduced are in patch 4 and 5, where the manage_start_stop > flag of scsi devices is split into the manage_system_start_stop and > manage_runtime_start_stop flags to allow keeping scsi runtime power > operations for spining up/down ATA devices but have libata do its own > system suspend/resume device power state management using EH. > > The remaining patches are code cleanup that do not introduce any > significant functional change. > > This series was tested on qemu and on various PCs and servers. I am > CC-ing people who recently reported issues with suspend/resume. > Additional testing would be much appreciated. > > Changes from v3: > * Corrected pathc 1 (typo in commit message and WARN_ON() removal) > * Changed path 3 as suggested by Niklas (moved definition of > ->slave_alloc) > * Rebased on rc2 > * Added review tags Thanks for the update! I gave this a try on Renesas Salvator-XS with R-Car H3 ES2.0 and a SATA hard drive: - The drive is spun up during system resume, - Accessing the drive after the system was resumed is instantaneous. Gr{oetje,eeting}s, Geert
On Thu, Sep 21, 2023 at 11:21 AM Geert Uytterhoeven <geert@linux-m68k.org> wrote: > On Wed, Sep 20, 2023 at 3:54 PM Damien Le Moal <dlemoal@kernel.org> wrote: > > The first 9 patches of this series fix several issues with suspend/resume > > power management operations in scsi and libata. The most significant > > changes introduced are in patch 4 and 5, where the manage_start_stop > > flag of scsi devices is split into the manage_system_start_stop and > > manage_runtime_start_stop flags to allow keeping scsi runtime power > > operations for spining up/down ATA devices but have libata do its own > > system suspend/resume device power state management using EH. > > > > The remaining patches are code cleanup that do not introduce any > > significant functional change. > > > > This series was tested on qemu and on various PCs and servers. I am > > CC-ing people who recently reported issues with suspend/resume. > > Additional testing would be much appreciated. > > > > Changes from v3: > > * Corrected pathc 1 (typo in commit message and WARN_ON() removal) > > * Changed path 3 as suggested by Niklas (moved definition of > > ->slave_alloc) > > * Rebased on rc2 > > * Added review tags > > Thanks for the update! > > I gave this a try on Renesas Salvator-XS with R-Car H3 ES2.0 and > a SATA hard drive: > - The drive is spun up during system resume, > - Accessing the drive after the system was resumed is instantaneous. Tested-by: Geert Uytterhoeven <geert+renesas@glider.be> Gr{oetje,eeting}s, Geert
On 20/09/2023 14:54, Damien Le Moal wrote: > +int ata_scsi_dev_alloc(struct scsi_device *sdev, struct ata_port *ap) nit: why not static? I could not see it used elsewhere. Indeed, I am not sure why it is not inlined in its only caller, ata_scsi_slave_alloc(). Thanks, John > +{ > + struct device_link *link; > + > + ata_scsi_sdev_config(sdev); > + > + /* > + * Create a link from the ata_port device to the scsi device to ensure > + * that PM does suspend/resume in the correct order: the scsi device is > + * consumer (child) and the ata port the supplier (parent). > + */ > + link = device_link_add(&sdev->sdev_gendev, &ap->tdev, > + DL_FLAG_STATELESS | > + DL_FLAG_PM_RUNTIME | DL_FLAG_RPM_ACTIVE); > + if (!link) { > + ata_port_err(ap, "Failed to create link to scsi device %s\n", > + dev_name(&sdev->sdev_gendev)); > + return -ENODEV; > + } > + > + return 0; > +} > + > +/** > + * ata_scsi_slave_alloc - Early setup of SCSI device > + * @sdev: SCSI device to examine > + * > + * This is called from scsi_alloc_sdev() when the scsi device > + * associated with an ATA device is scanned on a port. > + * > + * LOCKING: > + * Defined by SCSI layer. We don't really care. > + */ > + > +int ata_scsi_slave_alloc(struct scsi_device *sdev) > +{ > + return ata_scsi_dev_alloc(sdev, ata_shost_to_port(sdev->host)); > +} > +EXPORT_SYMBOL_GPL(ata_scsi_slave_alloc);
On 2023/09/21 2:22, Geert Uytterhoeven wrote: > On Thu, Sep 21, 2023 at 11:21 AM Geert Uytterhoeven > <geert@linux-m68k.org> wrote: >> On Wed, Sep 20, 2023 at 3:54 PM Damien Le Moal <dlemoal@kernel.org> wrote: >>> The first 9 patches of this series fix several issues with suspend/resume >>> power management operations in scsi and libata. The most significant >>> changes introduced are in patch 4 and 5, where the manage_start_stop >>> flag of scsi devices is split into the manage_system_start_stop and >>> manage_runtime_start_stop flags to allow keeping scsi runtime power >>> operations for spining up/down ATA devices but have libata do its own >>> system suspend/resume device power state management using EH. >>> >>> The remaining patches are code cleanup that do not introduce any >>> significant functional change. >>> >>> This series was tested on qemu and on various PCs and servers. I am >>> CC-ing people who recently reported issues with suspend/resume. >>> Additional testing would be much appreciated. >>> >>> Changes from v3: >>> * Corrected pathc 1 (typo in commit message and WARN_ON() removal) >>> * Changed path 3 as suggested by Niklas (moved definition of >>> ->slave_alloc) >>> * Rebased on rc2 >>> * Added review tags >> >> Thanks for the update! >> >> I gave this a try on Renesas Salvator-XS with R-Car H3 ES2.0 and >> a SATA hard drive: >> - The drive is spun up during system resume, >> - Accessing the drive after the system was resumed is instantaneous. > > Tested-by: Geert Uytterhoeven <geert+renesas@glider.be> Thanks Geert ! > > Gr{oetje,eeting}s, > > Geert >
On 2023/09/21 0:32, Sergey Shtylyov wrote: > On 9/20/23 4:54 PM, Damien Le Moal wrote: > >> The scsi disk driver does not resume disks that have been runtime >> suspended by the user. To be consistent with this behavior, do the same >> for ata ports and skip the PM request in ata_port_pm_resume() if the >> port was already runtime suspended. With this change, it is no longer >> necessary to for the PM state of the port to ACTIVE as the PM core code > ^^^^^^ > Can't parse this. :-) s/for/force Will fix that. > >> will take care of that when handling runtime resume. >> >> Signed-off-by: Damien Le Moal <dlemoal@kernel.org> >> Reviewed-by: Hannes Reinecke <hare@suse.de> >> Tested-by: Chia-Lin Kao (AceLan) <acelan.kao@canonical.com> > [...] > > MBR, Sergey
On 2023/09/21 14:36, Bart Van Assche wrote: > On 9/20/23 06:54, Damien Le Moal wrote: >> If an error occurs when resuming a host adapter before the devices >> attached to the adapter are resumed, the adapter low level driver may >> remove the scsi host, resulting in a call to sd_remove() for the >> disks of the host. This in turn results in a call to sd_shutdown() which >> will issue a synchronize cache command and a start stop unit command to >> spindown the disk. sd_shutdown() issues the commands only if the device >> is not already suspended but does not check the power state for >> system-wide suspend/resume. That is, the commands may be issued with the >> device in a suspended state, which causes PM resume to hang, forcing a >> reset of the machine to recover. >> >> Fix this by not calling sd_shutdown() in sd_remove() if the device >> is not running. > > Hi Damien, > > I'd like to look into an alternative fix (after this patch series went > in) but I couldn't identify the call chain in the ATA resume code that > results in removal of the SCSI host. Can you please show me the call > chain that results in SCSI host removal if resuming fails? See the pm80xx driver for which I recently fixed a resume issue. That is how I found this problem with device removal: resuming the pm800xx HBA was failing and the driver then called scsi_remove_host() to drop the ports and that led to trying to removed sd devices that were still suspended. > > Thanks, > > Bart. >