[0/2] scsi: ufs: critical health condition

Message ID	20250203152735.825010-1-avri.altman@wdc.com
Headers	show Received: from esa1.hgst.iphmx.com (esa1.hgst.iphmx.com [68.232.141.245]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 5804E209F50; Mon, 3 Feb 2025 15:31:38 +0000 (UTC) IronPort-SDR: 67a0d370_gRTJO95q/w26+yOXvrEWOSLqvd4v45r2Wq+GC/vuj8sP7M+ oBm7Xac0CO6P10KlwolKRAXZVN0KQjwGoj81PSA== WDCIronportException: Internal From: Avri Altman <avri.altman@wdc.com> To: "Martin K . Petersen" <martin.petersen@oracle.com> Cc: linux-scsi@vger.kernel.org, linux-kernel@vger.kernel.org, Guenter Roeck <linux@roeck-us.net>, Bart Van Assche <bvanassche@acm.org>, Avri Altman <avri.altman@wdc.com> Subject: [PATCH 0/2] scsi: ufs: critical health condition Date: Mon, 3 Feb 2025 17:27:33 +0200 Message-Id: <20250203152735.825010-1-avri.altman@wdc.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	scsi: ufs: critical health condition \| expand [0/2] scsi: ufs: critical health condition [1/2] scsi: ufs: hwmon: Prepare for more hwmon notifications [2/2] scsi: ufs: Add support for critical health notification

Message ID

20250203152735.825010-1-avri.altman@wdc.com

Headers

IronPort-SDR: 67a0d370_gRTJO95q/w26+yOXvrEWOSLqvd4v45r2Wq+GC/vuj8sP7M+
 oBm7Xac0CO6P10KlwolKRAXZVN0KQjwGoj81PSA==
WDCIronportException: Internal
From: Avri Altman <avri.altman@wdc.com>
To: "Martin K . Petersen" <martin.petersen@oracle.com>
Cc: linux-scsi@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	Guenter Roeck <linux@roeck-us.net>,
	Bart Van Assche <bvanassche@acm.org>,
	Avri Altman <avri.altman@wdc.com>
Subject: [PATCH 0/2] scsi: ufs: critical health condition
Date: Mon,  3 Feb 2025 17:27:33 +0200
Message-Id: <20250203152735.825010-1-avri.altman@wdc.com>
Precedence: bulk
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit

Series

scsi: ufs: critical health condition | expand

Message

Avri Altman Feb. 3, 2025, 3:27 p.m. UTC

Martin hi,
The UFS4.1 standard, released on January 8 2025, is adding several new
features. Among them a new exception event: HEALTH_CRITICAL, which
notify the host of a device's critical health condition. This
notification implies that the device is approaching to the end of its
life time based on the amount of performed program/erase cycles.

We use the hw monitor subsystem to proliferate this info via the chip
alarm channel.

Please consider this for the next merge window.

Thanks,
Avri

Avri Altman (2):
  scsi: ufs: hwmon: Prepare for more hwmon notifications
  scsi: ufs: Add support for critical health notification

 drivers/ufs/core/Kconfig       |  2 +-
 drivers/ufs/core/ufs-hwmon.c   | 12 ++++++++----
 drivers/ufs/core/ufshcd-priv.h |  8 ++++----
 drivers/ufs/core/ufshcd.c      | 31 ++++++++++++++++++++++++++-----
 include/ufs/ufs.h              |  1 +
 5 files changed, 40 insertions(+), 14 deletions(-)

Comments

Guenter Roeck Feb. 3, 2025, 4:36 p.m. UTC | #1

On 2/3/25 07:27, Avri Altman wrote:
> The UFS 4.1 standard, released on January 8, 2025, introduces several
> new features, including a new exception event: HEALTH_CRITICAL. This
> event notifies the host of a device's critical health condition,
> indicating that the device is approaching the end of its lifetime based
> on the number of program/erase cycles performed.
> 
> We utilize the hwmon (hardware monitoring) subsystem to propagate this
> information via the chip alarm channel.
> 

That is outside the scope of the hardware monitoring subsystem,
the "alarms" attribute is deprecated and must not be used
in new drivers, and it isn't actually implemented by this code.

I can't control what is submitted into the ufs code, bu from hardware
monitoring perspective this is a NACK.

Guenter

Guenter Roeck Feb. 3, 2025, 5:44 p.m. UTC | #2

On 2/3/25 09:25, Avri Altman wrote:
>> On 2/3/25 07:27, Avri Altman wrote:
>>> The UFS 4.1 standard, released on January 8, 2025, introduces several
>>> new features, including a new exception event: HEALTH_CRITICAL. This
>>> event notifies the host of a device's critical health condition,
>>> indicating that the device is approaching the end of its lifetime
>>> based on the number of program/erase cycles performed.
>>>
>>> We utilize the hwmon (hardware monitoring) subsystem to propagate this
>>> information via the chip alarm channel.
>>>
>>
>> That is outside the scope of the hardware monitoring subsystem, the
>> "alarms" attribute is deprecated and must not be used in new drivers, and it
>> isn't actually implemented by this code.
> OK.  Thanks for letting me know.
> Do you see any other path I can take within the hwmon,
> To let the upper stack / HAL know that the ufs device is reaching its EOL ?
> Or should I look elsewhere?
> 

Again, this is not a hardware monitoring attribute. Normally I'd assume
that information like this is reported, for example, via smartctl or
whatever similar mechanism is available for ufs devices.

Just to give an example: smartctl reports for one of the nvme drives
in my system:

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        39 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    0%
Data Units Read:                    10,835,485 [5.54 TB]
Data Units Written:                 4,931,062 [2.52 TB]
Host Read Commands:                 149,936,032
Host Write Commands:                36,799,659
Controller Busy Time:               318
Power Cycles:                       12
Power On Hours:                     326
Unsafe Shutdowns:                   4
Media and Data Integrity Errors:    0
Error Information Log Entries:      0
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               39 Celsius
Temperature Sensor 2:               41 Celsius

Per your logic, all of that could be declared to be "hardware monitoring".
That simply doesn't make sense. All that information is reported by smartctl,
and it can and should be monitored using smartd or a similar tool. There is
no need to invent a new mechanism to do the same. If smartmontools don't
support ufs, such support should be added there, and not be pressed into
some unrelated kernel subsystem.

Thanks,
Guenter