[RFC] devlink: health: add remediation type

Message ID	20210306024220.251721-1-kuba@kernel.org
State	New
Headers	show Return-Path: <SRS0=vvDg=IE=vger.kernel.org=netdev-owner@kernel.org> From: Jakub Kicinski <kuba@kernel.org> To: netdev@vger.kernel.org Cc: jiri@resnulli.us, saeedm@nvidia.com, andrew.gospodarek@broadcom.com, jacob.e.keller@intel.com, guglielmo.morandin@broadcom.com, eugenem@fb.com, eranbe@mellanox.com, Jakub Kicinski <kuba@kernel.org> Subject: [RFC] devlink: health: add remediation type Date: Fri, 5 Mar 2021 18:42:20 -0800 Message-Id: <20210306024220.251721-1-kuba@kernel.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	[RFC] devlink: health: add remediation type \| expand [RFC] devlink: health: add remediation type

Jakub Kicinski March 6, 2021, 2:42 a.m. UTC

Currently devlink health does not give user any clear information
of what kind of remediation ->recover callback will perform. This
makes it difficult to understand the impact of enabling auto-
-remediation, and the severity of the error itself.

To allow users to make more informed decision, as well as stretch
the applicability of devlink health beyond what can be fixed by
resetting things - add a new remediation type attribute.

Note that we only allow one remediation type per reporter, this
is intentional. devlink health is not built for mixing issues
of different severity into one reporter since it only maintains
one dump, of the first event and a single error counter.
Nudging vendors towards categorizing issues beyond coarse
groups is an added bonus.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
---
 include/net/devlink.h        |  2 ++
 include/uapi/linux/devlink.h | 30 ++++++++++++++++++++++++++++++
 net/core/devlink.c           |  7 +++++--
 3 files changed, 37 insertions(+), 2 deletions(-)

Jakub Kicinski March 8, 2021, 5:16 p.m. UTC | #1

On Sun, 7 Mar 2021 17:59:58 +0200 Eran Ben Elisha wrote:
> On 3/6/2021 4:42 AM, Jakub Kicinski wrote:

> > Currently devlink health does not give user any clear information

> > of what kind of remediation ->recover callback will perform. This

> > makes it difficult to understand the impact of enabling auto-

> > -remediation, and the severity of the error itself.

> > 

> > To allow users to make more informed decision, as well as stretch

> > the applicability of devlink health beyond what can be fixed by

> > resetting things - add a new remediation type attribute.

> > 

> > Note that we only allow one remediation type per reporter, this

> > is intentional. devlink health is not built for mixing issues

> > of different severity into one reporter since it only maintains

> > one dump, of the first event and a single error counter.

> > Nudging vendors towards categorizing issues beyond coarse

> > groups is an added bonus.

> > 

> > Signed-off-by: Jakub Kicinski <kuba@kernel.org>

> > ---

> >   include/net/devlink.h        |  2 ++

> >   include/uapi/linux/devlink.h | 30 ++++++++++++++++++++++++++++++

> >   net/core/devlink.c           |  7 +++++--

> >   3 files changed, 37 insertions(+), 2 deletions(-)

> > 

> > diff --git a/include/net/devlink.h b/include/net/devlink.h

> > index 853420db5d32..70b5fd6a8c0d 100644

> > --- a/include/net/devlink.h

> > +++ b/include/net/devlink.h

> > @@ -664,6 +664,7 @@ enum devlink_health_reporter_state {

> >   /**

> >    * struct devlink_health_reporter_ops - Reporter operations

> >    * @name: reporter name

> > + * remedy: severity of the remediation required

> >    * @recover: callback to recover from reported error

> >    *           if priv_ctx is NULL, run a full recover

> >    * @dump: callback to dump an object

> > @@ -674,6 +675,7 @@ enum devlink_health_reporter_state {

> >   

> >   struct devlink_health_reporter_ops {

> >   	char *name;

> > +	enum devlink_health_reporter_remedy remedy;

> >   	int (*recover)(struct devlink_health_reporter *reporter,

> >   		       void *priv_ctx, struct netlink_ext_ack *extack);

> >   	int (*dump)(struct devlink_health_reporter *reporter,

> > diff --git a/include/uapi/linux/devlink.h b/include/uapi/linux/devlink.h

> > index f6008b2fa60f..bd490c5536b1 100644

> > --- a/include/uapi/linux/devlink.h

> > +++ b/include/uapi/linux/devlink.h

> > @@ -534,6 +534,9 @@ enum devlink_attr {

> >   	DEVLINK_ATTR_RELOAD_ACTION_STATS,       /* nested */

> >   

> >   	DEVLINK_ATTR_PORT_PCI_SF_NUMBER,	/* u32 */

> > +

> > +	DEVLINK_ATTR_HEALTH_REPORTER_REMEDY,	/* u32 */

> > +

> >   	/* add new attributes above here, update the policy in devlink.c */

> >   

> >   	__DEVLINK_ATTR_MAX,

> > @@ -608,4 +611,31 @@ enum devlink_port_fn_opstate {

> >   	DEVLINK_PORT_FN_OPSTATE_ATTACHED,

> >   };

> >   

> > +/**

> > + * enum devlink_health_reporter_remedy - severity of remediation procedure

> > + * @DLH_REMEDY_NONE: transient error, no remediation required  

> DLH_REMEDY_LOCAL_FIX: associated component will undergo a local 

> un-harmful fix attempt.

> (e.g look for lost interrupt in mlx5e_tx_reporter_timeout_recover())


Should we make it more specific? Maybe DLH_REMEDY_STALL: device stall
detected, resumed by re-trigerring processing, without reset?

> > + * @DLH_REMEDY_COMP_RESET: associated device component (e.g. device queue)

> > + *			will be reset

> > + * @DLH_REMEDY_RESET: full device reset, will result in temporary unavailability

> > + *			of the device, device configuration should not be lost

> > + * @DLH_REMEDY_REINIT: device will be reinitialized and configuration lost

> > + * @DLH_REMEDY_POWER_CYCLE: device requires a power cycle to recover

> > + * @DLH_REMEDY_REIMAGE: device needs to be reflashed

> > + * @DLH_REMEDY_BAD_PART: indication of failing hardware, device needs to be

> > + *			replaced

> > + *

> > + * Used in %DEVLINK_ATTR_HEALTH_REPORTER_REMEDY, categorizes the health reporter

> > + * by the severity of the required remediation, and indicates the remediation

> > + * type to the user if it can't be applied automatically (e.g. "reimage").

> > + */  

> The assumption here is that a reporter's recovery function has one 

> remedy. But it can have few remedies and escalate between them. Did you 

> consider a bitmask?


Yes, I tried to explain in the commit message. If we wanted to support
escalating remediations we'd also need separate counters etc. I think
having a health reporter per remediation should actually work fairly
well.

The major case where escalations would be immediately useful would be
cases where normal remediation failed therefore we need a hard reset.
But in those cases I think having the health reporter in a failed state
should be a sufficient indication to automation to take a machine out
of service and power cycle it.

> > +enum devlink_health_reporter_remedy {

> > +	DLH_REMEDY_NONE = 1,

> > +	DLH_REMEDY_COMP_RESET,

> > +	DLH_REMEDY_RESET,

> > +	DLH_REMEDY_REINIT,

> > +	DLH_REMEDY_POWER_CYCLE,

> > +	DLH_REMEDY_REIMAGE,  

> In general, I don't expect a reported to perform POWER_CYCLE or REIMAGE 

> as part of the recovery.


Right, these are extending the use of health reporters beyond what can
be remediated automatically.

> > +	DLH_REMEDY_BAD_PART,  

> BAD_PART probably indicates that the reporter (or any command line 

> execution) cannot recover the issue.

> As the suggested remedy is static per reporter's recover method, it 

> doesn't make sense for one to set a recover method that by design cannot 

> recover successfully.

> 

> Maybe we should extend devlink_health_reporter_state with POWER_CYCLE, 

> REIMAGE and BAD_PART? To indicate the user that for a successful 

> recovery, it should run a non-devlink-health operation?


Hm, export and extend devlink_health_reporter_state? I like that idea.

> > diff --git a/net/core/devlink.c b/net/core/devlink.c

> > index 737b61c2976e..73eb665070b9 100644

> > --- a/net/core/devlink.c

> > +++ b/net/core/devlink.c

> > @@ -6095,7 +6095,8 @@ __devlink_health_reporter_create(struct devlink *devlink,

> >   {

> >   	struct devlink_health_reporter *reporter;

> >   

> > -	if (WARN_ON(graceful_period && !ops->recover))

> > +	if (WARN_ON(graceful_period && !ops->recover) ||

> > +	    WARN_ON(!ops->remedy))  

> Here you fail every reported that doesn't have remedy set, not only the 

> ones with recovery callback


Right, DLH_REMEDY_NONE doesn't require a callback. Some health
issues can be remedied by the device e.g. ECC errors or something
along those lines. Obviously non-RFC patch will have to come with
driver changes.

> >   		return ERR_PTR(-EINVAL);

> >   

> >   	reporter = kzalloc(sizeof(*reporter), GFP_KERNEL);

> > @@ -6263,7 +6264,9 @@ devlink_nl_health_reporter_fill(struct sk_buff *msg,

> >   	if (!reporter_attr)

> >   		goto genlmsg_cancel;

> >   	if (nla_put_string(msg, DEVLINK_ATTR_HEALTH_REPORTER_NAME,

> > -			   reporter->ops->name))

> > +			   reporter->ops->name) ||

> > +	    nla_put_u32(msg, DEVLINK_ATTR_HEALTH_REPORTER_REMEDY,

> > +			reporter->ops->remedy))  

> Why not new if clause like all other attributes later.


Eh.

> >   		goto reporter_nest_cancel;

> >   	if (nla_put_u8(msg, DEVLINK_ATTR_HEALTH_REPORTER_STATE,

> >   		       reporter->health_state))

> >

Jakub Kicinski March 8, 2021, 5:59 p.m. UTC | #2

On Mon, 8 Mar 2021 09:16:00 -0800 Jakub Kicinski wrote:
> > > +	DLH_REMEDY_BAD_PART,    

> > BAD_PART probably indicates that the reporter (or any command line 

> > execution) cannot recover the issue.

> > As the suggested remedy is static per reporter's recover method, it 

> > doesn't make sense for one to set a recover method that by design cannot 

> > recover successfully.

> > 

> > Maybe we should extend devlink_health_reporter_state with POWER_CYCLE, 

> > REIMAGE and BAD_PART? To indicate the user that for a successful 

> > recovery, it should run a non-devlink-health operation?  

> 

> Hm, export and extend devlink_health_reporter_state? I like that idea.

Trying to type it up it looks less pretty than expected.

Let's looks at some examples.

A queue reporter, say "rx", resets the queue dropping all outstanding
buffers. As previously mentioned when the normal remediation fails user
is expected to power cycle the machine or maybe swap the card. The
device itself does not have a crystal ball.

A management FW reporter "fw", has a auto recovery of FW reset
(REMEDY_RESET). On failure -> power cycle.

An "io" reporter (PCI link had to be trained down) can only return 
a hardware failure (we should probably have a HW failure other than
BAD_PART for this).

Flash reporters - the device will know if the flash had a bad block 
or the entire part is bad, so probably can have 2 reporters for this.

Most of the reporters would only report one "action" that can be
performed to fix them. The cartesian product of ->recovery types vs
manual recovery does not seem necessary. And drivers would get bloated
with additional boilerplate of returning ERROR_NEED_POWER_CYCLE for
_all_ cases with ->recovery. Because what else would the fix be if
software-initiated reset didn't work?

Eran Ben Elisha March 9, 2021, 2:06 p.m. UTC | #3

On 3/8/2021 7:59 PM, Jakub Kicinski wrote:
> On Mon, 8 Mar 2021 09:16:00 -0800 Jakub Kicinski wrote:

>>>> +	DLH_REMEDY_BAD_PART,

>>> BAD_PART probably indicates that the reporter (or any command line

>>> execution) cannot recover the issue.

>>> As the suggested remedy is static per reporter's recover method, it

>>> doesn't make sense for one to set a recover method that by design cannot

>>> recover successfully.

>>>

>>> Maybe we should extend devlink_health_reporter_state with POWER_CYCLE,

>>> REIMAGE and BAD_PART? To indicate the user that for a successful

>>> recovery, it should run a non-devlink-health operation?

>>

>> Hm, export and extend devlink_health_reporter_state? I like that idea.

> 

> Trying to type it up it looks less pretty than expected.

> 

> Let's looks at some examples.

> 

> A queue reporter, say "rx", resets the queue dropping all outstanding

> buffers. As previously mentioned when the normal remediation fails user

> is expected to power cycle the machine or maybe swap the card. The

> device itself does not have a crystal ball.


Not sure, reopen the queue, or reinit the driver might also be good in 
case of issue in the SW/HW queue context for example. But I agree that 
RX reporter can't tell from its perspective what further escalation is 
needed in case its local defined operations failed.

> 

> A management FW reporter "fw", has a auto recovery of FW reset

> (REMEDY_RESET). On failure -> power cycle.

> 

> An "io" reporter (PCI link had to be trained down) can only return

> a hardware failure (we should probably have a HW failure other than

> BAD_PART for this).

> 

> Flash reporters - the device will know if the flash had a bad block

> or the entire part is bad, so probably can have 2 reporters for this.

> 

> Most of the reporters would only report one "action" that can be

> performed to fix them. The cartesian product of ->recovery types vs

> manual recovery does not seem necessary. And drivers would get bloated

> with additional boilerplate of returning ERROR_NEED_POWER_CYCLE for

> _all_ cases with ->recovery. Because what else would the fix be if

> software-initiated reset didn't work?

> 


OK, I see your point.

If I got you right, this is the conclusions so far:
1. Each reporter with recover callback will have to supply a remedy 
definition.
2. We shouldn't have POWER_CYCLE, REIMAGE and BAD_PART as a remedy, 
because these are not valid reporter recover flows in any case.
3. If a reporter will fail to recover, its status shall remain as error, 
and it is out of the reporter's scope to advise the administrator on 
further actions.

Eran Ben Elisha March 9, 2021, 2:18 p.m. UTC | #4

On 3/8/2021 7:16 PM, Jakub Kicinski wrote:
> On Sun, 7 Mar 2021 17:59:58 +0200 Eran Ben Elisha wrote:

>> On 3/6/2021 4:42 AM, Jakub Kicinski wrote:

>>> Currently devlink health does not give user any clear information

>>> of what kind of remediation ->recover callback will perform. This

>>> makes it difficult to understand the impact of enabling auto-

>>> -remediation, and the severity of the error itself.

>>>

>>> To allow users to make more informed decision, as well as stretch

>>> the applicability of devlink health beyond what can be fixed by

>>> resetting things - add a new remediation type attribute.

>>>

>>> Note that we only allow one remediation type per reporter, this

>>> is intentional. devlink health is not built for mixing issues

>>> of different severity into one reporter since it only maintains

>>> one dump, of the first event and a single error counter.

>>> Nudging vendors towards categorizing issues beyond coarse

>>> groups is an added bonus.

>>>

>>> Signed-off-by: Jakub Kicinski <kuba@kernel.org>

>>> ---

>>>    include/net/devlink.h        |  2 ++

>>>    include/uapi/linux/devlink.h | 30 ++++++++++++++++++++++++++++++

>>>    net/core/devlink.c           |  7 +++++--

>>>    3 files changed, 37 insertions(+), 2 deletions(-)

>>>

>>> diff --git a/include/net/devlink.h b/include/net/devlink.h

>>> index 853420db5d32..70b5fd6a8c0d 100644

>>> --- a/include/net/devlink.h

>>> +++ b/include/net/devlink.h

>>> @@ -664,6 +664,7 @@ enum devlink_health_reporter_state {

>>>    /**

>>>     * struct devlink_health_reporter_ops - Reporter operations

>>>     * @name: reporter name

>>> + * remedy: severity of the remediation required

>>>     * @recover: callback to recover from reported error

>>>     *           if priv_ctx is NULL, run a full recover

>>>     * @dump: callback to dump an object

>>> @@ -674,6 +675,7 @@ enum devlink_health_reporter_state {

>>>    

>>>    struct devlink_health_reporter_ops {

>>>    	char *name;

>>> +	enum devlink_health_reporter_remedy remedy;

>>>    	int (*recover)(struct devlink_health_reporter *reporter,

>>>    		       void *priv_ctx, struct netlink_ext_ack *extack);

>>>    	int (*dump)(struct devlink_health_reporter *reporter,

>>> diff --git a/include/uapi/linux/devlink.h b/include/uapi/linux/devlink.h

>>> index f6008b2fa60f..bd490c5536b1 100644

>>> --- a/include/uapi/linux/devlink.h

>>> +++ b/include/uapi/linux/devlink.h

>>> @@ -534,6 +534,9 @@ enum devlink_attr {

>>>    	DEVLINK_ATTR_RELOAD_ACTION_STATS,       /* nested */

>>>    

>>>    	DEVLINK_ATTR_PORT_PCI_SF_NUMBER,	/* u32 */

>>> +

>>> +	DEVLINK_ATTR_HEALTH_REPORTER_REMEDY,	/* u32 */

>>> +

>>>    	/* add new attributes above here, update the policy in devlink.c */

>>>    

>>>    	__DEVLINK_ATTR_MAX,

>>> @@ -608,4 +611,31 @@ enum devlink_port_fn_opstate {

>>>    	DEVLINK_PORT_FN_OPSTATE_ATTACHED,

>>>    };

>>>    

>>> +/**

>>> + * enum devlink_health_reporter_remedy - severity of remediation procedure

>>> + * @DLH_REMEDY_NONE: transient error, no remediation required

>> DLH_REMEDY_LOCAL_FIX: associated component will undergo a local

>> un-harmful fix attempt.

>> (e.g look for lost interrupt in mlx5e_tx_reporter_timeout_recover())

> 

> Should we make it more specific? Maybe DLH_REMEDY_STALL: device stall

> detected, resumed by re-trigerring processing, without reset?


Sounds good.

> 

>>> + * @DLH_REMEDY_COMP_RESET: associated device component (e.g. device queue)

>>> + *			will be reset

>>> + * @DLH_REMEDY_RESET: full device reset, will result in temporary unavailability

>>> + *			of the device, device configuration should not be lost

>>> + * @DLH_REMEDY_REINIT: device will be reinitialized and configuration lost

>>> + * @DLH_REMEDY_POWER_CYCLE: device requires a power cycle to recover

>>> + * @DLH_REMEDY_REIMAGE: device needs to be reflashed

>>> + * @DLH_REMEDY_BAD_PART: indication of failing hardware, device needs to be

>>> + *			replaced

>>> + *

>>> + * Used in %DEVLINK_ATTR_HEALTH_REPORTER_REMEDY, categorizes the health reporter

>>> + * by the severity of the required remediation, and indicates the remediation

>>> + * type to the user if it can't be applied automatically (e.g. "reimage").

>>> + */

>> The assumption here is that a reporter's recovery function has one

>> remedy. But it can have few remedies and escalate between them. Did you

>> consider a bitmask?

> 

> Yes, I tried to explain in the commit message. If we wanted to support

> escalating remediations we'd also need separate counters etc. I think

> having a health reporter per remediation should actually work fairly

> well.


That would require reporter's recovery procedure failure to trigger 
health flow for other reporter.
So we can find ourselves with 2 RX reporters, sharing the same diagnose 
and dump callbacks, and each has other recovery flow.
Seems a bit counterintuitive.

Maybe, per reporter, exposing a counter per each supported remedy is not 
that bad?

> 

> The major case where escalations would be immediately useful would be

> cases where normal remediation failed therefore we need a hard reset.

> But in those cases I think having the health reporter in a failed state

> should be a sufficient indication to automation to take a machine out

> of service and power cycle it.

> 

>>> +enum devlink_health_reporter_remedy {

>>> +	DLH_REMEDY_NONE = 1,

>>> +	DLH_REMEDY_COMP_RESET,

>>> +	DLH_REMEDY_RESET,

>>> +	DLH_REMEDY_REINIT,

>>> +	DLH_REMEDY_POWER_CYCLE,

>>> +	DLH_REMEDY_REIMAGE,

>> In general, I don't expect a reported to perform POWER_CYCLE or REIMAGE

>> as part of the recovery.

> 

> Right, these are extending the use of health reporters beyond what can

> be remediated automatically.

> 

>>> +	DLH_REMEDY_BAD_PART,

>> BAD_PART probably indicates that the reporter (or any command line

>> execution) cannot recover the issue.

>> As the suggested remedy is static per reporter's recover method, it

>> doesn't make sense for one to set a recover method that by design cannot

>> recover successfully.

>>

>> Maybe we should extend devlink_health_reporter_state with POWER_CYCLE,

>> REIMAGE and BAD_PART? To indicate the user that for a successful

>> recovery, it should run a non-devlink-health operation?

> 

> Hm, export and extend devlink_health_reporter_state? I like that idea.

> 

>>> diff --git a/net/core/devlink.c b/net/core/devlink.c

>>> index 737b61c2976e..73eb665070b9 100644

>>> --- a/net/core/devlink.c

>>> +++ b/net/core/devlink.c

>>> @@ -6095,7 +6095,8 @@ __devlink_health_reporter_create(struct devlink *devlink,

>>>    {

>>>    	struct devlink_health_reporter *reporter;

>>>    

>>> -	if (WARN_ON(graceful_period && !ops->recover))

>>> +	if (WARN_ON(graceful_period && !ops->recover) ||

>>> +	    WARN_ON(!ops->remedy))

>> Here you fail every reported that doesn't have remedy set, not only the

>> ones with recovery callback

> 

> Right, DLH_REMEDY_NONE doesn't require a callback. Some health

> issues can be remedied by the device e.g. ECC errors or something

> along those lines. Obviously non-RFC patch will have to come with

> driver changes.

> 

>>>    		return ERR_PTR(-EINVAL);

>>>    

>>>    	reporter = kzalloc(sizeof(*reporter), GFP_KERNEL);

>>> @@ -6263,7 +6264,9 @@ devlink_nl_health_reporter_fill(struct sk_buff *msg,

>>>    	if (!reporter_attr)

>>>    		goto genlmsg_cancel;

>>>    	if (nla_put_string(msg, DEVLINK_ATTR_HEALTH_REPORTER_NAME,

>>> -			   reporter->ops->name))

>>> +			   reporter->ops->name) ||

>>> +	    nla_put_u32(msg, DEVLINK_ATTR_HEALTH_REPORTER_REMEDY,

>>> +			reporter->ops->remedy))

>> Why not new if clause like all other attributes later.

> 

> Eh.

> 

Currently, each nla_put_* is under its own if clause.

>>>    		goto reporter_nest_cancel;

>>>    	if (nla_put_u8(msg, DEVLINK_ATTR_HEALTH_REPORTER_STATE,

>>>    		       reporter->health_state))

>>>    

>

Jakub Kicinski March 9, 2021, 10:52 p.m. UTC | #5

On Tue, 9 Mar 2021 16:18:58 +0200 Eran Ben Elisha wrote:
> >> DLH_REMEDY_LOCAL_FIX: associated component will undergo a local

> >> un-harmful fix attempt.

> >> (e.g look for lost interrupt in mlx5e_tx_reporter_timeout_recover())  

> > 

> > Should we make it more specific? Maybe DLH_REMEDY_STALL: device stall

> > detected, resumed by re-trigerring processing, without reset?  

> 

> Sounds good.


FWIW I ended up calling it:

+ * @DLH_REMEDY_KICK: device stalled, processing will be re-triggered

> >> The assumption here is that a reporter's recovery function has one

> >> remedy. But it can have few remedies and escalate between them. Did you

> >> consider a bitmask?  

> > 

> > Yes, I tried to explain in the commit message. If we wanted to support

> > escalating remediations we'd also need separate counters etc. I think

> > having a health reporter per remediation should actually work fairly

> > well.  

> 

> That would require reporter's recovery procedure failure to trigger 

> health flow for other reporter.

> So we can find ourselves with 2 RX reporters, sharing the same diagnose 

> and dump callbacks, and each has other recovery flow.

> Seems a bit counterintuitive.


Let's talk about particular cases. Otherwise it's too easy to
misunderstand each other. I can't think of any practical case
where escalation makes sense.

> Maybe, per reporter, exposing a counter per each supported remedy is not 

> that bad?


It's a large change to the uAPI, and it makes vendors more likely 
to lump different problems under a single reporter (although I take
your point that it may cause over-splitting, but if we have to choose
between the two my preference is "too granular").

Jakub Kicinski March 9, 2021, 10:52 p.m. UTC | #6

On Tue, 9 Mar 2021 16:06:49 +0200 Eran Ben Elisha wrote:
> On 3/8/2021 7:59 PM, Jakub Kicinski wrote:

> >> Hm, export and extend devlink_health_reporter_state? I like that idea.  

> > 

> > Trying to type it up it looks less pretty than expected.

> > 

> > Let's looks at some examples.

> > 

> > A queue reporter, say "rx", resets the queue dropping all outstanding

> > buffers. As previously mentioned when the normal remediation fails user

> > is expected to power cycle the machine or maybe swap the card. The

> > device itself does not have a crystal ball.  

> 

> Not sure, reopen the queue, or reinit the driver might also be good in 

> case of issue in the SW/HW queue context for example. But I agree that 

> RX reporter can't tell from its perspective what further escalation is 

> needed in case its local defined operations failed.


Right, the point being if normal remediation fails collect a full
system dump and do the big hammer remediation (power cycle or reinit 
if user wants to try that).

> > A management FW reporter "fw", has a auto recovery of FW reset

> > (REMEDY_RESET). On failure -> power cycle.

> > 

> > An "io" reporter (PCI link had to be trained down) can only return

> > a hardware failure (we should probably have a HW failure other than

> > BAD_PART for this).

> > 

> > Flash reporters - the device will know if the flash had a bad block

> > or the entire part is bad, so probably can have 2 reporters for this.

> > 

> > Most of the reporters would only report one "action" that can be

> > performed to fix them. The cartesian product of ->recovery types vs

> > manual recovery does not seem necessary. And drivers would get bloated

> > with additional boilerplate of returning ERROR_NEED_POWER_CYCLE for

> > _all_ cases with ->recovery. Because what else would the fix be if

> > software-initiated reset didn't work?

> 

> OK, I see your point.

> 

> If I got you right, this is the conclusions so far:

> 1. Each reporter with recover callback will have to supply a remedy 

> definition.

> 2. We shouldn't have POWER_CYCLE, REIMAGE and BAD_PART as a remedy, 

> because these are not valid reporter recover flows in any case.

> 3. If a reporter will fail to recover, its status shall remain as error, 

> and it is out of the reporter's scope to advise the administrator on 

> further actions.


I was actually intending to go back to the original proposal, mostly 
as is (plus he KICK).

Indeed the intent is that if local remediation fails or is unavailable
and reporter is in failed state - power cycle or other manual
intervention is needed. So we can drop the POWER_CYCLE remedy and leave
it implicit.

But how are you suggesting we handle BAD_PART and REIMAGE? Still
extending the health status or a separate mechanism than dl-health?

Jacob Keller March 9, 2021, 11:44 p.m. UTC | #7

On 3/9/2021 2:52 PM, Jakub Kicinski wrote:
> On Tue, 9 Mar 2021 16:18:58 +0200 Eran Ben Elisha wrote:

>>>> DLH_REMEDY_LOCAL_FIX: associated component will undergo a local

>>>> un-harmful fix attempt.

>>>> (e.g look for lost interrupt in mlx5e_tx_reporter_timeout_recover())  

>>>

>>> Should we make it more specific? Maybe DLH_REMEDY_STALL: device stall

>>> detected, resumed by re-trigerring processing, without reset?  

>>

>> Sounds good.

> 

> FWIW I ended up calling it:

> 

> + * @DLH_REMEDY_KICK: device stalled, processing will be re-triggered

> 

>>>> The assumption here is that a reporter's recovery function has one

>>>> remedy. But it can have few remedies and escalate between them. Did you

>>>> consider a bitmask?  

>>>

>>> Yes, I tried to explain in the commit message. If we wanted to support

>>> escalating remediations we'd also need separate counters etc. I think

>>> having a health reporter per remediation should actually work fairly

>>> well.  

>>

>> That would require reporter's recovery procedure failure to trigger 

>> health flow for other reporter.

>> So we can find ourselves with 2 RX reporters, sharing the same diagnose 

>> and dump callbacks, and each has other recovery flow.

>> Seems a bit counterintuitive.

> 

> Let's talk about particular cases. Otherwise it's too easy to

> misunderstand each other. I can't think of any practical case

> where escalation makes sense.

> 

>> Maybe, per reporter, exposing a counter per each supported remedy is not 

>> that bad?

> 

> It's a large change to the uAPI, and it makes vendors more likely 

> to lump different problems under a single reporter (although I take

> your point that it may cause over-splitting, but if we have to choose

> between the two my preference is "too granular").

> 

I also prefer keeping it more granular and forcing only a single
"remedy" per reporter. If that remedy fails, I liked the thought of
possibly having some way to indicate possible "hammer" remedies as some
sort of extended status.

i.e. our reporter can try one known to be effective remedy
automatically, and then if it fails it could somehow report an extended
status that indicates "we still failed to recover, and we think the
problem might be fixed with RELOAD/REBOOT/REIMAGE"

But I would keep those 3 larger remedies that require user intervention
out of the set of regular remedies, and more as some other way to
indicate they might help?

I really don't think escalation makes a lot of sense because it's far
more complicated and as an administrator I am not sure I want a remedy
which could have larger impacts like resetting the device if that could
cause other issues...

Thanks,
Jake

[RFC] devlink: health: add remediation type

Commit Message

Comments

Patch