diff mbox series

bus: mhi: host: Address conflict between power_up and syserr

Message ID 20250306173226.857335-1-jeff.hugo@oss.qualcomm.com
State New
Headers show
Series bus: mhi: host: Address conflict between power_up and syserr | expand

Commit Message

Jeff Hugo March 6, 2025, 5:32 p.m. UTC
From: Jeffrey Hugo <quic_jhugo@quicinc.com>

mhi_async_power_up() enables IRQs, at which point we can receive a syserr
notification from the device.  The syserr notification queues a work item
that cannot execute until the pm_mutex is released.

If we receive a syserr notification at the right time during
mhi_async_power_up(), we will fail to initialize the device.

The syserr work item will be pending.  If mhi_async_power_up() detects the
syserr, it will handle it.  If the device is in PBL, then the PBL state
transition event will be queued, resulting in a work item after the
pending syserr work item.  Once mhi_async_power_up() releases the pm_mutex
the syserr work item can run.  It will blindly attempt to reset the MHI
state machine, which is the recovery action for syserr.  PBL/SBL are not
interrupt driven and will ignore the MHI Reset unless syserr is actively
advertised.  This will cause the syserr work item to timeout waiting for
Reset to be cleared, and will leave the host state in syserr processing.
The PBL transition work item will then run, and immediately fail because
syserr processing is not a valid state for PBL transition.

This leaves the device uninitialized.

This issue has a fairly unique signature in the kernel log:

[  909.803598] mhi mhi3: Requested to power ON
[  909.803775] Qualcomm Cloud AI 100 0000:36:00.0: Fatal error received from device.  Attempting to recover
[  909.803945] mhi mhi3: Power on setup success
[  911.808444] mhi mhi3: Device failed to exit MHI Reset state
[  911.808448] mhi mhi3: Device MHI is not in valid state

We cannot remove the syserr handling from mhi_async_power_up() because the
device may be in the syserr state, but we missed the notification as the
irq was fired before irqs were enabled.  We also can't queue the syserr
work item from mhi_async_power_up() if syserr is detected because that may
result in a duplicate work item, and cause the same issue since the
duplicate item will blindly issue MHI Reset even if syserr is no longer
active.

Instead, add a check in the syserr work item to make sure that the device
is in the syserr state if the device is in the PBL or SBL EEs.

Signed-off-by: Jeffrey Hugo <quic_jhugo@quicinc.com>
Signed-off-by: Jeff Hugo <jeff.hugo@oss.qualcomm.com>
---
 drivers/bus/mhi/host/pm.c | 18 +++++++++++++++++-
 1 file changed, 17 insertions(+), 1 deletion(-)

Comments

Troy Hanson March 6, 2025, 6:47 p.m. UTC | #1
On 3/6/25 12:32 PM, Jeff Hugo wrote:
> From: Jeffrey Hugo <quic_jhugo@quicinc.com>
> 
> mhi_async_power_up() enables IRQs, at which point we can receive a syserr
> notification from the device.  The syserr notification queues a work item
> that cannot execute until the pm_mutex is released.
> 
> If we receive a syserr notification at the right time during
> mhi_async_power_up(), we will fail to initialize the device.
> 
> The syserr work item will be pending.  If mhi_async_power_up() detects the
> syserr, it will handle it.  If the device is in PBL, then the PBL state
> transition event will be queued, resulting in a work item after the
> pending syserr work item.  Once mhi_async_power_up() releases the pm_mutex
> the syserr work item can run.  It will blindly attempt to reset the MHI
> state machine, which is the recovery action for syserr.  PBL/SBL are not
> interrupt driven and will ignore the MHI Reset unless syserr is actively
> advertised.  This will cause the syserr work item to timeout waiting for
> Reset to be cleared, and will leave the host state in syserr processing.
> The PBL transition work item will then run, and immediately fail because
> syserr processing is not a valid state for PBL transition.
> 
> This leaves the device uninitialized.
> 
> This issue has a fairly unique signature in the kernel log:
> 
> [  909.803598] mhi mhi3: Requested to power ON
> [  909.803775] Qualcomm Cloud AI 100 0000:36:00.0: Fatal error received from device.  Attempting to recover
> [  909.803945] mhi mhi3: Power on setup success
> [  911.808444] mhi mhi3: Device failed to exit MHI Reset state
> [  911.808448] mhi mhi3: Device MHI is not in valid state
> 
> We cannot remove the syserr handling from mhi_async_power_up() because the
> device may be in the syserr state, but we missed the notification as the
> irq was fired before irqs were enabled.  We also can't queue the syserr
> work item from mhi_async_power_up() if syserr is detected because that may
> result in a duplicate work item, and cause the same issue since the
> duplicate item will blindly issue MHI Reset even if syserr is no longer
> active.
> 
> Instead, add a check in the syserr work item to make sure that the device
> is in the syserr state if the device is in the PBL or SBL EEs.
> 
> Signed-off-by: Jeffrey Hugo <quic_jhugo@quicinc.com>
> Signed-off-by: Jeff Hugo <jeff.hugo@oss.qualcomm.com>

Reviewed-by: Troy Hanson <quic_thanson@quicinc.com>
Krishna Chaitanya Chundru March 13, 2025, 4:54 a.m. UTC | #2
On 3/6/2025 11:02 PM, Jeff Hugo wrote:
> From: Jeffrey Hugo <quic_jhugo@quicinc.com>
> 
> mhi_async_power_up() enables IRQs, at which point we can receive a syserr
> notification from the device.  The syserr notification queues a work item
> that cannot execute until the pm_mutex is released.
> 
> If we receive a syserr notification at the right time during
> mhi_async_power_up(), we will fail to initialize the device.
> 
> The syserr work item will be pending.  If mhi_async_power_up() detects the
> syserr, it will handle it.  If the device is in PBL, then the PBL state
> transition event will be queued, resulting in a work item after the
> pending syserr work item.  Once mhi_async_power_up() releases the pm_mutex
> the syserr work item can run.  It will blindly attempt to reset the MHI
> state machine, which is the recovery action for syserr.  PBL/SBL are not
> interrupt driven and will ignore the MHI Reset unless syserr is actively
> advertised.  This will cause the syserr work item to timeout waiting for
> Reset to be cleared, and will leave the host state in syserr processing.
> The PBL transition work item will then run, and immediately fail because
> syserr processing is not a valid state for PBL transition.
> 
> This leaves the device uninitialized.
> 
> This issue has a fairly unique signature in the kernel log:
> 
> [  909.803598] mhi mhi3: Requested to power ON
> [  909.803775] Qualcomm Cloud AI 100 0000:36:00.0: Fatal error received from device.  Attempting to recover
> [  909.803945] mhi mhi3: Power on setup success
> [  911.808444] mhi mhi3: Device failed to exit MHI Reset state
> [  911.808448] mhi mhi3: Device MHI is not in valid state
> 
> We cannot remove the syserr handling from mhi_async_power_up() because the
> device may be in the syserr state, but we missed the notification as the
> irq was fired before irqs were enabled.  We also can't queue the syserr
> work item from mhi_async_power_up() if syserr is detected because that may
> result in a duplicate work item, and cause the same issue since the
> duplicate item will blindly issue MHI Reset even if syserr is no longer
> active.
> 
> Instead, add a check in the syserr work item to make sure that the device
> is in the syserr state if the device is in the PBL or SBL EEs.
> 
> Signed-off-by: Jeffrey Hugo <quic_jhugo@quicinc.com>
> Signed-off-by: Jeff Hugo <jeff.hugo@oss.qualcomm.com>
> ---
>   drivers/bus/mhi/host/pm.c | 18 +++++++++++++++++-
>   1 file changed, 17 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/bus/mhi/host/pm.c b/drivers/bus/mhi/host/pm.c
> index 11c0e751f223..3dff0f932726 100644
> --- a/drivers/bus/mhi/host/pm.c
> +++ b/drivers/bus/mhi/host/pm.c
> @@ -602,6 +602,7 @@ static void mhi_pm_sys_error_transition(struct mhi_controller *mhi_cntrl)
>   	struct mhi_cmd *mhi_cmd;
>   	struct mhi_event_ctxt *er_ctxt;
>   	struct device *dev = &mhi_cntrl->mhi_dev->dev;
> +	bool reset_device = false;
>   	int ret, i;
>   
>   	dev_dbg(dev, "Transitioning from PM state: %s to: %s\n",
> @@ -630,8 +631,23 @@ static void mhi_pm_sys_error_transition(struct mhi_controller *mhi_cntrl)
>   	/* Wake up threads waiting for state transition */
>   	wake_up_all(&mhi_cntrl->state_event);
>   
> -	/* Trigger MHI RESET so that the device will not access host memory */
> +	/*
> +	 * Trigger MHI RESET so that the device will not access host memory.
> +	 * If the device is in PBL or SBL, it will only respond to RESET if
> +	 * the device is in SYSERR state.  SYSERR might already be cleared
> +	 * at this point.
> +	 */
>   	if (MHI_REG_ACCESS_VALID(prev_state)) {
> +		enum mhi_state cur_statemachine_state = mhi_get_mhi_state(mhi_cntrl);
> +		enum mhi_ee_type cur_ee = mhi_get_exec_env(mhi_cntrl);
> +
> +		if (cur_statemachine_state == MHI_STATE_SYS_ERR)
> +			reset_device = true;
> +		else if (cur_ee != MHI_EE_PBL && cur_ee != MHI_EE_SBL)
why do we need this check? Above check will be applicable when EE is in
MHI_EE_AMSS also. we can reset mhi only when current state is in SYS_ERR
irrespective of the current ee.

- Krishna Chaitanya.
> +			reset_device = true;
> +	}
> +
> +	if (reset_device) {
>   		u32 in_reset = -1;
>   		unsigned long timeout = msecs_to_jiffies(mhi_cntrl->timeout_ms);
>
Manivannan Sadhasivam March 14, 2025, 5:46 a.m. UTC | #3
On Thu, Mar 06, 2025 at 10:32:26AM -0700, Jeff Hugo wrote:
> From: Jeffrey Hugo <quic_jhugo@quicinc.com>
> 
> mhi_async_power_up() enables IRQs, at which point we can receive a syserr
> notification from the device.  The syserr notification queues a work item
> that cannot execute until the pm_mutex is released.
> 
> If we receive a syserr notification at the right time during
> mhi_async_power_up(), we will fail to initialize the device.
> 
> The syserr work item will be pending.  If mhi_async_power_up() detects the
> syserr, it will handle it.  If the device is in PBL, then the PBL state
> transition event will be queued, resulting in a work item after the
> pending syserr work item.  Once mhi_async_power_up() releases the pm_mutex
> the syserr work item can run.  It will blindly attempt to reset the MHI
> state machine, which is the recovery action for syserr.  PBL/SBL are not
> interrupt driven and will ignore the MHI Reset unless syserr is actively
> advertised.  This will cause the syserr work item to timeout waiting for
> Reset to be cleared, and will leave the host state in syserr processing.
> The PBL transition work item will then run, and immediately fail because
> syserr processing is not a valid state for PBL transition.
> 
> This leaves the device uninitialized.
> 
> This issue has a fairly unique signature in the kernel log:
> 
> [  909.803598] mhi mhi3: Requested to power ON
> [  909.803775] Qualcomm Cloud AI 100 0000:36:00.0: Fatal error received from device.  Attempting to recover
> [  909.803945] mhi mhi3: Power on setup success
> [  911.808444] mhi mhi3: Device failed to exit MHI Reset state
> [  911.808448] mhi mhi3: Device MHI is not in valid state
> 
> We cannot remove the syserr handling from mhi_async_power_up() because the
> device may be in the syserr state, but we missed the notification as the
> irq was fired before irqs were enabled.  We also can't queue the syserr
> work item from mhi_async_power_up() if syserr is detected because that may
> result in a duplicate work item, and cause the same issue since the
> duplicate item will blindly issue MHI Reset even if syserr is no longer
> active.
> 
> Instead, add a check in the syserr work item to make sure that the device
> is in the syserr state if the device is in the PBL or SBL EEs.
> 

Don't we need a Fixes tag?

> Signed-off-by: Jeffrey Hugo <quic_jhugo@quicinc.com>
> Signed-off-by: Jeff Hugo <jeff.hugo@oss.qualcomm.com>
> ---
>  drivers/bus/mhi/host/pm.c | 18 +++++++++++++++++-
>  1 file changed, 17 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/bus/mhi/host/pm.c b/drivers/bus/mhi/host/pm.c
> index 11c0e751f223..3dff0f932726 100644
> --- a/drivers/bus/mhi/host/pm.c
> +++ b/drivers/bus/mhi/host/pm.c
> @@ -602,6 +602,7 @@ static void mhi_pm_sys_error_transition(struct mhi_controller *mhi_cntrl)
>  	struct mhi_cmd *mhi_cmd;
>  	struct mhi_event_ctxt *er_ctxt;
>  	struct device *dev = &mhi_cntrl->mhi_dev->dev;
> +	bool reset_device = false;
>  	int ret, i;
>  
>  	dev_dbg(dev, "Transitioning from PM state: %s to: %s\n",
> @@ -630,8 +631,23 @@ static void mhi_pm_sys_error_transition(struct mhi_controller *mhi_cntrl)
>  	/* Wake up threads waiting for state transition */
>  	wake_up_all(&mhi_cntrl->state_event);
>  
> -	/* Trigger MHI RESET so that the device will not access host memory */
> +	/*
> +	 * Trigger MHI RESET so that the device will not access host memory.

Move this comment before 'if (reset_device)'.

> +	 * If the device is in PBL or SBL, it will only respond to RESET if
> +	 * the device is in SYSERR state.  SYSERR might already be cleared
> +	 * at this point.
> +	 */
>  	if (MHI_REG_ACCESS_VALID(prev_state)) {
> +		enum mhi_state cur_statemachine_state = mhi_get_mhi_state(mhi_cntrl);

s/cur_statemachine_state/cur_state

- Mani
Jeff Hugo March 14, 2025, 5:26 p.m. UTC | #4
On 3/12/2025 10:54 PM, Krishna Chaitanya Chundru wrote:
> 
> 
> On 3/6/2025 11:02 PM, Jeff Hugo wrote:
>> From: Jeffrey Hugo <quic_jhugo@quicinc.com>
>>
>> mhi_async_power_up() enables IRQs, at which point we can receive a syserr
>> notification from the device.  The syserr notification queues a work item
>> that cannot execute until the pm_mutex is released.
>>
>> If we receive a syserr notification at the right time during
>> mhi_async_power_up(), we will fail to initialize the device.
>>
>> The syserr work item will be pending.  If mhi_async_power_up() detects 
>> the
>> syserr, it will handle it.  If the device is in PBL, then the PBL state
>> transition event will be queued, resulting in a work item after the
>> pending syserr work item.  Once mhi_async_power_up() releases the 
>> pm_mutex
>> the syserr work item can run.  It will blindly attempt to reset the MHI
>> state machine, which is the recovery action for syserr.  PBL/SBL are not
>> interrupt driven and will ignore the MHI Reset unless syserr is actively
>> advertised.  This will cause the syserr work item to timeout waiting for
>> Reset to be cleared, and will leave the host state in syserr processing.
>> The PBL transition work item will then run, and immediately fail because
>> syserr processing is not a valid state for PBL transition.
>>
>> This leaves the device uninitialized.
>>
>> This issue has a fairly unique signature in the kernel log:
>>
>> [  909.803598] mhi mhi3: Requested to power ON
>> [  909.803775] Qualcomm Cloud AI 100 0000:36:00.0: Fatal error 
>> received from device.  Attempting to recover
>> [  909.803945] mhi mhi3: Power on setup success
>> [  911.808444] mhi mhi3: Device failed to exit MHI Reset state
>> [  911.808448] mhi mhi3: Device MHI is not in valid state
>>
>> We cannot remove the syserr handling from mhi_async_power_up() because 
>> the
>> device may be in the syserr state, but we missed the notification as the
>> irq was fired before irqs were enabled.  We also can't queue the syserr
>> work item from mhi_async_power_up() if syserr is detected because that 
>> may
>> result in a duplicate work item, and cause the same issue since the
>> duplicate item will blindly issue MHI Reset even if syserr is no longer
>> active.
>>
>> Instead, add a check in the syserr work item to make sure that the device
>> is in the syserr state if the device is in the PBL or SBL EEs.
>>
>> Signed-off-by: Jeffrey Hugo <quic_jhugo@quicinc.com>
>> Signed-off-by: Jeff Hugo <jeff.hugo@oss.qualcomm.com>
>> ---
>>   drivers/bus/mhi/host/pm.c | 18 +++++++++++++++++-
>>   1 file changed, 17 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/bus/mhi/host/pm.c b/drivers/bus/mhi/host/pm.c
>> index 11c0e751f223..3dff0f932726 100644
>> --- a/drivers/bus/mhi/host/pm.c
>> +++ b/drivers/bus/mhi/host/pm.c
>> @@ -602,6 +602,7 @@ static void mhi_pm_sys_error_transition(struct 
>> mhi_controller *mhi_cntrl)
>>       struct mhi_cmd *mhi_cmd;
>>       struct mhi_event_ctxt *er_ctxt;
>>       struct device *dev = &mhi_cntrl->mhi_dev->dev;
>> +    bool reset_device = false;
>>       int ret, i;
>>       dev_dbg(dev, "Transitioning from PM state: %s to: %s\n",
>> @@ -630,8 +631,23 @@ static void mhi_pm_sys_error_transition(struct 
>> mhi_controller *mhi_cntrl)
>>       /* Wake up threads waiting for state transition */
>>       wake_up_all(&mhi_cntrl->state_event);
>> -    /* Trigger MHI RESET so that the device will not access host 
>> memory */
>> +    /*
>> +     * Trigger MHI RESET so that the device will not access host memory.
>> +     * If the device is in PBL or SBL, it will only respond to RESET if
>> +     * the device is in SYSERR state.  SYSERR might already be cleared
>> +     * at this point.
>> +     */
>>       if (MHI_REG_ACCESS_VALID(prev_state)) {
>> +        enum mhi_state cur_statemachine_state = 
>> mhi_get_mhi_state(mhi_cntrl);
>> +        enum mhi_ee_type cur_ee = mhi_get_exec_env(mhi_cntrl);
>> +
>> +        if (cur_statemachine_state == MHI_STATE_SYS_ERR)
>> +            reset_device = true;
>> +        else if (cur_ee != MHI_EE_PBL && cur_ee != MHI_EE_SBL)
> why do we need this check? Above check will be applicable when EE is in
> MHI_EE_AMSS also. we can reset mhi only when current state is in SYS_ERR
> irrespective of the current ee.

No, it is valid for the host can reset MHI at any time - not just in 
response to syserr.  Figure 3-1 (the state machine diagram) says that 
the RESET state can be entered from any other state when Host sets the 
RESET bit.

AMSS EE is not really affected here.  If AMSS sets syserr, or we are 
here because the host needs to do a reset, then we will issue RESET=1 
which is fine. (if syserr set or EE is not PBL/SBL)

However, as the comment states, PBL and SBL won't respond to RESET=1 
unless they are expecting it.  This will cause issues as Host will issue 
RESET, expect PBL/SBL to clear it, but they won't, and Host will stall 
thinking PBL/SBL are malfunctioning.  This is a side effect of using 
syserr to signal a device fatal error when PBL doesn't otherwise 
actually support MHI, and the Host offloading the syserr processing.

-Jeff
Jeff Hugo March 14, 2025, 5:47 p.m. UTC | #5
On 3/13/2025 11:46 PM, Manivannan Sadhasivam wrote:
> On Thu, Mar 06, 2025 at 10:32:26AM -0700, Jeff Hugo wrote:
>> From: Jeffrey Hugo <quic_jhugo@quicinc.com>
>>
>> mhi_async_power_up() enables IRQs, at which point we can receive a syserr
>> notification from the device.  The syserr notification queues a work item
>> that cannot execute until the pm_mutex is released.
>>
>> If we receive a syserr notification at the right time during
>> mhi_async_power_up(), we will fail to initialize the device.
>>
>> The syserr work item will be pending.  If mhi_async_power_up() detects the
>> syserr, it will handle it.  If the device is in PBL, then the PBL state
>> transition event will be queued, resulting in a work item after the
>> pending syserr work item.  Once mhi_async_power_up() releases the pm_mutex
>> the syserr work item can run.  It will blindly attempt to reset the MHI
>> state machine, which is the recovery action for syserr.  PBL/SBL are not
>> interrupt driven and will ignore the MHI Reset unless syserr is actively
>> advertised.  This will cause the syserr work item to timeout waiting for
>> Reset to be cleared, and will leave the host state in syserr processing.
>> The PBL transition work item will then run, and immediately fail because
>> syserr processing is not a valid state for PBL transition.
>>
>> This leaves the device uninitialized.
>>
>> This issue has a fairly unique signature in the kernel log:
>>
>> [  909.803598] mhi mhi3: Requested to power ON
>> [  909.803775] Qualcomm Cloud AI 100 0000:36:00.0: Fatal error received from device.  Attempting to recover
>> [  909.803945] mhi mhi3: Power on setup success
>> [  911.808444] mhi mhi3: Device failed to exit MHI Reset state
>> [  911.808448] mhi mhi3: Device MHI is not in valid state
>>
>> We cannot remove the syserr handling from mhi_async_power_up() because the
>> device may be in the syserr state, but we missed the notification as the
>> irq was fired before irqs were enabled.  We also can't queue the syserr
>> work item from mhi_async_power_up() if syserr is detected because that may
>> result in a duplicate work item, and cause the same issue since the
>> duplicate item will blindly issue MHI Reset even if syserr is no longer
>> active.
>>
>> Instead, add a check in the syserr work item to make sure that the device
>> is in the syserr state if the device is in the PBL or SBL EEs.
>>
> 
> Don't we need a Fixes tag?

I don't recall seeing documentation saying that Fixes tags are 
mandatory.  Yes, I agree, they are helpful and should exist.

I am finding it difficult to point to a single commit that I can say 
introduced this issue. I believe we started seeing it with "bus: mhi: 
host: Add MHI_PM_SYS_ERR_FAIL state", but I don't think that commit 
actually introduced this issue. It seems like a coincidence that the 
issue was first observed with that commit.  I suspect that this issue 
has been a problem since the introduction of MHI, but I am not confident 
since the relevant code paths have radically changed since then.

Given I don't feel confident in identifying a commit, I felt it was 
perhaps better to not list one at all.

Do you have any suggestions?

> 
>> Signed-off-by: Jeffrey Hugo <quic_jhugo@quicinc.com>
>> Signed-off-by: Jeff Hugo <jeff.hugo@oss.qualcomm.com>
>> ---
>>   drivers/bus/mhi/host/pm.c | 18 +++++++++++++++++-
>>   1 file changed, 17 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/bus/mhi/host/pm.c b/drivers/bus/mhi/host/pm.c
>> index 11c0e751f223..3dff0f932726 100644
>> --- a/drivers/bus/mhi/host/pm.c
>> +++ b/drivers/bus/mhi/host/pm.c
>> @@ -602,6 +602,7 @@ static void mhi_pm_sys_error_transition(struct mhi_controller *mhi_cntrl)
>>   	struct mhi_cmd *mhi_cmd;
>>   	struct mhi_event_ctxt *er_ctxt;
>>   	struct device *dev = &mhi_cntrl->mhi_dev->dev;
>> +	bool reset_device = false;
>>   	int ret, i;
>>   
>>   	dev_dbg(dev, "Transitioning from PM state: %s to: %s\n",
>> @@ -630,8 +631,23 @@ static void mhi_pm_sys_error_transition(struct mhi_controller *mhi_cntrl)
>>   	/* Wake up threads waiting for state transition */
>>   	wake_up_all(&mhi_cntrl->state_event);
>>   
>> -	/* Trigger MHI RESET so that the device will not access host memory */
>> +	/*
>> +	 * Trigger MHI RESET so that the device will not access host memory.
> 
> Move this comment before 'if (reset_device)'.

I'll move it, but seems a bit weird to have the explanation for the 
logic of the conditionals (particularly the ones added by this patch) 
after them in the code.

>> +	 * If the device is in PBL or SBL, it will only respond to RESET if
>> +	 * the device is in SYSERR state.  SYSERR might already be cleared
>> +	 * at this point.
>> +	 */
>>   	if (MHI_REG_ACCESS_VALID(prev_state)) {
>> +		enum mhi_state cur_statemachine_state = mhi_get_mhi_state(mhi_cntrl);
> 
> s/cur_statemachine_state/cur_state

We already have a cur_state in the function, that is generally used for 
Host state and here we are reading the Device state. I felt that 
avoiding mixing the meanings was perhaps in the interest of the reader, 
but reusing cur_state appears possible.
Manivannan Sadhasivam March 18, 2025, 10:37 a.m. UTC | #6
On Fri, Mar 14, 2025 at 11:47:43AM -0600, Jeff Hugo wrote:
> On 3/13/2025 11:46 PM, Manivannan Sadhasivam wrote:
> > On Thu, Mar 06, 2025 at 10:32:26AM -0700, Jeff Hugo wrote:
> > > From: Jeffrey Hugo <quic_jhugo@quicinc.com>
> > > 
> > > mhi_async_power_up() enables IRQs, at which point we can receive a syserr
> > > notification from the device.  The syserr notification queues a work item
> > > that cannot execute until the pm_mutex is released.
> > > 
> > > If we receive a syserr notification at the right time during
> > > mhi_async_power_up(), we will fail to initialize the device.
> > > 
> > > The syserr work item will be pending.  If mhi_async_power_up() detects the
> > > syserr, it will handle it.  If the device is in PBL, then the PBL state
> > > transition event will be queued, resulting in a work item after the
> > > pending syserr work item.  Once mhi_async_power_up() releases the pm_mutex
> > > the syserr work item can run.  It will blindly attempt to reset the MHI
> > > state machine, which is the recovery action for syserr.  PBL/SBL are not
> > > interrupt driven and will ignore the MHI Reset unless syserr is actively
> > > advertised.  This will cause the syserr work item to timeout waiting for
> > > Reset to be cleared, and will leave the host state in syserr processing.
> > > The PBL transition work item will then run, and immediately fail because
> > > syserr processing is not a valid state for PBL transition.
> > > 
> > > This leaves the device uninitialized.
> > > 
> > > This issue has a fairly unique signature in the kernel log:
> > > 
> > > [  909.803598] mhi mhi3: Requested to power ON
> > > [  909.803775] Qualcomm Cloud AI 100 0000:36:00.0: Fatal error received from device.  Attempting to recover
> > > [  909.803945] mhi mhi3: Power on setup success
> > > [  911.808444] mhi mhi3: Device failed to exit MHI Reset state
> > > [  911.808448] mhi mhi3: Device MHI is not in valid state
> > > 
> > > We cannot remove the syserr handling from mhi_async_power_up() because the
> > > device may be in the syserr state, but we missed the notification as the
> > > irq was fired before irqs were enabled.  We also can't queue the syserr
> > > work item from mhi_async_power_up() if syserr is detected because that may
> > > result in a duplicate work item, and cause the same issue since the
> > > duplicate item will blindly issue MHI Reset even if syserr is no longer
> > > active.
> > > 
> > > Instead, add a check in the syserr work item to make sure that the device
> > > is in the syserr state if the device is in the PBL or SBL EEs.
> > > 
> > 
> > Don't we need a Fixes tag?
> 
> I don't recall seeing documentation saying that Fixes tags are mandatory.

It is the standard practice to add the Fixes tag if the bug was introduced by a
specific commit. But...

> Yes, I agree, they are helpful and should exist.
> 
> I am finding it difficult to point to a single commit that I can say
> introduced this issue. I believe we started seeing it with "bus: mhi: host:
> Add MHI_PM_SYS_ERR_FAIL state", but I don't think that commit actually
> introduced this issue. It seems like a coincidence that the issue was first
> observed with that commit.  I suspect that this issue has been a problem
> since the introduction of MHI, but I am not confident since the relevant
> code paths have radically changed since then.
> 
> Given I don't feel confident in identifying a commit, I felt it was perhaps
> better to not list one at all.
> 
> Do you have any suggestions?
> 

Fine with me. In that case, it would have been helpful if mentioned in the
changelog area.

> > 
> > > Signed-off-by: Jeffrey Hugo <quic_jhugo@quicinc.com>
> > > Signed-off-by: Jeff Hugo <jeff.hugo@oss.qualcomm.com>
> > > ---
> > >   drivers/bus/mhi/host/pm.c | 18 +++++++++++++++++-
> > >   1 file changed, 17 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/drivers/bus/mhi/host/pm.c b/drivers/bus/mhi/host/pm.c
> > > index 11c0e751f223..3dff0f932726 100644
> > > --- a/drivers/bus/mhi/host/pm.c
> > > +++ b/drivers/bus/mhi/host/pm.c
> > > @@ -602,6 +602,7 @@ static void mhi_pm_sys_error_transition(struct mhi_controller *mhi_cntrl)
> > >   	struct mhi_cmd *mhi_cmd;
> > >   	struct mhi_event_ctxt *er_ctxt;
> > >   	struct device *dev = &mhi_cntrl->mhi_dev->dev;
> > > +	bool reset_device = false;
> > >   	int ret, i;
> > >   	dev_dbg(dev, "Transitioning from PM state: %s to: %s\n",
> > > @@ -630,8 +631,23 @@ static void mhi_pm_sys_error_transition(struct mhi_controller *mhi_cntrl)
> > >   	/* Wake up threads waiting for state transition */
> > >   	wake_up_all(&mhi_cntrl->state_event);
> > > -	/* Trigger MHI RESET so that the device will not access host memory */
> > > +	/*
> > > +	 * Trigger MHI RESET so that the device will not access host memory.
> > 
> > Move this comment before 'if (reset_device)'.
> 
> I'll move it, but seems a bit weird to have the explanation for the logic of
> the conditionals (particularly the ones added by this patch) after them in
> the code.
> 

I thought that the comment is better placed where we really reset the device.

> > > +	 * If the device is in PBL or SBL, it will only respond to RESET if
> > > +	 * the device is in SYSERR state.  SYSERR might already be cleared
> > > +	 * at this point.
> > > +	 */
> > >   	if (MHI_REG_ACCESS_VALID(prev_state)) {
> > > +		enum mhi_state cur_statemachine_state = mhi_get_mhi_state(mhi_cntrl);
> > 
> > s/cur_statemachine_state/cur_state
> 
> We already have a cur_state in the function, that is generally used for Host
> state and here we are reading the Device state. I felt that avoiding mixing
> the meanings was perhaps in the interest of the reader, but reusing
> cur_state appears possible.
> 

It should be fine IMO.

- Mani
Jeff Hugo March 21, 2025, 4:24 p.m. UTC | #7
On 3/18/2025 4:37 AM, Manivannan Sadhasivam wrote:
> On Fri, Mar 14, 2025 at 11:47:43AM -0600, Jeff Hugo wrote:
>> On 3/13/2025 11:46 PM, Manivannan Sadhasivam wrote:
>>> On Thu, Mar 06, 2025 at 10:32:26AM -0700, Jeff Hugo wrote:
>>>> From: Jeffrey Hugo <quic_jhugo@quicinc.com>
>>>>
>>>> mhi_async_power_up() enables IRQs, at which point we can receive a syserr
>>>> notification from the device.  The syserr notification queues a work item
>>>> that cannot execute until the pm_mutex is released.
>>>>
>>>> If we receive a syserr notification at the right time during
>>>> mhi_async_power_up(), we will fail to initialize the device.
>>>>
>>>> The syserr work item will be pending.  If mhi_async_power_up() detects the
>>>> syserr, it will handle it.  If the device is in PBL, then the PBL state
>>>> transition event will be queued, resulting in a work item after the
>>>> pending syserr work item.  Once mhi_async_power_up() releases the pm_mutex
>>>> the syserr work item can run.  It will blindly attempt to reset the MHI
>>>> state machine, which is the recovery action for syserr.  PBL/SBL are not
>>>> interrupt driven and will ignore the MHI Reset unless syserr is actively
>>>> advertised.  This will cause the syserr work item to timeout waiting for
>>>> Reset to be cleared, and will leave the host state in syserr processing.
>>>> The PBL transition work item will then run, and immediately fail because
>>>> syserr processing is not a valid state for PBL transition.
>>>>
>>>> This leaves the device uninitialized.
>>>>
>>>> This issue has a fairly unique signature in the kernel log:
>>>>
>>>> [  909.803598] mhi mhi3: Requested to power ON
>>>> [  909.803775] Qualcomm Cloud AI 100 0000:36:00.0: Fatal error received from device.  Attempting to recover
>>>> [  909.803945] mhi mhi3: Power on setup success
>>>> [  911.808444] mhi mhi3: Device failed to exit MHI Reset state
>>>> [  911.808448] mhi mhi3: Device MHI is not in valid state
>>>>
>>>> We cannot remove the syserr handling from mhi_async_power_up() because the
>>>> device may be in the syserr state, but we missed the notification as the
>>>> irq was fired before irqs were enabled.  We also can't queue the syserr
>>>> work item from mhi_async_power_up() if syserr is detected because that may
>>>> result in a duplicate work item, and cause the same issue since the
>>>> duplicate item will blindly issue MHI Reset even if syserr is no longer
>>>> active.
>>>>
>>>> Instead, add a check in the syserr work item to make sure that the device
>>>> is in the syserr state if the device is in the PBL or SBL EEs.
>>>>
>>>
>>> Don't we need a Fixes tag?
>>
>> I don't recall seeing documentation saying that Fixes tags are mandatory.
> 
> It is the standard practice to add the Fixes tag if the bug was introduced by a
> specific commit. But...
> 
>> Yes, I agree, they are helpful and should exist.
>>
>> I am finding it difficult to point to a single commit that I can say
>> introduced this issue. I believe we started seeing it with "bus: mhi: host:
>> Add MHI_PM_SYS_ERR_FAIL state", but I don't think that commit actually
>> introduced this issue. It seems like a coincidence that the issue was first
>> observed with that commit.  I suspect that this issue has been a problem
>> since the introduction of MHI, but I am not confident since the relevant
>> code paths have radically changed since then.
>>
>> Given I don't feel confident in identifying a commit, I felt it was perhaps
>> better to not list one at all.
>>
>> Do you have any suggestions?
>>
> 
> Fine with me. In that case, it would have been helpful if mentioned in the
> changelog area.

Ah, that is a good idea. I'll add something with v2.
diff mbox series

Patch

diff --git a/drivers/bus/mhi/host/pm.c b/drivers/bus/mhi/host/pm.c
index 11c0e751f223..3dff0f932726 100644
--- a/drivers/bus/mhi/host/pm.c
+++ b/drivers/bus/mhi/host/pm.c
@@ -602,6 +602,7 @@  static void mhi_pm_sys_error_transition(struct mhi_controller *mhi_cntrl)
 	struct mhi_cmd *mhi_cmd;
 	struct mhi_event_ctxt *er_ctxt;
 	struct device *dev = &mhi_cntrl->mhi_dev->dev;
+	bool reset_device = false;
 	int ret, i;
 
 	dev_dbg(dev, "Transitioning from PM state: %s to: %s\n",
@@ -630,8 +631,23 @@  static void mhi_pm_sys_error_transition(struct mhi_controller *mhi_cntrl)
 	/* Wake up threads waiting for state transition */
 	wake_up_all(&mhi_cntrl->state_event);
 
-	/* Trigger MHI RESET so that the device will not access host memory */
+	/*
+	 * Trigger MHI RESET so that the device will not access host memory.
+	 * If the device is in PBL or SBL, it will only respond to RESET if
+	 * the device is in SYSERR state.  SYSERR might already be cleared
+	 * at this point.
+	 */
 	if (MHI_REG_ACCESS_VALID(prev_state)) {
+		enum mhi_state cur_statemachine_state = mhi_get_mhi_state(mhi_cntrl);
+		enum mhi_ee_type cur_ee = mhi_get_exec_env(mhi_cntrl);
+
+		if (cur_statemachine_state == MHI_STATE_SYS_ERR)
+			reset_device = true;
+		else if (cur_ee != MHI_EE_PBL && cur_ee != MHI_EE_SBL)
+			reset_device = true;
+	}
+
+	if (reset_device) {
 		u32 in_reset = -1;
 		unsigned long timeout = msecs_to_jiffies(mhi_cntrl->timeout_ms);