mbox series

[v2,0/3] iommu/arm-smmu: adreno-smmu page fault handling

Message ID 20201124191600.2051751-1-jcrouse@codeaurora.org
Headers show
Series iommu/arm-smmu: adreno-smmu page fault handling | expand

Message

Jordan Crouse Nov. 24, 2020, 7:15 p.m. UTC
This is a stack to add an Adreno GPU specific handler for pagefaults. The first
patch starts by wiring up report_iommu_fault for arm-smmu. The next patch adds
a adreno-smmu-priv function hook to capture a handful of important debugging
registers such as TTBR0, CONTEXTIDR, FSYNR0 and others. This is used by the
third patch to print more detailed information on page fault such as the TTBR0
for the pagetable that caused the fault and the source of the fault as
determined by a combination of the FSYNR1 register and an internal GPU
register.

This code provides a solid base that we can expand on later for even more
extensive GPU side page fault debugging capabilities.

v2: Fix comment wording and function pointer check per Rob Clark

Jordan Crouse (3):
  iommu/arm-smmu: Add support for driver IOMMU fault handlers
  drm/msm: Add an adreno-smmu-priv callback to get pagefault info
  drm/msm: Improve the a6xx page fault handler

 drivers/gpu/drm/msm/adreno/a5xx_gpu.c      |  4 +-
 drivers/gpu/drm/msm/adreno/a6xx_gpu.c      | 76 +++++++++++++++++++++-
 drivers/gpu/drm/msm/msm_iommu.c            | 11 +++-
 drivers/gpu/drm/msm/msm_mmu.h              |  4 +-
 drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c | 19 ++++++
 drivers/iommu/arm/arm-smmu/arm-smmu.c      | 16 ++++-
 drivers/iommu/arm/arm-smmu/arm-smmu.h      |  2 +
 include/linux/adreno-smmu-priv.h           | 31 ++++++++-
 8 files changed, 151 insertions(+), 12 deletions(-)

Comments

Will Deacon Jan. 22, 2021, 12:41 p.m. UTC | #1
On Tue, Nov 24, 2020 at 12:15:58PM -0700, Jordan Crouse wrote:
> Call report_iommu_fault() to allow upper-level drivers to register their

> own fault handlers.

> 

> Signed-off-by: Jordan Crouse <jcrouse@codeaurora.org>

> ---

> 

>  drivers/iommu/arm/arm-smmu/arm-smmu.c | 16 +++++++++++++---

>  1 file changed, 13 insertions(+), 3 deletions(-)

> 

> diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu.c b/drivers/iommu/arm/arm-smmu/arm-smmu.c

> index 0f28a8614da3..7fd18bbda8f5 100644

> --- a/drivers/iommu/arm/arm-smmu/arm-smmu.c

> +++ b/drivers/iommu/arm/arm-smmu/arm-smmu.c

> @@ -427,6 +427,7 @@ static irqreturn_t arm_smmu_context_fault(int irq, void *dev)

>  	struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);

>  	struct arm_smmu_device *smmu = smmu_domain->smmu;

>  	int idx = smmu_domain->cfg.cbndx;

> +	int ret;

>  

>  	fsr = arm_smmu_cb_read(smmu, idx, ARM_SMMU_CB_FSR);

>  	if (!(fsr & ARM_SMMU_FSR_FAULT))

> @@ -436,11 +437,20 @@ static irqreturn_t arm_smmu_context_fault(int irq, void *dev)

>  	iova = arm_smmu_cb_readq(smmu, idx, ARM_SMMU_CB_FAR);

>  	cbfrsynra = arm_smmu_gr1_read(smmu, ARM_SMMU_GR1_CBFRSYNRA(idx));

>  

> -	dev_err_ratelimited(smmu->dev,

> -	"Unhandled context fault: fsr=0x%x, iova=0x%08lx, fsynr=0x%x, cbfrsynra=0x%x, cb=%d\n",

> +	ret = report_iommu_fault(domain, dev, iova,

> +		fsynr & ARM_SMMU_FSYNR0_WNR ? IOMMU_FAULT_WRITE : IOMMU_FAULT_READ);

> +

> +	if (ret == -ENOSYS)

> +		dev_err_ratelimited(smmu->dev,

> +		"Unhandled context fault: fsr=0x%x, iova=0x%08lx, fsynr=0x%x, cbfrsynra=0x%x, cb=%d\n",

>  			    fsr, iova, fsynr, cbfrsynra, idx);

>  

> -	arm_smmu_cb_write(smmu, idx, ARM_SMMU_CB_FSR, fsr);

> +	/*

> +	 * If the iommu fault returns an error (except -ENOSYS) then assume that

> +	 * they will handle resuming on their own

> +	 */

> +	if (!ret || ret == -ENOSYS)

> +		arm_smmu_cb_write(smmu, idx, ARM_SMMU_CB_FSR, fsr);


Hmm, I don't grok this part. If the fault handler returned an error and
we don't clear the FSR, won't we just re-take the irq immediately? I think
it would be better to do this unconditionally, and print the "Unhandled
context fault" message for any non-zero value of ret.

Will
Robin Murphy Jan. 22, 2021, 12:53 p.m. UTC | #2
On 2021-01-22 12:41, Will Deacon wrote:
> On Tue, Nov 24, 2020 at 12:15:58PM -0700, Jordan Crouse wrote:

>> Call report_iommu_fault() to allow upper-level drivers to register their

>> own fault handlers.

>>

>> Signed-off-by: Jordan Crouse <jcrouse@codeaurora.org>

>> ---

>>

>>   drivers/iommu/arm/arm-smmu/arm-smmu.c | 16 +++++++++++++---

>>   1 file changed, 13 insertions(+), 3 deletions(-)

>>

>> diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu.c b/drivers/iommu/arm/arm-smmu/arm-smmu.c

>> index 0f28a8614da3..7fd18bbda8f5 100644

>> --- a/drivers/iommu/arm/arm-smmu/arm-smmu.c

>> +++ b/drivers/iommu/arm/arm-smmu/arm-smmu.c

>> @@ -427,6 +427,7 @@ static irqreturn_t arm_smmu_context_fault(int irq, void *dev)

>>   	struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);

>>   	struct arm_smmu_device *smmu = smmu_domain->smmu;

>>   	int idx = smmu_domain->cfg.cbndx;

>> +	int ret;

>>   

>>   	fsr = arm_smmu_cb_read(smmu, idx, ARM_SMMU_CB_FSR);

>>   	if (!(fsr & ARM_SMMU_FSR_FAULT))

>> @@ -436,11 +437,20 @@ static irqreturn_t arm_smmu_context_fault(int irq, void *dev)

>>   	iova = arm_smmu_cb_readq(smmu, idx, ARM_SMMU_CB_FAR);

>>   	cbfrsynra = arm_smmu_gr1_read(smmu, ARM_SMMU_GR1_CBFRSYNRA(idx));

>>   

>> -	dev_err_ratelimited(smmu->dev,

>> -	"Unhandled context fault: fsr=0x%x, iova=0x%08lx, fsynr=0x%x, cbfrsynra=0x%x, cb=%d\n",

>> +	ret = report_iommu_fault(domain, dev, iova,

>> +		fsynr & ARM_SMMU_FSYNR0_WNR ? IOMMU_FAULT_WRITE : IOMMU_FAULT_READ);

>> +

>> +	if (ret == -ENOSYS)

>> +		dev_err_ratelimited(smmu->dev,

>> +		"Unhandled context fault: fsr=0x%x, iova=0x%08lx, fsynr=0x%x, cbfrsynra=0x%x, cb=%d\n",

>>   			    fsr, iova, fsynr, cbfrsynra, idx);

>>   

>> -	arm_smmu_cb_write(smmu, idx, ARM_SMMU_CB_FSR, fsr);

>> +	/*

>> +	 * If the iommu fault returns an error (except -ENOSYS) then assume that

>> +	 * they will handle resuming on their own

>> +	 */

>> +	if (!ret || ret == -ENOSYS)

>> +		arm_smmu_cb_write(smmu, idx, ARM_SMMU_CB_FSR, fsr);

> 

> Hmm, I don't grok this part. If the fault handler returned an error and

> we don't clear the FSR, won't we just re-take the irq immediately?


If we don't touch the FSR at all, yes. Even if we clear the fault 
indicator bits, the interrupt *might* remain asserted until a stalled 
transaction is actually resolved - that's that lovely IMP-DEF corner.

Robin.

> I think

> it would be better to do this unconditionally, and print the "Unhandled

> context fault" message for any non-zero value of ret.

> 

> Will

>
Jordan Crouse Jan. 25, 2021, 9:51 p.m. UTC | #3
On Fri, Jan 22, 2021 at 12:53:17PM +0000, Robin Murphy wrote:
> On 2021-01-22 12:41, Will Deacon wrote:

> >On Tue, Nov 24, 2020 at 12:15:58PM -0700, Jordan Crouse wrote:

> >>Call report_iommu_fault() to allow upper-level drivers to register their

> >>own fault handlers.

> >>

> >>Signed-off-by: Jordan Crouse <jcrouse@codeaurora.org>

> >>---

> >>

> >>  drivers/iommu/arm/arm-smmu/arm-smmu.c | 16 +++++++++++++---

> >>  1 file changed, 13 insertions(+), 3 deletions(-)

> >>

> >>diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu.c b/drivers/iommu/arm/arm-smmu/arm-smmu.c

> >>index 0f28a8614da3..7fd18bbda8f5 100644

> >>--- a/drivers/iommu/arm/arm-smmu/arm-smmu.c

> >>+++ b/drivers/iommu/arm/arm-smmu/arm-smmu.c

> >>@@ -427,6 +427,7 @@ static irqreturn_t arm_smmu_context_fault(int irq, void *dev)

> >>  	struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);

> >>  	struct arm_smmu_device *smmu = smmu_domain->smmu;

> >>  	int idx = smmu_domain->cfg.cbndx;

> >>+	int ret;

> >>  	fsr = arm_smmu_cb_read(smmu, idx, ARM_SMMU_CB_FSR);

> >>  	if (!(fsr & ARM_SMMU_FSR_FAULT))

> >>@@ -436,11 +437,20 @@ static irqreturn_t arm_smmu_context_fault(int irq, void *dev)

> >>  	iova = arm_smmu_cb_readq(smmu, idx, ARM_SMMU_CB_FAR);

> >>  	cbfrsynra = arm_smmu_gr1_read(smmu, ARM_SMMU_GR1_CBFRSYNRA(idx));

> >>-	dev_err_ratelimited(smmu->dev,

> >>-	"Unhandled context fault: fsr=0x%x, iova=0x%08lx, fsynr=0x%x, cbfrsynra=0x%x, cb=%d\n",

> >>+	ret = report_iommu_fault(domain, dev, iova,

> >>+		fsynr & ARM_SMMU_FSYNR0_WNR ? IOMMU_FAULT_WRITE : IOMMU_FAULT_READ);

> >>+

> >>+	if (ret == -ENOSYS)

> >>+		dev_err_ratelimited(smmu->dev,

> >>+		"Unhandled context fault: fsr=0x%x, iova=0x%08lx, fsynr=0x%x, cbfrsynra=0x%x, cb=%d\n",

> >>  			    fsr, iova, fsynr, cbfrsynra, idx);

> >>-	arm_smmu_cb_write(smmu, idx, ARM_SMMU_CB_FSR, fsr);

> >>+	/*

> >>+	 * If the iommu fault returns an error (except -ENOSYS) then assume that

> >>+	 * they will handle resuming on their own

> >>+	 */

> >>+	if (!ret || ret == -ENOSYS)

> >>+		arm_smmu_cb_write(smmu, idx, ARM_SMMU_CB_FSR, fsr);

> >

> >Hmm, I don't grok this part. If the fault handler returned an error and

> >we don't clear the FSR, won't we just re-take the irq immediately?

> 

> If we don't touch the FSR at all, yes. Even if we clear the fault indicator

> bits, the interrupt *might* remain asserted until a stalled transaction is

> actually resolved - that's that lovely IMP-DEF corner.

>

> Robin.

> 


This is for stall-on-fault. The idea is that if the developer chooses to do so
we would stall the GPU after a fault long enough to take a picture of it with
devcoredump and then release the FSR. Since we can't take the devcoredump from
the interrupt handler we schedule it in a worker and then return an error
to let the main handler know that we'll come back around clear the FSR later
when we are done.

It is assumed that we'll have to turn off interrupts in our handler to allow
this to work. Its all very implementation specific, but then again we're
assuming that if you want to do this then you know what you are doing.

In that spirit the error that skips the FSR should probably be something
specific instead of "all errors" - that way a well meaning handler that returns
a -EINVAL doesn't accidentally break itself.

Jordan

> >I think

> >it would be better to do this unconditionally, and print the "Unhandled

> >context fault" message for any non-zero value of ret.


> >

> >Will

> >


-- 
The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project
Robin Murphy Jan. 26, 2021, 11:40 a.m. UTC | #4
On 2021-01-25 21:51, Jordan Crouse wrote:
> On Fri, Jan 22, 2021 at 12:53:17PM +0000, Robin Murphy wrote:

>> On 2021-01-22 12:41, Will Deacon wrote:

>>> On Tue, Nov 24, 2020 at 12:15:58PM -0700, Jordan Crouse wrote:

>>>> Call report_iommu_fault() to allow upper-level drivers to register their

>>>> own fault handlers.

>>>>

>>>> Signed-off-by: Jordan Crouse <jcrouse@codeaurora.org>

>>>> ---

>>>>

>>>>   drivers/iommu/arm/arm-smmu/arm-smmu.c | 16 +++++++++++++---

>>>>   1 file changed, 13 insertions(+), 3 deletions(-)

>>>>

>>>> diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu.c b/drivers/iommu/arm/arm-smmu/arm-smmu.c

>>>> index 0f28a8614da3..7fd18bbda8f5 100644

>>>> --- a/drivers/iommu/arm/arm-smmu/arm-smmu.c

>>>> +++ b/drivers/iommu/arm/arm-smmu/arm-smmu.c

>>>> @@ -427,6 +427,7 @@ static irqreturn_t arm_smmu_context_fault(int irq, void *dev)

>>>>   	struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);

>>>>   	struct arm_smmu_device *smmu = smmu_domain->smmu;

>>>>   	int idx = smmu_domain->cfg.cbndx;

>>>> +	int ret;

>>>>   	fsr = arm_smmu_cb_read(smmu, idx, ARM_SMMU_CB_FSR);

>>>>   	if (!(fsr & ARM_SMMU_FSR_FAULT))

>>>> @@ -436,11 +437,20 @@ static irqreturn_t arm_smmu_context_fault(int irq, void *dev)

>>>>   	iova = arm_smmu_cb_readq(smmu, idx, ARM_SMMU_CB_FAR);

>>>>   	cbfrsynra = arm_smmu_gr1_read(smmu, ARM_SMMU_GR1_CBFRSYNRA(idx));

>>>> -	dev_err_ratelimited(smmu->dev,

>>>> -	"Unhandled context fault: fsr=0x%x, iova=0x%08lx, fsynr=0x%x, cbfrsynra=0x%x, cb=%d\n",

>>>> +	ret = report_iommu_fault(domain, dev, iova,

>>>> +		fsynr & ARM_SMMU_FSYNR0_WNR ? IOMMU_FAULT_WRITE : IOMMU_FAULT_READ);

>>>> +

>>>> +	if (ret == -ENOSYS)

>>>> +		dev_err_ratelimited(smmu->dev,

>>>> +		"Unhandled context fault: fsr=0x%x, iova=0x%08lx, fsynr=0x%x, cbfrsynra=0x%x, cb=%d\n",

>>>>   			    fsr, iova, fsynr, cbfrsynra, idx);

>>>> -	arm_smmu_cb_write(smmu, idx, ARM_SMMU_CB_FSR, fsr);

>>>> +	/*

>>>> +	 * If the iommu fault returns an error (except -ENOSYS) then assume that

>>>> +	 * they will handle resuming on their own

>>>> +	 */

>>>> +	if (!ret || ret == -ENOSYS)

>>>> +		arm_smmu_cb_write(smmu, idx, ARM_SMMU_CB_FSR, fsr);

>>>

>>> Hmm, I don't grok this part. If the fault handler returned an error and

>>> we don't clear the FSR, won't we just re-take the irq immediately?

>>

>> If we don't touch the FSR at all, yes. Even if we clear the fault indicator

>> bits, the interrupt *might* remain asserted until a stalled transaction is

>> actually resolved - that's that lovely IMP-DEF corner.

>>

>> Robin.

>>

> 

> This is for stall-on-fault. The idea is that if the developer chooses to do so

> we would stall the GPU after a fault long enough to take a picture of it with

> devcoredump and then release the FSR. Since we can't take the devcoredump from

> the interrupt handler we schedule it in a worker and then return an error

> to let the main handler know that we'll come back around clear the FSR later

> when we are done.


Sure, but clearing FSR is not writing to RESUME to resolve the stalled 
transaction(s). You can already snarf the FSR contents from your 
report_iommu_fault() handler if you want to, so either way I don't see 
what's gained by not clearing it as expected at the point where we've 
handled the *interrupt*, even if it will take longer to decide what to 
do with the underlying *fault* that it signalled. I'm particularly not 
keen on having unusual behaviour in the core interrupt handling which 
callers may unwittingly trigger, for the sake of one 
very-very-driver-specific flow having a slightly richer debugging 
experience.

For actually *handling* faults, I thought we were going to need to hook 
up the new IOPF fault queue stuff anyway?

Robin.

> It is assumed that we'll have to turn off interrupts in our handler to allow

> this to work. Its all very implementation specific, but then again we're

> assuming that if you want to do this then you know what you are doing.

> 

> In that spirit the error that skips the FSR should probably be something

> specific instead of "all errors" - that way a well meaning handler that returns

> a -EINVAL doesn't accidentally break itself.

> 

> Jordan

> 

>>> I think

>>> it would be better to do this unconditionally, and print the "Unhandled

>>> context fault" message for any non-zero value of ret.

> 

>>>

>>> Will

>>>

>
Rob Clark Jan. 26, 2021, 4:05 p.m. UTC | #5
On Tue, Jan 26, 2021 at 3:41 AM Robin Murphy <robin.murphy@arm.com> wrote:
>

> On 2021-01-25 21:51, Jordan Crouse wrote:

> > On Fri, Jan 22, 2021 at 12:53:17PM +0000, Robin Murphy wrote:

> >> On 2021-01-22 12:41, Will Deacon wrote:

> >>> On Tue, Nov 24, 2020 at 12:15:58PM -0700, Jordan Crouse wrote:

> >>>> Call report_iommu_fault() to allow upper-level drivers to register their

> >>>> own fault handlers.

> >>>>

> >>>> Signed-off-by: Jordan Crouse <jcrouse@codeaurora.org>

> >>>> ---

> >>>>

> >>>>   drivers/iommu/arm/arm-smmu/arm-smmu.c | 16 +++++++++++++---

> >>>>   1 file changed, 13 insertions(+), 3 deletions(-)

> >>>>

> >>>> diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu.c b/drivers/iommu/arm/arm-smmu/arm-smmu.c

> >>>> index 0f28a8614da3..7fd18bbda8f5 100644

> >>>> --- a/drivers/iommu/arm/arm-smmu/arm-smmu.c

> >>>> +++ b/drivers/iommu/arm/arm-smmu/arm-smmu.c

> >>>> @@ -427,6 +427,7 @@ static irqreturn_t arm_smmu_context_fault(int irq, void *dev)

> >>>>    struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);

> >>>>    struct arm_smmu_device *smmu = smmu_domain->smmu;

> >>>>    int idx = smmu_domain->cfg.cbndx;

> >>>> +  int ret;

> >>>>    fsr = arm_smmu_cb_read(smmu, idx, ARM_SMMU_CB_FSR);

> >>>>    if (!(fsr & ARM_SMMU_FSR_FAULT))

> >>>> @@ -436,11 +437,20 @@ static irqreturn_t arm_smmu_context_fault(int irq, void *dev)

> >>>>    iova = arm_smmu_cb_readq(smmu, idx, ARM_SMMU_CB_FAR);

> >>>>    cbfrsynra = arm_smmu_gr1_read(smmu, ARM_SMMU_GR1_CBFRSYNRA(idx));

> >>>> -  dev_err_ratelimited(smmu->dev,

> >>>> -  "Unhandled context fault: fsr=0x%x, iova=0x%08lx, fsynr=0x%x, cbfrsynra=0x%x, cb=%d\n",

> >>>> +  ret = report_iommu_fault(domain, dev, iova,

> >>>> +          fsynr & ARM_SMMU_FSYNR0_WNR ? IOMMU_FAULT_WRITE : IOMMU_FAULT_READ);

> >>>> +

> >>>> +  if (ret == -ENOSYS)

> >>>> +          dev_err_ratelimited(smmu->dev,

> >>>> +          "Unhandled context fault: fsr=0x%x, iova=0x%08lx, fsynr=0x%x, cbfrsynra=0x%x, cb=%d\n",

> >>>>                        fsr, iova, fsynr, cbfrsynra, idx);

> >>>> -  arm_smmu_cb_write(smmu, idx, ARM_SMMU_CB_FSR, fsr);

> >>>> +  /*

> >>>> +   * If the iommu fault returns an error (except -ENOSYS) then assume that

> >>>> +   * they will handle resuming on their own

> >>>> +   */

> >>>> +  if (!ret || ret == -ENOSYS)

> >>>> +          arm_smmu_cb_write(smmu, idx, ARM_SMMU_CB_FSR, fsr);

> >>>

> >>> Hmm, I don't grok this part. If the fault handler returned an error and

> >>> we don't clear the FSR, won't we just re-take the irq immediately?

> >>

> >> If we don't touch the FSR at all, yes. Even if we clear the fault indicator

> >> bits, the interrupt *might* remain asserted until a stalled transaction is

> >> actually resolved - that's that lovely IMP-DEF corner.

> >>

> >> Robin.

> >>

> >

> > This is for stall-on-fault. The idea is that if the developer chooses to do so

> > we would stall the GPU after a fault long enough to take a picture of it with

> > devcoredump and then release the FSR. Since we can't take the devcoredump from

> > the interrupt handler we schedule it in a worker and then return an error

> > to let the main handler know that we'll come back around clear the FSR later

> > when we are done.

>

> Sure, but clearing FSR is not writing to RESUME to resolve the stalled

> transaction(s). You can already snarf the FSR contents from your

> report_iommu_fault() handler if you want to, so either way I don't see

> what's gained by not clearing it as expected at the point where we've

> handled the *interrupt*, even if it will take longer to decide what to

> do with the underlying *fault* that it signalled. I'm particularly not

> keen on having unusual behaviour in the core interrupt handling which

> callers may unwittingly trigger, for the sake of one

> very-very-driver-specific flow having a slightly richer debugging

> experience.


Tbf, "slightly" is an understatement.. it is a big enough improvement
that I've hacked up deferred resume several times to debug various
issues. ;-)

(Which is always a bit of a PITA because of things moving around in
arm-smmu as well as the drm side of things.)

But from my recollection, we can clear FSR immediately, all we need to
do is defer writing ARM_SMMU_CB_RESUME

BR,
-R

>

> For actually *handling* faults, I thought we were going to need to hook

> up the new IOPF fault queue stuff anyway?

>

> Robin.

>

> > It is assumed that we'll have to turn off interrupts in our handler to allow

> > this to work. Its all very implementation specific, but then again we're

> > assuming that if you want to do this then you know what you are doing.

> >

> > In that spirit the error that skips the FSR should probably be something

> > specific instead of "all errors" - that way a well meaning handler that returns

> > a -EINVAL doesn't accidentally break itself.

> >

> > Jordan

> >

> >>> I think

> >>> it would be better to do this unconditionally, and print the "Unhandled

> >>> context fault" message for any non-zero value of ret.

> >

> >>>

> >>> Will

> >>>

> >

> _______________________________________________

> iommu mailing list

> iommu@lists.linux-foundation.org

> https://lists.linuxfoundation.org/mailman/listinfo/iommu
Robin Murphy Jan. 26, 2021, 4:31 p.m. UTC | #6
On 2021-01-26 16:05, Rob Clark wrote:
> On Tue, Jan 26, 2021 at 3:41 AM Robin Murphy <robin.murphy@arm.com> wrote:

>>

>> On 2021-01-25 21:51, Jordan Crouse wrote:

>>> On Fri, Jan 22, 2021 at 12:53:17PM +0000, Robin Murphy wrote:

>>>> On 2021-01-22 12:41, Will Deacon wrote:

>>>>> On Tue, Nov 24, 2020 at 12:15:58PM -0700, Jordan Crouse wrote:

>>>>>> Call report_iommu_fault() to allow upper-level drivers to register their

>>>>>> own fault handlers.

>>>>>>

>>>>>> Signed-off-by: Jordan Crouse <jcrouse@codeaurora.org>

>>>>>> ---

>>>>>>

>>>>>>    drivers/iommu/arm/arm-smmu/arm-smmu.c | 16 +++++++++++++---

>>>>>>    1 file changed, 13 insertions(+), 3 deletions(-)

>>>>>>

>>>>>> diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu.c b/drivers/iommu/arm/arm-smmu/arm-smmu.c

>>>>>> index 0f28a8614da3..7fd18bbda8f5 100644

>>>>>> --- a/drivers/iommu/arm/arm-smmu/arm-smmu.c

>>>>>> +++ b/drivers/iommu/arm/arm-smmu/arm-smmu.c

>>>>>> @@ -427,6 +427,7 @@ static irqreturn_t arm_smmu_context_fault(int irq, void *dev)

>>>>>>     struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);

>>>>>>     struct arm_smmu_device *smmu = smmu_domain->smmu;

>>>>>>     int idx = smmu_domain->cfg.cbndx;

>>>>>> +  int ret;

>>>>>>     fsr = arm_smmu_cb_read(smmu, idx, ARM_SMMU_CB_FSR);

>>>>>>     if (!(fsr & ARM_SMMU_FSR_FAULT))

>>>>>> @@ -436,11 +437,20 @@ static irqreturn_t arm_smmu_context_fault(int irq, void *dev)

>>>>>>     iova = arm_smmu_cb_readq(smmu, idx, ARM_SMMU_CB_FAR);

>>>>>>     cbfrsynra = arm_smmu_gr1_read(smmu, ARM_SMMU_GR1_CBFRSYNRA(idx));

>>>>>> -  dev_err_ratelimited(smmu->dev,

>>>>>> -  "Unhandled context fault: fsr=0x%x, iova=0x%08lx, fsynr=0x%x, cbfrsynra=0x%x, cb=%d\n",

>>>>>> +  ret = report_iommu_fault(domain, dev, iova,

>>>>>> +          fsynr & ARM_SMMU_FSYNR0_WNR ? IOMMU_FAULT_WRITE : IOMMU_FAULT_READ);

>>>>>> +

>>>>>> +  if (ret == -ENOSYS)

>>>>>> +          dev_err_ratelimited(smmu->dev,

>>>>>> +          "Unhandled context fault: fsr=0x%x, iova=0x%08lx, fsynr=0x%x, cbfrsynra=0x%x, cb=%d\n",

>>>>>>                         fsr, iova, fsynr, cbfrsynra, idx);

>>>>>> -  arm_smmu_cb_write(smmu, idx, ARM_SMMU_CB_FSR, fsr);

>>>>>> +  /*

>>>>>> +   * If the iommu fault returns an error (except -ENOSYS) then assume that

>>>>>> +   * they will handle resuming on their own

>>>>>> +   */

>>>>>> +  if (!ret || ret == -ENOSYS)

>>>>>> +          arm_smmu_cb_write(smmu, idx, ARM_SMMU_CB_FSR, fsr);

>>>>>

>>>>> Hmm, I don't grok this part. If the fault handler returned an error and

>>>>> we don't clear the FSR, won't we just re-take the irq immediately?

>>>>

>>>> If we don't touch the FSR at all, yes. Even if we clear the fault indicator

>>>> bits, the interrupt *might* remain asserted until a stalled transaction is

>>>> actually resolved - that's that lovely IMP-DEF corner.

>>>>

>>>> Robin.

>>>>

>>>

>>> This is for stall-on-fault. The idea is that if the developer chooses to do so

>>> we would stall the GPU after a fault long enough to take a picture of it with

>>> devcoredump and then release the FSR. Since we can't take the devcoredump from

>>> the interrupt handler we schedule it in a worker and then return an error

>>> to let the main handler know that we'll come back around clear the FSR later

>>> when we are done.

>>

>> Sure, but clearing FSR is not writing to RESUME to resolve the stalled

>> transaction(s). You can already snarf the FSR contents from your

>> report_iommu_fault() handler if you want to, so either way I don't see

>> what's gained by not clearing it as expected at the point where we've

>> handled the *interrupt*, even if it will take longer to decide what to

>> do with the underlying *fault* that it signalled. I'm particularly not

>> keen on having unusual behaviour in the core interrupt handling which

>> callers may unwittingly trigger, for the sake of one

>> very-very-driver-specific flow having a slightly richer debugging

>> experience.

> 

> Tbf, "slightly" is an understatement.. it is a big enough improvement

> that I've hacked up deferred resume several times to debug various

> issues. ;-)


Oh, fear not, I fully appreciate that keeping the GPU stalled on a 
faulting transaction is a game-changer in itself ("almost like a real 
MMU!"). That comment was only aimed at whatever the perceived benefits 
are of deliberately not trying to clear the SMMU interrupt (even if it 
*would* stay clear). I have no issue with calling report_iommu_fault(), 
I'm just wary of doing anything weird with the result.

> (Which is always a bit of a PITA because of things moving around in

> arm-smmu as well as the drm side of things.)

> 

> But from my recollection, we can clear FSR immediately, all we need to

> do is defer writing ARM_SMMU_CB_RESUME


Phew! Thanks for the reassurance :)

Robin.

> 

> BR,

> -R

> 

>>

>> For actually *handling* faults, I thought we were going to need to hook

>> up the new IOPF fault queue stuff anyway?

>>

>> Robin.

>>

>>> It is assumed that we'll have to turn off interrupts in our handler to allow

>>> this to work. Its all very implementation specific, but then again we're

>>> assuming that if you want to do this then you know what you are doing.

>>>

>>> In that spirit the error that skips the FSR should probably be something

>>> specific instead of "all errors" - that way a well meaning handler that returns

>>> a -EINVAL doesn't accidentally break itself.

>>>

>>> Jordan

>>>

>>>>> I think

>>>>> it would be better to do this unconditionally, and print the "Unhandled

>>>>> context fault" message for any non-zero value of ret.

>>>

>>>>>

>>>>> Will

>>>>>

>>>

>> _______________________________________________

>> iommu mailing list

>> iommu@lists.linux-foundation.org

>> https://lists.linuxfoundation.org/mailman/listinfo/iommu