Message ID | 20250122-msm-gpu-fault-fixes-next-v3-0-0afa00158521@gmail.com |
---|---|
Headers | show |
Series | iommu/arm-smmu, drm/msm: Fixes for stall-on-fault | expand |
On Wed, Jan 22, 2025 at 03:00:58PM -0500, Connor Abbott wrote: > @@ -125,12 +125,25 @@ static void qcom_adreno_smmu_resume_translation(const void *cookie, bool termina > struct arm_smmu_domain *smmu_domain = (void *)cookie; > struct arm_smmu_cfg *cfg = &smmu_domain->cfg; > struct arm_smmu_device *smmu = smmu_domain->smmu; > - u32 reg = 0; > + u32 reg = 0, sctlr; > + unsigned long flags; > > if (terminate) > reg |= ARM_SMMU_RESUME_TERMINATE; > > + spin_lock_irqsave(&smmu_domain->cb_lock, flags); > + > arm_smmu_cb_write(smmu, cfg->cbndx, ARM_SMMU_CB_RESUME, reg); > + At this point further transaction can be processed but SCTLR.CFIE is cleared so subequent context fault will not generate interrupt till SCTLR.CFIE is set. > + /* > + * Re-enable interrupts after they were disabled by > + * arm_smmu_context_fault(). > + */ > + sctlr = arm_smmu_cb_read(smmu, cfg->cbndx, ARM_SMMU_CB_SCTLR); > + sctlr |= ARM_SMMU_SCTLR_CFIE; > + arm_smmu_cb_write(smmu, cfg->cbndx, ARM_SMMU_CB_SCTLR, sctlr); > + > + spin_unlock_irqrestore(&smmu_domain->cb_lock, flags); > } > > static void qcom_adreno_smmu_set_prr_bit(const void *cookie, bool set) > diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu.c b/drivers/iommu/arm/arm-smmu/arm-smmu.c > index 79afc92e1d8b984dd35c469a3f283ad0c78f3d26..ca1ff59015a63912f0f9c5256452b2b2efa928f1 100644 > --- a/drivers/iommu/arm/arm-smmu/arm-smmu.c > +++ b/drivers/iommu/arm/arm-smmu/arm-smmu.c > @@ -463,13 +463,52 @@ static irqreturn_t arm_smmu_context_fault(int irq, void *dev) > if (!(cfi.fsr & ARM_SMMU_CB_FSR_FAULT)) > return IRQ_NONE; > > + /* > + * On some implementations FSR.SS asserts a context fault > + * interrupt. We do not want this behavior, because resolving the > + * original context fault typically requires operations that cannot be > + * performed in IRQ context but leaving the stall unacknowledged will > + * immediately lead to another spurious interrupt as FSR.SS is still > + * set. Work around this by disabling interrupts for this context bank. > + * It's expected that interrupts are re-enabled after resuming the > + * translation. > + * > + * We have to do this before report_iommu_fault() so that we don't > + * leave interrupts disabled in case the downstream user decides the > + * fault can be resolved inside its fault handler. > + * > + * There is a possible race if there are multiple context banks sharing > + * the same interrupt and both signal an interrupt in between writing > + * RESUME and SCTLR. We could disable interrupts here before we > + * re-enable them in the resume handler, leaving interrupts enabled. > + * Lock the write to serialize it with the resume handler. > + */ > + if (cfi.fsr & ARM_SMMU_CB_FSR_SS) { > + u32 val; > + > + spin_lock(&smmu_domain->cb_lock); > + val = arm_smmu_cb_read(smmu, idx, ARM_SMMU_CB_SCTLR); > + val &= ~ARM_SMMU_SCTLR_CFIE; > + arm_smmu_cb_write(smmu, idx, ARM_SMMU_CB_SCTLR, val); > + spin_unlock(&smmu_domain->cb_lock); > + } > + > + /* > + * The SMMUv2 architecture specification says that if stall-on-fault is > + * enabled the correct sequence is to write to SMMU_CBn_FSR to clear > + * the fault and then write to SMMU_CBn_RESUME. Clear the interrupt > + * first before running the user's fault handler to make sure we follow > + * this sequence. It should be ok if there is another fault in the > + * meantime because we have already read the fault info. > + */ The context would remain stalled till we write to CBn_RESUME. Which is done in qcom_adreno_smmu_resume_translation(). For a stalled context further transactions are not processed and we shouldn't see further faults and or fault inerrupts. Do you observe faults with stalled context? > + arm_smmu_cb_write(smmu, idx, ARM_SMMU_CB_FSR, cfi.fsr); > + > ret = report_iommu_fault(&smmu_domain->domain, NULL, cfi.iova, > cfi.fsynr & ARM_SMMU_CB_FSYNR0_WNR ? IOMMU_FAULT_WRITE : IOMMU_FAULT_READ); > > if (ret == -ENOSYS && __ratelimit(&rs)) > arm_smmu_print_context_fault_info(smmu, idx, &cfi); > > - arm_smmu_cb_write(smmu, idx, ARM_SMMU_CB_FSR, cfi.fsr); > return IRQ_HANDLED; > } > > diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu.h b/drivers/iommu/arm/arm-smmu/arm-smmu.h > index 2dbf3243b5ad2db01e17fb26c26c838942a491be..789c64ff3eb9944c8af37426e005241a8288da20 100644 > --- a/drivers/iommu/arm/arm-smmu/arm-smmu.h > +++ b/drivers/iommu/arm/arm-smmu/arm-smmu.h > @@ -216,7 +216,6 @@ enum arm_smmu_cbar_type { > ARM_SMMU_CB_FSR_TLBLKF) > > #define ARM_SMMU_CB_FSR_FAULT (ARM_SMMU_CB_FSR_MULTI | \ > - ARM_SMMU_CB_FSR_SS | \ Given writing to FSR.SS doesn't clear this bit but write to CBn_RESUME does, this seems right. This but can be taken as separate patch. > ARM_SMMU_CB_FSR_UUT | \ > ARM_SMMU_CB_FSR_EF | \ > ARM_SMMU_CB_FSR_PF | \ > > -- > 2.47.1 >
On 2025-01-23 11:10 am, Prakash Gupta wrote: > On Wed, Jan 22, 2025 at 03:00:58PM -0500, Connor Abbott wrote: > >> @@ -125,12 +125,25 @@ static void qcom_adreno_smmu_resume_translation(const void *cookie, bool termina >> struct arm_smmu_domain *smmu_domain = (void *)cookie; >> struct arm_smmu_cfg *cfg = &smmu_domain->cfg; >> struct arm_smmu_device *smmu = smmu_domain->smmu; >> - u32 reg = 0; >> + u32 reg = 0, sctlr; >> + unsigned long flags; >> >> if (terminate) >> reg |= ARM_SMMU_RESUME_TERMINATE; >> >> + spin_lock_irqsave(&smmu_domain->cb_lock, flags); >> + >> arm_smmu_cb_write(smmu, cfg->cbndx, ARM_SMMU_CB_RESUME, reg); >> + > At this point further transaction can be processed but SCTLR.CFIE is > cleared so subequent context fault will not generate interrupt till > SCTLR.CFIE is set. > >> + /* >> + * Re-enable interrupts after they were disabled by >> + * arm_smmu_context_fault(). >> + */ >> + sctlr = arm_smmu_cb_read(smmu, cfg->cbndx, ARM_SMMU_CB_SCTLR); >> + sctlr |= ARM_SMMU_SCTLR_CFIE; >> + arm_smmu_cb_write(smmu, cfg->cbndx, ARM_SMMU_CB_SCTLR, sctlr); >> + >> + spin_unlock_irqrestore(&smmu_domain->cb_lock, flags); >> } >> >> static void qcom_adreno_smmu_set_prr_bit(const void *cookie, bool set) >> diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu.c b/drivers/iommu/arm/arm-smmu/arm-smmu.c >> index 79afc92e1d8b984dd35c469a3f283ad0c78f3d26..ca1ff59015a63912f0f9c5256452b2b2efa928f1 100644 >> --- a/drivers/iommu/arm/arm-smmu/arm-smmu.c >> +++ b/drivers/iommu/arm/arm-smmu/arm-smmu.c >> @@ -463,13 +463,52 @@ static irqreturn_t arm_smmu_context_fault(int irq, void *dev) >> if (!(cfi.fsr & ARM_SMMU_CB_FSR_FAULT)) >> return IRQ_NONE; >> >> + /* >> + * On some implementations FSR.SS asserts a context fault >> + * interrupt. We do not want this behavior, because resolving the >> + * original context fault typically requires operations that cannot be >> + * performed in IRQ context but leaving the stall unacknowledged will >> + * immediately lead to another spurious interrupt as FSR.SS is still >> + * set. Work around this by disabling interrupts for this context bank. >> + * It's expected that interrupts are re-enabled after resuming the >> + * translation. >> + * >> + * We have to do this before report_iommu_fault() so that we don't >> + * leave interrupts disabled in case the downstream user decides the >> + * fault can be resolved inside its fault handler. >> + * >> + * There is a possible race if there are multiple context banks sharing >> + * the same interrupt and both signal an interrupt in between writing >> + * RESUME and SCTLR. We could disable interrupts here before we >> + * re-enable them in the resume handler, leaving interrupts enabled. >> + * Lock the write to serialize it with the resume handler. >> + */ >> + if (cfi.fsr & ARM_SMMU_CB_FSR_SS) { >> + u32 val; >> + >> + spin_lock(&smmu_domain->cb_lock); >> + val = arm_smmu_cb_read(smmu, idx, ARM_SMMU_CB_SCTLR); >> + val &= ~ARM_SMMU_SCTLR_CFIE; >> + arm_smmu_cb_write(smmu, idx, ARM_SMMU_CB_SCTLR, val); >> + spin_unlock(&smmu_domain->cb_lock); >> + } >> + >> + /* >> + * The SMMUv2 architecture specification says that if stall-on-fault is >> + * enabled the correct sequence is to write to SMMU_CBn_FSR to clear >> + * the fault and then write to SMMU_CBn_RESUME. Clear the interrupt >> + * first before running the user's fault handler to make sure we follow >> + * this sequence. It should be ok if there is another fault in the >> + * meantime because we have already read the fault info. >> + */ > The context would remain stalled till we write to CBn_RESUME. Which is done > in qcom_adreno_smmu_resume_translation(). For a stalled context further > transactions are not processed and we shouldn't see further faults and > or fault inerrupts. Do you observe faults with stalled context? This aspect isn't exclusive to stalled contexts though - even for "normal" terminated faults, clearing the FSR as soon as we've sampled all the associated fault registers is no bad thing, since if a second fault does occur while we're still reporting the first, we're then more likely to get a full syndrome rather than just the FSR.MULTI bit. >> + arm_smmu_cb_write(smmu, idx, ARM_SMMU_CB_FSR, cfi.fsr); >> + >> ret = report_iommu_fault(&smmu_domain->domain, NULL, cfi.iova, >> cfi.fsynr & ARM_SMMU_CB_FSYNR0_WNR ? IOMMU_FAULT_WRITE : IOMMU_FAULT_READ); >> >> if (ret == -ENOSYS && __ratelimit(&rs)) >> arm_smmu_print_context_fault_info(smmu, idx, &cfi); >> >> - arm_smmu_cb_write(smmu, idx, ARM_SMMU_CB_FSR, cfi.fsr); >> return IRQ_HANDLED; >> } >> >> diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu.h b/drivers/iommu/arm/arm-smmu/arm-smmu.h >> index 2dbf3243b5ad2db01e17fb26c26c838942a491be..789c64ff3eb9944c8af37426e005241a8288da20 100644 >> --- a/drivers/iommu/arm/arm-smmu/arm-smmu.h >> +++ b/drivers/iommu/arm/arm-smmu/arm-smmu.h >> @@ -216,7 +216,6 @@ enum arm_smmu_cbar_type { >> ARM_SMMU_CB_FSR_TLBLKF) >> >> #define ARM_SMMU_CB_FSR_FAULT (ARM_SMMU_CB_FSR_MULTI | \ >> - ARM_SMMU_CB_FSR_SS | \ > Given writing to FSR.SS doesn't clear this bit but write to CBn_RESUME > does, this seems right. This but can be taken as separate patch. This change on its own isn't really useful - all that would achieve is that instead of constantly re-reporting the FSR.SS "fault", the interrupt goes unhandled and the IRQ core ends up disabling it permanently. If anything that's arguably worse, since the storm of context fault reports does at least give a fairly clear indication of what's gone wrong, rather than having to deduce the cause of an "irq n: nobody cared" message entirely by code inspection. Thanks, Robin. > >> ARM_SMMU_CB_FSR_UUT | \ >> ARM_SMMU_CB_FSR_EF | \ >> ARM_SMMU_CB_FSR_PF | \ >> >> -- >> 2.47.1 >>
On Thu, Jan 23, 2025 at 6:10 AM Prakash Gupta <quic_guptap@quicinc.com> wrote: > > On Wed, Jan 22, 2025 at 03:00:58PM -0500, Connor Abbott wrote: > > > @@ -125,12 +125,25 @@ static void qcom_adreno_smmu_resume_translation(const void *cookie, bool termina > > struct arm_smmu_domain *smmu_domain = (void *)cookie; > > struct arm_smmu_cfg *cfg = &smmu_domain->cfg; > > struct arm_smmu_device *smmu = smmu_domain->smmu; > > - u32 reg = 0; > > + u32 reg = 0, sctlr; > > + unsigned long flags; > > > > if (terminate) > > reg |= ARM_SMMU_RESUME_TERMINATE; > > > > + spin_lock_irqsave(&smmu_domain->cb_lock, flags); > > + > > arm_smmu_cb_write(smmu, cfg->cbndx, ARM_SMMU_CB_RESUME, reg); > > + > At this point further transaction can be processed but SCTLR.CFIE is > cleared so subequent context fault will not generate interrupt till > SCTLR.CFIE is set. If you're asking why the spin lock is there, it's because this isn't true if there's another context bank, they share an interrupt line, and it happens to fault around the same time. I haven't checked if that's actually the case for Adreno, but in case this gets used by other drivers and moved into common code I want it to be as robust as possible. This is explained in the comment added to arm_smmu_context_fault(). Also the next commit toggles CFCFG and we want to serialize against that. > > > + /* > > + * Re-enable interrupts after they were disabled by > > + * arm_smmu_context_fault(). > > + */ > > + sctlr = arm_smmu_cb_read(smmu, cfg->cbndx, ARM_SMMU_CB_SCTLR); > > + sctlr |= ARM_SMMU_SCTLR_CFIE; > > + arm_smmu_cb_write(smmu, cfg->cbndx, ARM_SMMU_CB_SCTLR, sctlr); > > + > > + spin_unlock_irqrestore(&smmu_domain->cb_lock, flags); > > } > > > > static void qcom_adreno_smmu_set_prr_bit(const void *cookie, bool set) > > diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu.c b/drivers/iommu/arm/arm-smmu/arm-smmu.c > > index 79afc92e1d8b984dd35c469a3f283ad0c78f3d26..ca1ff59015a63912f0f9c5256452b2b2efa928f1 100644 > > --- a/drivers/iommu/arm/arm-smmu/arm-smmu.c > > +++ b/drivers/iommu/arm/arm-smmu/arm-smmu.c > > @@ -463,13 +463,52 @@ static irqreturn_t arm_smmu_context_fault(int irq, void *dev) > > if (!(cfi.fsr & ARM_SMMU_CB_FSR_FAULT)) > > return IRQ_NONE; > > > > + /* > > + * On some implementations FSR.SS asserts a context fault > > + * interrupt. We do not want this behavior, because resolving the > > + * original context fault typically requires operations that cannot be > > + * performed in IRQ context but leaving the stall unacknowledged will > > + * immediately lead to another spurious interrupt as FSR.SS is still > > + * set. Work around this by disabling interrupts for this context bank. > > + * It's expected that interrupts are re-enabled after resuming the > > + * translation. > > + * > > + * We have to do this before report_iommu_fault() so that we don't > > + * leave interrupts disabled in case the downstream user decides the > > + * fault can be resolved inside its fault handler. > > + * > > + * There is a possible race if there are multiple context banks sharing > > + * the same interrupt and both signal an interrupt in between writing > > + * RESUME and SCTLR. We could disable interrupts here before we > > + * re-enable them in the resume handler, leaving interrupts enabled. > > + * Lock the write to serialize it with the resume handler. > > + */ > > + if (cfi.fsr & ARM_SMMU_CB_FSR_SS) { > > + u32 val; > > + > > + spin_lock(&smmu_domain->cb_lock); > > + val = arm_smmu_cb_read(smmu, idx, ARM_SMMU_CB_SCTLR); > > + val &= ~ARM_SMMU_SCTLR_CFIE; > > + arm_smmu_cb_write(smmu, idx, ARM_SMMU_CB_SCTLR, val); > > + spin_unlock(&smmu_domain->cb_lock); > > + } > > + > > + /* > > + * The SMMUv2 architecture specification says that if stall-on-fault is > > + * enabled the correct sequence is to write to SMMU_CBn_FSR to clear > > + * the fault and then write to SMMU_CBn_RESUME. Clear the interrupt > > + * first before running the user's fault handler to make sure we follow > > + * this sequence. It should be ok if there is another fault in the > > + * meantime because we have already read the fault info. > > + */ > The context would remain stalled till we write to CBn_RESUME. Which is done > in qcom_adreno_smmu_resume_translation(). For a stalled context further > transactions are not processed and we shouldn't see further faults and > or fault inerrupts. Do you observe faults with stalled context? Yes. I've observed that on MMU-500 writing RESUME before the interrupt has been cleared doesn't clear SS. This happened with v2 in the case where there was already a devcoredump and drm/msm called qcom_adreno_smmu_resume_translation() immediately from its fault handler, and we'd get a storm of unhandled interrupts until it was disabled. Given that the architecture spec says we're supposed to clear the interrupt first this may have been an attempt to "help" developers. > > > + arm_smmu_cb_write(smmu, idx, ARM_SMMU_CB_FSR, cfi.fsr); > > + > > ret = report_iommu_fault(&smmu_domain->domain, NULL, cfi.iova, > > cfi.fsynr & ARM_SMMU_CB_FSYNR0_WNR ? IOMMU_FAULT_WRITE : IOMMU_FAULT_READ); > > > > if (ret == -ENOSYS && __ratelimit(&rs)) > > arm_smmu_print_context_fault_info(smmu, idx, &cfi); > > > > - arm_smmu_cb_write(smmu, idx, ARM_SMMU_CB_FSR, cfi.fsr); > > return IRQ_HANDLED; > > } > > > > diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu.h b/drivers/iommu/arm/arm-smmu/arm-smmu.h > > index 2dbf3243b5ad2db01e17fb26c26c838942a491be..789c64ff3eb9944c8af37426e005241a8288da20 100644 > > --- a/drivers/iommu/arm/arm-smmu/arm-smmu.h > > +++ b/drivers/iommu/arm/arm-smmu/arm-smmu.h > > @@ -216,7 +216,6 @@ enum arm_smmu_cbar_type { > > ARM_SMMU_CB_FSR_TLBLKF) > > > > #define ARM_SMMU_CB_FSR_FAULT (ARM_SMMU_CB_FSR_MULTI | \ > > - ARM_SMMU_CB_FSR_SS | \ > Given writing to FSR.SS doesn't clear this bit but write to CBn_RESUME > does, this seems right. This but can be taken as separate patch. > > > ARM_SMMU_CB_FSR_UUT | \ > > ARM_SMMU_CB_FSR_EF | \ > > ARM_SMMU_CB_FSR_PF | \ > > > > -- > > 2.47.1 > >
On Thu, Jan 23, 2025 at 12:35 PM Prakash Gupta <quic_guptap@quicinc.com> wrote: > > On Thu, Jan 23, 2025 at 11:51:27AM +0000, Robin Murphy wrote: > > On 2025-01-23 11:10 am, Prakash Gupta wrote: > > > On Wed, Jan 22, 2025 at 03:00:58PM -0500, Connor Abbott wrote: > > > > + /* > > > > + * The SMMUv2 architecture specification says that if stall-on-fault is > > > > + * enabled the correct sequence is to write to SMMU_CBn_FSR to clear > > > > + * the fault and then write to SMMU_CBn_RESUME. Clear the interrupt > > > > + * first before running the user's fault handler to make sure we follow > > > > + * this sequence. It should be ok if there is another fault in the > > > > + * meantime because we have already read the fault info. > > > > + */ > > > The context would remain stalled till we write to CBn_RESUME. Which is done > > > in qcom_adreno_smmu_resume_translation(). For a stalled context further > > > transactions are not processed and we shouldn't see further faults and > > > or fault inerrupts. Do you observe faults with stalled context? > > > > This aspect isn't exclusive to stalled contexts though - even for "normal" > > terminated faults, clearing the FSR as soon as we've sampled all the > > associated fault registers is no bad thing, since if a second fault does > > occur while we're still reporting the first, we're then more likely to get a > > full syndrome rather than just the FSR.MULTI bit. > > > ARM SMMUv2 spec recommends, in case of reported fault sw should first > correct the condition which casued the fault, I would interpret this as > reporting fault to client using callback, and then write CBn_FSR and > CBn_RESUME in this order. Even in case of reported fault where context is > not stalled, the first step, IMO, I see no reason why should be any > different. I agree that delaying fault clearance can result in FSR.MULTI > being set, but clearning fault before prevent clients to use SCTLR.HUPCF > on subsequent transactions while they take any debug action. The client > should be reported fault in the same state it occured. Please refer > qcom_smmu_context_fault() for this sequence. It's not possible to implement it the way the spec describes because correcting the condition which caused the fault cannot always be done in the client's callback. Sometimes it has to be deferred to a handler not in IRQ context. However we must clear FSR (except for the SS bit) before leaving the fault handler because while the transaction is pending we want to be able to distinguish between a subsequent fault in another context bank (only FSR.SS will be set and we can ignore and return IRQ_NONE) and this context bank if they share an interrupt line. So, moving this up generally doesn't hurt and fixes the case where the client does resume the transaction inside its handler. I don't think there's really another way to implement this. And I have no idea why you think this prevents clients from using HUPCF, we already use HUPCF in drm/msm and it works fine with this series. > > > > > + arm_smmu_cb_write(smmu, idx, ARM_SMMU_CB_FSR, cfi.fsr); > > > > + > > > > ret = report_iommu_fault(&smmu_domain->domain, NULL, cfi.iova, > > > > cfi.fsynr & ARM_SMMU_CB_FSYNR0_WNR ? IOMMU_FAULT_WRITE : IOMMU_FAULT_READ); > > > > if (ret == -ENOSYS && __ratelimit(&rs)) > > > > arm_smmu_print_context_fault_info(smmu, idx, &cfi); > > > > - arm_smmu_cb_write(smmu, idx, ARM_SMMU_CB_FSR, cfi.fsr); > > > > return IRQ_HANDLED; > > > > } > > > > diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu.h b/drivers/iommu/arm/arm-smmu/arm-smmu.h > > > > index 2dbf3243b5ad2db01e17fb26c26c838942a491be..789c64ff3eb9944c8af37426e005241a8288da20 100644 > > > > --- a/drivers/iommu/arm/arm-smmu/arm-smmu.h > > > > +++ b/drivers/iommu/arm/arm-smmu/arm-smmu.h > > > > @@ -216,7 +216,6 @@ enum arm_smmu_cbar_type { > > > > ARM_SMMU_CB_FSR_TLBLKF) > > > > #define ARM_SMMU_CB_FSR_FAULT (ARM_SMMU_CB_FSR_MULTI | \ > > > > - ARM_SMMU_CB_FSR_SS | \ > > > Given writing to FSR.SS doesn't clear this bit but write to CBn_RESUME > > > does, this seems right. This but can be taken as separate patch. > > > > This change on its own isn't really useful - all that would achieve is that > > instead of constantly re-reporting the FSR.SS "fault", the interrupt goes > > unhandled and the IRQ core ends up disabling it permanently. If anything > > that's arguably worse, since the storm of context fault reports does at > > least give a fairly clear indication of what's gone wrong, rather than > > having to deduce the cause of an "irq n: nobody cared" message entirely by > > code inspection. > > > Does spec allow or do we see reported fault with just FSR.SS bit. Yes, the spec allows it and we do see it in practice. After this patch we may still see it in the case where multiple context banks share an interrupt and the other bank faults, but the correct action is to ignore it because we've disabled CFIE for this bank already. This is why this hunk has to be in this patch. > If answer > is no then Keeping FSR_SS would be misleading. Here ARM_SMMU_CB_FSR_FAULT > is used to clear fault bits or check valid faults. Also validity of this > is not based on rest of the change. > > Thanks, > Prakash > > > > > > > > ARM_SMMU_CB_FSR_UUT | \ > > > > ARM_SMMU_CB_FSR_EF | \ > > > > ARM_SMMU_CB_FSR_PF | \ > > > > > > > > -- > > > > 2.47.1 > > > > > >
On Thu, Jan 23, 2025 at 2:26 PM Prakash Gupta <quic_guptap@quicinc.com> wrote: > > On Thu, Jan 23, 2025 at 09:00:17AM -0500, Connor Abbott wrote: > > On Thu, Jan 23, 2025 at 6:10 AM Prakash Gupta <quic_guptap@quicinc.com> wrote: > > > > > > On Wed, Jan 22, 2025 at 03:00:58PM -0500, Connor Abbott wrote: > > > > > > > + /* > > > > + * On some implementations FSR.SS asserts a context fault > > > > + * interrupt. We do not want this behavior, because resolving the > > > > + * original context fault typically requires operations that cannot be > > > > + * performed in IRQ context but leaving the stall unacknowledged will > > > > + * immediately lead to another spurious interrupt as FSR.SS is still > > > > + * set. Work around this by disabling interrupts for this context bank. > > > > + * It's expected that interrupts are re-enabled after resuming the > > > > + * translation. > > > > + * > > > > + * We have to do this before report_iommu_fault() so that we don't > > > > + * leave interrupts disabled in case the downstream user decides the > > > > + * fault can be resolved inside its fault handler. > > > > + * > > > > + * There is a possible race if there are multiple context banks sharing > > > > + * the same interrupt and both signal an interrupt in between writing > > > > + * RESUME and SCTLR. We could disable interrupts here before we > Not sure if multiple context bank with shared interrupt supported for > arm-smmu driver, but even if does than context fault handler they would > operate on their respective context register space, so I don't see the race > at context register update. Let's say CB1 enables stall-on-fault. The sequence is something like this: - CB0 faults, context fault handler for CB0 runs first - resume handler writes RESUME for CB1 - CB1 faults on some other pending transaction - context fault handler for CB1 runs due to the fault from CB0 on shared interrupt line, discovers there is an additional fault because we just wrote RESUME - context fault handler for CB1 writes SCTLR disabling CFIE - resume handler writes SCTLR enabling CFIE At the end CFIE is incorrectly enabled while the second CB1 fault is pending and we get an interrupt storm. Realistically this is only going to happen if the resume handler gets interrupted in between the two register writes, otherwise it will probably win the race and write SCTLR before CB1 can run its context fault handler. But technically we need the spinlock. > > > > > + * re-enable them in the resume handler, leaving interrupts enabled. > > > > + * Lock the write to serialize it with the resume handler. > > > > + */ > > > > + if (cfi.fsr & ARM_SMMU_CB_FSR_SS) { > > > > + u32 val; > > > > + > > > > + spin_lock(&smmu_domain->cb_lock); > > > > + val = arm_smmu_cb_read(smmu, idx, ARM_SMMU_CB_SCTLR); > > > > + val &= ~ARM_SMMU_SCTLR_CFIE; > > > > + arm_smmu_cb_write(smmu, idx, ARM_SMMU_CB_SCTLR, val); > > > > + spin_unlock(&smmu_domain->cb_lock); > > > > + } > > > > + > > > > + /* > > > > + * The SMMUv2 architecture specification says that if stall-on-fault is > > > > + * enabled the correct sequence is to write to SMMU_CBn_FSR to clear > > > > + * the fault and then write to SMMU_CBn_RESUME. Clear the interrupt > > > > + * first before running the user's fault handler to make sure we follow > > > > + * this sequence. It should be ok if there is another fault in the > > > > + * meantime because we have already read the fault info. > > > > + */ > qcom_adreno_smmu_get_fault_info() reads the fault info as part of client > fault hanlder. So it would not be ok to clear FSR before reporting the > fault to client. > > > > The context would remain stalled till we write to CBn_RESUME. Which is done > > > in qcom_adreno_smmu_resume_translation(). For a stalled context further > > > transactions are not processed and we shouldn't see further faults and > > > or fault inerrupts. Do you observe faults with stalled context? > > > > Yes. I've observed that on MMU-500 writing RESUME before the interrupt > > has been cleared doesn't clear SS. This happened with v2 in the case > > where there was already a devcoredump and drm/msm called > > qcom_adreno_smmu_resume_translation() immediately from its fault > > handler, and we'd get a storm of unhandled interrupts until it was > > disabled. Given that the architecture spec says we're supposed to > > clear the interrupt first this may have been an attempt to "help" > > developers. > > > > Just to clarify, present sequence is: > 1. context fault with stall enable. FSR.SS set. > 2. Report fault to client > 3. resume/terminate stalled transaction > 4. clear FSR > > At what point when you try #2->#3->#4 or #4->#2->#3 sequence, is FSR.SS > cleared and interrupt storm is observed. With #2->#3->#4 FSR.SS is *not* cleared and there is a subsequent interrupt storm with only FSR.SS asserted. With #4->#2->#3 there is no interrupt storm. From this we conclude that MMU-500 does not clear FSR.SS unless #4 happens before #3. > The way CFIE disable is helping > with current patch indicates write FSR is unstalling context and subsequent > transactions are faulted. No, it does not indicate that. The interrupt storm happens even when there is exactly one context fault, and when the interrupt storm happens *only* FSR.SS is asserted. I've verified this with debug prints. Once more, MMU-500 will assert an interrupt when only FSR.SS is asserted. This has nothing to do with subsequent transactions. > Do you stop getting interrupt storm after write > RESUME. Yes, as long as the write to RESUME happens after all other bits in FSR are cleared. > If you can mention your SCTLR configuration and FSR state in test > sequence, it would be clearer. SCTLR has both HUPCF and CFCFG enabled. > > An aside, If reducing delay between FSR and RESUME write helps then both > can be done as part of qcom_adreno_smmu_resume_translation(). This will > follow spec recommendation and also make sure fault registers are not > cleared before reporting fault to client. > > Thanks, > Prakash
drm/msm uses the stall-on-fault model to record the GPU state on the first GPU page fault to help debugging. On systems where the GPU is paired with a MMU-500, there were two problems: 1. The MMU-500 doesn't de-assert its interrupt line until the fault is resumed, which led to a storm of interrupts until the fault handler was called. If we got unlucky and the fault handler was on the same CPU as the interrupt, there was a deadlock. 2. The GPU is capable of generating page faults much faster than we can resume them. GMU (GPU Management Unit) shares the same context bank as the GPU, so if there was a sudden spurt of page faults it would be effectively starved and would trigger a watchdog reset, made even worse because the GPU cannot be reset while there's a pending transaction leaving the GPU permanently wedged. Patch 1 fixes the first problem and is independent of the rest of the series. Patch 3 fixes the second problem and is dependent on patch 2, so there will have to be some cross-tree coordination. I've rebased this series on the latest linux-next to avoid rebase troubles. Signed-off-by: Connor Abbott <cwabbott0@gmail.com> --- Changes in v3: - Acknowledge the fault before resuming the transaction in patch 1. - Add suggested extra context to commit messages. - Link to v2: https://lore.kernel.org/r/20250120-msm-gpu-fault-fixes-next-v2-0-d636c4027042@gmail.com Changes in v2: - Remove unnecessary _irqsave when locking in IRQ handler (Robin) - Reuse existing spinlock for CFIE manipulation (Robin) - Lock CFCFG manipulation against concurrent CFIE manipulation - Don't use timer to re-enable stall-on-fault. (Rob) - Use more descriptive name for the function that re-enables stall-on-fault if the cooldown period has ended. (Rob) - Link to v1: https://lore.kernel.org/r/20250117-msm-gpu-fault-fixes-next-v1-0-bc9b332b5d0b@gmail.com --- Connor Abbott (3): iommu/arm-smmu: Fix spurious interrupts with stall-on-fault iommu/arm-smmu-qcom: Make set_stall work when the device is on drm/msm: Temporarily disable stall-on-fault after a page fault drivers/gpu/drm/msm/adreno/a5xx_gpu.c | 2 ++ drivers/gpu/drm/msm/adreno/a6xx_gpu.c | 4 +++ drivers/gpu/drm/msm/adreno/adreno_gpu.c | 42 +++++++++++++++++++++++++++- drivers/gpu/drm/msm/adreno/adreno_gpu.h | 24 ++++++++++++++++ drivers/gpu/drm/msm/msm_iommu.c | 9 ++++++ drivers/gpu/drm/msm/msm_mmu.h | 1 + drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c | 45 +++++++++++++++++++++++++++--- drivers/iommu/arm/arm-smmu/arm-smmu.c | 41 ++++++++++++++++++++++++++- drivers/iommu/arm/arm-smmu/arm-smmu.h | 1 - 9 files changed, 162 insertions(+), 7 deletions(-) --- base-commit: 0907e7fb35756464aa34c35d6abb02998418164b change-id: 20250117-msm-gpu-fault-fixes-next-96e3098023e1 Best regards,