Message ID | 20231221093209.984929-1-Kun.Song@windriver.com |
---|---|
State | New |
Headers | show |
Series | [v5.10.y] crypto: caam/jr - Fix possible caam_jr crash | expand |
Hi Kun Have you seen this issue in later kernel versions > 5.10 ? Regards Gaurav Jain > -----Original Message----- > From: Kun Song <Kun.Song@windriver.com> > Sent: Thursday, December 21, 2023 3:02 PM > To: Horia Geanta <horia.geanta@nxp.com>; Aymen Sghaier > <aymen.sghaier@nxp.com>; herbert@gondor.apana.org.au; > davem@davemloft.net > Cc: linux-crypto@vger.kernel.org; filip.pudak@windriver.com; > heng.guo@windriver.com; kun.song@windriver.com > Subject: [PATCH v5.10.y] crypto: caam/jr - Fix possible caam_jr crash > > Test environment: > Linux kernel version: 5.10.y > Architecture: ARM Cortex-A > Processor: NXP Layerscape LS1028 > > Crash in reboot tests: > Reproducibility: 1% > > If a job ring is still allocated, Once caam_jr_remove() returned, jrpriv will be > freed and the registers will get unmapped.Then caam_jr_interrupt will get error > irqstate value. > So such a job ring will probably crash.Crash info is below: > -------------------------------------- > RBS Sys: Restart ordered by epghd(0x1) > RBS Sys: RESTARTING > caam_jr 8030000.jr: Device is busy > caam_jr 8020000.jr: Device is busy > caam_jr 8010000.jr: Device is busy > arm-smmu 5000000.iommu: disabling translation caam_jr 8010000.jr: job ring > error: irqstate: 00000103 ------------[ cut here ]------------ kernel BUG at > drivers/crypto/caam/jr.c:288! > Internal error: Oops - BUG: 0 [#1] PREEMPT SMP Hardware name: freescale > ls1028a/ls1028a, BIOS 2019.10+fsl+g3d542a3d22 > pstate: 60000085 (nZCv daIf -PAN -UAO -TCO BTYPE=--) pc : > caam_jr_interrupt+0x128/0x130 lr : caam_jr_interrupt+0x128/0x130 sp : > ffff80001144be50 > x29: ffff80001144be50 x28: ffff800010f61008 > x27: ffff800011228000 x26: ffff800010f61008 > x25: ffff000027904800 x24: 0000000000000072 > x23: ffff8000113ba140 x22: 0000000000000001 > x21: ffff800011433000 x20: ffff000027904e80 > x19: 0000000000000103 x18: 0000000000000030 > x17: 0000000000000000 x16: 0000000000000000 > x15: ffffffffffffffff x14: ffff8000113ebcb8 > x13: 0000000000000008 x12: fffffffffffcac8f > x11: ffff00000038bb00 x10: ffff8000112a1e90 > x9 : ffff8000100a99c0 x8 : ffff800011249e90 > x7 : ffff8000112a1e90 x6 : 0000000000000000 > x5 : 0000000000000000 x4 : 0000000000000000 > x3 : 0000000000000000 x2 : 0000000000000000 > x1 : 0000000000000000 x0 : ffff0000279ac600 Call trace: > caam_jr_interrupt+0x128/0x130 > __handle_irq_event_percpu+0x84/0x2b0 > handle_irq_event+0x6c/0xfc > handle_fasteoi_irq+0xc8/0x230 > __handle_domain_irq+0xb8/0x130 > gic_handle_irq+0x90/0x158 > el1_irq+0xcc/0x180 > _raw_spin_lock_irq+0x0/0x90 > caam_rng_read_one.constprop.0+0x248/0x370 > caam_read+0x8c/0xb0 > hwrng_fillfn+0xfc/0x1cc > kthread+0x14c/0x160 > ret_from_fork+0x10/0x30 > Code: 2a1303e2 d00029a1 910ee021 940b2b1d (d4210000) ---[ end trace > f04d90f3ad0da5f4 ]--- Kernel panic - not syncing: Oops - BUG: Fatal exception in > interrupt Kernel Offset: disabled CPU features: 0x28040022,21002008 Memory > Limit: none > -------------------------------------- > > Disabling interrupts is to ensure that the device removal operation is not > interrupted. > > Signed-off-by: Kun Song <Kun.Song@windriver.com> > Reviewed-by: Hen Guo <Heng.Guo@windriver.com> > --- > drivers/crypto/caam/jr.c | 4 ++++ > 1 file changed, 4 insertions(+) > > diff --git a/drivers/crypto/caam/jr.c b/drivers/crypto/caam/jr.c index > 6f669966ba2c..d191e8caa1ad 100644 > --- a/drivers/crypto/caam/jr.c > +++ b/drivers/crypto/caam/jr.c > @@ -135,6 +135,10 @@ static int caam_jr_remove(struct platform_device > *pdev) > jrdev = &pdev->dev; > jrpriv = dev_get_drvdata(jrdev); > > + /* Disabling interrupts is ensure that the device removal operation > + * is not interrupted by interrupts. > + */ > + devm_free_irq(jrdev, jrpriv->irq, jrdev); > if (jrpriv->hwrng) > caam_rng_exit(jrdev->parent); > > -- > 2.26.1 >
Hello SK I am submitting and replying patches using gaurav.jain@nxp.com In Later kernel versions we have provided fixes related to job ring flush and there are other changes as well. It would be better to add these changes on top of 5.10 tree as we also run multiple tests and not observed this issue. Regards Gaurav Jain > -----Original Message----- > From: Kun Song <Kun.Song@windriver.com> > Sent: Wednesday, January 10, 2024 7:09 AM > To: Gaurav Jain (OSS) <gaurav.jain@oss.nxp.com> > Cc: Kun.Song@windriver.com; Varun Sethi <V.Sethi@nxp.com>; Aymen Sghaier > <aymen.sghaier@nxp.com>; davem@davemloft.net; filip.pudak@windriver.com; > heng.guo@windriver.com; herbert@gondor.apana.org.au; Horia Geanta > <horia.geanta@nxp.com>; linux-crypto@vger.kernel.org; Meenakshi Aggarwal > <meenakshi.aggarwal@nxp.com> > Subject: [REMINDER] Re: [PATCH v5.10.y] crypto: caam/jr - Fix possible caam_jr > crash > > Hello Gaurav, > > I hope you receive this email. I'm following up on a patch I submitted a few > weeks ago. There doesn't seem to be any response yet and I want to make sure > it gets pushed forward. > > I know you're busy and thank you for taking the time to focus on this.If you > have any concerns or feedback please let me know and I'll be happy to address it. > > Best regards, > SK
Hello SK I am submitting and replying patches using gaurav.jain@nxp.com In Later kernel versions we have provided fixes related to job ring flush and there are other changes as well. It would be better to add these changes on top of 5.10 tree as we also run multiple tests and not observed this issue. Regards Gaurav Jain > -----Original Message----- > From: Kun Song <Kun.Song@windriver.com> > Sent: Wednesday, January 10, 2024 7:09 AM > To: Gaurav Jain (OSS) <gaurav.jain@oss.nxp.com> > Cc: Kun.Song@windriver.com; Varun Sethi <V.Sethi@nxp.com>; Aymen Sghaier > <aymen.sghaier@nxp.com>; davem@davemloft.net; filip.pudak@windriver.com; > heng.guo@windriver.com; herbert@gondor.apana.org.au; Horia Geanta > <horia.geanta@nxp.com>; linux-crypto@vger.kernel.org; Meenakshi Aggarwal > <meenakshi.aggarwal@nxp.com> > Subject: [REMINDER] Re: [PATCH v5.10.y] crypto: caam/jr - Fix possible caam_jr > crash > > Hello Gaurav, > > I hope you receive this email. I'm following up on a patch I submitted a few > weeks ago. There doesn't seem to be any response yet and I want to make sure > it gets pushed forward. > > I know you're busy and thank you for taking the time to focus on this.If you > have any concerns or feedback please let me know and I'll be happy to address it. > > Best regards, > SK
On 12/21/2023 11:32 AM, Kun Song wrote: > Test environment: > Linux kernel version: 5.10.y > Architecture: ARM Cortex-A > Processor: NXP Layerscape LS1028 > > Crash in reboot tests: > Reproducibility: 1% > Replying here to comment on the log. I've added all the people from the latest reply, i.e.: https://lore.kernel.org/linux-crypto/AM0PR04MB6004AECDD044F1E6BF3B6732E7682@AM0PR04MB6004.eurprd04.prod.outlook.com > If a job ring is still allocated, Once caam_jr_remove() returned, > jrpriv will be freed and the registers will get unmapped.Then In this case, most likely the root cause is different (see below). > caam_jr_interrupt will get error irqstate value. > So such a job ring will probably crash.Crash info is below: > -------------------------------------- > RBS Sys: Restart ordered by epghd(0x1) > RBS Sys: RESTARTING This looks like a system restart. > caam_jr 8030000.jr: Device is busy > caam_jr 8020000.jr: Device is busy > caam_jr 8010000.jr: Device is busy For some reason, there are still tfms accounted for on all these caam job rings. Maybe the system restart is not handled correctly at some point, hence some resource leaks (unallocated tfms). As already discussed, exiting early from caam_jr .remove() callback will leave the HW unquiesced (e.g. job rings not flushed), interrupts still active. > arm-smmu 5000000.iommu: disabling translation From here onward caam memory accesses won't be translated. > caam_jr 8010000.jr: job ring error: irqstate: 00000103 The error code means caam was not able to write the status of the job it just finished in the output job ring (which is allocated in memory). Most likely this happened due to iommu no longer translating the access. > > Disabling interrupts is to ensure that the device removal > operation is not interrupted. > > Signed-off-by: Kun Song <Kun.Song@windriver.com> > Reviewed-by: Hen Guo <Heng.Guo@windriver.com> > --- > drivers/crypto/caam/jr.c | 4 ++++ > 1 file changed, 4 insertions(+) > > diff --git a/drivers/crypto/caam/jr.c b/drivers/crypto/caam/jr.c > index 6f669966ba2c..d191e8caa1ad 100644 > --- a/drivers/crypto/caam/jr.c > +++ b/drivers/crypto/caam/jr.c > @@ -135,6 +135,10 @@ static int caam_jr_remove(struct platform_device *pdev) > jrdev = &pdev->dev; > jrpriv = dev_get_drvdata(jrdev); > > + /* Disabling interrupts is ensure that the device removal operation > + * is not interrupted by interrupts. > + */ > + devm_free_irq(jrdev, jrpriv->irq, jrdev); > if (jrpriv->hwrng) > caam_rng_exit(jrdev->parent); > As pointed out by previous discussions, this is not enough. Crashes could still occur due to crypto API users calling into caam driver, which would be in an inconsistent state. Whether .remove() being called is triggered by a system restart or a manual device unbinding [1] is irrelevant, the driver mustn't crash. I think Herbert's suggestion [2] on how to deal with HW devices going away makes sense. Not sure if the changes (thinking of crypto API part) could go into LTS kernels. If not, we'll have to stick to caam driver-only changes (but IIUC other crypto drivers are having the same issue with HW devices going away [3]). Thanks, Horia [1] https://lore.kernel.org/linux-crypto/VI1PR04MB7023A7EC91599A537CB6A487EEB30@VI1PR04MB7023.eurprd04.prod.outlook.com/ [2] https://lore.kernel.org/linux-crypto/20190919134512.GA29320@gondor.apana.org.au/ [3] https://lore.kernel.org/linux-crypto/ZAqwTqw3lR+dnImO@gondor.apana.org.au/
diff --git a/drivers/crypto/caam/jr.c b/drivers/crypto/caam/jr.c index 6f669966ba2c..d191e8caa1ad 100644 --- a/drivers/crypto/caam/jr.c +++ b/drivers/crypto/caam/jr.c @@ -135,6 +135,10 @@ static int caam_jr_remove(struct platform_device *pdev) jrdev = &pdev->dev; jrpriv = dev_get_drvdata(jrdev); + /* Disabling interrupts is ensure that the device removal operation + * is not interrupted by interrupts. + */ + devm_free_irq(jrdev, jrpriv->irq, jrdev); if (jrpriv->hwrng) caam_rng_exit(jrdev->parent);