Message ID | 1612482643-11796-1-git-send-email-LinoSanfilippo@gmx.de |
---|---|
Headers | show |
Series | TPM fixes | expand |
On 2/4/21 6:50 PM, Lino Sanfilippo wrote: > From: Lino Sanfilippo <l.sanfilippo@kunbus.com> > > The following sequence of operations results in a refcount warning: > > 1. Open device /dev/tpmrm > 2. Remove module tpm_tis_spi > 3. Write a TPM command to the file descriptor opened at step 1. > > ------------[ cut here ]------------ > WARNING: CPU: 3 PID: 1161 at lib/refcount.c:25 kobject_get+0xa0/0xa4 > refcount_t: addition on 0; use-after-free. > Modules linked in: tpm_tis_spi tpm_tis_core tpm mdio_bcm_unimac brcmfmac > sha256_generic libsha256 sha256_arm hci_uart btbcm bluetooth cfg80211 vc4 > brcmutil ecdh_generic ecc snd_soc_core crc32_arm_ce libaes > raspberrypi_hwmon ac97_bus snd_pcm_dmaengine bcm2711_thermal snd_pcm > snd_timer genet snd phy_generic soundcore [last unloaded: spi_bcm2835] > CPU: 3 PID: 1161 Comm: hold_open Not tainted 5.10.0ls-main-dirty #2 > Hardware name: BCM2711 > [<c0410c3c>] (unwind_backtrace) from [<c040b580>] (show_stack+0x10/0x14) > [<c040b580>] (show_stack) from [<c1092174>] (dump_stack+0xc4/0xd8) > [<c1092174>] (dump_stack) from [<c0445a30>] (__warn+0x104/0x108) > [<c0445a30>] (__warn) from [<c0445aa8>] (warn_slowpath_fmt+0x74/0xb8) > [<c0445aa8>] (warn_slowpath_fmt) from [<c08435d0>] (kobject_get+0xa0/0xa4) > [<c08435d0>] (kobject_get) from [<bf0a715c>] (tpm_try_get_ops+0x14/0x54 [tpm]) > [<bf0a715c>] (tpm_try_get_ops [tpm]) from [<bf0a7d6c>] (tpm_common_write+0x38/0x60 [tpm]) > [<bf0a7d6c>] (tpm_common_write [tpm]) from [<c05a7ac0>] (vfs_write+0xc4/0x3c0) > [<c05a7ac0>] (vfs_write) from [<c05a7ee4>] (ksys_write+0x58/0xcc) > [<c05a7ee4>] (ksys_write) from [<c04001a0>] (ret_fast_syscall+0x0/0x4c) > Exception stack(0xc226bfa8 to 0xc226bff0) > bfa0: 00000000 000105b4 00000003 beafe664 00000014 00000000 > bfc0: 00000000 000105b4 000103f8 00000004 00000000 00000000 b6f9c000 beafe684 > bfe0: 0000006c beafe648 0001056c b6eb6944 > ---[ end trace d4b8409def9b8b1f ]--- > > The reason for this warning is the attempt to get the chip->dev reference > in tpm_common_write() although the reference counter is already zero. > > Since commit 8979b02aaf1d ("tpm: Fix reference count to main device") the > extra reference used to prevent a premature zero counter is never taken, > because the required TPM_CHIP_FLAG_TPM2 flag is never set. > > Fix this by removing the flag condition. > > Commit fdc915f7f719 ("tpm: expose spaces via a device link /dev/tpmrm<n>") > already introduced function tpm_devs_release() to release the extra > reference but did not implement the required put on chip->devs that results > in the call of this function. > > Fix this also by installing an action handler that puts chip->devs as soon > as the chip is unregistered. > > Fixes: fdc915f7f719 ("tpm: expose spaces via a device link /dev/tpmrm<n>") > Fixes: 8979b02aaf1d ("tpm: Fix reference count to main device") > Signed-off-by: Lino Sanfilippo <l.sanfilippo@kunbus.com> Tested-by: Stefan Berger <stefanb@linux.ibm.com> Steps: modprobe tpm_vtpm_proxy swtpm chardev --vtpm-proxy --tpm2 --tpmstate dir=./ & exec 100<>/dev/tpmrm1 kill -9 <swtpm pid> rmmod tpm_vtpm_proxy exec 100>&- # fails before, works after --> great job! :-)
On 2/4/21 7:46 PM, Stefan Berger wrote: > On 2/4/21 6:50 PM, Lino Sanfilippo wrote: >> From: Lino Sanfilippo <l.sanfilippo@kunbus.com> >> >> The following sequence of operations results in a refcount warning: >> >> 1. Open device /dev/tpmrm >> 2. Remove module tpm_tis_spi >> 3. Write a TPM command to the file descriptor opened at step 1. >> >> ------------[ cut here ]------------ >> WARNING: CPU: 3 PID: 1161 at lib/refcount.c:25 kobject_get+0xa0/0xa4 >> refcount_t: addition on 0; use-after-free. >> Modules linked in: tpm_tis_spi tpm_tis_core tpm mdio_bcm_unimac brcmfmac >> sha256_generic libsha256 sha256_arm hci_uart btbcm bluetooth cfg80211 >> vc4 >> brcmutil ecdh_generic ecc snd_soc_core crc32_arm_ce libaes >> raspberrypi_hwmon ac97_bus snd_pcm_dmaengine bcm2711_thermal snd_pcm >> snd_timer genet snd phy_generic soundcore [last unloaded: spi_bcm2835] >> CPU: 3 PID: 1161 Comm: hold_open Not tainted 5.10.0ls-main-dirty #2 >> Hardware name: BCM2711 >> [<c0410c3c>] (unwind_backtrace) from [<c040b580>] (show_stack+0x10/0x14) >> [<c040b580>] (show_stack) from [<c1092174>] (dump_stack+0xc4/0xd8) >> [<c1092174>] (dump_stack) from [<c0445a30>] (__warn+0x104/0x108) >> [<c0445a30>] (__warn) from [<c0445aa8>] (warn_slowpath_fmt+0x74/0xb8) >> [<c0445aa8>] (warn_slowpath_fmt) from [<c08435d0>] >> (kobject_get+0xa0/0xa4) >> [<c08435d0>] (kobject_get) from [<bf0a715c>] >> (tpm_try_get_ops+0x14/0x54 [tpm]) >> [<bf0a715c>] (tpm_try_get_ops [tpm]) from [<bf0a7d6c>] >> (tpm_common_write+0x38/0x60 [tpm]) >> [<bf0a7d6c>] (tpm_common_write [tpm]) from [<c05a7ac0>] >> (vfs_write+0xc4/0x3c0) >> [<c05a7ac0>] (vfs_write) from [<c05a7ee4>] (ksys_write+0x58/0xcc) >> [<c05a7ee4>] (ksys_write) from [<c04001a0>] (ret_fast_syscall+0x0/0x4c) >> Exception stack(0xc226bfa8 to 0xc226bff0) >> bfa0: 00000000 000105b4 00000003 beafe664 00000014 >> 00000000 >> bfc0: 00000000 000105b4 000103f8 00000004 00000000 00000000 b6f9c000 >> beafe684 >> bfe0: 0000006c beafe648 0001056c b6eb6944 >> ---[ end trace d4b8409def9b8b1f ]--- >> >> The reason for this warning is the attempt to get the chip->dev >> reference >> in tpm_common_write() although the reference counter is already zero. >> >> Since commit 8979b02aaf1d ("tpm: Fix reference count to main device") >> the >> extra reference used to prevent a premature zero counter is never taken, >> because the required TPM_CHIP_FLAG_TPM2 flag is never set. >> >> Fix this by removing the flag condition. >> >> Commit fdc915f7f719 ("tpm: expose spaces via a device link >> /dev/tpmrm<n>") >> already introduced function tpm_devs_release() to release the extra >> reference but did not implement the required put on chip->devs that >> results >> in the call of this function. >> >> Fix this also by installing an action handler that puts chip->devs as >> soon >> as the chip is unregistered. >> >> Fixes: fdc915f7f719 ("tpm: expose spaces via a device link >> /dev/tpmrm<n>") >> Fixes: 8979b02aaf1d ("tpm: Fix reference count to main device") >> Signed-off-by: Lino Sanfilippo <l.sanfilippo@kunbus.com> > > Tested-by: Stefan Berger <stefanb@linux.ibm.com> > > Steps: > > modprobe tpm_vtpm_proxy > > swtpm chardev --vtpm-proxy --tpm2 --tpmstate dir=./ & > > exec 100<>/dev/tpmrm1 > > kill -9 <swtpm pid> > > rmmod tpm_vtpm_proxy > > exec 100>&- # fails before, works after --> great job! :-) > > To clarify: When I tested this I had *both* patches applied. Without the patches I got the null pointer exception in tpm2_del_space(). The 2nd patch alone solves that issue when using the steps above. [ 525.647443] [c000000005d3bba0] [c000000000e81d78] mutex_lock+0x28/0x90 (unreliable) [ 525.647539] [c000000005d3bbd0] [c0080000001f5da0] tpm2_del_space+0x48/0x130 [tpm] [ 525.647635] [c000000005d3bc20] [c0080000001f56b8] tpmrm_release+0x40/0x70 [tpm] [ 525.647746] [c000000005d3bc50] [c0000000004bf718] __fput+0xb8/0x340 [ 525.647842] [c000000005d3bca0] [c00000000017def4] task_work_run+0xe4/0x150 [ 525.647930] [c000000005d3bcf0] [c00000000001feb4] do_notify_resume+0x484/0x4f0 [ 525.648023] [c000000005d3bdb0] [c000000000033a64] syscall_exit_prepare+0x1d4/0x330 [ 525.648115] [c000000005d3be20] [c00000000000d96c] system_call_common+0xfc/0x27c
On Thu, 2021-02-04 at 20:44 -0500, Stefan Berger wrote: > To clarify: When I tested this I had *both* patches applied. Without > the patches I got the null pointer exception in tpm2_del_space(). The > 2nd patch alone solves that issue when using the steps above. Yes, I can't confirm the bug either. I only have lpc tis devices, so it could be something to do with spi, but when I do python3 in one shell >>> fd = open("/dev/tpmrm0", "wb") do rmmod tpm_tis in another shell >>> buf = bytearray(20) >>> fd.write(buf) 20 so I don't see the oops you see on write. However >>> fd.close() And it oopses here in tpm2_del_space James
On Fri, Feb 05, 2021 at 12:50:42AM +0100, Lino Sanfilippo wrote: > From: Lino Sanfilippo <l.sanfilippo@kunbus.com> > > The following sequence of operations results in a refcount warning: > > 1. Open device /dev/tpmrm > 2. Remove module tpm_tis_spi > 3. Write a TPM command to the file descriptor opened at step 1. > > ------------[ cut here ]------------ > WARNING: CPU: 3 PID: 1161 at lib/refcount.c:25 kobject_get+0xa0/0xa4 > refcount_t: addition on 0; use-after-free. > Modules linked in: tpm_tis_spi tpm_tis_core tpm mdio_bcm_unimac brcmfmac > sha256_generic libsha256 sha256_arm hci_uart btbcm bluetooth cfg80211 vc4 > brcmutil ecdh_generic ecc snd_soc_core crc32_arm_ce libaes > raspberrypi_hwmon ac97_bus snd_pcm_dmaengine bcm2711_thermal snd_pcm > snd_timer genet snd phy_generic soundcore [last unloaded: spi_bcm2835] > CPU: 3 PID: 1161 Comm: hold_open Not tainted 5.10.0ls-main-dirty #2 > Hardware name: BCM2711 > [<c0410c3c>] (unwind_backtrace) from [<c040b580>] (show_stack+0x10/0x14) > [<c040b580>] (show_stack) from [<c1092174>] (dump_stack+0xc4/0xd8) > [<c1092174>] (dump_stack) from [<c0445a30>] (__warn+0x104/0x108) > [<c0445a30>] (__warn) from [<c0445aa8>] (warn_slowpath_fmt+0x74/0xb8) > [<c0445aa8>] (warn_slowpath_fmt) from [<c08435d0>] (kobject_get+0xa0/0xa4) > [<c08435d0>] (kobject_get) from [<bf0a715c>] (tpm_try_get_ops+0x14/0x54 [tpm]) > [<bf0a715c>] (tpm_try_get_ops [tpm]) from [<bf0a7d6c>] (tpm_common_write+0x38/0x60 [tpm]) > [<bf0a7d6c>] (tpm_common_write [tpm]) from [<c05a7ac0>] (vfs_write+0xc4/0x3c0) > [<c05a7ac0>] (vfs_write) from [<c05a7ee4>] (ksys_write+0x58/0xcc) > [<c05a7ee4>] (ksys_write) from [<c04001a0>] (ret_fast_syscall+0x0/0x4c) > Exception stack(0xc226bfa8 to 0xc226bff0) > bfa0: 00000000 000105b4 00000003 beafe664 00000014 00000000 > bfc0: 00000000 000105b4 000103f8 00000004 00000000 00000000 b6f9c000 beafe684 > bfe0: 0000006c beafe648 0001056c b6eb6944 > ---[ end trace d4b8409def9b8b1f ]--- > > The reason for this warning is the attempt to get the chip->dev reference > in tpm_common_write() although the reference counter is already zero. > > Since commit 8979b02aaf1d ("tpm: Fix reference count to main device") the > extra reference used to prevent a premature zero counter is never taken, > because the required TPM_CHIP_FLAG_TPM2 flag is never set. > > Fix this by removing the flag condition. > > Commit fdc915f7f719 ("tpm: expose spaces via a device link /dev/tpmrm<n>") > already introduced function tpm_devs_release() to release the extra > reference but did not implement the required put on chip->devs that results > in the call of this function. > > Fix this also by installing an action handler that puts chip->devs as soon > as the chip is unregistered. > > Fixes: fdc915f7f719 ("tpm: expose spaces via a device link /dev/tpmrm<n>") > Fixes: 8979b02aaf1d ("tpm: Fix reference count to main device") > Signed-off-by: Lino Sanfilippo <l.sanfilippo@kunbus.com> > --- > drivers/char/tpm/tpm-chip.c | 18 +++++++++++++++--- > drivers/char/tpm/tpm_ftpm_tee.c | 2 ++ > drivers/char/tpm/tpm_vtpm_proxy.c | 1 + > 3 files changed, 18 insertions(+), 3 deletions(-) <formletter> This is not the correct way to submit patches for inclusion in the stable kernel tree. Please read: https://www.kernel.org/doc/html/latest/process/stable-kernel-rules.html for how to do this properly. </formletter>
Hi Stefan, On 05.02.21 01:46, Stefan Berger wrote: > On 2/4/21 6:50 PM, Lino Sanfilippo wrote: >> Signed-off-by: Lino Sanfilippo <l.sanfilippo@kunbus.com> > > Tested-by: Stefan Berger <stefanb@linux.ibm.com> > > Steps: > > modprobe tpm_vtpm_proxy > > swtpm chardev --vtpm-proxy --tpm2 --tpmstate dir=./ & > > exec 100<>/dev/tpmrm1 > > kill -9 <swtpm pid> > > rmmod tpm_vtpm_proxy > > exec 100>&- # fails before, works after --> great job! :-) > > Great, thank you very much for testing! Best Regards, Lino
Hi, On 05.02.21 03:01, James Bottomley wrote: > On Thu, 2021-02-04 at 20:44 -0500, Stefan Berger wrote: >> To clarify: When I tested this I had *both* patches applied. Without >> the patches I got the null pointer exception in tpm2_del_space(). The >> 2nd patch alone solves that issue when using the steps above. > > > Yes, I can't confirm the bug either. I only have lpc tis devices, so > it could be something to do with spi, but when I do > > python3 in one shell > >>>> fd = open("/dev/tpmrm0", "wb") > > do rmmod tpm_tis in another shell > >>>> buf = bytearray(20) >>>> fd.write(buf) > 20 The issue is in the TPM chip driver code, so AFAIU it should not matter whether its SPI or something else. Maybe check again, that the file is still open when tpm_tis is removed and the write actually comes after the rmmod? Also note that there are some sanity checks in tpm_common_write() that the written data has to pass to get to the point where tpm_try_get_ops() is called, which is the call that eventually triggers the bug. > > so I don't see the oops you see on write. However > >>>> fd.close() > > And it oopses here in tpm2_del_space > > James > > Regards, Lino
On Fri, Feb 05, 2021 at 12:50:42AM +0100, Lino Sanfilippo wrote: > From: Lino Sanfilippo <l.sanfilippo@kunbus.com> > > The following sequence of operations results in a refcount warning: > > 1. Open device /dev/tpmrm > 2. Remove module tpm_tis_spi > 3. Write a TPM command to the file descriptor opened at step 1. > > WARNING: CPU: 3 PID: 1161 at lib/refcount.c:25 kobject_get+0xa0/0xa4 > refcount_t: addition on 0; use-after-free. > Modules linked in: tpm_tis_spi tpm_tis_core tpm mdio_bcm_unimac brcmfmac > sha256_generic libsha256 sha256_arm hci_uart btbcm bluetooth cfg80211 vc4 > brcmutil ecdh_generic ecc snd_soc_core crc32_arm_ce libaes > raspberrypi_hwmon ac97_bus snd_pcm_dmaengine bcm2711_thermal snd_pcm > snd_timer genet snd phy_generic soundcore [last unloaded: spi_bcm2835] > CPU: 3 PID: 1161 Comm: hold_open Not tainted 5.10.0ls-main-dirty #2 > Hardware name: BCM2711 > [<c0410c3c>] (unwind_backtrace) from [<c040b580>] (show_stack+0x10/0x14) > [<c040b580>] (show_stack) from [<c1092174>] (dump_stack+0xc4/0xd8) > [<c1092174>] (dump_stack) from [<c0445a30>] (__warn+0x104/0x108) > [<c0445a30>] (__warn) from [<c0445aa8>] (warn_slowpath_fmt+0x74/0xb8) > [<c0445aa8>] (warn_slowpath_fmt) from [<c08435d0>] (kobject_get+0xa0/0xa4) > [<c08435d0>] (kobject_get) from [<bf0a715c>] (tpm_try_get_ops+0x14/0x54 [tpm]) > [<bf0a715c>] (tpm_try_get_ops [tpm]) from [<bf0a7d6c>] (tpm_common_write+0x38/0x60 [tpm]) > [<bf0a7d6c>] (tpm_common_write [tpm]) from [<c05a7ac0>] (vfs_write+0xc4/0x3c0) > [<c05a7ac0>] (vfs_write) from [<c05a7ee4>] (ksys_write+0x58/0xcc) > [<c05a7ee4>] (ksys_write) from [<c04001a0>] (ret_fast_syscall+0x0/0x4c) > Exception stack(0xc226bfa8 to 0xc226bff0) > bfa0: 00000000 000105b4 00000003 beafe664 00000014 00000000 > bfc0: 00000000 000105b4 000103f8 00000004 00000000 00000000 b6f9c000 beafe684 > bfe0: 0000006c beafe648 0001056c b6eb6944 > > The reason for this warning is the attempt to get the chip->dev reference > in tpm_common_write() although the reference counter is already zero. > Since commit 8979b02aaf1d ("tpm: Fix reference count to main device") the > extra reference used to prevent a premature zero counter is never taken, > because the required TPM_CHIP_FLAG_TPM2 flag is never set. > > Fix this by removing the flag condition. > > Commit fdc915f7f719 ("tpm: expose spaces via a device link /dev/tpmrm<n>") > already introduced function tpm_devs_release() to release the extra > reference but did not implement the required put on chip->devs that results > in the call of this function. Seems wonky, the devs is just supposed to be a side thing, nothing should be using it as a primary reference count for a tpm. The bug here is only that tpm_common_open() did not get a kref on the chip before putting it in priv and linking it to the fd. See the comment before tpm_try_get_ops() indicating the caller must already have taken care to ensure the chip is valid. This should be all you need to fix the oops: diff --git a/drivers/char/tpm/tpm-dev-common.c b/drivers/char/tpm/tpm-dev-common.c index 1784530b8387bb..1b738dca7fffb5 100644 --- a/drivers/char/tpm/tpm-dev-common.c +++ b/drivers/char/tpm/tpm-dev-common.c @@ -105,6 +105,7 @@ static void tpm_timeout_work(struct work_struct *work) void tpm_common_open(struct file *file, struct tpm_chip *chip, struct file_priv *priv, struct tpm_space *space) { + get_device(&priv->chip.dev); priv->chip = chip; priv->space = space; priv->response_read = true; @@ -261,6 +262,7 @@ void tpm_common_release(struct file *file, struct file_priv *priv) flush_work(&priv->timeout_work); file->private_data = NULL; priv->response_length = 0; + put_device(&chip->dev); } int __init tpm_dev_common_init(void) > Fix this also by installing an action handler that puts chip->devs as soon > as the chip is unregistered. > > Fixes: fdc915f7f719 ("tpm: expose spaces via a device link /dev/tpmrm<n>") > Fixes: 8979b02aaf1d ("tpm: Fix reference count to main device") > Signed-off-by: Lino Sanfilippo <l.sanfilippo@kunbus.com> > drivers/char/tpm/tpm-chip.c | 18 +++++++++++++++--- > drivers/char/tpm/tpm_ftpm_tee.c | 2 ++ > drivers/char/tpm/tpm_vtpm_proxy.c | 1 + > 3 files changed, 18 insertions(+), 3 deletions(-) > > diff --git a/drivers/char/tpm/tpm-chip.c b/drivers/char/tpm/tpm-chip.c > index ddaeceb..3ace199 100644 > +++ b/drivers/char/tpm/tpm-chip.c > @@ -360,8 +360,7 @@ struct tpm_chip *tpm_chip_alloc(struct device *pdev, > * while cdevs is in use. The corresponding put > * is in the tpm_devs_release (TPM2 only) > */ > - if (chip->flags & TPM_CHIP_FLAG_TPM2) > - get_device(&chip->dev); > + get_device(&chip->dev); > > if (chip->dev_num == 0) > chip->dev.devt = MKDEV(MISC_MAJOR, TPM_MINOR); > @@ -422,8 +421,21 @@ struct tpm_chip *tpmm_chip_alloc(struct device *pdev, > rc = devm_add_action_or_reset(pdev, > (void (*)(void *)) put_device, > &chip->dev); > - if (rc) > + if (rc) { > + put_device(&chip->devs); > return ERR_PTR(rc); This isn't right read what 'or_reset' does Jason
On 2/4/21 9:01 PM, James Bottomley wrote: > On Thu, 2021-02-04 at 20:44 -0500, Stefan Berger wrote: >> To clarify: When I tested this I had *both* patches applied. Without >> the patches I got the null pointer exception in tpm2_del_space(). The >> 2nd patch alone solves that issue when using the steps above. > > Yes, I can't confirm the bug either. I only have lpc tis devices, so > it could be something to do with spi, but when I do I can confirm this bug: insmod /usr/lib/modules/5.10.0+/extra/tpm.ko ; insmod /usr/lib/modules/5.10.0+/extra/tpm_vtpm_proxy.ko swtpm chardev --vtpm-proxy --tpm2 --tpmstate dir=./ & exec 100<>/dev/tpmrm0 kill -9 <swtpm pid> rmmod tpm_vtpm_proxy echo -en '\x80\x01\x00\x00\x00\x0c\x00\x00\x01\x44\x00\x00' >&100 [ 167.289390] [c000000015d6fb60] [c0000000007d3ac0] refcount_warn_saturate+0x210/0x230 (unreliable) [ 167.290392] [c000000015d6fbc0] [c000000000831328] kobject_put+0x1b8/0x2e0 [ 167.291398] [c000000015d6fc50] [c000000000955548] put_device+0x28/0x40 [ 167.292409] [c000000015d6fc70] [c0080000008609a8] tpm_try_get_ops+0xb0/0x100 [tpm] [ 167.293417] [c000000015d6fcb0] [c008000000861864] tpm_common_write+0x15c/0x250 [tpm] [ 167.294429] [c000000015d6fd20] [c0000000004be190] vfs_write+0xf0/0x380 [ 167.295437] [c000000015d6fd70] [c0000000004be6c8] ksys_write+0x78/0x130 [ 167.296450] [c000000015d6fdc0] [c00000000003377c] system_call_exception+0x15c/0x270 [ 167.297461] [c000000015d6fe20] [c00000000000d960] system_call_common+0xf0/0x27c With this patch applied this error here is gone. Just have make sure to replace tpm.ko and tpm_vtpm_proxy.ko, not just the latter. So my Tested-By is good for both patches. Stefan
Hi, On 05.02.21 14:05, Jason Gunthorpe wrote: >> >> Commit fdc915f7f719 ("tpm: expose spaces via a device link /dev/tpmrm<n>") >> already introduced function tpm_devs_release() to release the extra >> reference but did not implement the required put on chip->devs that results >> in the call of this function. > > Seems wonky, the devs is just supposed to be a side thing, nothing > should be using it as a primary reference count for a tpm. > > The bug here is only that tpm_common_open() did not get a kref on the > chip before putting it in priv and linking it to the fd. See the > comment before tpm_try_get_ops() indicating the caller must already > have taken care to ensure the chip is valid. > > This should be all you need to fix the oops: > > diff --git a/drivers/char/tpm/tpm-dev-common.c b/drivers/char/tpm/tpm-dev-common.c > index 1784530b8387bb..1b738dca7fffb5 100644 > --- a/drivers/char/tpm/tpm-dev-common.c > +++ b/drivers/char/tpm/tpm-dev-common.c > @@ -105,6 +105,7 @@ static void tpm_timeout_work(struct work_struct *work) > void tpm_common_open(struct file *file, struct tpm_chip *chip, > struct file_priv *priv, struct tpm_space *space) > { > + get_device(&priv->chip.dev); > priv->chip = chip; > priv->space = space; > priv->response_read = true; This is racy, isnt it? The time between we open the file and we want to grab the reference in common_open() the chip can already be unregistered and freed. As a matter of fact this solution was the first thing that came into my mind, too, until I noticed the possible race condition. I can only guess that this was what James had in mind when he chose to take the extra reference to chip->dev in tpm_chip_alloc() instead of common_open(). >> diff --git a/drivers/char/tpm/tpm-chip.c b/drivers/char/tpm/tpm-chip.c >> index ddaeceb..3ace199 100644 >> +++ b/drivers/char/tpm/tpm-chip.c >> @@ -360,8 +360,7 @@ struct tpm_chip *tpm_chip_alloc(struct device *pdev, >> * while cdevs is in use. The corresponding put >> * is in the tpm_devs_release (TPM2 only) >> */ >> - if (chip->flags & TPM_CHIP_FLAG_TPM2) >> - get_device(&chip->dev); >> + get_device(&chip->dev); >> >> if (chip->dev_num == 0) >> chip->dev.devt = MKDEV(MISC_MAJOR, TPM_MINOR); >> @@ -422,8 +421,21 @@ struct tpm_chip *tpmm_chip_alloc(struct device *pdev, >> rc = devm_add_action_or_reset(pdev, >> (void (*)(void *)) put_device, >> &chip->dev); >> - if (rc) >> + if (rc) { >> + put_device(&chip->devs); >> return ERR_PTR(rc); > > This isn't right read what 'or_reset' does > In case of failure installing the action handler devm_add_action_or_reset() puts chip->dev for us. But we also have put chip->devs since we have retrieved a reference to both chip->dev and chip->devs. Or do I miss something here? > Jason > Regards, Lino
On Fri, Feb 05, 2021 at 03:55:09PM +0100, Lino Sanfilippo wrote: > Hi, > > On 05.02.21 14:05, Jason Gunthorpe wrote: > > >> > >> Commit fdc915f7f719 ("tpm: expose spaces via a device link /dev/tpmrm<n>") > >> already introduced function tpm_devs_release() to release the extra > >> reference but did not implement the required put on chip->devs that results > >> in the call of this function. > > > > Seems wonky, the devs is just supposed to be a side thing, nothing > > should be using it as a primary reference count for a tpm. > > > > The bug here is only that tpm_common_open() did not get a kref on the > > chip before putting it in priv and linking it to the fd. See the > > comment before tpm_try_get_ops() indicating the caller must already > > have taken care to ensure the chip is valid. > > > > This should be all you need to fix the oops: > > > > diff --git a/drivers/char/tpm/tpm-dev-common.c b/drivers/char/tpm/tpm-dev-common.c > > index 1784530b8387bb..1b738dca7fffb5 100644 > > +++ b/drivers/char/tpm/tpm-dev-common.c > > @@ -105,6 +105,7 @@ static void tpm_timeout_work(struct work_struct *work) > > void tpm_common_open(struct file *file, struct tpm_chip *chip, > > struct file_priv *priv, struct tpm_space *space) > > { > > + get_device(&priv->chip.dev); > > priv->chip = chip; > > priv->space = space; > > priv->response_read = true; > > This is racy, isnt it? The time between we open the file and we want to grab the > reference in common_open() the chip can already be unregistered and freed. No, the cdev layer holds the refcount on the device while open is being called. Jason
On 05.02.21 16:15, Jason Gunthorpe wrote: > > No, the cdev layer holds the refcount on the device while open is > being called. > > Jason > Yes, but the reference that is responsible for the chip deallocation is chip->dev which is linked to chip->cdev and represents /dev/tpm, not /dev/tpmrm. You are right, we dont have the issue with /dev/tpm for the reason you mentioned. But /dev/tpmrm is represented by chip->cdevs and keeping this ref held by the cdev layer wont protect us from the chip being freed (which is the reason why we need the chip->dev reference in the first place). And yes, the naming dev/devs/cdev/cdevs is quite confusing :( Regards, Lino
On Fri, Feb 05, 2021 at 04:50:13PM +0100, Lino Sanfilippo wrote: > > On 05.02.21 16:15, Jason Gunthorpe wrote: > > > > No, the cdev layer holds the refcount on the device while open is > > being called. > > > Yes, but the reference that is responsible for the chip deallocation is chip->dev > which is linked to chip->cdev and represents /dev/tpm, not /dev/tpmrm. > You are right, we dont have the issue with /dev/tpm for the reason you mentioned. > But /dev/tpmrm is represented by chip->cdevs and keeping this ref held by the cdev > layer wont protect us from the chip being freed (which is the reason why we need > the chip->dev reference in the first place). No, they are all chained together because they are all in the same struct: struct tpm_chip { struct device dev; struct device devs; struct cdev cdev; struct cdev cdevs; dev holds the refcount on memory, when it goes 0 the whole thing is kfreed. The rule is dev's refcount can't go to zero while any other refcount is != 0. For instance devs holds a get on dev that is put back only when devs goes to 0: static void tpm_devs_release(struct device *dev) { struct tpm_chip *chip = container_of(dev, struct tpm_chip, devs); /* release the master device reference */ put_device(&chip->dev); } Both cdev elements do something similar inside the cdev layer. The net result is during any open() the tpm_chip is guarenteed to have a positive refcount. Jason
On 05.02.21 at 16:58, Jason Gunthorpe wrote: eference in the first place). > > No, they are all chained together because they are all in the same > struct: > > struct tpm_chip { > struct device dev; > struct device devs; > struct cdev cdev; > struct cdev cdevs; > > dev holds the refcount on memory, when it goes 0 the whole thing is > kfreed. > > The rule is dev's refcount can't go to zero while any other refcount > is != 0. > > For instance devs holds a get on dev that is put back only when devs > goes to 0: > > static void tpm_devs_release(struct device *dev) > { > struct tpm_chip *chip = container_of(dev, struct tpm_chip, devs); > > /* release the master device reference */ > put_device(&chip->dev); > } > > Both cdev elements do something similar inside the cdev layer. Well this chaining is exactly what does not work nowadays and what the patch is supposed to fix: currently we dont ever take the extra ref (not even in TPM 2 case, note that TPM_CHIP_FLAG_TMP2 is never set), so - if (chip->flags & TPM_CHIP_FLAG_TPM2) - get_device(&chip->dev); + get_device(&chip->dev); and tpm_devs_release() is never called, since there is nothing that ever puts devs, so + rc = devm_add_action_or_reset(pdev, + (void (*)(void *)) put_device, + &chip->devs); The race with only get_device()/putdevice() in tpm_common_open()/tpm_common_release() is: 1. tpm chip is allocated with dev refcount = 1, devs refcount = 1 2. /dev/tpmrm is opened but before we get the ref to dev in tpm_common() another thread rmmmods the chip driver: 3. the chip is unregistered, dev is put with refcount = 0 and the whole chip struct is freed 3. Now open() proceeds, tries to grab the extra ref chip->dev from a chip that has already been deallocated and the system crashes. As I already wrote, that approach was my first thought, too, but since the result crashed due to the race condition, I chose the approach in patch 1. Regards, Lino > The net result is during any open() the tpm_chip is guarenteed to have > a positive refcount. >
On Fri, Feb 05, 2021 at 10:50:02PM +0100, Lino Sanfilippo wrote: > On 05.02.21 at 16:58, Jason Gunthorpe wrote: > eference in the first place). > > > > No, they are all chained together because they are all in the same > > struct: > > > > struct tpm_chip { > > struct device dev; > > struct device devs; > > struct cdev cdev; > > struct cdev cdevs; > > > > dev holds the refcount on memory, when it goes 0 the whole thing is > > kfreed. > > > > The rule is dev's refcount can't go to zero while any other refcount > > is != 0. > > > > For instance devs holds a get on dev that is put back only when devs > > goes to 0: > > > > static void tpm_devs_release(struct device *dev) > > { > > struct tpm_chip *chip = container_of(dev, struct tpm_chip, devs); > > > > /* release the master device reference */ > > put_device(&chip->dev); > > } > > > > Both cdev elements do something similar inside the cdev layer. > > Well this chaining is exactly what does not work nowadays and what the patch is supposed > to fix: currently we dont ever take the extra ref (not even in TPM 2 case, note that > TPM_CHIP_FLAG_TMP2 is never set), so > > - if (chip->flags & TPM_CHIP_FLAG_TPM2) > - get_device(&chip->dev); > + get_device(&chip->dev); Oh, hah, yes that is busted up. The patch sketch I sent to James is the right way to handle it, feel free to take it up > and tpm_devs_release() is never called, since there is nothing that ever puts devs, so Yes, that is a pre-existing memory leak > The race with only get_device()/putdevice() in tpm_common_open()/tpm_common_release() is: The refcount handling is busted up and not working the way it is designed, when that is fixed there is no race. Jason