mbox series

[v3,0/2] TPM fixes

Message ID 1612482643-11796-1-git-send-email-LinoSanfilippo@gmx.de
Headers show
Series TPM fixes | expand

Message

Lino Sanfilippo Feb. 4, 2021, 11:50 p.m. UTC
Changes in v3
- drop the patch that introduces the new function tpm_chip_free()
- rework the commit messages for the patches (style, typos, etc.)
- add fixes tag to patch 2
- add James Bottomley to cc list
- add stable mailing list to cc list

Changes in v2:
- drop the patch that erroneously cleaned up after failed installation of
  an action handler in tpmm_chip_alloc() (pointed out by Jarkko Sakkinen)
- make the commit message for patch 1 more detailed
- add fixes tags and kernel logs

Lino Sanfilippo (2):
  tpm: fix reference counting for struct tpm_chip
  tpm: in tpm2_del_space check if ops pointer is still valid

 drivers/char/tpm/tpm-chip.c       | 18 +++++++++++++++---
 drivers/char/tpm/tpm2-space.c     | 15 ++++++++++-----
 drivers/char/tpm/tpm_ftpm_tee.c   |  2 ++
 drivers/char/tpm/tpm_vtpm_proxy.c |  1 +
 4 files changed, 28 insertions(+), 8 deletions(-)

Comments

Stefan Berger Feb. 5, 2021, 12:46 a.m. UTC | #1
On 2/4/21 6:50 PM, Lino Sanfilippo wrote:
> From: Lino Sanfilippo <l.sanfilippo@kunbus.com>

>

> The following sequence of operations results in a refcount warning:

>

> 1. Open device /dev/tpmrm

> 2. Remove module tpm_tis_spi

> 3. Write a TPM command to the file descriptor opened at step 1.

>

> ------------[ cut here ]------------

> WARNING: CPU: 3 PID: 1161 at lib/refcount.c:25 kobject_get+0xa0/0xa4

> refcount_t: addition on 0; use-after-free.

> Modules linked in: tpm_tis_spi tpm_tis_core tpm mdio_bcm_unimac brcmfmac

> sha256_generic libsha256 sha256_arm hci_uart btbcm bluetooth cfg80211 vc4

> brcmutil ecdh_generic ecc snd_soc_core crc32_arm_ce libaes

> raspberrypi_hwmon ac97_bus snd_pcm_dmaengine bcm2711_thermal snd_pcm

> snd_timer genet snd phy_generic soundcore [last unloaded: spi_bcm2835]

> CPU: 3 PID: 1161 Comm: hold_open Not tainted 5.10.0ls-main-dirty #2

> Hardware name: BCM2711

> [<c0410c3c>] (unwind_backtrace) from [<c040b580>] (show_stack+0x10/0x14)

> [<c040b580>] (show_stack) from [<c1092174>] (dump_stack+0xc4/0xd8)

> [<c1092174>] (dump_stack) from [<c0445a30>] (__warn+0x104/0x108)

> [<c0445a30>] (__warn) from [<c0445aa8>] (warn_slowpath_fmt+0x74/0xb8)

> [<c0445aa8>] (warn_slowpath_fmt) from [<c08435d0>] (kobject_get+0xa0/0xa4)

> [<c08435d0>] (kobject_get) from [<bf0a715c>] (tpm_try_get_ops+0x14/0x54 [tpm])

> [<bf0a715c>] (tpm_try_get_ops [tpm]) from [<bf0a7d6c>] (tpm_common_write+0x38/0x60 [tpm])

> [<bf0a7d6c>] (tpm_common_write [tpm]) from [<c05a7ac0>] (vfs_write+0xc4/0x3c0)

> [<c05a7ac0>] (vfs_write) from [<c05a7ee4>] (ksys_write+0x58/0xcc)

> [<c05a7ee4>] (ksys_write) from [<c04001a0>] (ret_fast_syscall+0x0/0x4c)

> Exception stack(0xc226bfa8 to 0xc226bff0)

> bfa0:                   00000000 000105b4 00000003 beafe664 00000014 00000000

> bfc0: 00000000 000105b4 000103f8 00000004 00000000 00000000 b6f9c000 beafe684

> bfe0: 0000006c beafe648 0001056c b6eb6944

> ---[ end trace d4b8409def9b8b1f ]---

>

> The reason for this warning is the attempt to get the chip->dev reference

> in tpm_common_write() although the reference counter is already zero.

>

> Since commit 8979b02aaf1d ("tpm: Fix reference count to main device") the

> extra reference used to prevent a premature zero counter is never taken,

> because the required TPM_CHIP_FLAG_TPM2 flag is never set.

>

> Fix this by removing the flag condition.

>

> Commit fdc915f7f719 ("tpm: expose spaces via a device link /dev/tpmrm<n>")

> already introduced function tpm_devs_release() to release the extra

> reference but did not implement the required put on chip->devs that results

> in the call of this function.

>

> Fix this also by installing an action handler that puts chip->devs as soon

> as the chip is unregistered.

>

> Fixes: fdc915f7f719 ("tpm: expose spaces via a device link /dev/tpmrm<n>")

> Fixes: 8979b02aaf1d ("tpm: Fix reference count to main device")

> Signed-off-by: Lino Sanfilippo <l.sanfilippo@kunbus.com>


Tested-by: Stefan Berger <stefanb@linux.ibm.com>


Steps:

modprobe tpm_vtpm_proxy

swtpm chardev --vtpm-proxy --tpm2 --tpmstate dir=./ &

exec 100<>/dev/tpmrm1

kill -9 <swtpm pid>

rmmod tpm_vtpm_proxy

exec 100>&-   # fails before, works after   --> great job! :-)
Stefan Berger Feb. 5, 2021, 1:44 a.m. UTC | #2
On 2/4/21 7:46 PM, Stefan Berger wrote:
> On 2/4/21 6:50 PM, Lino Sanfilippo wrote:

>> From: Lino Sanfilippo <l.sanfilippo@kunbus.com>

>>

>> The following sequence of operations results in a refcount warning:

>>

>> 1. Open device /dev/tpmrm

>> 2. Remove module tpm_tis_spi

>> 3. Write a TPM command to the file descriptor opened at step 1.

>>

>> ------------[ cut here ]------------

>> WARNING: CPU: 3 PID: 1161 at lib/refcount.c:25 kobject_get+0xa0/0xa4

>> refcount_t: addition on 0; use-after-free.

>> Modules linked in: tpm_tis_spi tpm_tis_core tpm mdio_bcm_unimac brcmfmac

>> sha256_generic libsha256 sha256_arm hci_uart btbcm bluetooth cfg80211 

>> vc4

>> brcmutil ecdh_generic ecc snd_soc_core crc32_arm_ce libaes

>> raspberrypi_hwmon ac97_bus snd_pcm_dmaengine bcm2711_thermal snd_pcm

>> snd_timer genet snd phy_generic soundcore [last unloaded: spi_bcm2835]

>> CPU: 3 PID: 1161 Comm: hold_open Not tainted 5.10.0ls-main-dirty #2

>> Hardware name: BCM2711

>> [<c0410c3c>] (unwind_backtrace) from [<c040b580>] (show_stack+0x10/0x14)

>> [<c040b580>] (show_stack) from [<c1092174>] (dump_stack+0xc4/0xd8)

>> [<c1092174>] (dump_stack) from [<c0445a30>] (__warn+0x104/0x108)

>> [<c0445a30>] (__warn) from [<c0445aa8>] (warn_slowpath_fmt+0x74/0xb8)

>> [<c0445aa8>] (warn_slowpath_fmt) from [<c08435d0>] 

>> (kobject_get+0xa0/0xa4)

>> [<c08435d0>] (kobject_get) from [<bf0a715c>] 

>> (tpm_try_get_ops+0x14/0x54 [tpm])

>> [<bf0a715c>] (tpm_try_get_ops [tpm]) from [<bf0a7d6c>] 

>> (tpm_common_write+0x38/0x60 [tpm])

>> [<bf0a7d6c>] (tpm_common_write [tpm]) from [<c05a7ac0>] 

>> (vfs_write+0xc4/0x3c0)

>> [<c05a7ac0>] (vfs_write) from [<c05a7ee4>] (ksys_write+0x58/0xcc)

>> [<c05a7ee4>] (ksys_write) from [<c04001a0>] (ret_fast_syscall+0x0/0x4c)

>> Exception stack(0xc226bfa8 to 0xc226bff0)

>> bfa0:                   00000000 000105b4 00000003 beafe664 00000014 

>> 00000000

>> bfc0: 00000000 000105b4 000103f8 00000004 00000000 00000000 b6f9c000 

>> beafe684

>> bfe0: 0000006c beafe648 0001056c b6eb6944

>> ---[ end trace d4b8409def9b8b1f ]---

>>

>> The reason for this warning is the attempt to get the chip->dev 

>> reference

>> in tpm_common_write() although the reference counter is already zero.

>>

>> Since commit 8979b02aaf1d ("tpm: Fix reference count to main device") 

>> the

>> extra reference used to prevent a premature zero counter is never taken,

>> because the required TPM_CHIP_FLAG_TPM2 flag is never set.

>>

>> Fix this by removing the flag condition.

>>

>> Commit fdc915f7f719 ("tpm: expose spaces via a device link 

>> /dev/tpmrm<n>")

>> already introduced function tpm_devs_release() to release the extra

>> reference but did not implement the required put on chip->devs that 

>> results

>> in the call of this function.

>>

>> Fix this also by installing an action handler that puts chip->devs as 

>> soon

>> as the chip is unregistered.

>>

>> Fixes: fdc915f7f719 ("tpm: expose spaces via a device link 

>> /dev/tpmrm<n>")

>> Fixes: 8979b02aaf1d ("tpm: Fix reference count to main device")

>> Signed-off-by: Lino Sanfilippo <l.sanfilippo@kunbus.com>

>

> Tested-by: Stefan Berger <stefanb@linux.ibm.com>

>

> Steps:

>

> modprobe tpm_vtpm_proxy

>

> swtpm chardev --vtpm-proxy --tpm2 --tpmstate dir=./ &

>

> exec 100<>/dev/tpmrm1

>

> kill -9 <swtpm pid>

>

> rmmod tpm_vtpm_proxy

>

> exec 100>&-   # fails before, works after   --> great job! :-)

>

>

To clarify: When I tested this I had *both* patches applied. Without the 
patches I got the null pointer exception in tpm2_del_space(). The 2nd 
patch alone solves that issue when using the steps above.

[  525.647443] [c000000005d3bba0] [c000000000e81d78] 
mutex_lock+0x28/0x90 (unreliable)
[  525.647539] [c000000005d3bbd0] [c0080000001f5da0] 
tpm2_del_space+0x48/0x130 [tpm]
[  525.647635] [c000000005d3bc20] [c0080000001f56b8] 
tpmrm_release+0x40/0x70 [tpm]
[  525.647746] [c000000005d3bc50] [c0000000004bf718] __fput+0xb8/0x340
[  525.647842] [c000000005d3bca0] [c00000000017def4] 
task_work_run+0xe4/0x150
[  525.647930] [c000000005d3bcf0] [c00000000001feb4] 
do_notify_resume+0x484/0x4f0
[  525.648023] [c000000005d3bdb0] [c000000000033a64] 
syscall_exit_prepare+0x1d4/0x330
[  525.648115] [c000000005d3be20] [c00000000000d96c] 
system_call_common+0xfc/0x27c
James Bottomley Feb. 5, 2021, 2:01 a.m. UTC | #3
On Thu, 2021-02-04 at 20:44 -0500, Stefan Berger wrote:
> To clarify: When I tested this I had *both* patches applied. Without

> the patches I got the null pointer exception in tpm2_del_space(). The

> 2nd patch alone solves that issue when using the steps above.



Yes, I can't confirm the bug either.  I only have lpc tis devices, so
it could be something to do with spi, but when I do

python3 in one shell

>>> fd = open("/dev/tpmrm0", "wb")


do rmmod tpm_tis in another shell

>>> buf = bytearray(20)

>>> fd.write(buf)

20

so I don't see the oops you see on write.  However

>>> fd.close()


And it oopses here in tpm2_del_space

James
Greg KH Feb. 5, 2021, 6:50 a.m. UTC | #4
On Fri, Feb 05, 2021 at 12:50:42AM +0100, Lino Sanfilippo wrote:
> From: Lino Sanfilippo <l.sanfilippo@kunbus.com>

> 

> The following sequence of operations results in a refcount warning:

> 

> 1. Open device /dev/tpmrm

> 2. Remove module tpm_tis_spi

> 3. Write a TPM command to the file descriptor opened at step 1.

> 

> ------------[ cut here ]------------

> WARNING: CPU: 3 PID: 1161 at lib/refcount.c:25 kobject_get+0xa0/0xa4

> refcount_t: addition on 0; use-after-free.

> Modules linked in: tpm_tis_spi tpm_tis_core tpm mdio_bcm_unimac brcmfmac

> sha256_generic libsha256 sha256_arm hci_uart btbcm bluetooth cfg80211 vc4

> brcmutil ecdh_generic ecc snd_soc_core crc32_arm_ce libaes

> raspberrypi_hwmon ac97_bus snd_pcm_dmaengine bcm2711_thermal snd_pcm

> snd_timer genet snd phy_generic soundcore [last unloaded: spi_bcm2835]

> CPU: 3 PID: 1161 Comm: hold_open Not tainted 5.10.0ls-main-dirty #2

> Hardware name: BCM2711

> [<c0410c3c>] (unwind_backtrace) from [<c040b580>] (show_stack+0x10/0x14)

> [<c040b580>] (show_stack) from [<c1092174>] (dump_stack+0xc4/0xd8)

> [<c1092174>] (dump_stack) from [<c0445a30>] (__warn+0x104/0x108)

> [<c0445a30>] (__warn) from [<c0445aa8>] (warn_slowpath_fmt+0x74/0xb8)

> [<c0445aa8>] (warn_slowpath_fmt) from [<c08435d0>] (kobject_get+0xa0/0xa4)

> [<c08435d0>] (kobject_get) from [<bf0a715c>] (tpm_try_get_ops+0x14/0x54 [tpm])

> [<bf0a715c>] (tpm_try_get_ops [tpm]) from [<bf0a7d6c>] (tpm_common_write+0x38/0x60 [tpm])

> [<bf0a7d6c>] (tpm_common_write [tpm]) from [<c05a7ac0>] (vfs_write+0xc4/0x3c0)

> [<c05a7ac0>] (vfs_write) from [<c05a7ee4>] (ksys_write+0x58/0xcc)

> [<c05a7ee4>] (ksys_write) from [<c04001a0>] (ret_fast_syscall+0x0/0x4c)

> Exception stack(0xc226bfa8 to 0xc226bff0)

> bfa0:                   00000000 000105b4 00000003 beafe664 00000014 00000000

> bfc0: 00000000 000105b4 000103f8 00000004 00000000 00000000 b6f9c000 beafe684

> bfe0: 0000006c beafe648 0001056c b6eb6944

> ---[ end trace d4b8409def9b8b1f ]---

> 

> The reason for this warning is the attempt to get the chip->dev reference

> in tpm_common_write() although the reference counter is already zero.

> 

> Since commit 8979b02aaf1d ("tpm: Fix reference count to main device") the

> extra reference used to prevent a premature zero counter is never taken,

> because the required TPM_CHIP_FLAG_TPM2 flag is never set.

> 

> Fix this by removing the flag condition.

> 

> Commit fdc915f7f719 ("tpm: expose spaces via a device link /dev/tpmrm<n>")

> already introduced function tpm_devs_release() to release the extra

> reference but did not implement the required put on chip->devs that results

> in the call of this function.

> 

> Fix this also by installing an action handler that puts chip->devs as soon

> as the chip is unregistered.

> 

> Fixes: fdc915f7f719 ("tpm: expose spaces via a device link /dev/tpmrm<n>")

> Fixes: 8979b02aaf1d ("tpm: Fix reference count to main device")

> Signed-off-by: Lino Sanfilippo <l.sanfilippo@kunbus.com>

> ---

>  drivers/char/tpm/tpm-chip.c       | 18 +++++++++++++++---

>  drivers/char/tpm/tpm_ftpm_tee.c   |  2 ++

>  drivers/char/tpm/tpm_vtpm_proxy.c |  1 +

>  3 files changed, 18 insertions(+), 3 deletions(-)


<formletter>

This is not the correct way to submit patches for inclusion in the
stable kernel tree.  Please read:
    https://www.kernel.org/doc/html/latest/process/stable-kernel-rules.html
for how to do this properly.

</formletter>
Lino Sanfilippo Feb. 5, 2021, 10:34 a.m. UTC | #5
Hi Stefan,

On 05.02.21 01:46, Stefan Berger wrote:
> On 2/4/21 6:50 PM, Lino Sanfilippo wrote:


>> Signed-off-by: Lino Sanfilippo <l.sanfilippo@kunbus.com>

> 

> Tested-by: Stefan Berger <stefanb@linux.ibm.com>

> 

> Steps:

> 

> modprobe tpm_vtpm_proxy

> 

> swtpm chardev --vtpm-proxy --tpm2 --tpmstate dir=./ &

> 

> exec 100<>/dev/tpmrm1

> 

> kill -9 <swtpm pid>

> 

> rmmod tpm_vtpm_proxy

> 

> exec 100>&-   # fails before, works after   --> great job! :-)

> 

> 

Great, thank you very much for testing!

Best Regards,
Lino
Lino Sanfilippo Feb. 5, 2021, 10:52 a.m. UTC | #6
Hi,

On 05.02.21 03:01, James Bottomley wrote:
> On Thu, 2021-02-04 at 20:44 -0500, Stefan Berger wrote:

>> To clarify: When I tested this I had *both* patches applied. Without

>> the patches I got the null pointer exception in tpm2_del_space(). The

>> 2nd patch alone solves that issue when using the steps above.

> 

> 

> Yes, I can't confirm the bug either.  I only have lpc tis devices, so

> it could be something to do with spi, but when I do

>



> python3 in one shell

> 

>>>> fd = open("/dev/tpmrm0", "wb")

> 

> do rmmod tpm_tis in another shell

> 

>>>> buf = bytearray(20)

>>>> fd.write(buf)

> 20



The issue is in the TPM chip driver code, so AFAIU it should not matter whether its
SPI or something else. Maybe check again, that the file is still open when 
tpm_tis is removed and the write actually comes after the rmmod?
Also note that there are some sanity checks in tpm_common_write() that the written
data has to pass to get to the point where tpm_try_get_ops() is called, which 
is the call that eventually triggers the bug.

> 

> so I don't see the oops you see on write.  However

> 

>>>> fd.close()

> 

> And it oopses here in tpm2_del_space

> 

> James

> 

> 


Regards,
Lino
Jason Gunthorpe Feb. 5, 2021, 1:05 p.m. UTC | #7
On Fri, Feb 05, 2021 at 12:50:42AM +0100, Lino Sanfilippo wrote:
> From: Lino Sanfilippo <l.sanfilippo@kunbus.com>

> 

> The following sequence of operations results in a refcount warning:

> 

> 1. Open device /dev/tpmrm

> 2. Remove module tpm_tis_spi

> 3. Write a TPM command to the file descriptor opened at step 1.

> 

> WARNING: CPU: 3 PID: 1161 at lib/refcount.c:25 kobject_get+0xa0/0xa4

> refcount_t: addition on 0; use-after-free.

> Modules linked in: tpm_tis_spi tpm_tis_core tpm mdio_bcm_unimac brcmfmac

> sha256_generic libsha256 sha256_arm hci_uart btbcm bluetooth cfg80211 vc4

> brcmutil ecdh_generic ecc snd_soc_core crc32_arm_ce libaes

> raspberrypi_hwmon ac97_bus snd_pcm_dmaengine bcm2711_thermal snd_pcm

> snd_timer genet snd phy_generic soundcore [last unloaded: spi_bcm2835]

> CPU: 3 PID: 1161 Comm: hold_open Not tainted 5.10.0ls-main-dirty #2

> Hardware name: BCM2711

> [<c0410c3c>] (unwind_backtrace) from [<c040b580>] (show_stack+0x10/0x14)

> [<c040b580>] (show_stack) from [<c1092174>] (dump_stack+0xc4/0xd8)

> [<c1092174>] (dump_stack) from [<c0445a30>] (__warn+0x104/0x108)

> [<c0445a30>] (__warn) from [<c0445aa8>] (warn_slowpath_fmt+0x74/0xb8)

> [<c0445aa8>] (warn_slowpath_fmt) from [<c08435d0>] (kobject_get+0xa0/0xa4)

> [<c08435d0>] (kobject_get) from [<bf0a715c>] (tpm_try_get_ops+0x14/0x54 [tpm])

> [<bf0a715c>] (tpm_try_get_ops [tpm]) from [<bf0a7d6c>] (tpm_common_write+0x38/0x60 [tpm])

> [<bf0a7d6c>] (tpm_common_write [tpm]) from [<c05a7ac0>] (vfs_write+0xc4/0x3c0)

> [<c05a7ac0>] (vfs_write) from [<c05a7ee4>] (ksys_write+0x58/0xcc)

> [<c05a7ee4>] (ksys_write) from [<c04001a0>] (ret_fast_syscall+0x0/0x4c)

> Exception stack(0xc226bfa8 to 0xc226bff0)

> bfa0:                   00000000 000105b4 00000003 beafe664 00000014 00000000

> bfc0: 00000000 000105b4 000103f8 00000004 00000000 00000000 b6f9c000 beafe684

> bfe0: 0000006c beafe648 0001056c b6eb6944

> 

> The reason for this warning is the attempt to get the chip->dev reference

> in tpm_common_write() although the reference counter is already zero.



> Since commit 8979b02aaf1d ("tpm: Fix reference count to main device") the

> extra reference used to prevent a premature zero counter is never taken,

> because the required TPM_CHIP_FLAG_TPM2 flag is never set.

> 

> Fix this by removing the flag condition.

> 

> Commit fdc915f7f719 ("tpm: expose spaces via a device link /dev/tpmrm<n>")

> already introduced function tpm_devs_release() to release the extra

> reference but did not implement the required put on chip->devs that results

> in the call of this function.


Seems wonky, the devs is just supposed to be a side thing, nothing
should be using it as a primary reference count for a tpm.

The bug here is only that tpm_common_open() did not get a kref on the
chip before putting it in priv and linking it to the fd. See the
comment before tpm_try_get_ops() indicating the caller must already
have taken care to ensure the chip is valid.

This should be all you need to fix the oops:

diff --git a/drivers/char/tpm/tpm-dev-common.c b/drivers/char/tpm/tpm-dev-common.c
index 1784530b8387bb..1b738dca7fffb5 100644
--- a/drivers/char/tpm/tpm-dev-common.c
+++ b/drivers/char/tpm/tpm-dev-common.c
@@ -105,6 +105,7 @@ static void tpm_timeout_work(struct work_struct *work)
 void tpm_common_open(struct file *file, struct tpm_chip *chip,
                     struct file_priv *priv, struct tpm_space *space)
 {
+       get_device(&priv->chip.dev);
        priv->chip = chip;
        priv->space = space;
        priv->response_read = true;
@@ -261,6 +262,7 @@ void tpm_common_release(struct file *file, struct file_priv *priv)
        flush_work(&priv->timeout_work);
        file->private_data = NULL;
        priv->response_length = 0;
+       put_device(&chip->dev);
 }
 
 int __init tpm_dev_common_init(void)

> Fix this also by installing an action handler that puts chip->devs as soon

> as the chip is unregistered.

> 

> Fixes: fdc915f7f719 ("tpm: expose spaces via a device link /dev/tpmrm<n>")

> Fixes: 8979b02aaf1d ("tpm: Fix reference count to main device")

> Signed-off-by: Lino Sanfilippo <l.sanfilippo@kunbus.com>

>  drivers/char/tpm/tpm-chip.c       | 18 +++++++++++++++---

>  drivers/char/tpm/tpm_ftpm_tee.c   |  2 ++

>  drivers/char/tpm/tpm_vtpm_proxy.c |  1 +

>  3 files changed, 18 insertions(+), 3 deletions(-)

> 

> diff --git a/drivers/char/tpm/tpm-chip.c b/drivers/char/tpm/tpm-chip.c

> index ddaeceb..3ace199 100644

> +++ b/drivers/char/tpm/tpm-chip.c

> @@ -360,8 +360,7 @@ struct tpm_chip *tpm_chip_alloc(struct device *pdev,

>  	 * while cdevs is in use.  The corresponding put

>  	 * is in the tpm_devs_release (TPM2 only)

>  	 */

> -	if (chip->flags & TPM_CHIP_FLAG_TPM2)

> -		get_device(&chip->dev);

> +	get_device(&chip->dev);

>  

>  	if (chip->dev_num == 0)

>  		chip->dev.devt = MKDEV(MISC_MAJOR, TPM_MINOR);

> @@ -422,8 +421,21 @@ struct tpm_chip *tpmm_chip_alloc(struct device *pdev,

>  	rc = devm_add_action_or_reset(pdev,

>  				      (void (*)(void *)) put_device,

>  				      &chip->dev);

> -	if (rc)

> +	if (rc) {

> +		put_device(&chip->devs);

>  		return ERR_PTR(rc);


This isn't right read what 'or_reset' does

Jason
Stefan Berger Feb. 5, 2021, 1:29 p.m. UTC | #8
On 2/4/21 9:01 PM, James Bottomley wrote:
> On Thu, 2021-02-04 at 20:44 -0500, Stefan Berger wrote:

>> To clarify: When I tested this I had *both* patches applied. Without

>> the patches I got the null pointer exception in tpm2_del_space(). The

>> 2nd patch alone solves that issue when using the steps above.

>

> Yes, I can't confirm the bug either.  I only have lpc tis devices, so

> it could be something to do with spi, but when I do



I can confirm this bug:

insmod /usr/lib/modules/5.10.0+/extra/tpm.ko ; insmod 
/usr/lib/modules/5.10.0+/extra/tpm_vtpm_proxy.ko

swtpm chardev --vtpm-proxy --tpm2 --tpmstate dir=./ &

exec 100<>/dev/tpmrm0

kill -9 <swtpm pid>

rmmod tpm_vtpm_proxy

echo -en '\x80\x01\x00\x00\x00\x0c\x00\x00\x01\x44\x00\x00' >&100


[  167.289390] [c000000015d6fb60] [c0000000007d3ac0] 
refcount_warn_saturate+0x210/0x230 (unreliable)
[  167.290392] [c000000015d6fbc0] [c000000000831328] kobject_put+0x1b8/0x2e0
[  167.291398] [c000000015d6fc50] [c000000000955548] put_device+0x28/0x40
[  167.292409] [c000000015d6fc70] [c0080000008609a8] 
tpm_try_get_ops+0xb0/0x100 [tpm]
[  167.293417] [c000000015d6fcb0] [c008000000861864] 
tpm_common_write+0x15c/0x250 [tpm]
[  167.294429] [c000000015d6fd20] [c0000000004be190] vfs_write+0xf0/0x380
[  167.295437] [c000000015d6fd70] [c0000000004be6c8] ksys_write+0x78/0x130
[  167.296450] [c000000015d6fdc0] [c00000000003377c] 
system_call_exception+0x15c/0x270
[  167.297461] [c000000015d6fe20] [c00000000000d960] 
system_call_common+0xf0/0x27c

With this patch applied this error here is gone. Just have make sure to 
replace tpm.ko and tpm_vtpm_proxy.ko, not just the latter.

So my Tested-By is good for both patches.


    Stefan
Lino Sanfilippo Feb. 5, 2021, 2:55 p.m. UTC | #9
Hi,

On 05.02.21 14:05, Jason Gunthorpe wrote:

>>

>> Commit fdc915f7f719 ("tpm: expose spaces via a device link /dev/tpmrm<n>")

>> already introduced function tpm_devs_release() to release the extra

>> reference but did not implement the required put on chip->devs that results

>> in the call of this function.

> 

> Seems wonky, the devs is just supposed to be a side thing, nothing

> should be using it as a primary reference count for a tpm.

> 

> The bug here is only that tpm_common_open() did not get a kref on the

> chip before putting it in priv and linking it to the fd. See the

> comment before tpm_try_get_ops() indicating the caller must already

> have taken care to ensure the chip is valid.

> 

> This should be all you need to fix the oops:

> 

> diff --git a/drivers/char/tpm/tpm-dev-common.c b/drivers/char/tpm/tpm-dev-common.c

> index 1784530b8387bb..1b738dca7fffb5 100644

> --- a/drivers/char/tpm/tpm-dev-common.c

> +++ b/drivers/char/tpm/tpm-dev-common.c

> @@ -105,6 +105,7 @@ static void tpm_timeout_work(struct work_struct *work)

>  void tpm_common_open(struct file *file, struct tpm_chip *chip,

>                      struct file_priv *priv, struct tpm_space *space)

>  {

> +       get_device(&priv->chip.dev);

>         priv->chip = chip;

>         priv->space = space;

>         priv->response_read = true;


This is racy, isnt it? The time between we open the file and we want to grab the
reference in common_open() the chip can already be unregistered and freed.

As a matter of fact this solution was the first thing that came into my mind, too,
until I noticed the possible race condition. I can only guess that this was what
James had in mind when he chose to take the extra reference to chip->dev in
tpm_chip_alloc() instead of common_open(). 


>> diff --git a/drivers/char/tpm/tpm-chip.c b/drivers/char/tpm/tpm-chip.c

>> index ddaeceb..3ace199 100644

>> +++ b/drivers/char/tpm/tpm-chip.c

>> @@ -360,8 +360,7 @@ struct tpm_chip *tpm_chip_alloc(struct device *pdev,

>>  	 * while cdevs is in use.  The corresponding put

>>  	 * is in the tpm_devs_release (TPM2 only)

>>  	 */

>> -	if (chip->flags & TPM_CHIP_FLAG_TPM2)

>> -		get_device(&chip->dev);

>> +	get_device(&chip->dev);

>>  

>>  	if (chip->dev_num == 0)

>>  		chip->dev.devt = MKDEV(MISC_MAJOR, TPM_MINOR);

>> @@ -422,8 +421,21 @@ struct tpm_chip *tpmm_chip_alloc(struct device *pdev,

>>  	rc = devm_add_action_or_reset(pdev,

>>  				      (void (*)(void *)) put_device,

>>  				      &chip->dev);

>> -	if (rc)

>> +	if (rc) {

>> +		put_device(&chip->devs);

>>  		return ERR_PTR(rc);

> 

> This isn't right read what 'or_reset' does

> 

 
In case of failure installing the action handler devm_add_action_or_reset() puts
chip->dev for us. But we also have put chip->devs since we have retrieved a 
reference to both chip->dev and chip->devs. Or do I miss something here?

> Jason

> 


Regards,
Lino
Jason Gunthorpe Feb. 5, 2021, 3:15 p.m. UTC | #10
On Fri, Feb 05, 2021 at 03:55:09PM +0100, Lino Sanfilippo wrote:
> Hi,

> 

> On 05.02.21 14:05, Jason Gunthorpe wrote:

> 

> >>

> >> Commit fdc915f7f719 ("tpm: expose spaces via a device link /dev/tpmrm<n>")

> >> already introduced function tpm_devs_release() to release the extra

> >> reference but did not implement the required put on chip->devs that results

> >> in the call of this function.

> > 

> > Seems wonky, the devs is just supposed to be a side thing, nothing

> > should be using it as a primary reference count for a tpm.

> > 

> > The bug here is only that tpm_common_open() did not get a kref on the

> > chip before putting it in priv and linking it to the fd. See the

> > comment before tpm_try_get_ops() indicating the caller must already

> > have taken care to ensure the chip is valid.

> > 

> > This should be all you need to fix the oops:

> > 

> > diff --git a/drivers/char/tpm/tpm-dev-common.c b/drivers/char/tpm/tpm-dev-common.c

> > index 1784530b8387bb..1b738dca7fffb5 100644

> > +++ b/drivers/char/tpm/tpm-dev-common.c

> > @@ -105,6 +105,7 @@ static void tpm_timeout_work(struct work_struct *work)

> >  void tpm_common_open(struct file *file, struct tpm_chip *chip,

> >                      struct file_priv *priv, struct tpm_space *space)

> >  {

> > +       get_device(&priv->chip.dev);

> >         priv->chip = chip;

> >         priv->space = space;

> >         priv->response_read = true;

> 

> This is racy, isnt it? The time between we open the file and we want to grab the

> reference in common_open() the chip can already be unregistered and freed.


No, the cdev layer holds the refcount on the device while open is
being called.

Jason
Lino Sanfilippo Feb. 5, 2021, 3:50 p.m. UTC | #11
On 05.02.21 16:15, Jason Gunthorpe wrote:
> 

> No, the cdev layer holds the refcount on the device while open is

> being called.

> 

> Jason

> 


Yes, but the reference that is responsible for the chip deallocation is chip->dev
which is linked to chip->cdev and represents /dev/tpm, not /dev/tpmrm.
You are right, we dont have the issue with /dev/tpm for the reason you mentioned.
But /dev/tpmrm is represented by chip->cdevs and keeping this ref held by the cdev 
layer wont protect us from the chip being freed (which is the reason why we need
the chip->dev reference in the first place).

And yes, the naming dev/devs/cdev/cdevs is quite confusing  :(

Regards,
Lino
Jason Gunthorpe Feb. 5, 2021, 3:58 p.m. UTC | #12
On Fri, Feb 05, 2021 at 04:50:13PM +0100, Lino Sanfilippo wrote:
> 

> On 05.02.21 16:15, Jason Gunthorpe wrote:

> > 

> > No, the cdev layer holds the refcount on the device while open is

> > being called.

> > 

> Yes, but the reference that is responsible for the chip deallocation is chip->dev

> which is linked to chip->cdev and represents /dev/tpm, not /dev/tpmrm.

> You are right, we dont have the issue with /dev/tpm for the reason you mentioned.

> But /dev/tpmrm is represented by chip->cdevs and keeping this ref held by the cdev

> layer wont protect us from the chip being freed (which is the reason why we need

> the chip->dev reference in the first place).


No, they are all chained together because they are all in the same
struct:

struct tpm_chip {
	struct device dev;
	struct device devs;
	struct cdev cdev;
	struct cdev cdevs;

dev holds the refcount on memory, when it goes 0 the whole thing is
kfreed.

The rule is dev's refcount can't go to zero while any other refcount
is != 0.

For instance devs holds a get on dev that is put back only when devs
goes to 0:

static void tpm_devs_release(struct device *dev)
{
	struct tpm_chip *chip = container_of(dev, struct tpm_chip, devs);

	/* release the master device reference */
	put_device(&chip->dev);
}

Both cdev elements do something similar inside the cdev layer.

The net result is during any open() the tpm_chip is guarenteed to have
a positive refcount.

Jason
Lino Sanfilippo Feb. 5, 2021, 9:50 p.m. UTC | #13
On 05.02.21 at 16:58, Jason Gunthorpe wrote:
eference in the first place).
>

> No, they are all chained together because they are all in the same

> struct:

>

> struct tpm_chip {

> 	struct device dev;

> 	struct device devs;

> 	struct cdev cdev;

> 	struct cdev cdevs;

>

> dev holds the refcount on memory, when it goes 0 the whole thing is

> kfreed.

>

> The rule is dev's refcount can't go to zero while any other refcount

> is != 0.

>

> For instance devs holds a get on dev that is put back only when devs

> goes to 0:

>

> static void tpm_devs_release(struct device *dev)

> {

> 	struct tpm_chip *chip = container_of(dev, struct tpm_chip, devs);

>

> 	/* release the master device reference */

> 	put_device(&chip->dev);

> }

>

> Both cdev elements do something similar inside the cdev layer.


Well this chaining is exactly what does not work nowadays and what the patch is supposed
to fix: currently we dont ever take the extra ref (not even in TPM 2 case, note that
TPM_CHIP_FLAG_TMP2 is never set), so

-	if (chip->flags & TPM_CHIP_FLAG_TPM2)
-		get_device(&chip->dev);
+	get_device(&chip->dev);


and tpm_devs_release() is never called, since there is nothing that ever puts devs, so


+	rc = devm_add_action_or_reset(pdev,
+				      (void (*)(void *)) put_device,
+				      &chip->devs);


The race with only get_device()/putdevice() in tpm_common_open()/tpm_common_release() is:

1. tpm chip is allocated with dev refcount = 1, devs refcount = 1
2. /dev/tpmrm is opened but before we get the ref to dev in tpm_common() another thread
rmmmods the chip driver:
3. the chip is unregistered, dev is put with refcount = 0 and the whole chip struct is freed
3. Now open() proceeds, tries to grab the extra ref chip->dev from a chip that has already
been deallocated and the system crashes.

As I already wrote, that approach was my first thought, too, but since the result crashed due to the
race condition, I chose the approach in patch 1.

Regards,
Lino

> The net result is during any open() the tpm_chip is guarenteed to have

> a positive refcount.

>
Jason Gunthorpe Feb. 6, 2021, 12:39 a.m. UTC | #14
On Fri, Feb 05, 2021 at 10:50:02PM +0100, Lino Sanfilippo wrote:
> On 05.02.21 at 16:58, Jason Gunthorpe wrote:

> eference in the first place).

> >

> > No, they are all chained together because they are all in the same

> > struct:

> >

> > struct tpm_chip {

> > 	struct device dev;

> > 	struct device devs;

> > 	struct cdev cdev;

> > 	struct cdev cdevs;

> >

> > dev holds the refcount on memory, when it goes 0 the whole thing is

> > kfreed.

> >

> > The rule is dev's refcount can't go to zero while any other refcount

> > is != 0.

> >

> > For instance devs holds a get on dev that is put back only when devs

> > goes to 0:

> >

> > static void tpm_devs_release(struct device *dev)

> > {

> > 	struct tpm_chip *chip = container_of(dev, struct tpm_chip, devs);

> >

> > 	/* release the master device reference */

> > 	put_device(&chip->dev);

> > }

> >

> > Both cdev elements do something similar inside the cdev layer.

> 

> Well this chaining is exactly what does not work nowadays and what the patch is supposed

> to fix: currently we dont ever take the extra ref (not even in TPM 2 case, note that

> TPM_CHIP_FLAG_TMP2 is never set), so

> 

> -	if (chip->flags & TPM_CHIP_FLAG_TPM2)

> -		get_device(&chip->dev);

> +	get_device(&chip->dev);


Oh, hah, yes that is busted up. The patch sketch I sent to James is
the right way to handle it, feel free to take it up

> and tpm_devs_release() is never called, since there is nothing that ever puts devs, so


Yes, that is a pre-existing memory leak

> The race with only get_device()/putdevice() in tpm_common_open()/tpm_common_release() is:


The refcount handling is busted up and not working the way it is
designed, when that is fixed there is no race.

Jason