Message ID | 20250307-thermal-sysfs-race-v1-1-8a3d4d4ac9c4@intel.com |
---|---|
State | New |
Headers | show |
Series | thermal: core: Delay exposing sysfs interface | expand |
On Sat, Mar 8, 2025 at 2:02 AM Lucas De Marchi <lucas.demarchi@intel.com> wrote: > > There's a race between initializing the governor and userspace accessing > the sysfs interface. From time to time the Intel graphics CI shows this > signature: > > <1>[] #PF: error_code(0x0000) - not-present page > <6>[] PGD 0 P4D 0 > <4>[] Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI > <4>[] CPU: 3 UID: 0 PID: 562 Comm: thermald Not tainted 6.14.0-rc4-CI_DRM_16208-g7e37396f86d8+ #1 > <4>[] Hardware name: Intel Corporation Twin Lake Client Platform/AlderLake-N LP5 RVP, BIOS TWLNFWI1.R00.5222.A01.2405290634 05/29/2024 > <4>[] RIP: 0010:policy_show+0x1a/0x40 > > thermald tries to read the policy file between the sysfs files being > created and the governor set by thermal_set_governor(), which causes the > NULL pointer dereference. > > Similarly to the hwmon interface, delay exposing the sysfs files to when > the governor is already set. > > Closes: https://gitlab.freedesktop.org/drm/i915/kernel/-/issues/13655 > Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com> > --- > The race window is not that big. I could reproduce it and confirm > the fix by doing this: > > 1) Add a udelay() in thermal_zone_device_register_with_trips > 2) A busy loop cat'ing the file > > $ while [ 1 ]; do > cat /sys/devices/virtual/thermal/thermal_zone0/policy > /dev/null 2>&1 > done > 3) rebind processor_thermal_device_pci > --- > drivers/thermal/thermal_core.c | 20 ++++++++++---------- > 1 file changed, 10 insertions(+), 10 deletions(-) > > diff --git a/drivers/thermal/thermal_core.c b/drivers/thermal/thermal_core.c > index 2328ac0d8561b..f96ca27109288 100644 > --- a/drivers/thermal/thermal_core.c > +++ b/drivers/thermal/thermal_core.c > @@ -1589,26 +1589,26 @@ thermal_zone_device_register_with_trips(const char *type, > > tz->state = TZ_STATE_FLAG_INIT; > > + result = dev_set_name(&tz->device, "thermal_zone%d", tz->id); > + if (result) > + goto remove_id; > + > + thermal_zone_device_init(tz); > + > + result = thermal_zone_init_governor(tz); > + if (result) > + goto remove_id; > + > /* sys I/F */ > /* Add nodes that are always present via .groups */ > result = thermal_zone_create_device_groups(tz); > if (result) > goto remove_id; > > - result = dev_set_name(&tz->device, "thermal_zone%d", tz->id); > - if (result) { > - thermal_zone_destroy_device_groups(tz); > - goto remove_id; > - } > - thermal_zone_device_init(tz); > result = device_register(&tz->device); > if (result) > goto release_device; > > - result = thermal_zone_init_governor(tz); > - if (result) > - goto unregister; > - > if (!tz->tzp || !tz->tzp->no_hwmon) { > result = thermal_add_hwmon_sysfs(tz); > if (result) > > --- Applied as 6.15 material, thanks!
diff --git a/drivers/thermal/thermal_core.c b/drivers/thermal/thermal_core.c index 2328ac0d8561b..f96ca27109288 100644 --- a/drivers/thermal/thermal_core.c +++ b/drivers/thermal/thermal_core.c @@ -1589,26 +1589,26 @@ thermal_zone_device_register_with_trips(const char *type, tz->state = TZ_STATE_FLAG_INIT; + result = dev_set_name(&tz->device, "thermal_zone%d", tz->id); + if (result) + goto remove_id; + + thermal_zone_device_init(tz); + + result = thermal_zone_init_governor(tz); + if (result) + goto remove_id; + /* sys I/F */ /* Add nodes that are always present via .groups */ result = thermal_zone_create_device_groups(tz); if (result) goto remove_id; - result = dev_set_name(&tz->device, "thermal_zone%d", tz->id); - if (result) { - thermal_zone_destroy_device_groups(tz); - goto remove_id; - } - thermal_zone_device_init(tz); result = device_register(&tz->device); if (result) goto release_device; - result = thermal_zone_init_governor(tz); - if (result) - goto unregister; - if (!tz->tzp || !tz->tzp->no_hwmon) { result = thermal_add_hwmon_sysfs(tz); if (result)
There's a race between initializing the governor and userspace accessing the sysfs interface. From time to time the Intel graphics CI shows this signature: <1>[] #PF: error_code(0x0000) - not-present page <6>[] PGD 0 P4D 0 <4>[] Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI <4>[] CPU: 3 UID: 0 PID: 562 Comm: thermald Not tainted 6.14.0-rc4-CI_DRM_16208-g7e37396f86d8+ #1 <4>[] Hardware name: Intel Corporation Twin Lake Client Platform/AlderLake-N LP5 RVP, BIOS TWLNFWI1.R00.5222.A01.2405290634 05/29/2024 <4>[] RIP: 0010:policy_show+0x1a/0x40 thermald tries to read the policy file between the sysfs files being created and the governor set by thermal_set_governor(), which causes the NULL pointer dereference. Similarly to the hwmon interface, delay exposing the sysfs files to when the governor is already set. Closes: https://gitlab.freedesktop.org/drm/i915/kernel/-/issues/13655 Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com> --- The race window is not that big. I could reproduce it and confirm the fix by doing this: 1) Add a udelay() in thermal_zone_device_register_with_trips 2) A busy loop cat'ing the file $ while [ 1 ]; do cat /sys/devices/virtual/thermal/thermal_zone0/policy > /dev/null 2>&1 done 3) rebind processor_thermal_device_pci --- drivers/thermal/thermal_core.c | 20 ++++++++++---------- 1 file changed, 10 insertions(+), 10 deletions(-) --- base-commit: 8aed61b8334e00f4fe5de9f2df1cd183dc328a9d change-id: 20250307-thermal-sysfs-race-808f6f8376f4 Best regards,