mbox series

[0/3] Fix Navi3x boot and hotplug problems

Message ID 20230926225955.386553-1-mario.limonciello@amd.com
Headers show
Series Fix Navi3x boot and hotplug problems | expand

Message

Mario Limonciello Sept. 26, 2023, 10:59 p.m. UTC
On some OEM systems multiple navi3x dGPUS are triggering RAS errors
and BACO errors.

These errors come from elements of the OEM system that weren't part of
original test environment.  This series addresses those problems.

NOTE: Although this series touches two subsystems, I would prefer to
take this all through DRM because there is a workaround in linux-next
that I would like to be reverted at the same time as picking up the first
two patches.

Mario Limonciello (3):
  drm/amd: Fix detection of _PR3 on the PCIe root port
  power: supply: Don't count 'unknown' scope power supplies
  Revert "drm/amd/pm: workaround for the wrong ac power detection on smu
    13.0.0"

 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c           | 2 +-
 drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0.c       | 3 ++-
 drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_0_ppt.c | 1 +
 drivers/power/supply/power_supply_core.c             | 2 +-
 4 files changed, 5 insertions(+), 3 deletions(-)

Comments

Sebastian Reichel Sept. 30, 2023, 8:18 p.m. UTC | #1
Hi,

On Tue, Sep 26, 2023 at 05:59:54PM -0500, Mario Limonciello wrote:
> On some systems AMD Navi3x dGPU triggers RAS errors on startup; but
> only if the amdgpu kernel module is not part of the initramfs.
> This is because the hardware is not properly programmed for the
> AC/DC state of the system when it is loaded later in boot.

I don't understand the last sentence. As far as I can see
i2c_dw_pci_probe() either does not registers UCSI at all or
with the dGPU properties (and thus scope) set.

> The AC/DC state of the system is incorrect specifically when UCSI power
> supplies have been initialized.  These power supplies are marked as
> POWER_SUPPLY_SCOPE_UNKNOWN scope. As they're 'offline' the power
> supply count is increased but the resultant return value is
> power_supply_is_system_supplied() 0.
> 
> To fix this look explicitly for `POWER_SUPPLY_SCOPE_SYSTEM` power
> supplies before incrementing the count. If no system power supply
> is found then the system is assumed to be on AC.
> 
> Cc: stable@vger.kernel.org
> Tested-by: David Perry <David.Perry@amd.com>
> Fixes: 95339f40a8b6 ("power: supply: Fix logic checking if system is running from battery")
> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
> ---

This effectively fully disables supply detection for UCSI, because
it is never set to POWER_SUPPLY_SCOPE_SYSTEM. Please fix the amdgpu
init part instead.

-- Sebastian

>  drivers/power/supply/power_supply_core.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/power/supply/power_supply_core.c b/drivers/power/supply/power_supply_core.c
> index d325e6dbc770..3de6e6d00815 100644
> --- a/drivers/power/supply/power_supply_core.c
> +++ b/drivers/power/supply/power_supply_core.c
> @@ -349,7 +349,7 @@ static int __power_supply_is_system_supplied(struct device *dev, void *data)
>  	unsigned int *count = data;
>  
>  	if (!psy->desc->get_property(psy, POWER_SUPPLY_PROP_SCOPE, &ret))
> -		if (ret.intval == POWER_SUPPLY_SCOPE_DEVICE)
> +		if (ret.intval != POWER_SUPPLY_SCOPE_SYSTEM)
>  			return 0;
>  
>  	(*count)++;
> -- 
> 2.34.1
>
Mario Limonciello Oct. 2, 2023, midnight UTC | #2
On 9/30/2023 15:18, Sebastian Reichel wrote:
> Hi,
> 
> On Tue, Sep 26, 2023 at 05:59:54PM -0500, Mario Limonciello wrote:
>> On some systems AMD Navi3x dGPU triggers RAS errors on startup; but
>> only if the amdgpu kernel module is not part of the initramfs.
>> This is because the hardware is not properly programmed for the
>> AC/DC state of the system when it is loaded later in boot.
> 
> I don't understand the last sentence. As far as I can see
> i2c_dw_pci_probe() either does not registers UCSI at all or
> with the dGPU properties (and thus scope) set.

I'll explain it better below.

> 
>> The AC/DC state of the system is incorrect specifically when UCSI power
>> supplies have been initialized.  These power supplies are marked as
>> POWER_SUPPLY_SCOPE_UNKNOWN scope. As they're 'offline' the power
>> supply count is increased but the resultant return value is
>> power_supply_is_system_supplied() 0.
>>
>> To fix this look explicitly for `POWER_SUPPLY_SCOPE_SYSTEM` power
>> supplies before incrementing the count. If no system power supply
>> is found then the system is assumed to be on AC.
>>
>> Cc: stable@vger.kernel.org
>> Tested-by: David Perry <David.Perry@amd.com>
>> Fixes: 95339f40a8b6 ("power: supply: Fix logic checking if system is running from battery")
>> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
>> ---
> 
> This effectively fully disables supply detection for UCSI, because
> it is never set to POWER_SUPPLY_SCOPE_SYSTEM. Please fix the amdgpu
> init part instead.

I don't think my commit message did a good job conveying why this is a 
core bug.  Let me try to add more detail.

This is an OEM system that has 3 USB type C ports.  It's an Intel 
system, but this doesn't matter for the issue.
* when ucsi_acpi is not loaded there are no power supplies in the system 
and it reports power_supply_is_system_supplied() as AC.
* When ucsi_acpi is loaded 3 power supplies will be registered.
power_supply_is_system_supplied() reports as DC.

Now when you add in a Navi3x AMD dGPU to the system the power supplies 
don't change.  This particular dGPU model doesn't contain a USB-C port, 
so there is no UCSI power supply registered.

As amdgpu is loaded it looks at device initialization whether the system 
is powered by AC or DC.  Here is how it looks:

https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c?h=linux-6.5.y#n3834

On the OEM system if amdgpu loads before the ucsi_acpi driver (such as 
in the initramfs) then the right value is returned for 
power_supply_is_system_supplied() - AC.

If amdgpu is loaded after the ucsi_acpi driver, the wrong value is 
returned for power_supply_is_system_supplied() - DC.

This value is very important to set up the dGPU properly.  If the wrong 
value is returned, the wrong value will be notified to the hardware and 
the hardware will not behave properly.  On the OEM system this is a 
"black screen" at bootup along with RAS errors emitted by the dGPU.

With no changes to a malfunctioning kernel or initramfs binaries I can 
add modprobe.blacklist=ucsi_acpi to kernel command line avoid 
registering those 3 power supplies and the system behaves properly.

So I think it's inappropriate for "UNKNOWN" scope power supplies to be 
registered and treated as system supplies, at least as it pertains to 
power_supply_is_system_supplied().

> 
> -- Sebastian
> 
>>   drivers/power/supply/power_supply_core.c | 2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/drivers/power/supply/power_supply_core.c b/drivers/power/supply/power_supply_core.c
>> index d325e6dbc770..3de6e6d00815 100644
>> --- a/drivers/power/supply/power_supply_core.c
>> +++ b/drivers/power/supply/power_supply_core.c
>> @@ -349,7 +349,7 @@ static int __power_supply_is_system_supplied(struct device *dev, void *data)
>>   	unsigned int *count = data;
>>   
>>   	if (!psy->desc->get_property(psy, POWER_SUPPLY_PROP_SCOPE, &ret))
>> -		if (ret.intval == POWER_SUPPLY_SCOPE_DEVICE)
>> +		if (ret.intval != POWER_SUPPLY_SCOPE_SYSTEM)
>>   			return 0;
>>   
>>   	(*count)++;
>> -- 
>> 2.34.1
>>
Sebastian Reichel Oct. 4, 2023, 11:10 p.m. UTC | #3
Hi,

On Sun, Oct 01, 2023 at 07:00:11PM -0500, Mario Limonciello wrote:
> Let me try to add more detail.
> 
> This is an OEM system that has 3 USB type C ports.  It's an Intel system,
> but this doesn't matter for the issue.
> * when ucsi_acpi is not loaded there are no power supplies in the system and
> it reports power_supply_is_system_supplied() as AC.
> * When ucsi_acpi is loaded 3 power supplies will be registered.
> power_supply_is_system_supplied() reports as DC.
> 
> Now when you add in a Navi3x AMD dGPU to the system the power supplies don't
> change.  This particular dGPU model doesn't contain a USB-C port, so there
> is no UCSI power supply registered.
> 
> As amdgpu is loaded it looks at device initialization whether the system is
> powered by AC or DC.  Here is how it looks:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c?h=linux-6.5.y#n3834
> 
> On the OEM system if amdgpu loads before the ucsi_acpi driver (such as in
> the initramfs) then the right value is returned for
> power_supply_is_system_supplied() - AC.
> 
> If amdgpu is loaded after the ucsi_acpi driver, the wrong value is returned
> for power_supply_is_system_supplied() - DC.
> 
> This value is very important to set up the dGPU properly.  If the wrong
> value is returned, the wrong value will be notified to the hardware and the
> hardware will not behave properly.  On the OEM system this is a "black
> screen" at bootup along with RAS errors emitted by the dGPU.
> 
> With no changes to a malfunctioning kernel or initramfs binaries I can add
> modprobe.blacklist=ucsi_acpi to kernel command line avoid registering those
> 3 power supplies and the system behaves properly.
> 
> So I think it's inappropriate for "UNKNOWN" scope power supplies to be
> registered and treated as system supplies, at least as it pertains to
> power_supply_is_system_supplied().

So the main issue is, that the ucsi_acpi registers a bunch of
power-supply chargers with unknown scope on a desktop systems
and that results in the system assumed to be supplied from battery.

The problem with your change is, that many of the charger drivers
don't set a scope at all (and thus report unknown scope). Those
obviously should not be skipped. Probably most of these drivers
could be changed to properly set the scope, but it needs to be
checked on a case-by-case basis. With your current patch they would
regress in the oposite direction of your use-case.

Ideally ucsi is changed to properly describe the scope, but I
suppose this information is not available in ACPI?

Assuming that the above are not solvable easily, my idea would be to
only count the number of POWER_SUPPLY_TYPE_BATTERY device, which have
!POWER_SUPPLY_SCOPE_DEVICE and exit early if there are none.
Basically change __power_supply_is_system_supplied(), so that it
looks like this:

...
	if (!psy->desc->get_property(psy, POWER_SUPPLY_PROP_SCOPE, &ret))
		if (ret.intval == POWER_SUPPLY_SCOPE_DEVICE)
			return 0;

	if (psy->desc->type == POWER_SUPPLY_TYPE_BATTERY)
			(*count)++;
    else
		if (!psy->desc->get_property(psy, POWER_SUPPLY_PROP_ONLINE,
					&ret))
			return ret.intval;
...

That should work in both cases.

-- Sebastian

> > >   drivers/power/supply/power_supply_core.c | 2 +-
> > >   1 file changed, 1 insertion(+), 1 deletion(-)
> > > 
> > > diff --git a/drivers/power/supply/power_supply_core.c b/drivers/power/supply/power_supply_core.c
> > > index d325e6dbc770..3de6e6d00815 100644
> > > --- a/drivers/power/supply/power_supply_core.c
> > > +++ b/drivers/power/supply/power_supply_core.c
> > > @@ -349,7 +349,7 @@ static int __power_supply_is_system_supplied(struct device *dev, void *data)
> > >   	unsigned int *count = data;
> > >   	if (!psy->desc->get_property(psy, POWER_SUPPLY_PROP_SCOPE, &ret))
> > > -		if (ret.intval == POWER_SUPPLY_SCOPE_DEVICE)
> > > +		if (ret.intval != POWER_SUPPLY_SCOPE_SYSTEM)
> > >   			return 0;
> > >   	(*count)++;
> > > -- 
> > > 2.34.1
> > > 
>
Mario Limonciello Oct. 5, 2023, 7:51 p.m. UTC | #4
On 10/4/2023 18:10, Sebastian Reichel wrote:
> Hi,
> 
> On Sun, Oct 01, 2023 at 07:00:11PM -0500, Mario Limonciello wrote:
>> Let me try to add more detail.
>>
>> This is an OEM system that has 3 USB type C ports.  It's an Intel system,
>> but this doesn't matter for the issue.
>> * when ucsi_acpi is not loaded there are no power supplies in the system and
>> it reports power_supply_is_system_supplied() as AC.
>> * When ucsi_acpi is loaded 3 power supplies will be registered.
>> power_supply_is_system_supplied() reports as DC.
>>
>> Now when you add in a Navi3x AMD dGPU to the system the power supplies don't
>> change.  This particular dGPU model doesn't contain a USB-C port, so there
>> is no UCSI power supply registered.
>>
>> As amdgpu is loaded it looks at device initialization whether the system is
>> powered by AC or DC.  Here is how it looks:
>>
>> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c?h=linux-6.5.y#n3834
>>
>> On the OEM system if amdgpu loads before the ucsi_acpi driver (such as in
>> the initramfs) then the right value is returned for
>> power_supply_is_system_supplied() - AC.
>>
>> If amdgpu is loaded after the ucsi_acpi driver, the wrong value is returned
>> for power_supply_is_system_supplied() - DC.
>>
>> This value is very important to set up the dGPU properly.  If the wrong
>> value is returned, the wrong value will be notified to the hardware and the
>> hardware will not behave properly.  On the OEM system this is a "black
>> screen" at bootup along with RAS errors emitted by the dGPU.
>>
>> With no changes to a malfunctioning kernel or initramfs binaries I can add
>> modprobe.blacklist=ucsi_acpi to kernel command line avoid registering those
>> 3 power supplies and the system behaves properly.
>>
>> So I think it's inappropriate for "UNKNOWN" scope power supplies to be
>> registered and treated as system supplies, at least as it pertains to
>> power_supply_is_system_supplied().
> 
> So the main issue is, that the ucsi_acpi registers a bunch of
> power-supply chargers with unknown scope on a desktop systems
> and that results in the system assumed to be supplied from battery.
> 
> The problem with your change is, that many of the charger drivers
> don't set a scope at all (and thus report unknown scope). Those
> obviously should not be skipped. Probably most of these drivers
> could be changed to properly set the scope, but it needs to be
> checked on a case-by-case basis. With your current patch they would
> regress in the oposite direction of your use-case.
> 
> Ideally ucsi is changed to properly describe the scope, but I
> suppose this information is not available in ACPI?
> 
> Assuming that the above are not solvable easily, my idea would be to
> only count the number of POWER_SUPPLY_TYPE_BATTERY device, which have
> !POWER_SUPPLY_SCOPE_DEVICE and exit early if there are none.
> Basically change __power_supply_is_system_supplied(), so that it
> looks like this:
> 
> ...
> 	if (!psy->desc->get_property(psy, POWER_SUPPLY_PROP_SCOPE, &ret))
> 		if (ret.intval == POWER_SUPPLY_SCOPE_DEVICE)
> 			return 0;
> 
> 	if (psy->desc->type == POWER_SUPPLY_TYPE_BATTERY)
> 			(*count)++;
>      else
> 		if (!psy->desc->get_property(psy, POWER_SUPPLY_PROP_ONLINE,
> 					&ret))
> 			return ret.intval;
> ...
> 
> That should work in both cases.
> 

I tested both your suggestion as well as modifying UCSI driver to set 
the scope.  Both worked.

I've sent out v2 modifying the scope for UCSI driver.  If for some 
reason that ends up not working out we can revert to your generic 
suggestion.

https://lore.kernel.org/linux-usb/20231005175230.232764-1-mario.limonciello@amd.com/T/#m9543f1f2c3767c0e88135c2e3f15ced65cfdf004

> -- Sebastian
> 
>>>>    drivers/power/supply/power_supply_core.c | 2 +-
>>>>    1 file changed, 1 insertion(+), 1 deletion(-)
>>>>
>>>> diff --git a/drivers/power/supply/power_supply_core.c b/drivers/power/supply/power_supply_core.c
>>>> index d325e6dbc770..3de6e6d00815 100644
>>>> --- a/drivers/power/supply/power_supply_core.c
>>>> +++ b/drivers/power/supply/power_supply_core.c
>>>> @@ -349,7 +349,7 @@ static int __power_supply_is_system_supplied(struct device *dev, void *data)
>>>>    	unsigned int *count = data;
>>>>    	if (!psy->desc->get_property(psy, POWER_SUPPLY_PROP_SCOPE, &ret))
>>>> -		if (ret.intval == POWER_SUPPLY_SCOPE_DEVICE)
>>>> +		if (ret.intval != POWER_SUPPLY_SCOPE_SYSTEM)
>>>>    			return 0;
>>>>    	(*count)++;
>>>> -- 
>>>> 2.34.1
>>>>
>>