diff mbox series

[v2,3/7] drm/msm: Fix cx collapse issue during recovery

Message ID 20220709112837.v2.3.I4ac27a0b34ea796ce0f938bb509e257516bc6f57@changeid
State New
Headers show
Series Improve GPU Recovery | expand

Commit Message

Akhil P Oommen July 9, 2022, 5:59 a.m. UTC
There are some hardware logic under CX domain. For a successful
recovery, we should ensure cx headswitch collapses to ensure all the
stale states are cleard out. This is especially true to for a6xx family
where we can GMU co-processor.

Currently, cx doesn't collapse due to a devlink between gpu and its
smmu. So the *struct gpu device* needs to be runtime suspended to ensure
that the iommu driver removes its vote on cx gdsc.

Signed-off-by: Akhil P Oommen <quic_akhilpo@quicinc.com>
---

(no changes since v1)

 drivers/gpu/drm/msm/adreno/a6xx_gpu.c | 16 ++++++++++++++--
 drivers/gpu/drm/msm/msm_gpu.c         |  2 --
 2 files changed, 14 insertions(+), 4 deletions(-)

Comments

Doug Anderson July 11, 2022, 11:22 p.m. UTC | #1
Hi,

On Fri, Jul 8, 2022 at 11:00 PM Akhil P Oommen <quic_akhilpo@quicinc.com> wrote:
>
> There are some hardware logic under CX domain. For a successful
> recovery, we should ensure cx headswitch collapses to ensure all the
> stale states are cleard out. This is especially true to for a6xx family
> where we can GMU co-processor.
>
> Currently, cx doesn't collapse due to a devlink between gpu and its
> smmu. So the *struct gpu device* needs to be runtime suspended to ensure
> that the iommu driver removes its vote on cx gdsc.
>
> Signed-off-by: Akhil P Oommen <quic_akhilpo@quicinc.com>
> ---
>
> (no changes since v1)
>
>  drivers/gpu/drm/msm/adreno/a6xx_gpu.c | 16 ++++++++++++++--
>  drivers/gpu/drm/msm/msm_gpu.c         |  2 --
>  2 files changed, 14 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
> index 4d50110..7ed347c 100644
> --- a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
> +++ b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
> @@ -1278,8 +1278,20 @@ static void a6xx_recover(struct msm_gpu *gpu)
>          */
>         gmu_write(&a6xx_gpu->gmu, REG_A6XX_GMU_GMU_PWR_COL_KEEPALIVE, 0);
>
> -       gpu->funcs->pm_suspend(gpu);
> -       gpu->funcs->pm_resume(gpu);
> +       /*
> +        * Now drop all the pm_runtime usage count to allow cx gdsc to collapse.
> +        * First drop the usage count from all active submits
> +        */
> +       for (i = gpu->active_submits; i > 0; i--)
> +               pm_runtime_put(&gpu->pdev->dev);
> +
> +       /* And the final one from recover worker */
> +       pm_runtime_put_sync(&gpu->pdev->dev);
> +
> +       for (i = gpu->active_submits; i > 0; i--)
> +               pm_runtime_get(&gpu->pdev->dev);
> +
> +       pm_runtime_get_sync(&gpu->pdev->dev);

In response to v1, Rob suggested pm_runtime_force_suspend/resume().
Those seem like they would work to me, too. Why not use them?
Akhil P Oommen July 12, 2022, 5:04 a.m. UTC | #2
On 7/12/2022 4:52 AM, Doug Anderson wrote:
> Hi,
>
> On Fri, Jul 8, 2022 at 11:00 PM Akhil P Oommen <quic_akhilpo@quicinc.com> wrote:
>> There are some hardware logic under CX domain. For a successful
>> recovery, we should ensure cx headswitch collapses to ensure all the
>> stale states are cleard out. This is especially true to for a6xx family
>> where we can GMU co-processor.
>>
>> Currently, cx doesn't collapse due to a devlink between gpu and its
>> smmu. So the *struct gpu device* needs to be runtime suspended to ensure
>> that the iommu driver removes its vote on cx gdsc.
>>
>> Signed-off-by: Akhil P Oommen <quic_akhilpo@quicinc.com>
>> ---
>>
>> (no changes since v1)
>>
>>   drivers/gpu/drm/msm/adreno/a6xx_gpu.c | 16 ++++++++++++++--
>>   drivers/gpu/drm/msm/msm_gpu.c         |  2 --
>>   2 files changed, 14 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
>> index 4d50110..7ed347c 100644
>> --- a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
>> +++ b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
>> @@ -1278,8 +1278,20 @@ static void a6xx_recover(struct msm_gpu *gpu)
>>           */
>>          gmu_write(&a6xx_gpu->gmu, REG_A6XX_GMU_GMU_PWR_COL_KEEPALIVE, 0);
>>
>> -       gpu->funcs->pm_suspend(gpu);
>> -       gpu->funcs->pm_resume(gpu);
>> +       /*
>> +        * Now drop all the pm_runtime usage count to allow cx gdsc to collapse.
>> +        * First drop the usage count from all active submits
>> +        */
>> +       for (i = gpu->active_submits; i > 0; i--)
>> +               pm_runtime_put(&gpu->pdev->dev);
>> +
>> +       /* And the final one from recover worker */
>> +       pm_runtime_put_sync(&gpu->pdev->dev);
>> +
>> +       for (i = gpu->active_submits; i > 0; i--)
>> +               pm_runtime_get(&gpu->pdev->dev);
>> +
>> +       pm_runtime_get_sync(&gpu->pdev->dev);
> In response to v1, Rob suggested pm_runtime_force_suspend/resume().
> Those seem like they would work to me, too. Why not use them?
Quoting my previous response which I seem to have sent only to Freedreno 
list:

"I believe it is supposed to be used only during system sleep state 
transitions. Btw, we don't want pm_runtime_get() calls from elsewhere to 
fail by disabling RPM here."

-Akhil
Rob Clark July 12, 2022, 4:44 p.m. UTC | #3
On Mon, Jul 11, 2022 at 10:05 PM Akhil P Oommen
<quic_akhilpo@quicinc.com> wrote:
>
> On 7/12/2022 4:52 AM, Doug Anderson wrote:
> > Hi,
> >
> > On Fri, Jul 8, 2022 at 11:00 PM Akhil P Oommen <quic_akhilpo@quicinc.com> wrote:
> >> There are some hardware logic under CX domain. For a successful
> >> recovery, we should ensure cx headswitch collapses to ensure all the
> >> stale states are cleard out. This is especially true to for a6xx family
> >> where we can GMU co-processor.
> >>
> >> Currently, cx doesn't collapse due to a devlink between gpu and its
> >> smmu. So the *struct gpu device* needs to be runtime suspended to ensure
> >> that the iommu driver removes its vote on cx gdsc.
> >>
> >> Signed-off-by: Akhil P Oommen <quic_akhilpo@quicinc.com>
> >> ---
> >>
> >> (no changes since v1)
> >>
> >>   drivers/gpu/drm/msm/adreno/a6xx_gpu.c | 16 ++++++++++++++--
> >>   drivers/gpu/drm/msm/msm_gpu.c         |  2 --
> >>   2 files changed, 14 insertions(+), 4 deletions(-)
> >>
> >> diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
> >> index 4d50110..7ed347c 100644
> >> --- a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
> >> +++ b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
> >> @@ -1278,8 +1278,20 @@ static void a6xx_recover(struct msm_gpu *gpu)
> >>           */
> >>          gmu_write(&a6xx_gpu->gmu, REG_A6XX_GMU_GMU_PWR_COL_KEEPALIVE, 0);
> >>
> >> -       gpu->funcs->pm_suspend(gpu);
> >> -       gpu->funcs->pm_resume(gpu);
> >> +       /*
> >> +        * Now drop all the pm_runtime usage count to allow cx gdsc to collapse.
> >> +        * First drop the usage count from all active submits
> >> +        */
> >> +       for (i = gpu->active_submits; i > 0; i--)
> >> +               pm_runtime_put(&gpu->pdev->dev);
> >> +
> >> +       /* And the final one from recover worker */
> >> +       pm_runtime_put_sync(&gpu->pdev->dev);
> >> +
> >> +       for (i = gpu->active_submits; i > 0; i--)
> >> +               pm_runtime_get(&gpu->pdev->dev);
> >> +
> >> +       pm_runtime_get_sync(&gpu->pdev->dev);
> > In response to v1, Rob suggested pm_runtime_force_suspend/resume().
> > Those seem like they would work to me, too. Why not use them?
> Quoting my previous response which I seem to have sent only to Freedreno
> list:
>
> "I believe it is supposed to be used only during system sleep state
> transitions. Btw, we don't want pm_runtime_get() calls from elsewhere to
> fail by disabling RPM here."

The comment about not wanting other runpm calls to fail is valid.. but
that is also solveable, ie. by holding a lock around runpm calls.
Which I think we need to do anyways, otherwise looping over
gpu->active_submits is racey..

I think pm_runtime_force_suspend/resume() is the least-bad option.. or
at least I'm not seeing any obvious alternative that is better

BR,
-R
Akhil P Oommen July 12, 2022, 7:15 p.m. UTC | #4
On 7/12/2022 10:14 PM, Rob Clark wrote:
> On Mon, Jul 11, 2022 at 10:05 PM Akhil P Oommen
> <quic_akhilpo@quicinc.com> wrote:
>> On 7/12/2022 4:52 AM, Doug Anderson wrote:
>>> Hi,
>>>
>>> On Fri, Jul 8, 2022 at 11:00 PM Akhil P Oommen <quic_akhilpo@quicinc.com> wrote:
>>>> There are some hardware logic under CX domain. For a successful
>>>> recovery, we should ensure cx headswitch collapses to ensure all the
>>>> stale states are cleard out. This is especially true to for a6xx family
>>>> where we can GMU co-processor.
>>>>
>>>> Currently, cx doesn't collapse due to a devlink between gpu and its
>>>> smmu. So the *struct gpu device* needs to be runtime suspended to ensure
>>>> that the iommu driver removes its vote on cx gdsc.
>>>>
>>>> Signed-off-by: Akhil P Oommen <quic_akhilpo@quicinc.com>
>>>> ---
>>>>
>>>> (no changes since v1)
>>>>
>>>>    drivers/gpu/drm/msm/adreno/a6xx_gpu.c | 16 ++++++++++++++--
>>>>    drivers/gpu/drm/msm/msm_gpu.c         |  2 --
>>>>    2 files changed, 14 insertions(+), 4 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
>>>> index 4d50110..7ed347c 100644
>>>> --- a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
>>>> +++ b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
>>>> @@ -1278,8 +1278,20 @@ static void a6xx_recover(struct msm_gpu *gpu)
>>>>            */
>>>>           gmu_write(&a6xx_gpu->gmu, REG_A6XX_GMU_GMU_PWR_COL_KEEPALIVE, 0);
>>>>
>>>> -       gpu->funcs->pm_suspend(gpu);
>>>> -       gpu->funcs->pm_resume(gpu);
>>>> +       /*
>>>> +        * Now drop all the pm_runtime usage count to allow cx gdsc to collapse.
>>>> +        * First drop the usage count from all active submits
>>>> +        */
>>>> +       for (i = gpu->active_submits; i > 0; i--)
>>>> +               pm_runtime_put(&gpu->pdev->dev);
>>>> +
>>>> +       /* And the final one from recover worker */
>>>> +       pm_runtime_put_sync(&gpu->pdev->dev);
>>>> +
>>>> +       for (i = gpu->active_submits; i > 0; i--)
>>>> +               pm_runtime_get(&gpu->pdev->dev);
>>>> +
>>>> +       pm_runtime_get_sync(&gpu->pdev->dev);
>>> In response to v1, Rob suggested pm_runtime_force_suspend/resume().
>>> Those seem like they would work to me, too. Why not use them?
>> Quoting my previous response which I seem to have sent only to Freedreno
>> list:
>>
>> "I believe it is supposed to be used only during system sleep state
>> transitions. Btw, we don't want pm_runtime_get() calls from elsewhere to
>> fail by disabling RPM here."
> The comment about not wanting other runpm calls to fail is valid.. but
> that is also solveable, ie. by holding a lock around runpm calls.
> Which I think we need to do anyways, otherwise looping over
> gpu->active_submits is racey..
>
> I think pm_runtime_force_suspend/resume() is the least-bad option.. or
> at least I'm not seeing any obvious alternative that is better
>
> BR,
> -R
We are holding gpu->lock here which will block further submissions from 
scheduler. Will active_submits still race?

It is possible that there is another thread which successfully completed 
pm_runtime_get() and while it access the hardware, we pulled the plug on 
regulator/clock here. That will result in obvious device crash. So I can 
think of 2 solutions:

1. wrap *every* pm_runtime_get/put with a mutex. Something like:
             mutex_lock();
             pm_runtime_get();
             < ... access hardware here >>
             pm_runtime_put();
             mutex_unlock();

2. Drop runtime votes from every submit in recover worker and wait/poll 
for regulator to collapse in case there are transient votes on 
regulator  from other threads/subsystems.

Option (2) seems simpler to me.  What do you think?

-Akhil.
Rob Clark July 20, 2022, 6:06 p.m. UTC | #5
On Tue, Jul 12, 2022 at 12:15 PM Akhil P Oommen
<quic_akhilpo@quicinc.com> wrote:
>
> On 7/12/2022 10:14 PM, Rob Clark wrote:
> > On Mon, Jul 11, 2022 at 10:05 PM Akhil P Oommen
> > <quic_akhilpo@quicinc.com> wrote:
> >> On 7/12/2022 4:52 AM, Doug Anderson wrote:
> >>> Hi,
> >>>
> >>> On Fri, Jul 8, 2022 at 11:00 PM Akhil P Oommen <quic_akhilpo@quicinc.com> wrote:
> >>>> There are some hardware logic under CX domain. For a successful
> >>>> recovery, we should ensure cx headswitch collapses to ensure all the
> >>>> stale states are cleard out. This is especially true to for a6xx family
> >>>> where we can GMU co-processor.
> >>>>
> >>>> Currently, cx doesn't collapse due to a devlink between gpu and its
> >>>> smmu. So the *struct gpu device* needs to be runtime suspended to ensure
> >>>> that the iommu driver removes its vote on cx gdsc.
> >>>>
> >>>> Signed-off-by: Akhil P Oommen <quic_akhilpo@quicinc.com>
> >>>> ---
> >>>>
> >>>> (no changes since v1)
> >>>>
> >>>>    drivers/gpu/drm/msm/adreno/a6xx_gpu.c | 16 ++++++++++++++--
> >>>>    drivers/gpu/drm/msm/msm_gpu.c         |  2 --
> >>>>    2 files changed, 14 insertions(+), 4 deletions(-)
> >>>>
> >>>> diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
> >>>> index 4d50110..7ed347c 100644
> >>>> --- a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
> >>>> +++ b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
> >>>> @@ -1278,8 +1278,20 @@ static void a6xx_recover(struct msm_gpu *gpu)
> >>>>            */
> >>>>           gmu_write(&a6xx_gpu->gmu, REG_A6XX_GMU_GMU_PWR_COL_KEEPALIVE, 0);
> >>>>
> >>>> -       gpu->funcs->pm_suspend(gpu);
> >>>> -       gpu->funcs->pm_resume(gpu);
> >>>> +       /*
> >>>> +        * Now drop all the pm_runtime usage count to allow cx gdsc to collapse.
> >>>> +        * First drop the usage count from all active submits
> >>>> +        */
> >>>> +       for (i = gpu->active_submits; i > 0; i--)
> >>>> +               pm_runtime_put(&gpu->pdev->dev);
> >>>> +
> >>>> +       /* And the final one from recover worker */
> >>>> +       pm_runtime_put_sync(&gpu->pdev->dev);
> >>>> +
> >>>> +       for (i = gpu->active_submits; i > 0; i--)
> >>>> +               pm_runtime_get(&gpu->pdev->dev);
> >>>> +
> >>>> +       pm_runtime_get_sync(&gpu->pdev->dev);
> >>> In response to v1, Rob suggested pm_runtime_force_suspend/resume().
> >>> Those seem like they would work to me, too. Why not use them?
> >> Quoting my previous response which I seem to have sent only to Freedreno
> >> list:
> >>
> >> "I believe it is supposed to be used only during system sleep state
> >> transitions. Btw, we don't want pm_runtime_get() calls from elsewhere to
> >> fail by disabling RPM here."
> > The comment about not wanting other runpm calls to fail is valid.. but
> > that is also solveable, ie. by holding a lock around runpm calls.
> > Which I think we need to do anyways, otherwise looping over
> > gpu->active_submits is racey..
> >
> > I think pm_runtime_force_suspend/resume() is the least-bad option.. or
> > at least I'm not seeing any obvious alternative that is better
> >
> > BR,
> > -R
> We are holding gpu->lock here which will block further submissions from
> scheduler. Will active_submits still race?
>
> It is possible that there is another thread which successfully completed
> pm_runtime_get() and while it access the hardware, we pulled the plug on
> regulator/clock here. That will result in obvious device crash. So I can
> think of 2 solutions:
>
> 1. wrap *every* pm_runtime_get/put with a mutex. Something like:
>              mutex_lock();
>              pm_runtime_get();
>              < ... access hardware here >>
>              pm_runtime_put();
>              mutex_unlock();
>
> 2. Drop runtime votes from every submit in recover worker and wait/poll
> for regulator to collapse in case there are transient votes on
> regulator  from other threads/subsystems.
>
> Option (2) seems simpler to me.  What do you think?
>

But I think without #1 you could still be racing w/ some other path
that touches the hw, like devfreq, right.  They could be holding a
runpm ref, so even if you loop over active_submits decrementing the
runpm ref, it still doesn't drop to zero

BR,
-R
Akhil P Oommen July 20, 2022, 8:38 p.m. UTC | #6
On 7/20/2022 11:36 PM, Rob Clark wrote:
> On Tue, Jul 12, 2022 at 12:15 PM Akhil P Oommen
> <quic_akhilpo@quicinc.com> wrote:
>> On 7/12/2022 10:14 PM, Rob Clark wrote:
>>> On Mon, Jul 11, 2022 at 10:05 PM Akhil P Oommen
>>> <quic_akhilpo@quicinc.com> wrote:
>>>> On 7/12/2022 4:52 AM, Doug Anderson wrote:
>>>>> Hi,
>>>>>
>>>>> On Fri, Jul 8, 2022 at 11:00 PM Akhil P Oommen <quic_akhilpo@quicinc.com> wrote:
>>>>>> There are some hardware logic under CX domain. For a successful
>>>>>> recovery, we should ensure cx headswitch collapses to ensure all the
>>>>>> stale states are cleard out. This is especially true to for a6xx family
>>>>>> where we can GMU co-processor.
>>>>>>
>>>>>> Currently, cx doesn't collapse due to a devlink between gpu and its
>>>>>> smmu. So the *struct gpu device* needs to be runtime suspended to ensure
>>>>>> that the iommu driver removes its vote on cx gdsc.
>>>>>>
>>>>>> Signed-off-by: Akhil P Oommen <quic_akhilpo@quicinc.com>
>>>>>> ---
>>>>>>
>>>>>> (no changes since v1)
>>>>>>
>>>>>>     drivers/gpu/drm/msm/adreno/a6xx_gpu.c | 16 ++++++++++++++--
>>>>>>     drivers/gpu/drm/msm/msm_gpu.c         |  2 --
>>>>>>     2 files changed, 14 insertions(+), 4 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
>>>>>> index 4d50110..7ed347c 100644
>>>>>> --- a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
>>>>>> +++ b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
>>>>>> @@ -1278,8 +1278,20 @@ static void a6xx_recover(struct msm_gpu *gpu)
>>>>>>             */
>>>>>>            gmu_write(&a6xx_gpu->gmu, REG_A6XX_GMU_GMU_PWR_COL_KEEPALIVE, 0);
>>>>>>
>>>>>> -       gpu->funcs->pm_suspend(gpu);
>>>>>> -       gpu->funcs->pm_resume(gpu);
>>>>>> +       /*
>>>>>> +        * Now drop all the pm_runtime usage count to allow cx gdsc to collapse.
>>>>>> +        * First drop the usage count from all active submits
>>>>>> +        */
>>>>>> +       for (i = gpu->active_submits; i > 0; i--)
>>>>>> +               pm_runtime_put(&gpu->pdev->dev);
>>>>>> +
>>>>>> +       /* And the final one from recover worker */
>>>>>> +       pm_runtime_put_sync(&gpu->pdev->dev);
>>>>>> +
>>>>>> +       for (i = gpu->active_submits; i > 0; i--)
>>>>>> +               pm_runtime_get(&gpu->pdev->dev);
>>>>>> +
>>>>>> +       pm_runtime_get_sync(&gpu->pdev->dev);
>>>>> In response to v1, Rob suggested pm_runtime_force_suspend/resume().
>>>>> Those seem like they would work to me, too. Why not use them?
>>>> Quoting my previous response which I seem to have sent only to Freedreno
>>>> list:
>>>>
>>>> "I believe it is supposed to be used only during system sleep state
>>>> transitions. Btw, we don't want pm_runtime_get() calls from elsewhere to
>>>> fail by disabling RPM here."
>>> The comment about not wanting other runpm calls to fail is valid.. but
>>> that is also solveable, ie. by holding a lock around runpm calls.
>>> Which I think we need to do anyways, otherwise looping over
>>> gpu->active_submits is racey..
>>>
>>> I think pm_runtime_force_suspend/resume() is the least-bad option.. or
>>> at least I'm not seeing any obvious alternative that is better
>>>
>>> BR,
>>> -R
>> We are holding gpu->lock here which will block further submissions from
>> scheduler. Will active_submits still race?
>>
>> It is possible that there is another thread which successfully completed
>> pm_runtime_get() and while it access the hardware, we pulled the plug on
>> regulator/clock here. That will result in obvious device crash. So I can
>> think of 2 solutions:
>>
>> 1. wrap *every* pm_runtime_get/put with a mutex. Something like:
>>               mutex_lock();
>>               pm_runtime_get();
>>               < ... access hardware here >>
>>               pm_runtime_put();
>>               mutex_unlock();
>>
>> 2. Drop runtime votes from every submit in recover worker and wait/poll
>> for regulator to collapse in case there are transient votes on
>> regulator  from other threads/subsystems.
>>
>> Option (2) seems simpler to me.  What do you think?
>>
> But I think without #1 you could still be racing w/ some other path
> that touches the hw, like devfreq, right.  They could be holding a
> runpm ref, so even if you loop over active_submits decrementing the
> runpm ref, it still doesn't drop to zero
>
> BR,
> -R
Yes, you are right. There could be some transient votes from other 
threads/drivers/subsystem. This is the reason we need to poll for cx 
gdsc collapse in the next patch.Even with #1, it is difficult to 
coordinate with smmu driver and close to impossible with tz/hyp.

-Akhil.
Akhil P Oommen July 22, 2022, 5:25 p.m. UTC | #7
On 7/21/2022 2:08 AM, Akhil P Oommen wrote:
> On 7/20/2022 11:36 PM, Rob Clark wrote:
>> On Tue, Jul 12, 2022 at 12:15 PM Akhil P Oommen
>> <quic_akhilpo@quicinc.com> wrote:
>>> On 7/12/2022 10:14 PM, Rob Clark wrote:
>>>> On Mon, Jul 11, 2022 at 10:05 PM Akhil P Oommen
>>>> <quic_akhilpo@quicinc.com> wrote:
>>>>> On 7/12/2022 4:52 AM, Doug Anderson wrote:
>>>>>> Hi,
>>>>>>
>>>>>> On Fri, Jul 8, 2022 at 11:00 PM Akhil P Oommen 
>>>>>> <quic_akhilpo@quicinc.com> wrote:
>>>>>>> There are some hardware logic under CX domain. For a successful
>>>>>>> recovery, we should ensure cx headswitch collapses to ensure all 
>>>>>>> the
>>>>>>> stale states are cleard out. This is especially true to for a6xx 
>>>>>>> family
>>>>>>> where we can GMU co-processor.
>>>>>>>
>>>>>>> Currently, cx doesn't collapse due to a devlink between gpu and its
>>>>>>> smmu. So the *struct gpu device* needs to be runtime suspended 
>>>>>>> to ensure
>>>>>>> that the iommu driver removes its vote on cx gdsc.
>>>>>>>
>>>>>>> Signed-off-by: Akhil P Oommen <quic_akhilpo@quicinc.com>
>>>>>>> ---
>>>>>>>
>>>>>>> (no changes since v1)
>>>>>>>
>>>>>>>     drivers/gpu/drm/msm/adreno/a6xx_gpu.c | 16 ++++++++++++++--
>>>>>>>     drivers/gpu/drm/msm/msm_gpu.c         |  2 --
>>>>>>>     2 files changed, 14 insertions(+), 4 deletions(-)
>>>>>>>
>>>>>>> diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c 
>>>>>>> b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
>>>>>>> index 4d50110..7ed347c 100644
>>>>>>> --- a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
>>>>>>> +++ b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
>>>>>>> @@ -1278,8 +1278,20 @@ static void a6xx_recover(struct msm_gpu 
>>>>>>> *gpu)
>>>>>>>             */
>>>>>>>            gmu_write(&a6xx_gpu->gmu, 
>>>>>>> REG_A6XX_GMU_GMU_PWR_COL_KEEPALIVE, 0);
>>>>>>>
>>>>>>> -       gpu->funcs->pm_suspend(gpu);
>>>>>>> -       gpu->funcs->pm_resume(gpu);
>>>>>>> +       /*
>>>>>>> +        * Now drop all the pm_runtime usage count to allow cx 
>>>>>>> gdsc to collapse.
>>>>>>> +        * First drop the usage count from all active submits
>>>>>>> +        */
>>>>>>> +       for (i = gpu->active_submits; i > 0; i--)
>>>>>>> + pm_runtime_put(&gpu->pdev->dev);
>>>>>>> +
>>>>>>> +       /* And the final one from recover worker */
>>>>>>> + pm_runtime_put_sync(&gpu->pdev->dev);
>>>>>>> +
>>>>>>> +       for (i = gpu->active_submits; i > 0; i--)
>>>>>>> + pm_runtime_get(&gpu->pdev->dev);
>>>>>>> +
>>>>>>> + pm_runtime_get_sync(&gpu->pdev->dev);
>>>>>> In response to v1, Rob suggested pm_runtime_force_suspend/resume().
>>>>>> Those seem like they would work to me, too. Why not use them?
>>>>> Quoting my previous response which I seem to have sent only to 
>>>>> Freedreno
>>>>> list:
>>>>>
>>>>> "I believe it is supposed to be used only during system sleep state
>>>>> transitions. Btw, we don't want pm_runtime_get() calls from 
>>>>> elsewhere to
>>>>> fail by disabling RPM here."
>>>> The comment about not wanting other runpm calls to fail is valid.. but
>>>> that is also solveable, ie. by holding a lock around runpm calls.
>>>> Which I think we need to do anyways, otherwise looping over
>>>> gpu->active_submits is racey..
>>>>
>>>> I think pm_runtime_force_suspend/resume() is the least-bad option.. or
>>>> at least I'm not seeing any obvious alternative that is better
>>>>
>>>> BR,
>>>> -R
>>> We are holding gpu->lock here which will block further submissions from
>>> scheduler. Will active_submits still race?
>>>
>>> It is possible that there is another thread which successfully 
>>> completed
>>> pm_runtime_get() and while it access the hardware, we pulled the 
>>> plug on
>>> regulator/clock here. That will result in obvious device crash. So I 
>>> can
>>> think of 2 solutions:
>>>
>>> 1. wrap *every* pm_runtime_get/put with a mutex. Something like:
>>>               mutex_lock();
>>>               pm_runtime_get();
>>>               < ... access hardware here >>
>>>               pm_runtime_put();
>>>               mutex_unlock();
>>>
>>> 2. Drop runtime votes from every submit in recover worker and wait/poll
>>> for regulator to collapse in case there are transient votes on
>>> regulator  from other threads/subsystems.
>>>
>>> Option (2) seems simpler to me.  What do you think?
>>>
>> But I think without #1 you could still be racing w/ some other path
>> that touches the hw, like devfreq, right.  They could be holding a
>> runpm ref, so even if you loop over active_submits decrementing the
>> runpm ref, it still doesn't drop to zero
>>
>> BR,
>> -R
> Yes, you are right. There could be some transient votes from other 
> threads/drivers/subsystem. This is the reason we need to poll for cx 
> gdsc collapse in the next patch.Even with #1, it is difficult to 
> coordinate with smmu driver and close to impossible with tz/hyp.
>
> -Akhil.

Rob,

Summarizing my responses:
1. We cannot blindly force turn off cx headswitch because that would 
impact other gpu driver threads/smmu driver/tz/hyp etc which access cx 
domain register at the same time.
2. We need to drop all our rpm votes on 'gpu device' instead of a single 
vote on 'gmu device' because of [1]. Otherwise, smmu driver's vote on cx 
headswitch will block its collapse forever.

This is the high level sequence implemented in the current series' version:
1. Drop all rpm votes on 'gpu device' which will indirectly let smmu 
driver drop its vote on cx HS.
2. To take care of transient votes from other threads/hyp etc, poll for 
cx gdsc hw register to ensure that it has collapsed. (We might be able 
to move this to gpucc driver depending on the consensus on the other patch.)

[1] https://lkml.org/lkml/2018/8/30/590

-Akhil.
diff mbox series

Patch

diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
index 4d50110..7ed347c 100644
--- a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
+++ b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
@@ -1278,8 +1278,20 @@  static void a6xx_recover(struct msm_gpu *gpu)
 	 */
 	gmu_write(&a6xx_gpu->gmu, REG_A6XX_GMU_GMU_PWR_COL_KEEPALIVE, 0);
 
-	gpu->funcs->pm_suspend(gpu);
-	gpu->funcs->pm_resume(gpu);
+	/*
+	 * Now drop all the pm_runtime usage count to allow cx gdsc to collapse.
+	 * First drop the usage count from all active submits
+	 */
+	for (i = gpu->active_submits; i > 0; i--)
+		pm_runtime_put(&gpu->pdev->dev);
+
+	/* And the final one from recover worker */
+	pm_runtime_put_sync(&gpu->pdev->dev);
+
+	for (i = gpu->active_submits; i > 0; i--)
+		pm_runtime_get(&gpu->pdev->dev);
+
+	pm_runtime_get_sync(&gpu->pdev->dev);
 
 	msm_gpu_hw_init(gpu);
 }
diff --git a/drivers/gpu/drm/msm/msm_gpu.c b/drivers/gpu/drm/msm/msm_gpu.c
index 18c1544..aa6f34f 100644
--- a/drivers/gpu/drm/msm/msm_gpu.c
+++ b/drivers/gpu/drm/msm/msm_gpu.c
@@ -422,9 +422,7 @@  static void recover_worker(struct kthread_work *work)
 		/* retire completed submits, plus the one that hung: */
 		retire_submits(gpu);
 
-		pm_runtime_get_sync(&gpu->pdev->dev);
 		gpu->funcs->recover(gpu);
-		pm_runtime_put_sync(&gpu->pdev->dev);
 
 		/*
 		 * Replay all remaining submits starting with highest priority