diff mbox series

[v3,5/6] drm/sched: Use parent fence instead of finished

Message ID 20220909170845.4946-6-Arvind.Yadav@amd.com
State Accepted
Commit e4dc45b1848bc6bcac31eb1b4ccdd7f6718b3c86
Headers show
Series dma-buf: Check status of enable-signaling bit on debug | expand

Commit Message

Arvind Yadav Sept. 9, 2022, 5:08 p.m. UTC
Using the parent fence instead of the finished fence
to get the job status. This change is to avoid GPU
scheduler timeout error which can cause GPU reset.

Signed-off-by: Arvind Yadav <Arvind.Yadav@amd.com>
---

changes in v1,v2 - Enable signaling for finished fence in sche_main()
is removed

---
 drivers/gpu/drm/scheduler/sched_main.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Comments

Andrey Grodzovsky Sept. 9, 2022, 8:55 p.m. UTC | #1
Got it.

Reviewed-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>

Andrey

On 2022-09-09 16:30, Yadav, Arvind wrote:
>
> On 9/9/2022 11:02 PM, Andrey Grodzovsky wrote:
>> What exactly is the scenario which this patch fixes in more detail 
>> please  ?
>>
> GPU reset issue started after adding [PATCH 6/6].
>
> Root cause -> In drm_sched_get_cleanup_job(), We use the finished 
> fence status bit to check the job status dma_fence_is_signaled(). If a 
> job is signaled (DMA_FENCE_FLAG_SIGNALED_BIT is set), then we cancel 
> the reset worker thread.
>
> After applying [patch 6] now we are checking enable signaling in 
> dma_fence_is_signaled() by checking DMA_FENCE_FLAG_ENABLE_SIGNAL_BIT 
> bit. but signaling is not enabled for the finished fence. As a result, 
> dma_fence_is_signaled() always returns false, and 
> drm_sched_get_cleanup_job() will not cancel the reset worker thread, 
> resulting in the GPU reset.
>
> To Fix the above issue  Christian suggested that we can use 
> parent(hardware) fence instead of finished fence because signaling 
> enabled by the calling of dma_fence_add_callback() for parent fence. 
> As a result, dma_fence_is_signaled() will return the correct fence 
> status and reset worker thread can be cancelled in 
> drm_sched_get_cleanup_job().
>
> ~arvind
>
>> Andrey
>>
>> On 2022-09-09 13:08, Arvind Yadav wrote:
>>> Using the parent fence instead of the finished fence
>>> to get the job status. This change is to avoid GPU
>>> scheduler timeout error which can cause GPU reset.
>>>
>>> Signed-off-by: Arvind Yadav <Arvind.Yadav@amd.com>
>>> ---
>>>
>>> changes in v1,v2 - Enable signaling for finished fence in sche_main()
>>> is removed
>>>
>>> ---
>>>   drivers/gpu/drm/scheduler/sched_main.c | 4 ++--
>>>   1 file changed, 2 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c 
>>> b/drivers/gpu/drm/scheduler/sched_main.c
>>> index e0ab14e0fb6b..2ac28ad11432 100644
>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>> @@ -829,7 +829,7 @@ drm_sched_get_cleanup_job(struct 
>>> drm_gpu_scheduler *sched)
>>>       job = list_first_entry_or_null(&sched->pending_list,
>>>                          struct drm_sched_job, list);
>>>   -    if (job && dma_fence_is_signaled(&job->s_fence->finished)) {
>>> +    if (job && dma_fence_is_signaled(job->s_fence->parent)) {
>>>           /* remove job from pending_list */
>>>           list_del_init(&job->list);
>>>   @@ -841,7 +841,7 @@ drm_sched_get_cleanup_job(struct 
>>> drm_gpu_scheduler *sched)
>>>             if (next) {
>>>               next->s_fence->scheduled.timestamp =
>>> -                job->s_fence->finished.timestamp;
>>> +                job->s_fence->parent->timestamp;
>>>               /* start TO timer for next job */
>>>               drm_sched_start_timeout(sched);
>>>           }
diff mbox series

Patch

diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index e0ab14e0fb6b..2ac28ad11432 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -829,7 +829,7 @@  drm_sched_get_cleanup_job(struct drm_gpu_scheduler *sched)
 	job = list_first_entry_or_null(&sched->pending_list,
 				       struct drm_sched_job, list);
 
-	if (job && dma_fence_is_signaled(&job->s_fence->finished)) {
+	if (job && dma_fence_is_signaled(job->s_fence->parent)) {
 		/* remove job from pending_list */
 		list_del_init(&job->list);
 
@@ -841,7 +841,7 @@  drm_sched_get_cleanup_job(struct drm_gpu_scheduler *sched)
 
 		if (next) {
 			next->s_fence->scheduled.timestamp =
-				job->s_fence->finished.timestamp;
+				job->s_fence->parent->timestamp;
 			/* start TO timer for next job */
 			drm_sched_start_timeout(sched);
 		}