Message ID | 20220909170845.4946-6-Arvind.Yadav@amd.com |
---|---|
State | Accepted |
Commit | e4dc45b1848bc6bcac31eb1b4ccdd7f6718b3c86 |
Headers | show |
Series | dma-buf: Check status of enable-signaling bit on debug | expand |
Got it. Reviewed-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> Andrey On 2022-09-09 16:30, Yadav, Arvind wrote: > > On 9/9/2022 11:02 PM, Andrey Grodzovsky wrote: >> What exactly is the scenario which this patch fixes in more detail >> please ? >> > GPU reset issue started after adding [PATCH 6/6]. > > Root cause -> In drm_sched_get_cleanup_job(), We use the finished > fence status bit to check the job status dma_fence_is_signaled(). If a > job is signaled (DMA_FENCE_FLAG_SIGNALED_BIT is set), then we cancel > the reset worker thread. > > After applying [patch 6] now we are checking enable signaling in > dma_fence_is_signaled() by checking DMA_FENCE_FLAG_ENABLE_SIGNAL_BIT > bit. but signaling is not enabled for the finished fence. As a result, > dma_fence_is_signaled() always returns false, and > drm_sched_get_cleanup_job() will not cancel the reset worker thread, > resulting in the GPU reset. > > To Fix the above issue Christian suggested that we can use > parent(hardware) fence instead of finished fence because signaling > enabled by the calling of dma_fence_add_callback() for parent fence. > As a result, dma_fence_is_signaled() will return the correct fence > status and reset worker thread can be cancelled in > drm_sched_get_cleanup_job(). > > ~arvind > >> Andrey >> >> On 2022-09-09 13:08, Arvind Yadav wrote: >>> Using the parent fence instead of the finished fence >>> to get the job status. This change is to avoid GPU >>> scheduler timeout error which can cause GPU reset. >>> >>> Signed-off-by: Arvind Yadav <Arvind.Yadav@amd.com> >>> --- >>> >>> changes in v1,v2 - Enable signaling for finished fence in sche_main() >>> is removed >>> >>> --- >>> drivers/gpu/drm/scheduler/sched_main.c | 4 ++-- >>> 1 file changed, 2 insertions(+), 2 deletions(-) >>> >>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c >>> b/drivers/gpu/drm/scheduler/sched_main.c >>> index e0ab14e0fb6b..2ac28ad11432 100644 >>> --- a/drivers/gpu/drm/scheduler/sched_main.c >>> +++ b/drivers/gpu/drm/scheduler/sched_main.c >>> @@ -829,7 +829,7 @@ drm_sched_get_cleanup_job(struct >>> drm_gpu_scheduler *sched) >>> job = list_first_entry_or_null(&sched->pending_list, >>> struct drm_sched_job, list); >>> - if (job && dma_fence_is_signaled(&job->s_fence->finished)) { >>> + if (job && dma_fence_is_signaled(job->s_fence->parent)) { >>> /* remove job from pending_list */ >>> list_del_init(&job->list); >>> @@ -841,7 +841,7 @@ drm_sched_get_cleanup_job(struct >>> drm_gpu_scheduler *sched) >>> if (next) { >>> next->s_fence->scheduled.timestamp = >>> - job->s_fence->finished.timestamp; >>> + job->s_fence->parent->timestamp; >>> /* start TO timer for next job */ >>> drm_sched_start_timeout(sched); >>> }
diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c index e0ab14e0fb6b..2ac28ad11432 100644 --- a/drivers/gpu/drm/scheduler/sched_main.c +++ b/drivers/gpu/drm/scheduler/sched_main.c @@ -829,7 +829,7 @@ drm_sched_get_cleanup_job(struct drm_gpu_scheduler *sched) job = list_first_entry_or_null(&sched->pending_list, struct drm_sched_job, list); - if (job && dma_fence_is_signaled(&job->s_fence->finished)) { + if (job && dma_fence_is_signaled(job->s_fence->parent)) { /* remove job from pending_list */ list_del_init(&job->list); @@ -841,7 +841,7 @@ drm_sched_get_cleanup_job(struct drm_gpu_scheduler *sched) if (next) { next->s_fence->scheduled.timestamp = - job->s_fence->finished.timestamp; + job->s_fence->parent->timestamp; /* start TO timer for next job */ drm_sched_start_timeout(sched); }
Using the parent fence instead of the finished fence to get the job status. This change is to avoid GPU scheduler timeout error which can cause GPU reset. Signed-off-by: Arvind Yadav <Arvind.Yadav@amd.com> --- changes in v1,v2 - Enable signaling for finished fence in sche_main() is removed --- drivers/gpu/drm/scheduler/sched_main.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)