diff mbox series

[RESEND,v2] media: mtk-jpeg: Fix use after free bug due to uncanceled work

Message ID 20230707092414.866760-1-zyytlz.wz@163.com
State Accepted
Commit c677d7ae83141d390d1253abebafa49c962afb52
Headers show
Series [RESEND,v2] media: mtk-jpeg: Fix use after free bug due to uncanceled work | expand

Commit Message

Zheng Wang July 7, 2023, 9:24 a.m. UTC
In mtk_jpeg_probe, &jpeg->job_timeout_work is bound with
mtk_jpeg_job_timeout_work. Then mtk_jpeg_dec_device_run
and mtk_jpeg_enc_device_run may be called to start the
work.
If we remove the module which will call mtk_jpeg_remove
to make cleanup, there may be a unfinished work. The
possible sequence is as follows, which will cause a
typical UAF bug.

Fix it by canceling the work before cleanup in the mtk_jpeg_remove

CPU0                  CPU1

                    |mtk_jpeg_job_timeout_work
mtk_jpeg_remove     |
  v4l2_m2m_release  |
    kfree(m2m_dev); |
                    |
                    | v4l2_m2m_get_curr_priv
                    |   m2m_dev->curr_ctx //use
Fixes: b2f0d2724ba4 ("[media] vcodec: mediatek: Add Mediatek JPEG Decoder Driver")
Signed-off-by: Zheng Wang <zyytlz.wz@163.com>
---
- v2: use cancel_delayed_work_sync instead of cancel_delayed_work suggested by Kyrie.
---
 drivers/media/platform/mediatek/jpeg/mtk_jpeg_core.c | 1 +
 1 file changed, 1 insertion(+)

Comments

Zheng Hacker July 15, 2023, 4:08 p.m. UTC | #1
Hi,

This issue has not been resolved for a long time. Is there anyone who can help?

Best regards,
Zheng

Alexandre Mergnat <amergnat@baylibre.com> 于2023年7月7日周五 22:11写道:
>
>
>
> On 07/07/2023 11:24, Zheng Wang wrote:
> > In mtk_jpeg_probe, &jpeg->job_timeout_work is bound with
> > mtk_jpeg_job_timeout_work. Then mtk_jpeg_dec_device_run
> > and mtk_jpeg_enc_device_run may be called to start the
> > work.
> > If we remove the module which will call mtk_jpeg_remove
> > to make cleanup, there may be a unfinished work. The
> > possible sequence is as follows, which will cause a
> > typical UAF bug.
> >
> > Fix it by canceling the work before cleanup in the mtk_jpeg_remove
> >
> > CPU0                  CPU1
> >
> >                      |mtk_jpeg_job_timeout_work
> > mtk_jpeg_remove     |
> >    v4l2_m2m_release  |
> >      kfree(m2m_dev); |
> >                      |
> >                      | v4l2_m2m_get_curr_priv
> >                      |   m2m_dev->curr_ctx //use
>
> Reviewed-by: Alexandre Mergnat <amergnat@baylibre.com>
>
> --
> Regards,
> Alexandre
Zheng Hacker July 18, 2023, 3:07 a.m. UTC | #2
Friendly ping

Zheng Hacker <hackerzheng666@gmail.com> 于2023年7月16日周日 00:08写道:
>
> Hi,
>
> This issue has not been resolved for a long time. Is there anyone who can help?
>
> Best regards,
> Zheng
>
> Alexandre Mergnat <amergnat@baylibre.com> 于2023年7月7日周五 22:11写道:
> >
> >
> >
> > On 07/07/2023 11:24, Zheng Wang wrote:
> > > In mtk_jpeg_probe, &jpeg->job_timeout_work is bound with
> > > mtk_jpeg_job_timeout_work. Then mtk_jpeg_dec_device_run
> > > and mtk_jpeg_enc_device_run may be called to start the
> > > work.
> > > If we remove the module which will call mtk_jpeg_remove
> > > to make cleanup, there may be a unfinished work. The
> > > possible sequence is as follows, which will cause a
> > > typical UAF bug.
> > >
> > > Fix it by canceling the work before cleanup in the mtk_jpeg_remove
> > >
> > > CPU0                  CPU1
> > >
> > >                      |mtk_jpeg_job_timeout_work
> > > mtk_jpeg_remove     |
> > >    v4l2_m2m_release  |
> > >      kfree(m2m_dev); |
> > >                      |
> > >                      | v4l2_m2m_get_curr_priv
> > >                      |   m2m_dev->curr_ctx //use
> >
> > Reviewed-by: Alexandre Mergnat <amergnat@baylibre.com>
> >
> > --
> > Regards,
> > Alexandre
Alexandre Mergnat July 19, 2023, 10:17 a.m. UTC | #3
On 18/07/2023 05:07, Zheng Hacker wrote:
> Friendly ping
> 
> Zheng Hacker <hackerzheng666@gmail.com> 于2023年7月16日周日 00:08写道:
>>
>> Hi,
>>
>> This issue has not been resolved for a long time. Is there anyone who can help?
>>
>> Best regards,
>> Zheng
>>
>> Alexandre Mergnat <amergnat@baylibre.com> 于2023年7月7日周五 22:11写道:
>>>
>>>
>>>
>>> On 07/07/2023 11:24, Zheng Wang wrote:
>>>> In mtk_jpeg_probe, &jpeg->job_timeout_work is bound with
>>>> mtk_jpeg_job_timeout_work. Then mtk_jpeg_dec_device_run
>>>> and mtk_jpeg_enc_device_run may be called to start the
>>>> work.
>>>> If we remove the module which will call mtk_jpeg_remove
>>>> to make cleanup, there may be a unfinished work. The
>>>> possible sequence is as follows, which will cause a
>>>> typical UAF bug.
>>>>
>>>> Fix it by canceling the work before cleanup in the mtk_jpeg_remove
>>>>
>>>> CPU0                  CPU1
>>>>
>>>>                       |mtk_jpeg_job_timeout_work
>>>> mtk_jpeg_remove     |
>>>>     v4l2_m2m_release  |
>>>>       kfree(m2m_dev); |
>>>>                       |
>>>>                       | v4l2_m2m_get_curr_priv
>>>>                       |   m2m_dev->curr_ctx //use
>>>
>>> Reviewed-by: Alexandre Mergnat <amergnat@baylibre.com>
>>>
>>> --
>>> Regards,
>>> Alexandre

Hi Zheng,

If you asking me to merge patch, sorry but I can't, I'm just a reviewer. 
I invite you to ping the maintainers directly:

Bin Liu <bin.liu@mediatek.com> (supporter:MEDIATEK JPEG DRIVER)
Mauro Carvalho Chehab <mchehab@kernel.org> (maintainer:MEDIA INPUT 
INFRASTRUCTURE (V4L/DVB))
Matthias Brugger <matthias.bgg@gmail.com> (maintainer:ARM/Mediatek SoC 
support)

Otherwise, I misunderstood what you asking me. If so, can you rephrase 
your question please?
Zheng Hacker July 20, 2023, 3:17 a.m. UTC | #4
Thanks dor your kind reply. I'll try to connect them later.

Best regards,
Zheng

Alexandre Mergnat <amergnat@baylibre.com> 于2023年7月19日周三 18:17写道:
>
>
>
> On 18/07/2023 05:07, Zheng Hacker wrote:
> > Friendly ping
> >
> > Zheng Hacker <hackerzheng666@gmail.com> 于2023年7月16日周日 00:08写道:
> >>
> >> Hi,
> >>
> >> This issue has not been resolved for a long time. Is there anyone who can help?
> >>
> >> Best regards,
> >> Zheng
> >>
> >> Alexandre Mergnat <amergnat@baylibre.com> 于2023年7月7日周五 22:11写道:
> >>>
> >>>
> >>>
> >>> On 07/07/2023 11:24, Zheng Wang wrote:
> >>>> In mtk_jpeg_probe, &jpeg->job_timeout_work is bound with
> >>>> mtk_jpeg_job_timeout_work. Then mtk_jpeg_dec_device_run
> >>>> and mtk_jpeg_enc_device_run may be called to start the
> >>>> work.
> >>>> If we remove the module which will call mtk_jpeg_remove
> >>>> to make cleanup, there may be a unfinished work. The
> >>>> possible sequence is as follows, which will cause a
> >>>> typical UAF bug.
> >>>>
> >>>> Fix it by canceling the work before cleanup in the mtk_jpeg_remove
> >>>>
> >>>> CPU0                  CPU1
> >>>>
> >>>>                       |mtk_jpeg_job_timeout_work
> >>>> mtk_jpeg_remove     |
> >>>>     v4l2_m2m_release  |
> >>>>       kfree(m2m_dev); |
> >>>>                       |
> >>>>                       | v4l2_m2m_get_curr_priv
> >>>>                       |   m2m_dev->curr_ctx //use
> >>>
> >>> Reviewed-by: Alexandre Mergnat <amergnat@baylibre.com>
> >>>
> >>> --
> >>> Regards,
> >>> Alexandre
>
> Hi Zheng,
>
> If you asking me to merge patch, sorry but I can't, I'm just a reviewer.
> I invite you to ping the maintainers directly:
>
> Bin Liu <bin.liu@mediatek.com> (supporter:MEDIATEK JPEG DRIVER)
> Mauro Carvalho Chehab <mchehab@kernel.org> (maintainer:MEDIA INPUT
> INFRASTRUCTURE (V4L/DVB))
> Matthias Brugger <matthias.bgg@gmail.com> (maintainer:ARM/Mediatek SoC
> support)
>
> Otherwise, I misunderstood what you asking me. If so, can you rephrase
> your question please?
>
> --
> Regards,
> Alexandre
Chen-Yu Tsai July 20, 2023, 3:40 a.m. UTC | #5
On Fri, Jul 7, 2023 at 5:25 PM Zheng Wang <zyytlz.wz@163.com> wrote:
>
> In mtk_jpeg_probe, &jpeg->job_timeout_work is bound with
> mtk_jpeg_job_timeout_work. Then mtk_jpeg_dec_device_run
> and mtk_jpeg_enc_device_run may be called to start the
> work.
> If we remove the module which will call mtk_jpeg_remove
> to make cleanup, there may be a unfinished work. The
> possible sequence is as follows, which will cause a
> typical UAF bug.
>
> Fix it by canceling the work before cleanup in the mtk_jpeg_remove
>
> CPU0                  CPU1
>
>                     |mtk_jpeg_job_timeout_work
> mtk_jpeg_remove     |
>   v4l2_m2m_release  |
>     kfree(m2m_dev); |
>                     |
>                     | v4l2_m2m_get_curr_priv
>                     |   m2m_dev->curr_ctx //use
> Fixes: b2f0d2724ba4 ("[media] vcodec: mediatek: Add Mediatek JPEG Decoder Driver")
> Signed-off-by: Zheng Wang <zyytlz.wz@163.com>

Reviewed-by: Chen-Yu Tsai <wenst@chromium.org>
Zheng Hacker July 25, 2023, 2:39 a.m. UTC | #6
Hello,

Is there any follow-up about this issue?

Thanks,
Zheng

AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
于2023年7月20日周四 15:45写道:
>
> Il 07/07/23 11:24, Zheng Wang ha scritto:
> > In mtk_jpeg_probe, &jpeg->job_timeout_work is bound with
> > mtk_jpeg_job_timeout_work. Then mtk_jpeg_dec_device_run
> > and mtk_jpeg_enc_device_run may be called to start the
> > work.
> > If we remove the module which will call mtk_jpeg_remove
> > to make cleanup, there may be a unfinished work. The
> > possible sequence is as follows, which will cause a
> > typical UAF bug.
> >
> > Fix it by canceling the work before cleanup in the mtk_jpeg_remove
> >
> > CPU0                  CPU1
> >
> >                      |mtk_jpeg_job_timeout_work
> > mtk_jpeg_remove     |
> >    v4l2_m2m_release  |
> >      kfree(m2m_dev); |
> >                      |
> >                      | v4l2_m2m_get_curr_priv
> >                      |   m2m_dev->curr_ctx //use
> > Fixes: b2f0d2724ba4 ("[media] vcodec: mediatek: Add Mediatek JPEG Decoder Driver")
> > Signed-off-by: Zheng Wang <zyytlz.wz@163.com>
>
> Reviewed-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
>
>
Dmitry Osipenko Aug. 22, 2023, 6:51 p.m. UTC | #7
Hello Zheng,

On 7/7/23 12:24, Zheng Wang wrote:
> In mtk_jpeg_probe, &jpeg->job_timeout_work is bound with
> mtk_jpeg_job_timeout_work. Then mtk_jpeg_dec_device_run
> and mtk_jpeg_enc_device_run may be called to start the
> work.
> If we remove the module which will call mtk_jpeg_remove
> to make cleanup, there may be a unfinished work. The
> possible sequence is as follows, which will cause a
> typical UAF bug.
> 
> Fix it by canceling the work before cleanup in the mtk_jpeg_remove
> 
> CPU0                  CPU1
> 
>                     |mtk_jpeg_job_timeout_work
> mtk_jpeg_remove     |
>   v4l2_m2m_release  |
>     kfree(m2m_dev); |
>                     |
>                     | v4l2_m2m_get_curr_priv
>                     |   m2m_dev->curr_ctx //use
> Fixes: b2f0d2724ba4 ("[media] vcodec: mediatek: Add Mediatek JPEG Decoder Driver")
> Signed-off-by: Zheng Wang <zyytlz.wz@163.com>
> ---
> - v2: use cancel_delayed_work_sync instead of cancel_delayed_work suggested by Kyrie.
> ---
>  drivers/media/platform/mediatek/jpeg/mtk_jpeg_core.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/media/platform/mediatek/jpeg/mtk_jpeg_core.c b/drivers/media/platform/mediatek/jpeg/mtk_jpeg_core.c
> index 0051f372a66c..6069ecf420b0 100644
> --- a/drivers/media/platform/mediatek/jpeg/mtk_jpeg_core.c
> +++ b/drivers/media/platform/mediatek/jpeg/mtk_jpeg_core.c
> @@ -1816,6 +1816,7 @@ static void mtk_jpeg_remove(struct platform_device *pdev)
>  {
>  	struct mtk_jpeg_dev *jpeg = platform_get_drvdata(pdev);
>  
> +	cancel_delayed_work_sync(&jpeg->job_timeout_work);
>  	pm_runtime_disable(&pdev->dev);
>  	video_unregister_device(jpeg->vdev);
>  	v4l2_m2m_release(jpeg->m2m_dev);

AFAICS, there is a fundamental problem here. The job_timeout_work uses
v4l2_m2m_get_curr_priv() and at the time when driver module is unloaded,
all the v4l contexts must be closed and released. Hence the
v4l2_m2m_get_curr_priv() shall return NULL and crash the kernel when
work is executed before cancel_delayed_work_sync().

At the time when mtk_jpeg_remove() is invoked, there shall be no
job_timeout_work running in background because all jobs should be
completed before context is released. If you'll look at
v4l2_m2m_cancel_job(), you can see that it waits for the task completion
before closing context.

You shouldn't be able to remove driver module while it has active/opened
v4l contexts. If you can do that, then this is yours bug that needs to
be fixed.

In addition to this all, the job_timeout_work is initialized only for
the single-core JPEG device. I'd expect this patch should crash
multi-core JPEG devices.
Zheng Hacker Aug. 24, 2023, 8:20 a.m. UTC | #8
Dmitry Osipenko <dmitry.osipenko@collabora.com> 于2023年8月23日周三 02:51写道:

>
> Hello Zheng,
>
> On 7/7/23 12:24, Zheng Wang wrote:
> > In mtk_jpeg_probe, &jpeg->job_timeout_work is bound with
> > mtk_jpeg_job_timeout_work. Then mtk_jpeg_dec_device_run
> > and mtk_jpeg_enc_device_run may be called to start the
> > work.
> > If we remove the module which will call mtk_jpeg_remove
> > to make cleanup, there may be a unfinished work. The
> > possible sequence is as follows, which will cause a
> > typical UAF bug.
> >
> > Fix it by canceling the work before cleanup in the mtk_jpeg_remove
> >
> > CPU0                  CPU1
> >
> >                     |mtk_jpeg_job_timeout_work
> > mtk_jpeg_remove     |
> >   v4l2_m2m_release  |
> >     kfree(m2m_dev); |
> >                     |
> >                     | v4l2_m2m_get_curr_priv
> >                     |   m2m_dev->curr_ctx //use
> > Fixes: b2f0d2724ba4 ("[media] vcodec: mediatek: Add Mediatek JPEG Decoder Driver")
> > Signed-off-by: Zheng Wang <zyytlz.wz@163.com>
> > ---
> > - v2: use cancel_delayed_work_sync instead of cancel_delayed_work suggested by Kyrie.
> > ---
> >  drivers/media/platform/mediatek/jpeg/mtk_jpeg_core.c | 1 +
> >  1 file changed, 1 insertion(+)
> >
> > diff --git a/drivers/media/platform/mediatek/jpeg/mtk_jpeg_core.c b/drivers/media/platform/mediatek/jpeg/mtk_jpeg_core.c
> > index 0051f372a66c..6069ecf420b0 100644
> > --- a/drivers/media/platform/mediatek/jpeg/mtk_jpeg_core.c
> > +++ b/drivers/media/platform/mediatek/jpeg/mtk_jpeg_core.c
> > @@ -1816,6 +1816,7 @@ static void mtk_jpeg_remove(struct platform_device *pdev)
> >  {
> >       struct mtk_jpeg_dev *jpeg = platform_get_drvdata(pdev);
> >
> > +     cancel_delayed_work_sync(&jpeg->job_timeout_work);
> >       pm_runtime_disable(&pdev->dev);
> >       video_unregister_device(jpeg->vdev);
> >       v4l2_m2m_release(jpeg->m2m_dev);
>
> AFAICS, there is a fundamental problem here. The job_timeout_work uses
> v4l2_m2m_get_curr_priv() and at the time when driver module is unloaded,
> all the v4l contexts must be closed and released. Hence the
> v4l2_m2m_get_curr_priv() shall return NULL and crash the kernel when
> work is executed before cancel_delayed_work_sync().
>

Hi Dmitry,

Thanks for your reply. I think you're right. As m2m_dev is freed in
v4l2_m2m_release,
the invoking in v4l2_m2m_get_curr_priv might cause either UAF or null
pointer dereference
bug. I am sure that context is closed when we invoke mtk_jpeg_remove.
But I'm not sure if
context is released when mtk_jpegdec_timeout_work running.

> At the time when mtk_jpeg_remove() is invoked, there shall be no
> job_timeout_work running in background because all jobs should be
> completed before context is released. If you'll look at
> v4l2_m2m_cancel_job(), you can see that it waits for the task completion
> before closing context.

Yes, so I think the better way is to put the cancel_delayed_work_sync
invoking into
v4l2_m2m_ctx_release function?

>
> You shouldn't be able to remove driver module while it has active/opened
> v4l contexts. If you can do that, then this is yours bug that needs to
> be fixed.
>
> In addition to this all, the job_timeout_work is initialized only for
> the single-core JPEG device. I'd expect this patch should crash
> multi-core JPEG devices.
>

I think that's true. As I'm not familiar with the code here. Could you
please give me some advice about the patch?

Best regards,
Zheng

> --
> Best regards,
> Dmitry
>
Dmitry Osipenko Aug. 28, 2023, 2:04 a.m. UTC | #9
On 8/24/23 11:20, Zheng Hacker wrote:
> Dmitry Osipenko <dmitry.osipenko@collabora.com> 于2023年8月23日周三 02:51写道:
> 
>>
>> Hello Zheng,
>>
>> On 7/7/23 12:24, Zheng Wang wrote:
>>> In mtk_jpeg_probe, &jpeg->job_timeout_work is bound with
>>> mtk_jpeg_job_timeout_work. Then mtk_jpeg_dec_device_run
>>> and mtk_jpeg_enc_device_run may be called to start the
>>> work.
>>> If we remove the module which will call mtk_jpeg_remove
>>> to make cleanup, there may be a unfinished work. The
>>> possible sequence is as follows, which will cause a
>>> typical UAF bug.
>>>
>>> Fix it by canceling the work before cleanup in the mtk_jpeg_remove
>>>
>>> CPU0                  CPU1
>>>
>>>                     |mtk_jpeg_job_timeout_work
>>> mtk_jpeg_remove     |
>>>   v4l2_m2m_release  |
>>>     kfree(m2m_dev); |
>>>                     |
>>>                     | v4l2_m2m_get_curr_priv
>>>                     |   m2m_dev->curr_ctx //use
>>> Fixes: b2f0d2724ba4 ("[media] vcodec: mediatek: Add Mediatek JPEG Decoder Driver")
>>> Signed-off-by: Zheng Wang <zyytlz.wz@163.com>
>>> ---
>>> - v2: use cancel_delayed_work_sync instead of cancel_delayed_work suggested by Kyrie.
>>> ---
>>>  drivers/media/platform/mediatek/jpeg/mtk_jpeg_core.c | 1 +
>>>  1 file changed, 1 insertion(+)
>>>
>>> diff --git a/drivers/media/platform/mediatek/jpeg/mtk_jpeg_core.c b/drivers/media/platform/mediatek/jpeg/mtk_jpeg_core.c
>>> index 0051f372a66c..6069ecf420b0 100644
>>> --- a/drivers/media/platform/mediatek/jpeg/mtk_jpeg_core.c
>>> +++ b/drivers/media/platform/mediatek/jpeg/mtk_jpeg_core.c
>>> @@ -1816,6 +1816,7 @@ static void mtk_jpeg_remove(struct platform_device *pdev)
>>>  {
>>>       struct mtk_jpeg_dev *jpeg = platform_get_drvdata(pdev);
>>>
>>> +     cancel_delayed_work_sync(&jpeg->job_timeout_work);
>>>       pm_runtime_disable(&pdev->dev);
>>>       video_unregister_device(jpeg->vdev);
>>>       v4l2_m2m_release(jpeg->m2m_dev);
>>
>> AFAICS, there is a fundamental problem here. The job_timeout_work uses
>> v4l2_m2m_get_curr_priv() and at the time when driver module is unloaded,
>> all the v4l contexts must be closed and released. Hence the
>> v4l2_m2m_get_curr_priv() shall return NULL and crash the kernel when
>> work is executed before cancel_delayed_work_sync().
>>
> 
> Hi Dmitry,
> 
> Thanks for your reply. I think you're right. As m2m_dev is freed in
> v4l2_m2m_release,
> the invoking in v4l2_m2m_get_curr_priv might cause either UAF or null
> pointer dereference
> bug. I am sure that context is closed when we invoke mtk_jpeg_remove.
> But I'm not sure if
> context is released when mtk_jpegdec_timeout_work running.
> 
>> At the time when mtk_jpeg_remove() is invoked, there shall be no
>> job_timeout_work running in background because all jobs should be
>> completed before context is released. If you'll look at
>> v4l2_m2m_cancel_job(), you can see that it waits for the task completion
>> before closing context.
> 
> Yes, so I think the better way is to put the cancel_delayed_work_sync
> invoking into
> v4l2_m2m_ctx_release function?

The v4l2_m2m_ctx_release() already should wait for the job_timeout_work
completion or for the interrupt fire. Apparently it doesn't work in
yours case. You'll need to debug why v4l job or job_timeout_work is
running after v4l2_m2m_ctx_release(), it shouldn't happen.

The interrupt handler cancels job_timeout_work, you shouldn't need to
flush the work.

Technically, interrupt handler may race with job_timeout_work, but the
timeout is set to 1 second and in practice should be difficult to
trigger the race. The interrupt handler needs to be threaded, it should
use cancel_delayed_work_sync() and check the return value of this function.

>>
>> You shouldn't be able to remove driver module while it has active/opened
>> v4l contexts. If you can do that, then this is yours bug that needs to
>> be fixed.
>>
>> In addition to this all, the job_timeout_work is initialized only for
>> the single-core JPEG device. I'd expect this patch should crash
>> multi-core JPEG devices.
>>
> 
> I think that's true. As I'm not familiar with the code here. Could you
> please give me some advice about the patch?

We'll need to understand why v4l2_m2m_ctx_release() doesn't work as
expected before thinking about the patch.
Dmitry Osipenko Sept. 19, 2023, 6:24 p.m. UTC | #10
On 8/31/23 11:18, Zheng Hacker wrote:
>> The v4l2_m2m_ctx_release() already should wait for the job_timeout_work
>> completion or for the interrupt fire. Apparently it doesn't work in
>> yours case. You'll need to debug why v4l job or job_timeout_work is
>> running after v4l2_m2m_ctx_release(), it shouldn't happen.
>>
> Yes, v4l2_m2m_cancel_job waits for m2m_ctx->job_flags to be  ~TRANS_RUNNING,
> the mtk_jpeg_job_timeout_work will finally invoke v4l2_m2m_job_finish
> to trigger that.
> 
> However, this is not the only path to call v4l2_m2m_job_finish. Here
> is a invoking chain:
> v4l_streamon
>   ->v4l2_m2m_ioctl_streamon
>     ->v4l2_m2m_streamon
>       ->v4l2_m2m_try_schedule
>         ->v4l2_m2m_try_run
>           ->mtk_jpeg_dec_device_run
>             ->schedule_delayed_work(&jpeg->job_timeout_work...
>             ->error path goto dec_end
>             ->v4l2_m2m_job_finish
> 
> In some specific situation, it starts the worker and also calls
> v4l2_m2m_job_finish, which might
> make v4l2_m2m_cancel_job continues.

Then the error path should cancel the job_timeout_work, or better job
needs to be run after the dec/enc has been started and not before.

Looking further at the code, I'm confused by this hunk:

	mtk_jpeg_dec_start(comp_jpeg[hw_id]->reg_base);
	v4l2_m2m_job_finish(jpeg->m2m_dev, ctx->fh.m2m_ctx);

The job should be marked as finished when h/w has finished processing
the job and not right after the job has been started. So the job is
always completed and mtk_jpeg_job_timeout_work() doesn't work as
expected, am I missing something?
Dmitry Osipenko Sept. 19, 2023, 6:25 p.m. UTC | #11
On 9/19/23 21:24, Dmitry Osipenko wrote:
> On 8/31/23 11:18, Zheng Hacker wrote:
>>> The v4l2_m2m_ctx_release() already should wait for the job_timeout_work
>>> completion or for the interrupt fire. Apparently it doesn't work in
>>> yours case. You'll need to debug why v4l job or job_timeout_work is
>>> running after v4l2_m2m_ctx_release(), it shouldn't happen.
>>>
>> Yes, v4l2_m2m_cancel_job waits for m2m_ctx->job_flags to be  ~TRANS_RUNNING,
>> the mtk_jpeg_job_timeout_work will finally invoke v4l2_m2m_job_finish
>> to trigger that.
>>
>> However, this is not the only path to call v4l2_m2m_job_finish. Here
>> is a invoking chain:
>> v4l_streamon
>>   ->v4l2_m2m_ioctl_streamon
>>     ->v4l2_m2m_streamon
>>       ->v4l2_m2m_try_schedule
>>         ->v4l2_m2m_try_run
>>           ->mtk_jpeg_dec_device_run
>>             ->schedule_delayed_work(&jpeg->job_timeout_work...
>>             ->error path goto dec_end
>>             ->v4l2_m2m_job_finish
>>
>> In some specific situation, it starts the worker and also calls
>> v4l2_m2m_job_finish, which might
>> make v4l2_m2m_cancel_job continues.
> 
> Then the error path should cancel the job_timeout_work, or better job

s/job/timeout work/

> needs to be run after the dec/enc has been started and not before.
> 
> Looking further at the code, I'm confused by this hunk:
> 
> 	mtk_jpeg_dec_start(comp_jpeg[hw_id]->reg_base);
> 	v4l2_m2m_job_finish(jpeg->m2m_dev, ctx->fh.m2m_ctx);
> 
> The job should be marked as finished when h/w has finished processing
> the job and not right after the job has been started. So the job is
> always completed and mtk_jpeg_job_timeout_work() doesn't work as
> expected, am I missing something?
>
Zheng Hacker Oct. 8, 2023, 9:13 a.m. UTC | #12
Dmitry Osipenko <dmitry.osipenko@collabora.com> 于2023年9月20日周三 02:24写道:
>
> On 8/31/23 11:18, Zheng Hacker wrote:
> >> The v4l2_m2m_ctx_release() already should wait for the job_timeout_work
> >> completion or for the interrupt fire. Apparently it doesn't work in
> >> yours case. You'll need to debug why v4l job or job_timeout_work is
> >> running after v4l2_m2m_ctx_release(), it shouldn't happen.
> >>
> > Yes, v4l2_m2m_cancel_job waits for m2m_ctx->job_flags to be  ~TRANS_RUNNING,
> > the mtk_jpeg_job_timeout_work will finally invoke v4l2_m2m_job_finish
> > to trigger that.
> >
> > However, this is not the only path to call v4l2_m2m_job_finish. Here
> > is a invoking chain:
> > v4l_streamon
> >   ->v4l2_m2m_ioctl_streamon
> >     ->v4l2_m2m_streamon
> >       ->v4l2_m2m_try_schedule
> >         ->v4l2_m2m_try_run
> >           ->mtk_jpeg_dec_device_run
> >             ->schedule_delayed_work(&jpeg->job_timeout_work...
> >             ->error path goto dec_end
> >             ->v4l2_m2m_job_finish
> >
> > In some specific situation, it starts the worker and also calls
> > v4l2_m2m_job_finish, which might
> > make v4l2_m2m_cancel_job continues.
>
> Then the error path should cancel the job_timeout_work, or better job
> needs to be run after the dec/enc has been started and not before.
>

Hi,

Sorry for my late reply for I just went on a long vacation.

Get it. I'll write another patch and change the summary to the lack of
canceling job in error path.

> Looking further at the code, I'm confused by this hunk:
>
>         mtk_jpeg_dec_start(comp_jpeg[hw_id]->reg_base);
>         v4l2_m2m_job_finish(jpeg->m2m_dev, ctx->fh.m2m_ctx);
>
> The job should be marked as finished when h/w has finished processing
> the job and not right after the job has been started. So the job is
> always completed and mtk_jpeg_job_timeout_work() doesn't work as
> expected, am I missing something?

After reading the code I still don't know. I didn't see any function
like mtk_jpeg_dec_end. The same thing
happens on mtk_jpeg_enc_start. I think I'd better fix the first
problem and wait for someone familiar with
the second part.

Best regards,
Zheng

>
> --
> Best regards,
> Dmitry
>
Dmitry Osipenko Oct. 19, 2023, 7:56 p.m. UTC | #13
On 10/8/23 12:13, Zheng Hacker wrote:
> Dmitry Osipenko <dmitry.osipenko@collabora.com> 于2023年9月20日周三 02:24写道:
>>
>> On 8/31/23 11:18, Zheng Hacker wrote:
>>>> The v4l2_m2m_ctx_release() already should wait for the job_timeout_work
>>>> completion or for the interrupt fire. Apparently it doesn't work in
>>>> yours case. You'll need to debug why v4l job or job_timeout_work is
>>>> running after v4l2_m2m_ctx_release(), it shouldn't happen.
>>>>
>>> Yes, v4l2_m2m_cancel_job waits for m2m_ctx->job_flags to be  ~TRANS_RUNNING,
>>> the mtk_jpeg_job_timeout_work will finally invoke v4l2_m2m_job_finish
>>> to trigger that.
>>>
>>> However, this is not the only path to call v4l2_m2m_job_finish. Here
>>> is a invoking chain:
>>> v4l_streamon
>>>   ->v4l2_m2m_ioctl_streamon
>>>     ->v4l2_m2m_streamon
>>>       ->v4l2_m2m_try_schedule
>>>         ->v4l2_m2m_try_run
>>>           ->mtk_jpeg_dec_device_run
>>>             ->schedule_delayed_work(&jpeg->job_timeout_work...
>>>             ->error path goto dec_end
>>>             ->v4l2_m2m_job_finish
>>>
>>> In some specific situation, it starts the worker and also calls
>>> v4l2_m2m_job_finish, which might
>>> make v4l2_m2m_cancel_job continues.
>>
>> Then the error path should cancel the job_timeout_work, or better job
>> needs to be run after the dec/enc has been started and not before.
>>
> 
> Hi,
> 
> Sorry for my late reply for I just went on a long vacation.
> 
> Get it. I'll write another patch and change the summary to the lack of
> canceling job in error path.
> 
>> Looking further at the code, I'm confused by this hunk:
>>
>>         mtk_jpeg_dec_start(comp_jpeg[hw_id]->reg_base);
>>         v4l2_m2m_job_finish(jpeg->m2m_dev, ctx->fh.m2m_ctx);
>>
>> The job should be marked as finished when h/w has finished processing
>> the job and not right after the job has been started. So the job is
>> always completed and mtk_jpeg_job_timeout_work() doesn't work as
>> expected, am I missing something?
> 
> After reading the code I still don't know. I didn't see any function
> like mtk_jpeg_dec_end. The same thing
> happens on mtk_jpeg_enc_start. I think I'd better fix the first
> problem and wait for someone familiar with
> the second part.

I missed that the code mentioned above is related to the multi-core hw version, while you care about single-core. Nevertheless, the multi-core device_run() looks incorrect,

So, the error code paths need to be corrected. Please try to revert yours fix and test this change: 

diff --git a/drivers/media/platform/mediatek/jpeg/mtk_jpeg_core.c b/drivers/media/platform/mediatek/jpeg/mtk_jpeg_core.c
index 0051f372a66c..fd3b0587fcad 100644
--- a/drivers/media/platform/mediatek/jpeg/mtk_jpeg_core.c
+++ b/drivers/media/platform/mediatek/jpeg/mtk_jpeg_core.c
@@ -1254,9 +1254,6 @@ static void mtk_jpegdec_worker(struct work_struct *work)
 	v4l2_m2m_src_buf_remove(ctx->fh.m2m_ctx);
 	v4l2_m2m_dst_buf_remove(ctx->fh.m2m_ctx);
 
-	schedule_delayed_work(&comp_jpeg[hw_id]->job_timeout_work,
-			      msecs_to_jiffies(MTK_JPEG_HW_TIMEOUT_MSEC));
-
 	mtk_jpeg_set_dec_src(ctx, &src_buf->vb2_buf, &bs);
 	if (mtk_jpeg_set_dec_dst(ctx,
 				 &jpeg_src_buf->dec_param,
@@ -1266,6 +1263,9 @@ static void mtk_jpegdec_worker(struct work_struct *work)
 		goto setdst_end;
 	}
 
+	schedule_delayed_work(&comp_jpeg[hw_id]->job_timeout_work,
+			      msecs_to_jiffies(MTK_JPEG_HW_TIMEOUT_MSEC));
+
 	spin_lock_irqsave(&comp_jpeg[hw_id]->hw_lock, flags);
 	ctx->total_frame_num++;
 	mtk_jpeg_dec_reset(comp_jpeg[hw_id]->reg_base);
@@ -1330,13 +1330,13 @@ static void mtk_jpeg_dec_device_run(void *priv)
 	if (ret < 0)
 		goto dec_end;
 
-	schedule_delayed_work(&jpeg->job_timeout_work,
-			      msecs_to_jiffies(MTK_JPEG_HW_TIMEOUT_MSEC));
-
 	mtk_jpeg_set_dec_src(ctx, &src_buf->vb2_buf, &bs);
 	if (mtk_jpeg_set_dec_dst(ctx, &jpeg_src_buf->dec_param, &dst_buf->vb2_buf, &fb))
 		goto dec_end;
 
+	schedule_delayed_work(&jpeg->job_timeout_work,
+			      msecs_to_jiffies(MTK_JPEG_HW_TIMEOUT_MSEC));
+
 	spin_lock_irqsave(&jpeg->hw_lock, flags);
 	mtk_jpeg_dec_reset(jpeg->reg_base);
 	mtk_jpeg_dec_set_config(jpeg->reg_base,
diff mbox series

Patch

diff --git a/drivers/media/platform/mediatek/jpeg/mtk_jpeg_core.c b/drivers/media/platform/mediatek/jpeg/mtk_jpeg_core.c
index 0051f372a66c..6069ecf420b0 100644
--- a/drivers/media/platform/mediatek/jpeg/mtk_jpeg_core.c
+++ b/drivers/media/platform/mediatek/jpeg/mtk_jpeg_core.c
@@ -1816,6 +1816,7 @@  static void mtk_jpeg_remove(struct platform_device *pdev)
 {
 	struct mtk_jpeg_dev *jpeg = platform_get_drvdata(pdev);
 
+	cancel_delayed_work_sync(&jpeg->job_timeout_work);
 	pm_runtime_disable(&pdev->dev);
 	video_unregister_device(jpeg->vdev);
 	v4l2_m2m_release(jpeg->m2m_dev);