[v1,0/3] cedrus: Various bug fixes

Message ID	20220818203308.439043-1-nicolas.dufresne@collabora.com
Headers	show Return-Path: <linux-media-owner@kernel.org> sender: nicolas) by madras.collabora.co.uk (Postfix) with ESMTPSA id 0BCE06601B46; Thu, 18 Aug 2022 21:33:32 +0100 (BST) From: Nicolas Dufresne <nicolas.dufresne@collabora.com> To: linux-media@vger.kernel.org Cc: kernel@collabora.com, Nicolas Dufresne <nicolas.dufresne@collabora.com> Subject: [PATCH v1 0/3] cedrus: Various bug fixes Date: Thu, 18 Aug 2022 16:33:05 -0400 Message-Id: <20220818203308.439043-1-nicolas.dufresne@collabora.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	cedrus: Various bug fixes \| expand [v1,0/3] cedrus: Various bug fixes [v1,1/3] media: cedrus: Fix watchdog race condition [v1,2/3] media: cedrus: Set the platform driver data earlier [v1,3/3] media: cedrus: Fix endless loop in cedrus_h265_skip_bits()

Nicolas Dufresne Aug. 18, 2022, 8:33 p.m. UTC

This patchset addresses different bugs in cedrus driver. The first
patch address a possible NULL dereference in the probe function, the
second a possible infinite loop when passed invalid offsets and the
last one fixes a race condition in the watchdog implementation.

As of writing this cover later, fluster score remains unaffected
and score is 132/147 with GStreamer.

Dmitry Osipenko (2):
  media: cedrus: Set the platform driver data earlier
  media: cedrus: Fix endless loop in cedrus_h265_skip_bits()

Nicolas Dufresne (1):
  media: cedrus: Fix watchdog race condition

 drivers/staging/media/sunxi/cedrus/cedrus.c      | 4 ++--
 drivers/staging/media/sunxi/cedrus/cedrus_dec.c  | 4 ++--
 drivers/staging/media/sunxi/cedrus/cedrus_h265.c | 6 ++++--
 3 files changed, 8 insertions(+), 6 deletions(-)

Jernej Škrabec Aug. 19, 2022, 4:16 a.m. UTC | #1

Dne četrtek, 18. avgust 2022 ob 22:33:08 CEST je Nicolas Dufresne napisal(a):
> From: Dmitry Osipenko <dmitry.osipenko@collabora.com>
> 
> The busy status bit may never de-assert if number of programmed skip
> bits is incorrect, resulting in a kernel hang because the bit is polled
> endlessly in the code. Fix it by adding timeout for the bit-polling.
> This problem is reproducible by setting the data_bit_offset field of
> the HEVC slice params to a wrong value by userspace.
> 
> Cc: stable@vger.kernel.org
> Reported-by: Nicolas Dufresne <nicolas.dufresne@collabora.com>
> Signed-off-by: Dmitry Osipenko <dmitry.osipenko@collabora.com>
> Signed-off-by: Nicolas Dufresne <nicolas.dufresne@collabora.com>

Fixes tag would be nice.

> ---
>  drivers/staging/media/sunxi/cedrus/cedrus_h265.c | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/staging/media/sunxi/cedrus/cedrus_h265.c
> b/drivers/staging/media/sunxi/cedrus/cedrus_h265.c index
> f703c585d91c5..f0bc118021b0a 100644
> --- a/drivers/staging/media/sunxi/cedrus/cedrus_h265.c
> +++ b/drivers/staging/media/sunxi/cedrus/cedrus_h265.c
> @@ -227,6 +227,7 @@ static void cedrus_h265_pred_weight_write(struct
> cedrus_dev *dev, static void cedrus_h265_skip_bits(struct cedrus_dev *dev,
> int num) {
>  	int count = 0;
> +	u32 reg;
> 
>  	while (count < num) {
>  		int tmp = min(num - count, 32);
> @@ -234,8 +235,9 @@ static void cedrus_h265_skip_bits(struct cedrus_dev
> *dev, int num) cedrus_write(dev, VE_DEC_H265_TRIGGER,
>  			     VE_DEC_H265_TRIGGER_FLUSH_BITS |
>  			     VE_DEC_H265_TRIGGER_TYPE_N_BITS(tmp));
> -		while (cedrus_read(dev, VE_DEC_H265_STATUS) &
> VE_DEC_H265_STATUS_VLD_BUSY) -			udelay(1);
> +
> +		if (cedrus_wait_for(dev, VE_DEC_H265_STATUS,
> VE_DEC_H265_STATUS_VLD_BUSY)) +			
dev_err_ratelimited(dev->dev, "timed out
> waiting to skip bits\n");

Reporting issue is nice, but better would be to propagate error, since there 
is no way to properly decode this slice if above code block fails.

Best regards,
Jernej

> 
>  		count += tmp;
>  	}

Nicolas Dufresne Aug. 19, 2022, 3:39 p.m. UTC | #2

Le vendredi 19 août 2022 à 06:16 +0200, Jernej Škrabec a écrit :
> Dne četrtek, 18. avgust 2022 ob 22:33:08 CEST je Nicolas Dufresne napisal(a):
> > From: Dmitry Osipenko <dmitry.osipenko@collabora.com>
> > 
> > The busy status bit may never de-assert if number of programmed skip
> > bits is incorrect, resulting in a kernel hang because the bit is polled
> > endlessly in the code. Fix it by adding timeout for the bit-polling.
> > This problem is reproducible by setting the data_bit_offset field of
> > the HEVC slice params to a wrong value by userspace.
> > 
> > Cc: stable@vger.kernel.org
> > Reported-by: Nicolas Dufresne <nicolas.dufresne@collabora.com>
> > Signed-off-by: Dmitry Osipenko <dmitry.osipenko@collabora.com>
> > Signed-off-by: Nicolas Dufresne <nicolas.dufresne@collabora.com>
> 
> Fixes tag would be nice.
> 
> > ---
> >  drivers/staging/media/sunxi/cedrus/cedrus_h265.c | 6 ++++--
> >  1 file changed, 4 insertions(+), 2 deletions(-)
> > 
> > diff --git a/drivers/staging/media/sunxi/cedrus/cedrus_h265.c
> > b/drivers/staging/media/sunxi/cedrus/cedrus_h265.c index
> > f703c585d91c5..f0bc118021b0a 100644
> > --- a/drivers/staging/media/sunxi/cedrus/cedrus_h265.c
> > +++ b/drivers/staging/media/sunxi/cedrus/cedrus_h265.c
> > @@ -227,6 +227,7 @@ static void cedrus_h265_pred_weight_write(struct
> > cedrus_dev *dev, static void cedrus_h265_skip_bits(struct cedrus_dev *dev,
> > int num) {
> >  	int count = 0;
> > +	u32 reg;
> > 
> >  	while (count < num) {
> >  		int tmp = min(num - count, 32);
> > @@ -234,8 +235,9 @@ static void cedrus_h265_skip_bits(struct cedrus_dev
> > *dev, int num) cedrus_write(dev, VE_DEC_H265_TRIGGER,
> >  			     VE_DEC_H265_TRIGGER_FLUSH_BITS |
> >  			     VE_DEC_H265_TRIGGER_TYPE_N_BITS(tmp));
> > -		while (cedrus_read(dev, VE_DEC_H265_STATUS) &
> > VE_DEC_H265_STATUS_VLD_BUSY) -			udelay(1);
> > +
> > +		if (cedrus_wait_for(dev, VE_DEC_H265_STATUS,
> > VE_DEC_H265_STATUS_VLD_BUSY)) +			
> dev_err_ratelimited(dev->dev, "timed out
> > waiting to skip bits\n");
> 
> Reporting issue is nice, but better would be to propagate error, since there 
> is no way to properly decode this slice if above code block fails.

This mimic what was already there, mind if we do that later ? The propagation is
doing to be a lot more intrusive.

> 
> Best regards,
> Jernej
> 
> > 
> >  		count += tmp;
> >  	}
> 
> 
> 
>

Jernej Škrabec Aug. 20, 2022, 8:25 a.m. UTC | #3

Dne petek, 19. avgust 2022 ob 17:37:20 CEST je Nicolas Dufresne napisal(a):
> Le vendredi 19 août 2022 à 06:17 +0200, Jernej Škrabec a écrit :
> > Dne četrtek, 18. avgust 2022 ob 22:33:07 CEST je Nicolas Dufresne 
napisal(a):
> > > From: Dmitry Osipenko <dmitry.osipenko@collabora.com>
> > > 
> > > The cedrus_hw_resume() crashes with NULL deference on driver probe if
> > > runtime PM is disabled because it uses platform data that hasn't been
> > > set up yet. Fix this by setting the platform data earlier during probe.
> > 
> > Does it even work without PM? Maybe it would be better if Cedrus would
> > select PM in Kconfig.
> 
> I cannot comment myself on this, but it does not seem to invalidate this
> Dmitry's fix.

If NULL pointer dereference happens only when PM is disabled, then it does. I 
have PM always enabled and I never experienced above issue.

Best regards,
Jernej

> 
> > Best regards,
> > Jernej
> > 
> > > Cc: stable@vger.kernel.org
> > > Signed-off-by: Dmitry Osipenko <dmitry.osipenko@collabora.com>
> > > Signed-off-by: Nicolas Dufresne <nicolas.dufresne@collabora.com>
> > > ---
> > > 
> > >  drivers/staging/media/sunxi/cedrus/cedrus.c | 4 ++--
> > >  1 file changed, 2 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/drivers/staging/media/sunxi/cedrus/cedrus.c
> > > b/drivers/staging/media/sunxi/cedrus/cedrus.c index
> > > 960a0130cd620..55c54dfdc585c 100644
> > > --- a/drivers/staging/media/sunxi/cedrus/cedrus.c
> > > +++ b/drivers/staging/media/sunxi/cedrus/cedrus.c
> > > @@ -448,6 +448,8 @@ static int cedrus_probe(struct platform_device
> > > *pdev)
> > > 
> > >  	if (!dev)
> > >  	
> > >  		return -ENOMEM;
> > > 
> > > +	platform_set_drvdata(pdev, dev);
> > > +
> > > 
> > >  	dev->vfd = cedrus_video_device;
> > >  	dev->dev = &pdev->dev;
> > >  	dev->pdev = pdev;
> > > 
> > > @@ -521,8 +523,6 @@ static int cedrus_probe(struct platform_device
> > > *pdev)
> > > 
> > >  		goto err_m2m_mc;
> > >  	
> > >  	}
> > > 
> > > -	platform_set_drvdata(pdev, dev);
> > > -
> > > 
> > >  	return 0;
> > >  
> > >  err_m2m_mc:

Dmitry Osipenko Aug. 21, 2022, 8:40 p.m. UTC | #4

On 8/20/22 11:25, Jernej Škrabec wrote:
> Dne petek, 19. avgust 2022 ob 17:37:20 CEST je Nicolas Dufresne napisal(a):
>> Le vendredi 19 août 2022 à 06:17 +0200, Jernej Škrabec a écrit :
>>> Dne četrtek, 18. avgust 2022 ob 22:33:07 CEST je Nicolas Dufresne 
> napisal(a):
>>>> From: Dmitry Osipenko <dmitry.osipenko@collabora.com>
>>>>
>>>> The cedrus_hw_resume() crashes with NULL deference on driver probe if
>>>> runtime PM is disabled because it uses platform data that hasn't been
>>>> set up yet. Fix this by setting the platform data earlier during probe.
>>>
>>> Does it even work without PM? Maybe it would be better if Cedrus would
>>> select PM in Kconfig.
>>
>> I cannot comment myself on this, but it does not seem to invalidate this
>> Dmitry's fix.
> 
> If NULL pointer dereference happens only when PM is disabled, then it does. I 
> have PM always enabled and I never experienced above issue.

Originally this fix was needed when I changed cedrus_hw_probe() to do
the rpm-resume while was debugging the hang issue and then also noticed
that the !PM should be broken. It's a good common practice for all
drivers to have the drv data set early to avoid such problems, hence it
won't hurt to make this change anyways.

In practice nobody disables PM other than for debugging purposes and !PM
handling is often broken in drivers. I assume that it might be even
better to enable PM for all Allwiner SoCs and remove !PM handling from
all the affected drivers, like it was done for Tegra some time ago.

Jernej Škrabec Aug. 25, 2022, 9:04 p.m. UTC | #5

Dne nedelja, 21. avgust 2022 ob 22:40:21 CEST je Dmitry Osipenko napisal(a):
> On 8/20/22 11:25, Jernej Škrabec wrote:
> > Dne petek, 19. avgust 2022 ob 17:37:20 CEST je Nicolas Dufresne 
napisal(a):
> >> Le vendredi 19 août 2022 à 06:17 +0200, Jernej Škrabec a écrit :
> >>> Dne četrtek, 18. avgust 2022 ob 22:33:07 CEST je Nicolas Dufresne
> > 
> > napisal(a):
> >>>> From: Dmitry Osipenko <dmitry.osipenko@collabora.com>
> >>>> 
> >>>> The cedrus_hw_resume() crashes with NULL deference on driver probe if
> >>>> runtime PM is disabled because it uses platform data that hasn't been
> >>>> set up yet. Fix this by setting the platform data earlier during probe.
> >>> 
> >>> Does it even work without PM? Maybe it would be better if Cedrus would
> >>> select PM in Kconfig.
> >> 
> >> I cannot comment myself on this, but it does not seem to invalidate this
> >> Dmitry's fix.
> > 
> > If NULL pointer dereference happens only when PM is disabled, then it
> > does. I have PM always enabled and I never experienced above issue.
> 
> Originally this fix was needed when I changed cedrus_hw_probe() to do
> the rpm-resume while was debugging the hang issue and then also noticed
> that the !PM should be broken. It's a good common practice for all
> drivers to have the drv data set early to avoid such problems, hence it
> won't hurt to make this change anyways.

Ok, as others pointed out, this is still a good thing, so:

Acked-by: Jernej Skrabec <jernej.skrabec@gmail.com>


> 
> In practice nobody disables PM other than for debugging purposes and !PM
> handling is often broken in drivers. I assume that it might be even
> better to enable PM for all Allwiner SoCs and remove !PM handling from
> all the affected drivers, like it was done for Tegra some time ago.
> 

Maybe in the future :)

Best regards,
Jernej

> --
> Best regards,
> Dmitry

Jernej Škrabec Aug. 25, 2022, 9:13 p.m. UTC | #6

Dne petek, 19. avgust 2022 ob 17:39:25 CEST je Nicolas Dufresne napisal(a):
> Le vendredi 19 août 2022 à 06:16 +0200, Jernej Škrabec a écrit :
> > Dne četrtek, 18. avgust 2022 ob 22:33:08 CEST je Nicolas Dufresne 
napisal(a):
> > > From: Dmitry Osipenko <dmitry.osipenko@collabora.com>
> > > 
> > > The busy status bit may never de-assert if number of programmed skip
> > > bits is incorrect, resulting in a kernel hang because the bit is polled
> > > endlessly in the code. Fix it by adding timeout for the bit-polling.
> > > This problem is reproducible by setting the data_bit_offset field of
> > > the HEVC slice params to a wrong value by userspace.
> > > 
> > > Cc: stable@vger.kernel.org
> > > Reported-by: Nicolas Dufresne <nicolas.dufresne@collabora.com>
> > > Signed-off-by: Dmitry Osipenko <dmitry.osipenko@collabora.com>
> > > Signed-off-by: Nicolas Dufresne <nicolas.dufresne@collabora.com>
> > 
> > Fixes tag would be nice.
> > 
> > > ---
> > > 
> > >  drivers/staging/media/sunxi/cedrus/cedrus_h265.c | 6 ++++--
> > >  1 file changed, 4 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/drivers/staging/media/sunxi/cedrus/cedrus_h265.c
> > > b/drivers/staging/media/sunxi/cedrus/cedrus_h265.c index
> > > f703c585d91c5..f0bc118021b0a 100644
> > > --- a/drivers/staging/media/sunxi/cedrus/cedrus_h265.c
> > > +++ b/drivers/staging/media/sunxi/cedrus/cedrus_h265.c
> > > @@ -227,6 +227,7 @@ static void cedrus_h265_pred_weight_write(struct
> > > cedrus_dev *dev, static void cedrus_h265_skip_bits(struct cedrus_dev
> > > *dev,
> > > int num) {
> > > 
> > >  	int count = 0;
> > > 
> > > +	u32 reg;
> > > 
> > >  	while (count < num) {
> > >  	
> > >  		int tmp = min(num - count, 32);
> > > 
> > > @@ -234,8 +235,9 @@ static void cedrus_h265_skip_bits(struct cedrus_dev
> > > *dev, int num) cedrus_write(dev, VE_DEC_H265_TRIGGER,
> > > 
> > >  			     VE_DEC_H265_TRIGGER_FLUSH_BITS |
> > >  			     VE_DEC_H265_TRIGGER_TYPE_N_BITS(tmp));
> > > 
> > > -		while (cedrus_read(dev, VE_DEC_H265_STATUS) &
> > > VE_DEC_H265_STATUS_VLD_BUSY) -			udelay(1);
> > > +
> > > +		if (cedrus_wait_for(dev, VE_DEC_H265_STATUS,
> > > VE_DEC_H265_STATUS_VLD_BUSY)) +
> > 
> > dev_err_ratelimited(dev->dev, "timed out
> > 
> > > waiting to skip bits\n");
> > 
> > Reporting issue is nice, but better would be to propagate error, since
> > there is no way to properly decode this slice if above code block fails.
> This mimic what was already there, mind if we do that later ? The
> propagation is doing to be a lot more intrusive.

Since backporting fixes before 6.0 isn't priority, viability for backpporting 
isn't that important. You would only need to return 0 or -ETIMEDOUT and check 
for error in only one location. That doesn't sound  very intrusive for me.

Best regards,
Jernej

> 
> > Best regards,
> > Jernej
> > 
> > >  		count += tmp;
> > >  	
> > >  	}

[v1,0/3] cedrus: Various bug fixes

Message

Comments