[v9,13/13] media: ti: Add CSI2RX support for J721E

Message ID	20230811-upstream_csi-v9-13-8943f7a68a81@ti.com
State	New
Headers	show Return-Path: <linux-media-owner@vger.kernel.org> From: Jai Luthra <j-luthra@ti.com> To: Mauro Carvalho Chehab <mchehab@kernel.org>, Rob Herring <robh+dt@kernel.org>, Krzysztof Kozlowski <krzysztof.kozlowski+dt@linaro.org>, Conor Dooley <conor+dt@kernel.org>, Sakari Ailus <sakari.ailus@linux.intel.com>, Tomi Valkeinen <tomi.valkeinen@ideasonboard.com> CC: <linux-media@vger.kernel.org>, <linux-kernel@vger.kernel.org>, <devicetree@vger.kernel.org>, <linux-arm-kernel@lists.infradead.org>, Laurent Pinchart <laurent.pinchart@ideasonboard.com>, Mauro Carvalho Chehab <mchehab+samsung@kernel.org>, Maxime Ripard <mripard@kernel.org>, <niklas.soderlund+renesas@ragnatech.se>, Benoit Parrot <bparrot@ti.com>, Vaishnav Achath <vaishnav.a@ti.com>, Vignesh Raghavendra <vigneshr@ti.com>, <nm@ti.com>, <devarsht@ti.com>, <j-luthra@ti.com>, <a-bhatia1@ti.com>, Martyn Welch <martyn.welch@collabora.com>, Julien Massot <julien.massot@collabora.com> Subject: [PATCH v9 13/13] media: ti: Add CSI2RX support for J721E Date: Fri, 11 Aug 2023 16:17:35 +0530 Message-ID: <20230811-upstream_csi-v9-13-8943f7a68a81@ti.com> In-Reply-To: <20230811-upstream_csi-v9-0-8943f7a68a81@ti.com> References: <20230811-upstream_csi-v9-0-8943f7a68a81@ti.com> MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 8bit Precedence: bulk
Series	CSI2RX support on J721E and AM62 \| expand [v9,00/13] CSI2RX support on J721E and AM62 [v9,01/13] media: dt-bindings: Make sure items in data-lanes are unique [v9,02/13] media: dt-bindings: cadence-csi2rx: Add TI compatible string [v9,03/13] media: cadence: csi2rx: Unregister v4l2 async notifier [v9,04/13] media: cadence: csi2rx: Cleanup media entity properly [v9,05/13] media: cadence: csi2rx: Add get_fmt and set_fmt pad ops [v9,06/13] media: cadence: csi2rx: Configure DPHY using link freq [v9,07/13] media: cadence: csi2rx: Soft reset the streams before starting capture [v9,08/13] media: cadence: csi2rx: Set the STOP bit when stopping a stream [v9,09/13] media: cadence: csi2rx: Fix stream data configuration [v9,10/13] media: cadence: csi2rx: Populate subdev devnode [v9,11/13] media: cadence: csi2rx: Add link validation [v9,12/13] media: dt-bindings: Add TI J721E CSI2RX [v9,13/13] media: ti: Add CSI2RX support for J721E

Jai Luthra Aug. 11, 2023, 10:47 a.m. UTC

From: Pratyush Yadav <p.yadav@ti.com>

TI's J721E uses the Cadence CSI2RX and DPHY peripherals to facilitate
capture over a CSI-2 bus.

The Cadence CSI2RX IP acts as a bridge between the TI specific parts and
the CSI-2 protocol parts. TI then has a wrapper on top of this bridge
called the SHIM layer. It takes in data from stream 0, repacks it, and
sends it to memory over PSI-L DMA.

This driver acts as the "front end" to V4L2 client applications. It
implements the required ioctls and buffer operations, passes the
necessary calls on to the bridge, programs the SHIM layer, and performs
DMA via the dmaengine API to finally return the data to a buffer
supplied by the application.

Signed-off-by: Pratyush Yadav <p.yadav@ti.com>
Co-authored-by: Vaishnav Achath <vaishnav.a@ti.com>
Signed-off-by: Vaishnav Achath <vaishnav.a@ti.com>
Tested-by: Vaishnav Achath <vaishnav.a@ti.com>
Co-authored-by: Jai Luthra <j-luthra@ti.com>
Signed-off-by: Jai Luthra <j-luthra@ti.com>
---
Changes since v8:
    - Allocate drain buffer at start of stream instead of doing it in the
      middle, and document why it is needed in comments
    - Call subdev's get_fmt directly for link_validation()
    - Cleanup height/width clamping and rounding code, document it in comments
    - Return and check errors from setup_shim()
    - s/subdev/source for cadence csi2rx's v4l2_subdev
    - s/ti_csi2rx_init_subdev/ti_csi2rx_notifier_register
    - Change copyright year/author list

 MAINTAINERS                                        |    7 +
 drivers/media/platform/ti/Kconfig                  |   12 +
 drivers/media/platform/ti/Makefile                 |    1 +
 drivers/media/platform/ti/j721e-csi2rx/Makefile    |    2 +
 .../media/platform/ti/j721e-csi2rx/j721e-csi2rx.c  | 1150 ++++++++++++++++++++
 5 files changed, 1172 insertions(+)

Tomi Valkeinen Aug. 15, 2023, 1 p.m. UTC | #1

On 11/08/2023 13:47, Jai Luthra wrote:
> From: Pratyush Yadav <p.yadav@ti.com>
> 
> TI's J721E uses the Cadence CSI2RX and DPHY peripherals to facilitate
> capture over a CSI-2 bus.
> 
> The Cadence CSI2RX IP acts as a bridge between the TI specific parts and
> the CSI-2 protocol parts. TI then has a wrapper on top of this bridge
> called the SHIM layer. It takes in data from stream 0, repacks it, and
> sends it to memory over PSI-L DMA.
> 
> This driver acts as the "front end" to V4L2 client applications. It
> implements the required ioctls and buffer operations, passes the
> necessary calls on to the bridge, programs the SHIM layer, and performs
> DMA via the dmaengine API to finally return the data to a buffer
> supplied by the application.
> 
> Signed-off-by: Pratyush Yadav <p.yadav@ti.com>
> Co-authored-by: Vaishnav Achath <vaishnav.a@ti.com>
> Signed-off-by: Vaishnav Achath <vaishnav.a@ti.com>
> Tested-by: Vaishnav Achath <vaishnav.a@ti.com>
> Co-authored-by: Jai Luthra <j-luthra@ti.com>
> Signed-off-by: Jai Luthra <j-luthra@ti.com>
> ---
> Changes since v8:
>      - Allocate drain buffer at start of stream instead of doing it in the
>        middle, and document why it is needed in comments
>      - Call subdev's get_fmt directly for link_validation()
>      - Cleanup height/width clamping and rounding code, document it in comments
>      - Return and check errors from setup_shim()
>      - s/subdev/source for cadence csi2rx's v4l2_subdev
>      - s/ti_csi2rx_init_subdev/ti_csi2rx_notifier_register
>      - Change copyright year/author list
> 
>   MAINTAINERS                                        |    7 +
>   drivers/media/platform/ti/Kconfig                  |   12 +
>   drivers/media/platform/ti/Makefile                 |    1 +
>   drivers/media/platform/ti/j721e-csi2rx/Makefile    |    2 +
>   .../media/platform/ti/j721e-csi2rx/j721e-csi2rx.c  | 1150 ++++++++++++++++++++
>   5 files changed, 1172 insertions(+)
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 02a3192195af..959147d6d936 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -21455,6 +21455,13 @@ F:	Documentation/devicetree/bindings/media/i2c/ti,ds90*
>   F:	drivers/media/i2c/ds90*
>   F:	include/media/i2c/ds90*
>   
> +TI J721E CSI2RX DRIVER
> +M:	Jai Luthra <j-luthra@ti.com>
> +L:	linux-media@vger.kernel.org
> +S:	Maintained
> +F:	Documentation/devicetree/bindings/media/ti,j721e-csi2rx.yaml
> +F:	drivers/media/platform/ti/j721e-csi2rx/
> +
>   TI KEYSTONE MULTICORE NAVIGATOR DRIVERS
>   M:	Nishanth Menon <nm@ti.com>
>   M:	Santosh Shilimkar <ssantosh@kernel.org>
> diff --git a/drivers/media/platform/ti/Kconfig b/drivers/media/platform/ti/Kconfig
> index e1ab56c3be1f..42c908f6e1ae 100644
> --- a/drivers/media/platform/ti/Kconfig
> +++ b/drivers/media/platform/ti/Kconfig
> @@ -63,6 +63,18 @@ config VIDEO_TI_VPE_DEBUG
>   	help
>   	  Enable debug messages on VPE driver.
>   
> +config VIDEO_TI_J721E_CSI2RX
> +	tristate "TI J721E CSI2RX wrapper layer driver"
> +	depends on VIDEO_DEV && VIDEO_V4L2_SUBDEV_API
> +	depends on MEDIA_SUPPORT && MEDIA_CONTROLLER
> +	depends on PHY_CADENCE_DPHY_RX && VIDEO_CADENCE_CSI2RX
> +	depends on ARCH_K3 || COMPILE_TEST
> +	select VIDEOBUF2_DMA_CONTIG
> +	select V4L2_FWNODE
> +	help
> +	  Support for TI CSI2RX wrapper layer. This just enables the wrapper driver.
> +	  The Cadence CSI2RX bridge driver needs to be enabled separately.
> +
>   source "drivers/media/platform/ti/am437x/Kconfig"
>   source "drivers/media/platform/ti/davinci/Kconfig"
>   source "drivers/media/platform/ti/omap/Kconfig"
> diff --git a/drivers/media/platform/ti/Makefile b/drivers/media/platform/ti/Makefile
> index 98c5fe5c40d6..8a2f74c9380e 100644
> --- a/drivers/media/platform/ti/Makefile
> +++ b/drivers/media/platform/ti/Makefile
> @@ -3,5 +3,6 @@ obj-y += am437x/
>   obj-y += cal/
>   obj-y += vpe/
>   obj-y += davinci/
> +obj-y += j721e-csi2rx/
>   obj-y += omap/
>   obj-y += omap3isp/
> diff --git a/drivers/media/platform/ti/j721e-csi2rx/Makefile b/drivers/media/platform/ti/j721e-csi2rx/Makefile
> new file mode 100644
> index 000000000000..377afc1d6280
> --- /dev/null
> +++ b/drivers/media/platform/ti/j721e-csi2rx/Makefile
> @@ -0,0 +1,2 @@
> +# SPDX-License-Identifier: GPL-2.0
> +obj-$(CONFIG_VIDEO_TI_J721E_CSI2RX) += j721e-csi2rx.o
> diff --git a/drivers/media/platform/ti/j721e-csi2rx/j721e-csi2rx.c b/drivers/media/platform/ti/j721e-csi2rx/j721e-csi2rx.c
> new file mode 100644
> index 000000000000..301d947f6098
> --- /dev/null
> +++ b/drivers/media/platform/ti/j721e-csi2rx/j721e-csi2rx.c
> @@ -0,0 +1,1150 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * TI CSI2RX Shim Wrapper Driver
> + *
> + * Copyright (C) 2023 Texas Instruments Incorporated - https://www.ti.com/
> + *
> + * Author: Pratyush Yadav <p.yadav@ti.com>
> + * Author: Jai Luthra <j-luthra@ti.com>
> + */
> +
> +#include <linux/bitfield.h>
> +#include <linux/dmaengine.h>
> +#include <linux/module.h>
> +#include <linux/of_platform.h>
> +#include <linux/platform_device.h>
> +
> +#include <media/mipi-csi2.h>
> +#include <media/v4l2-device.h>
> +#include <media/v4l2-ioctl.h>
> +#include <media/v4l2-mc.h>
> +#include <media/videobuf2-dma-contig.h>
> +
> +#define TI_CSI2RX_MODULE_NAME		"j721e-csi2rx"
> +
> +#define SHIM_CNTL			0x10
> +#define SHIM_CNTL_PIX_RST		BIT(0)
> +
> +#define SHIM_DMACNTX			0x20
> +#define SHIM_DMACNTX_EN			BIT(31)
> +#define SHIM_DMACNTX_YUV422		GENMASK(27, 26)
> +#define SHIM_DMACNTX_SIZE		GENMASK(21, 20)
> +#define SHIM_DMACNTX_FMT		GENMASK(5, 0)
> +#define SHIM_DMACNTX_UYVY		0
> +#define SHIM_DMACNTX_VYUY		1
> +#define SHIM_DMACNTX_YUYV		2
> +#define SHIM_DMACNTX_YVYU		3
> +#define SHIM_DMACNTX_SIZE_8		0
> +#define SHIM_DMACNTX_SIZE_16		1
> +#define SHIM_DMACNTX_SIZE_32		2
> +
> +#define SHIM_PSI_CFG0			0x24
> +#define SHIM_PSI_CFG0_SRC_TAG		GENMASK(15, 0)
> +#define SHIM_PSI_CFG0_DST_TAG		GENMASK(31, 16)
> +
> +#define PSIL_WORD_SIZE_BYTES		16
> +/*
> + * There are no hard limits on the width or height. The DMA engine can handle
> + * all sizes. The max width and height are arbitrary numbers for this driver.
> + * Use 16K * 16K as the arbitrary limit. It is large enough that it is unlikely
> + * the limit will be hit in practice.
> + */
> +#define MAX_WIDTH_BYTES			SZ_16K
> +#define MAX_HEIGHT_LINES		SZ_16K
> +
> +#define DRAIN_TIMEOUT_MS		50
> +
> +struct ti_csi2rx_fmt {
> +	u32				fourcc;	/* Four character code. */
> +	u32				code;	/* Mbus code. */
> +	u32				csi_dt;	/* CSI Data type. */
> +	u8				bpp;	/* Bits per pixel. */
> +	u8				size;	/* Data size shift when unpacking. */
> +};
> +
> +struct ti_csi2rx_buffer {
> +	/* Common v4l2 buffer. Must be first. */
> +	struct vb2_v4l2_buffer		vb;
> +	struct list_head		list;
> +	struct ti_csi2rx_dev		*csi;
> +};
> +
> +enum ti_csi2rx_dma_state {
> +	TI_CSI2RX_DMA_STOPPED,	/* Streaming not started yet. */
> +	TI_CSI2RX_DMA_IDLE,	/* Streaming but no pending DMA operation. */
> +	TI_CSI2RX_DMA_ACTIVE,	/* Streaming and pending DMA operation. */
> +};
> +
> +struct ti_csi2rx_dma {
> +	/* Protects all fields in this struct. */
> +	spinlock_t			lock;
> +	struct dma_chan			*chan;
> +	/* Buffers queued to the driver, waiting to be processed by DMA. */
> +	struct list_head		queue;
> +	enum ti_csi2rx_dma_state	state;
> +	/*
> +	 * Queue of buffers submitted to DMA engine.
> +	 */
> +	struct list_head		submitted;
> +	/* Buffer to drain stale data from PSI-L endpoint */
> +	struct {
> +		void			*vaddr;
> +		dma_addr_t		paddr;
> +		size_t			len;
> +	} drain;
> +};
> +
> +struct ti_csi2rx_dev {
> +	struct device			*dev;
> +	void __iomem			*shim;
> +	struct v4l2_device		v4l2_dev;
> +	struct video_device		vdev;
> +	struct media_device		mdev;
> +	struct media_pipeline		pipe;
> +	struct media_pad		pad;
> +	struct v4l2_async_notifier	notifier;
> +	struct v4l2_subdev		*source;
> +	struct vb2_queue		vidq;
> +	struct mutex			mutex; /* To serialize ioctls. */
> +	struct v4l2_format		v_fmt;
> +	struct ti_csi2rx_dma		dma;
> +	u32				sequence;
> +};
> +
> +static const struct ti_csi2rx_fmt formats[] = {
> +	{
> +		.fourcc			= V4L2_PIX_FMT_YUYV,
> +		.code			= MEDIA_BUS_FMT_YUYV8_1X16,
> +		.csi_dt			= MIPI_CSI2_DT_YUV422_8B,
> +		.bpp			= 16,
> +		.size			= SHIM_DMACNTX_SIZE_8,
> +	}, {
> +		.fourcc			= V4L2_PIX_FMT_UYVY,
> +		.code			= MEDIA_BUS_FMT_UYVY8_1X16,
> +		.csi_dt			= MIPI_CSI2_DT_YUV422_8B,
> +		.bpp			= 16,
> +		.size			= SHIM_DMACNTX_SIZE_8,
> +	}, {
> +		.fourcc			= V4L2_PIX_FMT_YVYU,
> +		.code			= MEDIA_BUS_FMT_YVYU8_1X16,
> +		.csi_dt			= MIPI_CSI2_DT_YUV422_8B,
> +		.bpp			= 16,
> +		.size			= SHIM_DMACNTX_SIZE_8,
> +	}, {
> +		.fourcc			= V4L2_PIX_FMT_VYUY,
> +		.code			= MEDIA_BUS_FMT_VYUY8_1X16,
> +		.csi_dt			= MIPI_CSI2_DT_YUV422_8B,
> +		.bpp			= 16,
> +		.size			= SHIM_DMACNTX_SIZE_8,
> +	}, {
> +		.fourcc			= V4L2_PIX_FMT_SBGGR8,
> +		.code			= MEDIA_BUS_FMT_SBGGR8_1X8,
> +		.csi_dt			= MIPI_CSI2_DT_RAW8,
> +		.bpp			= 8,
> +		.size			= SHIM_DMACNTX_SIZE_8,
> +	}, {
> +		.fourcc			= V4L2_PIX_FMT_SGBRG8,
> +		.code			= MEDIA_BUS_FMT_SGBRG8_1X8,
> +		.csi_dt			= MIPI_CSI2_DT_RAW8,
> +		.bpp			= 8,
> +		.size			= SHIM_DMACNTX_SIZE_8,
> +	}, {
> +		.fourcc			= V4L2_PIX_FMT_SGRBG8,
> +		.code			= MEDIA_BUS_FMT_SGRBG8_1X8,
> +		.csi_dt			= MIPI_CSI2_DT_RAW8,
> +		.bpp			= 8,
> +		.size			= SHIM_DMACNTX_SIZE_8,
> +	}, {
> +		.fourcc			= V4L2_PIX_FMT_SRGGB8,
> +		.code			= MEDIA_BUS_FMT_SRGGB8_1X8,
> +		.csi_dt			= MIPI_CSI2_DT_RAW8,
> +		.bpp			= 8,
> +		.size			= SHIM_DMACNTX_SIZE_8,
> +	}, {
> +		.fourcc			= V4L2_PIX_FMT_SBGGR10,
> +		.code			= MEDIA_BUS_FMT_SBGGR10_1X10,
> +		.csi_dt			= MIPI_CSI2_DT_RAW10,
> +		.bpp			= 16,
> +		.size			= SHIM_DMACNTX_SIZE_16,
> +	}, {
> +		.fourcc			= V4L2_PIX_FMT_SGBRG10,
> +		.code			= MEDIA_BUS_FMT_SGBRG10_1X10,
> +		.csi_dt			= MIPI_CSI2_DT_RAW10,
> +		.bpp			= 16,
> +		.size			= SHIM_DMACNTX_SIZE_16,
> +	}, {
> +		.fourcc			= V4L2_PIX_FMT_SGRBG10,
> +		.code			= MEDIA_BUS_FMT_SGRBG10_1X10,
> +		.csi_dt			= MIPI_CSI2_DT_RAW10,
> +		.bpp			= 16,
> +		.size			= SHIM_DMACNTX_SIZE_16,
> +	}, {
> +		.fourcc			= V4L2_PIX_FMT_SRGGB10,
> +		.code			= MEDIA_BUS_FMT_SRGGB10_1X10,
> +		.csi_dt			= MIPI_CSI2_DT_RAW10,
> +		.bpp			= 16,
> +		.size			= SHIM_DMACNTX_SIZE_16,
> +	},
> +
> +	/* More formats can be supported but they are not listed for now. */
> +};
> +
> +static const unsigned int num_formats = ARRAY_SIZE(formats);
> +
> +/* Forward declaration needed by ti_csi2rx_dma_callback. */
> +static int ti_csi2rx_start_dma(struct ti_csi2rx_dev *csi,
> +			       struct ti_csi2rx_buffer *buf);
> +
> +static const struct ti_csi2rx_fmt *find_format_by_pix(u32 pixelformat)
> +{
> +	unsigned int i;
> +
> +	for (i = 0; i < num_formats; i++) {
> +		if (formats[i].fourcc == pixelformat)
> +			return &formats[i];
> +	}
> +
> +	return NULL;
> +}
> +
> +static const struct ti_csi2rx_fmt *find_format_by_code(u32 code)
> +{
> +	unsigned int i;
> +
> +	for (i = 0; i < num_formats; i++) {
> +		if (formats[i].code == code)
> +			return &formats[i];
> +	}
> +
> +	return NULL;
> +}
> +
> +static void ti_csi2rx_fill_fmt(const struct ti_csi2rx_fmt *csi_fmt,
> +			       struct v4l2_format *v4l2_fmt)
> +{
> +	struct v4l2_pix_format *pix = &v4l2_fmt->fmt.pix;
> +	unsigned int pixels_in_word;
> +	u8 bpp = ALIGN(csi_fmt->bpp, 8);
> +
> +	pixels_in_word = PSIL_WORD_SIZE_BYTES * 8 / bpp;
> +
> +	/* Clamp width and height to sensible maximums (16K x 16K) */
> +	pix->width = clamp_t(unsigned int, pix->width,
> +			     pixels_in_word,
> +			     MAX_WIDTH_BYTES * 8 / bpp);
> +	pix->height = clamp_t(unsigned int, pix->height, 1, MAX_HEIGHT_LINES);
> +
> +	/* Width should be a multiple of transfer word-size */
> +	pix->width = rounddown(pix->width, pixels_in_word);
> +
> +	v4l2_fmt->type = V4L2_BUF_TYPE_VIDEO_CAPTURE;
> +	pix->pixelformat = csi_fmt->fourcc;
> +	pix->colorspace = V4L2_COLORSPACE_SRGB;
> +	pix->bytesperline = pix->width * (bpp / 8);
> +	pix->sizeimage = pix->bytesperline * pix->height;
> +}
> +
> +static int ti_csi2rx_querycap(struct file *file, void *priv,
> +			      struct v4l2_capability *cap)
> +{
> +	strscpy(cap->driver, TI_CSI2RX_MODULE_NAME, sizeof(cap->driver));
> +	strscpy(cap->card, TI_CSI2RX_MODULE_NAME, sizeof(cap->card));
> +
> +	return 0;
> +}
> +
> +static int ti_csi2rx_enum_fmt_vid_cap(struct file *file, void *priv,
> +				      struct v4l2_fmtdesc *f)
> +{
> +	const struct ti_csi2rx_fmt *fmt = NULL;
> +
> +	if (f->mbus_code) {
> +		/* 1-to-1 mapping between bus formats and pixel formats */
> +		if (f->index > 0)
> +			return -EINVAL;
> +
> +		fmt = find_format_by_code(f->mbus_code);
> +	} else {
> +		if (f->index >= num_formats)
> +			return -EINVAL;
> +
> +		fmt = &formats[f->index];
> +	}
> +
> +	if (!fmt)
> +		return -EINVAL;
> +
> +	f->pixelformat = fmt->fourcc;
> +	memset(f->reserved, 0, sizeof(f->reserved));
> +	f->type = V4L2_BUF_TYPE_VIDEO_CAPTURE;
> +
> +	return 0;
> +}
> +
> +static int ti_csi2rx_g_fmt_vid_cap(struct file *file, void *prov,
> +				   struct v4l2_format *f)
> +{
> +	struct ti_csi2rx_dev *csi = video_drvdata(file);
> +
> +	*f = csi->v_fmt;
> +
> +	return 0;
> +}
> +
> +static int ti_csi2rx_try_fmt_vid_cap(struct file *file, void *priv,
> +				     struct v4l2_format *f)
> +{
> +	const struct ti_csi2rx_fmt *fmt;
> +
> +	/*
> +	 * Default to the first format if the requested pixel format code isn't
> +	 * supported.
> +	 */
> +	fmt = find_format_by_pix(f->fmt.pix.pixelformat);
> +	if (!fmt)
> +		fmt = &formats[0];
> +
> +	/* Interlaced formats are not supported. */
> +	f->fmt.pix.field = V4L2_FIELD_NONE;
> +
> +	ti_csi2rx_fill_fmt(fmt, f);
> +
> +	return 0;
> +}
> +
> +static int ti_csi2rx_s_fmt_vid_cap(struct file *file, void *priv,
> +				   struct v4l2_format *f)
> +{
> +	struct ti_csi2rx_dev *csi = video_drvdata(file);
> +	struct vb2_queue *q = &csi->vidq;
> +	int ret;
> +
> +	if (vb2_is_busy(q))
> +		return -EBUSY;
> +
> +	ret = ti_csi2rx_try_fmt_vid_cap(file, priv, f);
> +	if (ret < 0)
> +		return ret;
> +
> +	csi->v_fmt = *f;
> +
> +	return 0;
> +}
> +
> +static int ti_csi2rx_enum_framesizes(struct file *file, void *fh,
> +				     struct v4l2_frmsizeenum *fsize)
> +{
> +	const struct ti_csi2rx_fmt *fmt;
> +	unsigned int pixels_in_word;
> +	u8 bpp;
> +
> +	fmt = find_format_by_pix(fsize->pixel_format);
> +	if (!fmt || fsize->index != 0)
> +		return -EINVAL;
> +
> +	bpp = ALIGN(fmt->bpp, 8);
> +
> +	/*
> +	 * Number of pixels in one PSI-L word. The transfer happens in multiples
> +	 * of PSI-L word sizes.
> +	 */
> +	pixels_in_word = PSIL_WORD_SIZE_BYTES * 8 / bpp;
> +
> +	fsize->type = V4L2_FRMSIZE_TYPE_STEPWISE;
> +	fsize->stepwise.min_width = pixels_in_word;
> +	fsize->stepwise.max_width = rounddown(MAX_WIDTH_BYTES * 8 / bpp,
> +					      pixels_in_word);
> +	fsize->stepwise.step_width = pixels_in_word;
> +	fsize->stepwise.min_height = 1;
> +	fsize->stepwise.max_height = MAX_HEIGHT_LINES;
> +	fsize->stepwise.step_height = 1;
> +
> +	return 0;
> +}
> +
> +static const struct v4l2_ioctl_ops csi_ioctl_ops = {
> +	.vidioc_querycap      = ti_csi2rx_querycap,
> +	.vidioc_enum_fmt_vid_cap = ti_csi2rx_enum_fmt_vid_cap,
> +	.vidioc_try_fmt_vid_cap = ti_csi2rx_try_fmt_vid_cap,
> +	.vidioc_g_fmt_vid_cap = ti_csi2rx_g_fmt_vid_cap,
> +	.vidioc_s_fmt_vid_cap = ti_csi2rx_s_fmt_vid_cap,
> +	.vidioc_enum_framesizes = ti_csi2rx_enum_framesizes,
> +	.vidioc_reqbufs       = vb2_ioctl_reqbufs,
> +	.vidioc_create_bufs   = vb2_ioctl_create_bufs,
> +	.vidioc_prepare_buf   = vb2_ioctl_prepare_buf,
> +	.vidioc_querybuf      = vb2_ioctl_querybuf,
> +	.vidioc_qbuf          = vb2_ioctl_qbuf,
> +	.vidioc_dqbuf         = vb2_ioctl_dqbuf,
> +	.vidioc_expbuf        = vb2_ioctl_expbuf,
> +	.vidioc_streamon      = vb2_ioctl_streamon,
> +	.vidioc_streamoff     = vb2_ioctl_streamoff,
> +};
> +
> +static const struct v4l2_file_operations csi_fops = {
> +	.owner = THIS_MODULE,
> +	.open = v4l2_fh_open,
> +	.release = vb2_fop_release,
> +	.read = vb2_fop_read,
> +	.poll = vb2_fop_poll,
> +	.unlocked_ioctl = video_ioctl2,
> +	.mmap = vb2_fop_mmap,
> +};
> +
> +static int csi_async_notifier_bound(struct v4l2_async_notifier *notifier,
> +				    struct v4l2_subdev *subdev,
> +				    struct v4l2_async_connection *asc)
> +{
> +	struct ti_csi2rx_dev *csi = dev_get_drvdata(notifier->v4l2_dev->dev);
> +
> +	csi->source = subdev;
> +
> +	return 0;
> +}
> +
> +static int csi_async_notifier_complete(struct v4l2_async_notifier *notifier)
> +{
> +	struct ti_csi2rx_dev *csi = dev_get_drvdata(notifier->v4l2_dev->dev);
> +	struct video_device *vdev = &csi->vdev;
> +	int ret;
> +
> +	ret = video_register_device(vdev, VFL_TYPE_VIDEO, -1);
> +	if (ret)
> +		return ret;
> +
> +	ret = v4l2_create_fwnode_links_to_pad(csi->source, &csi->pad,
> +					      MEDIA_LNK_FL_IMMUTABLE | MEDIA_LNK_FL_ENABLED);
> +
> +	if (ret) {
> +		video_unregister_device(vdev);
> +		return ret;
> +	}
> +
> +	return v4l2_device_register_subdev_nodes(&csi->v4l2_dev);
> +}
> +
> +static const struct v4l2_async_notifier_operations csi_async_notifier_ops = {
> +	.bound = csi_async_notifier_bound,
> +	.complete = csi_async_notifier_complete,
> +};
> +
> +static int ti_csi2rx_notifier_register(struct ti_csi2rx_dev *csi)
> +{
> +	struct fwnode_handle *fwnode;
> +	struct v4l2_async_connection *asc;
> +	struct device_node *node;
> +	int ret;
> +
> +	node = of_get_child_by_name(csi->dev->of_node, "csi-bridge");
> +	if (!node)
> +		return -EINVAL;
> +
> +	fwnode = of_fwnode_handle(node);
> +	if (!fwnode) {
> +		of_node_put(node);
> +		return -EINVAL;
> +	}
> +
> +	v4l2_async_nf_init(&csi->notifier, &csi->v4l2_dev);
> +	csi->notifier.ops = &csi_async_notifier_ops;
> +
> +	asc = v4l2_async_nf_add_fwnode(&csi->notifier, fwnode,
> +				       struct v4l2_async_connection);
> +	of_node_put(node);
> +	if (IS_ERR(asc)) {
> +		v4l2_async_nf_cleanup(&csi->notifier);
> +		return PTR_ERR(asc);
> +	}
> +
> +	ret = v4l2_async_nf_register(&csi->notifier);
> +	if (ret) {
> +		v4l2_async_nf_cleanup(&csi->notifier);
> +		return ret;
> +	}
> +
> +	return 0;
> +}
> +
> +static int ti_csi2rx_setup_shim(struct ti_csi2rx_dev *csi)
> +{
> +	const struct ti_csi2rx_fmt *fmt;
> +	unsigned int reg;
> +
> +	fmt = find_format_by_pix(csi->v_fmt.fmt.pix.pixelformat);
> +	if (!fmt) {
> +		dev_err(csi->dev, "Pixelformat 0x%x is not supported\n",
> +			csi->v_fmt.fmt.pix.pixelformat);
> +		return -EINVAL;
> +	}
> +
> +	/* De-assert the pixel interface reset. */
> +	reg = SHIM_CNTL_PIX_RST;
> +	writel(reg, csi->shim + SHIM_CNTL);
> +
> +	reg = SHIM_DMACNTX_EN;
> +	reg |= FIELD_PREP(SHIM_DMACNTX_FMT, fmt->csi_dt);
> +
> +	/*
> +	 * Using the values from the documentation gives incorrect ordering for
> +	 * the luma and chroma components. In practice, the "reverse" format
> +	 * gives the correct image. So for example, if the image is in UYVY, the
> +	 * reverse would be YVYU.
> +	 */
> +	switch (fmt->fourcc) {
> +	case V4L2_PIX_FMT_UYVY:
> +		reg |= FIELD_PREP(SHIM_DMACNTX_YUV422,
> +					SHIM_DMACNTX_YVYU);
> +		break;
> +	case V4L2_PIX_FMT_VYUY:
> +		reg |= FIELD_PREP(SHIM_DMACNTX_YUV422,
> +					SHIM_DMACNTX_YUYV);
> +		break;
> +	case V4L2_PIX_FMT_YUYV:
> +		reg |= FIELD_PREP(SHIM_DMACNTX_YUV422,
> +					SHIM_DMACNTX_VYUY);
> +		break;
> +	case V4L2_PIX_FMT_YVYU:
> +		reg |= FIELD_PREP(SHIM_DMACNTX_YUV422,
> +					SHIM_DMACNTX_UYVY);
> +		break;
> +	default:
> +		/* Ignore if not YUV 4:2:2 */
> +		break;
> +	}
> +
> +	reg |= FIELD_PREP(SHIM_DMACNTX_SIZE, fmt->size);
> +
> +	writel(reg, csi->shim + SHIM_DMACNTX);
> +
> +	reg = FIELD_PREP(SHIM_PSI_CFG0_SRC_TAG, 0) |
> +	      FIELD_PREP(SHIM_PSI_CFG0_DST_TAG, 0);
> +	writel(reg, csi->shim + SHIM_PSI_CFG0);
> +
> +	return 0;
> +}
> +
> +static void ti_csi2rx_drain_callback(void *param)
> +{
> +	struct completion *drain_complete = param;
> +
> +	complete(drain_complete);
> +}
> +
> +/** Drain the stale data left at the PSI-L endpoint.
> + *
> + * This might happen if no buffers are queued in time but source is still
> + * streaming. Or rarely it may happen while stopping the stream. To prevent

I understand the first one, but when does this happen when stopping the 
stream?

> + * that stale data corrupting the subsequent transactions, it is required to
> + * issue DMA requests to drain it out.
> + */
> +static int ti_csi2rx_drain_dma(struct ti_csi2rx_dev *csi)
> +{
> +	struct dma_async_tx_descriptor *desc;
> +	struct completion drain_complete;
> +	dma_cookie_t cookie;
> +	int ret;
> +
> +	init_completion(&drain_complete);
> +
> +	desc = dmaengine_prep_slave_single(csi->dma.chan, csi->dma.drain.paddr,
> +					   csi->dma.drain.len, DMA_DEV_TO_MEM,
> +					   DMA_PREP_INTERRUPT | DMA_CTRL_ACK);
> +	if (!desc) {
> +		ret = -EIO;
> +		goto out;
> +	}
> +
> +	desc->callback = ti_csi2rx_drain_callback;
> +	desc->callback_param = &drain_complete;
> +
> +	cookie = dmaengine_submit(desc);
> +	ret = dma_submit_error(cookie);
> +	if (ret)
> +		goto out;
> +
> +	dma_async_issue_pending(csi->dma.chan);
> +
> +	if (!wait_for_completion_timeout(&drain_complete,
> +					 msecs_to_jiffies(DRAIN_TIMEOUT_MS))) {
> +		dmaengine_terminate_sync(csi->dma.chan);
> +		ret = -ETIMEDOUT;
> +		goto out;
> +	}
> +out:
> +	return ret;
> +}
> +
> +static void ti_csi2rx_dma_callback(void *param)
> +{
> +	struct ti_csi2rx_buffer *buf = param;
> +	struct ti_csi2rx_dev *csi = buf->csi;
> +	struct ti_csi2rx_dma *dma = &csi->dma;
> +	unsigned long flags;
> +
> +	/*
> +	 * TODO: Derive the sequence number from the CSI2RX frame number
> +	 * hardware monitor registers.
> +	 */
> +	buf->vb.vb2_buf.timestamp = ktime_get_ns();
> +	buf->vb.sequence = csi->sequence++;
> +
> +	spin_lock_irqsave(&dma->lock, flags);
> +
> +	WARN_ON(!list_is_first(&buf->list, &dma->submitted));
> +	vb2_buffer_done(&buf->vb.vb2_buf, VB2_BUF_STATE_DONE);
> +	list_del(&buf->list);
> +
> +	/* If there are more buffers to process then start their transfer. */
> +	while (!list_empty(&dma->queue)) {
> +		buf = list_entry(dma->queue.next, struct ti_csi2rx_buffer, list);
> +
> +		if (ti_csi2rx_start_dma(csi, buf)) {
> +			dev_err(csi->dev, "Failed to queue the next buffer for DMA\n");
> +			vb2_buffer_done(&buf->vb.vb2_buf, VB2_BUF_STATE_ERROR);
> +		} else {
> +			list_move_tail(&buf->list, &dma->submitted);
> +		}
> +	}
> +
> +	if (list_empty(&dma->submitted))
> +		dma->state = TI_CSI2RX_DMA_IDLE;
> +
> +	spin_unlock_irqrestore(&dma->lock, flags);
> +}
> +
> +static int ti_csi2rx_start_dma(struct ti_csi2rx_dev *csi,
> +			       struct ti_csi2rx_buffer *buf)
> +{
> +	unsigned long addr;
> +	struct dma_async_tx_descriptor *desc;
> +	size_t len = csi->v_fmt.fmt.pix.sizeimage;
> +	dma_cookie_t cookie;
> +	int ret = 0;
> +
> +	addr = vb2_dma_contig_plane_dma_addr(&buf->vb.vb2_buf, 0);
> +	desc = dmaengine_prep_slave_single(csi->dma.chan, addr, len,
> +					   DMA_DEV_TO_MEM,
> +					   DMA_PREP_INTERRUPT | DMA_CTRL_ACK);
> +	if (!desc)
> +		return -EIO;
> +
> +	desc->callback = ti_csi2rx_dma_callback;
> +	desc->callback_param = buf;
> +
> +	cookie = dmaengine_submit(desc);
> +	ret = dma_submit_error(cookie);
> +	if (ret)
> +		return ret;
> +
> +	dma_async_issue_pending(csi->dma.chan);
> +
> +	return 0;
> +}
> +
> +static void ti_csi2rx_cleanup_buffers(struct ti_csi2rx_dev *csi,
> +				      enum vb2_buffer_state buf_state)
> +{
> +	struct ti_csi2rx_dma *dma = &csi->dma;
> +	struct ti_csi2rx_buffer *buf, *tmp;
> +	enum ti_csi2rx_dma_state state;
> +	unsigned long flags;
> +	int ret;
> +
> +	spin_lock_irqsave(&dma->lock, flags);
> +	state = csi->dma.state;
> +	dma->state = TI_CSI2RX_DMA_STOPPED;
> +	spin_unlock_irqrestore(&dma->lock, flags);
> +
> +	if (state != TI_CSI2RX_DMA_STOPPED) {
> +		/*
> +		 * Normal DMA termination sometimes does not clean up pending
> +		 * data on the endpoint.

When is "sometimes"? It's good to be more exact.

> +		 */
> +		ret = ti_csi2rx_drain_dma(csi);
> +		if (ret)
> +			dev_dbg(csi->dev,
> +				"Failed to drain DMA. Next frame might be bogus\n");
> +	}
> +	ret = dmaengine_terminate_sync(csi->dma.chan);
> +	if (ret)
> +		dev_err(csi->dev, "Failed to stop DMA: %d\n", ret);
> +
> +	dma_free_coherent(csi->dev, dma->drain.len,
> +			  dma->drain.vaddr, dma->drain.paddr);
> +	dma->drain.vaddr = NULL;
> +
> +	spin_lock_irqsave(&dma->lock, flags);
> +	list_for_each_entry_safe(buf, tmp, &csi->dma.queue, list) {
> +		list_del(&buf->list);
> +		vb2_buffer_done(&buf->vb.vb2_buf, buf_state);
> +	}
> +	list_for_each_entry_safe(buf, tmp, &csi->dma.submitted, list) {
> +		list_del(&buf->list);
> +		vb2_buffer_done(&buf->vb.vb2_buf, buf_state);
> +	}
> +	spin_unlock_irqrestore(&dma->lock, flags);
> +}
> +
> +static int ti_csi2rx_queue_setup(struct vb2_queue *q, unsigned int *nbuffers,
> +				 unsigned int *nplanes, unsigned int sizes[],
> +				 struct device *alloc_devs[])
> +{
> +	struct ti_csi2rx_dev *csi = vb2_get_drv_priv(q);
> +	unsigned int size = csi->v_fmt.fmt.pix.sizeimage;
> +
> +	if (*nplanes) {
> +		if (sizes[0] < size)
> +			return -EINVAL;
> +		size = sizes[0];
> +	}
> +
> +	*nplanes = 1;
> +	sizes[0] = size;
> +
> +	return 0;
> +}
> +
> +static int ti_csi2rx_buffer_prepare(struct vb2_buffer *vb)
> +{
> +	struct ti_csi2rx_dev *csi = vb2_get_drv_priv(vb->vb2_queue);
> +	unsigned long size = csi->v_fmt.fmt.pix.sizeimage;
> +
> +	if (vb2_plane_size(vb, 0) < size) {
> +		dev_err(csi->dev, "Data will not fit into plane\n");
> +		return -EINVAL;
> +	}
> +
> +	vb2_set_plane_payload(vb, 0, size);
> +	return 0;
> +}
> +
> +static void ti_csi2rx_buffer_queue(struct vb2_buffer *vb)
> +{
> +	struct ti_csi2rx_dev *csi = vb2_get_drv_priv(vb->vb2_queue);
> +	struct ti_csi2rx_buffer *buf;
> +	struct ti_csi2rx_dma *dma = &csi->dma;
> +	bool restart_dma = false;
> +	unsigned long flags = 0;
> +	int ret;
> +
> +	buf = container_of(vb, struct ti_csi2rx_buffer, vb.vb2_buf);
> +	buf->csi = csi;
> +
> +	spin_lock_irqsave(&dma->lock, flags);
> +	/*
> +	 * Usually the DMA callback takes care of queueing the pending buffers.
> +	 * But if DMA has stalled due to lack of buffers, restart it now.
> +	 */
> +	if (dma->state == TI_CSI2RX_DMA_IDLE) {
> +		/*
> +		 * Do not restart DMA with the lock held because
> +		 * ti_csi2rx_drain_dma() might block for completion.
> +		 * There won't be a race on queueing DMA anyway since the
> +		 * callback is not being fired.
> +		 */
> +		restart_dma = true;
> +		dma->state = TI_CSI2RX_DMA_ACTIVE;
> +	} else {
> +		list_add_tail(&buf->list, &dma->queue);
> +	}
> +	spin_unlock_irqrestore(&dma->lock, flags);
> +
> +	if (restart_dma) {
> +		/*
> +		 * Once frames start dropping, some data gets stuck in the DMA
> +		 * pipeline somewhere. So the first DMA transfer after frame
> +		 * drops gives a partial frame. This is obviously not useful to
> +		 * the application and will only confuse it. Issue a DMA
> +		 * transaction to drain that up.
> +		 */
> +		ret = ti_csi2rx_drain_dma(csi);
> +		if (ret)
> +			dev_warn(csi->dev,
> +				 "Failed to drain DMA. Next frame might be bogus\n");
> +
> +		ret = ti_csi2rx_start_dma(csi, buf);
> +		if (ret) {
> +			dev_err(csi->dev, "Failed to start DMA: %d\n", ret);
> +			spin_lock_irqsave(&dma->lock, flags);
> +			vb2_buffer_done(&buf->vb.vb2_buf, VB2_BUF_STATE_ERROR);
> +			dma->state = TI_CSI2RX_DMA_IDLE;
> +			spin_unlock_irqrestore(&dma->lock, flags);
> +		} else {
> +			spin_lock_irqsave(&dma->lock, flags);
> +			list_add_tail(&buf->list, &dma->submitted);
> +			spin_unlock_irqrestore(&dma->lock, flags);
> +		}
> +	}
> +}
> +
> +static int ti_csi2rx_start_streaming(struct vb2_queue *vq, unsigned int count)
> +{
> +	struct ti_csi2rx_dev *csi = vb2_get_drv_priv(vq);
> +	struct ti_csi2rx_dma *dma = &csi->dma;
> +	struct ti_csi2rx_buffer *buf;
> +	unsigned long flags;
> +	int ret = 0;
> +
> +	spin_lock_irqsave(&dma->lock, flags);
> +	if (list_empty(&dma->queue))
> +		ret = -EIO;
> +	spin_unlock_irqrestore(&dma->lock, flags);
> +	if (ret)
> +		return ret;
> +
> +	dma->drain.len = csi->v_fmt.fmt.pix.sizeimage;
> +	dma->drain.vaddr = dma_alloc_coherent(csi->dev, dma->drain.len,
> +					      &dma->drain.paddr, GFP_KERNEL);
> +	if (!dma->drain.vaddr)
> +		return -ENOMEM;

This is still allocating a large buffer every time streaming is started 
(and with streams support, a separate buffer for each stream?).

Did you check if the TI DMA can do writes to a constant address? That 
would be the best option, as then the whole buffer allocation problem 
goes away.

Alternatively, can you flush the buffers with multiple one line 
transfers? The flushing shouldn't be performance critical, so even if 
that's slower than a normal full-frame DMA, it shouldn't matter much. 
And if that can be done, a single probe time line-buffer allocation 
should do the trick.

Other than this drain buffer topic, I think this looks fine. So, I'm 
going to give Rb, but I do encourage you to look more into optimizing 
this drain buffer.

Reviewed-by: Tomi Valkeinen <tomi.valkeinen@ideasonboard.com>

  Tomi

Jai Luthra Aug. 18, 2023, 10:25 a.m. UTC | #2

Hi Tomi,

Thanks for the review.

On Aug 15, 2023 at 16:00:51 +0300, Tomi Valkeinen wrote:
> On 11/08/2023 13:47, Jai Luthra wrote:
> > From: Pratyush Yadav <p.yadav@ti.com>
> > 

...

> > +
> > +static void ti_csi2rx_drain_callback(void *param)
> > +{
> > +	struct completion *drain_complete = param;
> > +
> > +	complete(drain_complete);
> > +}
> > +
> > +/** Drain the stale data left at the PSI-L endpoint.
> > + *
> > + * This might happen if no buffers are queued in time but source is still
> > + * streaming. Or rarely it may happen while stopping the stream. To prevent
> 
> I understand the first one, but when does this happen when stopping the
> stream?
> 

When multi-stream support is enabled the module-level pixel reset for 
cannot be done when stopping a single stream, in which case some 
in-flight data is left at the PSI-L endpoint despite enforcing the 
DMACNTX reset before. The same was true till v7 of this series as well, 
due to the module-level pixel reset being done in the wrong order 
(before stopping stream on the source).

Not sure if this will happen for single-stream usecases now (since v8)

I will fix this and other comments when I post subsequent patches for 
multi-stream.

> > + * that stale data corrupting the subsequent transactions, it is required to
> > + * issue DMA requests to drain it out.
> > + */
> > +static int ti_csi2rx_drain_dma(struct ti_csi2rx_dev *csi)
> > +{
> > +	struct dma_async_tx_descriptor *desc;
> > +	struct completion drain_complete;
> > +	dma_cookie_t cookie;
> > +	int ret;
> > +
> > +	init_completion(&drain_complete);
> > +
> > +	desc = dmaengine_prep_slave_single(csi->dma.chan, csi->dma.drain.paddr,
> > +					   csi->dma.drain.len, DMA_DEV_TO_MEM,
> > +					   DMA_PREP_INTERRUPT | DMA_CTRL_ACK);
> > +	if (!desc) {
> > +		ret = -EIO;
> > +		goto out;
> > +	}
> > +
> > +	desc->callback = ti_csi2rx_drain_callback;
> > +	desc->callback_param = &drain_complete;
> > +
> > +	cookie = dmaengine_submit(desc);
> > +	ret = dma_submit_error(cookie);
> > +	if (ret)
> > +		goto out;
> > +
> > +	dma_async_issue_pending(csi->dma.chan);
> > +
> > +	if (!wait_for_completion_timeout(&drain_complete,
> > +					 msecs_to_jiffies(DRAIN_TIMEOUT_MS))) {
> > +		dmaengine_terminate_sync(csi->dma.chan);
> > +		ret = -ETIMEDOUT;
> > +		goto out;
> > +	}
> > +out:
> > +	return ret;
> > +}
> > +
> > +static void ti_csi2rx_dma_callback(void *param)
> > +{
> > +	struct ti_csi2rx_buffer *buf = param;
> > +	struct ti_csi2rx_dev *csi = buf->csi;
> > +	struct ti_csi2rx_dma *dma = &csi->dma;
> > +	unsigned long flags;
> > +
> > +	/*
> > +	 * TODO: Derive the sequence number from the CSI2RX frame number
> > +	 * hardware monitor registers.
> > +	 */
> > +	buf->vb.vb2_buf.timestamp = ktime_get_ns();
> > +	buf->vb.sequence = csi->sequence++;
> > +
> > +	spin_lock_irqsave(&dma->lock, flags);
> > +
> > +	WARN_ON(!list_is_first(&buf->list, &dma->submitted));
> > +	vb2_buffer_done(&buf->vb.vb2_buf, VB2_BUF_STATE_DONE);
> > +	list_del(&buf->list);
> > +
> > +	/* If there are more buffers to process then start their transfer. */
> > +	while (!list_empty(&dma->queue)) {
> > +		buf = list_entry(dma->queue.next, struct ti_csi2rx_buffer, list);
> > +
> > +		if (ti_csi2rx_start_dma(csi, buf)) {
> > +			dev_err(csi->dev, "Failed to queue the next buffer for DMA\n");
> > +			vb2_buffer_done(&buf->vb.vb2_buf, VB2_BUF_STATE_ERROR);
> > +		} else {
> > +			list_move_tail(&buf->list, &dma->submitted);
> > +		}
> > +	}
> > +
> > +	if (list_empty(&dma->submitted))
> > +		dma->state = TI_CSI2RX_DMA_IDLE;
> > +
> > +	spin_unlock_irqrestore(&dma->lock, flags);
> > +}
> > +
> > +static int ti_csi2rx_start_dma(struct ti_csi2rx_dev *csi,
> > +			       struct ti_csi2rx_buffer *buf)
> > +{
> > +	unsigned long addr;
> > +	struct dma_async_tx_descriptor *desc;
> > +	size_t len = csi->v_fmt.fmt.pix.sizeimage;
> > +	dma_cookie_t cookie;
> > +	int ret = 0;
> > +
> > +	addr = vb2_dma_contig_plane_dma_addr(&buf->vb.vb2_buf, 0);
> > +	desc = dmaengine_prep_slave_single(csi->dma.chan, addr, len,
> > +					   DMA_DEV_TO_MEM,
> > +					   DMA_PREP_INTERRUPT | DMA_CTRL_ACK);
> > +	if (!desc)
> > +		return -EIO;
> > +
> > +	desc->callback = ti_csi2rx_dma_callback;
> > +	desc->callback_param = buf;
> > +
> > +	cookie = dmaengine_submit(desc);
> > +	ret = dma_submit_error(cookie);
> > +	if (ret)
> > +		return ret;
> > +
> > +	dma_async_issue_pending(csi->dma.chan);
> > +
> > +	return 0;
> > +}
> > +
> > +static void ti_csi2rx_cleanup_buffers(struct ti_csi2rx_dev *csi,
> > +				      enum vb2_buffer_state buf_state)
> > +{
> > +	struct ti_csi2rx_dma *dma = &csi->dma;
> > +	struct ti_csi2rx_buffer *buf, *tmp;
> > +	enum ti_csi2rx_dma_state state;
> > +	unsigned long flags;
> > +	int ret;
> > +
> > +	spin_lock_irqsave(&dma->lock, flags);
> > +	state = csi->dma.state;
> > +	dma->state = TI_CSI2RX_DMA_STOPPED;
> > +	spin_unlock_irqrestore(&dma->lock, flags);
> > +
> > +	if (state != TI_CSI2RX_DMA_STOPPED) {
> > +		/*
> > +		 * Normal DMA termination sometimes does not clean up pending
> > +		 * data on the endpoint.
> 
> When is "sometimes"? It's good to be more exact.
> 
> > +		 */
> > +		ret = ti_csi2rx_drain_dma(csi);
> > +		if (ret)
> > +			dev_dbg(csi->dev,
> > +				"Failed to drain DMA. Next frame might be bogus\n");
> > +	}
> > +	ret = dmaengine_terminate_sync(csi->dma.chan);
> > +	if (ret)
> > +		dev_err(csi->dev, "Failed to stop DMA: %d\n", ret);
> > +
> > +	dma_free_coherent(csi->dev, dma->drain.len,
> > +			  dma->drain.vaddr, dma->drain.paddr);
> > +	dma->drain.vaddr = NULL;
> > +
> > +	spin_lock_irqsave(&dma->lock, flags);
> > +	list_for_each_entry_safe(buf, tmp, &csi->dma.queue, list) {
> > +		list_del(&buf->list);
> > +		vb2_buffer_done(&buf->vb.vb2_buf, buf_state);
> > +	}
> > +	list_for_each_entry_safe(buf, tmp, &csi->dma.submitted, list) {
> > +		list_del(&buf->list);
> > +		vb2_buffer_done(&buf->vb.vb2_buf, buf_state);
> > +	}
> > +	spin_unlock_irqrestore(&dma->lock, flags);
> > +}
> > +
> > +static int ti_csi2rx_queue_setup(struct vb2_queue *q, unsigned int *nbuffers,
> > +				 unsigned int *nplanes, unsigned int sizes[],
> > +				 struct device *alloc_devs[])
> > +{
> > +	struct ti_csi2rx_dev *csi = vb2_get_drv_priv(q);
> > +	unsigned int size = csi->v_fmt.fmt.pix.sizeimage;
> > +
> > +	if (*nplanes) {
> > +		if (sizes[0] < size)
> > +			return -EINVAL;
> > +		size = sizes[0];
> > +	}
> > +
> > +	*nplanes = 1;
> > +	sizes[0] = size;
> > +
> > +	return 0;
> > +}
> > +
> > +static int ti_csi2rx_buffer_prepare(struct vb2_buffer *vb)
> > +{
> > +	struct ti_csi2rx_dev *csi = vb2_get_drv_priv(vb->vb2_queue);
> > +	unsigned long size = csi->v_fmt.fmt.pix.sizeimage;
> > +
> > +	if (vb2_plane_size(vb, 0) < size) {
> > +		dev_err(csi->dev, "Data will not fit into plane\n");
> > +		return -EINVAL;
> > +	}
> > +
> > +	vb2_set_plane_payload(vb, 0, size);
> > +	return 0;
> > +}
> > +
> > +static void ti_csi2rx_buffer_queue(struct vb2_buffer *vb)
> > +{
> > +	struct ti_csi2rx_dev *csi = vb2_get_drv_priv(vb->vb2_queue);
> > +	struct ti_csi2rx_buffer *buf;
> > +	struct ti_csi2rx_dma *dma = &csi->dma;
> > +	bool restart_dma = false;
> > +	unsigned long flags = 0;
> > +	int ret;
> > +
> > +	buf = container_of(vb, struct ti_csi2rx_buffer, vb.vb2_buf);
> > +	buf->csi = csi;
> > +
> > +	spin_lock_irqsave(&dma->lock, flags);
> > +	/*
> > +	 * Usually the DMA callback takes care of queueing the pending buffers.
> > +	 * But if DMA has stalled due to lack of buffers, restart it now.
> > +	 */
> > +	if (dma->state == TI_CSI2RX_DMA_IDLE) {
> > +		/*
> > +		 * Do not restart DMA with the lock held because
> > +		 * ti_csi2rx_drain_dma() might block for completion.
> > +		 * There won't be a race on queueing DMA anyway since the
> > +		 * callback is not being fired.
> > +		 */
> > +		restart_dma = true;
> > +		dma->state = TI_CSI2RX_DMA_ACTIVE;
> > +	} else {
> > +		list_add_tail(&buf->list, &dma->queue);
> > +	}
> > +	spin_unlock_irqrestore(&dma->lock, flags);
> > +
> > +	if (restart_dma) {
> > +		/*
> > +		 * Once frames start dropping, some data gets stuck in the DMA
> > +		 * pipeline somewhere. So the first DMA transfer after frame
> > +		 * drops gives a partial frame. This is obviously not useful to
> > +		 * the application and will only confuse it. Issue a DMA
> > +		 * transaction to drain that up.
> > +		 */
> > +		ret = ti_csi2rx_drain_dma(csi);
> > +		if (ret)
> > +			dev_warn(csi->dev,
> > +				 "Failed to drain DMA. Next frame might be bogus\n");
> > +
> > +		ret = ti_csi2rx_start_dma(csi, buf);
> > +		if (ret) {
> > +			dev_err(csi->dev, "Failed to start DMA: %d\n", ret);
> > +			spin_lock_irqsave(&dma->lock, flags);
> > +			vb2_buffer_done(&buf->vb.vb2_buf, VB2_BUF_STATE_ERROR);
> > +			dma->state = TI_CSI2RX_DMA_IDLE;
> > +			spin_unlock_irqrestore(&dma->lock, flags);
> > +		} else {
> > +			spin_lock_irqsave(&dma->lock, flags);
> > +			list_add_tail(&buf->list, &dma->submitted);
> > +			spin_unlock_irqrestore(&dma->lock, flags);
> > +		}
> > +	}
> > +}
> > +
> > +static int ti_csi2rx_start_streaming(struct vb2_queue *vq, unsigned int count)
> > +{
> > +	struct ti_csi2rx_dev *csi = vb2_get_drv_priv(vq);
> > +	struct ti_csi2rx_dma *dma = &csi->dma;
> > +	struct ti_csi2rx_buffer *buf;
> > +	unsigned long flags;
> > +	int ret = 0;
> > +
> > +	spin_lock_irqsave(&dma->lock, flags);
> > +	if (list_empty(&dma->queue))
> > +		ret = -EIO;
> > +	spin_unlock_irqrestore(&dma->lock, flags);
> > +	if (ret)
> > +		return ret;
> > +
> > +	dma->drain.len = csi->v_fmt.fmt.pix.sizeimage;
> > +	dma->drain.vaddr = dma_alloc_coherent(csi->dev, dma->drain.len,
> > +					      &dma->drain.paddr, GFP_KERNEL);
> > +	if (!dma->drain.vaddr)
> > +		return -ENOMEM;
> 
> This is still allocating a large buffer every time streaming is started (and
> with streams support, a separate buffer for each stream?).
> 
> Did you check if the TI DMA can do writes to a constant address? That would
> be the best option, as then the whole buffer allocation problem goes away.
> 

I checked with Vignesh, the hardware can support a scenario where we 
flush out all the data without allocating a buffer, but I couldn't find 
a way to signal that via the current dmaengine framework APIs. Will look 
into it further as it will be important for multi-stream support.

> Alternatively, can you flush the buffers with multiple one line transfers?
> The flushing shouldn't be performance critical, so even if that's slower
> than a normal full-frame DMA, it shouldn't matter much. And if that can be
> done, a single probe time line-buffer allocation should do the trick.

There will be considerable overhead if we queue many DMA transactions 
(in the order of 1000s or even 100s), which might not be okay for the 
scenarios where we have to drain mid-stream. Will have to run some 
experiments to see if that is worth it.

But one optimization we can for sure do is re-use a single drain buffer 
for all the streams. We will need to ensure to re-allocate the buffer 
for the "largest" framesize supported across the different streams at 
stream-on time.

My guess is the endpoint is not buffering a full-frame's worth of data, 
I will also check if we can upper bound that size to something feasible.

> 
> Other than this drain buffer topic, I think this looks fine. So, I'm going
> to give Rb, but I do encourage you to look more into optimizing this drain
> buffer.

Thank you!

> 
> Reviewed-by: Tomi Valkeinen <tomi.valkeinen@ideasonboard.com>
> 
>  Tomi
>

Laurent Pinchart Aug. 29, 2023, 3:55 p.m. UTC | #3

Hi Jai,

(CC'ing Vinod, the maintainer of the DMA engine subsystem, for a
question below)

On Fri, Aug 18, 2023 at 03:55:06PM +0530, Jai Luthra wrote:
> On Aug 15, 2023 at 16:00:51 +0300, Tomi Valkeinen wrote:
> > On 11/08/2023 13:47, Jai Luthra wrote:
> > > From: Pratyush Yadav <p.yadav@ti.com>

[snip]

> > > +static int ti_csi2rx_start_streaming(struct vb2_queue *vq, unsigned int count)
> > > +{
> > > +	struct ti_csi2rx_dev *csi = vb2_get_drv_priv(vq);
> > > +	struct ti_csi2rx_dma *dma = &csi->dma;
> > > +	struct ti_csi2rx_buffer *buf;
> > > +	unsigned long flags;
> > > +	int ret = 0;
> > > +
> > > +	spin_lock_irqsave(&dma->lock, flags);
> > > +	if (list_empty(&dma->queue))
> > > +		ret = -EIO;
> > > +	spin_unlock_irqrestore(&dma->lock, flags);
> > > +	if (ret)
> > > +		return ret;
> > > +
> > > +	dma->drain.len = csi->v_fmt.fmt.pix.sizeimage;
> > > +	dma->drain.vaddr = dma_alloc_coherent(csi->dev, dma->drain.len,
> > > +					      &dma->drain.paddr, GFP_KERNEL);
> > > +	if (!dma->drain.vaddr)
> > > +		return -ENOMEM;
> > 
> > This is still allocating a large buffer every time streaming is started (and
> > with streams support, a separate buffer for each stream?).
> > 
> > Did you check if the TI DMA can do writes to a constant address? That would
> > be the best option, as then the whole buffer allocation problem goes away.
> 
> I checked with Vignesh, the hardware can support a scenario where we 
> flush out all the data without allocating a buffer, but I couldn't find 
> a way to signal that via the current dmaengine framework APIs. Will look 
> into it further as it will be important for multi-stream support.

That would be the best option. It's not immediately apparent to me if
the DMA engine API supports such a use case.
dmaengine_prep_interleaved_dma() gives you finer grain control on the
source and destination increments, but I haven't seen a way to instruct
the DMA engine to direct writes to /dev/null (so to speak). Vinod, is
this something that is supported, or could be supported ?

> > Alternatively, can you flush the buffers with multiple one line transfers?
> > The flushing shouldn't be performance critical, so even if that's slower
> > than a normal full-frame DMA, it shouldn't matter much. And if that can be
> > done, a single probe time line-buffer allocation should do the trick.
> 
> There will be considerable overhead if we queue many DMA transactions 
> (in the order of 1000s or even 100s), which might not be okay for the 
> scenarios where we have to drain mid-stream. Will have to run some 
> experiments to see if that is worth it.
> 
> But one optimization we can for sure do is re-use a single drain buffer 
> for all the streams. We will need to ensure to re-allocate the buffer 
> for the "largest" framesize supported across the different streams at 
> stream-on time.

If you implement .device_prep_interleaved_dma() in the DMA engine driver
you could write to a single line buffer, assuming that the hardware would
support so in a generic way.

> My guess is the endpoint is not buffering a full-frame's worth of data, 
> I will also check if we can upper bound that size to something feasible.
> 
> > Other than this drain buffer topic, I think this looks fine. So, I'm going
> > to give Rb, but I do encourage you to look more into optimizing this drain
> > buffer.
> 
> Thank you!
> 
> > Reviewed-by: Tomi Valkeinen <tomi.valkeinen@ideasonboard.com>

Laurent Pinchart Oct. 4, 2023, 8:03 p.m. UTC | #4

On Wed, Oct 04, 2023 at 07:21:00PM +0530, Vinod Koul wrote:
> On 29-08-23, 18:55, Laurent Pinchart wrote:
> > Hi Jai,
> > 
> > (CC'ing Vinod, the maintainer of the DMA engine subsystem, for a
> > question below)
> 
> Sorry this got lost

No worries.

> > On Fri, Aug 18, 2023 at 03:55:06PM +0530, Jai Luthra wrote:
> > > On Aug 15, 2023 at 16:00:51 +0300, Tomi Valkeinen wrote:
> > > > On 11/08/2023 13:47, Jai Luthra wrote:
> > > > > From: Pratyush Yadav <p.yadav@ti.com>
> > 
> > [snip]
> > 
> > > > > +static int ti_csi2rx_start_streaming(struct vb2_queue *vq, unsigned int count)
> > > > > +{
> > > > > +	struct ti_csi2rx_dev *csi = vb2_get_drv_priv(vq);
> > > > > +	struct ti_csi2rx_dma *dma = &csi->dma;
> > > > > +	struct ti_csi2rx_buffer *buf;
> > > > > +	unsigned long flags;
> > > > > +	int ret = 0;
> > > > > +
> > > > > +	spin_lock_irqsave(&dma->lock, flags);
> > > > > +	if (list_empty(&dma->queue))
> > > > > +		ret = -EIO;
> > > > > +	spin_unlock_irqrestore(&dma->lock, flags);
> > > > > +	if (ret)
> > > > > +		return ret;
> > > > > +
> > > > > +	dma->drain.len = csi->v_fmt.fmt.pix.sizeimage;
> > > > > +	dma->drain.vaddr = dma_alloc_coherent(csi->dev, dma->drain.len,
> > > > > +					      &dma->drain.paddr, GFP_KERNEL);
> > > > > +	if (!dma->drain.vaddr)
> > > > > +		return -ENOMEM;
> > > > 
> > > > This is still allocating a large buffer every time streaming is started (and
> > > > with streams support, a separate buffer for each stream?).
> > > > 
> > > > Did you check if the TI DMA can do writes to a constant address? That would
> > > > be the best option, as then the whole buffer allocation problem goes away.
> > > 
> > > I checked with Vignesh, the hardware can support a scenario where we 
> > > flush out all the data without allocating a buffer, but I couldn't find 
> > > a way to signal that via the current dmaengine framework APIs. Will look 
> > > into it further as it will be important for multi-stream support.
> > 
> > That would be the best option. It's not immediately apparent to me if
> > the DMA engine API supports such a use case.
> > dmaengine_prep_interleaved_dma() gives you finer grain control on the
> > source and destination increments, but I haven't seen a way to instruct
> > the DMA engine to direct writes to /dev/null (so to speak). Vinod, is
> > this something that is supported, or could be supported ?
> 
> Write to a dummy buffer could have the same behaviour, no?

Yes, but if the DMA engine can write to /dev/null, that avoids
allocating a dummy buffer, which is nicer. For video use cases, dummy
buffers are often large.

> > > > Alternatively, can you flush the buffers with multiple one line transfers?
> > > > The flushing shouldn't be performance critical, so even if that's slower
> > > > than a normal full-frame DMA, it shouldn't matter much. And if that can be
> > > > done, a single probe time line-buffer allocation should do the trick.
> > > 
> > > There will be considerable overhead if we queue many DMA transactions 
> > > (in the order of 1000s or even 100s), which might not be okay for the 
> > > scenarios where we have to drain mid-stream. Will have to run some 
> > > experiments to see if that is worth it.
> > > 
> > > But one optimization we can for sure do is re-use a single drain buffer 
> > > for all the streams. We will need to ensure to re-allocate the buffer 
> > > for the "largest" framesize supported across the different streams at 
> > > stream-on time.
> > 
> > If you implement .device_prep_interleaved_dma() in the DMA engine driver
> > you could write to a single line buffer, assuming that the hardware would
> > support so in a generic way.
> > 
> > > My guess is the endpoint is not buffering a full-frame's worth of data, 
> > > I will also check if we can upper bound that size to something feasible.
> > > 
> > > > Other than this drain buffer topic, I think this looks fine. So, I'm going
> > > > to give Rb, but I do encourage you to look more into optimizing this drain
> > > > buffer.
> > > 
> > > Thank you!
> > > 
> > > > Reviewed-by: Tomi Valkeinen <tomi.valkeinen@ideasonboard.com>

Vinod Koul Oct. 5, 2023, 4:10 a.m. UTC | #5

On 04-10-23, 23:03, Laurent Pinchart wrote:
> On Wed, Oct 04, 2023 at 07:21:00PM +0530, Vinod Koul wrote:
> > On 29-08-23, 18:55, Laurent Pinchart wrote:
> > > Hi Jai,
> > > 
> > > (CC'ing Vinod, the maintainer of the DMA engine subsystem, for a
> > > question below)
> > 
> > Sorry this got lost
> 
> No worries.
> 
> > > On Fri, Aug 18, 2023 at 03:55:06PM +0530, Jai Luthra wrote:
> > > > On Aug 15, 2023 at 16:00:51 +0300, Tomi Valkeinen wrote:
> > > > > On 11/08/2023 13:47, Jai Luthra wrote:
> > > > > > From: Pratyush Yadav <p.yadav@ti.com>
> > > 
> > > [snip]
> > > 
> > > > > > +static int ti_csi2rx_start_streaming(struct vb2_queue *vq, unsigned int count)
> > > > > > +{
> > > > > > +	struct ti_csi2rx_dev *csi = vb2_get_drv_priv(vq);
> > > > > > +	struct ti_csi2rx_dma *dma = &csi->dma;
> > > > > > +	struct ti_csi2rx_buffer *buf;
> > > > > > +	unsigned long flags;
> > > > > > +	int ret = 0;
> > > > > > +
> > > > > > +	spin_lock_irqsave(&dma->lock, flags);
> > > > > > +	if (list_empty(&dma->queue))
> > > > > > +		ret = -EIO;
> > > > > > +	spin_unlock_irqrestore(&dma->lock, flags);
> > > > > > +	if (ret)
> > > > > > +		return ret;
> > > > > > +
> > > > > > +	dma->drain.len = csi->v_fmt.fmt.pix.sizeimage;
> > > > > > +	dma->drain.vaddr = dma_alloc_coherent(csi->dev, dma->drain.len,
> > > > > > +					      &dma->drain.paddr, GFP_KERNEL);
> > > > > > +	if (!dma->drain.vaddr)
> > > > > > +		return -ENOMEM;
> > > > > 
> > > > > This is still allocating a large buffer every time streaming is started (and
> > > > > with streams support, a separate buffer for each stream?).
> > > > > 
> > > > > Did you check if the TI DMA can do writes to a constant address? That would
> > > > > be the best option, as then the whole buffer allocation problem goes away.
> > > > 
> > > > I checked with Vignesh, the hardware can support a scenario where we 
> > > > flush out all the data without allocating a buffer, but I couldn't find 
> > > > a way to signal that via the current dmaengine framework APIs. Will look 
> > > > into it further as it will be important for multi-stream support.
> > > 
> > > That would be the best option. It's not immediately apparent to me if
> > > the DMA engine API supports such a use case.
> > > dmaengine_prep_interleaved_dma() gives you finer grain control on the
> > > source and destination increments, but I haven't seen a way to instruct
> > > the DMA engine to direct writes to /dev/null (so to speak). Vinod, is
> > > this something that is supported, or could be supported ?
> > 
> > Write to a dummy buffer could have the same behaviour, no?
> 
> Yes, but if the DMA engine can write to /dev/null, that avoids
> allocating a dummy buffer, which is nicer. For video use cases, dummy
> buffers are often large.

hmmm maybe I haven't comprehended it full, would you mind explaining the
details on how such a potential interleaved transfer would look like so
that we can model it or change apis to model this

Jai Luthra Oct. 5, 2023, 8:34 a.m. UTC | #6

Hi Laurent,

Thanks for the review, and apologies for being late on getting back on 
this.

On Aug 29, 2023 at 19:44:57 +0300, Laurent Pinchart wrote:
> Hi Jai,
> 
> Thank you for the patch.
> 
> On Fri, Aug 11, 2023 at 04:17:35PM +0530, Jai Luthra wrote:
> > From: Pratyush Yadav <p.yadav@ti.com>
> > 
> > TI's J721E uses the Cadence CSI2RX and DPHY peripherals to facilitate
> > capture over a CSI-2 bus.
> > 
> > The Cadence CSI2RX IP acts as a bridge between the TI specific parts and
> > the CSI-2 protocol parts. TI then has a wrapper on top of this bridge
> > called the SHIM layer. It takes in data from stream 0, repacks it, and
> > sends it to memory over PSI-L DMA.
> > 
> > This driver acts as the "front end" to V4L2 client applications. It
> > implements the required ioctls and buffer operations, passes the
> > necessary calls on to the bridge, programs the SHIM layer, and performs
> > DMA via the dmaengine API to finally return the data to a buffer
> > supplied by the application.
> > 
> > Signed-off-by: Pratyush Yadav <p.yadav@ti.com>
> > Co-authored-by: Vaishnav Achath <vaishnav.a@ti.com>
> > Signed-off-by: Vaishnav Achath <vaishnav.a@ti.com>
> > Tested-by: Vaishnav Achath <vaishnav.a@ti.com>
> > Co-authored-by: Jai Luthra <j-luthra@ti.com>
> > Signed-off-by: Jai Luthra <j-luthra@ti.com>
> > ---
> > Changes since v8:
> >     - Allocate drain buffer at start of stream instead of doing it in the
> >       middle, and document why it is needed in comments
> >     - Call subdev's get_fmt directly for link_validation()
> >     - Cleanup height/width clamping and rounding code, document it in comments
> >     - Return and check errors from setup_shim()
> >     - s/subdev/source for cadence csi2rx's v4l2_subdev
> >     - s/ti_csi2rx_init_subdev/ti_csi2rx_notifier_register
> >     - Change copyright year/author list
> > 
> >  MAINTAINERS                                        |    7 +
> >  drivers/media/platform/ti/Kconfig                  |   12 +
> >  drivers/media/platform/ti/Makefile                 |    1 +
> >  drivers/media/platform/ti/j721e-csi2rx/Makefile    |    2 +
> >  .../media/platform/ti/j721e-csi2rx/j721e-csi2rx.c  | 1150 ++++++++++++++++++++
> >  5 files changed, 1172 insertions(+)
> > 
> > diff --git a/MAINTAINERS b/MAINTAINERS
> > index 02a3192195af..959147d6d936 100644
> > --- a/MAINTAINERS
> > +++ b/MAINTAINERS
> > @@ -21455,6 +21455,13 @@ F:	Documentation/devicetree/bindings/media/i2c/ti,ds90*
> >  F:	drivers/media/i2c/ds90*
> >  F:	include/media/i2c/ds90*
> >  
> > +TI J721E CSI2RX DRIVER
> > +M:	Jai Luthra <j-luthra@ti.com>
> > +L:	linux-media@vger.kernel.org
> > +S:	Maintained
> > +F:	Documentation/devicetree/bindings/media/ti,j721e-csi2rx.yaml
> > +F:	drivers/media/platform/ti/j721e-csi2rx/
> > +
> >  TI KEYSTONE MULTICORE NAVIGATOR DRIVERS
> >  M:	Nishanth Menon <nm@ti.com>
> >  M:	Santosh Shilimkar <ssantosh@kernel.org>
> > diff --git a/drivers/media/platform/ti/Kconfig b/drivers/media/platform/ti/Kconfig
> > index e1ab56c3be1f..42c908f6e1ae 100644
> > --- a/drivers/media/platform/ti/Kconfig
> > +++ b/drivers/media/platform/ti/Kconfig
> > @@ -63,6 +63,18 @@ config VIDEO_TI_VPE_DEBUG
> >  	help
> >  	  Enable debug messages on VPE driver.
> >  
> > +config VIDEO_TI_J721E_CSI2RX
> > +	tristate "TI J721E CSI2RX wrapper layer driver"
> > +	depends on VIDEO_DEV && VIDEO_V4L2_SUBDEV_API
> > +	depends on MEDIA_SUPPORT && MEDIA_CONTROLLER
> > +	depends on PHY_CADENCE_DPHY_RX && VIDEO_CADENCE_CSI2RX
> 
> Is there a compile-time dependency on these, or just runtime ? If it's
> just at runtime, it would be nice to either drop the dependency here, or
> add a (...) || COMPILE_TEST
> 

There isn't a compile-time dependency as such, but yes this IP only 
works with Cadence CSI+DPHY so I will switch it to (...) || 
COMPILE_TEST.

> > +	depends on ARCH_K3 || COMPILE_TEST
> > +	select VIDEOBUF2_DMA_CONTIG
> > +	select V4L2_FWNODE
> > +	help
> > +	  Support for TI CSI2RX wrapper layer. This just enables the wrapper driver.
> > +	  The Cadence CSI2RX bridge driver needs to be enabled separately.
> > +
> >  source "drivers/media/platform/ti/am437x/Kconfig"
> >  source "drivers/media/platform/ti/davinci/Kconfig"
> >  source "drivers/media/platform/ti/omap/Kconfig"
> > diff --git a/drivers/media/platform/ti/Makefile b/drivers/media/platform/ti/Makefile
> > index 98c5fe5c40d6..8a2f74c9380e 100644
> > --- a/drivers/media/platform/ti/Makefile
> > +++ b/drivers/media/platform/ti/Makefile
> > @@ -3,5 +3,6 @@ obj-y += am437x/
> >  obj-y += cal/
> >  obj-y += vpe/
> >  obj-y += davinci/
> > +obj-y += j721e-csi2rx/
> >  obj-y += omap/
> >  obj-y += omap3isp/
> > diff --git a/drivers/media/platform/ti/j721e-csi2rx/Makefile b/drivers/media/platform/ti/j721e-csi2rx/Makefile
> > new file mode 100644
> > index 000000000000..377afc1d6280
> > --- /dev/null
> > +++ b/drivers/media/platform/ti/j721e-csi2rx/Makefile
> > @@ -0,0 +1,2 @@
> > +# SPDX-License-Identifier: GPL-2.0
> > +obj-$(CONFIG_VIDEO_TI_J721E_CSI2RX) += j721e-csi2rx.o
> > diff --git a/drivers/media/platform/ti/j721e-csi2rx/j721e-csi2rx.c b/drivers/media/platform/ti/j721e-csi2rx/j721e-csi2rx.c
> > new file mode 100644
> > index 000000000000..301d947f6098
> > --- /dev/null
> > +++ b/drivers/media/platform/ti/j721e-csi2rx/j721e-csi2rx.c
> > @@ -0,0 +1,1150 @@
> > +// SPDX-License-Identifier: GPL-2.0-only
> > +/*
> > + * TI CSI2RX Shim Wrapper Driver
> > + *
> > + * Copyright (C) 2023 Texas Instruments Incorporated - https://www.ti.com/
> > + *
> > + * Author: Pratyush Yadav <p.yadav@ti.com>
> > + * Author: Jai Luthra <j-luthra@ti.com>
> > + */
> > +
> > +#include <linux/bitfield.h>
> > +#include <linux/dmaengine.h>
> > +#include <linux/module.h>
> > +#include <linux/of_platform.h>
> > +#include <linux/platform_device.h>
> > +
> > +#include <media/mipi-csi2.h>
> > +#include <media/v4l2-device.h>
> > +#include <media/v4l2-ioctl.h>
> > +#include <media/v4l2-mc.h>
> > +#include <media/videobuf2-dma-contig.h>
> > +
> > +#define TI_CSI2RX_MODULE_NAME		"j721e-csi2rx"
> > +
> > +#define SHIM_CNTL			0x10
> > +#define SHIM_CNTL_PIX_RST		BIT(0)
> > +
> > +#define SHIM_DMACNTX			0x20
> > +#define SHIM_DMACNTX_EN			BIT(31)
> > +#define SHIM_DMACNTX_YUV422		GENMASK(27, 26)
> > +#define SHIM_DMACNTX_SIZE		GENMASK(21, 20)
> > +#define SHIM_DMACNTX_FMT		GENMASK(5, 0)
> > +#define SHIM_DMACNTX_UYVY		0
> > +#define SHIM_DMACNTX_VYUY		1
> > +#define SHIM_DMACNTX_YUYV		2
> > +#define SHIM_DMACNTX_YVYU		3
> > +#define SHIM_DMACNTX_SIZE_8		0
> > +#define SHIM_DMACNTX_SIZE_16		1
> > +#define SHIM_DMACNTX_SIZE_32		2
> > +
> > +#define SHIM_PSI_CFG0			0x24
> > +#define SHIM_PSI_CFG0_SRC_TAG		GENMASK(15, 0)
> > +#define SHIM_PSI_CFG0_DST_TAG		GENMASK(31, 16)
> > +
> > +#define PSIL_WORD_SIZE_BYTES		16
> > +/*
> > + * There are no hard limits on the width or height. The DMA engine can handle
> > + * all sizes. The max width and height are arbitrary numbers for this driver.
> > + * Use 16K * 16K as the arbitrary limit. It is large enough that it is unlikely
> > + * the limit will be hit in practice.
> > + */
> > +#define MAX_WIDTH_BYTES			SZ_16K
> > +#define MAX_HEIGHT_LINES		SZ_16K
> > +
> > +#define DRAIN_TIMEOUT_MS		50
> > +
> > +struct ti_csi2rx_fmt {
> > +	u32				fourcc;	/* Four character code. */
> > +	u32				code;	/* Mbus code. */
> > +	u32				csi_dt;	/* CSI Data type. */
> > +	u8				bpp;	/* Bits per pixel. */
> > +	u8				size;	/* Data size shift when unpacking. */
> > +};
> > +
> > +struct ti_csi2rx_buffer {
> > +	/* Common v4l2 buffer. Must be first. */
> > +	struct vb2_v4l2_buffer		vb;
> > +	struct list_head		list;
> > +	struct ti_csi2rx_dev		*csi;
> > +};
> > +
> > +enum ti_csi2rx_dma_state {
> > +	TI_CSI2RX_DMA_STOPPED,	/* Streaming not started yet. */
> > +	TI_CSI2RX_DMA_IDLE,	/* Streaming but no pending DMA operation. */
> > +	TI_CSI2RX_DMA_ACTIVE,	/* Streaming and pending DMA operation. */
> > +};
> > +
> > +struct ti_csi2rx_dma {
> > +	/* Protects all fields in this struct. */
> > +	spinlock_t			lock;
> > +	struct dma_chan			*chan;
> > +	/* Buffers queued to the driver, waiting to be processed by DMA. */
> > +	struct list_head		queue;
> > +	enum ti_csi2rx_dma_state	state;
> > +	/*
> > +	 * Queue of buffers submitted to DMA engine.
> > +	 */
> > +	struct list_head		submitted;
> > +	/* Buffer to drain stale data from PSI-L endpoint */
> > +	struct {
> > +		void			*vaddr;
> > +		dma_addr_t		paddr;
> > +		size_t			len;
> > +	} drain;
> > +};
> > +
> > +struct ti_csi2rx_dev {
> > +	struct device			*dev;
> > +	void __iomem			*shim;
> > +	struct v4l2_device		v4l2_dev;
> > +	struct video_device		vdev;
> > +	struct media_device		mdev;
> > +	struct media_pipeline		pipe;
> > +	struct media_pad		pad;
> > +	struct v4l2_async_notifier	notifier;
> > +	struct v4l2_subdev		*source;
> > +	struct vb2_queue		vidq;
> > +	struct mutex			mutex; /* To serialize ioctls. */
> > +	struct v4l2_format		v_fmt;
> > +	struct ti_csi2rx_dma		dma;
> > +	u32				sequence;
> > +};
> > +
> > +static const struct ti_csi2rx_fmt formats[] = {
> 
> It would be nice to prefix local symbols to avoid namespace clashes,
> even if they're static. ti_csi2rx_formats could be a good name. Same
> below where applicable, and possibly above for some macro names.
> 

Will fix.

> > +	{
> > +		.fourcc			= V4L2_PIX_FMT_YUYV,
> > +		.code			= MEDIA_BUS_FMT_YUYV8_1X16,
> > +		.csi_dt			= MIPI_CSI2_DT_YUV422_8B,
> > +		.bpp			= 16,
> > +		.size			= SHIM_DMACNTX_SIZE_8,
> > +	}, {
> > +		.fourcc			= V4L2_PIX_FMT_UYVY,
> > +		.code			= MEDIA_BUS_FMT_UYVY8_1X16,
> > +		.csi_dt			= MIPI_CSI2_DT_YUV422_8B,
> > +		.bpp			= 16,
> > +		.size			= SHIM_DMACNTX_SIZE_8,
> > +	}, {
> > +		.fourcc			= V4L2_PIX_FMT_YVYU,
> > +		.code			= MEDIA_BUS_FMT_YVYU8_1X16,
> > +		.csi_dt			= MIPI_CSI2_DT_YUV422_8B,
> > +		.bpp			= 16,
> > +		.size			= SHIM_DMACNTX_SIZE_8,
> > +	}, {
> > +		.fourcc			= V4L2_PIX_FMT_VYUY,
> > +		.code			= MEDIA_BUS_FMT_VYUY8_1X16,
> > +		.csi_dt			= MIPI_CSI2_DT_YUV422_8B,
> > +		.bpp			= 16,
> > +		.size			= SHIM_DMACNTX_SIZE_8,
> > +	}, {
> > +		.fourcc			= V4L2_PIX_FMT_SBGGR8,
> > +		.code			= MEDIA_BUS_FMT_SBGGR8_1X8,
> > +		.csi_dt			= MIPI_CSI2_DT_RAW8,
> > +		.bpp			= 8,
> > +		.size			= SHIM_DMACNTX_SIZE_8,
> > +	}, {
> > +		.fourcc			= V4L2_PIX_FMT_SGBRG8,
> > +		.code			= MEDIA_BUS_FMT_SGBRG8_1X8,
> > +		.csi_dt			= MIPI_CSI2_DT_RAW8,
> > +		.bpp			= 8,
> > +		.size			= SHIM_DMACNTX_SIZE_8,
> > +	}, {
> > +		.fourcc			= V4L2_PIX_FMT_SGRBG8,
> > +		.code			= MEDIA_BUS_FMT_SGRBG8_1X8,
> > +		.csi_dt			= MIPI_CSI2_DT_RAW8,
> > +		.bpp			= 8,
> > +		.size			= SHIM_DMACNTX_SIZE_8,
> > +	}, {
> > +		.fourcc			= V4L2_PIX_FMT_SRGGB8,
> > +		.code			= MEDIA_BUS_FMT_SRGGB8_1X8,
> > +		.csi_dt			= MIPI_CSI2_DT_RAW8,
> > +		.bpp			= 8,
> > +		.size			= SHIM_DMACNTX_SIZE_8,
> > +	}, {
> > +		.fourcc			= V4L2_PIX_FMT_SBGGR10,
> > +		.code			= MEDIA_BUS_FMT_SBGGR10_1X10,
> > +		.csi_dt			= MIPI_CSI2_DT_RAW10,
> > +		.bpp			= 16,
> > +		.size			= SHIM_DMACNTX_SIZE_16,
> > +	}, {
> > +		.fourcc			= V4L2_PIX_FMT_SGBRG10,
> > +		.code			= MEDIA_BUS_FMT_SGBRG10_1X10,
> > +		.csi_dt			= MIPI_CSI2_DT_RAW10,
> > +		.bpp			= 16,
> > +		.size			= SHIM_DMACNTX_SIZE_16,
> > +	}, {
> > +		.fourcc			= V4L2_PIX_FMT_SGRBG10,
> > +		.code			= MEDIA_BUS_FMT_SGRBG10_1X10,
> > +		.csi_dt			= MIPI_CSI2_DT_RAW10,
> > +		.bpp			= 16,
> > +		.size			= SHIM_DMACNTX_SIZE_16,
> > +	}, {
> > +		.fourcc			= V4L2_PIX_FMT_SRGGB10,
> > +		.code			= MEDIA_BUS_FMT_SRGGB10_1X10,
> > +		.csi_dt			= MIPI_CSI2_DT_RAW10,
> > +		.bpp			= 16,
> > +		.size			= SHIM_DMACNTX_SIZE_16,
> > +	},
> > +
> > +	/* More formats can be supported but they are not listed for now. */
> > +};
> > +
> > +static const unsigned int num_formats = ARRAY_SIZE(formats);
> 
> I would use ARRAY_SIZE(formats) below and drop num_formats, as I don't
> think it improves readability, but I don't insist.
> 

Will fix.

> > +
> > +/* Forward declaration needed by ti_csi2rx_dma_callback. */
> > +static int ti_csi2rx_start_dma(struct ti_csi2rx_dev *csi,
> > +			       struct ti_csi2rx_buffer *buf);
> > +
> > +static const struct ti_csi2rx_fmt *find_format_by_pix(u32 pixelformat)
> 
> Maybe "_by_fourcc" ? That's nitpicking though.
> 

Will fix.

> > +{
> > +	unsigned int i;
> > +
> > +	for (i = 0; i < num_formats; i++) {
> > +		if (formats[i].fourcc == pixelformat)
> > +			return &formats[i];
> > +	}
> > +
> > +	return NULL;
> > +}
> > +
> > +static const struct ti_csi2rx_fmt *find_format_by_code(u32 code)
> > +{
> > +	unsigned int i;
> > +
> > +	for (i = 0; i < num_formats; i++) {
> > +		if (formats[i].code == code)
> > +			return &formats[i];
> > +	}
> > +
> > +	return NULL;
> > +}
> > +
> > +static void ti_csi2rx_fill_fmt(const struct ti_csi2rx_fmt *csi_fmt,
> > +			       struct v4l2_format *v4l2_fmt)
> > +{
> > +	struct v4l2_pix_format *pix = &v4l2_fmt->fmt.pix;
> > +	unsigned int pixels_in_word;
> > +	u8 bpp = ALIGN(csi_fmt->bpp, 8);
> 
> All bpp values are multiple of 8, is ALIGN() needed ?
> 

Hmm you're right it's not really needed, I think I decided to keep it 
from Pratyush's series just to be safe. Will remove it in v10.

> > +
> > +	pixels_in_word = PSIL_WORD_SIZE_BYTES * 8 / bpp;
> > +
> > +	/* Clamp width and height to sensible maximums (16K x 16K) */
> > +	pix->width = clamp_t(unsigned int, pix->width,
> > +			     pixels_in_word,
> > +			     MAX_WIDTH_BYTES * 8 / bpp);
> > +	pix->height = clamp_t(unsigned int, pix->height, 1, MAX_HEIGHT_LINES);
> > +
> > +	/* Width should be a multiple of transfer word-size */
> > +	pix->width = rounddown(pix->width, pixels_in_word);
> > +
> > +	v4l2_fmt->type = V4L2_BUF_TYPE_VIDEO_CAPTURE;
> > +	pix->pixelformat = csi_fmt->fourcc;
> > +	pix->colorspace = V4L2_COLORSPACE_SRGB;
> 
> You should fill the other colorspace-related fields.
> 

Good catch, I think it would be better to use the colorspace and related 
fields from the mbus format of the source subdev as this IP doesn't 
really change it. Will fix in v10.

> > +	pix->bytesperline = pix->width * (bpp / 8);
> > +	pix->sizeimage = pix->bytesperline * pix->height;
> > +}
> > +
> > +static int ti_csi2rx_querycap(struct file *file, void *priv,
> > +			      struct v4l2_capability *cap)
> > +{
> > +	strscpy(cap->driver, TI_CSI2RX_MODULE_NAME, sizeof(cap->driver));
> > +	strscpy(cap->card, TI_CSI2RX_MODULE_NAME, sizeof(cap->card));
> > +
> > +	return 0;
> > +}
> > +
> > +static int ti_csi2rx_enum_fmt_vid_cap(struct file *file, void *priv,
> > +				      struct v4l2_fmtdesc *f)
> > +{
> > +	const struct ti_csi2rx_fmt *fmt = NULL;
> > +
> > +	if (f->mbus_code) {
> > +		/* 1-to-1 mapping between bus formats and pixel formats */
> > +		if (f->index > 0)
> > +			return -EINVAL;
> > +
> > +		fmt = find_format_by_code(f->mbus_code);
> > +	} else {
> > +		if (f->index >= num_formats)
> > +			return -EINVAL;
> > +
> > +		fmt = &formats[f->index];
> > +	}
> > +
> > +	if (!fmt)
> > +		return -EINVAL;
> > +
> > +	f->pixelformat = fmt->fourcc;
> > +	memset(f->reserved, 0, sizeof(f->reserved));
> > +	f->type = V4L2_BUF_TYPE_VIDEO_CAPTURE;
> > +
> > +	return 0;
> > +}
> > +
> > +static int ti_csi2rx_g_fmt_vid_cap(struct file *file, void *prov,
> > +				   struct v4l2_format *f)
> > +{
> > +	struct ti_csi2rx_dev *csi = video_drvdata(file);
> > +
> > +	*f = csi->v_fmt;
> > +
> > +	return 0;
> > +}
> > +
> > +static int ti_csi2rx_try_fmt_vid_cap(struct file *file, void *priv,
> > +				     struct v4l2_format *f)
> > +{
> > +	const struct ti_csi2rx_fmt *fmt;
> > +
> > +	/*
> > +	 * Default to the first format if the requested pixel format code isn't
> > +	 * supported.
> > +	 */
> > +	fmt = find_format_by_pix(f->fmt.pix.pixelformat);
> > +	if (!fmt)
> > +		fmt = &formats[0];
> > +
> > +	/* Interlaced formats are not supported. */
> > +	f->fmt.pix.field = V4L2_FIELD_NONE;
> > +
> > +	ti_csi2rx_fill_fmt(fmt, f);
> > +
> > +	return 0;
> > +}
> > +
> > +static int ti_csi2rx_s_fmt_vid_cap(struct file *file, void *priv,
> > +				   struct v4l2_format *f)
> > +{
> > +	struct ti_csi2rx_dev *csi = video_drvdata(file);
> > +	struct vb2_queue *q = &csi->vidq;
> > +	int ret;
> > +
> > +	if (vb2_is_busy(q))
> > +		return -EBUSY;
> > +
> > +	ret = ti_csi2rx_try_fmt_vid_cap(file, priv, f);
> > +	if (ret < 0)
> > +		return ret;
> > +
> > +	csi->v_fmt = *f;
> > +
> > +	return 0;
> > +}
> > +
> > +static int ti_csi2rx_enum_framesizes(struct file *file, void *fh,
> > +				     struct v4l2_frmsizeenum *fsize)
> > +{
> > +	const struct ti_csi2rx_fmt *fmt;
> > +	unsigned int pixels_in_word;
> > +	u8 bpp;
> > +
> > +	fmt = find_format_by_pix(fsize->pixel_format);
> > +	if (!fmt || fsize->index != 0)
> > +		return -EINVAL;
> > +
> > +	bpp = ALIGN(fmt->bpp, 8);
> > +
> > +	/*
> > +	 * Number of pixels in one PSI-L word. The transfer happens in multiples
> > +	 * of PSI-L word sizes.
> > +	 */
> > +	pixels_in_word = PSIL_WORD_SIZE_BYTES * 8 / bpp;
> > +
> > +	fsize->type = V4L2_FRMSIZE_TYPE_STEPWISE;
> > +	fsize->stepwise.min_width = pixels_in_word;
> > +	fsize->stepwise.max_width = rounddown(MAX_WIDTH_BYTES * 8 / bpp,
> > +					      pixels_in_word);
> > +	fsize->stepwise.step_width = pixels_in_word;
> > +	fsize->stepwise.min_height = 1;
> > +	fsize->stepwise.max_height = MAX_HEIGHT_LINES;
> > +	fsize->stepwise.step_height = 1;
> > +
> > +	return 0;
> > +}
> > +
> > +static const struct v4l2_ioctl_ops csi_ioctl_ops = {
> > +	.vidioc_querycap      = ti_csi2rx_querycap,
> > +	.vidioc_enum_fmt_vid_cap = ti_csi2rx_enum_fmt_vid_cap,
> > +	.vidioc_try_fmt_vid_cap = ti_csi2rx_try_fmt_vid_cap,
> > +	.vidioc_g_fmt_vid_cap = ti_csi2rx_g_fmt_vid_cap,
> > +	.vidioc_s_fmt_vid_cap = ti_csi2rx_s_fmt_vid_cap,
> > +	.vidioc_enum_framesizes = ti_csi2rx_enum_framesizes,
> > +	.vidioc_reqbufs       = vb2_ioctl_reqbufs,
> > +	.vidioc_create_bufs   = vb2_ioctl_create_bufs,
> > +	.vidioc_prepare_buf   = vb2_ioctl_prepare_buf,
> > +	.vidioc_querybuf      = vb2_ioctl_querybuf,
> > +	.vidioc_qbuf          = vb2_ioctl_qbuf,
> > +	.vidioc_dqbuf         = vb2_ioctl_dqbuf,
> > +	.vidioc_expbuf        = vb2_ioctl_expbuf,
> > +	.vidioc_streamon      = vb2_ioctl_streamon,
> > +	.vidioc_streamoff     = vb2_ioctl_streamoff,
> > +};
> > +
> > +static const struct v4l2_file_operations csi_fops = {
> > +	.owner = THIS_MODULE,
> > +	.open = v4l2_fh_open,
> > +	.release = vb2_fop_release,
> > +	.read = vb2_fop_read,
> > +	.poll = vb2_fop_poll,
> > +	.unlocked_ioctl = video_ioctl2,
> > +	.mmap = vb2_fop_mmap,
> > +};
> > +
> > +static int csi_async_notifier_bound(struct v4l2_async_notifier *notifier,
> > +				    struct v4l2_subdev *subdev,
> > +				    struct v4l2_async_connection *asc)
> > +{
> > +	struct ti_csi2rx_dev *csi = dev_get_drvdata(notifier->v4l2_dev->dev);
> > +
> > +	csi->source = subdev;
> > +
> > +	return 0;
> > +}
> > +
> > +static int csi_async_notifier_complete(struct v4l2_async_notifier *notifier)
> > +{
> > +	struct ti_csi2rx_dev *csi = dev_get_drvdata(notifier->v4l2_dev->dev);
> > +	struct video_device *vdev = &csi->vdev;
> > +	int ret;
> > +
> > +	ret = video_register_device(vdev, VFL_TYPE_VIDEO, -1);
> > +	if (ret)
> > +		return ret;
> > +
> > +	ret = v4l2_create_fwnode_links_to_pad(csi->source, &csi->pad,
> > +					      MEDIA_LNK_FL_IMMUTABLE | MEDIA_LNK_FL_ENABLED);
> > +
> > +	if (ret) {
> > +		video_unregister_device(vdev);
> > +		return ret;
> > +	}
> > +
> > +	return v4l2_device_register_subdev_nodes(&csi->v4l2_dev);
> 
> You should call video_unregister_device() if this fails.

Will fix.

> 
> I'm tempted, however, to register the video device at probe time, not in
> this function.
>

My guess is this was done here to prevent creating a video node if 
sensor (source subdev) failed to probe. With multistream support we will 
have multiple video nodes (upto 16) where it seems cleaner to wait for 
the source. Let me know if you had some other reason in mind.

> > +}
> > +
> > +static const struct v4l2_async_notifier_operations csi_async_notifier_ops = {
> > +	.bound = csi_async_notifier_bound,
> > +	.complete = csi_async_notifier_complete,
> > +};
> > +
> > +static int ti_csi2rx_notifier_register(struct ti_csi2rx_dev *csi)
> > +{
> > +	struct fwnode_handle *fwnode;
> > +	struct v4l2_async_connection *asc;
> > +	struct device_node *node;
> > +	int ret;
> > +
> > +	node = of_get_child_by_name(csi->dev->of_node, "csi-bridge");
> > +	if (!node)
> > +		return -EINVAL;
> > +
> > +	fwnode = of_fwnode_handle(node);
> > +	if (!fwnode) {
> > +		of_node_put(node);
> > +		return -EINVAL;
> > +	}
> > +
> > +	v4l2_async_nf_init(&csi->notifier, &csi->v4l2_dev);
> > +	csi->notifier.ops = &csi_async_notifier_ops;
> > +
> > +	asc = v4l2_async_nf_add_fwnode(&csi->notifier, fwnode,
> > +				       struct v4l2_async_connection);
> > +	of_node_put(node);
> > +	if (IS_ERR(asc)) {
> > +		v4l2_async_nf_cleanup(&csi->notifier);
> > +		return PTR_ERR(asc);
> > +	}
> > +
> > +	ret = v4l2_async_nf_register(&csi->notifier);
> > +	if (ret) {
> > +		v4l2_async_nf_cleanup(&csi->notifier);
> > +		return ret;
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> > +static int ti_csi2rx_setup_shim(struct ti_csi2rx_dev *csi)
> > +{
> > +	const struct ti_csi2rx_fmt *fmt;
> > +	unsigned int reg;
> > +
> > +	fmt = find_format_by_pix(csi->v_fmt.fmt.pix.pixelformat);
> > +	if (!fmt) {
> > +		dev_err(csi->dev, "Pixelformat 0x%x is not supported\n",
> 
> Use %p4cc to print a fourcc. You need to pass the pixel format by
> address to dev_err() then, not by value.
> 
> Can this happen though, given that the set format handler should never
> allow setting a format not supported by the driver ? I think I'd drop
> the error check. The function can then become a void function.
> 

Makes sense, will drop this.

> > +			csi->v_fmt.fmt.pix.pixelformat);
> > +		return -EINVAL;
> > +	}
> > +
> > +	/* De-assert the pixel interface reset. */
> > +	reg = SHIM_CNTL_PIX_RST;
> > +	writel(reg, csi->shim + SHIM_CNTL);
> > +
> > +	reg = SHIM_DMACNTX_EN;
> > +	reg |= FIELD_PREP(SHIM_DMACNTX_FMT, fmt->csi_dt);
> > +
> > +	/*
> > +	 * Using the values from the documentation gives incorrect ordering for
> > +	 * the luma and chroma components. In practice, the "reverse" format
> > +	 * gives the correct image. So for example, if the image is in UYVY, the
> > +	 * reverse would be YVYU.
> > +	 */
> > +	switch (fmt->fourcc) {
> > +	case V4L2_PIX_FMT_UYVY:
> > +		reg |= FIELD_PREP(SHIM_DMACNTX_YUV422,
> > +					SHIM_DMACNTX_YVYU);
> > +		break;
> > +	case V4L2_PIX_FMT_VYUY:
> > +		reg |= FIELD_PREP(SHIM_DMACNTX_YUV422,
> > +					SHIM_DMACNTX_YUYV);
> > +		break;
> > +	case V4L2_PIX_FMT_YUYV:
> > +		reg |= FIELD_PREP(SHIM_DMACNTX_YUV422,
> > +					SHIM_DMACNTX_VYUY);
> > +		break;
> > +	case V4L2_PIX_FMT_YVYU:
> > +		reg |= FIELD_PREP(SHIM_DMACNTX_YUV422,
> > +					SHIM_DMACNTX_UYVY);
> > +		break;
> > +	default:
> > +		/* Ignore if not YUV 4:2:2 */
> > +		break;
> > +	}
> > +
> > +	reg |= FIELD_PREP(SHIM_DMACNTX_SIZE, fmt->size);
> > +
> > +	writel(reg, csi->shim + SHIM_DMACNTX);
> > +
> > +	reg = FIELD_PREP(SHIM_PSI_CFG0_SRC_TAG, 0) |
> > +	      FIELD_PREP(SHIM_PSI_CFG0_DST_TAG, 0);
> > +	writel(reg, csi->shim + SHIM_PSI_CFG0);
> > +
> > +	return 0;
> > +}
> > +
> > +static void ti_csi2rx_drain_callback(void *param)
> > +{
> > +	struct completion *drain_complete = param;
> > +
> > +	complete(drain_complete);
> > +}
> > +
> > +/** Drain the stale data left at the PSI-L endpoint.
> 
> This isn't kerneldoc, so
> 
> /*
>  * Drain the stale data left at the PSI-L endpoint.
> 

Will fix.

> > + *
> > + * This might happen if no buffers are queued in time but source is still
> > + * streaming. Or rarely it may happen while stopping the stream. To prevent
> > + * that stale data corrupting the subsequent transactions, it is required to
> > + * issue DMA requests to drain it out.
> > + */
> > +static int ti_csi2rx_drain_dma(struct ti_csi2rx_dev *csi)
> > +{
> > +	struct dma_async_tx_descriptor *desc;
> > +	struct completion drain_complete;
> > +	dma_cookie_t cookie;
> > +	int ret;
> > +
> > +	init_completion(&drain_complete);
> > +
> > +	desc = dmaengine_prep_slave_single(csi->dma.chan, csi->dma.drain.paddr,
> > +					   csi->dma.drain.len, DMA_DEV_TO_MEM,
> > +					   DMA_PREP_INTERRUPT | DMA_CTRL_ACK);
> > +	if (!desc) {
> > +		ret = -EIO;
> > +		goto out;
> > +	}
> > +
> > +	desc->callback = ti_csi2rx_drain_callback;
> > +	desc->callback_param = &drain_complete;
> > +
> > +	cookie = dmaengine_submit(desc);
> > +	ret = dma_submit_error(cookie);
> > +	if (ret)
> > +		goto out;
> > +
> > +	dma_async_issue_pending(csi->dma.chan);
> > +
> > +	if (!wait_for_completion_timeout(&drain_complete,
> > +					 msecs_to_jiffies(DRAIN_TIMEOUT_MS))) {
> > +		dmaengine_terminate_sync(csi->dma.chan);
> > +		ret = -ETIMEDOUT;
> > +		goto out;
> > +	}
> > +out:
> > +	return ret;
> > +}
> > +
> > +static void ti_csi2rx_dma_callback(void *param)
> > +{
> > +	struct ti_csi2rx_buffer *buf = param;
> > +	struct ti_csi2rx_dev *csi = buf->csi;
> > +	struct ti_csi2rx_dma *dma = &csi->dma;
> > +	unsigned long flags;
> > +
> > +	/*
> > +	 * TODO: Derive the sequence number from the CSI2RX frame number
> > +	 * hardware monitor registers.
> > +	 */
> > +	buf->vb.vb2_buf.timestamp = ktime_get_ns();
> > +	buf->vb.sequence = csi->sequence++;
> > +
> > +	spin_lock_irqsave(&dma->lock, flags);
> > +
> > +	WARN_ON(!list_is_first(&buf->list, &dma->submitted));
> > +	vb2_buffer_done(&buf->vb.vb2_buf, VB2_BUF_STATE_DONE);
> > +	list_del(&buf->list);
> > +
> > +	/* If there are more buffers to process then start their transfer. */
> > +	while (!list_empty(&dma->queue)) {
> > +		buf = list_entry(dma->queue.next, struct ti_csi2rx_buffer, list);
> > +
> > +		if (ti_csi2rx_start_dma(csi, buf)) {
> > +			dev_err(csi->dev, "Failed to queue the next buffer for DMA\n");
> > +			vb2_buffer_done(&buf->vb.vb2_buf, VB2_BUF_STATE_ERROR);
> > +		} else {
> > +			list_move_tail(&buf->list, &dma->submitted);
> > +		}
> > +	}
> > +
> > +	if (list_empty(&dma->submitted))
> > +		dma->state = TI_CSI2RX_DMA_IDLE;
> > +
> > +	spin_unlock_irqrestore(&dma->lock, flags);
> > +}
> > +
> > +static int ti_csi2rx_start_dma(struct ti_csi2rx_dev *csi,
> > +			       struct ti_csi2rx_buffer *buf)
> > +{
> > +	unsigned long addr;
> > +	struct dma_async_tx_descriptor *desc;
> > +	size_t len = csi->v_fmt.fmt.pix.sizeimage;
> > +	dma_cookie_t cookie;
> > +	int ret = 0;
> > +
> > +	addr = vb2_dma_contig_plane_dma_addr(&buf->vb.vb2_buf, 0);
> > +	desc = dmaengine_prep_slave_single(csi->dma.chan, addr, len,
> > +					   DMA_DEV_TO_MEM,
> > +					   DMA_PREP_INTERRUPT | DMA_CTRL_ACK);
> > +	if (!desc)
> > +		return -EIO;
> > +
> > +	desc->callback = ti_csi2rx_dma_callback;
> > +	desc->callback_param = buf;
> > +
> > +	cookie = dmaengine_submit(desc);
> > +	ret = dma_submit_error(cookie);
> > +	if (ret)
> > +		return ret;
> > +
> > +	dma_async_issue_pending(csi->dma.chan);
> > +
> > +	return 0;
> > +}
> > +
> > +static void ti_csi2rx_cleanup_buffers(struct ti_csi2rx_dev *csi,
> > +				      enum vb2_buffer_state buf_state)
> > +{
> > +	struct ti_csi2rx_dma *dma = &csi->dma;
> > +	struct ti_csi2rx_buffer *buf, *tmp;
> > +	enum ti_csi2rx_dma_state state;
> > +	unsigned long flags;
> > +	int ret;
> > +
> > +	spin_lock_irqsave(&dma->lock, flags);
> > +	state = csi->dma.state;
> > +	dma->state = TI_CSI2RX_DMA_STOPPED;
> > +	spin_unlock_irqrestore(&dma->lock, flags);
> > +
> > +	if (state != TI_CSI2RX_DMA_STOPPED) {
> > +		/*
> > +		 * Normal DMA termination sometimes does not clean up pending
> > +		 * data on the endpoint.
> > +		 */
> > +		ret = ti_csi2rx_drain_dma(csi);
> > +		if (ret)
> > +			dev_dbg(csi->dev,
> > +				"Failed to drain DMA. Next frame might be bogus\n");
> 
> A dev_warn() may be more appropriate, this seems quite important.
> 

I think this was intentional. The calls to DMA engine for draining 
always "timeout" even when the drain is successful, because the amount 
of data drained is non-deterministic (and we use framesize to be safe, 
which is much more than what FIFOs would be storing). Keeping this as 
dev_warn() leads to spurious dmesg logs at the end of every stream 
close.

Maybe we should use dev_dbg for the timeout case, and dev_err for any 
other error thrown by DMA engine. Will fix that in v10.

Hopefully this will be simplified once we have an API in place to signal 
drain to "/dev/null".

> > +	}
> 
> A blank line would be nice here.
> 

Will fix.

> > +	ret = dmaengine_terminate_sync(csi->dma.chan);
> > +	if (ret)
> > +		dev_err(csi->dev, "Failed to stop DMA: %d\n", ret);
> 
> When called from ti_csi2rx_start_streaming() there's already a
> dmaengine_terminate_sync(), and there's also a call to the same function
> in ti_csi2rx_drain_dma() called above. Could we avoid calling the
> function multiple times ? I think stopping the DMA engine should be
> moved to a separate function, as it doesn't fit with the
> ti_csi2rx_cleanup_buffers() name.
> 

Oops missed that, will fix.

> > +
> > +	dma_free_coherent(csi->dev, dma->drain.len,
> > +			  dma->drain.vaddr, dma->drain.paddr);
> > +	dma->drain.vaddr = NULL;
> > +
> > +	spin_lock_irqsave(&dma->lock, flags);
> > +	list_for_each_entry_safe(buf, tmp, &csi->dma.queue, list) {
> > +		list_del(&buf->list);
> > +		vb2_buffer_done(&buf->vb.vb2_buf, buf_state);
> > +	}
> > +	list_for_each_entry_safe(buf, tmp, &csi->dma.submitted, list) {
> > +		list_del(&buf->list);
> > +		vb2_buffer_done(&buf->vb.vb2_buf, buf_state);
> > +	}
> > +	spin_unlock_irqrestore(&dma->lock, flags);
> > +}
> > +
> > +static int ti_csi2rx_queue_setup(struct vb2_queue *q, unsigned int *nbuffers,
> > +				 unsigned int *nplanes, unsigned int sizes[],
> > +				 struct device *alloc_devs[])
> > +{
> > +	struct ti_csi2rx_dev *csi = vb2_get_drv_priv(q);
> > +	unsigned int size = csi->v_fmt.fmt.pix.sizeimage;
> > +
> > +	if (*nplanes) {
> > +		if (sizes[0] < size)
> > +			return -EINVAL;
> > +		size = sizes[0];
> > +	}
> > +
> > +	*nplanes = 1;
> > +	sizes[0] = size;
> > +
> > +	return 0;
> > +}
> > +
> > +static int ti_csi2rx_buffer_prepare(struct vb2_buffer *vb)
> > +{
> > +	struct ti_csi2rx_dev *csi = vb2_get_drv_priv(vb->vb2_queue);
> > +	unsigned long size = csi->v_fmt.fmt.pix.sizeimage;
> > +
> > +	if (vb2_plane_size(vb, 0) < size) {
> > +		dev_err(csi->dev, "Data will not fit into plane\n");
> > +		return -EINVAL;
> > +	}
> > +
> > +	vb2_set_plane_payload(vb, 0, size);
> > +	return 0;
> > +}
> > +
> > +static void ti_csi2rx_buffer_queue(struct vb2_buffer *vb)
> > +{
> > +	struct ti_csi2rx_dev *csi = vb2_get_drv_priv(vb->vb2_queue);
> > +	struct ti_csi2rx_buffer *buf;
> > +	struct ti_csi2rx_dma *dma = &csi->dma;
> > +	bool restart_dma = false;
> > +	unsigned long flags = 0;
> > +	int ret;
> > +
> > +	buf = container_of(vb, struct ti_csi2rx_buffer, vb.vb2_buf);
> > +	buf->csi = csi;
> > +
> > +	spin_lock_irqsave(&dma->lock, flags);
> > +	/*
> > +	 * Usually the DMA callback takes care of queueing the pending buffers.
> > +	 * But if DMA has stalled due to lack of buffers, restart it now.
> > +	 */
> > +	if (dma->state == TI_CSI2RX_DMA_IDLE) {
> > +		/*
> > +		 * Do not restart DMA with the lock held because
> > +		 * ti_csi2rx_drain_dma() might block for completion.
> > +		 * There won't be a race on queueing DMA anyway since the
> > +		 * callback is not being fired.
> > +		 */
> > +		restart_dma = true;
> > +		dma->state = TI_CSI2RX_DMA_ACTIVE;
> > +	} else {
> > +		list_add_tail(&buf->list, &dma->queue);
> > +	}
> > +	spin_unlock_irqrestore(&dma->lock, flags);
> > +
> > +	if (restart_dma) {
> > +		/*
> > +		 * Once frames start dropping, some data gets stuck in the DMA
> > +		 * pipeline somewhere. So the first DMA transfer after frame
> > +		 * drops gives a partial frame. This is obviously not useful to
> > +		 * the application and will only confuse it. Issue a DMA
> > +		 * transaction to drain that up.
> > +		 */
> 
> Another option would be to return the frame to userspace with the error
> flag set. That would give an earlier indication to applications that
> something went wrong. Up to you.
> 

Oh I see, that does sound cleaner to me - I will try that out with 
different applications. If it works, would it be okay if I handle it as 
a separate series? The mechanism in this series is already well-tested, 
even if a bit ugly.

> > +		ret = ti_csi2rx_drain_dma(csi);
> > +		if (ret)
> > +			dev_warn(csi->dev,
> > +				 "Failed to drain DMA. Next frame might be bogus\n");
> > +
> > +		ret = ti_csi2rx_start_dma(csi, buf);
> > +		if (ret) {
> > +			dev_err(csi->dev, "Failed to start DMA: %d\n", ret);
> > +			spin_lock_irqsave(&dma->lock, flags);
> > +			vb2_buffer_done(&buf->vb.vb2_buf, VB2_BUF_STATE_ERROR);
> > +			dma->state = TI_CSI2RX_DMA_IDLE;
> > +			spin_unlock_irqrestore(&dma->lock, flags);
> > +		} else {
> > +			spin_lock_irqsave(&dma->lock, flags);
> > +			list_add_tail(&buf->list, &dma->submitted);
> > +			spin_unlock_irqrestore(&dma->lock, flags);
> > +		}
> > +	}

[snip]

> 
> -- Regards,
> 
> Laurent Pinchart

Jai Luthra Oct. 6, 2023, 10:26 a.m. UTC | #7

Hi Laurent, Vignesh, Vinod,

I have some good news, there is an upper bound on the amount of data 
stored in the FIFOs (~32KB), so we don't need to allocate a buffer of 
the full frame size.

On Oct 04, 2023 at 23:03:12 +0300, Laurent Pinchart wrote:
> On Wed, Oct 04, 2023 at 07:21:00PM +0530, Vinod Koul wrote:
> > On 29-08-23, 18:55, Laurent Pinchart wrote:
> > > Hi Jai,
> > > 
> > > (CC'ing Vinod, the maintainer of the DMA engine subsystem, for a
> > > question below)
> > 
> > Sorry this got lost
> 
> No worries.
> 
> > > On Fri, Aug 18, 2023 at 03:55:06PM +0530, Jai Luthra wrote:
> > > > On Aug 15, 2023 at 16:00:51 +0300, Tomi Valkeinen wrote:
> > > > > On 11/08/2023 13:47, Jai Luthra wrote:
> > > > > > From: Pratyush Yadav <p.yadav@ti.com>
> > > 
> > > [snip]
> > > 
> > > > > > +static int ti_csi2rx_start_streaming(struct vb2_queue *vq, unsigned int count)
> > > > > > +{
> > > > > > +	struct ti_csi2rx_dev *csi = vb2_get_drv_priv(vq);
> > > > > > +	struct ti_csi2rx_dma *dma = &csi->dma;
> > > > > > +	struct ti_csi2rx_buffer *buf;
> > > > > > +	unsigned long flags;
> > > > > > +	int ret = 0;
> > > > > > +
> > > > > > +	spin_lock_irqsave(&dma->lock, flags);
> > > > > > +	if (list_empty(&dma->queue))
> > > > > > +		ret = -EIO;
> > > > > > +	spin_unlock_irqrestore(&dma->lock, flags);
> > > > > > +	if (ret)
> > > > > > +		return ret;
> > > > > > +
> > > > > > +	dma->drain.len = csi->v_fmt.fmt.pix.sizeimage;
> > > > > > +	dma->drain.vaddr = dma_alloc_coherent(csi->dev, dma->drain.len,
> > > > > > +					      &dma->drain.paddr, GFP_KERNEL);
> > > > > > +	if (!dma->drain.vaddr)
> > > > > > +		return -ENOMEM;
> > > > > 
> > > > > This is still allocating a large buffer every time streaming is started (and
> > > > > with streams support, a separate buffer for each stream?).
> > > > > 
> > > > > Did you check if the TI DMA can do writes to a constant address? That would
> > > > > be the best option, as then the whole buffer allocation problem goes away.
> > > > 
> > > > I checked with Vignesh, the hardware can support a scenario where we 
> > > > flush out all the data without allocating a buffer, but I couldn't find 
> > > > a way to signal that via the current dmaengine framework APIs. Will look 
> > > > into it further as it will be important for multi-stream support.
> > > 
> > > That would be the best option. It's not immediately apparent to me if
> > > the DMA engine API supports such a use case.
> > > dmaengine_prep_interleaved_dma() gives you finer grain control on the
> > > source and destination increments, but I haven't seen a way to instruct
> > > the DMA engine to direct writes to /dev/null (so to speak). Vinod, is
> > > this something that is supported, or could be supported ?
> > 
> > Write to a dummy buffer could have the same behaviour, no?
> 
> Yes, but if the DMA engine can write to /dev/null, that avoids
> allocating a dummy buffer, which is nicer. For video use cases, dummy
> buffers are often large.
> 
> > > > > Alternatively, can you flush the buffers with multiple one line transfers?
> > > > > The flushing shouldn't be performance critical, so even if that's slower
> > > > > than a normal full-frame DMA, it shouldn't matter much. And if that can be
> > > > > done, a single probe time line-buffer allocation should do the trick.
> > > > 
> > > > There will be considerable overhead if we queue many DMA transactions 
> > > > (in the order of 1000s or even 100s), which might not be okay for the 
> > > > scenarios where we have to drain mid-stream. Will have to run some 
> > > > experiments to see if that is worth it.
> > > > 
> > > > But one optimization we can for sure do is re-use a single drain buffer 
> > > > for all the streams. We will need to ensure to re-allocate the buffer 
> > > > for the "largest" framesize supported across the different streams at 
> > > > stream-on time.
> > > 
> > > If you implement .device_prep_interleaved_dma() in the DMA engine driver
> > > you could write to a single line buffer, assuming that the hardware would
> > > support so in a generic way.
> > > 
> > > > My guess is the endpoint is not buffering a full-frame's worth of data, 
> > > > I will also check if we can upper bound that size to something feasible.

According to the spec the endpoint buffers a maximum of 2048 x (128-bit) 
samples, which comes out to be 32KiB.

I ran some experiments after disabling the drain and looking at the 
subsequent corrupt frames with stale data, and it was always in 
multiples of (< 20x) 128-bit samples.

Given we have an upper bound, I think a practical solution for now is to 
allocate a single re-usable 32KiB buffer at probe time (will send v10 
with this fix).

Although it would be ideal if we can do this without *any* buffers at 
all.

> > > > 
> > > > > Other than this drain buffer topic, I think this looks fine. So, I'm going
> > > > > to give Rb, but I do encourage you to look more into optimizing this drain
> > > > > buffer.
> > > > 
> > > > Thank you!
> > > > 
> > > > > Reviewed-by: Tomi Valkeinen <tomi.valkeinen@ideasonboard.com>
> 
> -- 
> Regards,
> 
> Laurent Pinchart

[v9,13/13] media: ti: Add CSI2RX support for J721E

Commit Message

Comments

Patch