mbox series

[v5,0/6] usb: gadget: functionfs: DMABUF import interface

Message ID 20240119141402.44262-1-paul@crapouillou.net
Headers show
Series usb: gadget: functionfs: DMABUF import interface | expand

Message

Paul Cercueil Jan. 19, 2024, 2:13 p.m. UTC
Hi,

This is the v5 of my patchset that adds a new DMABUF import interface to
FunctionFS.

Daniel / Sima suggested that I should cache the dma_buf_attachment while
the DMABUF is attached to the interface, instead of mapping/unmapping
the DMABUF for every transfer (also because unmapping is not possible in
the dma_fence's critical section). This meant having to add new
dma_buf_begin_access() / dma_buf_end_access() functions that the driver
can call to ensure cache coherency. These two functions are provided by
the new patch [1/6], and an implementation for udmabuf was added in
[2/6] - see the changelog below.

This patchset was successfully tested with CONFIG_LOCKDEP, no errors
were reported in dmesg while using the interface.

This interface is being used at Analog Devices, to transfer data from
high-speed transceivers to USB in a zero-copy fashion, using also the
DMABUF import interface to the IIO subsystem which is being upstreamed
in parallel [1]. The two are used by the Libiio software [2].

On a ZCU102 board with a FMComms3 daughter board, using the combination
of these two new interfaces yields a drastic improvement of the
throughput, from about 127 MiB/s using IIO's buffer read/write interface
+ read/write to the FunctionFS endpoints, to about 274 MiB/s when
passing around DMABUFs, for a lower CPU usage (0.85 load avg. before,
vs. 0.65 after).

Right now, *technically* there are no users of this interface, as
Analog Devices wants to wait until both interfaces are accepted upstream
to merge the DMABUF code in Libiio into the main branch, and Jonathan
wants to wait and see if this patchset is accepted to greenlight the
DMABUF interface in IIO as well. I think this isn't really a problem;
once everybody is happy with its part of the cake, we can merge them all
at once.

This is obviously for 5.9, and based on next-20240119.

Changelog:

- [1/6]: New patch
- [2/6]: New patch
- [5/6]:
  - Cache the dma_buf_attachment while the DMABUF is attached.
  - Use dma_buf_begin/end_access() to ensure that the DMABUF data will be
    coherent to the hardware.
  - Remove comment about cache-management and dma_buf_unmap_attachment(),
    since we now use dma_buf_begin/end_access().
  - Select DMA_SHARED_BUFFER in Kconfig entry
  - Add Christian's ACK

Cheers,
-Paul

[1] https://lore.kernel.org/linux-iio/219abc43b4fdd4a13b307ed2efaa0e6869e68e3f.camel@gmail.com/T/
[2] https://github.com/analogdevicesinc/libiio/tree/pcercuei/dev-new-dmabuf-api

Paul Cercueil (6):
  dma-buf: Add dma_buf_{begin,end}_access()
  dma-buf: udmabuf: Implement .{begin,end}_access
  usb: gadget: Support already-mapped DMA SGs
  usb: gadget: functionfs: Factorize wait-for-endpoint code
  usb: gadget: functionfs: Add DMABUF import interface
  Documentation: usb: Document FunctionFS DMABUF API

 Documentation/usb/functionfs.rst    |  36 ++
 drivers/dma-buf/dma-buf.c           |  66 ++++
 drivers/dma-buf/udmabuf.c           |  27 ++
 drivers/usb/gadget/Kconfig          |   1 +
 drivers/usb/gadget/function/f_fs.c  | 502 ++++++++++++++++++++++++++--
 drivers/usb/gadget/udc/core.c       |   7 +-
 include/linux/dma-buf.h             |  37 ++
 include/linux/usb/gadget.h          |   2 +
 include/uapi/linux/usb/functionfs.h |  41 +++
 9 files changed, 698 insertions(+), 21 deletions(-)

Comments

Christian König Jan. 22, 2024, 10:35 a.m. UTC | #1
Am 19.01.24 um 15:13 schrieb Paul Cercueil:
> These functions should be used by device drivers when they start and
> stop accessing the data of DMABUF. It allows DMABUF importers to cache
> the dma_buf_attachment while ensuring that the data they want to access
> is available for their device when the DMA transfers take place.

As Daniel already noted as well this is a complete no-go from the 
DMA-buf design point of view.

Regards,
Christian.

>
> Signed-off-by: Paul Cercueil <paul@crapouillou.net>
>
> ---
> v5: New patch
> ---
>   drivers/dma-buf/dma-buf.c | 66 +++++++++++++++++++++++++++++++++++++++
>   include/linux/dma-buf.h   | 37 ++++++++++++++++++++++
>   2 files changed, 103 insertions(+)
>
> diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
> index 8fe5aa67b167..a8bab6c18fcd 100644
> --- a/drivers/dma-buf/dma-buf.c
> +++ b/drivers/dma-buf/dma-buf.c
> @@ -830,6 +830,8 @@ static struct sg_table * __map_dma_buf(struct dma_buf_attachment *attach,
>    *     - dma_buf_mmap()
>    *     - dma_buf_begin_cpu_access()
>    *     - dma_buf_end_cpu_access()
> + *     - dma_buf_begin_access()
> + *     - dma_buf_end_access()
>    *     - dma_buf_map_attachment_unlocked()
>    *     - dma_buf_unmap_attachment_unlocked()
>    *     - dma_buf_vmap_unlocked()
> @@ -1602,6 +1604,70 @@ void dma_buf_vunmap_unlocked(struct dma_buf *dmabuf, struct iosys_map *map)
>   }
>   EXPORT_SYMBOL_NS_GPL(dma_buf_vunmap_unlocked, DMA_BUF);
>   
> +/**
> + * @dma_buf_begin_access - Call before any hardware access from/to the DMABUF
> + * @attach:	[in]	attachment used for hardware access
> + * @sg_table:	[in]	scatterlist used for the DMA transfer
> + * @direction:  [in]    direction of DMA transfer
> + */
> +int dma_buf_begin_access(struct dma_buf_attachment *attach,
> +			 struct sg_table *sgt, enum dma_data_direction dir)
> +{
> +	struct dma_buf *dmabuf;
> +	bool cookie;
> +	int ret;
> +
> +	if (WARN_ON(!attach))
> +		return -EINVAL;
> +
> +	dmabuf = attach->dmabuf;
> +
> +	if (!dmabuf->ops->begin_access)
> +		return 0;
> +
> +	cookie = dma_fence_begin_signalling();
> +	ret = dmabuf->ops->begin_access(attach, sgt, dir);
> +	dma_fence_end_signalling(cookie);
> +
> +	if (WARN_ON_ONCE(ret))
> +		return ret;
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL_NS_GPL(dma_buf_begin_access, DMA_BUF);
> +
> +/**
> + * @dma_buf_end_access - Call after any hardware access from/to the DMABUF
> + * @attach:	[in]	attachment used for hardware access
> + * @sg_table:	[in]	scatterlist used for the DMA transfer
> + * @direction:  [in]    direction of DMA transfer
> + */
> +int dma_buf_end_access(struct dma_buf_attachment *attach,
> +		       struct sg_table *sgt, enum dma_data_direction dir)
> +{
> +	struct dma_buf *dmabuf;
> +	bool cookie;
> +	int ret;
> +
> +	if (WARN_ON(!attach))
> +		return -EINVAL;
> +
> +	dmabuf = attach->dmabuf;
> +
> +	if (!dmabuf->ops->end_access)
> +		return 0;
> +
> +	cookie = dma_fence_begin_signalling();
> +	ret = dmabuf->ops->end_access(attach, sgt, dir);
> +	dma_fence_end_signalling(cookie);
> +
> +	if (WARN_ON_ONCE(ret))
> +		return ret;
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL_NS_GPL(dma_buf_end_access, DMA_BUF);
> +
>   #ifdef CONFIG_DEBUG_FS
>   static int dma_buf_debug_show(struct seq_file *s, void *unused)
>   {
> diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h
> index 8ff4add71f88..8ba612c7cc16 100644
> --- a/include/linux/dma-buf.h
> +++ b/include/linux/dma-buf.h
> @@ -246,6 +246,38 @@ struct dma_buf_ops {
>   	 */
>   	int (*end_cpu_access)(struct dma_buf *, enum dma_data_direction);
>   
> +	/**
> +	 * @begin_access:
> +	 *
> +	 * This is called from dma_buf_begin_access() when a device driver
> +	 * wants to access the data of the DMABUF. The exporter can use this
> +	 * to flush/sync the caches if needed.
> +	 *
> +	 * This callback is optional.
> +	 *
> +	 * Returns:
> +	 *
> +	 * 0 on success or a negative error code on failure.
> +	 */
> +	int (*begin_access)(struct dma_buf_attachment *, struct sg_table *,
> +			    enum dma_data_direction);
> +
> +	/**
> +	 * @end_access:
> +	 *
> +	 * This is called from dma_buf_end_access() when a device driver is
> +	 * done accessing the data of the DMABUF. The exporter can use this
> +	 * to flush/sync the caches if needed.
> +	 *
> +	 * This callback is optional.
> +	 *
> +	 * Returns:
> +	 *
> +	 * 0 on success or a negative error code on failure.
> +	 */
> +	int (*end_access)(struct dma_buf_attachment *, struct sg_table *,
> +			  enum dma_data_direction);
> +
>   	/**
>   	 * @mmap:
>   	 *
> @@ -606,6 +638,11 @@ void dma_buf_detach(struct dma_buf *dmabuf,
>   int dma_buf_pin(struct dma_buf_attachment *attach);
>   void dma_buf_unpin(struct dma_buf_attachment *attach);
>   
> +int dma_buf_begin_access(struct dma_buf_attachment *attach,
> +			 struct sg_table *sgt, enum dma_data_direction dir);
> +int dma_buf_end_access(struct dma_buf_attachment *attach,
> +		       struct sg_table *sgt, enum dma_data_direction dir);
> +
>   struct dma_buf *dma_buf_export(const struct dma_buf_export_info *exp_info);
>   
>   int dma_buf_fd(struct dma_buf *dmabuf, int flags);
Paul Cercueil Jan. 22, 2024, 11:01 a.m. UTC | #2
Hi Christian,

Le lundi 22 janvier 2024 à 11:35 +0100, Christian König a écrit :
> Am 19.01.24 um 15:13 schrieb Paul Cercueil:
> > These functions should be used by device drivers when they start
> > and
> > stop accessing the data of DMABUF. It allows DMABUF importers to
> > cache
> > the dma_buf_attachment while ensuring that the data they want to
> > access
> > is available for their device when the DMA transfers take place.
> 
> As Daniel already noted as well this is a complete no-go from the 
> DMA-buf design point of view.

What do you mean "as Daniel already noted"? It was him who suggested
this.

> 
> Regards,
> Christian.

Cheers,
-Paul

> 
> > 
> > Signed-off-by: Paul Cercueil <paul@crapouillou.net>
> > 
> > ---
> > v5: New patch
> > ---
> >   drivers/dma-buf/dma-buf.c | 66
> > +++++++++++++++++++++++++++++++++++++++
> >   include/linux/dma-buf.h   | 37 ++++++++++++++++++++++
> >   2 files changed, 103 insertions(+)
> > 
> > diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
> > index 8fe5aa67b167..a8bab6c18fcd 100644
> > --- a/drivers/dma-buf/dma-buf.c
> > +++ b/drivers/dma-buf/dma-buf.c
> > @@ -830,6 +830,8 @@ static struct sg_table * __map_dma_buf(struct
> > dma_buf_attachment *attach,
> >    *     - dma_buf_mmap()
> >    *     - dma_buf_begin_cpu_access()
> >    *     - dma_buf_end_cpu_access()
> > + *     - dma_buf_begin_access()
> > + *     - dma_buf_end_access()
> >    *     - dma_buf_map_attachment_unlocked()
> >    *     - dma_buf_unmap_attachment_unlocked()
> >    *     - dma_buf_vmap_unlocked()
> > @@ -1602,6 +1604,70 @@ void dma_buf_vunmap_unlocked(struct dma_buf
> > *dmabuf, struct iosys_map *map)
> >   }
> >   EXPORT_SYMBOL_NS_GPL(dma_buf_vunmap_unlocked, DMA_BUF);
> >   
> > +/**
> > + * @dma_buf_begin_access - Call before any hardware access from/to
> > the DMABUF
> > + * @attach:	[in]	attachment used for hardware access
> > + * @sg_table:	[in]	scatterlist used for the DMA transfer
> > + * @direction:  [in]    direction of DMA transfer
> > + */
> > +int dma_buf_begin_access(struct dma_buf_attachment *attach,
> > +			 struct sg_table *sgt, enum
> > dma_data_direction dir)
> > +{
> > +	struct dma_buf *dmabuf;
> > +	bool cookie;
> > +	int ret;
> > +
> > +	if (WARN_ON(!attach))
> > +		return -EINVAL;
> > +
> > +	dmabuf = attach->dmabuf;
> > +
> > +	if (!dmabuf->ops->begin_access)
> > +		return 0;
> > +
> > +	cookie = dma_fence_begin_signalling();
> > +	ret = dmabuf->ops->begin_access(attach, sgt, dir);
> > +	dma_fence_end_signalling(cookie);
> > +
> > +	if (WARN_ON_ONCE(ret))
> > +		return ret;
> > +
> > +	return 0;
> > +}
> > +EXPORT_SYMBOL_NS_GPL(dma_buf_begin_access, DMA_BUF);
> > +
> > +/**
> > + * @dma_buf_end_access - Call after any hardware access from/to
> > the DMABUF
> > + * @attach:	[in]	attachment used for hardware access
> > + * @sg_table:	[in]	scatterlist used for the DMA transfer
> > + * @direction:  [in]    direction of DMA transfer
> > + */
> > +int dma_buf_end_access(struct dma_buf_attachment *attach,
> > +		       struct sg_table *sgt, enum
> > dma_data_direction dir)
> > +{
> > +	struct dma_buf *dmabuf;
> > +	bool cookie;
> > +	int ret;
> > +
> > +	if (WARN_ON(!attach))
> > +		return -EINVAL;
> > +
> > +	dmabuf = attach->dmabuf;
> > +
> > +	if (!dmabuf->ops->end_access)
> > +		return 0;
> > +
> > +	cookie = dma_fence_begin_signalling();
> > +	ret = dmabuf->ops->end_access(attach, sgt, dir);
> > +	dma_fence_end_signalling(cookie);
> > +
> > +	if (WARN_ON_ONCE(ret))
> > +		return ret;
> > +
> > +	return 0;
> > +}
> > +EXPORT_SYMBOL_NS_GPL(dma_buf_end_access, DMA_BUF);
> > +
> >   #ifdef CONFIG_DEBUG_FS
> >   static int dma_buf_debug_show(struct seq_file *s, void *unused)
> >   {
> > diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h
> > index 8ff4add71f88..8ba612c7cc16 100644
> > --- a/include/linux/dma-buf.h
> > +++ b/include/linux/dma-buf.h
> > @@ -246,6 +246,38 @@ struct dma_buf_ops {
> >   	 */
> >   	int (*end_cpu_access)(struct dma_buf *, enum
> > dma_data_direction);
> >   
> > +	/**
> > +	 * @begin_access:
> > +	 *
> > +	 * This is called from dma_buf_begin_access() when a
> > device driver
> > +	 * wants to access the data of the DMABUF. The exporter
> > can use this
> > +	 * to flush/sync the caches if needed.
> > +	 *
> > +	 * This callback is optional.
> > +	 *
> > +	 * Returns:
> > +	 *
> > +	 * 0 on success or a negative error code on failure.
> > +	 */
> > +	int (*begin_access)(struct dma_buf_attachment *, struct
> > sg_table *,
> > +			    enum dma_data_direction);
> > +
> > +	/**
> > +	 * @end_access:
> > +	 *
> > +	 * This is called from dma_buf_end_access() when a device
> > driver is
> > +	 * done accessing the data of the DMABUF. The exporter can
> > use this
> > +	 * to flush/sync the caches if needed.
> > +	 *
> > +	 * This callback is optional.
> > +	 *
> > +	 * Returns:
> > +	 *
> > +	 * 0 on success or a negative error code on failure.
> > +	 */
> > +	int (*end_access)(struct dma_buf_attachment *, struct
> > sg_table *,
> > +			  enum dma_data_direction);
> > +
> >   	/**
> >   	 * @mmap:
> >   	 *
> > @@ -606,6 +638,11 @@ void dma_buf_detach(struct dma_buf *dmabuf,
> >   int dma_buf_pin(struct dma_buf_attachment *attach);
> >   void dma_buf_unpin(struct dma_buf_attachment *attach);
> >   
> > +int dma_buf_begin_access(struct dma_buf_attachment *attach,
> > +			 struct sg_table *sgt, enum
> > dma_data_direction dir);
> > +int dma_buf_end_access(struct dma_buf_attachment *attach,
> > +		       struct sg_table *sgt, enum
> > dma_data_direction dir);
> > +
> >   struct dma_buf *dma_buf_export(const struct dma_buf_export_info
> > *exp_info);
> >   
> >   int dma_buf_fd(struct dma_buf *dmabuf, int flags);
>
Paul Cercueil Jan. 23, 2024, 1:02 p.m. UTC | #3
Le mardi 23 janvier 2024 à 12:52 +0100, Christian König a écrit :
> Am 23.01.24 um 11:10 schrieb Paul Cercueil:
> > Hi Christian,
> > 
> > Le lundi 22 janvier 2024 à 14:41 +0100, Christian König a écrit :
> > > Am 22.01.24 um 12:01 schrieb Paul Cercueil:
> > > > Hi Christian,
> > > > 
> > > > Le lundi 22 janvier 2024 à 11:35 +0100, Christian König a
> > > > écrit :
> > > > > Am 19.01.24 um 15:13 schrieb Paul Cercueil:
> > > > > > These functions should be used by device drivers when they
> > > > > > start
> > > > > > and
> > > > > > stop accessing the data of DMABUF. It allows DMABUF
> > > > > > importers
> > > > > > to
> > > > > > cache
> > > > > > the dma_buf_attachment while ensuring that the data they
> > > > > > want
> > > > > > to
> > > > > > access
> > > > > > is available for their device when the DMA transfers take
> > > > > > place.
> > > > > As Daniel already noted as well this is a complete no-go from
> > > > > the
> > > > > DMA-buf design point of view.
> > > > What do you mean "as Daniel already noted"? It was him who
> > > > suggested
> > > > this.
> > > Sorry, I haven't fully catched up to the discussion then.
> > > 
> > > In general DMA-buf is build around the idea that the data can be
> > > accessed coherently by the involved devices.
> > > 
> > > Having a begin/end of access for devices was brought up multiple
> > > times
> > > but so far rejected for good reasons.
> > I would argue that if it was brought up multiple times, then there
> > are
> > also good reasons to support such a mechanism.
> > 
> > > That an exporter has to call extra functions to access his own
> > > buffers
> > > is a complete no-go for the design since this forces exporters
> > > into
> > > doing extra steps for allowing importers to access their data.
> > Then what about we add these dma_buf_{begin,end}_access(), with
> > only
> > implementations for "dumb" exporters e.g. udmabuf or the dmabuf
> > heaps?
> > And only importers (who cache the mapping and actually care about
> > non-
> > coherency) would have to call these.
> 
> No, the problem is still that you would have to change all importers
> to 
> mandatory use dma_buf_begin/end.
> 
> But going a step back caching the mapping is irrelevant for
> coherency. 
> Even if you don't cache the mapping you don't get coherency.

You actually do - at least with udmabuf, as in that case
dma_buf_map_attachment() / dma_buf_unmap_attachment() will handle cache
coherency when the SGs are mapped/unmapped.

The problem was then that dma_buf_unmap_attachment cannot be called
before the dma_fence is signaled, and calling it after is already too
late (because the fence would be signaled before the data is sync'd).

Daniel / Sima suggested then that I cache the mapping and add new
functions to ensure cache coherency, which is what these patches are
about.

> In other words exporters are not require to call sync_to_cpu or 
> sync_to_device when you create a mapping.
> 
> What exactly is your use case here? And why does coherency matters?

My use-case is, I create DMABUFs with udmabuf, that I attach to
USB/functionfs with the interface introduced by this patchset. I attach
them to IIO with a similar interface (being upstreamed in parallel),
and transfer data from USB to IIO and vice-versa in a zero-copy
fashion.

This works perfectly fine as long as the USB and IIO hardware are
coherent between themselves, which is the case on most of our boards.
However I do have a board (with a Xilinx Ultrascale SoC) where it is
not the case, and cache flushes/sync are needed. So I was trying to
rework these new interfaces to work on that system too.

If this really is a no-no, then I am fine with the assumption that
devices sharing a DMABUF must be coherent between themselves; but
that's something that should probably be enforced rather than assumed.

(and I *think* there is a way to force coherency in the Ultrascale's
interconnect - we're investigating it)

Cheers,
-Paul

> > At the very least, is there a way to check that "the data can be
> > accessed coherently by the involved devices"? So that my importer
> > can
> > EPERM if there is no coherency vs. a device that's already
> > attached.
> 
> Yeah, there is functionality for this in the DMA subsystem. I've once
> created prototype patches for enforcing the same coherency approach 
> between importer and exporter, but we never got around to upstream
> them.
> 
> 
> 
> > 
> > Cheers,
> > -Paul
> > 
> > > That in turn is pretty much un-testable unless you have every
> > > possible
> > > importer around while testing the exporter.
> > > 
> > > Regards,
> > > Christian.
> > > 
> > > > > Regards,
> > > > > Christian.
> > > > Cheers,
> > > > -Paul
> > > > 
> > > > > > Signed-off-by: Paul Cercueil <paul@crapouillou.net>
> > > > > > 
> > > > > > ---
> > > > > > v5: New patch
> > > > > > ---
> > > > > >     drivers/dma-buf/dma-buf.c | 66
> > > > > > +++++++++++++++++++++++++++++++++++++++
> > > > > >     include/linux/dma-buf.h   | 37 ++++++++++++++++++++++
> > > > > >     2 files changed, 103 insertions(+)
> > > > > > 
> > > > > > diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-
> > > > > > buf/dma-
> > > > > > buf.c
> > > > > > index 8fe5aa67b167..a8bab6c18fcd 100644
> > > > > > --- a/drivers/dma-buf/dma-buf.c
> > > > > > +++ b/drivers/dma-buf/dma-buf.c
> > > > > > @@ -830,6 +830,8 @@ static struct sg_table *
> > > > > > __map_dma_buf(struct
> > > > > > dma_buf_attachment *attach,
> > > > > >      *     - dma_buf_mmap()
> > > > > >      *     - dma_buf_begin_cpu_access()
> > > > > >      *     - dma_buf_end_cpu_access()
> > > > > > + *     - dma_buf_begin_access()
> > > > > > + *     - dma_buf_end_access()
> > > > > >      *     - dma_buf_map_attachment_unlocked()
> > > > > >      *     - dma_buf_unmap_attachment_unlocked()
> > > > > >      *     - dma_buf_vmap_unlocked()
> > > > > > @@ -1602,6 +1604,70 @@ void dma_buf_vunmap_unlocked(struct
> > > > > > dma_buf
> > > > > > *dmabuf, struct iosys_map *map)
> > > > > >     }
> > > > > >     EXPORT_SYMBOL_NS_GPL(dma_buf_vunmap_unlocked, DMA_BUF);
> > > > > >     
> > > > > > +/**
> > > > > > + * @dma_buf_begin_access - Call before any hardware access
> > > > > > from/to
> > > > > > the DMABUF
> > > > > > + * @attach:	[in]	attachment used for hardware
> > > > > > access
> > > > > > + * @sg_table:	[in]	scatterlist used for the DMA
> > > > > > transfer
> > > > > > + * @direction:  [in]    direction of DMA transfer
> > > > > > + */
> > > > > > +int dma_buf_begin_access(struct dma_buf_attachment
> > > > > > *attach,
> > > > > > +			 struct sg_table *sgt, enum
> > > > > > dma_data_direction dir)
> > > > > > +{
> > > > > > +	struct dma_buf *dmabuf;
> > > > > > +	bool cookie;
> > > > > > +	int ret;
> > > > > > +
> > > > > > +	if (WARN_ON(!attach))
> > > > > > +		return -EINVAL;
> > > > > > +
> > > > > > +	dmabuf = attach->dmabuf;
> > > > > > +
> > > > > > +	if (!dmabuf->ops->begin_access)
> > > > > > +		return 0;
> > > > > > +
> > > > > > +	cookie = dma_fence_begin_signalling();
> > > > > > +	ret = dmabuf->ops->begin_access(attach, sgt, dir);
> > > > > > +	dma_fence_end_signalling(cookie);
> > > > > > +
> > > > > > +	if (WARN_ON_ONCE(ret))
> > > > > > +		return ret;
> > > > > > +
> > > > > > +	return 0;
> > > > > > +}
> > > > > > +EXPORT_SYMBOL_NS_GPL(dma_buf_begin_access, DMA_BUF);
> > > > > > +
> > > > > > +/**
> > > > > > + * @dma_buf_end_access - Call after any hardware access
> > > > > > from/to
> > > > > > the DMABUF
> > > > > > + * @attach:	[in]	attachment used for hardware
> > > > > > access
> > > > > > + * @sg_table:	[in]	scatterlist used for the DMA
> > > > > > transfer
> > > > > > + * @direction:  [in]    direction of DMA transfer
> > > > > > + */
> > > > > > +int dma_buf_end_access(struct dma_buf_attachment *attach,
> > > > > > +		       struct sg_table *sgt, enum
> > > > > > dma_data_direction dir)
> > > > > > +{
> > > > > > +	struct dma_buf *dmabuf;
> > > > > > +	bool cookie;
> > > > > > +	int ret;
> > > > > > +
> > > > > > +	if (WARN_ON(!attach))
> > > > > > +		return -EINVAL;
> > > > > > +
> > > > > > +	dmabuf = attach->dmabuf;
> > > > > > +
> > > > > > +	if (!dmabuf->ops->end_access)
> > > > > > +		return 0;
> > > > > > +
> > > > > > +	cookie = dma_fence_begin_signalling();
> > > > > > +	ret = dmabuf->ops->end_access(attach, sgt, dir);
> > > > > > +	dma_fence_end_signalling(cookie);
> > > > > > +
> > > > > > +	if (WARN_ON_ONCE(ret))
> > > > > > +		return ret;
> > > > > > +
> > > > > > +	return 0;
> > > > > > +}
> > > > > > +EXPORT_SYMBOL_NS_GPL(dma_buf_end_access, DMA_BUF);
> > > > > > +
> > > > > >     #ifdef CONFIG_DEBUG_FS
> > > > > >     static int dma_buf_debug_show(struct seq_file *s, void
> > > > > > *unused)
> > > > > >     {
> > > > > > diff --git a/include/linux/dma-buf.h b/include/linux/dma-
> > > > > > buf.h
> > > > > > index 8ff4add71f88..8ba612c7cc16 100644
> > > > > > --- a/include/linux/dma-buf.h
> > > > > > +++ b/include/linux/dma-buf.h
> > > > > > @@ -246,6 +246,38 @@ struct dma_buf_ops {
> > > > > >     	 */
> > > > > >     	int (*end_cpu_access)(struct dma_buf *, enum
> > > > > > dma_data_direction);
> > > > > >     
> > > > > > +	/**
> > > > > > +	 * @begin_access:
> > > > > > +	 *
> > > > > > +	 * This is called from dma_buf_begin_access() when
> > > > > > a
> > > > > > device driver
> > > > > > +	 * wants to access the data of the DMABUF. The
> > > > > > exporter
> > > > > > can use this
> > > > > > +	 * to flush/sync the caches if needed.
> > > > > > +	 *
> > > > > > +	 * This callback is optional.
> > > > > > +	 *
> > > > > > +	 * Returns:
> > > > > > +	 *
> > > > > > +	 * 0 on success or a negative error code on
> > > > > > failure.
> > > > > > +	 */
> > > > > > +	int (*begin_access)(struct dma_buf_attachment *,
> > > > > > struct
> > > > > > sg_table *,
> > > > > > +			    enum dma_data_direction);
> > > > > > +
> > > > > > +	/**
> > > > > > +	 * @end_access:
> > > > > > +	 *
> > > > > > +	 * This is called from dma_buf_end_access() when a
> > > > > > device
> > > > > > driver is
> > > > > > +	 * done accessing the data of the DMABUF. The
> > > > > > exporter
> > > > > > can
> > > > > > use this
> > > > > > +	 * to flush/sync the caches if needed.
> > > > > > +	 *
> > > > > > +	 * This callback is optional.
> > > > > > +	 *
> > > > > > +	 * Returns:
> > > > > > +	 *
> > > > > > +	 * 0 on success or a negative error code on
> > > > > > failure.
> > > > > > +	 */
> > > > > > +	int (*end_access)(struct dma_buf_attachment *,
> > > > > > struct
> > > > > > sg_table *,
> > > > > > +			  enum dma_data_direction);
> > > > > > +
> > > > > >     	/**
> > > > > >     	 * @mmap:
> > > > > >     	 *
> > > > > > @@ -606,6 +638,11 @@ void dma_buf_detach(struct dma_buf
> > > > > > *dmabuf,
> > > > > >     int dma_buf_pin(struct dma_buf_attachment *attach);
> > > > > >     void dma_buf_unpin(struct dma_buf_attachment *attach);
> > > > > >     
> > > > > > +int dma_buf_begin_access(struct dma_buf_attachment
> > > > > > *attach,
> > > > > > +			 struct sg_table *sgt, enum
> > > > > > dma_data_direction dir);
> > > > > > +int dma_buf_end_access(struct dma_buf_attachment *attach,
> > > > > > +		       struct sg_table *sgt, enum
> > > > > > dma_data_direction dir);
> > > > > > +
> > > > > >     struct dma_buf *dma_buf_export(const struct
> > > > > > dma_buf_export_info
> > > > > > *exp_info);
> > > > > >     
> > > > > >     int dma_buf_fd(struct dma_buf *dmabuf, int flags);
>
Paul Cercueil Jan. 24, 2024, 3:52 p.m. UTC | #4
Hi Andrew,

Le mercredi 24 janvier 2024 à 09:38 -0600, Andrew Davis a écrit :
> On 1/24/24 4:58 AM, Paul Cercueil wrote:
> > Hi Christian,
> > 
> > Le mardi 23 janvier 2024 à 14:28 +0100, Christian König a écrit :
> > >   Am 23.01.24 um 14:02 schrieb Paul Cercueil:
> > >   
> > > > [SNIP]
> > > >   
> > > > >   
> > > > > >    
> > > > > > >   
> > > > > > > That an exporter has to call extra functions to access
> > > > > > > his
> > > > > > > own
> > > > > > > buffers
> > > > > > > is a complete no-go for the design since this forces
> > > > > > > exporters
> > > > > > > into
> > > > > > > doing extra steps for allowing importers to access their
> > > > > > > data.
> > > > > > >   
> > > > > >   
> > > > > > Then what about we add these dma_buf_{begin,end}_access(),
> > > > > > with
> > > > > > only
> > > > > > implementations for "dumb" exporters e.g. udmabuf or the
> > > > > > dmabuf
> > > > > > heaps?
> > > > > > And only importers (who cache the mapping and actually care
> > > > > > about
> > > > > > non-
> > > > > > coherency) would have to call these.
> > > > > >   
> > > > >   
> > > > > No, the problem is still that you would have to change all
> > > > > importers
> > > > > to
> > > > > mandatory use dma_buf_begin/end.
> > > > > 
> > > > > But going a step back caching the mapping is irrelevant for
> > > > > coherency.
> > > > > Even if you don't cache the mapping you don't get coherency.
> > > > >   
> > > >   
> > > > You actually do - at least with udmabuf, as in that case
> > > > dma_buf_map_attachment() / dma_buf_unmap_attachment() will
> > > > handle
> > > > cache
> > > > coherency when the SGs are mapped/unmapped.
> > > >   
> > >   
> > >   Well I just double checked the source in 6.7.1 and I can't see
> > > udmabuf doing anything for cache coherency in map/unmap.
> > >   
> > >   All it does is calling dma_map_sgtable() and
> > > dma_unmap_sgtable() to
> > > create and destroy the SG table and those are not supposed to
> > > sync
> > > anything to the CPU cache.
> > >   
> > >   In other words drivers usually use DMA_ATTR_SKIP_CPU_SYNC here,
> > > it's
> > > just that this is missing from udmabuf.
> > 
> > Ok.
> >   
> > > >   
> > > > The problem was then that dma_buf_unmap_attachment cannot be
> > > > called
> > > > before the dma_fence is signaled, and calling it after is
> > > > already
> > > > too
> > > > late (because the fence would be signaled before the data is
> > > > sync'd).
> > > >   
> > >   
> > >   Well what sync are you talking about? CPU sync? In DMA-buf that
> > > is
> > > handled differently.
> > >   
> > >   For importers it's mandatory that they can be coherent with the
> > > exporter. That usually means they can snoop the CPU cache if the
> > > exporter can snoop the CPU cache.
> > 
> > I seem to have such a system where one device can snoop the CPU
> > cache
> > and the other cannot. Therefore if I want to support it properly, I
> > do
> > need cache flush/sync. I don't actually try to access the data
> > using
> > the CPU (and when I do, I call the sync start/end ioctls).
> > 
> 
> If you don't access the data using the CPU, then how did the data
> end up in the CPU caches? If you have a device that can write-
> allocate
> into your CPU cache, but some other device in the system cannot snoop
> that data back out then that is just broken and those devices cannot
> reasonably share buffers..

I think that's what happens, yes.

> Now we do have systems where some hardware can snoop CPU(or L3)
> caches
> and others cannot, but they will not *allocate* into those caches
> (unless they also have the ability to sync them without CPU in the
> loop).
> 
> Your problem may be if you are still using udmabuf driver as your
> DMA-BUF exporter, which as said before is broken (and I just sent
> some
> patches with a few fixes just for you :)). For udmabuf, data starts
> in the CPU domain (in caches) and is only ever synced for the CPU,
> not for attached devices. So in this case the writing device might
> update those cache lines but a non-snooping reader would never see
> those updates.

I tried today with the system dma-heap, and the behaviour was the same.
Adding an implementation of .dma_buf_begin/end_access() to it made it
work there too.

> I'm not saying there isn't a need for these new {begin,end}_access()
> functions. I can think of a few interesting usecases, but as you
> say below that would be good to work out in a different series.

Yep, but it's a can of worms I'd rather not open if I can avoid it :)

> Andrew

Cheers,
-Paul

> 
> > 
> > >   For exporters you can implement the begin/end CPU access
> > > functions
> > > which allows you to implement something even if your exporting
> > > device
> > > can't snoop the CPU cache.
> > 
> > That only works if the importers call the begin_cpu_access() /
> > end_cpu_access(), which they don't.
> > 
> >   
> > > > Daniel / Sima suggested then that I cache the mapping and add
> > > > new
> > > > functions to ensure cache coherency, which is what these
> > > > patches
> > > > are
> > > > about.
> > > >   
> > >   
> > >   Yeah, I've now catched up on the latest mail. Sorry I haven't
> > > seen
> > > that before.
> > >   
> > >   
> > > >   
> > > > 
> > > >   
> > > > >   
> > > > > In other words exporters are not require to call sync_to_cpu
> > > > > or
> > > > > sync_to_device when you create a mapping.
> > > > > 
> > > > > What exactly is your use case here? And why does coherency
> > > > > matters?
> > > > >   
> > > >   
> > > > My use-case is, I create DMABUFs with udmabuf, that I attach to
> > > > USB/functionfs with the interface introduced by this patchset.
> > > > I
> > > > attach
> > > > them to IIO with a similar interface (being upstreamed in
> > > > parallel),
> > > > and transfer data from USB to IIO and vice-versa in a zero-copy
> > > > fashion.
> > > > 
> > > > This works perfectly fine as long as the USB and IIO hardware
> > > > are
> > > > coherent between themselves, which is the case on most of our
> > > > boards.
> > > > However I do have a board (with a Xilinx Ultrascale SoC) where
> > > > it
> > > > is
> > > > not the case, and cache flushes/sync are needed. So I was
> > > > trying to
> > > > rework these new interfaces to work on that system too.
> > > >   
> > >   
> > >   Yeah, that sounds strongly like one of the use cases we have
> > > rejected so far.
> > >   
> > >   
> > >   
> > > >   
> > > > If this really is a no-no, then I am fine with the assumption
> > > > that
> > > > devices sharing a DMABUF must be coherent between themselves;
> > > > but
> > > > that's something that should probably be enforced rather than
> > > > assumed.
> > > > 
> > > > (and I *think* there is a way to force coherency in the
> > > > Ultrascale's
> > > > interconnect - we're investigating it)
> > > >   
> > >   
> > >   What you can do is that instead of using udmabuf or dma-heaps
> > > is
> > > that the device which can't provide coherency act as exporters of
> > > the
> > > buffers.
> > >   
> > >   The exporter is allowed to call sync_for_cpu/sync_for_device on
> > > it's
> > > own buffers and also gets begin/end CPU access notfications. So
> > > you
> > > can then handle coherency between the exporter and the CPU.
> > 
> > But again that would only work if the importers would call
> > begin_cpu_access() / end_cpu_access(), which they don't, because
> > they
> > don't actually access the data using the CPU.
> > 
> > Unless you mean that the exporter can call
> > sync_for_cpu/sync_for_device
> > before/after every single DMA transfer so that the data appears
> > coherent to the importers, without them having to call
> > begin_cpu_access() / end_cpu_access().
> > 
> > In which case - this would still demultiply the complexity; my USB-
> > functionfs interface here (and IIO interface in the separate
> > patchset)
> > are not device-specific, so I'd rather keep them importers.
> >   
> > >   If you really don't have coherency between devices then that
> > > would
> > > be a really new use case and we would need much more agreement on
> > > how
> > > to do this.
> > 
> > [snip]
> > 
> > Agreed. Desiging a good generic solution would be better.
> > 
> > With that said...
> > 
> > Let's keep it out of this USB-functionfs interface for now. The
> > interface does work perfectly fine on platforms that don't have
> > coherency problems. The coherency issue in itself really is a
> > tangential issue.
> > 
> > So I will send a v6 where I don't try to force the cache coherency
> > -
> > and instead assume that the attached devices are coherent between
> > themselves.
> > 
> > But it would be even better to have a way to detect non-coherency
> > and
> > return an error on attach.
> > 
> > Cheers,
> > -Paul
Christian König Jan. 26, 2024, 4:42 p.m. UTC | #5
Am 25.01.24 um 19:01 schrieb Daniel Vetter:
> On Thu, Jan 25, 2024 at 04:00:16PM +0100, Christian König wrote:
>> Am 24.01.24 um 11:58 schrieb Paul Cercueil:
>>> [SNIP]
>>>>> The problem was then that dma_buf_unmap_attachment cannot be called
>>>>> before the dma_fence is signaled, and calling it after is already
>>>>> too
>>>>> late (because the fence would be signaled before the data is
>>>>> sync'd).
>>>>    Well what sync are you talking about? CPU sync? In DMA-buf that is
>>>> handled differently.
>>>>    For importers it's mandatory that they can be coherent with the
>>>> exporter. That usually means they can snoop the CPU cache if the
>>>> exporter can snoop the CPU cache.
>>> I seem to have such a system where one device can snoop the CPU cache
>>> and the other cannot. Therefore if I want to support it properly, I do
>>> need cache flush/sync. I don't actually try to access the data using
>>> the CPU (and when I do, I call the sync start/end ioctls).
>> Usually that isn't a problem as long as you don't access the data with the
>> CPU.
>>
>> [SNIP]
>>
>>>>> (and I *think* there is a way to force coherency in the
>>>>> Ultrascale's
>>>>> interconnect - we're investigating it)
>>>>    What you can do is that instead of using udmabuf or dma-heaps is
>>>> that the device which can't provide coherency act as exporters of the
>>>> buffers.
>>>>    The exporter is allowed to call sync_for_cpu/sync_for_device on it's
>>>> own buffers and also gets begin/end CPU access notfications. So you
>>>> can then handle coherency between the exporter and the CPU.
>>> But again that would only work if the importers would call
>>> begin_cpu_access() / end_cpu_access(), which they don't, because they
>>> don't actually access the data using the CPU.
>> Wow, that is a completely new use case then.
>>
>> Neither DMA-buf nor the DMA subsystem in Linux actually supports this as far
>> as I can see.
>>
>>> Unless you mean that the exporter can call sync_for_cpu/sync_for_device
>>> before/after every single DMA transfer so that the data appears
>>> coherent to the importers, without them having to call
>>> begin_cpu_access() / end_cpu_access().
>> Yeah, I mean the importers don't have to call begin_cpu_access() /
>> end_cpu_access() if they don't do CPU access :)
>>
>> What you can still do as exporter is to call sync_for_device() and
>> sync_for_cpu() before and after each operation on your non-coherent device.
>> Paired with the fence signaling that should still work fine then.
>>
>> But taking a step back, this use case is not something even the low level
>> DMA subsystem supports. That sync_for_cpu() does the right thing is
>> coincident and not proper engineering.
>>
>> What you need is a sync_device_to_device() which does the appropriate
>> actions depending on which devices are involved.
>>
>>> In which case - this would still demultiply the complexity; my USB-
>>> functionfs interface here (and IIO interface in the separate patchset)
>>> are not device-specific, so I'd rather keep them importers.
>>>>    If you really don't have coherency between devices then that would
>>>> be a really new use case and we would need much more agreement on how
>>>> to do this.
>>> [snip]
>>>
>>> Agreed. Desiging a good generic solution would be better.
>>>
>>> With that said...
>>>
>>> Let's keep it out of this USB-functionfs interface for now. The
>>> interface does work perfectly fine on platforms that don't have
>>> coherency problems. The coherency issue in itself really is a
>>> tangential issue.
>> Yeah, completely agree.
>>
>>> So I will send a v6 where I don't try to force the cache coherency -
>>> and instead assume that the attached devices are coherent between
>>> themselves.
>>>
>>> But it would be even better to have a way to detect non-coherency and
>>> return an error on attach.
>> Take a look into the DMA subsystem. I'm pretty sure we already have
>> something like this in there.
>>
>> If nothing else helps you could take a look if the coherent memory access
>> mask is non zero or something like that.
> Jumping in way late and apolgies to everyone since yes I indeed suggested
> this entire mess to Paul in some private thread.
>
> And worse, I think we need it, it's just that we got away without it thus
> far.
>
> So way back at the og dma-buf kick-off dma coherency was discussed, and a
> few things where noted:
> - the dma api only supports device<->cpu coherency
> - getting the full coherency model off the ground right away is probably
>    too hard, so we made the decision that where it matters, relevant
>    flushing needs to be done in dma_buf_map/unmap.
>
> If you look at the earliest patches for dma-buf we had pretty clear
> language that all dma-operations should be bracketed with map/unmap. Of
> course that didn't work out for drm at all, and we had to first get
> dma_resv_lock and dma_fence landed and then your dynamic exporter/importer
> support in just to get the buffer migration functionality working, which
> was only one of the things discussed that braketing everything with
> map/unmap was supposed to take care of.
>
> The other was coherency management. But looking through archives I think
> this was already agreed to be postponed for later in the original kick-off
> meeting and never further discussed on the mailing list.
>
> This worked for a fairly long time, because thus far dma-buf was used on
> fairly reaasonable architectures where all participating devices are
> coherent enough.
>
> We did have to add the cpu access flushing fairly quickly because there's
> a lot of SoC chips (including intel) where that was necessary, but even
> that was added later on, as an opt-in and without fixing every. See
> fc13020e086b ("dma-buf: add support for kernel cpu access").
>
> The ioctl to allow userspace to do flushing was added even later on, and
> there the entire yolo opt-in situation is even worse. c11e391da2a8
> ("dma-buf: Add ioctls to allow userspace to flush") was only in 2016, 5
> years after dma-buf landed.
>
> It looks like it's finally time to add the device side flushing functions
> we've talked about first over 12 years ago :-)
>
> The reason this pops up now is that unlike other dma-buf users on maybe
> somewhat more funky architectures, Paul's patches want to use dma_fence
> for synchronization of the dma operations. Which means you cannot call the
> full dma_buf_map/unmap dance because that takes dma_resv_lock, and
> absolute no-go in a dma_fence critical path.
>
> And yes in those 12 years the dma-api hasn't gained the device2device sync
> support we'd need, but neither has it gained the multiple devices <-> cpu
> sync support we'd strictly need for dma-buf. So yes this is all a terrible
> hodge-podge of hacks, but if we'd require theoretically perfect code we'd
> still have zero dma-buf support in upstream.
>
> This also includes how we landed these extensions, none of them in the
> past have landed with a "update all existing exporters/importers" rule. We
> talked about that every time, and rejected it every time for imo pretty
> good reasons - the perf impact tends to be way too harsh if you impose
> over-flushing on everyone, including the reasonable platforms. And we
> currently can't do less than overflushing with the current dma-api
> interfaces because we dont have the specific flush functions we'd need. So
> really this isn't doing a worse abuse of the dma-api than what we have.
> It's definitely a bit wasteful since the functions we use do in theory
> flush too much. But in practice on the these funky architectures they
> flush enough.
>
> There's also the very hard issue of actually trying to optimize flushes,
> because a dma operation might only access part of a buffer, and you might
> interleave read/write access by different devices in very innovative ways.
> So I'm firmly on the "make it work first, then fast" side of things.
>
> So dma-buf will continue to be a thing that's tested for specific combos,
> and then we'll patch them. It's a decade-plus tradition at this point.
>
> Which is all a very long winded way of saying that yes, I think we need
> this, and we needed this 12 years ago already if we'd have aimed for
> perfect.
>
> I have a bunch of detail comments on the patch itself, but I guess we
> first need to find consensus on whether it's a good idea in the first
> place.

Well I think we should have some solution, but I'm not sure if it should 
be part of DMA-buf.

Essentially a DMA-buf exports the buffers as he uses it and the importer 
(or the DMA-buf subsystem) is then the one who says ok I can use this or 
I can't use this or I need to call extra functions to use this or whatever.

It's not the job of the exporter to provide the coherency for the 
importer, cause otherwise we would have a lot of code in the exporter 
which can only be tested when you have the right importer around. And I 
strongly think that this is a no-go for having a reliable solution.

That's why I think the approach of having DMA-buf callbacks is most 
likely the wrong thing to do.

What should happen instead is that the DMA subsystem provides 
functionality which to devices which don't support coherency through 
it's connection to say I want to access this data, please make sure to 
flush the appropriate catches. But that's just a very very rough design 
idea.

This will become more with CXL at the horizon I think.

Regards,
Christian.

>
> Cheers, Sima
Christian König Jan. 30, 2024, 1:09 p.m. UTC | #6
Am 30.01.24 um 11:40 schrieb Daniel Vetter:
> On Tue, Jan 30, 2024 at 10:48:23AM +0100, Paul Cercueil wrote:
>> Le mardi 30 janvier 2024 à 10:23 +0100, Christian König a écrit :
>>>   I would say we start with the DMA-API by getting away from sg_tables
>>> to something cleaner and state oriented.
>> FYI I am already adding a 'dma_vec' object in my IIO DMABUF patchset,
>> which is just a dead simple
>>
>> struct dma_vec {
>>    dma_addr_t addr;
>>    size_t len;
>> };
>>
>> (The rationale for introducing it in the IIO DMABUF patchset was that
>> the "scatterlist" wouldn't allow me to change the transfer size.)
>>
>> So I believe a new "sg_table"-like could just be an array of struct
>> dma_vec + flags.
> Yeah that's pretty much the proposal I've seen, split the sg table into
> input data (struct page + len) and output data (which is the dma_addr_t +
> len you have above).

I would extend that a bit and say we have an array with 
dma_addr+power_of_two_order and a header structure with lower bit offset 
and some DMA transaction flags.

But this is something which can be worked as an optimization later on. 
For a start this proposal here looks good to me as well.

> The part I don't expect to ever happen, because it hasn't the past 20 or
> so years, is that the dma-api will give us information about what is
> needed to keep the buffers coherency between various devices and the cpu.

Well maybe that's what we are doing wrong.

Instead of asking the dma-api about the necessary information we should 
give the API the opportunity to work for us.

In other words we don't need the information about buffer coherency what 
we need is that the API works for as and fulfills the requirements we have.

So the question is really what should we propose to change on the 
DMA-api side to get this working as expected?

Regards,
Christian.





> -Sima
Christian König Feb. 6, 2024, 1:28 p.m. UTC | #7
Am 31.01.24 um 10:07 schrieb Daniel Vetter:
> On Tue, Jan 30, 2024 at 02:09:45PM +0100, Christian König wrote:
>> Am 30.01.24 um 11:40 schrieb Daniel Vetter:
>>> On Tue, Jan 30, 2024 at 10:48:23AM +0100, Paul Cercueil wrote:
>>>> Le mardi 30 janvier 2024 à 10:23 +0100, Christian König a écrit :
>>>>>    I would say we start with the DMA-API by getting away from sg_tables
>>>>> to something cleaner and state oriented.
>>>> FYI I am already adding a 'dma_vec' object in my IIO DMABUF patchset,
>>>> which is just a dead simple
>>>>
>>>> struct dma_vec {
>>>>     dma_addr_t addr;
>>>>     size_t len;
>>>> };
>>>>
>>>> (The rationale for introducing it in the IIO DMABUF patchset was that
>>>> the "scatterlist" wouldn't allow me to change the transfer size.)
>>>>
>>>> So I believe a new "sg_table"-like could just be an array of struct
>>>> dma_vec + flags.
>>> Yeah that's pretty much the proposal I've seen, split the sg table into
>>> input data (struct page + len) and output data (which is the dma_addr_t +
>>> len you have above).
>> I would extend that a bit and say we have an array with
>> dma_addr+power_of_two_order and a header structure with lower bit offset and
>> some DMA transaction flags.
>>
>> But this is something which can be worked as an optimization later on. For a
>> start this proposal here looks good to me as well.
>>
>>> The part I don't expect to ever happen, because it hasn't the past 20 or
>>> so years, is that the dma-api will give us information about what is
>>> needed to keep the buffers coherency between various devices and the cpu.
>> Well maybe that's what we are doing wrong.
>>
>> Instead of asking the dma-api about the necessary information we should give
>> the API the opportunity to work for us.
>>
>> In other words we don't need the information about buffer coherency what we
>> need is that the API works for as and fulfills the requirements we have.
>>
>> So the question is really what should we propose to change on the DMA-api
>> side to get this working as expected?
> So one thing I've been pondering, kinda picking up your point about CXL,
> is that we do make the coherency protocol more explicit by adding a
> coherency mode to dma_buf that the exporter sets. Some ideas for values
> this could have:
>
> - ATTOMIC_COHERENT: Fully cache coherent, including device/cpu atomis.
>    This would be for CXL. Non-CXL devices could still participate with the
>    old model using explicit devices flushes, but must at comply with
>    CPU_COHERENT.
>
>    There's also the power9-only nvlink that would fit here, but I guess
>    going forward CXL (and cache-coherent integrated gpu) would really be
>    the only users of this flag.
>
>    Peer2peer would have the same rules, otherwise doesn't really make
>    sense. Also we might want to forbib non-CXL imports for these buffers
>    maybe even? Not sure on that.
>
> - CPU_COHERENT: device transactions do snoop cpu devices caches, but
>    devices might do their own caching which isn't snooped by the cpu and
>    needs explicit device-side invalidate/flushing. This means pcie
>    importers are not allowed to use pcie no-snoop transactions, intel igpu
>    wouldn't be allowed to use MOCS that do the same, similar for arm
>    integrated devices.
>
>    Importers can skip all explicit cache management apis like
>    dma_buf_begin/end_cpu_access, or the newly proposed
>    dma_buf_begin/end_device_access here.
>
>    We'd need to figure out what exactly this means for peer2peer
>    transactions, I have no idea whether the no-snoop flag even does
>    anything for those.
>
>    We might also want to split up CPU_COHERENT into CPU_COHERENT_WB and
>    CPU_WOHERENT_WC, so that importers know whether cpu reads are going to
>    be crawling or not.
>
> - MEMORY_COHERENT: devices transactions do not snoop any caches, but
>    promise that all transactions are fully flushed to system memory. Any
>    devices transactions which do fill cpu caches must call the proposed
>    dma_buf_begin/end_device_access functions proposed here. Any cpu access
>    must be braketed by calls to dma_buf_begin/end_cpu_access.
>
>    If your device does fill cpu caches, then essentially you'd not be able
>    to import such buffers. Not sure whether we need to put the
>    responsibility of checking that onto importers or exporters. Ideally
>    core dma-buf.c code would check this.
>
>    Also maybe the cpu WC mapping mode would actually need to be a sub-mode
>    for MEMORY_COHERENT, because all cpu wc achieves is to avoid the need to
>    call dma_buf_begin/end_cpu_access, you would still need your devices to
>    be memory coherent. And if they're not, then you cannot use that
>    dma-buf.
>
>    Or maybe alternatively we need to guarantee that exporters which set
>    MEMORY_COHERENT implement dma_buf_begin/end_device_access to make things
>    work for these cpu-coherent but not memory-coherent devices. This
>    becomes very tricky with device/arch/bus specific details I think.
>
> - DMA_API_COHERENT: The memory is allocated or mapped by the dma-api, and
>    the exact coherency mode is not know. Importers _must_ braket all cpu
>    and device access with the respective dma_buf functions. This is
>    essentially the "we have no idea" default.
>
>    Note that exporters might export memory allocated with dma_map_alloc
>    with MEMORY_COHERENT or CPU_COHERENT if they know how the memory exactly
>    works. E.g. for most arm soc gpu/display drivers we can assume that the
>    dma-api gives us MEMORY_COHERENT or CPU_COHERENT_WC, and just use that.
>    Essentially this would make the current implicit assumptions explicit.
>
>    udmabuf would need to set this, definitely if Paul's patches to add the
>    explicit device flushes land.
>
> - DEFAULT_COHERENT: This would be the backwards compat legacy yolo
>    behvaior. I'm not sure whether we should alias that with
>    DMA_API_COHERENT or leave it as a special value to mark exporters which
>    haven't been updated for the much more explicit coherency handling yet.
>
>    The specification for this coherency mode would be a flat out "who
>    knows, just don't break existing use-cases with actual users".
>    Essentially the only reason we'd have this would be to make sure we can
>    avoid regressions of these existing use-cases, by keeping whatever
>    horrible heuristics we have in current exporters.
>
>    It would also allow us to convert exporters and importers on a case by
>    case basis.
>
> Note that all these coherency modes are defined in terms of bus-sepecific
> device access and in terms of dma_buf apis the importer must call or can
> skip. This way we'd avoid having to change the dma-api in a first step,
> and if this all works out properly we could then use the resulting dma-api
> as a baseline to propose dma-api extensions.

When I read this for the first time my initial impression was that the 
idea mostly looked good, but while thinking about it more and more I 
came to the conclusion that this would go into the wrong direction.

Maybe I'm repeating myself, but I think we first of all have to talk a 
bit about some aspects of coherency:

1. Intra device coherency. This means that intra devices caches are 
invalidated before beginning an operation and flushed before signaling 
that an operation finished.

2. Inter device and device to CPU coherency. This means that caches 
which sit in between devices and between devices and the CPU need to be 
invalidated and flushed appropriately when buffers are accessed by 
different parties.

Number 1 is device specific, part of the DMA-buf framework and handled 
by dma_fences. As far as I can see that part is actually quite well 
designed and I don't see any obvious need for change.

Number 2 is platform specific and I completely agree with the DMA-api 
folks that this doesn't belong into DMA-buf in the first place. That's 
why I think the begin_cpu_access()/end_cpu_access() callbacks are 
actually a bit misplaced. We still can use those in the exporter, but to 
make better buffer placement decisions, but should not invalidate any 
caches when they are called.

The flushing and invalidation for platform caches should really be in 
the DMA-buf framework and not the exporter.

So in my thinking the enumeration you outlined above should really go 
into struct device and explaining to everybody what the coherency 
properties of DMA operations of this device is.

> I think starting right out with designing dma-api extension is a few
> bridges too far. Both from a "how do we convince upstream" pov, but maybe
> even more from a "how do we figure out what we even need" pov.

Well I totally agree on the "how do we figure out what we even need", 
but I disagree a bit on that we don't know what DMA-api extension we need.

We don't have the full picture yet, but as I already outlined from the 
DMA-api pov we have two major things on our TODO list:

1. Somehow remove the struct pages from the DMA-buf *importer* API.

     My best suggestion at the moment for this is to split sg_tables 
into two data structures, one for the struct pages and one for the DMA 
addresses.

     Mangling the addresses to ensure that no importer messes with the 
struct pages was a good step, but it also creates problems when 
dma_sync_sg_for_cpu() dma_sync_for_device() are supposed to be called.

2. Add some dma_sync_sg_between_devices(A, B....).

     And on this I think we are on the same page that we are going to 
need this, but we are just not clear on who is going to use it.

Regards,
Christian.

>
>> Regards,
>> Christian.
>>
>>
>>
>>
>>
>>> -Sima
>> _______________________________________________
>> Linaro-mm-sig mailing list -- linaro-mm-sig@lists.linaro.org
>> To unsubscribe send an email to linaro-mm-sig-leave@lists.linaro.org