diff mbox series

media: videobuf2-dma-sg: limit the sg segment size

Message ID 20230828075420.2009568-1-anle.pan@nxp.com
State New
Headers show
Series media: videobuf2-dma-sg: limit the sg segment size | expand

Commit Message

Anle Pan Aug. 28, 2023, 7:54 a.m. UTC
When allocating from pages, the size of the sg segment is unlimited and
the default is UINT_MAX. This will cause the DMA stream mapping failed
later with a "swiotlb buffer full" error. The default maximum mapping
size is 128 slots x 2K = 256K, determined by "IO_TLB_SEGSIZE".
To fix the issue, limit the sg segment size according to
"dma_max_mapping_size" to match the mapping limit.

Signed-off-by: Anle Pan <anle.pan@nxp.com>
---
 drivers/media/common/videobuf2/videobuf2-dma-sg.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

Comments

Tomasz Figa Aug. 29, 2023, 10:03 a.m. UTC | #1
Hi Anle,

On Mon, Aug 28, 2023 at 8:57 AM Anle Pan <anle.pan@nxp.com> wrote:
>
> When allocating from pages, the size of the sg segment is unlimited and
> the default is UINT_MAX. This will cause the DMA stream mapping failed
> later with a "swiotlb buffer full" error.

Thanks for the patch. Good catch.

> The default maximum mapping
> size is 128 slots x 2K = 256K, determined by "IO_TLB_SEGSIZE".
> To fix the issue, limit the sg segment size according to
> "dma_max_mapping_size" to match the mapping limit.
>
> Signed-off-by: Anle Pan <anle.pan@nxp.com>
> ---
>  drivers/media/common/videobuf2/videobuf2-dma-sg.c | 9 +++++++--
>  1 file changed, 7 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/media/common/videobuf2/videobuf2-dma-sg.c b/drivers/media/common/videobuf2/videobuf2-dma-sg.c
> index fa69158a65b1..b608a7c5f240 100644
> --- a/drivers/media/common/videobuf2/videobuf2-dma-sg.c
> +++ b/drivers/media/common/videobuf2/videobuf2-dma-sg.c
> @@ -105,6 +105,7 @@ static void *vb2_dma_sg_alloc(struct vb2_buffer *vb, struct device *dev,
>         struct sg_table *sgt;
>         int ret;
>         int num_pages;
> +       size_t max_segment = 0;
>
>         if (WARN_ON(!dev) || WARN_ON(!size))
>                 return ERR_PTR(-EINVAL);
> @@ -134,8 +135,12 @@ static void *vb2_dma_sg_alloc(struct vb2_buffer *vb, struct device *dev,
>         if (ret)
>                 goto fail_pages_alloc;
>
> -       ret = sg_alloc_table_from_pages(buf->dma_sgt, buf->pages,
> -                       buf->num_pages, 0, size, GFP_KERNEL);
> +       if (dev)
> +               max_segment = dma_max_mapping_size(dev);
> +       if (max_segment == 0)
> +               max_segment = UINT_MAX;
> +       ret = sg_alloc_table_from_pages_segment(buf->dma_sgt, buf->pages,
> +               buf->num_pages, 0, size, max_segment, GFP_KERNEL);

One thing that I'm not sure about here is that we use
sg_alloc_table_from_pages_segment(), but we actually don't pass the
max segment size (as returned by dma_get_max_seg_size()) to it.
I'm also not exactly sure what's the difference between "max mapping
size" and "max seg size".
+Robin Murphy +Christoph Hellwig I think we could benefit from your
expertise here.

Generally looking at videobuf2-dma-sg, I feel like we would benefit
from some kind of dma_alloc_table_from_pages() that simply takes the
struct dev pointer and does everything necessary.

Best regards,
Tomasz
Tomasz Figa Aug. 30, 2023, 3:59 a.m. UTC | #2
On Tue, Aug 29, 2023 at 8:14 PM Robin Murphy <robin.murphy@arm.com> wrote:
>
> On 2023-08-29 11:03, Tomasz Figa wrote:
> > Hi Anle,
> >
> > On Mon, Aug 28, 2023 at 8:57 AM Anle Pan <anle.pan@nxp.com> wrote:
> >>
> >> When allocating from pages, the size of the sg segment is unlimited and
> >> the default is UINT_MAX. This will cause the DMA stream mapping failed
> >> later with a "swiotlb buffer full" error.
> >
> > Thanks for the patch. Good catch.
> >
> >> The default maximum mapping
> >> size is 128 slots x 2K = 256K, determined by "IO_TLB_SEGSIZE".
> >> To fix the issue, limit the sg segment size according to
> >> "dma_max_mapping_size" to match the mapping limit.
> >>
> >> Signed-off-by: Anle Pan <anle.pan@nxp.com>
> >> ---
> >>   drivers/media/common/videobuf2/videobuf2-dma-sg.c | 9 +++++++--
> >>   1 file changed, 7 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/drivers/media/common/videobuf2/videobuf2-dma-sg.c b/drivers/media/common/videobuf2/videobuf2-dma-sg.c
> >> index fa69158a65b1..b608a7c5f240 100644
> >> --- a/drivers/media/common/videobuf2/videobuf2-dma-sg.c
> >> +++ b/drivers/media/common/videobuf2/videobuf2-dma-sg.c
> >> @@ -105,6 +105,7 @@ static void *vb2_dma_sg_alloc(struct vb2_buffer *vb, struct device *dev,
> >>          struct sg_table *sgt;
> >>          int ret;
> >>          int num_pages;
> >> +       size_t max_segment = 0;
> >>
> >>          if (WARN_ON(!dev) || WARN_ON(!size))
> >>                  return ERR_PTR(-EINVAL);
> >> @@ -134,8 +135,12 @@ static void *vb2_dma_sg_alloc(struct vb2_buffer *vb, struct device *dev,
> >>          if (ret)
> >>                  goto fail_pages_alloc;
> >>
> >> -       ret = sg_alloc_table_from_pages(buf->dma_sgt, buf->pages,
> >> -                       buf->num_pages, 0, size, GFP_KERNEL);
> >> +       if (dev)
>
> dev can't be NULL, see the context above.
>
> >> +               max_segment = dma_max_mapping_size(dev);
> >> +       if (max_segment == 0)
> >> +               max_segment = UINT_MAX;
> >> +       ret = sg_alloc_table_from_pages_segment(buf->dma_sgt, buf->pages,
> >> +               buf->num_pages, 0, size, max_segment, GFP_KERNEL);
> >
> > One thing that I'm not sure about here is that we use
> > sg_alloc_table_from_pages_segment(), but we actually don't pass the
> > max segment size (as returned by dma_get_max_seg_size()) to it.
> > I'm also not exactly sure what's the difference between "max mapping
> > size" and "max seg size".
> > +Robin Murphy +Christoph Hellwig I think we could benefit from your
> > expertise here.
>
> dma_get_max_seg_size() represents a capability of the device itself,
> namely the largest contiguous range it can be programmed to access in a
> single DMA descriptor/register/whatever. Conversely,
> dma_max_mapping_size() is a capablity of the DMA API implementation, and
> represents the largest contiguous mapping it is guaranteed to be able to
> handle (each segment in the case of dma_map_sg(), or the whole thing for
> dma_map_page()). Most likely the thing you want here is
> min_not_zero(max_seg_size, max_mapping_size).
>

The extra complexity needed here convinces me even more that we need a helper...

> > Generally looking at videobuf2-dma-sg, I feel like we would benefit
> > from some kind of dma_alloc_table_from_pages() that simply takes the
> > struct dev pointer and does everything necessary.
>
> Possibly; this code already looks lifted from drm_prime_pages_to_sg(),
> and if it's needed here then presumably vb2_dma_sg_get_userptr() also
> needs it, at the very least.

That code probably predates drm_prime_pages_to_sg(), but that's
probably also the reason that it has its own issues...
I'm tempted to send a patch adding such a helper, but Christoph didn't
sound very keen on that idea. Hmm.

Best regards,
Tomasz
Hui Fang Aug. 30, 2023, 8:50 a.m. UTC | #3
On Mon, Aug 28, 2023 at 8:57 AM Anle Pan <anle.pan@nxp.com> wrote:
>
> When allocating from pages, the size of the sg segment is unlimited and the
> default is UINT_MAX. This will cause the DMA stream mapping failed later
> with a "swiotlb buffer full" error. The default maximum mapping size is 128
> slots x 2K = 256K, determined by "IO_TLB_SEGSIZE".
> To fix the issue, limit the sg segment size according to
> "dma_max_mapping_size" to match the mapping limit.

I wonder if only NXP met the "swiotlb buffer full" issue. In theory,
when format is YUYV, those resolutions no greater than 320x240 (153600 bytes,
which less than the max mapping size 256K ) can't meet the issue.
But resolutions no less than 640x480 (307200 bytes), may have chances to
trigger the issue.

BRs,
FangHui
Jason Gunthorpe Aug. 30, 2023, 4:43 p.m. UTC | #4
On Wed, Aug 30, 2023 at 04:33:41PM +0200, Christoph Hellwig wrote:
> On Wed, Aug 30, 2023 at 12:47:57PM +0900, Tomasz Figa wrote:
> > Do we see anything replacing it widely anywhere on the short-middle
> > term horizon? I think we could possibly migrate vb2 to use that new
> > thing internally and just provide some compatibility X to scatterlist
> > conversion function for the drivers.
> 
> Jason said at LSF/MM that he had a prototype for a mapping API that
> takes a phys/len array as input and dma_addr/len a output, which really
> is the right thing to do, especially for dmabuf.

Yes, still a prototype. Given the change in direction some of the
assumptions of the list design will need some adjusting.

I felt there wasn't much justification to add a list without also
supporting the P2P and it was not looking very good to give the DMA
API proper p2p support without also connecting it to lists somehow.

Anyhow, I had drafted a basic list datastructure and starting
implementation that is sort of structured in away that is similar to
xarray (eg with fixed chunks, generic purpose, etc)

https://github.com/jgunthorpe/linux/commit/58d7e0578a09d9cd2360be515208bcd74ade5958

I would still call it an experiment as the auto-sizing entry approach
needs benchmarking to see what it costs.

> Jason, what's the status of your work?

After we talked I changed the order of the work, instead of starting
with the SG list side, I wanted to progress on the DMA API side and
then build any list infrastructure on that new API.

Your idea to use a new API that could allocate IOVA and then map phys
to IOVA as a DMA API entry point looked good to me, and it is a
perfect map for what RDMA's ODP stuff needs. So I changed direction to
rework ODP and introduce the DMA API parts as the first step.

I had to put it aside due to the volume of iommu/vfio stuff going on
right now. However, I think I will have someone who can work on the
ODP driver part this month

Regardless, RDMA really needs some kind of generic lists to hold CPU
and physical address ranges, so I still see that as being part of the
ultimate solution.

Jason
Christoph Hellwig Sept. 1, 2023, 6:10 a.m. UTC | #5
On Thu, Aug 31, 2023 at 12:32:38PM -0300, Jason Gunthorpe wrote:
> The entry is variable sized, so it depends on what is stuffed in
> it. For alot of common use cases, especially RDMA page lists, it will
> be able to use an 8 byte entry. This is pretty much the most space
> efficient it could be.

How do you get away with a 8 byte entry for addr+len?

> The primary alternative I see is a fixed 16 bytes/entry with a 64 bit
> address and ~60 bit length + ~4 bits of flags. This is closer to bio,
> simpler and faster, but makes the RDMA cases 2x bigger.

That's what I'd expect.

> With your direction I felt we could safely keep bio as it is and
> cheaply make a fast DMA mapper for it. Provide something like this as
> the 'kitchen sink' version for dmabuf/rdma/etc that are a little
> different.

So for the first version I see no need to change the bio_vec
representation as part of this project, but at the same time the
bio_vec representation causes problems for other reasons.  So I want
to change it anyway.
Jason Gunthorpe Sept. 1, 2023, 2:26 p.m. UTC | #6
On Fri, Sep 01, 2023 at 08:10:14AM +0200, Christoph Hellwig wrote:
> On Thu, Aug 31, 2023 at 12:32:38PM -0300, Jason Gunthorpe wrote:
> > The entry is variable sized, so it depends on what is stuffed in
> > it. For alot of common use cases, especially RDMA page lists, it will
> > be able to use an 8 byte entry. This is pretty much the most space
> > efficient it could be.
> 
> How do you get away with a 8 byte entry for addr+len?

It's a compression. The basic base/length/flags has alot of zero bits
in common cases.

I was drafting:

 2 bits for 'encoding == 8 bytes'
 2 bits for flags
 28 bits for length
 32 bits for address >> 12

So if the range has zero bits in the right places then it fits in
8 bytes.

Otherwise the compressor will choose a 16 byte entry:

 2 bits for 'encoding == 16 bytes'
 2 bits for flags
 36 bits for length
 64 bits for address
 24 bits for offset

And a 24 byte entry with 36 bits of flags and no limitations.

So we can store anything, but common cases of page lists will use only
8 bytes/entry.

This is a classical compression trade off, better space efficiency for
long term storage, worse iteration efficiency.

> > With your direction I felt we could safely keep bio as it is and
> > cheaply make a fast DMA mapper for it. Provide something like this as
> > the 'kitchen sink' version for dmabuf/rdma/etc that are a little
> > different.
> 
> So for the first version I see no need to change the bio_vec
> representation as part of this project, 

Right

> but at the same time the bio_vec representation causes problems for
> other reasons.  So I want to change it anyway.

I don't feel competent in this area, so I'm not sure what this will be.

I was hoping to come with some data and benchmarks and we consider
options. The appeal of smaller long term memory footprint for the RDMA
use case is interesting enough to look at it.

Jason
Hui Fang Sept. 4, 2023, 7:10 a.m. UTC | #7
On Wed, Aug 30, 2023 at 6:20 PM Tomasz Figa <tfiga@chromium.org> wrote:
> On Wed, Aug 30, 2023 at 5:50 PM Hui Fang <hui.fang@nxp.com> wrote:
......
> > I wonder if only NXP met the "swiotlb buffer full" issue. In theory,
> > when format is YUYV, those resolutions no greater than 320x240 (153600
> > bytes, which less than the max mapping size 256K ) can't meet the issue.
> > But resolutions no less than 640x480 (307200 bytes), may have chances
> > to trigger the issue.
> 
> I think a combination of a device that can support scatter-gather, but requires
> swiotlb is kind of rare. Most drivers would require a single contiguous DMA
> address (and thus use videobuf2-dma-contig) and physical discontinuity would
> be handled by an IOMMU (transparently to vb2).
> 
> I guess one thing to ask you about your specific setup is whether it's expected
> that the swiotlb ends up being used at all?

Yes, the swiotlb ends up being used. We met the issue when bring up
DeviceAsWebCam (android-14 new feature, android device works as a usb camera).


BRs,
Fang Hui
Tomasz Figa Sept. 5, 2023, 3:43 a.m. UTC | #8
On Mon, Sep 4, 2023 at 4:10 PM Hui Fang <hui.fang@nxp.com> wrote:
>
> On Wed, Aug 30, 2023 at 6:20 PM Tomasz Figa <tfiga@chromium.org> wrote:
> > On Wed, Aug 30, 2023 at 5:50 PM Hui Fang <hui.fang@nxp.com> wrote:
> ......
> > > I wonder if only NXP met the "swiotlb buffer full" issue. In theory,
> > > when format is YUYV, those resolutions no greater than 320x240 (153600
> > > bytes, which less than the max mapping size 256K ) can't meet the issue.
> > > But resolutions no less than 640x480 (307200 bytes), may have chances
> > > to trigger the issue.
> >
> > I think a combination of a device that can support scatter-gather, but requires
> > swiotlb is kind of rare. Most drivers would require a single contiguous DMA
> > address (and thus use videobuf2-dma-contig) and physical discontinuity would
> > be handled by an IOMMU (transparently to vb2).
> >
> > I guess one thing to ask you about your specific setup is whether it's expected
> > that the swiotlb ends up being used at all?
>
> Yes, the swiotlb ends up being used. We met the issue when bring up
> DeviceAsWebCam (android-14 new feature, android device works as a usb camera).

I see. I guess the mapping is done by the USB gadget controller
driver? Could you point us to the exact driver that's used?

Just to clarify, swiotlb should only be needed in the very extreme
fallback case, because of the performance impact of the memory copy
back and forth. The right approach would depend on the DMA
capabilities of your device, though.

Best regards,
Tomasz
Hui Fang Sept. 6, 2023, 8:16 a.m. UTC | #9
On Wed, Sep 5, 2023 at 12:44 AM Tomasz Figa <tfiga@chromium.org> wrote:
> 
> I see. I guess the mapping is done by the USB gadget controller driver? Could
> you point us to the exact driver that's used?
> 
> Just to clarify, swiotlb should only be needed in the very extreme fallback case,
> because of the performance impact of the memory copy back and forth. The
> right approach would depend on the DMA capabilities of your device, though.


[  138.493943][ T2104] Call trace:
[  138.497090][ T2104]  vb2_dma_sg_alloc+0x2ec/0x2fc
[  138.501808][ T2104]  __vb2_queue_alloc+0x224/0x724
[  138.506608][ T2104]  vb2_core_reqbufs+0x374/0x528
[  138.511320][ T2104]  vb2_reqbufs+0xe0/0xf4
[  138.515428][ T2104]  uvcg_alloc_buffers+0x18/0x34
[  138.520159][ T2104]  uvc_v4l2_reqbufs+0x38/0x54
[  138.524703][ T2104]  v4l_reqbufs+0x68/0x80
[  138.528820][ T2104]  __video_do_ioctl+0x370/0x4dc
[  138.533535][ T2104]  video_usercopy+0x43c/0xb38
[  138.538076][ T2104]  video_ioctl2+0x18/0x28
[  138.542272][ T2104]  v4l2_ioctl+0x6c/0x84
[  138.546291][ T2104]  __arm64_sys_ioctl+0xa8/0xe4
[  138.550928][ T2104]  invoke_syscall+0x58/0x114
[  138.555389][ T2104]  el0_svc_common+0x88/0xfc
[  138.559755][ T2104]  do_el0_svc+0x2c/0xb8
[  138.563776][ T2104]  el0_svc+0x2c/0xa4
[  138.567544][ T2104]  el0t_64_sync_handler+0x68/0xb4
[  138.572434][ T2104]  el0t_64_sync+0x1a4/0x1a8
[  138.576803][ T2104] Code: 17ffffcb 928002b3 d4210000 17ffffc8 (d4210000) 
[  138.583598][ T2104] ---[ end trace 0000000000000000 ]---

Also, below should explain why vb2_dma_sg_alloc is used.
We tested on 8mp with use dwc3 controller.

In drivers/usb/dwc3/gadget.c:
dwc->gadget->sg_supported       = true;

In drivers/usb/gadget/function/uvc_queue.c
if (cdev->gadget->sg_supported) {
	queue->queue.mem_ops = &vb2_dma_sg_memops;
	queue->use_sg = 1;
} else {
	queue->queue.mem_ops = &vb2_vmalloc_memops;
}


BRs,
Fang Hui
Hans Verkuil Sept. 6, 2023, 8:52 a.m. UTC | #10
Hi all,

On 29/08/2023 13:14, Robin Murphy wrote:
> On 2023-08-29 11:03, Tomasz Figa wrote:
>> Hi Anle,
>>
>> On Mon, Aug 28, 2023 at 8:57 AM Anle Pan <anle.pan@nxp.com> wrote:
>>>
>>> When allocating from pages, the size of the sg segment is unlimited and
>>> the default is UINT_MAX. This will cause the DMA stream mapping failed
>>> later with a "swiotlb buffer full" error.
>>
>> Thanks for the patch. Good catch.
>>
>>> The default maximum mapping
>>> size is 128 slots x 2K = 256K, determined by "IO_TLB_SEGSIZE".
>>> To fix the issue, limit the sg segment size according to
>>> "dma_max_mapping_size" to match the mapping limit.
>>>
>>> Signed-off-by: Anle Pan <anle.pan@nxp.com>
>>> ---
>>>   drivers/media/common/videobuf2/videobuf2-dma-sg.c | 9 +++++++--
>>>   1 file changed, 7 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/media/common/videobuf2/videobuf2-dma-sg.c b/drivers/media/common/videobuf2/videobuf2-dma-sg.c
>>> index fa69158a65b1..b608a7c5f240 100644
>>> --- a/drivers/media/common/videobuf2/videobuf2-dma-sg.c
>>> +++ b/drivers/media/common/videobuf2/videobuf2-dma-sg.c
>>> @@ -105,6 +105,7 @@ static void *vb2_dma_sg_alloc(struct vb2_buffer *vb, struct device *dev,
>>>          struct sg_table *sgt;
>>>          int ret;
>>>          int num_pages;
>>> +       size_t max_segment = 0;
>>>
>>>          if (WARN_ON(!dev) || WARN_ON(!size))
>>>                  return ERR_PTR(-EINVAL);
>>> @@ -134,8 +135,12 @@ static void *vb2_dma_sg_alloc(struct vb2_buffer *vb, struct device *dev,
>>>          if (ret)
>>>                  goto fail_pages_alloc;
>>>
>>> -       ret = sg_alloc_table_from_pages(buf->dma_sgt, buf->pages,
>>> -                       buf->num_pages, 0, size, GFP_KERNEL);
>>> +       if (dev)
> 
> dev can't be NULL, see the context above.
> 
>>> +               max_segment = dma_max_mapping_size(dev);
>>> +       if (max_segment == 0)
>>> +               max_segment = UINT_MAX;
>>> +       ret = sg_alloc_table_from_pages_segment(buf->dma_sgt, buf->pages,
>>> +               buf->num_pages, 0, size, max_segment, GFP_KERNEL);
>>
>> One thing that I'm not sure about here is that we use
>> sg_alloc_table_from_pages_segment(), but we actually don't pass the
>> max segment size (as returned by dma_get_max_seg_size()) to it.
>> I'm also not exactly sure what's the difference between "max mapping
>> size" and "max seg size".
>> +Robin Murphy +Christoph Hellwig I think we could benefit from your
>> expertise here.
> 
> dma_get_max_seg_size() represents a capability of the device itself, namely the largest contiguous range it can be programmed to access in a single DMA descriptor/register/whatever. Conversely,
> dma_max_mapping_size() is a capablity of the DMA API implementation, and represents the largest contiguous mapping it is guaranteed to be able to handle (each segment in the case of dma_map_sg(), or
> the whole thing for dma_map_page()). Most likely the thing you want here is min_not_zero(max_seg_size, max_mapping_size).
> 
>> Generally looking at videobuf2-dma-sg, I feel like we would benefit
>> from some kind of dma_alloc_table_from_pages() that simply takes the
>> struct dev pointer and does everything necessary.
> 
> Possibly; this code already looks lifted from drm_prime_pages_to_sg(), and if it's needed here then presumably vb2_dma_sg_get_userptr() also needs it, at the very least.

I also see sg_alloc_table_from_pages being called in vb2_dc_get_userptr in videobuf2-dma-contig.c,
presumably that needs the same fix?
Tomasz Figa Sept. 6, 2023, 9:26 a.m. UTC | #11
On Wed, Sep 6, 2023 at 5:52 PM Hans Verkuil <hverkuil@xs4all.nl> wrote:
>
> Hi all,
>
> On 29/08/2023 13:14, Robin Murphy wrote:
> > On 2023-08-29 11:03, Tomasz Figa wrote:
> >> Hi Anle,
> >>
> >> On Mon, Aug 28, 2023 at 8:57 AM Anle Pan <anle.pan@nxp.com> wrote:
> >>>
> >>> When allocating from pages, the size of the sg segment is unlimited and
> >>> the default is UINT_MAX. This will cause the DMA stream mapping failed
> >>> later with a "swiotlb buffer full" error.
> >>
> >> Thanks for the patch. Good catch.
> >>
> >>> The default maximum mapping
> >>> size is 128 slots x 2K = 256K, determined by "IO_TLB_SEGSIZE".
> >>> To fix the issue, limit the sg segment size according to
> >>> "dma_max_mapping_size" to match the mapping limit.
> >>>
> >>> Signed-off-by: Anle Pan <anle.pan@nxp.com>
> >>> ---
> >>>   drivers/media/common/videobuf2/videobuf2-dma-sg.c | 9 +++++++--
> >>>   1 file changed, 7 insertions(+), 2 deletions(-)
> >>>
> >>> diff --git a/drivers/media/common/videobuf2/videobuf2-dma-sg.c b/drivers/media/common/videobuf2/videobuf2-dma-sg.c
> >>> index fa69158a65b1..b608a7c5f240 100644
> >>> --- a/drivers/media/common/videobuf2/videobuf2-dma-sg.c
> >>> +++ b/drivers/media/common/videobuf2/videobuf2-dma-sg.c
> >>> @@ -105,6 +105,7 @@ static void *vb2_dma_sg_alloc(struct vb2_buffer *vb, struct device *dev,
> >>>          struct sg_table *sgt;
> >>>          int ret;
> >>>          int num_pages;
> >>> +       size_t max_segment = 0;
> >>>
> >>>          if (WARN_ON(!dev) || WARN_ON(!size))
> >>>                  return ERR_PTR(-EINVAL);
> >>> @@ -134,8 +135,12 @@ static void *vb2_dma_sg_alloc(struct vb2_buffer *vb, struct device *dev,
> >>>          if (ret)
> >>>                  goto fail_pages_alloc;
> >>>
> >>> -       ret = sg_alloc_table_from_pages(buf->dma_sgt, buf->pages,
> >>> -                       buf->num_pages, 0, size, GFP_KERNEL);
> >>> +       if (dev)
> >
> > dev can't be NULL, see the context above.
> >
> >>> +               max_segment = dma_max_mapping_size(dev);
> >>> +       if (max_segment == 0)
> >>> +               max_segment = UINT_MAX;
> >>> +       ret = sg_alloc_table_from_pages_segment(buf->dma_sgt, buf->pages,
> >>> +               buf->num_pages, 0, size, max_segment, GFP_KERNEL);
> >>
> >> One thing that I'm not sure about here is that we use
> >> sg_alloc_table_from_pages_segment(), but we actually don't pass the
> >> max segment size (as returned by dma_get_max_seg_size()) to it.
> >> I'm also not exactly sure what's the difference between "max mapping
> >> size" and "max seg size".
> >> +Robin Murphy +Christoph Hellwig I think we could benefit from your
> >> expertise here.
> >
> > dma_get_max_seg_size() represents a capability of the device itself, namely the largest contiguous range it can be programmed to access in a single DMA descriptor/register/whatever. Conversely,
> > dma_max_mapping_size() is a capablity of the DMA API implementation, and represents the largest contiguous mapping it is guaranteed to be able to handle (each segment in the case of dma_map_sg(), or
> > the whole thing for dma_map_page()). Most likely the thing you want here is min_not_zero(max_seg_size, max_mapping_size).
> >
> >> Generally looking at videobuf2-dma-sg, I feel like we would benefit
> >> from some kind of dma_alloc_table_from_pages() that simply takes the
> >> struct dev pointer and does everything necessary.
> >
> > Possibly; this code already looks lifted from drm_prime_pages_to_sg(), and if it's needed here then presumably vb2_dma_sg_get_userptr() also needs it, at the very least.
>
> I also see sg_alloc_table_from_pages being called in vb2_dc_get_userptr in videobuf2-dma-contig.c,
> presumably that needs the same fix?
>
> From what I gather from this thread there are no improved helpers on the immediate
> horizon, so this issue has to be fixed in videobuf2 for now.

Agreed. (Although I suspect the real issue that NXP is having isn't
really this and this is just a side effect.)

>
> So this requires a v2 that fixes this also in vb2_dma_sg_get_userptr and vb2_dc_get_userptr,
> correct? If so, then it would be nice if Anle can post a v2 with those changes.

If we need to fix 3 different callers, could we at least add an
internal vb2 helper for this (e.g. vb2_asg_alloc_table_from_pages())?

>
> Note that when I grepped for sg_alloc_table_from_pages_segment users, I noticed that
> in most cases dma_max_mapping_size is used, but in one case it uses dma_get_max_seg_size
> (drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c). I have no idea what the difference is
> between the two.

FWIW, Robin explained that earlier in the thread. To be fully correct,
we need to consider both...

>
> One small change that would make sg_alloc_table_from_pages_segment() a bit easier
> to work with is if it would replace a max_segment value of 0 with UINT_MAX. Then
> you can just stick in dma_max_mapping_size(dev) as the argument.
>
> Alternatively, if we can be certain that dma_max_mapping_size(dev) never returns 0,
> then that 'if (max_segment == 0)' part can just be dropped.

That's also the reason I wanted a helper. Not sure if it's so easy to
change the calling convention now.

>
> I also wonder if any of the other (non-media) users of sg_alloc_table_from_pages
> would need to use sg_alloc_table_from_pages_segment as well. Is this really
> media specific? Or is media just the most likely subsystem to hit this issue due
> to the large amounts of memory it uses?

I don't think it's media-specific, but also as I said earlier, it
should be really rare to hit it - swiotlb isn't normally expected for
reasonably capable systems and buffer allocation being done correctly.

Best regards,
Tomasz
Hui Fang Sept. 11, 2023, 6:13 a.m. UTC | #12
On Wed, Sep 6, 2023 at 18:28 PM Tomasz Figa <tfiga@chromium.org> wrote:
> That all makes sense, but it still doesn't answer the real question on why
> swiotlb ends up being used. I think you may want to trace what happens in
> the DMA mapping ops implementation on your system causing it to use
> swiotlb.

Add log and feed invalid data to low buffer on purpose,
it's confirmed that swiotlb is actually used.

Got log as
"[  846.570271][  T138] software IO TLB: ==== swiotlb_bounce: DMA_TO_DEVICE,
 dst 000000004589fa38, src 00000000c6d7e8d8, srcPhy 5504139264, size 4096".

" srcPhy 5504139264" is larger than 4G (8mp has DRAM over 5G).
And "CONFIG_ZONE_DMA32=y" in kernel config, so swiotlb static is used.
Also, the host (win10) side can't get valid image.

Code as below.
diff --git a/drivers/media/common/videobuf2/videobuf2-dma-sg.c b/drivers/media/common/videobuf2/videobuf2-dma-sg.c
index 7f83a86e6810..de03704ce695 100644
--- a/drivers/media/common/videobuf2/videobuf2-dma-sg.c
+++ b/drivers/media/common/videobuf2/videobuf2-dma-sg.c
@@ -98,6 +98,7 @@ static int vb2_dma_sg_alloc_compacted(struct vb2_dma_sg_buf *buf,
        return 0;
 }
 
+bool g_v4l2 = false;
 static void *vb2_dma_sg_alloc(struct vb2_buffer *vb, struct device *dev,
                              unsigned long size)
 {
@@ -144,6 +145,7 @@ static void *vb2_dma_sg_alloc(struct vb2_buffer *vb, struct device *dev,
        if (ret)
                goto fail_table_alloc;
 
+       g_v4l2 = true;
        pr_info("==== vb2_dma_sg_alloc, call sg_alloc_table_from_pages_segment,
			size %d, max_segment %d\n", (int)size, (int)max_segment);
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index dac01ace03a0..a2cda646a02f 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -523,6 +523,7 @@ static unsigned int swiotlb_align_offset(struct device *dev, u64 addr)
        return addr & dma_get_min_align_mask(dev) & (IO_TLB_SIZE - 1);
 }
 
+extern bool g_v4l2;
 /*
  * Bounce: copy the swiotlb buffer from or back to the original dma location
  */
@@ -591,8 +592,19 @@ static void swiotlb_bounce(struct device *dev, phys_addr_t tlb_addr, size_t size
                }
        } else if (dir == DMA_TO_DEVICE) {
                memcpy(vaddr, phys_to_virt(orig_addr), size);
+               if (g_v4l2) {
+                       static unsigned char val;
+                       val++;
+                       memset(vaddr, val, size);
+
+                       pr_info("====xx %s: DMA_TO_DEVICE, dst %p, src %p, srcPhy %llu, size %zu\n",
+                               __func__, vaddr, phys_to_virt(orig_addr), orig_addr, size);
+               }
        } else {
                memcpy(phys_to_virt(orig_addr), vaddr, size);
        }
 }


BRs,
Fang Hui
Tomasz Figa Sept. 12, 2023, 2:22 a.m. UTC | #13
On Mon, Sep 11, 2023 at 3:13 PM Hui Fang <hui.fang@nxp.com> wrote:
>
> On Wed, Sep 6, 2023 at 18:28 PM Tomasz Figa <tfiga@chromium.org> wrote:
> > That all makes sense, but it still doesn't answer the real question on why
> > swiotlb ends up being used. I think you may want to trace what happens in
> > the DMA mapping ops implementation on your system causing it to use
> > swiotlb.
>
> Add log and feed invalid data to low buffer on purpose,
> it's confirmed that swiotlb is actually used.
>

Yes, that we already know. But why?

> Got log as
> "[  846.570271][  T138] software IO TLB: ==== swiotlb_bounce: DMA_TO_DEVICE,
>  dst 000000004589fa38, src 00000000c6d7e8d8, srcPhy 5504139264, size 4096".
>
> " srcPhy 5504139264" is larger than 4G (8mp has DRAM over 5G).
> And "CONFIG_ZONE_DMA32=y" in kernel config, so swiotlb static is used.
> Also, the host (win10) side can't get valid image.
>
> Code as below.
> diff --git a/drivers/media/common/videobuf2/videobuf2-dma-sg.c b/drivers/media/common/videobuf2/videobuf2-dma-sg.c
> index 7f83a86e6810..de03704ce695 100644
> --- a/drivers/media/common/videobuf2/videobuf2-dma-sg.c
> +++ b/drivers/media/common/videobuf2/videobuf2-dma-sg.c
> @@ -98,6 +98,7 @@ static int vb2_dma_sg_alloc_compacted(struct vb2_dma_sg_buf *buf,
>         return 0;
>  }
>
> +bool g_v4l2 = false;
>  static void *vb2_dma_sg_alloc(struct vb2_buffer *vb, struct device *dev,
>                               unsigned long size)
>  {
> @@ -144,6 +145,7 @@ static void *vb2_dma_sg_alloc(struct vb2_buffer *vb, struct device *dev,
>         if (ret)
>                 goto fail_table_alloc;
>
> +       g_v4l2 = true;
>         pr_info("==== vb2_dma_sg_alloc, call sg_alloc_table_from_pages_segment,
>                         size %d, max_segment %d\n", (int)size, (int)max_segment);
> diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
> index dac01ace03a0..a2cda646a02f 100644
> --- a/kernel/dma/swiotlb.c
> +++ b/kernel/dma/swiotlb.c
> @@ -523,6 +523,7 @@ static unsigned int swiotlb_align_offset(struct device *dev, u64 addr)
>         return addr & dma_get_min_align_mask(dev) & (IO_TLB_SIZE - 1);
>  }
>
> +extern bool g_v4l2;
>  /*
>   * Bounce: copy the swiotlb buffer from or back to the original dma location
>   */
> @@ -591,8 +592,19 @@ static void swiotlb_bounce(struct device *dev, phys_addr_t tlb_addr, size_t size
>                 }
>         } else if (dir == DMA_TO_DEVICE) {
>                 memcpy(vaddr, phys_to_virt(orig_addr), size);
> +               if (g_v4l2) {
> +                       static unsigned char val;
> +                       val++;
> +                       memset(vaddr, val, size);
> +
> +                       pr_info("====xx %s: DMA_TO_DEVICE, dst %p, src %p, srcPhy %llu, size %zu\n",
> +                               __func__, vaddr, phys_to_virt(orig_addr), orig_addr, size);
> +               }
>         } else {
>                 memcpy(phys_to_virt(orig_addr), vaddr, size);
>         }
>  }
>
>
> BRs,
> Fang Hui
>
Hui Fang Sept. 12, 2023, 7:43 a.m. UTC | #14
On Tue, Sep 12, 2023 at 4:11 PM Tomasz Figa <tfiga@chromium.org> wrote:
> Is your DMA device restricted only to the bottom-most 4 GB (32-bit DMA
> address)? If yes, would it make sense to also allocate from that area rather
> than bouncing the memory?

The DMA device use 32-bit DMA address.
From user space, can't control the v4l2 buffer address, may still change the
code of vb2_dma_sg_alloc().

BRs,
Fang Hui
Hui Fang Sept. 18, 2023, 2:28 a.m. UTC | #15
> On Wed, Sep 13, 2023 at 21:17 PM Fang Hui <hui.fang@nxp.com > wrote:
> > above the Signed-off-by line if you don't mind. Thanks.
> 
> Sure. Will verified on other different i.mx boards, then push.

Ref https://lore.kernel.org/all/20230914145812.12851-1-hui.fang@nxp.com/

BRs,
Fang Hui
diff mbox series

Patch

diff --git a/drivers/media/common/videobuf2/videobuf2-dma-sg.c b/drivers/media/common/videobuf2/videobuf2-dma-sg.c
index fa69158a65b1..b608a7c5f240 100644
--- a/drivers/media/common/videobuf2/videobuf2-dma-sg.c
+++ b/drivers/media/common/videobuf2/videobuf2-dma-sg.c
@@ -105,6 +105,7 @@  static void *vb2_dma_sg_alloc(struct vb2_buffer *vb, struct device *dev,
 	struct sg_table *sgt;
 	int ret;
 	int num_pages;
+	size_t max_segment = 0;
 
 	if (WARN_ON(!dev) || WARN_ON(!size))
 		return ERR_PTR(-EINVAL);
@@ -134,8 +135,12 @@  static void *vb2_dma_sg_alloc(struct vb2_buffer *vb, struct device *dev,
 	if (ret)
 		goto fail_pages_alloc;
 
-	ret = sg_alloc_table_from_pages(buf->dma_sgt, buf->pages,
-			buf->num_pages, 0, size, GFP_KERNEL);
+	if (dev)
+		max_segment = dma_max_mapping_size(dev);
+	if (max_segment == 0)
+		max_segment = UINT_MAX;
+	ret = sg_alloc_table_from_pages_segment(buf->dma_sgt, buf->pages,
+		buf->num_pages, 0, size, max_segment, GFP_KERNEL);
 	if (ret)
 		goto fail_table_alloc;