mbox series

[0/5] udmbuf bug fix and some improvements

Message ID 20240801104512.4056860-1-link@vivo.com
Headers show
Series udmbuf bug fix and some improvements | expand

Message

Huan Yang Aug. 1, 2024, 10:45 a.m. UTC
This patchset attempts to fix some errors in udmabuf and remove the
upin_list structure.

Some of this fix just gather the patches which I upload before.

Patch1
===
Try to remove page fault mmap and direct map it.
Due to current udmabuf has already obtained and pinned the folio
upon completion of the creation.This means that the physical memory has
already been acquired, rather than being accessed dynamically. The
current page fault method only saves some page table memory.

As a result, the page fault mechanism has lost its purpose as a demanding
page. Due to the fact that page fault requires trapping into kernel mode
and filling in when accessing the corresponding virtual address in mmap,
this means that user mode access to virtual addresses needs to trap into
kernel mode.

Therefore, when creating a large size udmabuf, this represents a
considerable overhead.

Therefore, the current patch removes the page fault method of mmap and
instead fills it directly when mmap is triggered.

This is achieved by using the scatter-gather table to establish a
linear relationship for the page. Calling remap_pfn_range does not cause
the previously set VMA flags to become invalid.

Patch2
===
This is the same to patch:
https://lore.kernel.org/all/20240725021349.580574-1-link@vivo.com/
I just gather it to this patchset.

Patch3
===
The current implementation of udmabuf's vmap has issues.

It does not correctly set each page of the folio to the page structure,
so that when vmap is called, all pages are the head page of the folio.

This implementation is not the same as this patch:
https://lore.kernel.org/all/20240731090233.1343559-1-link@vivo.com/

This reuse sgt table to map all page into vmalloc area.

Patch4
===
Wrap the repeated calls to get_sg_table, add a helper function to do it.
Set to udmabuf->sg use cmpxchg, It should be able to prevent concurrent
access situations. (I see mmap do not use lock)

Patch5
===
Attempt to remove unpin_list and other related data structures.

In order to adapt to Folio, we established the unpin_list data structure
to unpin all folios and maintain the page mapping relationship.

However, this data structure requires 24 bytes for each page and has low
traversal performance for the list. And maintaining the offset structure
also consumes a portion of memory.

This patch attempts to remove these data structures and modify the
semantics of some existing data structures.

udmabuf:
  folios -> folios array, which only contain's the folio, org contains
duplicate.
  add item_offset -> base on create item count, record it's start offset
in every memfd.
  add item_size -> base on create item count, record it's size in every
memfd.
  add nr_folios -> folios array number

So, when building the sg table, it is necessary to iterate in this way:
  if size cross item->size, take care of it's start offset in folio.
  if got folio, set each page into sgl until reach into folio size.

This patch also remove single folios' create on each create item, use it
be the ubuf->folios arrays' pointer, slide to fill the corresponding
folio under the item into the array.

After the modification, the various data structures in udmabuf have the
following corresponding relationships:
  pagecount * PAGESIZE = sum(folios_size(folios[i])) i=0->nr_folios
  pagecount * PAGESIZE = sum(item_size[i]) i=0, item_count (do not
record)
  item_offset use to record each memfd offset if exist, else 0.

Huan Yang (5):
  udmabuf: cancel mmap page fault, direct map it
  udmabuf: change folios array from kmalloc to kvmalloc
  udmabuf: fix vmap_udmabuf error page set
  udmabuf: add get_sg_table helper function
  udmabuf: remove folio pin list

 drivers/dma-buf/udmabuf.c | 270 +++++++++++++++++++++-----------------
 1 file changed, 148 insertions(+), 122 deletions(-)


base-commit: cd19ac2f903276b820f5d0d89de0c896c27036ed

Comments

Christian König Aug. 1, 2024, 10:50 a.m. UTC | #1
Am 01.08.24 um 12:45 schrieb Huan Yang:
> The current udmabuf mmap uses a page fault mechanism to populate the vma.
>
> However, the current udmabuf has already obtained and pinned the folio
> upon completion of the creation.This means that the physical memory has
> already been acquired, rather than being accessed dynamically. The
> current page fault method only saves some page table memory.
>
> As a result, the page fault mechanism has lost its purpose as a demanding
> page. Due to the fact that page fault requires trapping into kernel mode
> and filling in when accessing the corresponding virtual address in mmap,
> this means that user mode access to virtual addresses needs to trap into
> kernel mode.
>
> Therefore, when creating a large size udmabuf, this represents a
> considerable overhead.
>
> Therefore, the current patch removes the page fault method of mmap and
> instead fills it directly when mmap is triggered.
>
> Signed-off-by: Huan Yang <link@vivo.com>
> ---
>   drivers/dma-buf/udmabuf.c | 70 ++++++++++++++++++++++-----------------
>   1 file changed, 39 insertions(+), 31 deletions(-)
>
> diff --git a/drivers/dma-buf/udmabuf.c b/drivers/dma-buf/udmabuf.c
> index 047c3cd2ceff..d69aeada7367 100644
> --- a/drivers/dma-buf/udmabuf.c
> +++ b/drivers/dma-buf/udmabuf.c
> @@ -38,36 +38,39 @@ struct udmabuf_folio {
>   	struct list_head list;
>   };
>   
> -static vm_fault_t udmabuf_vm_fault(struct vm_fault *vmf)
> -{
> -	struct vm_area_struct *vma = vmf->vma;
> -	struct udmabuf *ubuf = vma->vm_private_data;
> -	pgoff_t pgoff = vmf->pgoff;
> -	unsigned long pfn;
> -
> -	if (pgoff >= ubuf->pagecount)
> -		return VM_FAULT_SIGBUS;
> -
> -	pfn = folio_pfn(ubuf->folios[pgoff]);
> -	pfn += ubuf->offsets[pgoff] >> PAGE_SHIFT;
> -
> -	return vmf_insert_pfn(vma, vmf->address, pfn);
> -}
> -
> -static const struct vm_operations_struct udmabuf_vm_ops = {
> -	.fault = udmabuf_vm_fault,
> -};
> +static struct sg_table *get_sg_table(struct device *dev, struct dma_buf *buf,
> +				     enum dma_data_direction direction);
>   
>   static int mmap_udmabuf(struct dma_buf *buf, struct vm_area_struct *vma)
>   {
>   	struct udmabuf *ubuf = buf->priv;
> +	struct sg_table *table = ubuf->sg;
> +	unsigned long addr = vma->vm_start;
> +	struct sg_page_iter piter;
> +	int ret;
>   
>   	if ((vma->vm_flags & (VM_SHARED | VM_MAYSHARE)) == 0)
>   		return -EINVAL;
>   
> -	vma->vm_ops = &udmabuf_vm_ops;
> -	vma->vm_private_data = ubuf;
> -	vm_flags_set(vma, VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP);
> +	if (!table) {
> +		table = get_sg_table(NULL, buf, 0);
> +		if (IS_ERR(table))
> +			return PTR_ERR(table);
> +		ubuf->sg = table;
> +	}
> +
> +	for_each_sgtable_page(table, &piter, vma->vm_pgoff) {

That might not work correctly. We intentionally remove the pages from 
the sgtable when it is shared between devices.

Additional to that the sgtable is *not* a page container, but rather a 
DMA address container. So that here is also a rather bad idea from the 
design side.

Regards,
Christian.

> +		struct page *page = sg_page_iter_page(&piter);
> +
> +		ret = remap_pfn_range(vma, addr, page_to_pfn(page), PAGE_SIZE,
> +				      vma->vm_page_prot);
> +		if (ret)
> +			return ret;
> +		addr += PAGE_SIZE;
> +		if (addr >= vma->vm_end)
> +			return 0;
> +	}
> +
>   	return 0;
>   }
>   
> @@ -126,6 +129,10 @@ static struct sg_table *get_sg_table(struct device *dev, struct dma_buf *buf,
>   		sg_set_folio(sgl, ubuf->folios[i], PAGE_SIZE,
>   			     ubuf->offsets[i]);
>   
> +	// if dev is NULL, no need to sync.
> +	if (!dev)
> +		return sg;
> +
>   	ret = dma_map_sgtable(dev, sg, direction, 0);
>   	if (ret < 0)
>   		goto err_map;
> @@ -206,20 +213,21 @@ static int begin_cpu_udmabuf(struct dma_buf *buf,
>   {
>   	struct udmabuf *ubuf = buf->priv;
>   	struct device *dev = ubuf->device->this_device;
> -	int ret = 0;
> +	struct sg_table *sg;
>   
> -	if (!ubuf->sg) {
> -		ubuf->sg = get_sg_table(dev, buf, direction);
> -		if (IS_ERR(ubuf->sg)) {
> -			ret = PTR_ERR(ubuf->sg);
> -			ubuf->sg = NULL;
> -		}
> -	} else {
> +	if (ubuf->sg) {
>   		dma_sync_sg_for_cpu(dev, ubuf->sg->sgl, ubuf->sg->nents,
>   				    direction);
> +		return 0;
>   	}
>   
> -	return ret;
> +	sg = get_sg_table(dev, buf, direction);
> +	if (IS_ERR(sg))
> +		return PTR_ERR(sg);
> +
> +	ubuf->sg = sg;
> +
> +	return 0;
>   }
>   
>   static int end_cpu_udmabuf(struct dma_buf *buf,
Christian König Aug. 1, 2024, 10:55 a.m. UTC | #2
Am 01.08.24 um 12:45 schrieb Huan Yang:
> Currently vmap_udmabuf set page's array by each folio.
> But, ubuf->folios is only contain's the folio's head page.
>
> That mean we repeatedly mapped the folio head page to the vmalloc area.
>
> This patch fix it, set each folio's page correct, so that pages array
> contains right page, and then map into vmalloc area
>
> Signed-off-by: Huan Yang <link@vivo.com>
> ---
>   drivers/dma-buf/udmabuf.c | 17 +++++++++++++----
>   1 file changed, 13 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/dma-buf/udmabuf.c b/drivers/dma-buf/udmabuf.c
> index a915714c5dce..7ed532342d7f 100644
> --- a/drivers/dma-buf/udmabuf.c
> +++ b/drivers/dma-buf/udmabuf.c
> @@ -77,18 +77,27 @@ static int mmap_udmabuf(struct dma_buf *buf, struct vm_area_struct *vma)
>   static int vmap_udmabuf(struct dma_buf *buf, struct iosys_map *map)
>   {
>   	struct udmabuf *ubuf = buf->priv;
> -	struct page **pages;
> +	struct page **pages, **tmp;
> +	struct sg_table *sg = ubuf->sg;
> +	struct sg_page_iter piter;
>   	void *vaddr;
> -	pgoff_t pg;
>   
>   	dma_resv_assert_held(buf->resv);
>   
> +	if (!sg) {
> +		sg = get_sg_table(NULL, buf, 0);
> +		if (IS_ERR(sg))
> +			return PTR_ERR(sg);
> +		ubuf->sg = sg;
> +	}
> +
>   	pages = kvmalloc_array(ubuf->pagecount, sizeof(*pages), GFP_KERNEL);
>   	if (!pages)
>   		return -ENOMEM;
> +	tmp = pages;
>   
> -	for (pg = 0; pg < ubuf->pagecount; pg++)
> -		pages[pg] = &ubuf->folios[pg]->page;
> +	for_each_sgtable_page(sg, &piter, 0)
> +		*tmp++ = sg_page_iter_page(&piter);

Again don't abuse the sg table for that!

Regards,
Christian.

>   
>   	vaddr = vm_map_ram(pages, ubuf->pagecount, -1);
>   	kvfree(pages);
Kasireddy, Vivek Aug. 1, 2024, 6:32 p.m. UTC | #3
Hi Huan,

> This patchset attempts to fix some errors in udmabuf and remove the
> upin_list structure.
> 
> Some of this fix just gather the patches which I upload before.
> 
> Patch1
> ===
> Try to remove page fault mmap and direct map it.
> Due to current udmabuf has already obtained and pinned the folio
> upon completion of the creation.This means that the physical memory has
> already been acquired, rather than being accessed dynamically. The
> current page fault method only saves some page table memory.
> 
> As a result, the page fault mechanism has lost its purpose as a demanding
> page. Due to the fact that page fault requires trapping into kernel mode
> and filling in when accessing the corresponding virtual address in mmap,
> this means that user mode access to virtual addresses needs to trap into
> kernel mode.
> 
> Therefore, when creating a large size udmabuf, this represents a
> considerable overhead.
Just want to mention that for the main use-case the udmabuf driver is designed for,
(sharing Qemu Guest FB with Host for GPU DMA), udmabufs are not created very
frequently. And, I think providing CPU access via mmap is just a backup, mainly
intended for debugging purposes.

> 
> Therefore, the current patch removes the page fault method of mmap and
> instead fills it directly when mmap is triggered.
> 
> This is achieved by using the scatter-gather table to establish a
> linear relationship for the page. Calling remap_pfn_range does not cause
> the previously set VMA flags to become invalid.
> 
> Patch2
> ===
> This is the same to patch:
> https://lore.kernel.org/all/20240725021349.580574-1-link@vivo.com/
> I just gather it to this patchset.
> 
> Patch3
> ===
> The current implementation of udmabuf's vmap has issues.
> 
> It does not correctly set each page of the folio to the page structure,
> so that when vmap is called, all pages are the head page of the folio.
> 
> This implementation is not the same as this patch:
> https://lore.kernel.org/all/20240731090233.1343559-1-link@vivo.com/
> 
> This reuse sgt table to map all page into vmalloc area.
> 
> Patch4
> ===
> Wrap the repeated calls to get_sg_table, add a helper function to do it.
> Set to udmabuf->sg use cmpxchg, It should be able to prevent concurrent
> access situations. (I see mmap do not use lock)
> 
> Patch5
> ===
> Attempt to remove unpin_list and other related data structures.
> 
> In order to adapt to Folio, we established the unpin_list data structure
> to unpin all folios and maintain the page mapping relationship.
> 
> However, this data structure requires 24 bytes for each page and has low
> traversal performance for the list. And maintaining the offset structure
> also consumes a portion of memory.
> 
> This patch attempts to remove these data structures and modify the
> semantics of some existing data structures.
> 
> udmabuf:
>   folios -> folios array, which only contain's the folio, org contains
> duplicate.
>   add item_offset -> base on create item count, record it's start offset
> in every memfd.
>   add item_size -> base on create item count, record it's size in every
> memfd.
>   add nr_folios -> folios array number
I am not sure if these changes improve the readability. Instead, I think it makes
sense to add comments to the existing code.

> 
> So, when building the sg table, it is necessary to iterate in this way:
>   if size cross item->size, take care of it's start offset in folio.
>   if got folio, set each page into sgl until reach into folio size.
> 
> This patch also remove single folios' create on each create item, use it
> be the ubuf->folios arrays' pointer, slide to fill the corresponding
> folio under the item into the array.
> 
> After the modification, the various data structures in udmabuf have the
> following corresponding relationships:
>   pagecount * PAGESIZE = sum(folios_size(folios[i])) i=0->nr_folios
>   pagecount * PAGESIZE = sum(item_size[i]) i=0, item_count (do not
> record)
>   item_offset use to record each memfd offset if exist, else 0.
> 
> Huan Yang (5):
>   udmabuf: cancel mmap page fault, direct map it
>   udmabuf: change folios array from kmalloc to kvmalloc
>   udmabuf: fix vmap_udmabuf error page set
Do you have a test-case to test this patch?

>   udmabuf: add get_sg_table helper function
>   udmabuf: remove folio pin list
Please run the newly added udmabuf selftests to make sure that these
patches are not causing any regressions. And, we also need to make sure that
the main use-cases (Qemu with memfd + shmem and Qemu with memfd + hugetlb)
are working as expected given the invasive changes. 

I'll be able to test and provide more detailed feedback on all patches once I am back from
vacation late next week.

Thanks,
Vivek 

> 
>  drivers/dma-buf/udmabuf.c | 270 +++++++++++++++++++++-----------------
>  1 file changed, 148 insertions(+), 122 deletions(-)
> 
> 
> base-commit: cd19ac2f903276b820f5d0d89de0c896c27036ed
> --
> 2.45.2
Huan Yang Aug. 2, 2024, 1:16 a.m. UTC | #4
在 2024/8/2 2:32, Kasireddy, Vivek 写道:
> [Some people who received this message don't often get email from vivek.kasireddy@intel.com. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
>
> Hi Huan,
>
>> This patchset attempts to fix some errors in udmabuf and remove the
>> upin_list structure.
>>
>> Some of this fix just gather the patches which I upload before.
>>
>> Patch1
>> ===
>> Try to remove page fault mmap and direct map it.
>> Due to current udmabuf has already obtained and pinned the folio
>> upon completion of the creation.This means that the physical memory has
>> already been acquired, rather than being accessed dynamically. The
>> current page fault method only saves some page table memory.
>>
>> As a result, the page fault mechanism has lost its purpose as a demanding
>> page. Due to the fact that page fault requires trapping into kernel mode
>> and filling in when accessing the corresponding virtual address in mmap,
>> this means that user mode access to virtual addresses needs to trap into
>> kernel mode.
>>
>> Therefore, when creating a large size udmabuf, this represents a
>> considerable overhead.
> Just want to mention that for the main use-case the udmabuf driver is designed for,
> (sharing Qemu Guest FB with Host for GPU DMA), udmabufs are not created very
> frequently. And, I think providing CPU access via mmap is just a backup, mainly
> intended for debugging purposes.

I'm very glad to know this.

However, recently I have been researching on using asynchronous and 
direct I/O (DIO) when loading large model files with dma-buf,

which can improve performance and reduce power consumption. You can see 
the patchset:

https://lore.kernel.org/all/20240730075755.10941-1-link@vivo.com/

In the discussion, the maintainer suggested that we should base our work 
on udmabuf. I tested udmabuf and found that using asynchronous

and direct I/O (DIO) to read files performs similarly to my patchset.

So I turned to studying udmabuf, and once I become familiar with the 
system, I will be able to encourage our partners to adapt it.

>
>> Therefore, the current patch removes the page fault method of mmap and
>> instead fills it directly when mmap is triggered.
>>
>> This is achieved by using the scatter-gather table to establish a
>> linear relationship for the page. Calling remap_pfn_range does not cause
>> the previously set VMA flags to become invalid.
>>
>> Patch2
>> ===
>> This is the same to patch:
>> https://lore.kernel.org/all/20240725021349.580574-1-link@vivo.com/
>> I just gather it to this patchset.
>>
>> Patch3
>> ===
>> The current implementation of udmabuf's vmap has issues.
>>
>> It does not correctly set each page of the folio to the page structure,
>> so that when vmap is called, all pages are the head page of the folio.
>>
>> This implementation is not the same as this patch:
>> https://lore.kernel.org/all/20240731090233.1343559-1-link@vivo.com/
>>
>> This reuse sgt table to map all page into vmalloc area.
>>
>> Patch4
>> ===
>> Wrap the repeated calls to get_sg_table, add a helper function to do it.
>> Set to udmabuf->sg use cmpxchg, It should be able to prevent concurrent
>> access situations. (I see mmap do not use lock)
>>
>> Patch5
>> ===
>> Attempt to remove unpin_list and other related data structures.
>>
>> In order to adapt to Folio, we established the unpin_list data structure
>> to unpin all folios and maintain the page mapping relationship.
>>
>> However, this data structure requires 24 bytes for each page and has low
>> traversal performance for the list. And maintaining the offset structure
>> also consumes a portion of memory.
>>
>> This patch attempts to remove these data structures and modify the
>> semantics of some existing data structures.
>>
>> udmabuf:
>>    folios -> folios array, which only contain's the folio, org contains
>> duplicate.
>>    add item_offset -> base on create item count, record it's start offset
>> in every memfd.
>>    add item_size -> base on create item count, record it's size in every
>> memfd.
>>    add nr_folios -> folios array number
> I am not sure if these changes improve the readability. Instead, I think it makes
> sense to add comments to the existing code.

This is not aimed at improving readability, but rather at saving memory 
and performance,

as unpin_list is 24 bytes for each folio.

If each folio is 24 bytes, it would result in a lot of performance loss.

I previously provided a patch to establish a kmem_cache to reduce memory 
waste, but after recent study,

https://lore.kernel.org/all/20240731062642.1164140-1-link@vivo.com/(This 
patch forget to unregister when model exit)

I believe that the unpin_list may not need to be constructed, and 
instead, operations can be directly based on the folio array.


>
>> So, when building the sg table, it is necessary to iterate in this way:
>>    if size cross item->size, take care of it's start offset in folio.
>>    if got folio, set each page into sgl until reach into folio size.
>>
>> This patch also remove single folios' create on each create item, use it
>> be the ubuf->folios arrays' pointer, slide to fill the corresponding
>> folio under the item into the array.
>>
>> After the modification, the various data structures in udmabuf have the
>> following corresponding relationships:
>>    pagecount * PAGESIZE = sum(folios_size(folios[i])) i=0->nr_folios
>>    pagecount * PAGESIZE = sum(item_size[i]) i=0, item_count (do not
>> record)
>>    item_offset use to record each memfd offset if exist, else 0.
>>
>> Huan Yang (5):
>>    udmabuf: cancel mmap page fault, direct map it
>>    udmabuf: change folios array from kmalloc to kvmalloc
>>    udmabuf: fix vmap_udmabuf error page set
> Do you have a test-case to test this patch?
>
>>    udmabuf: add get_sg_table helper function
>>    udmabuf: remove folio pin list
> Please run the newly added udmabuf selftests to make sure that these

Yes, you're right, when I release the next patch, I will include it. 
Thank you for point this.

Christian König reminded me not to build page associations based on the 
sg table, which I had not considered.

Therefore, the overall logic of the patch needs to be revised.

> patches are not causing any regressions. And, we also need to make sure that
> the main use-cases (Qemu with memfd + shmem and Qemu with memfd + hugetlb)
> are working as expected given the invasive changes.
>
> I'll be able to test and provide more detailed feedback on all patches once I am back from
> vacation late next week.

Wish you a pleasant holiday.
Thank you.
>
> Thanks,
> Vivek
>
>>   drivers/dma-buf/udmabuf.c | 270 +++++++++++++++++++++-----------------
>>   1 file changed, 148 insertions(+), 122 deletions(-)
>>
>>
>> base-commit: cd19ac2f903276b820f5d0d89de0c896c27036ed
>> --
>> 2.45.2
Michel Dänzer Aug. 2, 2024, 3:26 p.m. UTC | #5
On 2024-08-01 20:32, Kasireddy, Vivek wrote:
> Hi Huan,
> 
>> This patchset attempts to fix some errors in udmabuf and remove the
>> upin_list structure.
>>
>> Some of this fix just gather the patches which I upload before.
>>
>> Patch1
>> ===
>> Try to remove page fault mmap and direct map it.
>> Due to current udmabuf has already obtained and pinned the folio
>> upon completion of the creation.This means that the physical memory has
>> already been acquired, rather than being accessed dynamically. The
>> current page fault method only saves some page table memory.
>>
>> As a result, the page fault mechanism has lost its purpose as a demanding
>> page. Due to the fact that page fault requires trapping into kernel mode
>> and filling in when accessing the corresponding virtual address in mmap,
>> this means that user mode access to virtual addresses needs to trap into
>> kernel mode.
>>
>> Therefore, when creating a large size udmabuf, this represents a
>> considerable overhead.
> Just want to mention that for the main use-case the udmabuf driver is designed for,
> (sharing Qemu Guest FB with Host for GPU DMA), udmabufs are not created very
> frequently. And, I think providing CPU access via mmap is just a backup, mainly
> intended for debugging purposes.

FYI, Mesa now uses udmabuf for supporting dma-bufs with software rendering.