mbox series

[v2,0/5] Introduce DMA_HEAP_ALLOC_AND_READ_FILE heap flag

Message ID 20240730075755.10941-1-link@vivo.com
Headers show
Series Introduce DMA_HEAP_ALLOC_AND_READ_FILE heap flag | expand

Message

Huan Yang July 30, 2024, 7:57 a.m. UTC
Background
====
Some user may need load file into dma-buf, current way is:
  1. allocate a dma-buf, get dma-buf fd
  2. mmap dma-buf fd into user vaddr
  3. read(file_fd, vaddr, fsz)
Due to dma-buf user map can't support direct I/O[1], the file read
must be buffer I/O.

This means that during the process of reading the file into dma-buf,
page cache needs to be generated, and the corresponding content needs to
be first copied to the page cache before being copied to the dma-buf.

This way worked well when reading relatively small files before, as
the page cache can cache the file content, thus improving performance.

However, there are new challenges currently, especially as AI models are
becoming larger and need to be shared between DMA devices and the CPU
via dma-buf.

For example, our 7B model file size is around 3.4GB. Using the
previous would mean generating a total of 3.4GB of page cache
(even if it will be reclaimed), and also requiring the copying of 3.4GB
of content between page cache and dma-buf. 

Due to the limited resources of system memory, files in the gigabyte range
cannot persist in memory indefinitely, so this portion of page cache may
not provide much assistance for subsequent reads. Additionally, the
existence of page cache will consume additional system resources due to
the extra copying required by the CPU.

Therefore, I think it is necessary for dma-buf to support direct I/O.

However, direct I/O file reads cannot be performed using the buffer
mmaped by the user space for the dma-buf.[1]

Here are some discussions on implementing direct I/O using dma-buf:

mmap[1]
---
dma-buf never support user map vaddr use of direct I/O.

udmabuf[2]
---
Currently, udmabuf can use the memfd method to read files into
dma-buf in direct I/O mode.

However, if the size is large, the current udmabuf needs to adjust the
corresponding size_limit(default 64MB).
But using udmabuf for files at the 3GB level is not a very good approach.
It needs to make some adjustments internally to handle this.[3] Or else,
fail create.

But, it is indeed a viable way to enable dma-buf to support direct I/O.
However, it is necessary to initiate the file read after the memory allocation
is completed, and handle race conditions carefully.

sendfile/splice[4]
---
Another way to enable dma-buf to support direct I/O is by implementing
splice_write/write_iter in the dma-buf file operations (fops) to adapt
to the sendfile method.
However, the current sendfile/splice calls are based on pipe. When using
direct I/O to read a file, the content needs to be copied to the buffer
allocated by the pipe (default 64KB), and then the dma-buf fops'
splice_write needs to be called to write the content into the dma-buf.
This approach requires serially reading the content of file pipe size
into the pipe buffer and then waiting for the dma-buf to be written
before reading the next one.(The I/O performance is relatively weak
under direct I/O.)
Moreover, due to the existence of the pipe buffer, even when using
direct I/O and not needing to generate additional page cache,
there still needs to be a CPU copy.

copy_file_range[5]
---
Consider of copy_file_range, It only supports copying files within the
same file system. Similarly, it is not very practical.


So, currently, there is no particularly suitable solution on VFS to
allow dma-buf to support direct I/O for large file reads.

This patchset provides an idea to complete file reads when requesting a
dma-buf.

Introduce DMA_HEAP_ALLOC_AND_READ_FILE heap flag
===
This patch provides a method to immediately read the file content after
the dma-buf is allocated, and only returns the dma-buf file descriptor
after the file is fully read.

Since the dma-buf file descriptor is not returned, no other thread can
access it except for the current thread, so we don't need to worry about
race conditions.

Map the dma-buf to the vmalloc area and initiate file reads in kernel
space, supporting both buffer I/O and direct I/O.

This patch adds the DMA_HEAP_ALLOC_AND_READ heap_flag for user.
When a user needs to allocate a dma-buf and read a file, they should
pass this heap flag. As the size of the file being read is fixed, there is no
need to pass the 'len' parameter. Instead, The file_fd needs to be passed to
indicate to the kernel the file that needs to be read.

The file open flag determines the mode of file reading.
But, please note that if direct I/O(O_DIRECT) is needed to read the file,
the file size must be page aligned. (with patch 2-5, no need)

Therefore, for the user, len and file_fd are mutually exclusive,
and they are combined using a union.

Once the user obtains the dma-buf fd, the dma-buf directly contains the
file content.

Patch 1 implement it.

Patch 2-5 provides an approach for performance improvement.

The DMA_HEAP_ALLOC_AND_READ_FILE heap flag patch enables us to
synchronously read files using direct I/O.

This approach helps to save CPU copying and avoid a certain degree of
memory thrashing (page cache generation and reclamation)

When dealing with large file sizes, the benefits of this approach become
particularly significant.

However, there are currently some methods that can improve performance,
not just save system resources:

Due to the large file size, for example, a AI 7B model of around 3.4GB, the
time taken to allocate DMA-BUF memory will be relatively long. Waiting
for the allocation to complete before reading the file will add to the
overall time consumption. Therefore, the total time for DMA-BUF
allocation and file read can be calculated using the formula
   T(total) = T(alloc) + T(I/O)

However, if we change our approach, we don't necessarily need to wait
for the DMA-BUF allocation to complete before initiating I/O. In fact,
during the allocation process, we already hold a portion of the page,
which means that waiting for subsequent page allocations to complete
before carrying out file reads is actually unfair to the pages that have
already been allocated.

The allocation of pages is sequential, and the reading of the file is
also sequential, with the content and size corresponding to the file.
This means that the memory location for each page, which holds the
content of a specific position in the file, can be determined at the
time of allocation.

However, to fully leverage I/O performance, it is best to wait and
gather a certain number of pages before initiating batch processing.

The default gather size is 128MB. So, ever gathered can see as a file read
work, it maps the gather page to the vmalloc area to obtain a continuous
virtual address, which is used as a buffer to store the contents of the
corresponding file. So, if using direct I/O to read a file, the file
content will be written directly to the corresponding dma-buf buffer memory
without any additional copying.(compare to pipe buffer.)

Consider other ways to read into dma-buf. If we assume reading after mmap
dma-buf, we need to map the pages of the dma-buf to the user virtual
address space. Also, udmabuf memfd need do this operations too.
Even if we support sendfile, the file copy also need buffer, you must
setup it.
So, mapping pages to the vmalloc area does not incur any additional
performance overhead compared to other methods.[6]

Certainly, the administrator can also modify the gather size through patch5.

The formula for the time taken for system_heap buffer allocation and
file reading through async_read is as follows:

  T(total) = T(first gather page) + Max(T(remain alloc), T(I/O))

Compared to the synchronous read:
  T(total) = T(alloc) + T(I/O)

If the allocation time or I/O time is long, the time difference will be
covered by the maximum value between the allocation and I/O. The other
party will be concealed.

Therefore, the larger the size of the file that needs to be read, the
greater the corresponding benefits will be.

How to use
===
Consider the current pathway for loading model files into DMA-BUF:
  1. open dma-heap, get heap fd
  2. open file, get file_fd(can't use O_DIRECT)
  3. use file len to allocate dma-buf, get dma-buf fd
  4. mmap dma-buf fd, get vaddr
  5. read(file_fd, vaddr, file_size) into dma-buf pages
  6. share, attach, whatever you want

Use DMA_HEAP_ALLOC_AND_READ_FILE JUST a little change:
  1. open dma-heap, get heap fd
  2. open file, get file_fd(buffer/direct)
  3. allocate dma-buf with DMA_HEAP_ALLOC_AND_READ_FILE heap flag, set file_fd
     instead of len. get dma-buf fd(contains file content)
  4. share, attach, whatever you want

So, test it is easy.

How to test
===
The performance comparison will be conducted for the following scenarios:
  1. normal
  2. udmabuf with [3] patch
  3. sendfile
  4. only patch 1
  5. patch1 - patch4.

normal:
  1. open dma-heap, get heap fd
  2. open file, get file_fd(can't use O_DIRECT)
  3. use file len to allocate dma-buf, get dma-buf fd
  4. mmap dma-buf fd, get vaddr
  5. read(file_fd, vaddr, file_size) into dma-buf pages
  6. share, attach, whatever you want

UDMA-BUF step:
  1. memfd_create
  2. open file(buffer/direct)
  3. udmabuf create
  4. mmap memfd
  5. read file into memfd vaddr

Sendfile step(need suit splice_write/write_iter, just use to compare):
  1. open dma-heap, get heap fd
  2. open file, get file_fd(buffer/direct)
  3. use file len to allocate dma-buf, get dma-buf fd
  4. sendfile file_fd to dma-buf fd
  6. share, attach, whatever you want

patch1/patch1-4:
  1. open dma-heap, get heap fd
  2. open file, get file_fd(buffer/direct)
  3. allocate dma-buf with DMA_HEAP_ALLOC_AND_READ_FILE heap flag, set file_fd
     instead of len. get dma-buf fd(contains file content)
  4. share, attach, whatever you want

You can create a file to test it. Compare the performance gap between the two.
It is best to compare the differences in file size from KB to MB to GB.

The following test data will compare the performance differences between 512KB,
8MB, 1GB, and 3GB under various scenarios.

Performance Test
===
  12G RAM phone
  UFS4.0(the maximum speed is 4GB/s. ),
  f2fs
  kernel 6.1 with patch[7] (or else, can't support kvec direct I/O read.)
  no memory pressure.
  drop_cache is used for each test.

The average of 5 test results:
| scheme-size         | 512KB(ns)  | 8MB(ns)    | 1GB(ns)       | 3GB(ns)       |
| ------------------- | ---------- | ---------- | ------------- | ------------- |
| normal              | 2,790,861  | 14,535,784 | 1,520,790,492 | 3,332,438,754 |
| udmabuf buffer I/O  | 1,704,046  | 11,313,476 | 821,348,000   | 2,108,419,923 |
| sendfile buffer I/O | 3,261,261  | 12,112,292 | 1,565,939,938 | 3,062,052,984 |
| patch1-4 buffer I/O | 2,064,538  | 10,771,474 | 986,338,800   | 2,187,570,861 |
| sendfile direct I/O | 12,844,231 | 37,883,938 | 5,110,299,184 | 9,777,661,077 |
| patch1 direct I/O   | 813,215    | 6,962,092  | 2,364,211,877 | 5,648,897,554 |
| udmabuf direct I/O  | 1,289,554  | 8,968,138  | 921,480,784   | 2,158,305,738 |
| patch1-4 direct I/O | 1,957,661  | 6,581,999  | 520,003,538   | 1,400,006,107 |

So, based on the test results:

When the file is large, the patchset has the highest performance.
Compared to normal, patchset is a 50% improvement;
Compared to normal, patch1 only showed a degradation of 41%.
patch1 typical performance breakdown is as follows:
  1. alloc cost 188,802,693 ns
  2. vmap cost 42,491,385 ns
  3. file read cost 4,180,876,702 ns
Therefore, directly performing a single direct I/O read on a large file
may not be the most optimal way for performance.

The performance of direct I/O implemented by the sendfile method is the worst.

When file size is small, The difference in performance is not
significant. This is consistent with expectations.



Suggested use cases
===
  1. When there is a need to read large files and system resources are scarce,
     especially when the size of memory is limited.(GB level) In this
     scenario, using direct I/O for file reading can even bring performance
     improvements.(may need patch2-3)
  2. For embedded devices with limited RAM, using direct I/O can save system
     resources and avoid unnecessary data copying. Therefore, even if the
     performance is lower when read small file, it can still be used
     effectively.
  3. If there is sufficient memory, pinning the page cache of the model files
     in memory and placing file in the EROFS file system for read-only access
     maybe better.(EROFS do not support direct I/O)


Changlog
===
 v1 [8]
 v1->v2:
   Uses the heap flag method for alloc and read instead of adding a new
   DMA-buf ioctl command. [9]
   Split the patchset to facilitate review and test.
     patch 1 implement alloc and read, offer heap flag into it.
     patch 2-4 offer async read
     patch 5 can change gather limit.

Reference
===
[1] https://lore.kernel.org/all/0393cf47-3fa2-4e32-8b3d-d5d5bdece298@amd.com/
[2] https://lore.kernel.org/all/ZpTnzkdolpEwFbtu@phenom.ffwll.local/
[3] https://lore.kernel.org/all/20240725021349.580574-1-link@vivo.com/
[4] https://lore.kernel.org/all/Zpf5R7fRZZmEwVuR@infradead.org/
[5] https://lore.kernel.org/all/ZpiHKY2pGiBuEq4z@infradead.org/
[6] https://lore.kernel.org/all/9b70db2e-e562-4771-be6b-1fa8df19e356@amd.com/
[7] https://patchew.org/linux/20230209102954.528942-1-dhowells@redhat.com/20230209102954.528942-7-dhowells@redhat.com/
[8] https://lore.kernel.org/all/20240711074221.459589-1-link@vivo.com/
[9] https://lore.kernel.org/all/5ccbe705-883c-4651-9e66-6b452c414c74@amd.com/

Huan Yang (5):
  dma-buf: heaps: Introduce DMA_HEAP_ALLOC_AND_READ_FILE heap flag
  dma-buf: heaps: Introduce async alloc read ops
  dma-buf: heaps: support alloc async read file
  dma-buf: heaps: system_heap alloc support async read
  dma-buf: heaps: configurable async read gather limit

 drivers/dma-buf/dma-heap.c          | 552 +++++++++++++++++++++++++++-
 drivers/dma-buf/heaps/system_heap.c |  70 +++-
 include/linux/dma-heap.h            |  53 ++-
 include/uapi/linux/dma-heap.h       |  11 +-
 4 files changed, 673 insertions(+), 13 deletions(-)


base-commit: 931a3b3bccc96e7708c82b30b2b5fa82dfd04890

Comments

Huan Yang Aug. 1, 2024, 2:53 a.m. UTC | #1
在 2024/8/1 4:46, Daniel Vetter 写道:
> On Tue, Jul 30, 2024 at 08:04:04PM +0800, Huan Yang wrote:
>> 在 2024/7/30 17:05, Huan Yang 写道:
>>> 在 2024/7/30 16:56, Daniel Vetter 写道:
>>>> [????????? daniel.vetter@ffwll.ch ?????????
>>>> https://aka.ms/LearnAboutSenderIdentification?????????????]
>>>>
>>>> On Tue, Jul 30, 2024 at 03:57:44PM +0800, Huan Yang wrote:
>>>>> UDMA-BUF step:
>>>>>     1. memfd_create
>>>>>     2. open file(buffer/direct)
>>>>>     3. udmabuf create
>>>>>     4. mmap memfd
>>>>>     5. read file into memfd vaddr
>>>> Yeah this is really slow and the worst way to do it. You absolutely want
>>>> to start _all_ the io before you start creating the dma-buf, ideally
>>>> with
>>>> everything running in parallel. But just starting the direct I/O with
>>>> async and then creating the umdabuf should be a lot faster and avoid
>>> That's greate,  Let me rephrase that, and please correct me if I'm wrong.
>>>
>>> UDMA-BUF step:
>>>    1. memfd_create
>>>    2. mmap memfd
>>>    3. open file(buffer/direct)
>>>    4. start thread to async read
>>>    3. udmabuf create
>>>
>>> With this, can improve
>> I just test with it. Step is:
>>
>> UDMA-BUF step:
>>    1. memfd_create
>>    2. mmap memfd
>>    3. open file(buffer/direct)
>>    4. start thread to async read
>>    5. udmabuf create
>>
>>    6 . join wait
>>
>> 3G file read all step cost 1,527,103,431ns, it's greate.
> Ok that's almost the throughput of your patch set, which I think is close
> enough. The remaining difference is probably just the mmap overhead, not
> sure whether/how we can do direct i/o to an fd directly ... in principle
> it's possible for any file that uses the standard pagecache.

Yes, for mmap, IMO, now that we get all folios and pin it. That's mean 
all pfn it's got when udmabuf created.

So, I think mmap with page fault is helpless for save memory but 
increase the mmap access cost.(maybe can save a little page table's memory)

I want to offer a patchset to remove it and more suitable for folios 
operate(And remove unpin list). And contains some fix patch.

I'll send it when I test it's good.


About fd operation for direct I/O, maybe use sendfile or copy_file_range?

sendfile base pipe buffer, it's low performance when I test is.

copy_file_range can't work due to it's not the same file system.

So, I can't find other way to do it. Can someone give some suggestions?

> -Sima
Daniel Vetter Aug. 5, 2024, 5:53 p.m. UTC | #2
On Thu, Aug 01, 2024 at 10:53:45AM +0800, Huan Yang wrote:
> 
> 在 2024/8/1 4:46, Daniel Vetter 写道:
> > On Tue, Jul 30, 2024 at 08:04:04PM +0800, Huan Yang wrote:
> > > 在 2024/7/30 17:05, Huan Yang 写道:
> > > > 在 2024/7/30 16:56, Daniel Vetter 写道:
> > > > > [????????? daniel.vetter@ffwll.ch ?????????
> > > > > https://aka.ms/LearnAboutSenderIdentification?????????????]
> > > > > 
> > > > > On Tue, Jul 30, 2024 at 03:57:44PM +0800, Huan Yang wrote:
> > > > > > UDMA-BUF step:
> > > > > >     1. memfd_create
> > > > > >     2. open file(buffer/direct)
> > > > > >     3. udmabuf create
> > > > > >     4. mmap memfd
> > > > > >     5. read file into memfd vaddr
> > > > > Yeah this is really slow and the worst way to do it. You absolutely want
> > > > > to start _all_ the io before you start creating the dma-buf, ideally
> > > > > with
> > > > > everything running in parallel. But just starting the direct I/O with
> > > > > async and then creating the umdabuf should be a lot faster and avoid
> > > > That's greate,  Let me rephrase that, and please correct me if I'm wrong.
> > > > 
> > > > UDMA-BUF step:
> > > >    1. memfd_create
> > > >    2. mmap memfd
> > > >    3. open file(buffer/direct)
> > > >    4. start thread to async read
> > > >    3. udmabuf create
> > > > 
> > > > With this, can improve
> > > I just test with it. Step is:
> > > 
> > > UDMA-BUF step:
> > >    1. memfd_create
> > >    2. mmap memfd
> > >    3. open file(buffer/direct)
> > >    4. start thread to async read
> > >    5. udmabuf create
> > > 
> > >    6 . join wait
> > > 
> > > 3G file read all step cost 1,527,103,431ns, it's greate.
> > Ok that's almost the throughput of your patch set, which I think is close
> > enough. The remaining difference is probably just the mmap overhead, not
> > sure whether/how we can do direct i/o to an fd directly ... in principle
> > it's possible for any file that uses the standard pagecache.
> 
> Yes, for mmap, IMO, now that we get all folios and pin it. That's mean all
> pfn it's got when udmabuf created.
> 
> So, I think mmap with page fault is helpless for save memory but increase
> the mmap access cost.(maybe can save a little page table's memory)
> 
> I want to offer a patchset to remove it and more suitable for folios
> operate(And remove unpin list). And contains some fix patch.
> 
> I'll send it when I test it's good.
> 
> 
> About fd operation for direct I/O, maybe use sendfile or copy_file_range?
> 
> sendfile base pipe buffer, it's low performance when I test is.
> 
> copy_file_range can't work due to it's not the same file system.
> 
> So, I can't find other way to do it. Can someone give some suggestions?

Yeah direct I/O to pagecache without an mmap might be too niche to be
supported. Maybe io_uring has something, but I guess as unlikely as
anything else.
-Sima