diff mbox series

[v10,01/17] iova: Export alloc_iova_fast() and free_iova_fast()

Message ID 20210729073503.187-2-xieyongji@bytedance.com
State Superseded
Headers show
Series [v10,01/17] iova: Export alloc_iova_fast() and free_iova_fast() | expand

Commit Message

Yongji Xie July 29, 2021, 7:34 a.m. UTC
Export alloc_iova_fast() and free_iova_fast() so that
some modules can use it to improve iova allocation efficiency.

Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
---
 drivers/iommu/iova.c | 2 ++
 1 file changed, 2 insertions(+)

Comments

Jason Wang Aug. 3, 2021, 7:41 a.m. UTC | #1
在 2021/7/29 下午3:34, Xie Yongji 写道:
> Export alloc_iova_fast() and free_iova_fast() so that

> some modules can use it to improve iova allocation efficiency.



It's better to explain which alloc_iova() is not sufficient here.

Thanks


>

> Signed-off-by: Xie Yongji <xieyongji@bytedance.com>

> ---

>   drivers/iommu/iova.c | 2 ++

>   1 file changed, 2 insertions(+)

>

> diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c

> index b6cf5f16123b..3941ed6bb99b 100644

> --- a/drivers/iommu/iova.c

> +++ b/drivers/iommu/iova.c

> @@ -521,6 +521,7 @@ alloc_iova_fast(struct iova_domain *iovad, unsigned long size,

>   

>   	return new_iova->pfn_lo;

>   }

> +EXPORT_SYMBOL_GPL(alloc_iova_fast);

>   

>   /**

>    * free_iova_fast - free iova pfn range into rcache

> @@ -538,6 +539,7 @@ free_iova_fast(struct iova_domain *iovad, unsigned long pfn, unsigned long size)

>   

>   	free_iova(iovad, pfn);

>   }

> +EXPORT_SYMBOL_GPL(free_iova_fast);

>   

>   #define fq_ring_for_each(i, fq) \

>   	for ((i) = (fq)->head; (i) != (fq)->tail; (i) = ((i) + 1) % IOVA_FQ_SIZE)
Jason Wang Aug. 3, 2021, 7:41 a.m. UTC | #2
在 2021/7/29 下午3:34, Xie Yongji 写道:
> Export alloc_iova_fast() and free_iova_fast() so that

> some modules can use it to improve iova allocation efficiency.



It's better to explain why alloc_iova() is not sufficient here.

Thanks


>

> Signed-off-by: Xie Yongji <xieyongji@bytedance.com>

> ---

>   drivers/iommu/iova.c | 2 ++

>   1 file changed, 2 insertions(+)

>

> diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c

> index b6cf5f16123b..3941ed6bb99b 100644

> --- a/drivers/iommu/iova.c

> +++ b/drivers/iommu/iova.c

> @@ -521,6 +521,7 @@ alloc_iova_fast(struct iova_domain *iovad, unsigned long size,

>   

>   	return new_iova->pfn_lo;

>   }

> +EXPORT_SYMBOL_GPL(alloc_iova_fast);

>   

>   /**

>    * free_iova_fast - free iova pfn range into rcache

> @@ -538,6 +539,7 @@ free_iova_fast(struct iova_domain *iovad, unsigned long pfn, unsigned long size)

>   

>   	free_iova(iovad, pfn);

>   }

> +EXPORT_SYMBOL_GPL(free_iova_fast);

>   

>   #define fq_ring_for_each(i, fq) \

>   	for ((i) = (fq)->head; (i) != (fq)->tail; (i) = ((i) + 1) % IOVA_FQ_SIZE)
Yongji Xie Aug. 3, 2021, 8:54 a.m. UTC | #3
On Tue, Aug 3, 2021 at 3:41 PM Jason Wang <jasowang@redhat.com> wrote:
>

>

> 在 2021/7/29 下午3:34, Xie Yongji 写道:

> > Export alloc_iova_fast() and free_iova_fast() so that

> > some modules can use it to improve iova allocation efficiency.

>

>

> It's better to explain why alloc_iova() is not sufficient here.

>


Fine.

Thanks,
Yongji
Robin Murphy Aug. 3, 2021, 10:53 a.m. UTC | #4
On 2021-08-03 09:54, Yongji Xie wrote:
> On Tue, Aug 3, 2021 at 3:41 PM Jason Wang <jasowang@redhat.com> wrote:

>>

>>

>> 在 2021/7/29 下午3:34, Xie Yongji 写道:

>>> Export alloc_iova_fast() and free_iova_fast() so that

>>> some modules can use it to improve iova allocation efficiency.

>>

>>

>> It's better to explain why alloc_iova() is not sufficient here.

>>

> 

> Fine.


What I fail to understand from the later patches is what the IOVA domain 
actually represents. If the "device" is a userspace process then 
logically the "IOVA" would be the userspace address, so presumably 
somewhere you're having to translate between this arbitrary address 
space and actual usable addresses - if you're worried about efficiency 
surely it would be even better to not do that?

Presumably userspace doesn't have any concern about alignment and the 
things we have to worry about for the DMA API in general, so it's pretty 
much just allocating slots in a buffer, and there are far more effective 
ways to do that than a full-blown address space manager. If you're going 
to reuse any infrastructure I'd have expected it to be SWIOTLB rather 
than the IOVA allocator. Because, y'know, you're *literally implementing 
a software I/O TLB* ;)

Robin.
Yongji Xie Aug. 4, 2021, 5:02 a.m. UTC | #5
On Tue, Aug 3, 2021 at 6:54 PM Robin Murphy <robin.murphy@arm.com> wrote:
>

> On 2021-08-03 09:54, Yongji Xie wrote:

> > On Tue, Aug 3, 2021 at 3:41 PM Jason Wang <jasowang@redhat.com> wrote:

> >>

> >>

> >> 在 2021/7/29 下午3:34, Xie Yongji 写道:

> >>> Export alloc_iova_fast() and free_iova_fast() so that

> >>> some modules can use it to improve iova allocation efficiency.

> >>

> >>

> >> It's better to explain why alloc_iova() is not sufficient here.

> >>

> >

> > Fine.

>

> What I fail to understand from the later patches is what the IOVA domain

> actually represents. If the "device" is a userspace process then

> logically the "IOVA" would be the userspace address, so presumably

> somewhere you're having to translate between this arbitrary address

> space and actual usable addresses - if you're worried about efficiency

> surely it would be even better to not do that?

>


Yes, userspace daemon needs to translate the "IOVA" in a DMA
descriptor to the VA (from mmap(2)). But this actually doesn't affect
performance since it's an identical mapping in most cases.

> Presumably userspace doesn't have any concern about alignment and the

> things we have to worry about for the DMA API in general, so it's pretty

> much just allocating slots in a buffer, and there are far more effective

> ways to do that than a full-blown address space manager.


Considering iova allocation efficiency, I think the iova allocator is
better here. In most cases, we don't even need to hold a spin lock
during iova allocation.

> If you're going

> to reuse any infrastructure I'd have expected it to be SWIOTLB rather

> than the IOVA allocator. Because, y'know, you're *literally implementing

> a software I/O TLB* ;)

>


But actually what we can reuse in SWIOTLB is the IOVA allocator. And
the IOVA management in SWIOTLB is not what we want. For example,
SWIOTLB allocates and uses contiguous memory for bouncing, which is
not necessary in VDUSE case. And VDUSE needs coherent mapping which is
not supported by the SWIOTLB. Besides, the SWIOTLB works in singleton
mode (designed for platform IOMMU) , but VDUSE is based on on-chip
IOMMU (supports multiple instances). So I still prefer to reuse the
IOVA allocator to implement a MMU-based software IOTLB.

Thanks,
Yongji
Robin Murphy Aug. 4, 2021, 3:43 p.m. UTC | #6
On 2021-08-04 06:02, Yongji Xie wrote:
> On Tue, Aug 3, 2021 at 6:54 PM Robin Murphy <robin.murphy@arm.com> wrote:

>>

>> On 2021-08-03 09:54, Yongji Xie wrote:

>>> On Tue, Aug 3, 2021 at 3:41 PM Jason Wang <jasowang@redhat.com> wrote:

>>>>

>>>>

>>>> 在 2021/7/29 下午3:34, Xie Yongji 写道:

>>>>> Export alloc_iova_fast() and free_iova_fast() so that

>>>>> some modules can use it to improve iova allocation efficiency.

>>>>

>>>>

>>>> It's better to explain why alloc_iova() is not sufficient here.

>>>>

>>>

>>> Fine.

>>

>> What I fail to understand from the later patches is what the IOVA domain

>> actually represents. If the "device" is a userspace process then

>> logically the "IOVA" would be the userspace address, so presumably

>> somewhere you're having to translate between this arbitrary address

>> space and actual usable addresses - if you're worried about efficiency

>> surely it would be even better to not do that?

>>

> 

> Yes, userspace daemon needs to translate the "IOVA" in a DMA

> descriptor to the VA (from mmap(2)). But this actually doesn't affect

> performance since it's an identical mapping in most cases.


I'm not familiar with the vhost_iotlb stuff, but it looks suspiciously 
like you're walking yet another tree to make those translations. Even if 
the buffer can be mapped all at once with a fixed offset such that each 
DMA mapping call doesn't need a lookup for each individual "IOVA" - that 
might be what's happening already, but it's a bit hard to follow just 
reading the patches in my mail client - vhost_iotlb_add_range() doesn't 
look like it's super-cheap to call, and you're serialising on a lock for 
that.

My main point, though, is that if you've already got something else 
keeping track of the actual addresses, then the way you're using an 
iova_domain appears to be something you could do with a trivial bitmap 
allocator. That's why I don't buy the efficiency argument. The main 
design points of the IOVA allocator are to manage large address spaces 
while trying to maximise spatial locality to minimise the underlying 
pagetable usage, and allocating with a flexible limit to support 
multiple devices with different addressing capabilities in the same 
address space. If none of those aspects are relevant to the use-case - 
which AFAICS appears to be true here - then as a general-purpose 
resource allocator it's rubbish and has an unreasonably massive memory 
overhead and there are many, many better choices.

FWIW I've recently started thinking about moving all the caching stuff 
out of iova_domain and into the iommu-dma layer since it's now a giant 
waste of space for all the other current IOVA users.

>> Presumably userspace doesn't have any concern about alignment and the

>> things we have to worry about for the DMA API in general, so it's pretty

>> much just allocating slots in a buffer, and there are far more effective

>> ways to do that than a full-blown address space manager.

> 

> Considering iova allocation efficiency, I think the iova allocator is

> better here. In most cases, we don't even need to hold a spin lock

> during iova allocation.

> 

>> If you're going

>> to reuse any infrastructure I'd have expected it to be SWIOTLB rather

>> than the IOVA allocator. Because, y'know, you're *literally implementing

>> a software I/O TLB* ;)

>>

> 

> But actually what we can reuse in SWIOTLB is the IOVA allocator.


Huh? Those are completely unrelated and orthogonal things - SWIOTLB does 
not use an external allocator (see find_slots()). By SWIOTLB I mean 
specifically the library itself, not dma-direct or any of the other 
users built around it. The functionality for managing slots in a buffer 
and bouncing data in and out can absolutely be reused - that's why users 
like the Xen and iommu-dma code *are* reusing it instead of open-coding 
their own versions.

> And

> the IOVA management in SWIOTLB is not what we want. For example,

> SWIOTLB allocates and uses contiguous memory for bouncing, which is

> not necessary in VDUSE case.


alloc_iova() allocates a contiguous (in IOVA address) region of space. 
In vduse_domain_map_page() you use it to allocate a contiguous region of 
space from your bounce buffer. Can you clarify how that is fundamentally 
different from allocating a contiguous region of space from a bounce 
buffer? Nobody's saying the underlying implementation details of where 
the buffer itself comes from can't be tweaked.

> And VDUSE needs coherent mapping which is

> not supported by the SWIOTLB. Besides, the SWIOTLB works in singleton

> mode (designed for platform IOMMU) , but VDUSE is based on on-chip

> IOMMU (supports multiple instances).

That's not entirely true - the IOMMU bounce buffering scheme introduced 
in intel-iommu and now moved into the iommu-dma layer was already a step 
towards something conceptually similar. It does still rely on stealing 
the underlying pages from the global SWIOTLB pool at the moment, but the 
bouncing is effectively done in a per-IOMMU-domain context.

The next step is currently queued in linux-next, wherein we can now have 
individual per-device SWIOTLB pools. In fact at that point I think you 
might actually be able to do your thing without implementing any special 
DMA ops at all - you'd need to set up a pool for your "device" with 
force_bounce set, then when you mmap() that to userspace, set up 
dev->dma_range_map to describe an offset from the physical address of 
the buffer to the userspace address, and I think dma-direct would be 
tricked into doing the right thing. It's a bit wacky, but it could stand 
to save a hell of a lot of bother.

Finally, enhancing SWIOTLB to cope with virtually-mapped buffers that 
don't have to be physically contiguous is a future improvement which I 
think could benefit various use-cases - indeed it's possibly already on 
the table for IOMMU bounce pages - so would probably be welcome in general.

 > So I still prefer to reuse the

 > IOVA allocator to implement a MMU-based software IOTLB.


If you're dead set on open-coding all the bounce-buffering machinery, 
then I'd honestly recommend open-coding a more suitable buffer allocator 
as well ;)

Thanks,
Robin.
Yongji Xie Aug. 5, 2021, 12:34 p.m. UTC | #7
On Wed, Aug 4, 2021 at 11:43 PM Robin Murphy <robin.murphy@arm.com> wrote:
>

> On 2021-08-04 06:02, Yongji Xie wrote:

> > On Tue, Aug 3, 2021 at 6:54 PM Robin Murphy <robin.murphy@arm.com> wrote:

> >>

> >> On 2021-08-03 09:54, Yongji Xie wrote:

> >>> On Tue, Aug 3, 2021 at 3:41 PM Jason Wang <jasowang@redhat.com> wrote:

> >>>>

> >>>>

> >>>> 在 2021/7/29 下午3:34, Xie Yongji 写道:

> >>>>> Export alloc_iova_fast() and free_iova_fast() so that

> >>>>> some modules can use it to improve iova allocation efficiency.

> >>>>

> >>>>

> >>>> It's better to explain why alloc_iova() is not sufficient here.

> >>>>

> >>>

> >>> Fine.

> >>

> >> What I fail to understand from the later patches is what the IOVA domain

> >> actually represents. If the "device" is a userspace process then

> >> logically the "IOVA" would be the userspace address, so presumably

> >> somewhere you're having to translate between this arbitrary address

> >> space and actual usable addresses - if you're worried about efficiency

> >> surely it would be even better to not do that?

> >>

> >

> > Yes, userspace daemon needs to translate the "IOVA" in a DMA

> > descriptor to the VA (from mmap(2)). But this actually doesn't affect

> > performance since it's an identical mapping in most cases.

>

> I'm not familiar with the vhost_iotlb stuff, but it looks suspiciously

> like you're walking yet another tree to make those translations. Even if

> the buffer can be mapped all at once with a fixed offset such that each

> DMA mapping call doesn't need a lookup for each individual "IOVA" - that

> might be what's happening already, but it's a bit hard to follow just

> reading the patches in my mail client - vhost_iotlb_add_range() doesn't

> look like it's super-cheap to call, and you're serialising on a lock for

> that.

>


Yes, that's true. Since the software IOTLB is not used in the VM case,
we need a unified way (vhost_iotlb) to manage the IOVA mapping for
both VM and Container cases.

> My main point, though, is that if you've already got something else

> keeping track of the actual addresses, then the way you're using an

> iova_domain appears to be something you could do with a trivial bitmap

> allocator. That's why I don't buy the efficiency argument. The main

> design points of the IOVA allocator are to manage large address spaces

> while trying to maximise spatial locality to minimise the underlying

> pagetable usage, and allocating with a flexible limit to support

> multiple devices with different addressing capabilities in the same

> address space. If none of those aspects are relevant to the use-case -

> which AFAICS appears to be true here - then as a general-purpose

> resource allocator it's rubbish and has an unreasonably massive memory

> overhead and there are many, many better choices.

>


OK, I get your point. Actually we used the genpool allocator in the
early version. Maybe we can fall back to using it.

> FWIW I've recently started thinking about moving all the caching stuff

> out of iova_domain and into the iommu-dma layer since it's now a giant

> waste of space for all the other current IOVA users.

>

> >> Presumably userspace doesn't have any concern about alignment and the

> >> things we have to worry about for the DMA API in general, so it's pretty

> >> much just allocating slots in a buffer, and there are far more effective

> >> ways to do that than a full-blown address space manager.

> >

> > Considering iova allocation efficiency, I think the iova allocator is

> > better here. In most cases, we don't even need to hold a spin lock

> > during iova allocation.

> >

> >> If you're going

> >> to reuse any infrastructure I'd have expected it to be SWIOTLB rather

> >> than the IOVA allocator. Because, y'know, you're *literally implementing

> >> a software I/O TLB* ;)

> >>

> >

> > But actually what we can reuse in SWIOTLB is the IOVA allocator.

>

> Huh? Those are completely unrelated and orthogonal things - SWIOTLB does

> not use an external allocator (see find_slots()). By SWIOTLB I mean

> specifically the library itself, not dma-direct or any of the other

> users built around it. The functionality for managing slots in a buffer

> and bouncing data in and out can absolutely be reused - that's why users

> like the Xen and iommu-dma code *are* reusing it instead of open-coding

> their own versions.

>


I see. Actually the slots management in SWIOTLB is what I mean by IOVA
allocator.

> > And

> > the IOVA management in SWIOTLB is not what we want. For example,

> > SWIOTLB allocates and uses contiguous memory for bouncing, which is

> > not necessary in VDUSE case.

>

> alloc_iova() allocates a contiguous (in IOVA address) region of space.

> In vduse_domain_map_page() you use it to allocate a contiguous region of

> space from your bounce buffer. Can you clarify how that is fundamentally

> different from allocating a contiguous region of space from a bounce

> buffer? Nobody's saying the underlying implementation details of where

> the buffer itself comes from can't be tweaked.

>


I mean physically contiguous memory here. We can currently allocate
the bounce pages one by one rather than allocating a bunch of
physically contiguous memory at once which is not friendly to a
userspace device.

> > And VDUSE needs coherent mapping which is

> > not supported by the SWIOTLB. Besides, the SWIOTLB works in singleton

> > mode (designed for platform IOMMU) , but VDUSE is based on on-chip

> > IOMMU (supports multiple instances).

> That's not entirely true - the IOMMU bounce buffering scheme introduced

> in intel-iommu and now moved into the iommu-dma layer was already a step

> towards something conceptually similar. It does still rely on stealing

> the underlying pages from the global SWIOTLB pool at the moment, but the

> bouncing is effectively done in a per-IOMMU-domain context.

>

> The next step is currently queued in linux-next, wherein we can now have

> individual per-device SWIOTLB pools. In fact at that point I think you

> might actually be able to do your thing without implementing any special

> DMA ops at all - you'd need to set up a pool for your "device" with

> force_bounce set, then when you mmap() that to userspace, set up

> dev->dma_range_map to describe an offset from the physical address of

> the buffer to the userspace address, and I think dma-direct would be

> tricked into doing the right thing. It's a bit wacky, but it could stand

> to save a hell of a lot of bother.

>


Cool! I missed this work, sorry. But it looks like its current version
can't meet our needs (e.g. avoid using physically contiguous memory).
So I'd like to consider it as a follow-up optimization and use a
general IOVA allocator in this initial version. The IOVA allocator
would be still needed for coherent mapping
(vduse_domain_alloc_coherent() and vduse_domain_free_coherent()) after
we reuse the SWIOTLB.

> Finally, enhancing SWIOTLB to cope with virtually-mapped buffers that

> don't have to be physically contiguous is a future improvement which I

> think could benefit various use-cases - indeed it's possibly already on

> the table for IOMMU bounce pages - so would probably be welcome in general.

>


Yes, it's indeed needed by VDUSE. But I'm not sure if it would be
needed by other drivers. Looks like we need swiotlb_tbl_map_single()
to return a virtual address and introduce some way to let the caller
do some translation between VA to PA.

Thanks,
Yongji
Jason Wang Aug. 5, 2021, 1:31 p.m. UTC | #8
在 2021/8/5 下午8:34, Yongji Xie 写道:
>> My main point, though, is that if you've already got something else

>> keeping track of the actual addresses, then the way you're using an

>> iova_domain appears to be something you could do with a trivial bitmap

>> allocator. That's why I don't buy the efficiency argument. The main

>> design points of the IOVA allocator are to manage large address spaces

>> while trying to maximise spatial locality to minimise the underlying

>> pagetable usage, and allocating with a flexible limit to support

>> multiple devices with different addressing capabilities in the same

>> address space. If none of those aspects are relevant to the use-case -

>> which AFAICS appears to be true here - then as a general-purpose

>> resource allocator it's rubbish and has an unreasonably massive memory

>> overhead and there are many, many better choices.

>>

> OK, I get your point. Actually we used the genpool allocator in the

> early version. Maybe we can fall back to using it.



I think maybe you can share some perf numbers to see how much 
alloc_iova_fast() can help.

Thanks


>
Yongji Xie Aug. 9, 2021, 5:56 a.m. UTC | #9
On Thu, Aug 5, 2021 at 9:31 PM Jason Wang <jasowang@redhat.com> wrote:
>

>

> 在 2021/8/5 下午8:34, Yongji Xie 写道:

> >> My main point, though, is that if you've already got something else

> >> keeping track of the actual addresses, then the way you're using an

> >> iova_domain appears to be something you could do with a trivial bitmap

> >> allocator. That's why I don't buy the efficiency argument. The main

> >> design points of the IOVA allocator are to manage large address spaces

> >> while trying to maximise spatial locality to minimise the underlying

> >> pagetable usage, and allocating with a flexible limit to support

> >> multiple devices with different addressing capabilities in the same

> >> address space. If none of those aspects are relevant to the use-case -

> >> which AFAICS appears to be true here - then as a general-purpose

> >> resource allocator it's rubbish and has an unreasonably massive memory

> >> overhead and there are many, many better choices.

> >>

> > OK, I get your point. Actually we used the genpool allocator in the

> > early version. Maybe we can fall back to using it.

>

>

> I think maybe you can share some perf numbers to see how much

> alloc_iova_fast() can help.

>


I did some fio tests[1] with a ram-backend vduse block device[2].

Following are some performance data:

                            numjobs=1   numjobs=2    numjobs=4   numjobs=8
iova_alloc_fast    145k iops      265k iops      514k iops      758k iops

iova_alloc            137k iops     170k iops      128k iops      113k iops

gen_pool_alloc   143k iops      270k iops      458k iops      521k iops

The iova_alloc_fast() has the best performance since we always hit the
per-cpu cache. Regardless of the per-cpu cache, the genpool allocator
should be better than the iova allocator.

[1] fio jobfile:

[global]
rw=randread
direct=1
ioengine=libaio
iodepth=16
time_based=1
runtime=60s
group_reporting
bs=4k
filename=/dev/vda
[job]
numjobs=..

[2]  $ qemu-storage-daemon \
      --chardev socket,id=charmonitor,path=/tmp/qmp.sock,server,nowait \
      --monitor chardev=charmonitor \
      --blockdev
driver=host_device,cache.direct=on,aio=native,filename=/dev/nullb0,node-name=disk0
\
      --export type=vduse-blk,id=test,node-name=disk0,writable=on,name=vduse-null,num-queues=16,queue-size=128

The qemu-storage-daemon can be builded based on the repo:
https://github.com/bytedance/qemu/tree/vduse-test.

Thanks,
Yongji
Jason Wang Aug. 10, 2021, 3:02 a.m. UTC | #10
在 2021/8/9 下午1:56, Yongji Xie 写道:
> On Thu, Aug 5, 2021 at 9:31 PM Jason Wang <jasowang@redhat.com> wrote:

>>

>> 在 2021/8/5 下午8:34, Yongji Xie 写道:

>>>> My main point, though, is that if you've already got something else

>>>> keeping track of the actual addresses, then the way you're using an

>>>> iova_domain appears to be something you could do with a trivial bitmap

>>>> allocator. That's why I don't buy the efficiency argument. The main

>>>> design points of the IOVA allocator are to manage large address spaces

>>>> while trying to maximise spatial locality to minimise the underlying

>>>> pagetable usage, and allocating with a flexible limit to support

>>>> multiple devices with different addressing capabilities in the same

>>>> address space. If none of those aspects are relevant to the use-case -

>>>> which AFAICS appears to be true here - then as a general-purpose

>>>> resource allocator it's rubbish and has an unreasonably massive memory

>>>> overhead and there are many, many better choices.

>>>>

>>> OK, I get your point. Actually we used the genpool allocator in the

>>> early version. Maybe we can fall back to using it.

>>

>> I think maybe you can share some perf numbers to see how much

>> alloc_iova_fast() can help.

>>

> I did some fio tests[1] with a ram-backend vduse block device[2].

>

> Following are some performance data:

>

>                              numjobs=1   numjobs=2    numjobs=4   numjobs=8

> iova_alloc_fast    145k iops      265k iops      514k iops      758k iops

>

> iova_alloc            137k iops     170k iops      128k iops      113k iops

>

> gen_pool_alloc   143k iops      270k iops      458k iops      521k iops

>

> The iova_alloc_fast() has the best performance since we always hit the

> per-cpu cache. Regardless of the per-cpu cache, the genpool allocator

> should be better than the iova allocator.



I think we see convincing numbers for using iova_alloc_fast() than the 
gen_poll_alloc() (45% improvement on job=8).

Thanks


>

> [1] fio jobfile:

>

> [global]

> rw=randread

> direct=1

> ioengine=libaio

> iodepth=16

> time_based=1

> runtime=60s

> group_reporting

> bs=4k

> filename=/dev/vda

> [job]

> numjobs=..

>

> [2]  $ qemu-storage-daemon \

>        --chardev socket,id=charmonitor,path=/tmp/qmp.sock,server,nowait \

>        --monitor chardev=charmonitor \

>        --blockdev

> driver=host_device,cache.direct=on,aio=native,filename=/dev/nullb0,node-name=disk0

> \

>        --export type=vduse-blk,id=test,node-name=disk0,writable=on,name=vduse-null,num-queues=16,queue-size=128

>

> The qemu-storage-daemon can be builded based on the repo:

> https://github.com/bytedance/qemu/tree/vduse-test.

>

> Thanks,

> Yongji

>
Yongji Xie Aug. 10, 2021, 7:43 a.m. UTC | #11
On Tue, Aug 10, 2021 at 11:02 AM Jason Wang <jasowang@redhat.com> wrote:
>

>

> 在 2021/8/9 下午1:56, Yongji Xie 写道:

> > On Thu, Aug 5, 2021 at 9:31 PM Jason Wang <jasowang@redhat.com> wrote:

> >>

> >> 在 2021/8/5 下午8:34, Yongji Xie 写道:

> >>>> My main point, though, is that if you've already got something else

> >>>> keeping track of the actual addresses, then the way you're using an

> >>>> iova_domain appears to be something you could do with a trivial bitmap

> >>>> allocator. That's why I don't buy the efficiency argument. The main

> >>>> design points of the IOVA allocator are to manage large address spaces

> >>>> while trying to maximise spatial locality to minimise the underlying

> >>>> pagetable usage, and allocating with a flexible limit to support

> >>>> multiple devices with different addressing capabilities in the same

> >>>> address space. If none of those aspects are relevant to the use-case -

> >>>> which AFAICS appears to be true here - then as a general-purpose

> >>>> resource allocator it's rubbish and has an unreasonably massive memory

> >>>> overhead and there are many, many better choices.

> >>>>

> >>> OK, I get your point. Actually we used the genpool allocator in the

> >>> early version. Maybe we can fall back to using it.

> >>

> >> I think maybe you can share some perf numbers to see how much

> >> alloc_iova_fast() can help.

> >>

> > I did some fio tests[1] with a ram-backend vduse block device[2].

> >

> > Following are some performance data:

> >

> >                              numjobs=1   numjobs=2    numjobs=4   numjobs=8

> > iova_alloc_fast    145k iops      265k iops      514k iops      758k iops

> >

> > iova_alloc            137k iops     170k iops      128k iops      113k iops

> >

> > gen_pool_alloc   143k iops      270k iops      458k iops      521k iops

> >

> > The iova_alloc_fast() has the best performance since we always hit the

> > per-cpu cache. Regardless of the per-cpu cache, the genpool allocator

> > should be better than the iova allocator.

>

>

> I think we see convincing numbers for using iova_alloc_fast() than the

> gen_poll_alloc() (45% improvement on job=8).

>


Yes, so alloc_iova_fast() still seems to be the best choice based on
performance considerations.

Hi Robin, any comments?

Thanks,
Yongji
diff mbox series

Patch

diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index b6cf5f16123b..3941ed6bb99b 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -521,6 +521,7 @@  alloc_iova_fast(struct iova_domain *iovad, unsigned long size,
 
 	return new_iova->pfn_lo;
 }
+EXPORT_SYMBOL_GPL(alloc_iova_fast);
 
 /**
  * free_iova_fast - free iova pfn range into rcache
@@ -538,6 +539,7 @@  free_iova_fast(struct iova_domain *iovad, unsigned long pfn, unsigned long size)
 
 	free_iova(iovad, pfn);
 }
+EXPORT_SYMBOL_GPL(free_iova_fast);
 
 #define fq_ring_for_each(i, fq) \
 	for ((i) = (fq)->head; (i) != (fq)->tail; (i) = ((i) + 1) % IOVA_FQ_SIZE)