diff mbox series

[net-next,v2,2/2] page_pool: optimize the cpu sync operation when DMA mapping

Message ID 1629442611-61547-3-git-send-email-linyunsheng@huawei.com
State Superseded
Headers show
Series None | expand

Commit Message

Yunsheng Lin Aug. 20, 2021, 6:56 a.m. UTC
If the DMA_ATTR_SKIP_CPU_SYNC is not set, cpu syncing is
also done in dma_map_page_attrs(), so set the attrs according
to pool->p.flags to avoid calling cpu sync function again.

Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
---
 net/core/page_pool.c | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

Comments

Ilias Apalodimas Aug. 20, 2021, 9:39 a.m. UTC | #1
On Fri, Aug 20, 2021 at 02:56:51PM +0800, Yunsheng Lin wrote:
> If the DMA_ATTR_SKIP_CPU_SYNC is not set, cpu syncing is
> also done in dma_map_page_attrs(), so set the attrs according
> to pool->p.flags to avoid calling cpu sync function again.

Isn't DMA_ATTR_SKIP_CPU_SYNC checked within dma_map_page_attrs() anyway?

Regards
/Ilias
> 
> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
> ---
>  net/core/page_pool.c | 9 +++++----
>  1 file changed, 5 insertions(+), 4 deletions(-)
> 
> diff --git a/net/core/page_pool.c b/net/core/page_pool.c
> index 1a69784..3df5554 100644
> --- a/net/core/page_pool.c
> +++ b/net/core/page_pool.c
> @@ -191,8 +191,12 @@ static void page_pool_dma_sync_for_device(struct page_pool *pool,
>  
>  static bool page_pool_dma_map(struct page_pool *pool, struct page *page)
>  {
> +	unsigned long attrs = DMA_ATTR_SKIP_CPU_SYNC;
>  	dma_addr_t dma;
>  
> +	if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV)
> +		attrs = 0;
> +
>  	/* Setup DMA mapping: use 'struct page' area for storing DMA-addr
>  	 * since dma_addr_t can be either 32 or 64 bits and does not always fit
>  	 * into page private data (i.e 32bit cpu with 64bit DMA caps)
> @@ -200,15 +204,12 @@ static bool page_pool_dma_map(struct page_pool *pool, struct page *page)
>  	 */
>  	dma = dma_map_page_attrs(pool->p.dev, page, 0,
>  				 (PAGE_SIZE << pool->p.order),
> -				 pool->p.dma_dir, DMA_ATTR_SKIP_CPU_SYNC);
> +				 pool->p.dma_dir, attrs);
>  	if (dma_mapping_error(pool->p.dev, dma))
>  		return false;
>  
>  	page_pool_set_dma_addr(page, dma);
>  
> -	if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV)
> -		page_pool_dma_sync_for_device(pool, page, pool->p.max_len);
> -
>  	return true;
>  }
>  
> -- 
> 2.7.4
>
Yunsheng Lin Aug. 23, 2021, 3:56 a.m. UTC | #2
On 2021/8/20 17:39, Ilias Apalodimas wrote:
> On Fri, Aug 20, 2021 at 02:56:51PM +0800, Yunsheng Lin wrote:

>> If the DMA_ATTR_SKIP_CPU_SYNC is not set, cpu syncing is

>> also done in dma_map_page_attrs(), so set the attrs according

>> to pool->p.flags to avoid calling cpu sync function again.

> 

> Isn't DMA_ATTR_SKIP_CPU_SYNC checked within dma_map_page_attrs() anyway?


Yes, the checking in dma_map_page_attrs() should save us from
calling dma_sync_single_for_device() again if we set the attrs
according to "pool->p.flags & PP_FLAG_DMA_SYNC_DEV".

As dma_sync_single_for_device() is EXPORT_SYMBOL()'ed, and
should be a no-op for dma coherent device, so there may be a
function calling overhead for dma coherent device, letting
dma_map_page_attrs() handling the sync seems to avoid the stack
pushing/poping overhead:

https://elixir.bootlin.com/linux/latest/source/kernel/dma/direct.h#L104

The one thing I am not sure about is that the pool->p.offset
and pool->p.max_len are used to decide the sync range before this
patch, while the sync range is the same as the map range when doing
the sync in dma_map_page_attrs().

I assumed the above is not a issue? only sync more than we need?
and it won't hurt the performance?

> 

> Regards

> /Ilias

>>

>> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>

>> ---

>>  net/core/page_pool.c | 9 +++++----

>>  1 file changed, 5 insertions(+), 4 deletions(-)

>>

>> diff --git a/net/core/page_pool.c b/net/core/page_pool.c

>> index 1a69784..3df5554 100644

>> --- a/net/core/page_pool.c

>> +++ b/net/core/page_pool.c

>> @@ -191,8 +191,12 @@ static void page_pool_dma_sync_for_device(struct page_pool *pool,

>>  

>>  static bool page_pool_dma_map(struct page_pool *pool, struct page *page)

>>  {

>> +	unsigned long attrs = DMA_ATTR_SKIP_CPU_SYNC;

>>  	dma_addr_t dma;

>>  

>> +	if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV)

>> +		attrs = 0;

>> +

>>  	/* Setup DMA mapping: use 'struct page' area for storing DMA-addr

>>  	 * since dma_addr_t can be either 32 or 64 bits and does not always fit

>>  	 * into page private data (i.e 32bit cpu with 64bit DMA caps)

>> @@ -200,15 +204,12 @@ static bool page_pool_dma_map(struct page_pool *pool, struct page *page)

>>  	 */

>>  	dma = dma_map_page_attrs(pool->p.dev, page, 0,

>>  				 (PAGE_SIZE << pool->p.order),

>> -				 pool->p.dma_dir, DMA_ATTR_SKIP_CPU_SYNC);

>> +				 pool->p.dma_dir, attrs);

>>  	if (dma_mapping_error(pool->p.dev, dma))

>>  		return false;

>>  

>>  	page_pool_set_dma_addr(page, dma);

>>  

>> -	if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV)

>> -		page_pool_dma_sync_for_device(pool, page, pool->p.max_len);

>> -

>>  	return true;

>>  }

>>  

>> -- 

>> 2.7.4

>>

> .

>
Ilias Apalodimas Aug. 23, 2021, 12:42 p.m. UTC | #3
On Mon, Aug 23, 2021 at 11:56:48AM +0800, Yunsheng Lin wrote:
> On 2021/8/20 17:39, Ilias Apalodimas wrote:

> > On Fri, Aug 20, 2021 at 02:56:51PM +0800, Yunsheng Lin wrote:

> >> If the DMA_ATTR_SKIP_CPU_SYNC is not set, cpu syncing is

> >> also done in dma_map_page_attrs(), so set the attrs according

> >> to pool->p.flags to avoid calling cpu sync function again.

> > 

> > Isn't DMA_ATTR_SKIP_CPU_SYNC checked within dma_map_page_attrs() anyway?

> 

> Yes, the checking in dma_map_page_attrs() should save us from

> calling dma_sync_single_for_device() again if we set the attrs

> according to "pool->p.flags & PP_FLAG_DMA_SYNC_DEV".


But we aren't syncing anything right now when we allocate the pages since
this is called with DMA_ATTR_SKIP_CPU_SYNC. We are syncing the allocated
range on the end of the function, if the pool was created and was requested
to take care of the mappings for us.

> 

> As dma_sync_single_for_device() is EXPORT_SYMBOL()'ed, and

> should be a no-op for dma coherent device, so there may be a

> function calling overhead for dma coherent device, letting

> dma_map_page_attrs() handling the sync seems to avoid the stack

> pushing/poping overhead:

> 

> https://elixir.bootlin.com/linux/latest/source/kernel/dma/direct.h#L104

> 

> The one thing I am not sure about is that the pool->p.offset

> and pool->p.max_len are used to decide the sync range before this

> patch, while the sync range is the same as the map range when doing

> the sync in dma_map_page_attrs().


I am not sure I am following here. We always sync the entire range as well
in the current code as the mapping function is called with max_len.

> 

> I assumed the above is not a issue? only sync more than we need?

> and it won't hurt the performance?


We can sync more than we need, but if it's a non-coherent architecture,
there's a performance penalty. 

Regards
/Ilias
> 

> > 

> > Regards

> > /Ilias

> >>

> >> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>

> >> ---

> >>  net/core/page_pool.c | 9 +++++----

> >>  1 file changed, 5 insertions(+), 4 deletions(-)

> >>

> >> diff --git a/net/core/page_pool.c b/net/core/page_pool.c

> >> index 1a69784..3df5554 100644

> >> --- a/net/core/page_pool.c

> >> +++ b/net/core/page_pool.c

> >> @@ -191,8 +191,12 @@ static void page_pool_dma_sync_for_device(struct page_pool *pool,

> >>  

> >>  static bool page_pool_dma_map(struct page_pool *pool, struct page *page)

> >>  {

> >> +	unsigned long attrs = DMA_ATTR_SKIP_CPU_SYNC;

> >>  	dma_addr_t dma;

> >>  

> >> +	if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV)

> >> +		attrs = 0;

> >> +

> >>  	/* Setup DMA mapping: use 'struct page' area for storing DMA-addr

> >>  	 * since dma_addr_t can be either 32 or 64 bits and does not always fit

> >>  	 * into page private data (i.e 32bit cpu with 64bit DMA caps)

> >> @@ -200,15 +204,12 @@ static bool page_pool_dma_map(struct page_pool *pool, struct page *page)

> >>  	 */

> >>  	dma = dma_map_page_attrs(pool->p.dev, page, 0,

> >>  				 (PAGE_SIZE << pool->p.order),

> >> -				 pool->p.dma_dir, DMA_ATTR_SKIP_CPU_SYNC);

> >> +				 pool->p.dma_dir, attrs);

> >>  	if (dma_mapping_error(pool->p.dev, dma))

> >>  		return false;

> >>  

> >>  	page_pool_set_dma_addr(page, dma);

> >>  

> >> -	if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV)

> >> -		page_pool_dma_sync_for_device(pool, page, pool->p.max_len);

> >> -

> >>  	return true;

> >>  }

> >>  

> >> -- 

> >> 2.7.4

> >>

> > .

> >
Yunsheng Lin Aug. 24, 2021, 7 a.m. UTC | #4
On 2021/8/23 20:42, Ilias Apalodimas wrote:
> On Mon, Aug 23, 2021 at 11:56:48AM +0800, Yunsheng Lin wrote:

>> On 2021/8/20 17:39, Ilias Apalodimas wrote:

>>> On Fri, Aug 20, 2021 at 02:56:51PM +0800, Yunsheng Lin wrote:


[..]
>>

>> https://elixir.bootlin.com/linux/latest/source/kernel/dma/direct.h#L104

>>

>> The one thing I am not sure about is that the pool->p.offset

>> and pool->p.max_len are used to decide the sync range before this

>> patch, while the sync range is the same as the map range when doing

>> the sync in dma_map_page_attrs().

> 

> I am not sure I am following here. We always sync the entire range as well

> in the current code as the mapping function is called with max_len.

> 

>>

>> I assumed the above is not a issue? only sync more than we need?

>> and it won't hurt the performance?

> 

> We can sync more than we need, but if it's a non-coherent architecture,

> there's a performance penalty. 


Since I do not have any performance data to prove if there is a
performance penalty for non-coherent architecture, I will drop it:)

> 

> Regards

> /Ilias

>>
Ilias Apalodimas Aug. 24, 2021, 9:04 a.m. UTC | #5
Hi Yunsheng,

+cc Lorenzo, which has done some tests on non-coherent platforms

On Tue, 24 Aug 2021 at 10:00, Yunsheng Lin <linyunsheng@huawei.com> wrote:
>

> On 2021/8/23 20:42, Ilias Apalodimas wrote:

> > On Mon, Aug 23, 2021 at 11:56:48AM +0800, Yunsheng Lin wrote:

> >> On 2021/8/20 17:39, Ilias Apalodimas wrote:

> >>> On Fri, Aug 20, 2021 at 02:56:51PM +0800, Yunsheng Lin wrote:

>

> [..]

> >>

> >> https://elixir.bootlin.com/linux/latest/source/kernel/dma/direct.h#L104

> >>

> >> The one thing I am not sure about is that the pool->p.offset

> >> and pool->p.max_len are used to decide the sync range before this

> >> patch, while the sync range is the same as the map range when doing

> >> the sync in dma_map_page_attrs().

> >

> > I am not sure I am following here. We always sync the entire range as well

> > in the current code as the mapping function is called with max_len.

> >

> >>

> >> I assumed the above is not a issue? only sync more than we need?

> >> and it won't hurt the performance?

> >

> > We can sync more than we need, but if it's a non-coherent architecture,

> > there's a performance penalty.

>

> Since I do not have any performance data to prove if there is a

> performance penalty for non-coherent architecture, I will drop it:)


I am pretty sure it does affect it.  Unless I am missing something the
patch simply re-arranges calls to avoid calling dma_map_page_attrs()
right?
However since dma_map_page_attrs() won't do anything sync-related
since it's called with DMA_ATTR_SKIP_CPU_SYNC, I doubt calling it will
have any measurable difference.  If there is, we should pick it up.


Regards
/Ilias
>

> >

> > Regards

> > /Ilias

> >>
diff mbox series

Patch

diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index 1a69784..3df5554 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -191,8 +191,12 @@  static void page_pool_dma_sync_for_device(struct page_pool *pool,
 
 static bool page_pool_dma_map(struct page_pool *pool, struct page *page)
 {
+	unsigned long attrs = DMA_ATTR_SKIP_CPU_SYNC;
 	dma_addr_t dma;
 
+	if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV)
+		attrs = 0;
+
 	/* Setup DMA mapping: use 'struct page' area for storing DMA-addr
 	 * since dma_addr_t can be either 32 or 64 bits and does not always fit
 	 * into page private data (i.e 32bit cpu with 64bit DMA caps)
@@ -200,15 +204,12 @@  static bool page_pool_dma_map(struct page_pool *pool, struct page *page)
 	 */
 	dma = dma_map_page_attrs(pool->p.dev, page, 0,
 				 (PAGE_SIZE << pool->p.order),
-				 pool->p.dma_dir, DMA_ATTR_SKIP_CPU_SYNC);
+				 pool->p.dma_dir, attrs);
 	if (dma_mapping_error(pool->p.dev, dma))
 		return false;
 
 	page_pool_set_dma_addr(page, dma);
 
-	if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV)
-		page_pool_dma_sync_for_device(pool, page, pool->p.max_len);
-
 	return true;
 }