diff mbox series

[net-next,v2,3/3] skbuff: keep track of pp page when __skb_frag_ref() is called

Message ID 20210914121114.28559-4-linyunsheng@huawei.com
State New
Headers show
Series some optimization for page pool | expand

Commit Message

Yunsheng Lin Sept. 14, 2021, 12:11 p.m. UTC
As the skb->pp_recycle and page->pp_magic may not be enough
to track if a frag page is from page pool after the calling
of __skb_frag_ref(), mostly because of a data race, see:
commit 2cc3aeb5eccc ("skbuff: Fix a potential race while
recycling page_pool packets").

There may be clone and expand head case that might lose the
track if a frag page is from page pool or not.

And not being able to keep track of pp page may cause problem
for the skb_split() case in tso_fragment() too:
Supposing a skb has 3 frag pages, all coming from a page pool,
and is split to skb1 and skb2:
skb1: first frag page + first half of second frag page
skb2: second half of second frag page + third frag page

How do we set the skb->pp_recycle of skb1 and skb2?
1. If we set both of them to 1, then we may have a similar
   race as the above commit for second frag page.
2. If we set only one of them to 1, then we may have resource
   leaking problem as both first frag page and third frag page
   are indeed from page pool.

So increment the frag count when __skb_frag_ref() is called,
and only use page->pp_magic to indicate if a frag page is from
page pool, to avoid the above data race.

For 32 bit systems with 64 bit dma, we preserve the orginial
behavior as frag count is used to trace how many time does a
frag page is called with __skb_frag_ref().

We still use both skb->pp_recycle and page->pp_magic to decide
the head page for a skb is from page pool or not.

Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
---
 include/linux/skbuff.h  | 40 ++++++++++++++++++++++++++++++++++++----
 include/net/page_pool.h | 28 +++++++++++++++++++++++++++-
 net/core/page_pool.c    | 16 ++--------------
 3 files changed, 65 insertions(+), 19 deletions(-)

Comments

Alexander Duyck Sept. 15, 2021, 12:59 a.m. UTC | #1
On Tue, Sep 14, 2021 at 5:12 AM Yunsheng Lin <linyunsheng@huawei.com> wrote:
>

> As the skb->pp_recycle and page->pp_magic may not be enough

> to track if a frag page is from page pool after the calling

> of __skb_frag_ref(), mostly because of a data race, see:

> commit 2cc3aeb5eccc ("skbuff: Fix a potential race while

> recycling page_pool packets").


I'm not sure how this comment actually applies. It is an issue that
was fixed. If anything my concern is that this change will introduce
new races instead of fixing any existing ones.

> There may be clone and expand head case that might lose the

> track if a frag page is from page pool or not.


Can you explain how? If there is such a case we should fix it instead
of trying to introduce new features to address it. This seems more
like a performance optimization rather than a fix.

> And not being able to keep track of pp page may cause problem

> for the skb_split() case in tso_fragment() too:

> Supposing a skb has 3 frag pages, all coming from a page pool,

> and is split to skb1 and skb2:

> skb1: first frag page + first half of second frag page

> skb2: second half of second frag page + third frag page

>

> How do we set the skb->pp_recycle of skb1 and skb2?

> 1. If we set both of them to 1, then we may have a similar

>    race as the above commit for second frag page.

> 2. If we set only one of them to 1, then we may have resource

>    leaking problem as both first frag page and third frag page

>    are indeed from page pool.


The question I would have is in the above cases how is skb->pp_recycle
being set on the second buffer? Is it already coming that way? If so,
maybe you should special case the __skb_frag_ref when you know you are
loading a recycling skb instead of just assuming you want to do it
automatically.

> So increment the frag count when __skb_frag_ref() is called,

> and only use page->pp_magic to indicate if a frag page is from

> page pool, to avoid the above data race.


This assumes the page is only going to be used by the network stack.
My concern is what happens when the page is pulled out of the skb and
used for example in storage?

> For 32 bit systems with 64 bit dma, we preserve the orginial

> behavior as frag count is used to trace how many time does a

> frag page is called with __skb_frag_ref().

>

> We still use both skb->pp_recycle and page->pp_magic to decide

> the head page for a skb is from page pool or not.

>

> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>

> ---

>  include/linux/skbuff.h  | 40 ++++++++++++++++++++++++++++++++++++----

>  include/net/page_pool.h | 28 +++++++++++++++++++++++++++-

>  net/core/page_pool.c    | 16 ++--------------

>  3 files changed, 65 insertions(+), 19 deletions(-)

>

> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h

> index 35eebc2310a5..4d975ab27078 100644

> --- a/include/linux/skbuff.h

> +++ b/include/linux/skbuff.h

> @@ -3073,7 +3073,16 @@ static inline struct page *skb_frag_page(const skb_frag_t *frag)

>   */

>  static inline void __skb_frag_ref(skb_frag_t *frag)

>  {

> -       get_page(skb_frag_page(frag));

> +       struct page *page = skb_frag_page(frag);

> +

> +#ifdef CONFIG_PAGE_POOL

> +       if (page_pool_is_pp_page(page)) {

> +               page_pool_atomic_inc_frag_count(page);

> +               return;

> +       }

> +#endif

> +

> +       get_page(page);

>  }

>

>  /**

> @@ -3088,6 +3097,22 @@ static inline void skb_frag_ref(struct sk_buff *skb, int f)

>         __skb_frag_ref(&skb_shinfo(skb)->frags[f]);

>  }

>

> +/**

> + * skb_frag_is_pp_page - decide if a page is recyclable.

> + * @page: frag page

> + * @recycle: skb->pp_recycle

> + *

> + * For 32 bit systems with 64 bit dma, the skb->pp_recycle is

> + * also used to decide if a page can be recycled to the page

> + * pool.

> + */

> +static inline bool skb_frag_is_pp_page(struct page *page,

> +                                      bool recycle)

> +{

> +       return page_pool_is_pp_page(page) ||

> +               (recycle && __page_pool_is_pp_page(page));

> +}

> +


The logic for this check is ugly. You are essentially calling
__page_pool_is_pp_page again if it fails the first check. It would
probably make more sense to rearrange things and just call
(!DMA_USE_PP_FRAG_COUNT || recycle)  && __page_pool_is_pp_page(). With
that the check of recycle could be dropped entirely if frag count is
valid to use, and in the non-fragcount case it reverts back to the
original check.

>  /**

>   * __skb_frag_unref - release a reference on a paged fragment.

>   * @frag: the paged fragment

> @@ -3101,8 +3126,10 @@ static inline void __skb_frag_unref(skb_frag_t *frag, bool recycle)

>         struct page *page = skb_frag_page(frag);

>

>  #ifdef CONFIG_PAGE_POOL

> -       if (recycle && page_pool_return_skb_page(page))

> +       if (skb_frag_is_pp_page(page, recycle)) {

> +               page_pool_return_skb_page(page);

>                 return;

> +       }

>  #endif

>         put_page(page);

>  }

> @@ -4720,9 +4747,14 @@ static inline void skb_mark_for_recycle(struct sk_buff *skb)

>

>  static inline bool skb_pp_recycle(struct sk_buff *skb, void *data)

>  {

> -       if (!IS_ENABLED(CONFIG_PAGE_POOL) || !skb->pp_recycle)

> +       struct page *page = virt_to_head_page(data);

> +

> +       if (!IS_ENABLED(CONFIG_PAGE_POOL) || !skb->pp_recycle ||

> +           !__page_pool_is_pp_page(page))

>                 return false;

> -       return page_pool_return_skb_page(virt_to_head_page(data));

> +

> +       page_pool_return_skb_page(page);

> +       return true;

>  }

>

>  #endif /* __KERNEL__ */


As I recall the virt_to_head_page isn't necessarily a cheap call as it
can lead to quite a bit of pointer chasing and a few extra math steps
to do the virt to page conversion. I would be curious how much extra
overhead this is adding to the non-fragcount or non-recycling case.

> diff --git a/include/net/page_pool.h b/include/net/page_pool.h

> index 2ad0706566c5..eb103d86f453 100644

> --- a/include/net/page_pool.h

> +++ b/include/net/page_pool.h

> @@ -164,7 +164,7 @@ inline enum dma_data_direction page_pool_get_dma_dir(struct page_pool *pool)

>         return pool->p.dma_dir;

>  }

>

> -bool page_pool_return_skb_page(struct page *page);

> +void page_pool_return_skb_page(struct page *page);

>

>  struct page_pool *page_pool_create(const struct page_pool_params *params);

>

> @@ -244,6 +244,32 @@ static inline void page_pool_set_frag_count(struct page *page, long nr)

>         atomic_long_set(&page->pp_frag_count, nr);

>  }

>

> +static inline void page_pool_atomic_inc_frag_count(struct page *page)

> +{

> +       atomic_long_inc(&page->pp_frag_count);

> +}

> +


Your function name is almost as long as the function itself. Maybe you
don't need it?

> +static inline bool __page_pool_is_pp_page(struct page *page)

> +{

> +       /* page->pp_magic is OR'ed with PP_SIGNATURE after the allocation

> +        * in order to preserve any existing bits, such as bit 0 for the

> +        * head page of compound page and bit 1 for pfmemalloc page, so

> +        * mask those bits for freeing side when doing below checking,

> +        * and page_is_pfmemalloc() is checked in __page_pool_put_page()

> +        * to avoid recycling the pfmemalloc page.

> +        */

> +       return (page->pp_magic & ~0x3UL) == PP_SIGNATURE;

> +}

> +

> +static inline bool page_pool_is_pp_page(struct page *page)

> +{

> +       /* For systems with the same dma addr as the bus addr, we can use

> +        * page->pp_magic to indicate a pp page uniquely.

> +        */

> +       return !PAGE_POOL_DMA_USE_PP_FRAG_COUNT &&

> +                       __page_pool_is_pp_page(page);

> +}

> +


We should really change the name of the #define. I keep reading it as
we are using the PP_FRAG_COUNT, not that it is already in use. Maybe
we should look at something like PP_FRAG_COUNT_VALID and just invert
the logic for it.

Also this function naming is really confusing. You don't have to have
the frag count to be a page pool page. Maybe this should be something
like page_pool_is_pp_frag_page.

>  static inline long page_pool_atomic_sub_frag_count_return(struct page *page,

>                                                           long nr)

>  {

> diff --git a/net/core/page_pool.c b/net/core/page_pool.c

> index 09d7b8614ef5..3a419871d4bc 100644

> --- a/net/core/page_pool.c

> +++ b/net/core/page_pool.c

> @@ -24,7 +24,7 @@

>  #define DEFER_TIME (msecs_to_jiffies(1000))

>  #define DEFER_WARN_INTERVAL (60 * HZ)

>

> -#define BIAS_MAX       LONG_MAX

> +#define BIAS_MAX       (LONG_MAX / 2)

>

>  static int page_pool_init(struct page_pool *pool,

>                           const struct page_pool_params *params)


I still think this would be better as a separate patch calling out the
fact that you are changing the value with the plan to support
incrementing it in the future.

> @@ -736,20 +736,10 @@ void page_pool_update_nid(struct page_pool *pool, int new_nid)

>  }

>  EXPORT_SYMBOL(page_pool_update_nid);

>

> -bool page_pool_return_skb_page(struct page *page)

> +void page_pool_return_skb_page(struct page *page)

>  {

>         struct page_pool *pp;

>

> -       /* page->pp_magic is OR'ed with PP_SIGNATURE after the allocation

> -        * in order to preserve any existing bits, such as bit 0 for the

> -        * head page of compound page and bit 1 for pfmemalloc page, so

> -        * mask those bits for freeing side when doing below checking,

> -        * and page_is_pfmemalloc() is checked in __page_pool_put_page()

> -        * to avoid recycling the pfmemalloc page.

> -        */

> -       if (unlikely((page->pp_magic & ~0x3UL) != PP_SIGNATURE))

> -               return false;

> -

>         pp = page->pp;

>

>         /* Driver set this to memory recycling info. Reset it on recycle.

> @@ -758,7 +748,5 @@ bool page_pool_return_skb_page(struct page *page)

>          * 'flipped' fragment being in use or not.

>          */

>         page_pool_put_full_page(pp, page, false);

> -

> -       return true;

>  }

>  EXPORT_SYMBOL(page_pool_return_skb_page);

> --

> 2.33.0

>
Yunsheng Lin Sept. 15, 2021, 9:07 a.m. UTC | #2
On 2021/9/15 8:59, Alexander Duyck wrote:
> On Tue, Sep 14, 2021 at 5:12 AM Yunsheng Lin <linyunsheng@huawei.com> wrote:

>>

>> As the skb->pp_recycle and page->pp_magic may not be enough

>> to track if a frag page is from page pool after the calling

>> of __skb_frag_ref(), mostly because of a data race, see:

>> commit 2cc3aeb5eccc ("skbuff: Fix a potential race while

>> recycling page_pool packets").

> 

> I'm not sure how this comment actually applies. It is an issue that

> was fixed. If anything my concern is that this change will introduce

> new races instead of fixing any existing ones.


My initial thinking about adding the above comment is to emphasize
that we might clear cloned skb's pp_recycle when doing head expanding,
and page pool might need to give up on that page if that cloned skb is
the last one to be freed.

> 

>> There may be clone and expand head case that might lose the

>> track if a frag page is from page pool or not.

> 

> Can you explain how? If there is such a case we should fix it instead

> of trying to introduce new features to address it. This seems more

> like a performance optimization rather than a fix.


Yes, I consider it an optimization too, that's why I am targetting
net-next.

Even for the below skb_split() case in tso_fragment(), I am not sure
how can a rx pp page can go through the tcp stack yet.

> 

>> And not being able to keep track of pp page may cause problem

>> for the skb_split() case in tso_fragment() too:

>> Supposing a skb has 3 frag pages, all coming from a page pool,

>> and is split to skb1 and skb2:

>> skb1: first frag page + first half of second frag page

>> skb2: second half of second frag page + third frag page

>>

>> How do we set the skb->pp_recycle of skb1 and skb2?

>> 1. If we set both of them to 1, then we may have a similar

>>    race as the above commit for second frag page.

>> 2. If we set only one of them to 1, then we may have resource

>>    leaking problem as both first frag page and third frag page

>>    are indeed from page pool.

> 

> The question I would have is in the above cases how is skb->pp_recycle

> being set on the second buffer? Is it already coming that way? If so,


As the skb_split() implemention, it takes skb and skb1 for input,
skb have pp_recycle set, and the skb1 is the newly allocated without
pp_recycle set, after skb_split(), some payload of skb is split to
the skb1, how to set the pp_recycle of skb1 seems tricky here.

> maybe you should special case the __skb_frag_ref when you know you are

> loading a recycling skb instead of just assuming you want to do it

> automatically.


I am not sure what does "special case" and "automatically" means here.
One way I can think of is to avoid doing the __skb_frag_ref, and allocate
a new frag for skb1 and do memcpying if there is a sharing frag between
skb and skb1 when splitting. But it seems the allocating a new frag
and memcpying seems much heavier doing the __skb_frag_ref.

> 

>> So increment the frag count when __skb_frag_ref() is called,

>> and only use page->pp_magic to indicate if a frag page is from

>> page pool, to avoid the above data race.

> 

> This assumes the page is only going to be used by the network stack.

> My concern is what happens when the page is pulled out of the skb and

> used for example in storage?


As my understanding, the above case works as before, as if the storage
is holding reference to that page, the page pool will give up on that
page by checking "page_ref_count(page) == 1" when the last user from
network stack has done with a pp page(by checking page->pp_frag_count).

> 

>> For 32 bit systems with 64 bit dma, we preserve the orginial

>> behavior as frag count is used to trace how many time does a

>> frag page is called with __skb_frag_ref().

>>

>> We still use both skb->pp_recycle and page->pp_magic to decide

>> the head page for a skb is from page pool or not.

>>

>> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>

>> ---

>>  include/linux/skbuff.h  | 40 ++++++++++++++++++++++++++++++++++++----

>>  include/net/page_pool.h | 28 +++++++++++++++++++++++++++-

>>  net/core/page_pool.c    | 16 ++--------------

>>  3 files changed, 65 insertions(+), 19 deletions(-)

>>

>> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h

>> index 35eebc2310a5..4d975ab27078 100644

>> --- a/include/linux/skbuff.h

>> +++ b/include/linux/skbuff.h

>> @@ -3073,7 +3073,16 @@ static inline struct page *skb_frag_page(const skb_frag_t *frag)

>>   */

>>  static inline void __skb_frag_ref(skb_frag_t *frag)

>>  {

>> -       get_page(skb_frag_page(frag));

>> +       struct page *page = skb_frag_page(frag);

>> +

>> +#ifdef CONFIG_PAGE_POOL

>> +       if (page_pool_is_pp_page(page)) {

>> +               page_pool_atomic_inc_frag_count(page);

>> +               return;

>> +       }

>> +#endif

>> +

>> +       get_page(page);

>>  }

>>

>>  /**

>> @@ -3088,6 +3097,22 @@ static inline void skb_frag_ref(struct sk_buff *skb, int f)

>>         __skb_frag_ref(&skb_shinfo(skb)->frags[f]);

>>  }

>>

>> +/**

>> + * skb_frag_is_pp_page - decide if a page is recyclable.

>> + * @page: frag page

>> + * @recycle: skb->pp_recycle

>> + *

>> + * For 32 bit systems with 64 bit dma, the skb->pp_recycle is

>> + * also used to decide if a page can be recycled to the page

>> + * pool.

>> + */

>> +static inline bool skb_frag_is_pp_page(struct page *page,

>> +                                      bool recycle)

>> +{

>> +       return page_pool_is_pp_page(page) ||

>> +               (recycle && __page_pool_is_pp_page(page));

>> +}

>> +

> 

> The logic for this check is ugly. You are essentially calling

> __page_pool_is_pp_page again if it fails the first check. It would

> probably make more sense to rearrange things and just call

> (!DMA_USE_PP_FRAG_COUNT || recycle)  && __page_pool_is_pp_page(). With

> that the check of recycle could be dropped entirely if frag count is

> valid to use, and in the non-fragcount case it reverts back to the

> original check.


The reason I did not do that is I kind of did not want to use the
DMA_USE_PP_FRAG_COUNT outside of page pool.
I can use DMA_USE_PP_FRAG_COUNT directly in skbuff.h if the above
is considered harmless:)

The 32 bit systems with 64 bit dma really seems a burden here, as
memtioned by Ilias [1], there seems to be no such system using page
pool, we might as well consider disabling page pool for such system?

Ilias, Jesper, what do you think?

1. http://lkml.iu.edu/hypermail/linux/kernel/2107.1/06321.html

> 

>>  /**

>>   * __skb_frag_unref - release a reference on a paged fragment.

>>   * @frag: the paged fragment

>> @@ -3101,8 +3126,10 @@ static inline void __skb_frag_unref(skb_frag_t *frag, bool recycle)

>>         struct page *page = skb_frag_page(frag);

>>

>>  #ifdef CONFIG_PAGE_POOL

>> -       if (recycle && page_pool_return_skb_page(page))

>> +       if (skb_frag_is_pp_page(page, recycle)) {

>> +               page_pool_return_skb_page(page);

>>                 return;

>> +       }

>>  #endif

>>         put_page(page);

>>  }

>> @@ -4720,9 +4747,14 @@ static inline void skb_mark_for_recycle(struct sk_buff *skb)

>>

>>  static inline bool skb_pp_recycle(struct sk_buff *skb, void *data)

>>  {

>> -       if (!IS_ENABLED(CONFIG_PAGE_POOL) || !skb->pp_recycle)

>> +       struct page *page = virt_to_head_page(data);

>> +

>> +       if (!IS_ENABLED(CONFIG_PAGE_POOL) || !skb->pp_recycle ||

>> +           !__page_pool_is_pp_page(page))

>>                 return false;

>> -       return page_pool_return_skb_page(virt_to_head_page(data));

>> +

>> +       page_pool_return_skb_page(page);

>> +       return true;

>>  }

>>

>>  #endif /* __KERNEL__ */

> 

> As I recall the virt_to_head_page isn't necessarily a cheap call as it

> can lead to quite a bit of pointer chasing and a few extra math steps

> to do the virt to page conversion. I would be curious how much extra

> overhead this is adding to the non-fragcount or non-recycling case.


As the page pool can only deal with head_page as the checking of
"page_ref_count(page) == 1" in __page_pool_put_page() seems requiring
head_page, and skb_free_frag() in skb_free_head() seems to operate on
the head_page, so I assume the overhead is minimal?

> 

>> diff --git a/include/net/page_pool.h b/include/net/page_pool.h

>> index 2ad0706566c5..eb103d86f453 100644

>> --- a/include/net/page_pool.h

>> +++ b/include/net/page_pool.h

>> @@ -164,7 +164,7 @@ inline enum dma_data_direction page_pool_get_dma_dir(struct page_pool *pool)

>>         return pool->p.dma_dir;

>>  }

>>

>> -bool page_pool_return_skb_page(struct page *page);

>> +void page_pool_return_skb_page(struct page *page);

>>

>>  struct page_pool *page_pool_create(const struct page_pool_params *params);

>>

>> @@ -244,6 +244,32 @@ static inline void page_pool_set_frag_count(struct page *page, long nr)

>>         atomic_long_set(&page->pp_frag_count, nr);

>>  }

>>

>> +static inline void page_pool_atomic_inc_frag_count(struct page *page)

>> +{

>> +       atomic_long_inc(&page->pp_frag_count);

>> +}

>> +

> 

> Your function name is almost as long as the function itself. Maybe you

> don't need it?


It is about avoiding exposing the pp_frag_count outside of page pool.
It is not needed if the above is not a issue:)

> 

>> +static inline bool __page_pool_is_pp_page(struct page *page)

>> +{

>> +       /* page->pp_magic is OR'ed with PP_SIGNATURE after the allocation

>> +        * in order to preserve any existing bits, such as bit 0 for the

>> +        * head page of compound page and bit 1 for pfmemalloc page, so

>> +        * mask those bits for freeing side when doing below checking,

>> +        * and page_is_pfmemalloc() is checked in __page_pool_put_page()

>> +        * to avoid recycling the pfmemalloc page.

>> +        */

>> +       return (page->pp_magic & ~0x3UL) == PP_SIGNATURE;

>> +}

>> +

>> +static inline bool page_pool_is_pp_page(struct page *page)

>> +{

>> +       /* For systems with the same dma addr as the bus addr, we can use

>> +        * page->pp_magic to indicate a pp page uniquely.

>> +        */

>> +       return !PAGE_POOL_DMA_USE_PP_FRAG_COUNT &&

>> +                       __page_pool_is_pp_page(page);

>> +}

>> +

> 

> We should really change the name of the #define. I keep reading it as

> we are using the PP_FRAG_COUNT, not that it is already in use. Maybe

> we should look at something like PP_FRAG_COUNT_VALID and just invert

> the logic for it.


Yes, Jesper seems to have the similar confusion.
I seems better that we can remove that macro completely if the 32 bit
systems with 64 bit dma turn out to be not existing at all?

> 

> Also this function naming is really confusing. You don't have to have

> the frag count to be a page pool page. Maybe this should be something

> like page_pool_is_pp_frag_page.


I am not sure what does the extra *frag* means here.

> 

>>  static inline long page_pool_atomic_sub_frag_count_return(struct page *page,

>>                                                           long nr)

>>  {

>> diff --git a/net/core/page_pool.c b/net/core/page_pool.c

>> index 09d7b8614ef5..3a419871d4bc 100644

>> --- a/net/core/page_pool.c

>> +++ b/net/core/page_pool.c

>> @@ -24,7 +24,7 @@

>>  #define DEFER_TIME (msecs_to_jiffies(1000))

>>  #define DEFER_WARN_INTERVAL (60 * HZ)

>>

>> -#define BIAS_MAX       LONG_MAX

>> +#define BIAS_MAX       (LONG_MAX / 2)

>>

>>  static int page_pool_init(struct page_pool *pool,

>>                           const struct page_pool_params *params)

> 

> I still think this would be better as a separate patch calling out the

> fact that you are changing the value with the plan to support

> incrementing it in the future.


Sure.

> 

>> @@ -736,20 +736,10 @@ void page_pool_update_nid(struct page_pool *pool, int new_nid)

>>  }

>>  EXPORT_SYMBOL(page_pool_update_nid);

>>

>> -bool page_pool_return_skb_page(struct page *page)

>> +void page_pool_return_skb_page(struct page *page)

>>  {

>>         struct page_pool *pp;

>>

>> -       /* page->pp_magic is OR'ed with PP_SIGNATURE after the allocation

>> -        * in order to preserve any existing bits, such as bit 0 for the

>> -        * head page of compound page and bit 1 for pfmemalloc page, so

>> -        * mask those bits for freeing side when doing below checking,

>> -        * and page_is_pfmemalloc() is checked in __page_pool_put_page()

>> -        * to avoid recycling the pfmemalloc page.

>> -        */

>> -       if (unlikely((page->pp_magic & ~0x3UL) != PP_SIGNATURE))

>> -               return false;

>> -

>>         pp = page->pp;

>>

>>         /* Driver set this to memory recycling info. Reset it on recycle.

>> @@ -758,7 +748,5 @@ bool page_pool_return_skb_page(struct page *page)

>>          * 'flipped' fragment being in use or not.

>>          */

>>         page_pool_put_full_page(pp, page, false);

>> -

>> -       return true;

>>  }

>>  EXPORT_SYMBOL(page_pool_return_skb_page);

>> --

>> 2.33.0

>>

> .

>
Ilias Apalodimas Sept. 15, 2021, 12:56 p.m. UTC | #3
Hi Yunsheng,  Alexander,

On Wed, Sep 15, 2021 at 05:07:08PM +0800, Yunsheng Lin wrote:
> On 2021/9/15 8:59, Alexander Duyck wrote:

> > On Tue, Sep 14, 2021 at 5:12 AM Yunsheng Lin <linyunsheng@huawei.com> wrote:

> >>

> >> As the skb->pp_recycle and page->pp_magic may not be enough

> >> to track if a frag page is from page pool after the calling

> >> of __skb_frag_ref(), mostly because of a data race, see:

> >> commit 2cc3aeb5eccc ("skbuff: Fix a potential race while

> >> recycling page_pool packets").

> > 

> > I'm not sure how this comment actually applies. It is an issue that

> > was fixed. If anything my concern is that this change will introduce

> > new races instead of fixing any existing ones.

> 

> My initial thinking about adding the above comment is to emphasize

> that we might clear cloned skb's pp_recycle when doing head expanding,

> and page pool might need to give up on that page if that cloned skb is

> the last one to be freed.

> 

> > 

> >> There may be clone and expand head case that might lose the

> >> track if a frag page is from page pool or not.

> > 

> > Can you explain how? If there is such a case we should fix it instead

> > of trying to introduce new features to address it. This seems more

> > like a performance optimization rather than a fix.

> 

> Yes, I consider it an optimization too, that's why I am targetting

> net-next.

> 

> Even for the below skb_split() case in tso_fragment(), I am not sure

> how can a rx pp page can go through the tcp stack yet.


I am bit confused :).  We don't have that problem *now* right?  This will
appear if we try to pull in your patches on using page pool and recycling
for Tx where TSO and skb_split are used?

I'll be honest, when I came up with the recycling idea for page pool, I
never intended to support Tx.  I agree with Alexander here,  If people want
to use it on Tx and think there's value,  we might need to go back to the
drawing board and see what I've missed.  It's still early and there's a
handful of drivers using it,  so it will less painful now.

The pp_recycle_bit was introduced to make the checking faster, instead of
getting stuff into cache and check the page signature.  If that ends up
being counterproductive, we could just replace the entire logic with the
frag count and the page signature, couldn't we?  In that case we should be
very cautious and measure potential regression on the standard path. 

But in general,  I'd be happier if we only had a simple logic in our
testing for the pages we have to recycle.  Debugging and understanding this
otherwise will end up being a mess.

> 

> > 

> >> And not being able to keep track of pp page may cause problem

> >> for the skb_split() case in tso_fragment() too:

> >> Supposing a skb has 3 frag pages, all coming from a page pool,

> >> and is split to skb1 and skb2:

> >> skb1: first frag page + first half of second frag page

> >> skb2: second half of second frag page + third frag page

> >>

> >> How do we set the skb->pp_recycle of skb1 and skb2?

> >> 1. If we set both of them to 1, then we may have a similar

> >>    race as the above commit for second frag page.

> >> 2. If we set only one of them to 1, then we may have resource

> >>    leaking problem as both first frag page and third frag page

> >>    are indeed from page pool.

> > 

> > The question I would have is in the above cases how is skb->pp_recycle

> > being set on the second buffer? Is it already coming that way? If so,

> 

> As the skb_split() implemention, it takes skb and skb1 for input,

> skb have pp_recycle set, and the skb1 is the newly allocated without

> pp_recycle set, after skb_split(), some payload of skb is split to

> the skb1, how to set the pp_recycle of skb1 seems tricky here.

> 

> > maybe you should special case the __skb_frag_ref when you know you are

> > loading a recycling skb instead of just assuming you want to do it

> > automatically.

> 

> I am not sure what does "special case" and "automatically" means here.


The 'special case' is an skb_split() were one or more of the frags ends up
being split right?

> One way I can think of is to avoid doing the __skb_frag_ref, and allocate

> a new frag for skb1 and do memcpying if there is a sharing frag between

> skb and skb1 when splitting. But it seems the allocating a new frag

> and memcpying seems much heavier doing the __skb_frag_ref.


We could also accept we'd only recycle some of the frags.  If the frag is 'split',
we could just unmap it and let the networking stack free it eventually. 
But that  would introduce some kind of dependency between the core networking 
code and page_pool,  which we've tried to avoid so far,  but more importantly
it will slow down the entire path for drivers that don't use page_pool ...

> 

> > 

> >> So increment the frag count when __skb_frag_ref() is called,

> >> and only use page->pp_magic to indicate if a frag page is from

> >> page pool, to avoid the above data race.

> > 

> > This assumes the page is only going to be used by the network stack.

> > My concern is what happens when the page is pulled out of the skb and

> > used for example in storage?

> 

> As my understanding, the above case works as before, as if the storage

> is holding reference to that page, the page pool will give up on that

> page by checking "page_ref_count(page) == 1" when the last user from

> network stack has done with a pp page(by checking page->pp_frag_count).

> 

> > 

> >> For 32 bit systems with 64 bit dma, we preserve the orginial

> >> behavior as frag count is used to trace how many time does a

> >> frag page is called with __skb_frag_ref().

> >>

> >> We still use both skb->pp_recycle and page->pp_magic to decide

> >> the head page for a skb is from page pool or not.

> >>


[...]

> >>

> >> +/**

> >> + * skb_frag_is_pp_page - decide if a page is recyclable.

> >> + * @page: frag page

> >> + * @recycle: skb->pp_recycle

> >> + *

> >> + * For 32 bit systems with 64 bit dma, the skb->pp_recycle is

> >> + * also used to decide if a page can be recycled to the page

> >> + * pool.

> >> + */

> >> +static inline bool skb_frag_is_pp_page(struct page *page,

> >> +                                      bool recycle)

> >> +{

> >> +       return page_pool_is_pp_page(page) ||

> >> +               (recycle && __page_pool_is_pp_page(page));

> >> +}

> >> +

> > 

> > The logic for this check is ugly. You are essentially calling

> > __page_pool_is_pp_page again if it fails the first check. It would

> > probably make more sense to rearrange things and just call

> > (!DMA_USE_PP_FRAG_COUNT || recycle)  && __page_pool_is_pp_page(). With

> > that the check of recycle could be dropped entirely if frag count is

> > valid to use, and in the non-fragcount case it reverts back to the

> > original check.

> 

> The reason I did not do that is I kind of did not want to use the

> DMA_USE_PP_FRAG_COUNT outside of page pool.

> I can use DMA_USE_PP_FRAG_COUNT directly in skbuff.h if the above

> is considered harmless:)

> 

> The 32 bit systems with 64 bit dma really seems a burden here, as

> memtioned by Ilias [1], there seems to be no such system using page

> pool, we might as well consider disabling page pool for such system?

> 

> Ilias, Jesper, what do you think?

> 

> 1. http://lkml.iu.edu/hypermail/linux/kernel/2107.1/06321.html

> 


Well I can't really disagree with myself too much :).  I still think we are
carrying a lot of code and complexity for systems that don't exist.

> > 

> >>  /**

> >>   * __skb_frag_unref - release a reference on a paged fragment.

> >>   * @frag: the paged fragment

> >> @@ -3101,8 +3126,10 @@ static inline void __skb_frag_unref(skb_frag_t *frag, bool recycle)

> >>         struct page *page = skb_frag_page(frag);

> >>

> >>  #ifdef CONFIG_PAGE_POOL

> >> -       if (recycle && page_pool_return_skb_page(page))

> >> +       if (skb_frag_is_pp_page(page, recycle)) {

> >> +               page_pool_return_skb_page(page);

> >>                 return;

> >> +       }

> >>  #endif

> >>         put_page(page);

> >>  }

> >> @@ -4720,9 +4747,14 @@ static inline void skb_mark_for_recycle(struct sk_buff *skb)

> >>

> >>  static inline bool skb_pp_recycle(struct sk_buff *skb, void *data)

> >>  {

> >> -       if (!IS_ENABLED(CONFIG_PAGE_POOL) || !skb->pp_recycle)

> >> +       struct page *page = virt_to_head_page(data);

> >> +

> >> +       if (!IS_ENABLED(CONFIG_PAGE_POOL) || !skb->pp_recycle ||

> >> +           !__page_pool_is_pp_page(page))

> >>                 return false;

> >> -       return page_pool_return_skb_page(virt_to_head_page(data));

> >> +

> >> +       page_pool_return_skb_page(page);

> >> +       return true;

> >>  }

> >>

> >>  #endif /* __KERNEL__ */

> > 

> > As I recall the virt_to_head_page isn't necessarily a cheap call as it

> > can lead to quite a bit of pointer chasing and a few extra math steps

> > to do the virt to page conversion. I would be curious how much extra

> > overhead this is adding to the non-fragcount or non-recycling case.

> 

> As the page pool can only deal with head_page as the checking of

> "page_ref_count(page) == 1" in __page_pool_put_page() seems requiring

> head_page, and skb_free_frag() in skb_free_head() seems to operate on

> the head_page, so I assume the overhead is minimal?

> 

> > 

> >> diff --git a/include/net/page_pool.h b/include/net/page_pool.h

> >> index 2ad0706566c5..eb103d86f453 100644

> >> --- a/include/net/page_pool.h

> >> +++ b/include/net/page_pool.h

> >> @@ -164,7 +164,7 @@ inline enum dma_data_direction page_pool_get_dma_dir(struct page_pool *pool)

> >>         return pool->p.dma_dir;

> >>  }

> >>

> >> -bool page_pool_return_skb_page(struct page *page);

> >> +void page_pool_return_skb_page(struct page *page);

> >>

> >>  struct page_pool *page_pool_create(const struct page_pool_params *params);

> >>

> >> @@ -244,6 +244,32 @@ static inline void page_pool_set_frag_count(struct page *page, long nr)

> >>         atomic_long_set(&page->pp_frag_count, nr);

> >>  }

> >>

> >> +static inline void page_pool_atomic_inc_frag_count(struct page *page)

> >> +{

> >> +       atomic_long_inc(&page->pp_frag_count);

> >> +}

> >> +

> > 

> > Your function name is almost as long as the function itself. Maybe you

> > don't need it?

> 

> It is about avoiding exposing the pp_frag_count outside of page pool.

> It is not needed if the above is not a issue:)

> 

> > 

> >> +static inline bool __page_pool_is_pp_page(struct page *page)

> >> +{

> >> +       /* page->pp_magic is OR'ed with PP_SIGNATURE after the allocation

> >> +        * in order to preserve any existing bits, such as bit 0 for the

> >> +        * head page of compound page and bit 1 for pfmemalloc page, so

> >> +        * mask those bits for freeing side when doing below checking,

> >> +        * and page_is_pfmemalloc() is checked in __page_pool_put_page()

> >> +        * to avoid recycling the pfmemalloc page.

> >> +        */

> >> +       return (page->pp_magic & ~0x3UL) == PP_SIGNATURE;

> >> +}

> >> +

> >> +static inline bool page_pool_is_pp_page(struct page *page)

> >> +{

> >> +       /* For systems with the same dma addr as the bus addr, we can use

> >> +        * page->pp_magic to indicate a pp page uniquely.

> >> +        */

> >> +       return !PAGE_POOL_DMA_USE_PP_FRAG_COUNT &&

> >> +                       __page_pool_is_pp_page(page);

> >> +}

> >> +

> > 

> > We should really change the name of the #define. I keep reading it as

> > we are using the PP_FRAG_COUNT, not that it is already in use. Maybe

> > we should look at something like PP_FRAG_COUNT_VALID and just invert

> > the logic for it.

> 

> Yes, Jesper seems to have the similar confusion.


+1 

> I seems better that we can remove that macro completely if the 32 bit

> systems with 64 bit dma turn out to be not existing at all?

> 

> > 

> > Also this function naming is really confusing. You don't have to have

> > the frag count to be a page pool page. Maybe this should be something

> > like page_pool_is_pp_frag_page.

> 


[...]

Regards
/Ilias
Jesper Dangaard Brouer Sept. 15, 2021, 3:42 p.m. UTC | #4
On 15/09/2021 14.56, Ilias Apalodimas wrote:
> Hi Yunsheng,  Alexander,

> 

> On Wed, Sep 15, 2021 at 05:07:08PM +0800, Yunsheng Lin wrote:

>> On 2021/9/15 8:59, Alexander Duyck wrote:

>>> On Tue, Sep 14, 2021 at 5:12 AM Yunsheng Lin <linyunsheng@huawei.com> wrote:

>>>>

>>>> As the skb->pp_recycle and page->pp_magic may not be enough

>>>> to track if a frag page is from page pool after the calling

>>>> of __skb_frag_ref(), mostly because of a data race, see:

>>>> commit 2cc3aeb5eccc ("skbuff: Fix a potential race while

>>>> recycling page_pool packets").

>>>

>>> I'm not sure how this comment actually applies. It is an issue that

>>> was fixed. If anything my concern is that this change will introduce

>>> new races instead of fixing any existing ones.

>>

>> My initial thinking about adding the above comment is to emphasize

>> that we might clear cloned skb's pp_recycle when doing head expanding,

>> and page pool might need to give up on that page if that cloned skb is

>> the last one to be freed.

>>

>>>

>>>> There may be clone and expand head case that might lose the

>>>> track if a frag page is from page pool or not.

>>>

>>> Can you explain how? If there is such a case we should fix it instead

>>> of trying to introduce new features to address it. This seems more

>>> like a performance optimization rather than a fix.

>>

>> Yes, I consider it an optimization too, that's why I am targetting

>> net-next.

>>

>> Even for the below skb_split() case in tso_fragment(), I am not sure

>> how can a rx pp page can go through the tcp stack yet.

> 

> I am bit confused :).  We don't have that problem *now* right?  This will

> appear if we try to pull in your patches on using page pool and recycling

> for Tx where TSO and skb_split are used?

> 

> I'll be honest, when I came up with the recycling idea for page pool, I

> never intended to support Tx.  I agree with Alexander here,  If people want

> to use it on Tx and think there's value,  we might need to go back to the

> drawing board and see what I've missed.  It's still early and there's a

> handful of drivers using it,  so it will less painful now.


I agree, page_pool is NOT designed or intended for TX support.
E.g. it doesn't make sense to allocate a page_pool instance per socket, 
as the backing memory structures for page_pool are too much.
As the number RX-queues are more limited it was deemed okay that we use 
page_pool per RX-queue, which sacrifice some memory to gain speed.


> The pp_recycle_bit was introduced to make the checking faster, instead of

> getting stuff into cache and check the page signature.  If that ends up

> being counterproductive, we could just replace the entire logic with the

> frag count and the page signature, couldn't we?  In that case we should be

> very cautious and measure potential regression on the standard path.


+1

> But in general,  I'd be happier if we only had a simple logic in our

> testing for the pages we have to recycle.  Debugging and understanding this

> otherwise will end up being a mess.



[...]
>>

>>>

>>>> For 32 bit systems with 64 bit dma, we preserve the orginial

>>>> behavior as frag count is used to trace how many time does a

>>>> frag page is called with __skb_frag_ref().

>>>>

>>>> We still use both skb->pp_recycle and page->pp_magic to decide

>>>> the head page for a skb is from page pool or not.

>>>>

> 

> [...]

> 

>>>>

>>>> +/**

>>>> + * skb_frag_is_pp_page - decide if a page is recyclable.

>>>> + * @page: frag page

>>>> + * @recycle: skb->pp_recycle

>>>> + *

>>>> + * For 32 bit systems with 64 bit dma, the skb->pp_recycle is

>>>> + * also used to decide if a page can be recycled to the page

>>>> + * pool.

>>>> + */

>>>> +static inline bool skb_frag_is_pp_page(struct page *page,

>>>> +                                      bool recycle)

>>>> +{

>>>> +       return page_pool_is_pp_page(page) ||

>>>> +               (recycle && __page_pool_is_pp_page(page));

>>>> +}

>>>> +

>>>

>>> The logic for this check is ugly. You are essentially calling

>>> __page_pool_is_pp_page again if it fails the first check. It would

>>> probably make more sense to rearrange things and just call

>>> (!DMA_USE_PP_FRAG_COUNT || recycle)  && __page_pool_is_pp_page(). With

>>> that the check of recycle could be dropped entirely if frag count is

>>> valid to use, and in the non-fragcount case it reverts back to the

>>> original check.

>>

>> The reason I did not do that is I kind of did not want to use the

>> DMA_USE_PP_FRAG_COUNT outside of page pool.

>> I can use DMA_USE_PP_FRAG_COUNT directly in skbuff.h if the above

>> is considered harmless:)

>>

>> The 32 bit systems with 64 bit dma really seems a burden here, as

>> memtioned by Ilias [1], there seems to be no such system using page

>> pool, we might as well consider disabling page pool for such system?

>>

>> Ilias, Jesper, what do you think?

>>

>> 1. http://lkml.iu.edu/hypermail/linux/kernel/2107.1/06321.html

>>

> 

> Well I can't really disagree with myself too much :).  I still think we are

> carrying a lot of code and complexity for systems that don't exist.


I would be fine with rejecting such systems at page_pool setup time. 
Meaning that NIC drivers using page_pool for DMA-mapping, getting 
compiled on 32-bit systems and needing 64-bit DMA-mappings, will have 
their call to page_pool_create() fail (with something else than -EINVAL 
please).
If drivers really want work on such systems, they have to implement 
their own DMA-mapping fallback tracking outside page_pool.  Meaning it 
is only the keeping track of DMA-mapping part of page_pool they cannot use.

[...]
>>

>>>

>>>> +static inline bool __page_pool_is_pp_page(struct page *page)

>>>> +{

>>>> +       /* page->pp_magic is OR'ed with PP_SIGNATURE after the allocation

>>>> +        * in order to preserve any existing bits, such as bit 0 for the

>>>> +        * head page of compound page and bit 1 for pfmemalloc page, so

>>>> +        * mask those bits for freeing side when doing below checking,

>>>> +        * and page_is_pfmemalloc() is checked in __page_pool_put_page()

>>>> +        * to avoid recycling the pfmemalloc page.

>>>> +        */

>>>> +       return (page->pp_magic & ~0x3UL) == PP_SIGNATURE;

>>>> +}

>>>> +

>>>> +static inline bool page_pool_is_pp_page(struct page *page)

>>>> +{

>>>> +       /* For systems with the same dma addr as the bus addr, we can use

>>>> +        * page->pp_magic to indicate a pp page uniquely.

>>>> +        */

>>>> +       return !PAGE_POOL_DMA_USE_PP_FRAG_COUNT &&

>>>> +                       __page_pool_is_pp_page(page);

>>>> +}

>>>> +

>>>

>>> We should really change the name of the #define. I keep reading it as

>>> we are using the PP_FRAG_COUNT, not that it is already in use. Maybe

>>> we should look at something like PP_FRAG_COUNT_VALID and just invert

>>> the logic for it.

>>

>> Yes, Jesper seems to have the similar confusion.

> 

> +1


+1


>> I seems better that we can remove that macro completely if the 32 bit

>> systems with 64 bit dma turn out to be not existing at all?

>>

>>>

>>> Also this function naming is really confusing. You don't have to have

>>> the frag count to be a page pool page. Maybe this should be something

>>> like page_pool_is_pp_frag_page.

>>

> 

> [...]

> 

> Regards

> /Ilias


--Jesper
Yunsheng Lin Sept. 16, 2021, 2:05 a.m. UTC | #5
On 2021/9/15 23:42, Jesper Dangaard Brouer wrote:
> 

> On 15/09/2021 14.56, Ilias Apalodimas wrote:

>> Hi Yunsheng,  Alexander,

>>

>> On Wed, Sep 15, 2021 at 05:07:08PM +0800, Yunsheng Lin wrote:

>>> On 2021/9/15 8:59, Alexander Duyck wrote:

>>>> On Tue, Sep 14, 2021 at 5:12 AM Yunsheng Lin <linyunsheng@huawei.com> wrote:

>>>>>

>>>>> As the skb->pp_recycle and page->pp_magic may not be enough

>>>>> to track if a frag page is from page pool after the calling

>>>>> of __skb_frag_ref(), mostly because of a data race, see:

>>>>> commit 2cc3aeb5eccc ("skbuff: Fix a potential race while

>>>>> recycling page_pool packets").

>>>>

>>>> I'm not sure how this comment actually applies. It is an issue that

>>>> was fixed. If anything my concern is that this change will introduce

>>>> new races instead of fixing any existing ones.

>>>

>>> My initial thinking about adding the above comment is to emphasize

>>> that we might clear cloned skb's pp_recycle when doing head expanding,

>>> and page pool might need to give up on that page if that cloned skb is

>>> the last one to be freed.

>>>

>>>>

>>>>> There may be clone and expand head case that might lose the

>>>>> track if a frag page is from page pool or not.

>>>>

>>>> Can you explain how? If there is such a case we should fix it instead

>>>> of trying to introduce new features to address it. This seems more

>>>> like a performance optimization rather than a fix.

>>>

>>> Yes, I consider it an optimization too, that's why I am targetting

>>> net-next.

>>>

>>> Even for the below skb_split() case in tso_fragment(), I am not sure

>>> how can a rx pp page can go through the tcp stack yet.

>>

>> I am bit confused :).  We don't have that problem *now* right?  This will

>> appear if we try to pull in your patches on using page pool and recycling

>> for Tx where TSO and skb_split are used?


As my understanding, the problem might exists without tx recycling, because a
skb from wire would be passed down to the tcp stack and retransmited back to
the wire theoretically. As I am not able to setup a configuration to verify
and test it and the handling seems tricky, so I am targetting net-next branch
instead of net branch.

>>

>> I'll be honest, when I came up with the recycling idea for page pool, I

>> never intended to support Tx.  I agree with Alexander here,  If people want

>> to use it on Tx and think there's value,  we might need to go back to the

>> drawing board and see what I've missed.  It's still early and there's a

>> handful of drivers using it,  so it will less painful now.


Yes, we also need to prototype it to see if there is something missing in the
drawing board and how much improvement we get from that:)

> 

> I agree, page_pool is NOT designed or intended for TX support.

> E.g. it doesn't make sense to allocate a page_pool instance per socket, as the backing memory structures for page_pool are too much.

> As the number RX-queues are more limited it was deemed okay that we use page_pool per RX-queue, which sacrifice some memory to gain speed.


As memtioned before, Tx recycling is based on page_pool instance per socket.
it shares the page_pool instance with rx.

Anyway, based on feedback from edumazet and dsahern, I am still trying to
see if the page pool is meaningful for tx.

> 

> 

>> The pp_recycle_bit was introduced to make the checking faster, instead of

>> getting stuff into cache and check the page signature.  If that ends up

>> being counterproductive, we could just replace the entire logic with the

>> frag count and the page signature, couldn't we?  In that case we should be

>> very cautious and measure potential regression on the standard path.

> 

> +1


I am not sure "pp_recycle_bit was introduced to make the checking faster" is a
valid. The size of "struct page" is only about 9 words(36/72 bytes), which is
mostly to be in the same cache line, and both standard path and recycle path have
been touching the "struct page", so it seems the overhead for checking signature
seems minimal.

I agree that we need to be cautious and measure potential regression on the
standard path.

Another way is to use the bit 0 of frag->bv_page ptr to indicate if a frag
page is from page pool.

> 

>> But in general,  I'd be happier if we only had a simple logic in our

>> testing for the pages we have to recycle.  Debugging and understanding this

>> otherwise will end up being a mess.

> 

> 

> [...]

>>>

>>>>

>>>>> For 32 bit systems with 64 bit dma, we preserve the orginial

>>>>> behavior as frag count is used to trace how many time does a

>>>>> frag page is called with __skb_frag_ref().

>>>>>

>>>>> We still use both skb->pp_recycle and page->pp_magic to decide

>>>>> the head page for a skb is from page pool or not.

>>>>>

>>

>> [...]

>>

>>>>>

>>>>> +/**

>>>>> + * skb_frag_is_pp_page - decide if a page is recyclable.

>>>>> + * @page: frag page

>>>>> + * @recycle: skb->pp_recycle

>>>>> + *

>>>>> + * For 32 bit systems with 64 bit dma, the skb->pp_recycle is

>>>>> + * also used to decide if a page can be recycled to the page

>>>>> + * pool.

>>>>> + */

>>>>> +static inline bool skb_frag_is_pp_page(struct page *page,

>>>>> +                                      bool recycle)

>>>>> +{

>>>>> +       return page_pool_is_pp_page(page) ||

>>>>> +               (recycle && __page_pool_is_pp_page(page));

>>>>> +}

>>>>> +

>>>>

>>>> The logic for this check is ugly. You are essentially calling

>>>> __page_pool_is_pp_page again if it fails the first check. It would

>>>> probably make more sense to rearrange things and just call

>>>> (!DMA_USE_PP_FRAG_COUNT || recycle)  && __page_pool_is_pp_page(). With

>>>> that the check of recycle could be dropped entirely if frag count is

>>>> valid to use, and in the non-fragcount case it reverts back to the

>>>> original check.

>>>

>>> The reason I did not do that is I kind of did not want to use the

>>> DMA_USE_PP_FRAG_COUNT outside of page pool.

>>> I can use DMA_USE_PP_FRAG_COUNT directly in skbuff.h if the above

>>> is considered harmless:)

>>>

>>> The 32 bit systems with 64 bit dma really seems a burden here, as

>>> memtioned by Ilias [1], there seems to be no such system using page

>>> pool, we might as well consider disabling page pool for such system?

>>>

>>> Ilias, Jesper, what do you think?

>>>

>>> 1. http://lkml.iu.edu/hypermail/linux/kernel/2107.1/06321.html

>>>

>>

>> Well I can't really disagree with myself too much :).  I still think we are

>> carrying a lot of code and complexity for systems that don't exist.

> 

> I would be fine with rejecting such systems at page_pool setup time. Meaning that NIC drivers using page_pool for DMA-mapping, getting compiled on 32-bit systems and needing 64-bit DMA-mappings, will have their call to page_pool_create() fail (with something else than -EINVAL please).

> If drivers really want work on such systems, they have to implement their own DMA-mapping fallback tracking outside page_pool.  Meaning it is only the keeping track of DMA-mapping part of page_pool they cannot use.


Ok, will send out a patch to reject DMA-mapping support for such system.

> 

> [...]

>>>

>>>>

>>>>> +static inline bool __page_pool_is_pp_page(struct page *page)

>>>>> +{

>>>>> +       /* page->pp_magic is OR'ed with PP_SIGNATURE after the allocation

>>>>> +        * in order to preserve any existing bits, such as bit 0 for the

>>>>> +        * head page of compound page and bit 1 for pfmemalloc page, so

>>>>> +        * mask those bits for freeing side when doing below checking,

>>>>> +        * and page_is_pfmemalloc() is checked in __page_pool_put_page()

>>>>> +        * to avoid recycling the pfmemalloc page.

>>>>> +        */

>>>>> +       return (page->pp_magic & ~0x3UL) == PP_SIGNATURE;

>>>>> +}

>>>>> +

>>>>> +static inline bool page_pool_is_pp_page(struct page *page)

>>>>> +{

>>>>> +       /* For systems with the same dma addr as the bus addr, we can use

>>>>> +        * page->pp_magic to indicate a pp page uniquely.

>>>>> +        */

>>>>> +       return !PAGE_POOL_DMA_USE_PP_FRAG_COUNT &&

>>>>> +                       __page_pool_is_pp_page(page);

>>>>> +}

>>>>> +

>>>>

>>>> We should really change the name of the #define. I keep reading it as

>>>> we are using the PP_FRAG_COUNT, not that it is already in use. Maybe

>>>> we should look at something like PP_FRAG_COUNT_VALID and just invert

>>>> the logic for it.

>>>

>>> Yes, Jesper seems to have the similar confusion.

>>

>> +1

> 

> +1

> 

> 

>>> I seems better that we can remove that macro completely if the 32 bit

>>> systems with 64 bit dma turn out to be not existing at all?

>>>

>>>>

>>>> Also this function naming is really confusing. You don't have to have

>>>> the frag count to be a page pool page. Maybe this should be something

>>>> like page_pool_is_pp_frag_page.

>>>

>>

>> [...]

>>

>> Regards

>> /Ilias

> 

> --Jesper

> _______________________________________________

> Linuxarm mailing list -- linuxarm@openeuler.org

> To unsubscribe send an email to linuxarm-leave@openeuler.org
Ilias Apalodimas Sept. 16, 2021, 8:44 a.m. UTC | #6
> >> appear if we try to pull in your patches on using page pool and recycling


[...]

> >> for Tx where TSO and skb_split are used?

> 

> As my understanding, the problem might exists without tx recycling, because a

> skb from wire would be passed down to the tcp stack and retransmited back to

> the wire theoretically. As I am not able to setup a configuration to verify

> and test it and the handling seems tricky, so I am targetting net-next branch

> instead of net branch.

> 

> >>

> >> I'll be honest, when I came up with the recycling idea for page pool, I

> >> never intended to support Tx.  I agree with Alexander here,  If people want

> >> to use it on Tx and think there's value,  we might need to go back to the

> >> drawing board and see what I've missed.  It's still early and there's a

> >> handful of drivers using it,  so it will less painful now.

> 

> Yes, we also need to prototype it to see if there is something missing in the

> drawing board and how much improvement we get from that:)

> 

> > 

> > I agree, page_pool is NOT designed or intended for TX support.

> > E.g. it doesn't make sense to allocate a page_pool instance per socket, as the backing memory structures for page_pool are too much.

> > As the number RX-queues are more limited it was deemed okay that we use page_pool per RX-queue, which sacrifice some memory to gain speed.

> 

> As memtioned before, Tx recycling is based on page_pool instance per socket.

> it shares the page_pool instance with rx.

> 

> Anyway, based on feedback from edumazet and dsahern, I am still trying to

> see if the page pool is meaningful for tx.

> 

> > 

> > 

> >> The pp_recycle_bit was introduced to make the checking faster, instead of

> >> getting stuff into cache and check the page signature.  If that ends up

> >> being counterproductive, we could just replace the entire logic with the

> >> frag count and the page signature, couldn't we?  In that case we should be

> >> very cautious and measure potential regression on the standard path.

> > 

> > +1

> 

> I am not sure "pp_recycle_bit was introduced to make the checking faster" is a

> valid. The size of "struct page" is only about 9 words(36/72 bytes), which is

> mostly to be in the same cache line, and both standard path and recycle path have

> been touching the "struct page", so it seems the overhead for checking signature

> seems minimal.

> 

> I agree that we need to be cautious and measure potential regression on the

> standard path.


well pp_recycle is on the same cache line boundary with the head_frag we
need to decide on recycling. After that we start checking page signatures
etc,  which means the default release path remains mostly unaffected.  

I guess what you are saying here, is that 'struct page' is going to be
accessed eventually by the default network path,  so there won't be any 
noticeable performance hit?  What about the other usecases we have
for pp_recycle right now?  __skb_frag_unref() in skb_shift() or
skb_try_coalesce() (the latter can probably be removed tbh).

> 

> Another way is to use the bit 0 of frag->bv_page ptr to indicate if a frag

> page is from page pool.


Instead of the 'struct page' signature?  And the pp_recycle bit will
continue to exist?  

Right now the 'naive' explanation on the recycling decision is something like:

if (pp_recycle) <--- recycling bit is set
    (check page signature) <--- signature matches page pool
		(check fragment refcnt) <--- If frags are enabled and is the last consumer
			recycle

If we can proove the performance is unaffected when we eliminate the first if,
then obviously we should remove it.  I'll try running that test here and see,
but keep in mind I am only testing on an 1GB interface.  Any chance we can get 
measurements on a beefier hardware using hns3 ?

> 

> > 

> >> But in general,  I'd be happier if we only had a simple logic in our

> >> testing for the pages we have to recycle.  Debugging and understanding this

> >> otherwise will end up being a mess.

> > 

> > 


[...]

Regards
/Ilias
Yunsheng Lin Sept. 16, 2021, 9:33 a.m. UTC | #7
On 2021/9/16 16:44, Ilias Apalodimas wrote:
>>>> appear if we try to pull in your patches on using page pool and recycling

> 

> [...]

> 

>>>> for Tx where TSO and skb_split are used?

>>

>> As my understanding, the problem might exists without tx recycling, because a

>> skb from wire would be passed down to the tcp stack and retransmited back to

>> the wire theoretically. As I am not able to setup a configuration to verify

>> and test it and the handling seems tricky, so I am targetting net-next branch

>> instead of net branch.

>>

>>>>

>>>> I'll be honest, when I came up with the recycling idea for page pool, I

>>>> never intended to support Tx.  I agree with Alexander here,  If people want

>>>> to use it on Tx and think there's value,  we might need to go back to the

>>>> drawing board and see what I've missed.  It's still early and there's a

>>>> handful of drivers using it,  so it will less painful now.

>>

>> Yes, we also need to prototype it to see if there is something missing in the

>> drawing board and how much improvement we get from that:)

>>

>>>

>>> I agree, page_pool is NOT designed or intended for TX support.

>>> E.g. it doesn't make sense to allocate a page_pool instance per socket, as the backing memory structures for page_pool are too much.

>>> As the number RX-queues are more limited it was deemed okay that we use page_pool per RX-queue, which sacrifice some memory to gain speed.

>>

>> As memtioned before, Tx recycling is based on page_pool instance per socket.

>> it shares the page_pool instance with rx.

>>

>> Anyway, based on feedback from edumazet and dsahern, I am still trying to

>> see if the page pool is meaningful for tx.

>>

>>>

>>>

>>>> The pp_recycle_bit was introduced to make the checking faster, instead of

>>>> getting stuff into cache and check the page signature.  If that ends up

>>>> being counterproductive, we could just replace the entire logic with the

>>>> frag count and the page signature, couldn't we?  In that case we should be

>>>> very cautious and measure potential regression on the standard path.

>>>

>>> +1

>>

>> I am not sure "pp_recycle_bit was introduced to make the checking faster" is a

>> valid. The size of "struct page" is only about 9 words(36/72 bytes), which is

>> mostly to be in the same cache line, and both standard path and recycle path have

>> been touching the "struct page", so it seems the overhead for checking signature

>> seems minimal.

>>

>> I agree that we need to be cautious and measure potential regression on the

>> standard path.

> 

> well pp_recycle is on the same cache line boundary with the head_frag we

> need to decide on recycling. After that we start checking page signatures

> etc,  which means the default release path remains mostly unaffected.  

> 

> I guess what you are saying here, is that 'struct page' is going to be

> accessed eventually by the default network path,  so there won't be any 

> noticeable performance hit?  What about the other usecases we have


Yes.

> for pp_recycle right now?  __skb_frag_unref() in skb_shift() or

> skb_try_coalesce() (the latter can probably be removed tbh).


If we decide to go with accurate indicator of a pp page, we just need
to make sure network stack use __skb_frag_unref() and __skb_frag_ref()
to put and get a page frag, the indicator checking need only done in
__skb_frag_unref() and __skb_frag_ref(), so the skb_shift() and
skb_try_coalesce() should be fine too.

> 

>>

>> Another way is to use the bit 0 of frag->bv_page ptr to indicate if a frag

>> page is from page pool.

> 

> Instead of the 'struct page' signature?  And the pp_recycle bit will

> continue to exist?  


pp_recycle bit might only exist or is only used for the head page for the skb.
The bit 0 of frag->bv_page ptr can be used to indicate a frag page uniquely.
Doing a memcpying of shinfo or "*fragto = *fragfrom" automatically pass the
indicator to the new shinfo before doing a __skb_frag_ref(), and __skb_frag_ref()
will increment the _refcount or pp_frag_count according to the bit 0 of
frag->bv_page.

By the way, I also prototype the above idea, and it seems to work well too.

> .

> Right now the 'naive' explanation on the recycling decision is something like:

> 

> if (pp_recycle) <--- recycling bit is set

>     (check page signature) <--- signature matches page pool

> 		(check fragment refcnt) <--- If frags are enabled and is the last consumer

> 			recycle

> 

> If we can proove the performance is unaffected when we eliminate the first if,

> then obviously we should remove it.  I'll try running that test here and see,

> but keep in mind I am only testing on an 1GB interface.  Any chance we can get 

> measurements on a beefier hardware using hns3 ?


Sure, I will try it.
As the kind of performance overhead is small, any performance testcase in mind?

> 

>>

>>>

>>>> But in general,  I'd be happier if we only had a simple logic in our

>>>> testing for the pages we have to recycle.  Debugging and understanding this

>>>> otherwise will end up being a mess.

>>>

>>>

> 

> [...]

> 

> Regards

> /Ilias

> .

>
Ilias Apalodimas Sept. 16, 2021, 10:38 a.m. UTC | #8
On Thu, Sep 16, 2021 at 05:33:39PM +0800, Yunsheng Lin wrote:
> On 2021/9/16 16:44, Ilias Apalodimas wrote:

> >>>> appear if we try to pull in your patches on using page pool and recycling

> > 

> > [...]

> > 

> >>>> for Tx where TSO and skb_split are used?

> >>

> >> As my understanding, the problem might exists without tx recycling, because a

> >> skb from wire would be passed down to the tcp stack and retransmited back to

> >> the wire theoretically. As I am not able to setup a configuration to verify

> >> and test it and the handling seems tricky, so I am targetting net-next branch

> >> instead of net branch.

> >>

> >>>>

> >>>> I'll be honest, when I came up with the recycling idea for page pool, I

> >>>> never intended to support Tx.  I agree with Alexander here,  If people want

> >>>> to use it on Tx and think there's value,  we might need to go back to the

> >>>> drawing board and see what I've missed.  It's still early and there's a

> >>>> handful of drivers using it,  so it will less painful now.

> >>

> >> Yes, we also need to prototype it to see if there is something missing in the

> >> drawing board and how much improvement we get from that:)

> >>

> >>>

> >>> I agree, page_pool is NOT designed or intended for TX support.

> >>> E.g. it doesn't make sense to allocate a page_pool instance per socket, as the backing memory structures for page_pool are too much.

> >>> As the number RX-queues are more limited it was deemed okay that we use page_pool per RX-queue, which sacrifice some memory to gain speed.

> >>

> >> As memtioned before, Tx recycling is based on page_pool instance per socket.

> >> it shares the page_pool instance with rx.

> >>

> >> Anyway, based on feedback from edumazet and dsahern, I am still trying to

> >> see if the page pool is meaningful for tx.

> >>

> >>>

> >>>

> >>>> The pp_recycle_bit was introduced to make the checking faster, instead of

> >>>> getting stuff into cache and check the page signature.  If that ends up

> >>>> being counterproductive, we could just replace the entire logic with the

> >>>> frag count and the page signature, couldn't we?  In that case we should be

> >>>> very cautious and measure potential regression on the standard path.

> >>>

> >>> +1

> >>

> >> I am not sure "pp_recycle_bit was introduced to make the checking faster" is a

> >> valid. The size of "struct page" is only about 9 words(36/72 bytes), which is

> >> mostly to be in the same cache line, and both standard path and recycle path have

> >> been touching the "struct page", so it seems the overhead for checking signature

> >> seems minimal.

> >>

> >> I agree that we need to be cautious and measure potential regression on the

> >> standard path.

> > 

> > well pp_recycle is on the same cache line boundary with the head_frag we

> > need to decide on recycling. After that we start checking page signatures

> > etc,  which means the default release path remains mostly unaffected.  

> > 

> > I guess what you are saying here, is that 'struct page' is going to be

> > accessed eventually by the default network path,  so there won't be any 

> > noticeable performance hit?  What about the other usecases we have

> 

> Yes.


In that case you'd need to call virt_to_head_page() early though, get it
and then compare the signature.   I guess that's avoidable by using 
frag->bv_page for the fragments?

> 

> > for pp_recycle right now?  __skb_frag_unref() in skb_shift() or

> > skb_try_coalesce() (the latter can probably be removed tbh).

> 

> If we decide to go with accurate indicator of a pp page, we just need

> to make sure network stack use __skb_frag_unref() and __skb_frag_ref()

> to put and get a page frag, the indicator checking need only done in

> __skb_frag_unref() and __skb_frag_ref(), so the skb_shift() and

> skb_try_coalesce() should be fine too.

> 

> > 

> >>

> >> Another way is to use the bit 0 of frag->bv_page ptr to indicate if a frag

> >> page is from page pool.

> > 

> > Instead of the 'struct page' signature?  And the pp_recycle bit will

> > continue to exist?  

> 

> pp_recycle bit might only exist or is only used for the head page for the skb.

> The bit 0 of frag->bv_page ptr can be used to indicate a frag page uniquely.

> Doing a memcpying of shinfo or "*fragto = *fragfrom" automatically pass the

> indicator to the new shinfo before doing a __skb_frag_ref(), and __skb_frag_ref()

> will increment the _refcount or pp_frag_count according to the bit 0 of

> frag->bv_page.

> 

> By the way, I also prototype the above idea, and it seems to work well too.

> 


As long as no one else touches this, it's just another way of identifying a
page_pool allocated page.  But are we gaining by that?  Not using
virt_to_head_page() as stated above? But in that case you still need to
keep pp_recycle around. 

> > .

> > Right now the 'naive' explanation on the recycling decision is something like:

> > 

> > if (pp_recycle) <--- recycling bit is set

> >     (check page signature) <--- signature matches page pool

> > 		(check fragment refcnt) <--- If frags are enabled and is the last consumer

> > 			recycle

> > 

> > If we can proove the performance is unaffected when we eliminate the first if,

> > then obviously we should remove it.  I'll try running that test here and see,

> > but keep in mind I am only testing on an 1GB interface.  Any chance we can get 

> > measurements on a beefier hardware using hns3 ?

> 

> Sure, I will try it.

> As the kind of performance overhead is small, any performance testcase in mind?

> 


'eliminate the first if' wasn't accurate.  I meant switch the first if and
check the struct page signature instead.  That would be the best solution
imho.  We effectively have a single rule to check if a packet comes from
page_pool or not.

You can start by sending a lot of packets and dropping those immediately.
That should put enough stress on the receive path and the allocators and it
should give us a rough idea. 

> > 

> >>

> >>>

> >>>> But in general,  I'd be happier if we only had a simple logic in our

> >>>> testing for the pages we have to recycle.  Debugging and understanding this

> >>>> otherwise will end up being a mess.

> >>>

> >>>

> > 

> > [...]

> > 

> > Regards

> > /Ilias

> > .

> > 


Regards
/Ilias
Yunsheng Lin Sept. 16, 2021, 11:04 a.m. UTC | #9
On 2021/9/16 18:38, Ilias Apalodimas wrote:
> On Thu, Sep 16, 2021 at 05:33:39PM +0800, Yunsheng Lin wrote:

>> On 2021/9/16 16:44, Ilias Apalodimas wrote:

>>>>>> appear if we try to pull in your patches on using page pool and recycling

>>>

>>> [...]

>>>

>>>>>> for Tx where TSO and skb_split are used?

>>>>

>>>> As my understanding, the problem might exists without tx recycling, because a

>>>> skb from wire would be passed down to the tcp stack and retransmited back to

>>>> the wire theoretically. As I am not able to setup a configuration to verify

>>>> and test it and the handling seems tricky, so I am targetting net-next branch

>>>> instead of net branch.

>>>>

>>>>>>

>>>>>> I'll be honest, when I came up with the recycling idea for page pool, I

>>>>>> never intended to support Tx.  I agree with Alexander here,  If people want

>>>>>> to use it on Tx and think there's value,  we might need to go back to the

>>>>>> drawing board and see what I've missed.  It's still early and there's a

>>>>>> handful of drivers using it,  so it will less painful now.

>>>>

>>>> Yes, we also need to prototype it to see if there is something missing in the

>>>> drawing board and how much improvement we get from that:)

>>>>

>>>>>

>>>>> I agree, page_pool is NOT designed or intended for TX support.

>>>>> E.g. it doesn't make sense to allocate a page_pool instance per socket, as the backing memory structures for page_pool are too much.

>>>>> As the number RX-queues are more limited it was deemed okay that we use page_pool per RX-queue, which sacrifice some memory to gain speed.

>>>>

>>>> As memtioned before, Tx recycling is based on page_pool instance per socket.

>>>> it shares the page_pool instance with rx.

>>>>

>>>> Anyway, based on feedback from edumazet and dsahern, I am still trying to

>>>> see if the page pool is meaningful for tx.

>>>>

>>>>>

>>>>>

>>>>>> The pp_recycle_bit was introduced to make the checking faster, instead of

>>>>>> getting stuff into cache and check the page signature.  If that ends up

>>>>>> being counterproductive, we could just replace the entire logic with the

>>>>>> frag count and the page signature, couldn't we?  In that case we should be

>>>>>> very cautious and measure potential regression on the standard path.

>>>>>

>>>>> +1

>>>>

>>>> I am not sure "pp_recycle_bit was introduced to make the checking faster" is a

>>>> valid. The size of "struct page" is only about 9 words(36/72 bytes), which is

>>>> mostly to be in the same cache line, and both standard path and recycle path have

>>>> been touching the "struct page", so it seems the overhead for checking signature

>>>> seems minimal.

>>>>

>>>> I agree that we need to be cautious and measure potential regression on the

>>>> standard path.

>>>

>>> well pp_recycle is on the same cache line boundary with the head_frag we

>>> need to decide on recycling. After that we start checking page signatures

>>> etc,  which means the default release path remains mostly unaffected.  

>>>

>>> I guess what you are saying here, is that 'struct page' is going to be

>>> accessed eventually by the default network path,  so there won't be any 

>>> noticeable performance hit?  What about the other usecases we have

>>

>> Yes.

> 

> In that case you'd need to call virt_to_head_page() early though, get it

> and then compare the signature.   I guess that's avoidable by using 

> frag->bv_page for the fragments?


If a page of a skb frag is from page pool, It seems frag->bv_page is
always point to head_page of a compound page, so the calling of
virt_to_head_page() does not seems necessary.

bit 0 of frag->bv_page is different way of indicatior for a pp page,
it is better we do not confuse with the page signature way. Using
a bit 0 may give us a free word in 'struct page' if we manage to
use skb->pp_recycle to indicate a head page of the skb uniquely, meaning
page->pp_magic can be used for future feature.


> 

>>

>>> for pp_recycle right now?  __skb_frag_unref() in skb_shift() or

>>> skb_try_coalesce() (the latter can probably be removed tbh).

>>

>> If we decide to go with accurate indicator of a pp page, we just need

>> to make sure network stack use __skb_frag_unref() and __skb_frag_ref()

>> to put and get a page frag, the indicator checking need only done in

>> __skb_frag_unref() and __skb_frag_ref(), so the skb_shift() and

>> skb_try_coalesce() should be fine too.

>>

>>>

>>>>

>>>> Another way is to use the bit 0 of frag->bv_page ptr to indicate if a frag

>>>> page is from page pool.

>>>

>>> Instead of the 'struct page' signature?  And the pp_recycle bit will

>>> continue to exist?  

>>

>> pp_recycle bit might only exist or is only used for the head page for the skb.

>> The bit 0 of frag->bv_page ptr can be used to indicate a frag page uniquely.

>> Doing a memcpying of shinfo or "*fragto = *fragfrom" automatically pass the

>> indicator to the new shinfo before doing a __skb_frag_ref(), and __skb_frag_ref()

>> will increment the _refcount or pp_frag_count according to the bit 0 of

>> frag->bv_page.

>>

>> By the way, I also prototype the above idea, and it seems to work well too.

>>

> 

> As long as no one else touches this, it's just another way of identifying a

> page_pool allocated page.  But are we gaining by that?  Not using

> virt_to_head_page() as stated above? But in that case you still need to

> keep pp_recycle around. 


No, we do not need the pp_recycle, as long as the we make sure __skb_frag_ref()
is called after memcpying the shinfo or doing "*fragto = *fragfrom".

> 

>>> .

>>> Right now the 'naive' explanation on the recycling decision is something like:

>>>

>>> if (pp_recycle) <--- recycling bit is set

>>>     (check page signature) <--- signature matches page pool

>>> 		(check fragment refcnt) <--- If frags are enabled and is the last consumer

>>> 			recycle

>>>

>>> If we can proove the performance is unaffected when we eliminate the first if,

>>> then obviously we should remove it.  I'll try running that test here and see,

>>> but keep in mind I am only testing on an 1GB interface.  Any chance we can get 

>>> measurements on a beefier hardware using hns3 ?

>>

>> Sure, I will try it.

>> As the kind of performance overhead is small, any performance testcase in mind?

>>

> 

> 'eliminate the first if' wasn't accurate.  I meant switch the first if and

> check the struct page signature instead.  That would be the best solution

> imho.  We effectively have a single rule to check if a packet comes from

> page_pool or not.


I am not sure what does "switch " means here, if the page signature can
indicate a pp page uniquely, the "if (pp_recycle)" checking can be removed.

> 

> You can start by sending a lot of packets and dropping those immediately.

> That should put enough stress on the receive path and the allocators and it

> should give us a rough idea. 

> 

>>>

>>>>

>>>>>

>>>>>> But in general,  I'd be happier if we only had a simple logic in our

>>>>>> testing for the pages we have to recycle.  Debugging and understanding this

>>>>>> otherwise will end up being a mess.

>>>>>

>>>>>

>>>

>>> [...]

>>>

>>> Regards

>>> /Ilias

>>> .

>>>

> 

> Regards

> /Ilias

> .

>
Yunsheng Lin Sept. 16, 2021, 11:21 a.m. UTC | #10
On 2021/9/16 19:04, Yunsheng Lin wrote:
> On 2021/9/16 18:38, Ilias Apalodimas wrote:

>> On Thu, Sep 16, 2021 at 05:33:39PM +0800, Yunsheng Lin wrote:

>>> On 2021/9/16 16:44, Ilias Apalodimas wrote:

>>>>>>> appear if we try to pull in your patches on using page pool and recycling

>>>>

>>>> [...]

>>>>

>>>>>>> for Tx where TSO and skb_split are used?

>>>>>

>>>>> As my understanding, the problem might exists without tx recycling, because a

>>>>> skb from wire would be passed down to the tcp stack and retransmited back to

>>>>> the wire theoretically. As I am not able to setup a configuration to verify

>>>>> and test it and the handling seems tricky, so I am targetting net-next branch

>>>>> instead of net branch.

>>>>>

>>>>>>>

>>>>>>> I'll be honest, when I came up with the recycling idea for page pool, I

>>>>>>> never intended to support Tx.  I agree with Alexander here,  If people want

>>>>>>> to use it on Tx and think there's value,  we might need to go back to the

>>>>>>> drawing board and see what I've missed.  It's still early and there's a

>>>>>>> handful of drivers using it,  so it will less painful now.

>>>>>

>>>>> Yes, we also need to prototype it to see if there is something missing in the

>>>>> drawing board and how much improvement we get from that:)

>>>>>

>>>>>>

>>>>>> I agree, page_pool is NOT designed or intended for TX support.

>>>>>> E.g. it doesn't make sense to allocate a page_pool instance per socket, as the backing memory structures for page_pool are too much.

>>>>>> As the number RX-queues are more limited it was deemed okay that we use page_pool per RX-queue, which sacrifice some memory to gain speed.

>>>>>

>>>>> As memtioned before, Tx recycling is based on page_pool instance per socket.

>>>>> it shares the page_pool instance with rx.

>>>>>

>>>>> Anyway, based on feedback from edumazet and dsahern, I am still trying to

>>>>> see if the page pool is meaningful for tx.

>>>>>

>>>>>>

>>>>>>

>>>>>>> The pp_recycle_bit was introduced to make the checking faster, instead of

>>>>>>> getting stuff into cache and check the page signature.  If that ends up

>>>>>>> being counterproductive, we could just replace the entire logic with the

>>>>>>> frag count and the page signature, couldn't we?  In that case we should be

>>>>>>> very cautious and measure potential regression on the standard path.

>>>>>>

>>>>>> +1

>>>>>

>>>>> I am not sure "pp_recycle_bit was introduced to make the checking faster" is a

>>>>> valid. The size of "struct page" is only about 9 words(36/72 bytes), which is

>>>>> mostly to be in the same cache line, and both standard path and recycle path have

>>>>> been touching the "struct page", so it seems the overhead for checking signature

>>>>> seems minimal.

>>>>>

>>>>> I agree that we need to be cautious and measure potential regression on the

>>>>> standard path.

>>>>

>>>> well pp_recycle is on the same cache line boundary with the head_frag we

>>>> need to decide on recycling. After that we start checking page signatures

>>>> etc,  which means the default release path remains mostly unaffected.  

>>>>

>>>> I guess what you are saying here, is that 'struct page' is going to be

>>>> accessed eventually by the default network path,  so there won't be any 

>>>> noticeable performance hit?  What about the other usecases we have

>>>

>>> Yes.

>>

>> In that case you'd need to call virt_to_head_page() early though, get it

>> and then compare the signature.   I guess that's avoidable by using 

>> frag->bv_page for the fragments?

> 

> If a page of a skb frag is from page pool, It seems frag->bv_page is

> always point to head_page of a compound page, so the calling of

> virt_to_head_page() does not seems necessary.

> 

> bit 0 of frag->bv_page is different way of indicatior for a pp page,

> it is better we do not confuse with the page signature way. Using

> a bit 0 may give us a free word in 'struct page' if we manage to

> use skb->pp_recycle to indicate a head page of the skb uniquely, meaning

> page->pp_magic can be used for future feature.

> 

> 

>>

>>>

>>>> for pp_recycle right now?  __skb_frag_unref() in skb_shift() or

>>>> skb_try_coalesce() (the latter can probably be removed tbh).

>>>

>>> If we decide to go with accurate indicator of a pp page, we just need

>>> to make sure network stack use __skb_frag_unref() and __skb_frag_ref()

>>> to put and get a page frag, the indicator checking need only done in

>>> __skb_frag_unref() and __skb_frag_ref(), so the skb_shift() and

>>> skb_try_coalesce() should be fine too.

>>>

>>>>

>>>>>

>>>>> Another way is to use the bit 0 of frag->bv_page ptr to indicate if a frag

>>>>> page is from page pool.

>>>>

>>>> Instead of the 'struct page' signature?  And the pp_recycle bit will

>>>> continue to exist?  

>>>

>>> pp_recycle bit might only exist or is only used for the head page for the skb.

>>> The bit 0 of frag->bv_page ptr can be used to indicate a frag page uniquely.

>>> Doing a memcpying of shinfo or "*fragto = *fragfrom" automatically pass the

>>> indicator to the new shinfo before doing a __skb_frag_ref(), and __skb_frag_ref()

>>> will increment the _refcount or pp_frag_count according to the bit 0 of

>>> frag->bv_page.

>>>

>>> By the way, I also prototype the above idea, and it seems to work well too.

>>>

>>

>> As long as no one else touches this, it's just another way of identifying a

>> page_pool allocated page.  But are we gaining by that?  Not using

>> virt_to_head_page() as stated above? But in that case you still need to

>> keep pp_recycle around. 

> 

> No, we do not need the pp_recycle, as long as the we make sure __skb_frag_ref()

> is called after memcpying the shinfo or doing "*fragto = *fragfrom".


Acctually it seems we do not need to ensure __skb_frag_ref() is called after
memcpying the shinfo or doing "*fragto = *fragfrom".

Just make sure the bit 0 of frag->bv_page is passed to the new frag->bv_page(
by memcpying the shinfo or doing "*fragto = *fragfrom"), __skb_frag_ref() and
__skb_frag_unref() will check the bit 0 of frag->bv_page to update the _refcount
or pp_frag_count accordingly.

> 

>>

>>>> .

>>>> Right now the 'naive' explanation on the recycling decision is something like:

>>>>

>>>> if (pp_recycle) <--- recycling bit is set

>>>>     (check page signature) <--- signature matches page pool

>>>> 		(check fragment refcnt) <--- If frags are enabled and is the last consumer

>>>> 			recycle

>>>>

>>>> If we can proove the performance is unaffected when we eliminate the first if,

>>>> then obviously we should remove it.  I'll try running that test here and see,

>>>> but keep in mind I am only testing on an 1GB interface.  Any chance we can get 

>>>> measurements on a beefier hardware using hns3 ?

>>>

>>> Sure, I will try it.

>>> As the kind of performance overhead is small, any performance testcase in mind?

>>>

>>

>> 'eliminate the first if' wasn't accurate.  I meant switch the first if and

>> check the struct page signature instead.  That would be the best solution

>> imho.  We effectively have a single rule to check if a packet comes from

>> page_pool or not.

> 

> I am not sure what does "switch " means here, if the page signature can

> indicate a pp page uniquely, the "if (pp_recycle)" checking can be removed.

> 

>>

>> You can start by sending a lot of packets and dropping those immediately.

>> That should put enough stress on the receive path and the allocators and it

>> should give us a rough idea. 

>>

>>>>

>>>>>

>>>>>>

>>>>>>> But in general,  I'd be happier if we only had a simple logic in our

>>>>>>> testing for the pages we have to recycle.  Debugging and understanding this

>>>>>>> otherwise will end up being a mess.

>>>>>>

>>>>>>

>>>>

>>>> [...]

>>>>

>>>> Regards

>>>> /Ilias

>>>> .

>>>>

>>

>> Regards

>> /Ilias

>> .

>>

> .

>
Ilias Apalodimas Sept. 16, 2021, 11:57 a.m. UTC | #11
On Thu, Sep 16, 2021 at 07:04:54PM +0800, Yunsheng Lin wrote:
> On 2021/9/16 18:38, Ilias Apalodimas wrote:

> > On Thu, Sep 16, 2021 at 05:33:39PM +0800, Yunsheng Lin wrote:

> >> On 2021/9/16 16:44, Ilias Apalodimas wrote:

> >>>>>> appear if we try to pull in your patches on using page pool and recycling

> >>>

> >>> [...]

> >>>

> >>>>>> for Tx where TSO and skb_split are used?

> >>>>

> >>>> As my understanding, the problem might exists without tx recycling, because a

> >>>> skb from wire would be passed down to the tcp stack and retransmited back to

> >>>> the wire theoretically. As I am not able to setup a configuration to verify

> >>>> and test it and the handling seems tricky, so I am targetting net-next branch

> >>>> instead of net branch.

> >>>>

> >>>>>>

> >>>>>> I'll be honest, when I came up with the recycling idea for page pool, I

> >>>>>> never intended to support Tx.  I agree with Alexander here,  If people want

> >>>>>> to use it on Tx and think there's value,  we might need to go back to the

> >>>>>> drawing board and see what I've missed.  It's still early and there's a

> >>>>>> handful of drivers using it,  so it will less painful now.

> >>>>

> >>>> Yes, we also need to prototype it to see if there is something missing in the

> >>>> drawing board and how much improvement we get from that:)

> >>>>

> >>>>>

> >>>>> I agree, page_pool is NOT designed or intended for TX support.

> >>>>> E.g. it doesn't make sense to allocate a page_pool instance per socket, as the backing memory structures for page_pool are too much.

> >>>>> As the number RX-queues are more limited it was deemed okay that we use page_pool per RX-queue, which sacrifice some memory to gain speed.

> >>>>

> >>>> As memtioned before, Tx recycling is based on page_pool instance per socket.

> >>>> it shares the page_pool instance with rx.

> >>>>

> >>>> Anyway, based on feedback from edumazet and dsahern, I am still trying to

> >>>> see if the page pool is meaningful for tx.

> >>>>

> >>>>>

> >>>>>

> >>>>>> The pp_recycle_bit was introduced to make the checking faster, instead of

> >>>>>> getting stuff into cache and check the page signature.  If that ends up

> >>>>>> being counterproductive, we could just replace the entire logic with the

> >>>>>> frag count and the page signature, couldn't we?  In that case we should be

> >>>>>> very cautious and measure potential regression on the standard path.

> >>>>>

> >>>>> +1

> >>>>

> >>>> I am not sure "pp_recycle_bit was introduced to make the checking faster" is a

> >>>> valid. The size of "struct page" is only about 9 words(36/72 bytes), which is

> >>>> mostly to be in the same cache line, and both standard path and recycle path have

> >>>> been touching the "struct page", so it seems the overhead for checking signature

> >>>> seems minimal.

> >>>>

> >>>> I agree that we need to be cautious and measure potential regression on the

> >>>> standard path.

> >>>

> >>> well pp_recycle is on the same cache line boundary with the head_frag we

> >>> need to decide on recycling. After that we start checking page signatures

> >>> etc,  which means the default release path remains mostly unaffected.  

> >>>

> >>> I guess what you are saying here, is that 'struct page' is going to be

> >>> accessed eventually by the default network path,  so there won't be any 

> >>> noticeable performance hit?  What about the other usecases we have

> >>

> >> Yes.

> > 

> > In that case you'd need to call virt_to_head_page() early though, get it

> > and then compare the signature.   I guess that's avoidable by using 

> > frag->bv_page for the fragments?

> 

> If a page of a skb frag is from page pool, It seems frag->bv_page is

> always point to head_page of a compound page, so the calling of

> virt_to_head_page() does not seems necessary.

> 


I was mostly referring to the skb head here and how would you trigger the
recycling path. 

I think we are talking about different things here.  
One idea is to use the last bit of frag->bv_page to identify fragments
allocated from page_pool, which is done today with the signature.

The signature however exists in the head page so my question was, can we rid
of that without having a performance penalty?

IOW in skb_free_head() an we replace:

if (skb_pp_recycle(skb, head)) 
with
if (page->pp_magic & ~0x3UL) == PP_SIGNATURE)
and get rid of the 'bool recycle' argument in __skb_frag_unref()?

> bit 0 of frag->bv_page is different way of indicatior for a pp page,

> it is better we do not confuse with the page signature way. Using

> a bit 0 may give us a free word in 'struct page' if we manage to

> use skb->pp_recycle to indicate a head page of the skb uniquely, meaning

> page->pp_magic can be used for future feature.

> 

> 

> > 

> >>

> >>> for pp_recycle right now?  __skb_frag_unref() in skb_shift() or

> >>> skb_try_coalesce() (the latter can probably be removed tbh).

> >>

> >> If we decide to go with accurate indicator of a pp page, we just need

> >> to make sure network stack use __skb_frag_unref() and __skb_frag_ref()

> >> to put and get a page frag, the indicator checking need only done in

> >> __skb_frag_unref() and __skb_frag_ref(), so the skb_shift() and

> >> skb_try_coalesce() should be fine too.

> >>

> >>>

> >>>>

> >>>> Another way is to use the bit 0 of frag->bv_page ptr to indicate if a frag

> >>>> page is from page pool.

> >>>

> >>> Instead of the 'struct page' signature?  And the pp_recycle bit will

> >>> continue to exist?  

> >>

> >> pp_recycle bit might only exist or is only used for the head page for the skb.

> >> The bit 0 of frag->bv_page ptr can be used to indicate a frag page uniquely.

> >> Doing a memcpying of shinfo or "*fragto = *fragfrom" automatically pass the

> >> indicator to the new shinfo before doing a __skb_frag_ref(), and __skb_frag_ref()

> >> will increment the _refcount or pp_frag_count according to the bit 0 of

> >> frag->bv_page.

> >>

> >> By the way, I also prototype the above idea, and it seems to work well too.

> >>

> > 

> > As long as no one else touches this, it's just another way of identifying a

> > page_pool allocated page.  But are we gaining by that?  Not using

> > virt_to_head_page() as stated above? But in that case you still need to

> > keep pp_recycle around. 

> 

> No, we do not need the pp_recycle, as long as the we make sure __skb_frag_ref()

> is called after memcpying the shinfo or doing "*fragto = *fragfrom".


But we'll have to keep it for the skb head in this case.

Regards
/Ilias

> 

> > 

> >>> .

> >>> Right now the 'naive' explanation on the recycling decision is something like:

> >>>

> >>> if (pp_recycle) <--- recycling bit is set

> >>>     (check page signature) <--- signature matches page pool

> >>> 		(check fragment refcnt) <--- If frags are enabled and is the last consumer

> >>> 			recycle

> >>>

> >>> If we can proove the performance is unaffected when we eliminate the first if,

> >>> then obviously we should remove it.  I'll try running that test here and see,

> >>> but keep in mind I am only testing on an 1GB interface.  Any chance we can get 

> >>> measurements on a beefier hardware using hns3 ?

> >>

> >> Sure, I will try it.

> >> As the kind of performance overhead is small, any performance testcase in mind?

> >>

> > 

> > 'eliminate the first if' wasn't accurate.  I meant switch the first if and

> > check the struct page signature instead.  That would be the best solution

> > imho.  We effectively have a single rule to check if a packet comes from

> > page_pool or not.

> 

> I am not sure what does "switch " means here, if the page signature can

> indicate a pp page uniquely, the "if (pp_recycle)" checking can be removed.

> 

> > 

> > You can start by sending a lot of packets and dropping those immediately.

> > That should put enough stress on the receive path and the allocators and it

> > should give us a rough idea. 

> > 

> >>>

> >>>>

> >>>>>

> >>>>>> But in general,  I'd be happier if we only had a simple logic in our

> >>>>>> testing for the pages we have to recycle.  Debugging and understanding this

> >>>>>> otherwise will end up being a mess.

> >>>>>

> >>>>>

> >>>

> >>> [...]

> >>>

> >>> Regards

> >>> /Ilias

> >>> .

> >>>

> > 

> > Regards

> > /Ilias

> > .

> >
Yunsheng Lin Sept. 17, 2021, 3:57 a.m. UTC | #12
On 2021/9/16 19:57, Ilias Apalodimas wrote:
> On Thu, Sep 16, 2021 at 07:04:54PM +0800, Yunsheng Lin wrote:

>> On 2021/9/16 18:38, Ilias Apalodimas wrote:

>>> On Thu, Sep 16, 2021 at 05:33:39PM +0800, Yunsheng Lin wrote:

>>>> On 2021/9/16 16:44, Ilias Apalodimas wrote:

>>>>>>>> appear if we try to pull in your patches on using page pool and recycling

>>>>>

>>>>> [...]

>>>>>

>>>>>>>> for Tx where TSO and skb_split are used?

>>>>>>

>>>>>> As my understanding, the problem might exists without tx recycling, because a

>>>>>> skb from wire would be passed down to the tcp stack and retransmited back to

>>>>>> the wire theoretically. As I am not able to setup a configuration to verify

>>>>>> and test it and the handling seems tricky, so I am targetting net-next branch

>>>>>> instead of net branch.

>>>>>>

>>>>>>>>

>>>>>>>> I'll be honest, when I came up with the recycling idea for page pool, I

>>>>>>>> never intended to support Tx.  I agree with Alexander here,  If people want

>>>>>>>> to use it on Tx and think there's value,  we might need to go back to the

>>>>>>>> drawing board and see what I've missed.  It's still early and there's a

>>>>>>>> handful of drivers using it,  so it will less painful now.

>>>>>>

>>>>>> Yes, we also need to prototype it to see if there is something missing in the

>>>>>> drawing board and how much improvement we get from that:)

>>>>>>

>>>>>>>

>>>>>>> I agree, page_pool is NOT designed or intended for TX support.

>>>>>>> E.g. it doesn't make sense to allocate a page_pool instance per socket, as the backing memory structures for page_pool are too much.

>>>>>>> As the number RX-queues are more limited it was deemed okay that we use page_pool per RX-queue, which sacrifice some memory to gain speed.

>>>>>>

>>>>>> As memtioned before, Tx recycling is based on page_pool instance per socket.

>>>>>> it shares the page_pool instance with rx.

>>>>>>

>>>>>> Anyway, based on feedback from edumazet and dsahern, I am still trying to

>>>>>> see if the page pool is meaningful for tx.

>>>>>>

>>>>>>>

>>>>>>>

>>>>>>>> The pp_recycle_bit was introduced to make the checking faster, instead of

>>>>>>>> getting stuff into cache and check the page signature.  If that ends up

>>>>>>>> being counterproductive, we could just replace the entire logic with the

>>>>>>>> frag count and the page signature, couldn't we?  In that case we should be

>>>>>>>> very cautious and measure potential regression on the standard path.

>>>>>>>

>>>>>>> +1

>>>>>>

>>>>>> I am not sure "pp_recycle_bit was introduced to make the checking faster" is a

>>>>>> valid. The size of "struct page" is only about 9 words(36/72 bytes), which is

>>>>>> mostly to be in the same cache line, and both standard path and recycle path have

>>>>>> been touching the "struct page", so it seems the overhead for checking signature

>>>>>> seems minimal.

>>>>>>

>>>>>> I agree that we need to be cautious and measure potential regression on the

>>>>>> standard path.

>>>>>

>>>>> well pp_recycle is on the same cache line boundary with the head_frag we

>>>>> need to decide on recycling. After that we start checking page signatures

>>>>> etc,  which means the default release path remains mostly unaffected.  

>>>>>

>>>>> I guess what you are saying here, is that 'struct page' is going to be

>>>>> accessed eventually by the default network path,  so there won't be any 

>>>>> noticeable performance hit?  What about the other usecases we have

>>>>

>>>> Yes.

>>>

>>> In that case you'd need to call virt_to_head_page() early though, get it

>>> and then compare the signature.   I guess that's avoidable by using 

>>> frag->bv_page for the fragments?

>>

>> If a page of a skb frag is from page pool, It seems frag->bv_page is

>> always point to head_page of a compound page, so the calling of

>> virt_to_head_page() does not seems necessary.

>>

> 

> I was mostly referring to the skb head here and how would you trigger the

> recycling path. 

> 

> I think we are talking about different things here.  

> One idea is to use the last bit of frag->bv_page to identify fragments

> allocated from page_pool, which is done today with the signature.

> 

> The signature however exists in the head page so my question was, can we rid

> of that without having a performance penalty?


As both skb frag and head page is eventually operated on the head page
of a compound page(if it is a compound page) for normal case too, maybe
we can refactor the code to get the head page of a compound page before
the signature checking without doing a second virt_to_head_page() or
compound_head() call?

> 

> IOW in skb_free_head() an we replace:

> 

> if (skb_pp_recycle(skb, head)) 

> with

> if (page->pp_magic & ~0x3UL) == PP_SIGNATURE)

> and get rid of the 'bool recycle' argument in __skb_frag_unref()?


For the frag page of a skb, it seems ok to get rid of the 'bool recycle'
argument in __skb_frag_unref(), as __skb_frag_unref() and __skb_frag_ref()
is symmetrically called to put/get a page.

For the head page of a skb, we might need to make sure the head page
passed to __build_skb_around() meet below condition:
do pp_frag_count incrementing instead of _refcount incrementing when
the head page is not newly allocated and it is from page pool.
It seems hard to audit that?


> 

>> bit 0 of frag->bv_page is different way of indicatior for a pp page,

>> it is better we do not confuse with the page signature way. Using

>> a bit 0 may give us a free word in 'struct page' if we manage to

>> use skb->pp_recycle to indicate a head page of the skb uniquely, meaning

>> page->pp_magic can be used for future feature.

>>

>>

>>>

>>>>

>>>>> for pp_recycle right now?  __skb_frag_unref() in skb_shift() or

>>>>> skb_try_coalesce() (the latter can probably be removed tbh).

>>>>

>>>> If we decide to go with accurate indicator of a pp page, we just need

>>>> to make sure network stack use __skb_frag_unref() and __skb_frag_ref()

>>>> to put and get a page frag, the indicator checking need only done in

>>>> __skb_frag_unref() and __skb_frag_ref(), so the skb_shift() and

>>>> skb_try_coalesce() should be fine too.

>>>>

>>>>>

>>>>>>

>>>>>> Another way is to use the bit 0 of frag->bv_page ptr to indicate if a frag

>>>>>> page is from page pool.

>>>>>

>>>>> Instead of the 'struct page' signature?  And the pp_recycle bit will

>>>>> continue to exist?  

>>>>

>>>> pp_recycle bit might only exist or is only used for the head page for the skb.

>>>> The bit 0 of frag->bv_page ptr can be used to indicate a frag page uniquely.

>>>> Doing a memcpying of shinfo or "*fragto = *fragfrom" automatically pass the

>>>> indicator to the new shinfo before doing a __skb_frag_ref(), and __skb_frag_ref()

>>>> will increment the _refcount or pp_frag_count according to the bit 0 of

>>>> frag->bv_page.

>>>>

>>>> By the way, I also prototype the above idea, and it seems to work well too.

>>>>

>>>

>>> As long as no one else touches this, it's just another way of identifying a

>>> page_pool allocated page.  But are we gaining by that?  Not using

>>> virt_to_head_page() as stated above? But in that case you still need to

>>> keep pp_recycle around. 

>>

>> No, we do not need the pp_recycle, as long as the we make sure __skb_frag_ref()

>> is called after memcpying the shinfo or doing "*fragto = *fragfrom".

> 

> But we'll have to keep it for the skb head in this case.


As above, I am not really look into skb head case:)

> 

> Regards

> /Ilias

> 

>>

>>>

>>>>> .

>>>>> Right now the 'naive' explanation on the recycling decision is something like:

>>>>>

>>>>> if (pp_recycle) <--- recycling bit is set

>>>>>     (check page signature) <--- signature matches page pool

>>>>> 		(check fragment refcnt) <--- If frags are enabled and is the last consumer

>>>>> 			recycle

>>>>>

>>>>> If we can proove the performance is unaffected when we eliminate the first if,

>>>>> then obviously we should remove it.  I'll try running that test here and see,

>>>>> but keep in mind I am only testing on an 1GB interface.  Any chance we can get 

>>>>> measurements on a beefier hardware using hns3 ?

>>>>

>>>> Sure, I will try it.

>>>> As the kind of performance overhead is small, any performance testcase in mind?

>>>>

>>>

>>> 'eliminate the first if' wasn't accurate.  I meant switch the first if and

>>> check the struct page signature instead.  That would be the best solution

>>> imho.  We effectively have a single rule to check if a packet comes from

>>> page_pool or not.

>>

>> I am not sure what does "switch " means here, if the page signature can

>> indicate a pp page uniquely, the "if (pp_recycle)" checking can be removed.

>>

>>>

>>> You can start by sending a lot of packets and dropping those immediately.

>>> That should put enough stress on the receive path and the allocators and it

>>> should give us a rough idea. 

>>>

>>>>>

>>>>>>

>>>>>>>

>>>>>>>> But in general,  I'd be happier if we only had a simple logic in our

>>>>>>>> testing for the pages we have to recycle.  Debugging and understanding this

>>>>>>>> otherwise will end up being a mess.

>>>>>>>

>>>>>>>

>>>>>

>>>>> [...]

>>>>>

>>>>> Regards

>>>>> /Ilias

>>>>> .

>>>>>

>>>

>>> Regards

>>> /Ilias

>>> .

>>>

> .

>
Ilias Apalodimas Sept. 17, 2021, 6:38 a.m. UTC | #13
Hi Yunsheng,

[...]
> >>>>>> I am not sure "pp_recycle_bit was introduced to make the checking faster" is a

> >>>>>> valid. The size of "struct page" is only about 9 words(36/72 bytes), which is

> >>>>>> mostly to be in the same cache line, and both standard path and recycle path have

> >>>>>> been touching the "struct page", so it seems the overhead for checking signature

> >>>>>> seems minimal.

> >>>>>>

> >>>>>> I agree that we need to be cautious and measure potential regression on the

> >>>>>> standard path.

> >>>>>

> >>>>> well pp_recycle is on the same cache line boundary with the head_frag we

> >>>>> need to decide on recycling. After that we start checking page signatures

> >>>>> etc,  which means the default release path remains mostly unaffected.  

> >>>>>

> >>>>> I guess what you are saying here, is that 'struct page' is going to be

> >>>>> accessed eventually by the default network path,  so there won't be any 

> >>>>> noticeable performance hit?  What about the other usecases we have

> >>>>

> >>>> Yes.

> >>>

> >>> In that case you'd need to call virt_to_head_page() early though, get it

> >>> and then compare the signature.   I guess that's avoidable by using 

> >>> frag->bv_page for the fragments?

> >>

> >> If a page of a skb frag is from page pool, It seems frag->bv_page is

> >> always point to head_page of a compound page, so the calling of

> >> virt_to_head_page() does not seems necessary.

> >>

> > 

> > I was mostly referring to the skb head here and how would you trigger the

> > recycling path. 

> > 

> > I think we are talking about different things here.  

> > One idea is to use the last bit of frag->bv_page to identify fragments

> > allocated from page_pool, which is done today with the signature.

> > 

> > The signature however exists in the head page so my question was, can we rid

> > of that without having a performance penalty?

> 

> As both skb frag and head page is eventually operated on the head page

> of a compound page(if it is a compound page) for normal case too, maybe

> we can refactor the code to get the head page of a compound page before

> the signature checking without doing a second virt_to_head_page() or

> compound_head() call?


Yea that's doable, but my concern is different here.  If we do that the
standard network stack, even for drivers that don't use page_pool,  will
have to do a virt_to_head_page() -> check signature, to decide if it has to
try recycling the packet.  That's the performance part I am worried about,
since it happens for every packet. 

> 

> > 

> > IOW in skb_free_head() an we replace:

> > 

> > if (skb_pp_recycle(skb, head)) 

> > with

> > if (page->pp_magic & ~0x3UL) == PP_SIGNATURE)

> > and get rid of the 'bool recycle' argument in __skb_frag_unref()?

> 

> For the frag page of a skb, it seems ok to get rid of the 'bool recycle'

> argument in __skb_frag_unref(), as __skb_frag_unref() and __skb_frag_ref()

> is symmetrically called to put/get a page.

> 

> For the head page of a skb, we might need to make sure the head page

> passed to __build_skb_around() meet below condition:

> do pp_frag_count incrementing instead of _refcount incrementing when

> the head page is not newly allocated and it is from page pool.

> It seems hard to audit that?


Yea that seems a bit weird at least to me and I am not sure, it's the only
place we'll have to go and do that.

> 

> 

> > 

> >> bit 0 of frag->bv_page is different way of indicatior for a pp page,

> >> it is better we do not confuse with the page signature way. Using

> >> a bit 0 may give us a free word in 'struct page' if we manage to

> >> use skb->pp_recycle to indicate a head page of the skb uniquely, meaning

> >> page->pp_magic can be used for future feature.

> >>

> >>

> >>>

> >>>>

> >>>>> for pp_recycle right now?  __skb_frag_unref() in skb_shift() or

> >>>>> skb_try_coalesce() (the latter can probably be removed tbh).

> >>>>

> >>>> If we decide to go with accurate indicator of a pp page, we just need

> >>>> to make sure network stack use __skb_frag_unref() and __skb_frag_ref()

> >>>> to put and get a page frag, the indicator checking need only done in

> >>>> __skb_frag_unref() and __skb_frag_ref(), so the skb_shift() and

> >>>> skb_try_coalesce() should be fine too.

> >>>>

> >>>>>

> >>>>>>

> >>>>>> Another way is to use the bit 0 of frag->bv_page ptr to indicate if a frag

> >>>>>> page is from page pool.

> >>>>>

> >>>>> Instead of the 'struct page' signature?  And the pp_recycle bit will

> >>>>> continue to exist?  

> >>>>

> >>>> pp_recycle bit might only exist or is only used for the head page for the skb.

> >>>> The bit 0 of frag->bv_page ptr can be used to indicate a frag page uniquely.

> >>>> Doing a memcpying of shinfo or "*fragto = *fragfrom" automatically pass the

> >>>> indicator to the new shinfo before doing a __skb_frag_ref(), and __skb_frag_ref()

> >>>> will increment the _refcount or pp_frag_count according to the bit 0 of

> >>>> frag->bv_page.

> >>>>

> >>>> By the way, I also prototype the above idea, and it seems to work well too.

> >>>>

> >>>

> >>> As long as no one else touches this, it's just another way of identifying a

> >>> page_pool allocated page.  But are we gaining by that?  Not using

> >>> virt_to_head_page() as stated above? But in that case you still need to

> >>> keep pp_recycle around. 

> >>

> >> No, we do not need the pp_recycle, as long as the we make sure __skb_frag_ref()

> >> is called after memcpying the shinfo or doing "*fragto = *fragfrom".

> > 

> > But we'll have to keep it for the skb head in this case.

> 

> As above, I am not really look into skb head case:)


Let me take a step back here, because I think we drifted a bit. 
The page signature was introduced in order to be able to identify skb
fragments. The problem was that you couldn't rely on the pp_recycle bit of
the skb head,  since fragments could come from anywhere.  So you use the
skb bit as a hint for skb frags, and you eventually decide using the page
signature.

So we got 3 options (Anything I've missed ?)
- try to remove pp_recycle bit, since the page signature is enough for the
  skb head and fragments.  That in my opinion is the cleanest option,  as
  long as we can prove there's no performance hit on the standard network
  path.

- Replace the page signature with frag->bv_page bit0.  In that case we
  still have to keep the pp_recycle bit,  but we do have an 'easier'
  indication that a skb frag comes from page_pool.  That's still pretty
  safe, since you now have unique identifiers for the skb and page
  fragments and you can be sure of their origin (page pool or not).
  What I am missing here, is what do we get out of this?  I think the
  advantage is not having to call virt_to_head_page() for frags ?

- Keep all of them(?) and use frag->bv_page bit0 similarly to pp_recycle
  bit?  I don't see much value on this one,  I am just keeping it here for
  completeness.

Thanks
/Ilias

> 

> > 

> > Regards

> > /Ilias

> > 

> >>

> >>>

> >>>>> .

> >>>>> Right now the 'naive' explanation on the recycling decision is something like:

> >>>>>

> >>>>> if (pp_recycle) <--- recycling bit is set

> >>>>>     (check page signature) <--- signature matches page pool

> >>>>> 		(check fragment refcnt) <--- If frags are enabled and is the last consumer

> >>>>> 			recycle

> >>>>>

> >>>>> If we can proove the performance is unaffected when we eliminate the first if,

> >>>>> then obviously we should remove it.  I'll try running that test here and see,

> >>>>> but keep in mind I am only testing on an 1GB interface.  Any chance we can get 

> >>>>> measurements on a beefier hardware using hns3 ?

> >>>>

> >>>> Sure, I will try it.

> >>>> As the kind of performance overhead is small, any performance testcase in mind?

> >>>>

> >>>

> >>> 'eliminate the first if' wasn't accurate.  I meant switch the first if and

> >>> check the struct page signature instead.  That would be the best solution

> >>> imho.  We effectively have a single rule to check if a packet comes from

> >>> page_pool or not.

> >>

> >> I am not sure what does "switch " means here, if the page signature can

> >> indicate a pp page uniquely, the "if (pp_recycle)" checking can be removed.

> >>

> >>>

> >>> You can start by sending a lot of packets and dropping those immediately.

> >>> That should put enough stress on the receive path and the allocators and it

> >>> should give us a rough idea. 

> >>>

> >>>>>

> >>>>>>

> >>>>>>>

> >>>>>>>> But in general,  I'd be happier if we only had a simple logic in our

> >>>>>>>> testing for the pages we have to recycle.  Debugging and understanding this

> >>>>>>>> otherwise will end up being a mess.

> >>>>>>>

> >>>>>>>

> >>>>>

> >>>>> [...]

> >>>>>

> >>>>> Regards

> >>>>> /Ilias

> >>>>> .

> >>>>>

> >>>

> >>> Regards

> >>> /Ilias

> >>> .

> >>>

> > .

> >
Yunsheng Lin Sept. 17, 2021, 9:17 a.m. UTC | #14
On 2021/9/17 14:38, Ilias Apalodimas wrote:
> Hi Yunsheng,

> 

> [...]

>>>>>>>> I am not sure "pp_recycle_bit was introduced to make the checking faster" is a

>>>>>>>> valid. The size of "struct page" is only about 9 words(36/72 bytes), which is

>>>>>>>> mostly to be in the same cache line, and both standard path and recycle path have

>>>>>>>> been touching the "struct page", so it seems the overhead for checking signature

>>>>>>>> seems minimal.

>>>>>>>>

>>>>>>>> I agree that we need to be cautious and measure potential regression on the

>>>>>>>> standard path.

>>>>>>>

>>>>>>> well pp_recycle is on the same cache line boundary with the head_frag we

>>>>>>> need to decide on recycling. After that we start checking page signatures

>>>>>>> etc,  which means the default release path remains mostly unaffected.  

>>>>>>>

>>>>>>> I guess what you are saying here, is that 'struct page' is going to be

>>>>>>> accessed eventually by the default network path,  so there won't be any 

>>>>>>> noticeable performance hit?  What about the other usecases we have

>>>>>>

>>>>>> Yes.

>>>>>

>>>>> In that case you'd need to call virt_to_head_page() early though, get it

>>>>> and then compare the signature.   I guess that's avoidable by using 

>>>>> frag->bv_page for the fragments?

>>>>

>>>> If a page of a skb frag is from page pool, It seems frag->bv_page is

>>>> always point to head_page of a compound page, so the calling of

>>>> virt_to_head_page() does not seems necessary.

>>>>

>>>

>>> I was mostly referring to the skb head here and how would you trigger the

>>> recycling path. 

>>>

>>> I think we are talking about different things here.  

>>> One idea is to use the last bit of frag->bv_page to identify fragments

>>> allocated from page_pool, which is done today with the signature.

>>>

>>> The signature however exists in the head page so my question was, can we rid

>>> of that without having a performance penalty?

>>

>> As both skb frag and head page is eventually operated on the head page

>> of a compound page(if it is a compound page) for normal case too, maybe

>> we can refactor the code to get the head page of a compound page before

>> the signature checking without doing a second virt_to_head_page() or

>> compound_head() call?

> 

> Yea that's doable, but my concern is different here.  If we do that the

> standard network stack, even for drivers that don't use page_pool,  will

> have to do a virt_to_head_page() -> check signature, to decide if it has to

> try recycling the packet.  That's the performance part I am worried about,

> since it happens for every packet. 


Yes, there is theoretically performance penalty for virt_to_head_page() or
compound_head(), will do more test if we decide to go with the signature
checking.

> 

>>

>>>

>>> IOW in skb_free_head() an we replace:

>>>

>>> if (skb_pp_recycle(skb, head)) 

>>> with

>>> if (page->pp_magic & ~0x3UL) == PP_SIGNATURE)

>>> and get rid of the 'bool recycle' argument in __skb_frag_unref()?

>>

>> For the frag page of a skb, it seems ok to get rid of the 'bool recycle'

>> argument in __skb_frag_unref(), as __skb_frag_unref() and __skb_frag_ref()

>> is symmetrically called to put/get a page.

>>

>> For the head page of a skb, we might need to make sure the head page

>> passed to __build_skb_around() meet below condition:

>> do pp_frag_count incrementing instead of _refcount incrementing when

>> the head page is not newly allocated and it is from page pool.

>> It seems hard to audit that?

> 

> Yea that seems a bit weird at least to me and I am not sure, it's the only

> place we'll have to go and do that.


Yes, That is why I avoid changing the behavior of a head page for a skb.
In other word, maybe we should not track if head page for a skb is pp page
or not when the page'_refcount is incremented during network stack journey,
just treat it as normal page?

> 

>>

>>

>>>

>>>> bit 0 of frag->bv_page is different way of indicatior for a pp page,

>>>> it is better we do not confuse with the page signature way. Using

>>>> a bit 0 may give us a free word in 'struct page' if we manage to

>>>> use skb->pp_recycle to indicate a head page of the skb uniquely, meaning

>>>> page->pp_magic can be used for future feature.

>>>>

>>>>

>>>>>

>>>>>>

>>>>>>> for pp_recycle right now?  __skb_frag_unref() in skb_shift() or

>>>>>>> skb_try_coalesce() (the latter can probably be removed tbh).

>>>>>>

>>>>>> If we decide to go with accurate indicator of a pp page, we just need

>>>>>> to make sure network stack use __skb_frag_unref() and __skb_frag_ref()

>>>>>> to put and get a page frag, the indicator checking need only done in

>>>>>> __skb_frag_unref() and __skb_frag_ref(), so the skb_shift() and

>>>>>> skb_try_coalesce() should be fine too.

>>>>>>

>>>>>>>

>>>>>>>>

>>>>>>>> Another way is to use the bit 0 of frag->bv_page ptr to indicate if a frag

>>>>>>>> page is from page pool.

>>>>>>>

>>>>>>> Instead of the 'struct page' signature?  And the pp_recycle bit will

>>>>>>> continue to exist?  

>>>>>>

>>>>>> pp_recycle bit might only exist or is only used for the head page for the skb.

>>>>>> The bit 0 of frag->bv_page ptr can be used to indicate a frag page uniquely.

>>>>>> Doing a memcpying of shinfo or "*fragto = *fragfrom" automatically pass the

>>>>>> indicator to the new shinfo before doing a __skb_frag_ref(), and __skb_frag_ref()

>>>>>> will increment the _refcount or pp_frag_count according to the bit 0 of

>>>>>> frag->bv_page.

>>>>>>

>>>>>> By the way, I also prototype the above idea, and it seems to work well too.

>>>>>>

>>>>>

>>>>> As long as no one else touches this, it's just another way of identifying a

>>>>> page_pool allocated page.  But are we gaining by that?  Not using

>>>>> virt_to_head_page() as stated above? But in that case you still need to

>>>>> keep pp_recycle around. 

>>>>

>>>> No, we do not need the pp_recycle, as long as the we make sure __skb_frag_ref()

>>>> is called after memcpying the shinfo or doing "*fragto = *fragfrom".

>>>

>>> But we'll have to keep it for the skb head in this case.

>>

>> As above, I am not really look into skb head case:)

> 

> Let me take a step back here, because I think we drifted a bit. 

> The page signature was introduced in order to be able to identify skb

> fragments. The problem was that you couldn't rely on the pp_recycle bit of

> the skb head,  since fragments could come from anywhere.  So you use the

> skb bit as a hint for skb frags, and you eventually decide using the page

> signature.

> 

> So we got 3 options (Anything I've missed ?)

> - try to remove pp_recycle bit, since the page signature is enough for the

>   skb head and fragments.  That in my opinion is the cleanest option,  as

>   long as we can prove there's no performance hit on the standard network

>   path.

> 

> - Replace the page signature with frag->bv_page bit0.  In that case we

>   still have to keep the pp_recycle bit,  but we do have an 'easier'

>   indication that a skb frag comes from page_pool.  That's still pretty

>   safe, since you now have unique identifiers for the skb and page

>   fragments and you can be sure of their origin (page pool or not).

>   What I am missing here, is what do we get out of this?  I think the

>   advantage is not having to call virt_to_head_page() for frags ?


Not using the signature will free a word space in struct page for future
feature?

> 

> - Keep all of them(?) and use frag->bv_page bit0 similarly to pp_recycle

>   bit?  I don't see much value on this one,  I am just keeping it here for

>   completeness.



For safty and performance reason:
1. maybe we should move the pp_recycle bit from "struct sk_buff" to
   "struct skb_shared_info", and use it to only indicate if the head page of
   a skb is from page pool.

2. The frag->bv_page bit0 is used to indicate if the frag page of a skb is
   from page pool, and modify __skb_frag_unref() and __skb_frag_ref() to keep
   track of it.

3. For safty or debugging reason, keep the page signature for now, and put a
   page signature WARN_ON checking in page pool to catch any misbehaviour?

If there is not bug showing up later, maybe we can free the page signature space
for other usage?

> 

> Thanks

> /Ilias
Ilias Apalodimas Sept. 17, 2021, 3:01 p.m. UTC | #15
> >>>>> In that case you'd need to call virt_to_head_page() early though, get it

[...]
> >>>>> and then compare the signature.   I guess that's avoidable by using 

> >>>>> frag->bv_page for the fragments?

> >>>>

> >>>> If a page of a skb frag is from page pool, It seems frag->bv_page is

> >>>> always point to head_page of a compound page, so the calling of

> >>>> virt_to_head_page() does not seems necessary.

> >>>>

> >>>

> >>> I was mostly referring to the skb head here and how would you trigger the

> >>> recycling path. 

> >>>

> >>> I think we are talking about different things here.  

> >>> One idea is to use the last bit of frag->bv_page to identify fragments

> >>> allocated from page_pool, which is done today with the signature.

> >>>

> >>> The signature however exists in the head page so my question was, can we rid

> >>> of that without having a performance penalty?

> >>

> >> As both skb frag and head page is eventually operated on the head page

> >> of a compound page(if it is a compound page) for normal case too, maybe

> >> we can refactor the code to get the head page of a compound page before

> >> the signature checking without doing a second virt_to_head_page() or

> >> compound_head() call?

> > 

> > Yea that's doable, but my concern is different here.  If we do that the

> > standard network stack, even for drivers that don't use page_pool,  will

> > have to do a virt_to_head_page() -> check signature, to decide if it has to

> > try recycling the packet.  That's the performance part I am worried about,

> > since it happens for every packet. 

> 

> Yes, there is theoretically performance penalty for virt_to_head_page() or

> compound_head(), will do more test if we decide to go with the signature

> checking.


Can we check this somehow?  I can send a patch for this,  but my 
testing is limited to 1Gbit for the recycling.  I can find
25/100Gbit interfaces for the 'normal' path.

> 

> > 

> >>

> >>>

> >>> IOW in skb_free_head() an we replace:

> >>>

> >>> if (skb_pp_recycle(skb, head)) 

> >>> with

> >>> if (page->pp_magic & ~0x3UL) == PP_SIGNATURE)

> >>> and get rid of the 'bool recycle' argument in __skb_frag_unref()?

> >>

> >> For the frag page of a skb, it seems ok to get rid of the 'bool recycle'

> >> argument in __skb_frag_unref(), as __skb_frag_unref() and __skb_frag_ref()

> >> is symmetrically called to put/get a page.

> >>

> >> For the head page of a skb, we might need to make sure the head page

> >> passed to __build_skb_around() meet below condition:

> >> do pp_frag_count incrementing instead of _refcount incrementing when

> >> the head page is not newly allocated and it is from page pool.

> >> It seems hard to audit that?

> > 

> > Yea that seems a bit weird at least to me and I am not sure, it's the only

> > place we'll have to go and do that.

> 

> Yes, That is why I avoid changing the behavior of a head page for a skb.

> In other word, maybe we should not track if head page for a skb is pp page

> or not when the page'_refcount is incremented during network stack journey,

> just treat it as normal page?

>  


I am not sure I understand this.

> > 

> >>

> >>

> >>>

> >>>> bit 0 of frag->bv_page is different way of indicatior for a pp page,

> >>>> it is better we do not confuse with the page signature way. Using

> >>>> a bit 0 may give us a free word in 'struct page' if we manage to

> >>>> use skb->pp_recycle to indicate a head page of the skb uniquely, meaning

> >>>> page->pp_magic can be used for future feature.

> >>>>

> >>>>

> >>>>>

> >>>>>>

> >>>>>>> for pp_recycle right now?  __skb_frag_unref() in skb_shift() or

> >>>>>>> skb_try_coalesce() (the latter can probably be removed tbh).

> >>>>>>

> >>>>>> If we decide to go with accurate indicator of a pp page, we just need

> >>>>>> to make sure network stack use __skb_frag_unref() and __skb_frag_ref()

> >>>>>> to put and get a page frag, the indicator checking need only done in

> >>>>>> __skb_frag_unref() and __skb_frag_ref(), so the skb_shift() and

> >>>>>> skb_try_coalesce() should be fine too.

> >>>>>>

> >>>>>>>

> >>>>>>>>

> >>>>>>>> Another way is to use the bit 0 of frag->bv_page ptr to indicate if a frag

> >>>>>>>> page is from page pool.

> >>>>>>>

> >>>>>>> Instead of the 'struct page' signature?  And the pp_recycle bit will

> >>>>>>> continue to exist?  

> >>>>>>

> >>>>>> pp_recycle bit might only exist or is only used for the head page for the skb.

> >>>>>> The bit 0 of frag->bv_page ptr can be used to indicate a frag page uniquely.

> >>>>>> Doing a memcpying of shinfo or "*fragto = *fragfrom" automatically pass the

> >>>>>> indicator to the new shinfo before doing a __skb_frag_ref(), and __skb_frag_ref()

> >>>>>> will increment the _refcount or pp_frag_count according to the bit 0 of

> >>>>>> frag->bv_page.

> >>>>>>

> >>>>>> By the way, I also prototype the above idea, and it seems to work well too.

> >>>>>>

> >>>>>

> >>>>> As long as no one else touches this, it's just another way of identifying a

> >>>>> page_pool allocated page.  But are we gaining by that?  Not using

> >>>>> virt_to_head_page() as stated above? But in that case you still need to

> >>>>> keep pp_recycle around. 

> >>>>

> >>>> No, we do not need the pp_recycle, as long as the we make sure __skb_frag_ref()

> >>>> is called after memcpying the shinfo or doing "*fragto = *fragfrom".

> >>>

> >>> But we'll have to keep it for the skb head in this case.

> >>

> >> As above, I am not really look into skb head case:)

> > 

> > Let me take a step back here, because I think we drifted a bit. 

> > The page signature was introduced in order to be able to identify skb

> > fragments. The problem was that you couldn't rely on the pp_recycle bit of

> > the skb head,  since fragments could come from anywhere.  So you use the

> > skb bit as a hint for skb frags, and you eventually decide using the page

> > signature.

> > 

> > So we got 3 options (Anything I've missed ?)

> > - try to remove pp_recycle bit, since the page signature is enough for the

> >   skb head and fragments.  That in my opinion is the cleanest option,  as

> >   long as we can prove there's no performance hit on the standard network

> >   path.

> > 

> > - Replace the page signature with frag->bv_page bit0.  In that case we

> >   still have to keep the pp_recycle bit,  but we do have an 'easier'

> >   indication that a skb frag comes from page_pool.  That's still pretty

> >   safe, since you now have unique identifiers for the skb and page

> >   fragments and you can be sure of their origin (page pool or not).

> >   What I am missing here, is what do we get out of this?  I think the

> >   advantage is not having to call virt_to_head_page() for frags ?

> 

> Not using the signature will free a word space in struct page for future

> feature?


Yea that's another thing we gain,  but I am not sure how useful how this is
going to turn out.  

> 

> > 

> > - Keep all of them(?) and use frag->bv_page bit0 similarly to pp_recycle

> >   bit?  I don't see much value on this one,  I am just keeping it here for

> >   completeness.

> 

> 

> For safty and performance reason:

> 1. maybe we should move the pp_recycle bit from "struct sk_buff" to

>    "struct skb_shared_info", and use it to only indicate if the head page of

>    a skb is from page pool.


What's the safety or performance we gain out of this?  The only performance
I can think of is the dirty cache line of the recycle bit we set to 0.
If we do move it to skb_shared)info we'll have to make sure it's on the
same cacheline as the ones we already change.
> 

> 2. The frag->bv_page bit0 is used to indicate if the frag page of a skb is

>    from page pool, and modify __skb_frag_unref() and __skb_frag_ref() to keep

>    track of it.

> 

> 3. For safty or debugging reason, keep the page signature for now, and put a

>    page signature WARN_ON checking in page pool to catch any misbehaviour?

> 

> If there is not bug showing up later, maybe we can free the page signature space

> for other usage?


Yea that's essentially identical to (2) but we move the pp_recycle on the
skb_shared_info.  I'd really prefer getting rid of the pp_recycle entirely,
since it's the cleanest thing we can do in my head.  If we ever need an
extra 4/8 bytes in the future,  we can always go back and implement this.

Alexander/Jesper any additional thoughts?

Regards
/Ilias
> 

> > 

> > Thanks

> > /Ilias

>
Eric Dumazet Sept. 17, 2021, 5:15 p.m. UTC | #16
On Wed, Sep 15, 2021 at 7:05 PM Yunsheng Lin <linyunsheng@huawei.com> wrote:

> As memtioned before, Tx recycling is based on page_pool instance per socket.

> it shares the page_pool instance with rx.

>

> Anyway, based on feedback from edumazet and dsahern, I am still trying to

> see if the page pool is meaningful for tx.

>


It is not for generic linux TCP stack, but perhaps for benchmarks.

Unless you dedicate one TX/RX pair per TCP socket ?

Most high performance TCP flows are using zerocopy, I am really not
sure why we would
need to 'optimize' the path that is wasting cpu cycles doing
user->kernel copies anyway,
at the cost of insane complexity.
Yunsheng Lin Sept. 18, 2021, 1:43 a.m. UTC | #17
On 2021/9/17 23:01, Ilias Apalodimas wrote:
>>>>>>> In that case you'd need to call virt_to_head_page() early though, get it

> [...]

>>>>>>> and then compare the signature.   I guess that's avoidable by using 

>>>>>>> frag->bv_page for the fragments?

>>>>>>

>>>>>> If a page of a skb frag is from page pool, It seems frag->bv_page is

>>>>>> always point to head_page of a compound page, so the calling of

>>>>>> virt_to_head_page() does not seems necessary.

>>>>>>

>>>>>

>>>>> I was mostly referring to the skb head here and how would you trigger the

>>>>> recycling path. 

>>>>>

>>>>> I think we are talking about different things here.  

>>>>> One idea is to use the last bit of frag->bv_page to identify fragments

>>>>> allocated from page_pool, which is done today with the signature.

>>>>>

>>>>> The signature however exists in the head page so my question was, can we rid

>>>>> of that without having a performance penalty?

>>>>

>>>> As both skb frag and head page is eventually operated on the head page

>>>> of a compound page(if it is a compound page) for normal case too, maybe

>>>> we can refactor the code to get the head page of a compound page before

>>>> the signature checking without doing a second virt_to_head_page() or

>>>> compound_head() call?

>>>

>>> Yea that's doable, but my concern is different here.  If we do that the

>>> standard network stack, even for drivers that don't use page_pool,  will

>>> have to do a virt_to_head_page() -> check signature, to decide if it has to

>>> try recycling the packet.  That's the performance part I am worried about,

>>> since it happens for every packet. 

>>

>> Yes, there is theoretically performance penalty for virt_to_head_page() or

>> compound_head(), will do more test if we decide to go with the signature

>> checking.

> 

> Can we check this somehow?  I can send a patch for this,  but my 

> testing is limited to 1Gbit for the recycling.  I can find

> 25/100Gbit interfaces for the 'normal' path.


I have done the signature checking for frag page of a skb, I am not
able to see noticable change between patched(patched with this patch) and
unpatched, for small packet drop test case(perfermance data is about 34Mpps).

As the hns3 driver does not use the build_skb() API, so I am not able to test
the signature checking penalty for head page of a skb, any chance to do the
testing for head page of a skb on your side?

> 

>>

>>>

>>>>

>>>>>

>>>>> IOW in skb_free_head() an we replace:

>>>>>

>>>>> if (skb_pp_recycle(skb, head)) 

>>>>> with

>>>>> if (page->pp_magic & ~0x3UL) == PP_SIGNATURE)

>>>>> and get rid of the 'bool recycle' argument in __skb_frag_unref()?

>>>>

>>>> For the frag page of a skb, it seems ok to get rid of the 'bool recycle'

>>>> argument in __skb_frag_unref(), as __skb_frag_unref() and __skb_frag_ref()

>>>> is symmetrically called to put/get a page.

>>>>

>>>> For the head page of a skb, we might need to make sure the head page

>>>> passed to __build_skb_around() meet below condition:

>>>> do pp_frag_count incrementing instead of _refcount incrementing when

>>>> the head page is not newly allocated and it is from page pool.

>>>> It seems hard to audit that?

>>>

>>> Yea that seems a bit weird at least to me and I am not sure, it's the only

>>> place we'll have to go and do that.

>>

>> Yes, That is why I avoid changing the behavior of a head page for a skb.

>> In other word, maybe we should not track if head page for a skb is pp page

>> or not when the page'_refcount is incremented during network stack journey,

>> just treat it as normal page?

>>  

> 

> I am not sure I understand this.


I was saying only treat the head page of a skb as pp page when it is newly
allocated from page pool, if that page is reference-counted to build another
head page for another skb later, just treat it as normal page.

> 

>>>

>>>>

>>>>

>>>>>

>>>>>> bit 0 of frag->bv_page is different way of indicatior for a pp page,

>>>>>> it is better we do not confuse with the page signature way. Using

>>>>>> a bit 0 may give us a free word in 'struct page' if we manage to

>>>>>> use skb->pp_recycle to indicate a head page of the skb uniquely, meaning

>>>>>> page->pp_magic can be used for future feature.

>>>>>>

>>>>>>

>>>>>>>

>>>>>>>>

>>>>>>>>> for pp_recycle right now?  __skb_frag_unref() in skb_shift() or

>>>>>>>>> skb_try_coalesce() (the latter can probably be removed tbh).

>>>>>>>>

>>>>>>>> If we decide to go with accurate indicator of a pp page, we just need

>>>>>>>> to make sure network stack use __skb_frag_unref() and __skb_frag_ref()

>>>>>>>> to put and get a page frag, the indicator checking need only done in

>>>>>>>> __skb_frag_unref() and __skb_frag_ref(), so the skb_shift() and

>>>>>>>> skb_try_coalesce() should be fine too.

>>>>>>>>

>>>>>>>>>

>>>>>>>>>>

>>>>>>>>>> Another way is to use the bit 0 of frag->bv_page ptr to indicate if a frag

>>>>>>>>>> page is from page pool.

>>>>>>>>>

>>>>>>>>> Instead of the 'struct page' signature?  And the pp_recycle bit will

>>>>>>>>> continue to exist?  

>>>>>>>>

>>>>>>>> pp_recycle bit might only exist or is only used for the head page for the skb.

>>>>>>>> The bit 0 of frag->bv_page ptr can be used to indicate a frag page uniquely.

>>>>>>>> Doing a memcpying of shinfo or "*fragto = *fragfrom" automatically pass the

>>>>>>>> indicator to the new shinfo before doing a __skb_frag_ref(), and __skb_frag_ref()

>>>>>>>> will increment the _refcount or pp_frag_count according to the bit 0 of

>>>>>>>> frag->bv_page.

>>>>>>>>

>>>>>>>> By the way, I also prototype the above idea, and it seems to work well too.

>>>>>>>>

>>>>>>>

>>>>>>> As long as no one else touches this, it's just another way of identifying a

>>>>>>> page_pool allocated page.  But are we gaining by that?  Not using

>>>>>>> virt_to_head_page() as stated above? But in that case you still need to

>>>>>>> keep pp_recycle around. 

>>>>>>

>>>>>> No, we do not need the pp_recycle, as long as the we make sure __skb_frag_ref()

>>>>>> is called after memcpying the shinfo or doing "*fragto = *fragfrom".

>>>>>

>>>>> But we'll have to keep it for the skb head in this case.

>>>>

>>>> As above, I am not really look into skb head case:)

>>>

>>> Let me take a step back here, because I think we drifted a bit. 

>>> The page signature was introduced in order to be able to identify skb

>>> fragments. The problem was that you couldn't rely on the pp_recycle bit of

>>> the skb head,  since fragments could come from anywhere.  So you use the

>>> skb bit as a hint for skb frags, and you eventually decide using the page

>>> signature.

>>>

>>> So we got 3 options (Anything I've missed ?)

>>> - try to remove pp_recycle bit, since the page signature is enough for the

>>>   skb head and fragments.  That in my opinion is the cleanest option,  as

>>>   long as we can prove there's no performance hit on the standard network

>>>   path.

>>>

>>> - Replace the page signature with frag->bv_page bit0.  In that case we

>>>   still have to keep the pp_recycle bit,  but we do have an 'easier'

>>>   indication that a skb frag comes from page_pool.  That's still pretty

>>>   safe, since you now have unique identifiers for the skb and page

>>>   fragments and you can be sure of their origin (page pool or not).

>>>   What I am missing here, is what do we get out of this?  I think the

>>>   advantage is not having to call virt_to_head_page() for frags ?

>>

>> Not using the signature will free a word space in struct page for future

>> feature?

> 

> Yea that's another thing we gain,  but I am not sure how useful how this is

> going to turn out.  

> 

>>

>>>

>>> - Keep all of them(?) and use frag->bv_page bit0 similarly to pp_recycle

>>>   bit?  I don't see much value on this one,  I am just keeping it here for

>>>   completeness.

>>

>>

>> For safty and performance reason:

>> 1. maybe we should move the pp_recycle bit from "struct sk_buff" to

>>    "struct skb_shared_info", and use it to only indicate if the head page of

>>    a skb is from page pool.

> 

> What's the safety or performance we gain out of this?  The only performance


safety is that we still have two ways to indicate a pp page.
the pp_recycle bit in  "struct skb_shared_info" or frag->bv_page bit0 tell
if we want to treat a page as pp page, the page signature checking is used
to tell if we if set those bits correctly?

> I can think of is the dirty cache line of the recycle bit we set to 0.

> If we do move it to skb_shared)info we'll have to make sure it's on the

> same cacheline as the ones we already change.


Yes, when we move the pp_recycle bit to skb_shared_info, that bit is only
set once, and we seems to not need to worry about skb doing cloning or
expanding as the it is part of head page(shinfo is part of head page).

>>

>> 2. The frag->bv_page bit0 is used to indicate if the frag page of a skb is

>>    from page pool, and modify __skb_frag_unref() and __skb_frag_ref() to keep

>>    track of it.

>>

>> 3. For safty or debugging reason, keep the page signature for now, and put a

>>    page signature WARN_ON checking in page pool to catch any misbehaviour?

>>

>> If there is not bug showing up later, maybe we can free the page signature space

>> for other usage?

> 

> Yea that's essentially identical to (2) but we move the pp_recycle on the

> skb_shared_info.  I'd really prefer getting rid of the pp_recycle entirely,


When also removing the pp_recycle for head page of a skb, it seems a little
risky as we are not sure when a not-newly-allocated pp page is called with
__build_skb_around() to build to head page?


> since it's the cleanest thing we can do in my head.  If we ever need an

> extra 4/8 bytes in the future,  we can always go back and implement this.

> 

> Alexander/Jesper any additional thoughts?

> 

> Regards

> /Ilias

>>

>>>

>>> Thanks

>>> /Ilias

>>

> .

>
Yunsheng Lin Sept. 18, 2021, 2:42 a.m. UTC | #18
On 2021/9/18 1:15, Eric Dumazet wrote:
> On Wed, Sep 15, 2021 at 7:05 PM Yunsheng Lin <linyunsheng@huawei.com> wrote:

> 

>> As memtioned before, Tx recycling is based on page_pool instance per socket.

>> it shares the page_pool instance with rx.

>>

>> Anyway, based on feedback from edumazet and dsahern, I am still trying to

>> see if the page pool is meaningful for tx.

>>

> 

> It is not for generic linux TCP stack, but perhaps for benchmarks.


I am not sure I understand what does above means, did you mean
tx recycling only benefit the benchmark tool, such as iperf/netperf,
but not the real usecase?

> 

> Unless you dedicate one TX/RX pair per TCP socket ?


TX/RX pair for netdev queue or TX/RX pair for recycling pool?

As the TX/RX pair for netdev queue, I am not dedicating one TX/RX
pair netdev queue per TCP socket.

As the TX/RX pair for recycling pool, my initial thinking is each
NAPI/socket context have a 'struct pp_alloc_cache', which provides
last-in-first-out and lockless mini pool specific to each NAPI/socket
context, and a central locked 'struct ptr_ring' pool based on queue
for all the NAPI/socket mini pools, when a NAPI/socket context's
mini pool is empty or full, it can refill some page from the central
pool or flush some page to the central pool.

I am not sure if the locked central pool is needed or not, or the
'struct ptr_ring' of page pool is right one to be the locked central
pool yet.

> 

> Most high performance TCP flows are using zerocopy, I am really not

> sure why we would

> need to 'optimize' the path that is wasting cpu cycles doing

> user->kernel copies anyway,

> at the cost of insane complexity.


As my understanding, zerocopy is mostly about big packet and non-IOMMU
case.

As complexity, I am not convinced yet that it is that complex, as it is
mostly using the existing infrastructure to support tx recycling.

The point is that most of skb is freed in the context of NAPI or socket,
it seems we may utilize that to do batch allocating/freeing of skb/page_frag,
or reusing of skb/page_frag/dma mapping to avoid (IO/CPU)TLB miss, cache miss,
overhead of spinlock and dma mapping.


> .

>
Ilias Apalodimas Sept. 18, 2021, 9:23 a.m. UTC | #19
[...]

> >>>>> I was mostly referring to the skb head here and how would you trigger the

> >>>>> recycling path. 

> >>>>>

> >>>>> I think we are talking about different things here.  

> >>>>> One idea is to use the last bit of frag->bv_page to identify fragments

> >>>>> allocated from page_pool, which is done today with the signature.

> >>>>>

> >>>>> The signature however exists in the head page so my question was, can we rid

> >>>>> of that without having a performance penalty?

> >>>>

> >>>> As both skb frag and head page is eventually operated on the head page

> >>>> of a compound page(if it is a compound page) for normal case too, maybe

> >>>> we can refactor the code to get the head page of a compound page before

> >>>> the signature checking without doing a second virt_to_head_page() or

> >>>> compound_head() call?

> >>>

> >>> Yea that's doable, but my concern is different here.  If we do that the

> >>> standard network stack, even for drivers that don't use page_pool,  will

> >>> have to do a virt_to_head_page() -> check signature, to decide if it has to

> >>> try recycling the packet.  That's the performance part I am worried about,

> >>> since it happens for every packet. 

> >>

> >> Yes, there is theoretically performance penalty for virt_to_head_page() or

> >> compound_head(), will do more test if we decide to go with the signature

> >> checking.

> > 

> > Can we check this somehow?  I can send a patch for this,  but my 

> > testing is limited to 1Gbit for the recycling.  I can find

> > 25/100Gbit interfaces for the 'normal' path.

> 

> I have done the signature checking for frag page of a skb, I am not

> able to see noticable change between patched(patched with this patch) and

> unpatched, for small packet drop test case(perfermance data is about 34Mpps).

> 

> As the hns3 driver does not use the build_skb() API, so I am not able to test

> the signature checking penalty for head page of a skb, any chance to do the

> testing for head page of a skb on your side?


Yea I think I'll see what I can do.  I wasn't expecting any regression on the
recycling path, since we do eventually call virt_to_head_page(),  but it's
always good to confirm.  Thanks for testing it.

> 

> > 

> >>

> >>>

> >>>>

> >>>>>

> >>>>> IOW in skb_free_head() an we replace:

> >>>>>

> >>>>> if (skb_pp_recycle(skb, head)) 

> >>>>> with

> >>>>> if (page->pp_magic & ~0x3UL) == PP_SIGNATURE)

> >>>>> and get rid of the 'bool recycle' argument in __skb_frag_unref()?

> >>>>

> >>>> For the frag page of a skb, it seems ok to get rid of the 'bool recycle'

> >>>> argument in __skb_frag_unref(), as __skb_frag_unref() and __skb_frag_ref()

> >>>> is symmetrically called to put/get a page.

> >>>>

> >>>> For the head page of a skb, we might need to make sure the head page

> >>>> passed to __build_skb_around() meet below condition:

> >>>> do pp_frag_count incrementing instead of _refcount incrementing when

> >>>> the head page is not newly allocated and it is from page pool.

> >>>> It seems hard to audit that?

> >>>

> >>> Yea that seems a bit weird at least to me and I am not sure, it's the only

> >>> place we'll have to go and do that.

> >>

> >> Yes, That is why I avoid changing the behavior of a head page for a skb.

> >> In other word, maybe we should not track if head page for a skb is pp page

> >> or not when the page'_refcount is incremented during network stack journey,

> >> just treat it as normal page?

> >>  

> > 

> > I am not sure I understand this.

> 

> I was saying only treat the head page of a skb as pp page when it is newly

> allocated from page pool, if that page is reference-counted to build another

> head page for another skb later, just treat it as normal page.


But the problem here is that a cloned/expanded SKB could trigger a race
when freeing the fragments.  That's why we reset the pp_recycle bit if
there's still references to the frags.  What does 'normal' page means here?
We'll have to at least unmap dma part.

> 

> > 

> >>>

> >>>>

> >>>>

> >>>>>

> >>>>>> bit 0 of frag->bv_page is different way of indicatior for a pp page,

> >>>>>> it is better we do not confuse with the page signature way. Using

> >>>>>> a bit 0 may give us a free word in 'struct page' if we manage to

> >>>>>> use skb->pp_recycle to indicate a head page of the skb uniquely, meaning

> >>>>>> page->pp_magic can be used for future feature.

> >>>>>>

> >>>>>>

> >>>>>>>

> >>>>>>>>

> >>>>>>>>> for pp_recycle right now?  __skb_frag_unref() in skb_shift() or

> >>>>>>>>> skb_try_coalesce() (the latter can probably be removed tbh).

> >>>>>>>>

> >>>>>>>> If we decide to go with accurate indicator of a pp page, we just need

> >>>>>>>> to make sure network stack use __skb_frag_unref() and __skb_frag_ref()

> >>>>>>>> to put and get a page frag, the indicator checking need only done in

> >>>>>>>> __skb_frag_unref() and __skb_frag_ref(), so the skb_shift() and

> >>>>>>>> skb_try_coalesce() should be fine too.

> >>>>>>>>

> >>>>>>>>>

> >>>>>>>>>>

> >>>>>>>>>> Another way is to use the bit 0 of frag->bv_page ptr to indicate if a frag

> >>>>>>>>>> page is from page pool.

> >>>>>>>>>

> >>>>>>>>> Instead of the 'struct page' signature?  And the pp_recycle bit will

> >>>>>>>>> continue to exist?  

> >>>>>>>>

> >>>>>>>> pp_recycle bit might only exist or is only used for the head page for the skb.

> >>>>>>>> The bit 0 of frag->bv_page ptr can be used to indicate a frag page uniquely.

> >>>>>>>> Doing a memcpying of shinfo or "*fragto = *fragfrom" automatically pass the

> >>>>>>>> indicator to the new shinfo before doing a __skb_frag_ref(), and __skb_frag_ref()

> >>>>>>>> will increment the _refcount or pp_frag_count according to the bit 0 of

> >>>>>>>> frag->bv_page.

> >>>>>>>>

> >>>>>>>> By the way, I also prototype the above idea, and it seems to work well too.

> >>>>>>>>

> >>>>>>>

> >>>>>>> As long as no one else touches this, it's just another way of identifying a

> >>>>>>> page_pool allocated page.  But are we gaining by that?  Not using

> >>>>>>> virt_to_head_page() as stated above? But in that case you still need to

> >>>>>>> keep pp_recycle around. 

> >>>>>>

> >>>>>> No, we do not need the pp_recycle, as long as the we make sure __skb_frag_ref()

> >>>>>> is called after memcpying the shinfo or doing "*fragto = *fragfrom".

> >>>>>

> >>>>> But we'll have to keep it for the skb head in this case.

> >>>>

> >>>> As above, I am not really look into skb head case:)

> >>>

> >>> Let me take a step back here, because I think we drifted a bit. 

> >>> The page signature was introduced in order to be able to identify skb

> >>> fragments. The problem was that you couldn't rely on the pp_recycle bit of

> >>> the skb head,  since fragments could come from anywhere.  So you use the

> >>> skb bit as a hint for skb frags, and you eventually decide using the page

> >>> signature.

> >>>

> >>> So we got 3 options (Anything I've missed ?)

> >>> - try to remove pp_recycle bit, since the page signature is enough for the

> >>>   skb head and fragments.  That in my opinion is the cleanest option,  as

> >>>   long as we can prove there's no performance hit on the standard network

> >>>   path.

> >>>

> >>> - Replace the page signature with frag->bv_page bit0.  In that case we

> >>>   still have to keep the pp_recycle bit,  but we do have an 'easier'

> >>>   indication that a skb frag comes from page_pool.  That's still pretty

> >>>   safe, since you now have unique identifiers for the skb and page

> >>>   fragments and you can be sure of their origin (page pool or not).

> >>>   What I am missing here, is what do we get out of this?  I think the

> >>>   advantage is not having to call virt_to_head_page() for frags ?

> >>

> >> Not using the signature will free a word space in struct page for future

> >> feature?

> > 

> > Yea that's another thing we gain,  but I am not sure how useful how this is

> > going to turn out.  

> > 

> >>

> >>>

> >>> - Keep all of them(?) and use frag->bv_page bit0 similarly to pp_recycle

> >>>   bit?  I don't see much value on this one,  I am just keeping it here for

> >>>   completeness.

> >>

> >>

> >> For safty and performance reason:

> >> 1. maybe we should move the pp_recycle bit from "struct sk_buff" to

> >>    "struct skb_shared_info", and use it to only indicate if the head page of

> >>    a skb is from page pool.

> > 

> > What's the safety or performance we gain out of this?  The only performance

> 

> safety is that we still have two ways to indicate a pp page.

> the pp_recycle bit in  "struct skb_shared_info" or frag->bv_page bit0 tell

> if we want to treat a page as pp page, the page signature checking is used

> to tell if we if set those bits correctly?

> 


Yea but in the long run we'll want the page signature.  So that's basically
(2) once we do that.

> > I can think of is the dirty cache line of the recycle bit we set to 0.

> > If we do move it to skb_shared)info we'll have to make sure it's on the

> > same cacheline as the ones we already change.

> 

> Yes, when we move the pp_recycle bit to skb_shared_info, that bit is only

> set once, and we seems to not need to worry about skb doing cloning or

> expanding as the it is part of head page(shinfo is part of head page).

> 

> >>

> >> 2. The frag->bv_page bit0 is used to indicate if the frag page of a skb is

> >>    from page pool, and modify __skb_frag_unref() and __skb_frag_ref() to keep

> >>    track of it.

> >>

> >> 3. For safty or debugging reason, keep the page signature for now, and put a

> >>    page signature WARN_ON checking in page pool to catch any misbehaviour?

> >>

> >> If there is not bug showing up later, maybe we can free the page signature space

> >> for other usage?

> > 

> > Yea that's essentially identical to (2) but we move the pp_recycle on the

> > skb_shared_info.  I'd really prefer getting rid of the pp_recycle entirely,

> 

> When also removing the pp_recycle for head page of a skb, it seems a little

> risky as we are not sure when a not-newly-allocated pp page is called with

> __build_skb_around() to build to head page?


Removing the pp_recyle, is only safe if we keep the page signature.  I was
suggesting we follow (1) first before starting moving things around.

> 

> > since it's the cleanest thing we can do in my head.  If we ever need an

> > extra 4/8 bytes in the future,  we can always go back and implement this.

> > 

> > Alexander/Jesper any additional thoughts?

> > 


Thanks
/Ilias
Yunsheng Lin Sept. 22, 2021, 3:38 a.m. UTC | #20
On 2021/9/18 17:23, Ilias Apalodimas wrote:
> [...]

> 


[...]

>>>>>>

>>>>>>>

>>>>>>> IOW in skb_free_head() an we replace:

>>>>>>>

>>>>>>> if (skb_pp_recycle(skb, head)) 

>>>>>>> with

>>>>>>> if (page->pp_magic & ~0x3UL) == PP_SIGNATURE)

>>>>>>> and get rid of the 'bool recycle' argument in __skb_frag_unref()?

>>>>>>

>>>>>> For the frag page of a skb, it seems ok to get rid of the 'bool recycle'

>>>>>> argument in __skb_frag_unref(), as __skb_frag_unref() and __skb_frag_ref()

>>>>>> is symmetrically called to put/get a page.

>>>>>>

>>>>>> For the head page of a skb, we might need to make sure the head page

>>>>>> passed to __build_skb_around() meet below condition:

>>>>>> do pp_frag_count incrementing instead of _refcount incrementing when

>>>>>> the head page is not newly allocated and it is from page pool.

>>>>>> It seems hard to audit that?

>>>>>

>>>>> Yea that seems a bit weird at least to me and I am not sure, it's the only

>>>>> place we'll have to go and do that.

>>>>

>>>> Yes, That is why I avoid changing the behavior of a head page for a skb.

>>>> In other word, maybe we should not track if head page for a skb is pp page

>>>> or not when the page'_refcount is incremented during network stack journey,

>>>> just treat it as normal page?

>>>>  

>>>

>>> I am not sure I understand this.

>>

>> I was saying only treat the head page of a skb as pp page when it is newly

>> allocated from page pool, if that page is reference-counted to build another

>> head page for another skb later, just treat it as normal page.

> 

> But the problem here is that a cloned/expanded SKB could trigger a race

> when freeing the fragments.  That's why we reset the pp_recycle bit if

> there's still references to the frags.  What does 'normal' page means here?

> We'll have to at least unmap dma part.


'normal' page means non-pp page here. Maybe forget the above.

I read the code related to head page headling for a skb, it seems the
NAPI_GRO_FREE_STOLEN_HEAD and skb_head_frag_to_page_desc() case is just
fine as it is now when the page signature is used to identify a pp page
for the head page of a skb uniquely?

> 

>>

>>>

>>>>>

>>>>>>

>>>>>>

>>>>>>>

>>>>>>>> bit 0 of frag->bv_page is different way of indicatior for a pp page,

>>>>>>>> it is better we do not confuse with the page signature way. Using

>>>>>>>> a bit 0 may give us a free word in 'struct page' if we manage to

>>>>>>>> use skb->pp_recycle to indicate a head page of the skb uniquely, meaning

>>>>>>>> page->pp_magic can be used for future feature.

>>>>>>>>

>>>>>>>>

>>>>>>>>>

>>>>>>>>>>

>>>>>>>>>>> for pp_recycle right now?  __skb_frag_unref() in skb_shift() or

>>>>>>>>>>> skb_try_coalesce() (the latter can probably be removed tbh).

>>>>>>>>>>

>>>>>>>>>> If we decide to go with accurate indicator of a pp page, we just need

>>>>>>>>>> to make sure network stack use __skb_frag_unref() and __skb_frag_ref()

>>>>>>>>>> to put and get a page frag, the indicator checking need only done in

>>>>>>>>>> __skb_frag_unref() and __skb_frag_ref(), so the skb_shift() and

>>>>>>>>>> skb_try_coalesce() should be fine too.

>>>>>>>>>>

>>>>>>>>>>>

>>>>>>>>>>>>

>>>>>>>>>>>> Another way is to use the bit 0 of frag->bv_page ptr to indicate if a frag

>>>>>>>>>>>> page is from page pool.

>>>>>>>>>>>

>>>>>>>>>>> Instead of the 'struct page' signature?  And the pp_recycle bit will

>>>>>>>>>>> continue to exist?  

>>>>>>>>>>

>>>>>>>>>> pp_recycle bit might only exist or is only used for the head page for the skb.

>>>>>>>>>> The bit 0 of frag->bv_page ptr can be used to indicate a frag page uniquely.

>>>>>>>>>> Doing a memcpying of shinfo or "*fragto = *fragfrom" automatically pass the

>>>>>>>>>> indicator to the new shinfo before doing a __skb_frag_ref(), and __skb_frag_ref()

>>>>>>>>>> will increment the _refcount or pp_frag_count according to the bit 0 of

>>>>>>>>>> frag->bv_page.

>>>>>>>>>>

>>>>>>>>>> By the way, I also prototype the above idea, and it seems to work well too.

>>>>>>>>>>

>>>>>>>>>

>>>>>>>>> As long as no one else touches this, it's just another way of identifying a

>>>>>>>>> page_pool allocated page.  But are we gaining by that?  Not using

>>>>>>>>> virt_to_head_page() as stated above? But in that case you still need to

>>>>>>>>> keep pp_recycle around. 

>>>>>>>>

>>>>>>>> No, we do not need the pp_recycle, as long as the we make sure __skb_frag_ref()

>>>>>>>> is called after memcpying the shinfo or doing "*fragto = *fragfrom".

>>>>>>>

>>>>>>> But we'll have to keep it for the skb head in this case.

>>>>>>

>>>>>> As above, I am not really look into skb head case:)

>>>>>

>>>>> Let me take a step back here, because I think we drifted a bit. 

>>>>> The page signature was introduced in order to be able to identify skb

>>>>> fragments. The problem was that you couldn't rely on the pp_recycle bit of

>>>>> the skb head,  since fragments could come from anywhere.  So you use the

>>>>> skb bit as a hint for skb frags, and you eventually decide using the page

>>>>> signature.

>>>>>

>>>>> So we got 3 options (Anything I've missed ?)

>>>>> - try to remove pp_recycle bit, since the page signature is enough for the

>>>>>   skb head and fragments.  That in my opinion is the cleanest option,  as

>>>>>   long as we can prove there's no performance hit on the standard network

>>>>>   path.

>>>>>

>>>>> - Replace the page signature with frag->bv_page bit0.  In that case we

>>>>>   still have to keep the pp_recycle bit,  but we do have an 'easier'

>>>>>   indication that a skb frag comes from page_pool.  That's still pretty

>>>>>   safe, since you now have unique identifiers for the skb and page

>>>>>   fragments and you can be sure of their origin (page pool or not).

>>>>>   What I am missing here, is what do we get out of this?  I think the

>>>>>   advantage is not having to call virt_to_head_page() for frags ?

>>>>

>>>> Not using the signature will free a word space in struct page for future

>>>> feature?

>>>

>>> Yea that's another thing we gain,  but I am not sure how useful how this is

>>> going to turn out.  

>>>

>>>>

>>>>>

>>>>> - Keep all of them(?) and use frag->bv_page bit0 similarly to pp_recycle

>>>>>   bit?  I don't see much value on this one,  I am just keeping it here for

>>>>>   completeness.

>>>>

>>>>

>>>> For safty and performance reason:

>>>> 1. maybe we should move the pp_recycle bit from "struct sk_buff" to

>>>>    "struct skb_shared_info", and use it to only indicate if the head page of

>>>>    a skb is from page pool.

>>>

>>> What's the safety or performance we gain out of this?  The only performance

>>

>> safety is that we still have two ways to indicate a pp page.

>> the pp_recycle bit in  "struct skb_shared_info" or frag->bv_page bit0 tell

>> if we want to treat a page as pp page, the page signature checking is used

>> to tell if we if set those bits correctly?

>>

> 

> Yea but in the long run we'll want the page signature.  So that's basically

> (2) once we do that.

> 

>>> I can think of is the dirty cache line of the recycle bit we set to 0.

>>> If we do move it to skb_shared)info we'll have to make sure it's on the

>>> same cacheline as the ones we already change.

>>

>> Yes, when we move the pp_recycle bit to skb_shared_info, that bit is only

>> set once, and we seems to not need to worry about skb doing cloning or

>> expanding as the it is part of head page(shinfo is part of head page).

>>

>>>>

>>>> 2. The frag->bv_page bit0 is used to indicate if the frag page of a skb is

>>>>    from page pool, and modify __skb_frag_unref() and __skb_frag_ref() to keep

>>>>    track of it.

>>>>

>>>> 3. For safty or debugging reason, keep the page signature for now, and put a

>>>>    page signature WARN_ON checking in page pool to catch any misbehaviour?

>>>>

>>>> If there is not bug showing up later, maybe we can free the page signature space

>>>> for other usage?

>>>

>>> Yea that's essentially identical to (2) but we move the pp_recycle on the

>>> skb_shared_info.  I'd really prefer getting rid of the pp_recycle entirely,

>>

>> When also removing the pp_recycle for head page of a skb, it seems a little

>> risky as we are not sure when a not-newly-allocated pp page is called with

>> __build_skb_around() to build to head page?

> 

> Removing the pp_recyle, is only safe if we keep the page signature.  I was

> suggesting we follow (1) first before starting moving things around.


I suppose (1) means the below, right:

> - try to remove pp_recycle bit, since the page signature is enough for the

>   skb head and fragments.  That in my opinion is the cleanest option,  as

>   long as we can prove there's no performance hit on the standard network

>   path.


It seems doable if my above analysis of head page headling for a skb does not
miss anything.

> 

>>

>>> since it's the cleanest thing we can do in my head.  If we ever need an

>>> extra 4/8 bytes in the future,  we can always go back and implement this.

>>>

>>> Alexander/Jesper any additional thoughts?

>>>

> 

> Thanks

> /Ilias

> .

>
diff mbox series

Patch

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 35eebc2310a5..4d975ab27078 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -3073,7 +3073,16 @@  static inline struct page *skb_frag_page(const skb_frag_t *frag)
  */
 static inline void __skb_frag_ref(skb_frag_t *frag)
 {
-	get_page(skb_frag_page(frag));
+	struct page *page = skb_frag_page(frag);
+
+#ifdef CONFIG_PAGE_POOL
+	if (page_pool_is_pp_page(page)) {
+		page_pool_atomic_inc_frag_count(page);
+		return;
+	}
+#endif
+
+	get_page(page);
 }
 
 /**
@@ -3088,6 +3097,22 @@  static inline void skb_frag_ref(struct sk_buff *skb, int f)
 	__skb_frag_ref(&skb_shinfo(skb)->frags[f]);
 }
 
+/**
+ * skb_frag_is_pp_page - decide if a page is recyclable.
+ * @page: frag page
+ * @recycle: skb->pp_recycle
+ *
+ * For 32 bit systems with 64 bit dma, the skb->pp_recycle is
+ * also used to decide if a page can be recycled to the page
+ * pool.
+ */
+static inline bool skb_frag_is_pp_page(struct page *page,
+				       bool recycle)
+{
+	return page_pool_is_pp_page(page) ||
+		(recycle && __page_pool_is_pp_page(page));
+}
+
 /**
  * __skb_frag_unref - release a reference on a paged fragment.
  * @frag: the paged fragment
@@ -3101,8 +3126,10 @@  static inline void __skb_frag_unref(skb_frag_t *frag, bool recycle)
 	struct page *page = skb_frag_page(frag);
 
 #ifdef CONFIG_PAGE_POOL
-	if (recycle && page_pool_return_skb_page(page))
+	if (skb_frag_is_pp_page(page, recycle)) {
+		page_pool_return_skb_page(page);
 		return;
+	}
 #endif
 	put_page(page);
 }
@@ -4720,9 +4747,14 @@  static inline void skb_mark_for_recycle(struct sk_buff *skb)
 
 static inline bool skb_pp_recycle(struct sk_buff *skb, void *data)
 {
-	if (!IS_ENABLED(CONFIG_PAGE_POOL) || !skb->pp_recycle)
+	struct page *page = virt_to_head_page(data);
+
+	if (!IS_ENABLED(CONFIG_PAGE_POOL) || !skb->pp_recycle ||
+	    !__page_pool_is_pp_page(page))
 		return false;
-	return page_pool_return_skb_page(virt_to_head_page(data));
+
+	page_pool_return_skb_page(page);
+	return true;
 }
 
 #endif	/* __KERNEL__ */
diff --git a/include/net/page_pool.h b/include/net/page_pool.h
index 2ad0706566c5..eb103d86f453 100644
--- a/include/net/page_pool.h
+++ b/include/net/page_pool.h
@@ -164,7 +164,7 @@  inline enum dma_data_direction page_pool_get_dma_dir(struct page_pool *pool)
 	return pool->p.dma_dir;
 }
 
-bool page_pool_return_skb_page(struct page *page);
+void page_pool_return_skb_page(struct page *page);
 
 struct page_pool *page_pool_create(const struct page_pool_params *params);
 
@@ -244,6 +244,32 @@  static inline void page_pool_set_frag_count(struct page *page, long nr)
 	atomic_long_set(&page->pp_frag_count, nr);
 }
 
+static inline void page_pool_atomic_inc_frag_count(struct page *page)
+{
+	atomic_long_inc(&page->pp_frag_count);
+}
+
+static inline bool __page_pool_is_pp_page(struct page *page)
+{
+	/* page->pp_magic is OR'ed with PP_SIGNATURE after the allocation
+	 * in order to preserve any existing bits, such as bit 0 for the
+	 * head page of compound page and bit 1 for pfmemalloc page, so
+	 * mask those bits for freeing side when doing below checking,
+	 * and page_is_pfmemalloc() is checked in __page_pool_put_page()
+	 * to avoid recycling the pfmemalloc page.
+	 */
+	return (page->pp_magic & ~0x3UL) == PP_SIGNATURE;
+}
+
+static inline bool page_pool_is_pp_page(struct page *page)
+{
+	/* For systems with the same dma addr as the bus addr, we can use
+	 * page->pp_magic to indicate a pp page uniquely.
+	 */
+	return !PAGE_POOL_DMA_USE_PP_FRAG_COUNT &&
+			__page_pool_is_pp_page(page);
+}
+
 static inline long page_pool_atomic_sub_frag_count_return(struct page *page,
 							  long nr)
 {
diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index 09d7b8614ef5..3a419871d4bc 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -24,7 +24,7 @@ 
 #define DEFER_TIME (msecs_to_jiffies(1000))
 #define DEFER_WARN_INTERVAL (60 * HZ)
 
-#define BIAS_MAX	LONG_MAX
+#define BIAS_MAX	(LONG_MAX / 2)
 
 static int page_pool_init(struct page_pool *pool,
 			  const struct page_pool_params *params)
@@ -736,20 +736,10 @@  void page_pool_update_nid(struct page_pool *pool, int new_nid)
 }
 EXPORT_SYMBOL(page_pool_update_nid);
 
-bool page_pool_return_skb_page(struct page *page)
+void page_pool_return_skb_page(struct page *page)
 {
 	struct page_pool *pp;
 
-	/* page->pp_magic is OR'ed with PP_SIGNATURE after the allocation
-	 * in order to preserve any existing bits, such as bit 0 for the
-	 * head page of compound page and bit 1 for pfmemalloc page, so
-	 * mask those bits for freeing side when doing below checking,
-	 * and page_is_pfmemalloc() is checked in __page_pool_put_page()
-	 * to avoid recycling the pfmemalloc page.
-	 */
-	if (unlikely((page->pp_magic & ~0x3UL) != PP_SIGNATURE))
-		return false;
-
 	pp = page->pp;
 
 	/* Driver set this to memory recycling info. Reset it on recycle.
@@ -758,7 +748,5 @@  bool page_pool_return_skb_page(struct page *page)
 	 * 'flipped' fragment being in use or not.
 	 */
 	page_pool_put_full_page(pp, page, false);
-
-	return true;
 }
 EXPORT_SYMBOL(page_pool_return_skb_page);