Message ID | 20211217113049.23850-1-david@redhat.com |
---|---|
Headers | show |
Series | mm: COW fixes part 1: fix the COW security issue for THP and hugetlb | expand |
> On Dec 17, 2021, at 3:30 AM, David Hildenbrand <david@redhat.com> wrote: > > Sometimes it is required to have a seqcount implementation that uses > a structure with a fixed and minimal size -- just a bare unsigned int -- > independent of the kernel configuration. This is especially valuable, when > the raw_ variants of the seqlock function will be used and the additional > lockdep part of the seqcount_t structure remains essentially unused. > > Let's provide a lockdep-free raw_seqcount_t variant that can be used via > the raw functions to have a basic seqlock. > > The target use case is embedding a raw_seqcount_t in the "struct page", > where we really want a minimal size and cannot tolerate a sudden grow of > the seqcount_t structure resulting in a significant "struct page" > increase or even a layout change. > > Provide raw_read_seqcount_retry(), to make it easy to match to > raw_read_seqcount_begin() in the code. > > Let's add a short documentation as well. > > Note: There might be other possible users for raw_seqcount_t where the > lockdep part might be completely unused and just wastes memory -- > essentially any users that only use the raw_ function variants. > Is it possible to force some policy when raw_seqcount_t is used to prevent its abuse? For instance not to allow to acquire other (certain?) locks when it is held? [ snip ] > +/** > + * raw_seqcount_init() - runtime initializer for raw_seqcount_t > + * @s: Pointer to the raw_seqcount_t instance > + */ > +# define raw_seqcount_init(s) __raw_seqcount_init(s) > + > #ifdef CONFIG_DEBUG_LOCK_ALLOC > > # define SEQCOUNT_DEP_MAP_INIT(lockname) \ > @@ -111,11 +129,16 @@ static inline void seqcount_lockdep_reader_access(const seqcount_t *s) > # define seqcount_lockdep_reader_access(x) > #endif > > +/** > + * RAW_SEQCNT_ZERO() - static initializer for raw_seqcount_t > + */ > +#define RAW_SEQCNT_ZERO() 0 I am not sure why RAW_SWQCNT_ZERO() should be a function-like macro. Moreover, the documentation showed: + /* static */ + static raw_seqcount_t foo_seqcount = RAW_SEQCNT_ZERO(foo_seqcount); + But RAW_SEQCNT_ZERO does not have an argument?
> On Dec 17, 2021, at 3:30 AM, David Hildenbrand <david@redhat.com> wrote: > > Let's return early for hugetlb, which really only relies on the compound > mapcount so far and does not support PageDoubleMap() yet. Use the chance > to cleanup the file-THP case to make it easier to grasp. While at it, use > head_compound_mapcount(). > > This is a preparation for further changes. It would be useful to add “no functional change intended” or something. > > Reviewed-by: Peter Xu <peterx@redhat.com> > Signed-off-by: David Hildenbrand <david@redhat.com> > --- > mm/util.c | 11 +++++++---- > 1 file changed, 7 insertions(+), 4 deletions(-) > > diff --git a/mm/util.c b/mm/util.c > index 741ba32a43ac..3239e75c148d 100644 > --- a/mm/util.c > +++ b/mm/util.c > @@ -732,15 +732,18 @@ int __page_mapcount(struct page *page) > { > int ret; > > - ret = atomic_read(&page->_mapcount) + 1; > + if (PageHuge(page)) > + return compound_mapcount(page); Before you return, perhaps you can add an assertion like: VM_BUG_ON(PageDoubleMap(page)); This would be make the code clearer and would ease debugging in the future (if support for double-map is expanded).
On 17.12.21 18:02, Nadav Amit wrote: > > >> On Dec 17, 2021, at 3:30 AM, David Hildenbrand <david@redhat.com> wrote: >> >> Sometimes it is required to have a seqcount implementation that uses >> a structure with a fixed and minimal size -- just a bare unsigned int -- >> independent of the kernel configuration. This is especially valuable, when >> the raw_ variants of the seqlock function will be used and the additional >> lockdep part of the seqcount_t structure remains essentially unused. >> >> Let's provide a lockdep-free raw_seqcount_t variant that can be used via >> the raw functions to have a basic seqlock. >> >> The target use case is embedding a raw_seqcount_t in the "struct page", >> where we really want a minimal size and cannot tolerate a sudden grow of >> the seqcount_t structure resulting in a significant "struct page" >> increase or even a layout change. >> >> Provide raw_read_seqcount_retry(), to make it easy to match to >> raw_read_seqcount_begin() in the code. >> >> Let's add a short documentation as well. >> >> Note: There might be other possible users for raw_seqcount_t where the >> lockdep part might be completely unused and just wastes memory -- >> essentially any users that only use the raw_ function variants. >> > > Is it possible to force some policy when raw_seqcount_t is used to > prevent its abuse? For instance not to allow to acquire other (certain?) > locks when it is held? > Good question ... in this series we won't be taking additional locks on the reader or the writer side. Something like lockdep_forbid() / lockdep_allow() to disallow any kind of locking. I haven't heard of anything like that, maybe someone reading along has a clue? The writer side might be easy to handle, but some seqcount operations that don't do the full read()->retry() cycle are problematic (->raw_read_seqcount). > [ snip ] > >> +/** >> + * raw_seqcount_init() - runtime initializer for raw_seqcount_t >> + * @s: Pointer to the raw_seqcount_t instance >> + */ >> +# define raw_seqcount_init(s) __raw_seqcount_init(s) >> + >> #ifdef CONFIG_DEBUG_LOCK_ALLOC >> >> # define SEQCOUNT_DEP_MAP_INIT(lockname) \ >> @@ -111,11 +129,16 @@ static inline void seqcount_lockdep_reader_access(const seqcount_t *s) >> # define seqcount_lockdep_reader_access(x) >> #endif >> >> +/** >> + * RAW_SEQCNT_ZERO() - static initializer for raw_seqcount_t >> + */ >> +#define RAW_SEQCNT_ZERO() 0 > > I am not sure why RAW_SWQCNT_ZERO() should be a function-like macro. > I think I just went for consistency with SEQCNT_ZERO() -- but I agree, that can just be simplified! Thanks!
On 17.12.21 18:16, Nadav Amit wrote: > >> On Dec 17, 2021, at 3:30 AM, David Hildenbrand <david@redhat.com> wrote: >> >> Let's return early for hugetlb, which really only relies on the compound >> mapcount so far and does not support PageDoubleMap() yet. Use the chance >> to cleanup the file-THP case to make it easier to grasp. While at it, use >> head_compound_mapcount(). >> >> This is a preparation for further changes. > > It would be useful to add “no functional change intended” or something. Absolutely, same applies to other "simplification" patches. > >> >> Reviewed-by: Peter Xu <peterx@redhat.com> >> Signed-off-by: David Hildenbrand <david@redhat.com> >> --- >> mm/util.c | 11 +++++++---- >> 1 file changed, 7 insertions(+), 4 deletions(-) >> >> diff --git a/mm/util.c b/mm/util.c >> index 741ba32a43ac..3239e75c148d 100644 >> --- a/mm/util.c >> +++ b/mm/util.c >> @@ -732,15 +732,18 @@ int __page_mapcount(struct page *page) >> { >> int ret; >> >> - ret = atomic_read(&page->_mapcount) + 1; >> + if (PageHuge(page)) >> + return compound_mapcount(page); > > Before you return, perhaps you can add an assertion like: > > VM_BUG_ON(PageDoubleMap(page)); > > This would be make the code clearer and would ease debugging in the > future (if support for double-map is expanded). > I'd probably have to add this to a couple of places -- and I assume anybody working on that has to grep the kernel for use of PageDoubleMap already. Thanks!
On 17.12.21 18:29, David Hildenbrand wrote: > On 17.12.21 18:02, Nadav Amit wrote: >> >> >>> On Dec 17, 2021, at 3:30 AM, David Hildenbrand <david@redhat.com> wrote: >>> >>> Sometimes it is required to have a seqcount implementation that uses >>> a structure with a fixed and minimal size -- just a bare unsigned int -- >>> independent of the kernel configuration. This is especially valuable, when >>> the raw_ variants of the seqlock function will be used and the additional >>> lockdep part of the seqcount_t structure remains essentially unused. >>> >>> Let's provide a lockdep-free raw_seqcount_t variant that can be used via >>> the raw functions to have a basic seqlock. >>> >>> The target use case is embedding a raw_seqcount_t in the "struct page", >>> where we really want a minimal size and cannot tolerate a sudden grow of >>> the seqcount_t structure resulting in a significant "struct page" >>> increase or even a layout change. >>> >>> Provide raw_read_seqcount_retry(), to make it easy to match to >>> raw_read_seqcount_begin() in the code. >>> >>> Let's add a short documentation as well. >>> >>> Note: There might be other possible users for raw_seqcount_t where the >>> lockdep part might be completely unused and just wastes memory -- >>> essentially any users that only use the raw_ function variants. >>> >> >> Is it possible to force some policy when raw_seqcount_t is used to >> prevent its abuse? For instance not to allow to acquire other (certain?) >> locks when it is held? >> > > Good question ... in this series we won't be taking additional locks on > the reader or the writer side. Something like lockdep_forbid() / > lockdep_allow() to disallow any kind of locking. I haven't heard of > anything like that, maybe someone reading along has a clue? > > The writer side might be easy to handle, but some seqcount operations > that don't do the full read()->retry() cycle are problematic > (->raw_read_seqcount). Sorry, I forgot to mention an important point: the raw_seqcount_t doesn't give you any additional "power" to abuse. You can just use the ordinary seqcount_t with the raw_ functions. One example is mm->write_protect_seq . So whatever we would want to "invent" should also apply to the raw_ functions in general -- which might be undesired or impossible (IIRC IRQ context).
> On Dec 17, 2021, at 9:49 AM, David Hildenbrand <david@redhat.com> wrote: > > On 17.12.21 18:29, David Hildenbrand wrote: >> On 17.12.21 18:02, Nadav Amit wrote: >>> >>> >>>> On Dec 17, 2021, at 3:30 AM, David Hildenbrand <david@redhat.com> wrote: >>>> >>>> Sometimes it is required to have a seqcount implementation that uses >>>> a structure with a fixed and minimal size -- just a bare unsigned int -- >>>> independent of the kernel configuration. This is especially valuable, when >>>> the raw_ variants of the seqlock function will be used and the additional >>>> lockdep part of the seqcount_t structure remains essentially unused. >>>> >>>> Let's provide a lockdep-free raw_seqcount_t variant that can be used via >>>> the raw functions to have a basic seqlock. >>>> >>>> The target use case is embedding a raw_seqcount_t in the "struct page", >>>> where we really want a minimal size and cannot tolerate a sudden grow of >>>> the seqcount_t structure resulting in a significant "struct page" >>>> increase or even a layout change. >>>> >>>> Provide raw_read_seqcount_retry(), to make it easy to match to >>>> raw_read_seqcount_begin() in the code. >>>> >>>> Let's add a short documentation as well. >>>> >>>> Note: There might be other possible users for raw_seqcount_t where the >>>> lockdep part might be completely unused and just wastes memory -- >>>> essentially any users that only use the raw_ function variants. >>>> >>> >>> Is it possible to force some policy when raw_seqcount_t is used to >>> prevent its abuse? For instance not to allow to acquire other (certain?) >>> locks when it is held? >>> >> >> Good question ... in this series we won't be taking additional locks on >> the reader or the writer side. Something like lockdep_forbid() / >> lockdep_allow() to disallow any kind of locking. I haven't heard of >> anything like that, maybe someone reading along has a clue? >> >> The writer side might be easy to handle, but some seqcount operations >> that don't do the full read()->retry() cycle are problematic >> (->raw_read_seqcount). > > Sorry, I forgot to mention an important point: the raw_seqcount_t > doesn't give you any additional "power" to abuse. > > You can just use the ordinary seqcount_t with the raw_ functions. One > example is mm->write_protect_seq . So whatever we would want to "invent" > should also apply to the raw_ functions in general -- which might be > undesired or impossible (IIRC IRQ context). > Thanks for the clarification. I was unfamiliar with raw_read_seqcount_begin() (and friends). Indeed it is very very rarely used.
On 12/17/21 03:30, David Hildenbrand wrote: > Let's return early for hugetlb, which really only relies on the compound > mapcount so far and does not support PageDoubleMap() yet. Use the chance It is too early to say if hugetlb double mapping will use PageDoubleMap(). I do not think (hope) it will be necessary. So, I think you can drop mention of it here. > to cleanup the file-THP case to make it easier to grasp. While at it, use > head_compound_mapcount(). > > This is a preparation for further changes. > > Reviewed-by: Peter Xu <peterx@redhat.com> > Signed-off-by: David Hildenbrand <david@redhat.com> Acked-by: Mike Kravetz <mike.kravetz@oracle.com>
On 17.12.21 19:06, Mike Kravetz wrote: > On 12/17/21 03:30, David Hildenbrand wrote: >> Let's return early for hugetlb, which really only relies on the compound >> mapcount so far and does not support PageDoubleMap() yet. Use the chance > > It is too early to say if hugetlb double mapping will use PageDoubleMap(). > I do not think (hope) it will be necessary. So, I think you can drop mention > of it here. Desires have most certainly been expressed from a couple of parties -- to PTE map huge pages :) Hopefully we'll find a way to avoid PageDoubleMap, I agree. Dropping the comment! > >> to cleanup the file-THP case to make it easier to grasp. While at it, use >> head_compound_mapcount(). >> >> This is a preparation for further changes. >> >> Reviewed-by: Peter Xu <peterx@redhat.com> >> Signed-off-by: David Hildenbrand <david@redhat.com> > > Acked-by: Mike Kravetz <mike.kravetz@oracle.com> > Thanks!
On Fri, Dec 17, 2021 at 3:33 AM David Hildenbrand <david@redhat.com> wrote: > > Let's return early for hugetlb, which really only relies on the compound > mapcount so far and does not support PageDoubleMap() yet. Use the chance > to cleanup the file-THP case to make it easier to grasp. While at it, use > head_compound_mapcount(). > > This is a preparation for further changes. > > Reviewed-by: Peter Xu <peterx@redhat.com> > Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Yang Shi <shy828301@gmail.com> > --- > mm/util.c | 11 +++++++---- > 1 file changed, 7 insertions(+), 4 deletions(-) > > diff --git a/mm/util.c b/mm/util.c > index 741ba32a43ac..3239e75c148d 100644 > --- a/mm/util.c > +++ b/mm/util.c > @@ -732,15 +732,18 @@ int __page_mapcount(struct page *page) > { > int ret; > > - ret = atomic_read(&page->_mapcount) + 1; > + if (PageHuge(page)) > + return compound_mapcount(page); > /* > * For file THP page->_mapcount contains total number of mapping > * of the page: no need to look into compound_mapcount. > */ > - if (!PageAnon(page) && !PageHuge(page)) > - return ret; > + if (!PageAnon(page)) > + return atomic_read(&page->_mapcount) + 1; > + > + ret = atomic_read(&page->_mapcount) + 1; > page = compound_head(page); > - ret += atomic_read(compound_mapcount_ptr(page)) + 1; > + ret += head_compound_mapcount(page); > if (PageDoubleMap(page)) > ret--; > return ret; > -- > 2.31.1 >
On Fri, Dec 17 2021 at 12:30, David Hildenbrand wrote: > Sometimes it is required to have a seqcount implementation that uses > a structure with a fixed and minimal size -- just a bare unsigned int -- > independent of the kernel configuration. This is especially valuable, when > the raw_ variants of the seqlock function will be used and the additional > lockdep part of the seqcount_t structure remains essentially unused. > > Let's provide a lockdep-free raw_seqcount_t variant that can be used via > the raw functions to have a basic seqlock. > > The target use case is embedding a raw_seqcount_t in the "struct page", > where we really want a minimal size and cannot tolerate a sudden grow of > the seqcount_t structure resulting in a significant "struct page" > increase or even a layout change. Cannot tolerate? Could you please provide a reason and not just a statement? > Provide raw_read_seqcount_retry(), to make it easy to match to > raw_read_seqcount_begin() in the code. > > Let's add a short documentation as well. > > Note: There might be other possible users for raw_seqcount_t where the > lockdep part might be completely unused and just wastes memory -- > essentially any users that only use the raw_ function variants. Even when the reader side uses raw_seqcount_begin/retry() the writer side still can use the non-raw variant which validates that the associated lock is held on write. Aside of that your proposed extra raw sequence count needs extra care vs. PREEMPT_RT and this want's to be very clearly documented. Why? The lock association has two purposes: 1) Lockdep coverage which unearthed bugs already 2) PREEMPT_RT livelock prevention Assume the following: spin_lock(wrlock); write_seqcount_begin(seq); ---> preemption by a high priority reader seqcount_begin(seq); <-- live lock The RT substitution does: seqcount_begin(seq) cnt = READ_ONCE(seq->sequence); if (cnt & 1) { lock(s->lock); unlock(s->lock); } which prevents the deadlock because it makes the reader block on the associated lock, which allows the preempted writer to make progress. This applies to raw_seqcount_begin() as well. I have no objections against the construct itself, but this has to be properly documented vs. the restriction this imposes. As you can see above the writer side therefore has to ensure that it cannot preempted on PREEMPT_RT, which limits the possibilities what you can do inside a preemption (or interrupt) disabled section on RT enabled kernels. See Documentation/locking/locktypes.rst for further information. Thanks, tglx
On 17.12.21 22:28, Thomas Gleixner wrote: > On Fri, Dec 17 2021 at 12:30, David Hildenbrand wrote: >> Sometimes it is required to have a seqcount implementation that uses >> a structure with a fixed and minimal size -- just a bare unsigned int -- >> independent of the kernel configuration. This is especially valuable, when >> the raw_ variants of the seqlock function will be used and the additional >> lockdep part of the seqcount_t structure remains essentially unused. >> >> Let's provide a lockdep-free raw_seqcount_t variant that can be used via >> the raw functions to have a basic seqlock. >> >> The target use case is embedding a raw_seqcount_t in the "struct page", >> where we really want a minimal size and cannot tolerate a sudden grow of >> the seqcount_t structure resulting in a significant "struct page" >> increase or even a layout change. > Hi Thomas, thanks for your feedback! > Cannot tolerate? Could you please provide a reason and not just a > statement? Absolutely. "struct page" is supposed to have a minimal size with a fixed layout. Embedding something inside such a structure can change the fixed layout in a way that it can just completely breaks any assumptions on location of values. Therefore, embedding a complex structure in it is usually avoided -- and if we have to (spin_lock), we work around sudden size increases. There are ways around it: allocate the lock and only store the pointer in the struct page. But that most certainly adds complexity, which is why I want to avoid it for now. I'll extend that answer and add it to the patch description. > >> Provide raw_read_seqcount_retry(), to make it easy to match to >> raw_read_seqcount_begin() in the code. >> >> Let's add a short documentation as well. >> >> Note: There might be other possible users for raw_seqcount_t where the >> lockdep part might be completely unused and just wastes memory -- >> essentially any users that only use the raw_ function variants. > > Even when the reader side uses raw_seqcount_begin/retry() the writer > side still can use the non-raw variant which validates that the > associated lock is held on write. Yes, that's my understanding as well. > > Aside of that your proposed extra raw sequence count needs extra care > vs. PREEMPT_RT and this want's to be very clearly documented. Why? > > The lock association has two purposes: > > 1) Lockdep coverage which unearthed bugs already Yes, that's a real shame to lose. > > 2) PREEMPT_RT livelock prevention > > Assume the following: > > spin_lock(wrlock); > write_seqcount_begin(seq); > > ---> preemption by a high priority reader > > seqcount_begin(seq); <-- live lock > > The RT substitution does: > > seqcount_begin(seq) > cnt = READ_ONCE(seq->sequence); > > if (cnt & 1) { > lock(s->lock); > unlock(s->lock); > } > > which prevents the deadlock because it makes the reader block on > the associated lock, which allows the preempted writer to make > progress. > > This applies to raw_seqcount_begin() as well. > > I have no objections against the construct itself, but this has to be > properly documented vs. the restriction this imposes. Absolutely, any input is highly appreciated. But to mention it again: whatever you can do with raw_seqcount_t, you can do with seqcount_t, and there are already users relying completely on the raw_ function variants (see my other reply). So the documentation should most probably be extended to cover the raw_ functions and seqcount_t in general. > > As you can see above the writer side therefore has to ensure that it > cannot preempted on PREEMPT_RT, which limits the possibilities what you > can do inside a preemption (or interrupt) disabled section on RT enabled > kernels. See Documentation/locking/locktypes.rst for further information. It's going to be used for THP, which are currently incompatible with PREEMPT_RT (disabled in the Kconfig). But preemption is also disabled because we're using bit_spin_lock(), which does a bit_spin_lock(). Certainly worth documenting! Thanks for your input!
On Fri, Dec 17, 2021 at 12:30:41PM +0100, David Hildenbrand wrote: > Let's return early for hugetlb, which really only relies on the compound > mapcount so far and does not support PageDoubleMap() yet. Use the chance > to cleanup the file-THP case to make it easier to grasp. While at it, use > head_compound_mapcount(). > > This is a preparation for further changes. > > Reviewed-by: Peter Xu <peterx@redhat.com> > Signed-off-by: David Hildenbrand <david@redhat.com> > --- > mm/util.c | 11 +++++++---- > 1 file changed, 7 insertions(+), 4 deletions(-) > > diff --git a/mm/util.c b/mm/util.c > index 741ba32a43ac..3239e75c148d 100644 > --- a/mm/util.c > +++ b/mm/util.c > @@ -732,15 +732,18 @@ int __page_mapcount(struct page *page) > { > int ret; > > - ret = atomic_read(&page->_mapcount) + 1; > + if (PageHuge(page)) > + return compound_mapcount(page); It would be nice to make PageHuge() inlinable first. It's a shame the we need to have to do a function call for PageHuge() check. Otherwise, looks good: Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>