diff mbox series

xfrm: policy: Restructure RCU-read locking in xfrm_sk_policy_lookup

Message ID 20210618141101.18168-1-varad.gautam@suse.com
State New
Headers show
Series xfrm: policy: Restructure RCU-read locking in xfrm_sk_policy_lookup | expand

Commit Message

Varad Gautam June 18, 2021, 2:11 p.m. UTC
Commit "xfrm: policy: Read seqcount outside of rcu-read side in
xfrm_policy_lookup_bytype" [Linked] resolved a locking bug in
xfrm_policy_lookup_bytype that causes an RCU reader-writer deadlock on
the mutex wrapped by xfrm_policy_hash_generation on PREEMPT_RT since
77cc278f7b20 ("xfrm: policy: Use sequence counters with associated
lock").

However, xfrm_sk_policy_lookup can still reach xfrm_policy_lookup_bytype
while holding rcu_read_lock(), as:
xfrm_sk_policy_lookup()
  rcu_read_lock()
  security_xfrm_policy_lookup()
    xfrm_policy_lookup()
      xfrm_policy_lookup_bytype()
        read_seqcount_begin()
          mutex_lock(&hash_resize_mutex)

This results in the same deadlock on hash_resize_mutex, where xfrm_hash_resize
is holding the mutex and sleeping inside synchronize_rcu, and
xfrm_policy_lookup_bytype blocks on the mutex inside the RCU read side
critical section.

To resolve this, shorten the RCU read side critical section within
xfrm_sk_policy_lookup by taking a reference on the associated xfrm_policy,
and avoid calling xfrm_policy_lookup_bytype with the rcu_read_lock() held.

Fixes: 77cc278f7b20 ("xfrm: policy: Use sequence counters with associated lock")
Link: https://lore.kernel.org/r/20210528160407.32127-1-varad.gautam@suse.com/
Signed-off-by: Varad Gautam <varad.gautam@suse.com>
Cc: stable@vger.kernel.org # v5.9
---
 net/xfrm/xfrm_policy.c | 65 +++++++++++++++++++++++-------------------
 1 file changed, 35 insertions(+), 30 deletions(-)

Comments

Steffen Klassert June 21, 2021, 8:29 a.m. UTC | #1
On Fri, Jun 18, 2021 at 04:11:01PM +0200, Varad Gautam wrote:
> Commit "xfrm: policy: Read seqcount outside of rcu-read side in
> xfrm_policy_lookup_bytype" [Linked] resolved a locking bug in
> xfrm_policy_lookup_bytype that causes an RCU reader-writer deadlock on
> the mutex wrapped by xfrm_policy_hash_generation on PREEMPT_RT since
> 77cc278f7b20 ("xfrm: policy: Use sequence counters with associated
> lock").
> 
> However, xfrm_sk_policy_lookup can still reach xfrm_policy_lookup_bytype
> while holding rcu_read_lock(), as:
> xfrm_sk_policy_lookup()
>   rcu_read_lock()
>   security_xfrm_policy_lookup()
>     xfrm_policy_lookup()

Hm, I don't see that call chain. security_xfrm_policy_lookup() calls
a hook with the name xfrm_policy_lookup. The only LSM that has
registered a function to that hook is selinux. It registers
selinux_xfrm_policy_lookup() and I don't see how we can call
xfrm_policy_lookup() from there.

Did you actually trigger that bug?
Varad Gautam June 21, 2021, 9:11 a.m. UTC | #2
On 6/21/21 10:29 AM, Steffen Klassert wrote:
> On Fri, Jun 18, 2021 at 04:11:01PM +0200, Varad Gautam wrote:
>> Commit "xfrm: policy: Read seqcount outside of rcu-read side in
>> xfrm_policy_lookup_bytype" [Linked] resolved a locking bug in
>> xfrm_policy_lookup_bytype that causes an RCU reader-writer deadlock on
>> the mutex wrapped by xfrm_policy_hash_generation on PREEMPT_RT since
>> 77cc278f7b20 ("xfrm: policy: Use sequence counters with associated
>> lock").
>>
>> However, xfrm_sk_policy_lookup can still reach xfrm_policy_lookup_bytype
>> while holding rcu_read_lock(), as:
>> xfrm_sk_policy_lookup()
>>   rcu_read_lock()
>>   security_xfrm_policy_lookup()
>>     xfrm_policy_lookup()
> 
> Hm, I don't see that call chain. security_xfrm_policy_lookup() calls
> a hook with the name xfrm_policy_lookup. The only LSM that has
> registered a function to that hook is selinux. It registers
> selinux_xfrm_policy_lookup() and I don't see how we can call
> xfrm_policy_lookup() from there.
> 
> Did you actually trigger that bug?
> 

Right, I misread the call chain - security_xfrm_policy_lookup does not reach
xfrm_policy_lookup, making this patch unnecessary. The bug I have is:

T1, holding hash_resize_mutex and sleeping inside synchronize_rcu:

__schedule
schedule
schedule_timeout
wait_for_completion
__wait_rcu_gp
synchronize_rcu
xfrm_hash_resize

And T2 producing RCU-stalls since it blocked on the mutex:

__schedule
schedule
__rt_mutex_slowlock
rt_mutex_slowlock_locked
rt_mutex_slowlock
xfrm_policy_lookup_bytype.constprop.77
__xfrm_policy_check
udpv6_queue_rcv_one_skb
__udp6_lib_rcv
ip6_protocol_deliver_rcu
ip6_input_finish
ip6_input
ip6_mc_input
ipv6_rcv
__netif_receive_skb_one_core
process_backlog
net_rx_action
__softirqentry_text_start
__local_bh_enable_ip
ip6_finish_output2
ip6_output
ip6_send_skb
udp_v6_send_skb
udpv6_sendmsg
sock_sendmsg
____sys_sendmsg
___sys_sendmsg
__sys_sendmsg
do_syscall_64

So, despite the patch here [1], there is another way to reach
xfrm_policy_lookup_bytype within an RCU-read side - which on PREEMPT_RT will
deadlock with xfrm_hash_resize. Does softirq processing on RT happen within
rcu_read_lock/unlock - this would explain the stalls.

[1] https://lore.kernel.org/r/20210528160407.32127-1-varad.gautam@suse.com/

Regards,
Varad
Steffen Klassert June 21, 2021, 11:05 a.m. UTC | #3
On Mon, Jun 21, 2021 at 11:11:18AM +0200, Varad Gautam wrote:
> On 6/21/21 10:29 AM, Steffen Klassert wrote:
> > On Fri, Jun 18, 2021 at 04:11:01PM +0200, Varad Gautam wrote:
> >> Commit "xfrm: policy: Read seqcount outside of rcu-read side in
> >> xfrm_policy_lookup_bytype" [Linked] resolved a locking bug in
> >> xfrm_policy_lookup_bytype that causes an RCU reader-writer deadlock on
> >> the mutex wrapped by xfrm_policy_hash_generation on PREEMPT_RT since
> >> 77cc278f7b20 ("xfrm: policy: Use sequence counters with associated
> >> lock").
> >>
> >> However, xfrm_sk_policy_lookup can still reach xfrm_policy_lookup_bytype
> >> while holding rcu_read_lock(), as:
> >> xfrm_sk_policy_lookup()
> >>   rcu_read_lock()
> >>   security_xfrm_policy_lookup()
> >>     xfrm_policy_lookup()
> > 
> > Hm, I don't see that call chain. security_xfrm_policy_lookup() calls
> > a hook with the name xfrm_policy_lookup. The only LSM that has
> > registered a function to that hook is selinux. It registers
> > selinux_xfrm_policy_lookup() and I don't see how we can call
> > xfrm_policy_lookup() from there.
> > 
> > Did you actually trigger that bug?
> > 
> 
> Right, I misread the call chain - security_xfrm_policy_lookup does not reach
> xfrm_policy_lookup, making this patch unnecessary. The bug I have is:
> 
> T1, holding hash_resize_mutex and sleeping inside synchronize_rcu:
> 
> __schedule
> schedule
> schedule_timeout
> wait_for_completion
> __wait_rcu_gp
> synchronize_rcu
> xfrm_hash_resize
> 
> And T2 producing RCU-stalls since it blocked on the mutex:
> 
> __schedule
> schedule
> __rt_mutex_slowlock
> rt_mutex_slowlock_locked
> rt_mutex_slowlock
> xfrm_policy_lookup_bytype.constprop.77

Ugh, why does xfrm_policy_lookup_bytype use a mutex? This is called
in the receive path inside a sofirq.

The bug was introduced by: 

commit 77cc278f7b202e4f16f8596837219d02cb090b96
Author: Ahmed S. Darwish <a.darwish@linutronix.de>
Date:   Mon Jul 20 17:55:22 2020 +0200

    xfrm: policy: Use sequence counters with associated lock

    A sequence counter write side critical section must be protected by some
    form of locking to serialize writers. If the serialization primitive is
    not disabling preemption implicitly, preemption has to be explicitly
    disabled before entering the sequence counter write side critical
    section.

    A plain seqcount_t does not contain the information of which lock must
    be held when entering a write side critical section.

    Use the new seqcount_spinlock_t and seqcount_mutex_t data types instead,
    which allow to associate a lock with the sequence counter. This enables
    lockdep to verify that the lock used for writer serialization is held
    when the write side critical section is entered.

    If lockdep is disabled this lock association is compiled out and has
    neither storage size nor runtime overhead.

    Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lkml.kernel.org/r/20200720155530.1173732-17-a.darwish@linutronix.de

This uses a seqcount_mutex_t for xfrm_policy_hash_generation, that's
wrong.
Steffen Klassert June 22, 2021, 11:21 a.m. UTC | #4
On Mon, Jun 21, 2021 at 01:05:28PM +0200, Steffen Klassert wrote:
> On Mon, Jun 21, 2021 at 11:11:18AM +0200, Varad Gautam wrote:
> > 
> > Right, I misread the call chain - security_xfrm_policy_lookup does not reach
> > xfrm_policy_lookup, making this patch unnecessary. The bug I have is:
> > 
> > T1, holding hash_resize_mutex and sleeping inside synchronize_rcu:
> > 
> > __schedule
> > schedule
> > schedule_timeout
> > wait_for_completion
> > __wait_rcu_gp
> > synchronize_rcu
> > xfrm_hash_resize
> > 
> > And T2 producing RCU-stalls since it blocked on the mutex:
> > 
> > __schedule
> > schedule
> > __rt_mutex_slowlock
> > rt_mutex_slowlock_locked
> > rt_mutex_slowlock
> > xfrm_policy_lookup_bytype.constprop.77
> 
> Ugh, why does xfrm_policy_lookup_bytype use a mutex? This is called
> in the receive path inside a sofirq.
> 
> The bug was introduced by: 
> 
> commit 77cc278f7b202e4f16f8596837219d02cb090b96
> Author: Ahmed S. Darwish <a.darwish@linutronix.de>
> Date:   Mon Jul 20 17:55:22 2020 +0200
> 
>     xfrm: policy: Use sequence counters with associated lock
> 
>     A sequence counter write side critical section must be protected by some
>     form of locking to serialize writers. If the serialization primitive is
>     not disabling preemption implicitly, preemption has to be explicitly
>     disabled before entering the sequence counter write side critical
>     section.
> 
>     A plain seqcount_t does not contain the information of which lock must
>     be held when entering a write side critical section.
> 
>     Use the new seqcount_spinlock_t and seqcount_mutex_t data types instead,
>     which allow to associate a lock with the sequence counter. This enables
>     lockdep to verify that the lock used for writer serialization is held
>     when the write side critical section is entered.
> 
>     If lockdep is disabled this lock association is compiled out and has
>     neither storage size nor runtime overhead.
> 
>     Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de>
>     Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
>     Link: https://lkml.kernel.org/r/20200720155530.1173732-17-a.darwish@linutronix.de
> 
> This uses a seqcount_mutex_t for xfrm_policy_hash_generation, that's
> wrong.

Varad, can you try to replace the seqcount_mutex_t for xfrm_policy_hash_generation
by a seqcount_spinlock_t? I'm not familiar with that seqcount changes,
but we should not end up with using a mutex in this codepath.
Frederic Weisbecker June 22, 2021, 11:51 a.m. UTC | #5
On Tue, Jun 22, 2021 at 01:21:59PM +0200, Steffen Klassert wrote:
> On Mon, Jun 21, 2021 at 01:05:28PM +0200, Steffen Klassert wrote:
> > On Mon, Jun 21, 2021 at 11:11:18AM +0200, Varad Gautam wrote:
> > > 
> > > Right, I misread the call chain - security_xfrm_policy_lookup does not reach
> > > xfrm_policy_lookup, making this patch unnecessary. The bug I have is:
> > > 
> > > T1, holding hash_resize_mutex and sleeping inside synchronize_rcu:
> > > 
> > > __schedule
> > > schedule
> > > schedule_timeout
> > > wait_for_completion
> > > __wait_rcu_gp
> > > synchronize_rcu
> > > xfrm_hash_resize
> > > 
> > > And T2 producing RCU-stalls since it blocked on the mutex:
> > > 
> > > __schedule
> > > schedule
> > > __rt_mutex_slowlock
> > > rt_mutex_slowlock_locked
> > > rt_mutex_slowlock
> > > xfrm_policy_lookup_bytype.constprop.77
> > 
> > Ugh, why does xfrm_policy_lookup_bytype use a mutex? This is called
> > in the receive path inside a sofirq.
> > 
> > The bug was introduced by: 
> > 
> > commit 77cc278f7b202e4f16f8596837219d02cb090b96
> > Author: Ahmed S. Darwish <a.darwish@linutronix.de>
> > Date:   Mon Jul 20 17:55:22 2020 +0200
> > 
> >     xfrm: policy: Use sequence counters with associated lock
> > 
> >     A sequence counter write side critical section must be protected by some
> >     form of locking to serialize writers. If the serialization primitive is
> >     not disabling preemption implicitly, preemption has to be explicitly
> >     disabled before entering the sequence counter write side critical
> >     section.
> > 
> >     A plain seqcount_t does not contain the information of which lock must
> >     be held when entering a write side critical section.
> > 
> >     Use the new seqcount_spinlock_t and seqcount_mutex_t data types instead,
> >     which allow to associate a lock with the sequence counter. This enables
> >     lockdep to verify that the lock used for writer serialization is held
> >     when the write side critical section is entered.
> > 
> >     If lockdep is disabled this lock association is compiled out and has
> >     neither storage size nor runtime overhead.
> > 
> >     Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de>
> >     Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> >     Link: https://lkml.kernel.org/r/20200720155530.1173732-17-a.darwish@linutronix.de
> > 
> > This uses a seqcount_mutex_t for xfrm_policy_hash_generation, that's
> > wrong.
> 
> Varad, can you try to replace the seqcount_mutex_t for xfrm_policy_hash_generation
> by a seqcount_spinlock_t? I'm not familiar with that seqcount changes,
> but we should not end up with using a mutex in this codepath.

Something like this? (beware, untested, also I don't know if the read side
should then disable bh, doesn't look necessary for PREEMPT_RT, but I may be
missing something...)

diff --git a/include/net/netns/xfrm.h b/include/net/netns/xfrm.h
index e816b6a3ef2b..9b376b87bd54 100644
--- a/include/net/netns/xfrm.h
+++ b/include/net/netns/xfrm.h
@@ -74,6 +74,7 @@ struct netns_xfrm {
 #endif
 	spinlock_t		xfrm_state_lock;
 	seqcount_spinlock_t	xfrm_state_hash_generation;
+	seqcount_spinlock_t	xfrm_policy_hash_generation;
 
 	spinlock_t xfrm_policy_lock;
 	struct mutex xfrm_cfg_mutex;
diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
index ce500f847b99..46a6d15b66d6 100644
--- a/net/xfrm/xfrm_policy.c
+++ b/net/xfrm/xfrm_policy.c
@@ -155,7 +155,6 @@ static struct xfrm_policy_afinfo const __rcu *xfrm_policy_afinfo[AF_INET6 + 1]
 						__read_mostly;
 
 static struct kmem_cache *xfrm_dst_cache __ro_after_init;
-static __read_mostly seqcount_mutex_t xfrm_policy_hash_generation;
 
 static struct rhashtable xfrm_policy_inexact_table;
 static const struct rhashtable_params xfrm_pol_inexact_params;
@@ -585,7 +584,7 @@ static void xfrm_bydst_resize(struct net *net, int dir)
 		return;
 
 	spin_lock_bh(&net->xfrm.xfrm_policy_lock);
-	write_seqcount_begin(&xfrm_policy_hash_generation);
+	write_seqcount_begin(&net->xfrm.xfrm_policy_hash_generation);
 
 	odst = rcu_dereference_protected(net->xfrm.policy_bydst[dir].table,
 				lockdep_is_held(&net->xfrm.xfrm_policy_lock));
@@ -596,7 +595,7 @@ static void xfrm_bydst_resize(struct net *net, int dir)
 	rcu_assign_pointer(net->xfrm.policy_bydst[dir].table, ndst);
 	net->xfrm.policy_bydst[dir].hmask = nhashmask;
 
-	write_seqcount_end(&xfrm_policy_hash_generation);
+	write_seqcount_end(&net->xfrm.xfrm_policy_hash_generation);
 	spin_unlock_bh(&net->xfrm.xfrm_policy_lock);
 
 	synchronize_rcu();
@@ -1245,7 +1244,7 @@ static void xfrm_hash_rebuild(struct work_struct *work)
 	} while (read_seqretry(&net->xfrm.policy_hthresh.lock, seq));
 
 	spin_lock_bh(&net->xfrm.xfrm_policy_lock);
-	write_seqcount_begin(&xfrm_policy_hash_generation);
+	write_seqcount_begin(&net->xfrm.xfrm_policy_hash_generation);
 
 	/* make sure that we can insert the indirect policies again before
 	 * we start with destructive action.
@@ -1354,7 +1353,7 @@ static void xfrm_hash_rebuild(struct work_struct *work)
 
 out_unlock:
 	__xfrm_policy_inexact_flush(net);
-	write_seqcount_end(&xfrm_policy_hash_generation);
+	write_seqcount_end(&net->xfrm.xfrm_policy_hash_generation);
 	spin_unlock_bh(&net->xfrm.xfrm_policy_lock);
 
 	mutex_unlock(&hash_resize_mutex);
@@ -2095,9 +2094,9 @@ static struct xfrm_policy *xfrm_policy_lookup_bytype(struct net *net, u8 type,
 	rcu_read_lock();
  retry:
 	do {
-		sequence = read_seqcount_begin(&xfrm_policy_hash_generation);
+		sequence = read_seqcount_begin(&net->xfrm.xfrm_policy_hash_generation);
 		chain = policy_hash_direct(net, daddr, saddr, family, dir);
-	} while (read_seqcount_retry(&xfrm_policy_hash_generation, sequence));
+	} while (read_seqcount_retry(&net->xfrm.xfrm_policy_hash_generation, sequence));
 
 	ret = NULL;
 	hlist_for_each_entry_rcu(pol, chain, bydst) {
@@ -2128,7 +2127,7 @@ static struct xfrm_policy *xfrm_policy_lookup_bytype(struct net *net, u8 type,
 	}
 
 skip_inexact:
-	if (read_seqcount_retry(&xfrm_policy_hash_generation, sequence))
+	if (read_seqcount_retry(&net->xfrm.xfrm_policy_hash_generation, sequence))
 		goto retry;
 
 	if (ret && !xfrm_pol_hold_rcu(ret))
@@ -4084,6 +4083,7 @@ static int __net_init xfrm_net_init(struct net *net)
 	/* Initialize the per-net locks here */
 	spin_lock_init(&net->xfrm.xfrm_state_lock);
 	spin_lock_init(&net->xfrm.xfrm_policy_lock);
+	seqcount_spinlock_init(&net->xfrm.xfrm_policy_hash_generation, &net->xfrm.xfrm_policy_lock);
 	mutex_init(&net->xfrm.xfrm_cfg_mutex);
 
 	rv = xfrm_statistics_init(net);
@@ -4128,7 +4128,6 @@ void __init xfrm_init(void)
 {
 	register_pernet_subsys(&xfrm_net_ops);
 	xfrm_dev_init();
-	seqcount_mutex_init(&xfrm_policy_hash_generation, &hash_resize_mutex);
 	xfrm_input_init();
 
 #ifdef CONFIG_XFRM_ESPINTCP
Steffen Klassert June 22, 2021, 12:33 p.m. UTC | #6
On Tue, Jun 22, 2021 at 01:51:24PM +0200, Frederic Weisbecker wrote:
> On Tue, Jun 22, 2021 at 01:21:59PM +0200, Steffen Klassert wrote:
> > On Mon, Jun 21, 2021 at 01:05:28PM +0200, Steffen Klassert wrote:
> > > On Mon, Jun 21, 2021 at 11:11:18AM +0200, Varad Gautam wrote:
> > > > 
> > > > Right, I misread the call chain - security_xfrm_policy_lookup does not reach
> > > > xfrm_policy_lookup, making this patch unnecessary. The bug I have is:
> > > > 
> > > > T1, holding hash_resize_mutex and sleeping inside synchronize_rcu:
> > > > 
> > > > __schedule
> > > > schedule
> > > > schedule_timeout
> > > > wait_for_completion
> > > > __wait_rcu_gp
> > > > synchronize_rcu
> > > > xfrm_hash_resize
> > > > 
> > > > And T2 producing RCU-stalls since it blocked on the mutex:
> > > > 
> > > > __schedule
> > > > schedule
> > > > __rt_mutex_slowlock
> > > > rt_mutex_slowlock_locked
> > > > rt_mutex_slowlock
> > > > xfrm_policy_lookup_bytype.constprop.77
> > > 
> > > Ugh, why does xfrm_policy_lookup_bytype use a mutex? This is called
> > > in the receive path inside a sofirq.
> > > 
> > > The bug was introduced by: 
> > > 
> > > commit 77cc278f7b202e4f16f8596837219d02cb090b96
> > > Author: Ahmed S. Darwish <a.darwish@linutronix.de>
> > > Date:   Mon Jul 20 17:55:22 2020 +0200
> > > 
> > >     xfrm: policy: Use sequence counters with associated lock
> > > 
> > >     A sequence counter write side critical section must be protected by some
> > >     form of locking to serialize writers. If the serialization primitive is
> > >     not disabling preemption implicitly, preemption has to be explicitly
> > >     disabled before entering the sequence counter write side critical
> > >     section.
> > > 
> > >     A plain seqcount_t does not contain the information of which lock must
> > >     be held when entering a write side critical section.
> > > 
> > >     Use the new seqcount_spinlock_t and seqcount_mutex_t data types instead,
> > >     which allow to associate a lock with the sequence counter. This enables
> > >     lockdep to verify that the lock used for writer serialization is held
> > >     when the write side critical section is entered.
> > > 
> > >     If lockdep is disabled this lock association is compiled out and has
> > >     neither storage size nor runtime overhead.
> > > 
> > >     Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de>
> > >     Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > >     Link: https://lkml.kernel.org/r/20200720155530.1173732-17-a.darwish@linutronix.de
> > > 
> > > This uses a seqcount_mutex_t for xfrm_policy_hash_generation, that's
> > > wrong.
> > 
> > Varad, can you try to replace the seqcount_mutex_t for xfrm_policy_hash_generation
> > by a seqcount_spinlock_t? I'm not familiar with that seqcount changes,
> > but we should not end up with using a mutex in this codepath.
> 
> Something like this? (beware, untested, also I don't know if the read side
> should then disable bh, doesn't look necessary for PREEMPT_RT, but I may be
> missing something...)

Looking a bit deeper into this it seems that the problem is that
xfrm_policy_hash_generation and hash_resize_mutex do not protect
the same thing.

hash_resize_mutex protects user configuration against a worker thread
that rebalances the hash buckets. xfrm_policy_hash_generation protects
user configuration against the data path that runs in softirq.

Finally the following line from xfrm_init() relates these two:

seqcount_mutex_init(&xfrm_policy_hash_generation, &hash_resize_mutex);

That looks a bit odd. This line was also introduced with the above
mentioned patch.
Steffen Klassert June 23, 2021, 5:27 a.m. UTC | #7
On Tue, Jun 22, 2021 at 01:51:24PM +0200, Frederic Weisbecker wrote:
> On Tue, Jun 22, 2021 at 01:21:59PM +0200, Steffen Klassert wrote:
> > On Mon, Jun 21, 2021 at 01:05:28PM +0200, Steffen Klassert wrote:
> > > On Mon, Jun 21, 2021 at 11:11:18AM +0200, Varad Gautam wrote:
> > 
> > Varad, can you try to replace the seqcount_mutex_t for xfrm_policy_hash_generation
> > by a seqcount_spinlock_t? I'm not familiar with that seqcount changes,
> > but we should not end up with using a mutex in this codepath.
> 
> Something like this? (beware, untested, also I don't know if the read side
> should then disable bh, doesn't look necessary for PREEMPT_RT, but I may be
> missing something...)
> 
> diff --git a/include/net/netns/xfrm.h b/include/net/netns/xfrm.h
> index e816b6a3ef2b..9b376b87bd54 100644
> --- a/include/net/netns/xfrm.h
> +++ b/include/net/netns/xfrm.h
> @@ -74,6 +74,7 @@ struct netns_xfrm {
>  #endif
>  	spinlock_t		xfrm_state_lock;
>  	seqcount_spinlock_t	xfrm_state_hash_generation;
> +	seqcount_spinlock_t	xfrm_policy_hash_generation;
>  
>  	spinlock_t xfrm_policy_lock;
>  	struct mutex xfrm_cfg_mutex;
> diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
> index ce500f847b99..46a6d15b66d6 100644
> --- a/net/xfrm/xfrm_policy.c
> +++ b/net/xfrm/xfrm_policy.c
> @@ -155,7 +155,6 @@ static struct xfrm_policy_afinfo const __rcu *xfrm_policy_afinfo[AF_INET6 + 1]
>  						__read_mostly;
>  
>  static struct kmem_cache *xfrm_dst_cache __ro_after_init;
> -static __read_mostly seqcount_mutex_t xfrm_policy_hash_generation;
>  
>  static struct rhashtable xfrm_policy_inexact_table;
>  static const struct rhashtable_params xfrm_pol_inexact_params;
> @@ -585,7 +584,7 @@ static void xfrm_bydst_resize(struct net *net, int dir)
>  		return;
>  
>  	spin_lock_bh(&net->xfrm.xfrm_policy_lock);
> -	write_seqcount_begin(&xfrm_policy_hash_generation);
> +	write_seqcount_begin(&net->xfrm.xfrm_policy_hash_generation);
>  
>  	odst = rcu_dereference_protected(net->xfrm.policy_bydst[dir].table,
>  				lockdep_is_held(&net->xfrm.xfrm_policy_lock));
> @@ -596,7 +595,7 @@ static void xfrm_bydst_resize(struct net *net, int dir)
>  	rcu_assign_pointer(net->xfrm.policy_bydst[dir].table, ndst);
>  	net->xfrm.policy_bydst[dir].hmask = nhashmask;
>  
> -	write_seqcount_end(&xfrm_policy_hash_generation);
> +	write_seqcount_end(&net->xfrm.xfrm_policy_hash_generation);
>  	spin_unlock_bh(&net->xfrm.xfrm_policy_lock);
>  
>  	synchronize_rcu();
> @@ -1245,7 +1244,7 @@ static void xfrm_hash_rebuild(struct work_struct *work)
>  	} while (read_seqretry(&net->xfrm.policy_hthresh.lock, seq));
>  
>  	spin_lock_bh(&net->xfrm.xfrm_policy_lock);
> -	write_seqcount_begin(&xfrm_policy_hash_generation);
> +	write_seqcount_begin(&net->xfrm.xfrm_policy_hash_generation);
>  
>  	/* make sure that we can insert the indirect policies again before
>  	 * we start with destructive action.
> @@ -1354,7 +1353,7 @@ static void xfrm_hash_rebuild(struct work_struct *work)
>  
>  out_unlock:
>  	__xfrm_policy_inexact_flush(net);
> -	write_seqcount_end(&xfrm_policy_hash_generation);
> +	write_seqcount_end(&net->xfrm.xfrm_policy_hash_generation);
>  	spin_unlock_bh(&net->xfrm.xfrm_policy_lock);
>  
>  	mutex_unlock(&hash_resize_mutex);
> @@ -2095,9 +2094,9 @@ static struct xfrm_policy *xfrm_policy_lookup_bytype(struct net *net, u8 type,
>  	rcu_read_lock();
>   retry:
>  	do {
> -		sequence = read_seqcount_begin(&xfrm_policy_hash_generation);
> +		sequence = read_seqcount_begin(&net->xfrm.xfrm_policy_hash_generation);
>  		chain = policy_hash_direct(net, daddr, saddr, family, dir);
> -	} while (read_seqcount_retry(&xfrm_policy_hash_generation, sequence));
> +	} while (read_seqcount_retry(&net->xfrm.xfrm_policy_hash_generation, sequence));
>  
>  	ret = NULL;
>  	hlist_for_each_entry_rcu(pol, chain, bydst) {
> @@ -2128,7 +2127,7 @@ static struct xfrm_policy *xfrm_policy_lookup_bytype(struct net *net, u8 type,
>  	}
>  
>  skip_inexact:
> -	if (read_seqcount_retry(&xfrm_policy_hash_generation, sequence))
> +	if (read_seqcount_retry(&net->xfrm.xfrm_policy_hash_generation, sequence))
>  		goto retry;
>  
>  	if (ret && !xfrm_pol_hold_rcu(ret))
> @@ -4084,6 +4083,7 @@ static int __net_init xfrm_net_init(struct net *net)
>  	/* Initialize the per-net locks here */
>  	spin_lock_init(&net->xfrm.xfrm_state_lock);
>  	spin_lock_init(&net->xfrm.xfrm_policy_lock);
> +	seqcount_spinlock_init(&net->xfrm.xfrm_policy_hash_generation, &net->xfrm.xfrm_policy_lock);
>  	mutex_init(&net->xfrm.xfrm_cfg_mutex);
>  
>  	rv = xfrm_statistics_init(net);
> @@ -4128,7 +4128,6 @@ void __init xfrm_init(void)
>  {
>  	register_pernet_subsys(&xfrm_net_ops);
>  	xfrm_dev_init();
> -	seqcount_mutex_init(&xfrm_policy_hash_generation, &hash_resize_mutex);
>  	xfrm_input_init();
>  
>  #ifdef CONFIG_XFRM_ESPINTCP

Yes, looks like your patch should do it. The xfrm_policy_lock is the
write side protection for the seqcount here.

Thanks!
diff mbox series

Patch

diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
index 0b7409f19ecb1..28e072007c72d 100644
--- a/net/xfrm/xfrm_policy.c
+++ b/net/xfrm/xfrm_policy.c
@@ -2152,42 +2152,47 @@  static struct xfrm_policy *xfrm_sk_policy_lookup(const struct sock *sk, int dir,
 						 u16 family, u32 if_id)
 {
 	struct xfrm_policy *pol;
+	bool match;
+	int err = 0;
 
-	rcu_read_lock();
  again:
+	rcu_read_lock();
 	pol = rcu_dereference(sk->sk_policy[dir]);
-	if (pol != NULL) {
-		bool match;
-		int err = 0;
-
-		if (pol->family != family) {
-			pol = NULL;
-			goto out;
-		}
+	if (pol == NULL) {
+		rcu_read_unlock();
+		goto out;
+	}
 
-		match = xfrm_selector_match(&pol->selector, fl, family);
-		if (match) {
-			if ((sk->sk_mark & pol->mark.m) != pol->mark.v ||
-			    pol->if_id != if_id) {
-				pol = NULL;
-				goto out;
-			}
-			err = security_xfrm_policy_lookup(pol->security,
-						      fl->flowi_secid,
-						      dir);
-			if (!err) {
-				if (!xfrm_pol_hold_rcu(pol))
-					goto again;
-			} else if (err == -ESRCH) {
-				pol = NULL;
-			} else {
-				pol = ERR_PTR(err);
-			}
-		} else
-			pol = NULL;
+	/* Take a reference on this pol and call rcu_read_unlock(). */
+	if (!xfrm_pol_hold_rcu(pol)) {
+		rcu_read_unlock();
+		goto again;
 	}
-out:
 	rcu_read_unlock();
+
+	if (pol->family != family)
+		goto out_put;
+
+	match = xfrm_selector_match(&pol->selector, fl, family);
+	if (!match)
+		goto out_put;
+
+	if ((sk->sk_mark & pol->mark.m) != pol->mark.v ||
+	    pol->if_id != if_id)
+		goto out_put;
+
+	err = security_xfrm_policy_lookup(pol->security,
+					fl->flowi_secid,
+					dir);
+	if (!err) {
+		/* Safe to return, we have a ref on pol. */
+		goto out;
+	}
+
+ out_put:
+	xfrm_pol_put(pol);
+	pol = (err && err != -ESRCH) ? ERR_PTR(err) : NULL;
+ out:
 	return pol;
 }