Message ID | 20210618141101.18168-1-varad.gautam@suse.com |
---|---|
State | New |
Headers | show |
Series | xfrm: policy: Restructure RCU-read locking in xfrm_sk_policy_lookup | expand |
On Fri, Jun 18, 2021 at 04:11:01PM +0200, Varad Gautam wrote: > Commit "xfrm: policy: Read seqcount outside of rcu-read side in > xfrm_policy_lookup_bytype" [Linked] resolved a locking bug in > xfrm_policy_lookup_bytype that causes an RCU reader-writer deadlock on > the mutex wrapped by xfrm_policy_hash_generation on PREEMPT_RT since > 77cc278f7b20 ("xfrm: policy: Use sequence counters with associated > lock"). > > However, xfrm_sk_policy_lookup can still reach xfrm_policy_lookup_bytype > while holding rcu_read_lock(), as: > xfrm_sk_policy_lookup() > rcu_read_lock() > security_xfrm_policy_lookup() > xfrm_policy_lookup() Hm, I don't see that call chain. security_xfrm_policy_lookup() calls a hook with the name xfrm_policy_lookup. The only LSM that has registered a function to that hook is selinux. It registers selinux_xfrm_policy_lookup() and I don't see how we can call xfrm_policy_lookup() from there. Did you actually trigger that bug?
On 6/21/21 10:29 AM, Steffen Klassert wrote: > On Fri, Jun 18, 2021 at 04:11:01PM +0200, Varad Gautam wrote: >> Commit "xfrm: policy: Read seqcount outside of rcu-read side in >> xfrm_policy_lookup_bytype" [Linked] resolved a locking bug in >> xfrm_policy_lookup_bytype that causes an RCU reader-writer deadlock on >> the mutex wrapped by xfrm_policy_hash_generation on PREEMPT_RT since >> 77cc278f7b20 ("xfrm: policy: Use sequence counters with associated >> lock"). >> >> However, xfrm_sk_policy_lookup can still reach xfrm_policy_lookup_bytype >> while holding rcu_read_lock(), as: >> xfrm_sk_policy_lookup() >> rcu_read_lock() >> security_xfrm_policy_lookup() >> xfrm_policy_lookup() > > Hm, I don't see that call chain. security_xfrm_policy_lookup() calls > a hook with the name xfrm_policy_lookup. The only LSM that has > registered a function to that hook is selinux. It registers > selinux_xfrm_policy_lookup() and I don't see how we can call > xfrm_policy_lookup() from there. > > Did you actually trigger that bug? > Right, I misread the call chain - security_xfrm_policy_lookup does not reach xfrm_policy_lookup, making this patch unnecessary. The bug I have is: T1, holding hash_resize_mutex and sleeping inside synchronize_rcu: __schedule schedule schedule_timeout wait_for_completion __wait_rcu_gp synchronize_rcu xfrm_hash_resize And T2 producing RCU-stalls since it blocked on the mutex: __schedule schedule __rt_mutex_slowlock rt_mutex_slowlock_locked rt_mutex_slowlock xfrm_policy_lookup_bytype.constprop.77 __xfrm_policy_check udpv6_queue_rcv_one_skb __udp6_lib_rcv ip6_protocol_deliver_rcu ip6_input_finish ip6_input ip6_mc_input ipv6_rcv __netif_receive_skb_one_core process_backlog net_rx_action __softirqentry_text_start __local_bh_enable_ip ip6_finish_output2 ip6_output ip6_send_skb udp_v6_send_skb udpv6_sendmsg sock_sendmsg ____sys_sendmsg ___sys_sendmsg __sys_sendmsg do_syscall_64 So, despite the patch here [1], there is another way to reach xfrm_policy_lookup_bytype within an RCU-read side - which on PREEMPT_RT will deadlock with xfrm_hash_resize. Does softirq processing on RT happen within rcu_read_lock/unlock - this would explain the stalls. [1] https://lore.kernel.org/r/20210528160407.32127-1-varad.gautam@suse.com/ Regards, Varad
On Mon, Jun 21, 2021 at 11:11:18AM +0200, Varad Gautam wrote: > On 6/21/21 10:29 AM, Steffen Klassert wrote: > > On Fri, Jun 18, 2021 at 04:11:01PM +0200, Varad Gautam wrote: > >> Commit "xfrm: policy: Read seqcount outside of rcu-read side in > >> xfrm_policy_lookup_bytype" [Linked] resolved a locking bug in > >> xfrm_policy_lookup_bytype that causes an RCU reader-writer deadlock on > >> the mutex wrapped by xfrm_policy_hash_generation on PREEMPT_RT since > >> 77cc278f7b20 ("xfrm: policy: Use sequence counters with associated > >> lock"). > >> > >> However, xfrm_sk_policy_lookup can still reach xfrm_policy_lookup_bytype > >> while holding rcu_read_lock(), as: > >> xfrm_sk_policy_lookup() > >> rcu_read_lock() > >> security_xfrm_policy_lookup() > >> xfrm_policy_lookup() > > > > Hm, I don't see that call chain. security_xfrm_policy_lookup() calls > > a hook with the name xfrm_policy_lookup. The only LSM that has > > registered a function to that hook is selinux. It registers > > selinux_xfrm_policy_lookup() and I don't see how we can call > > xfrm_policy_lookup() from there. > > > > Did you actually trigger that bug? > > > > Right, I misread the call chain - security_xfrm_policy_lookup does not reach > xfrm_policy_lookup, making this patch unnecessary. The bug I have is: > > T1, holding hash_resize_mutex and sleeping inside synchronize_rcu: > > __schedule > schedule > schedule_timeout > wait_for_completion > __wait_rcu_gp > synchronize_rcu > xfrm_hash_resize > > And T2 producing RCU-stalls since it blocked on the mutex: > > __schedule > schedule > __rt_mutex_slowlock > rt_mutex_slowlock_locked > rt_mutex_slowlock > xfrm_policy_lookup_bytype.constprop.77 Ugh, why does xfrm_policy_lookup_bytype use a mutex? This is called in the receive path inside a sofirq. The bug was introduced by: commit 77cc278f7b202e4f16f8596837219d02cb090b96 Author: Ahmed S. Darwish <a.darwish@linutronix.de> Date: Mon Jul 20 17:55:22 2020 +0200 xfrm: policy: Use sequence counters with associated lock A sequence counter write side critical section must be protected by some form of locking to serialize writers. If the serialization primitive is not disabling preemption implicitly, preemption has to be explicitly disabled before entering the sequence counter write side critical section. A plain seqcount_t does not contain the information of which lock must be held when entering a write side critical section. Use the new seqcount_spinlock_t and seqcount_mutex_t data types instead, which allow to associate a lock with the sequence counter. This enables lockdep to verify that the lock used for writer serialization is held when the write side critical section is entered. If lockdep is disabled this lock association is compiled out and has neither storage size nor runtime overhead. Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200720155530.1173732-17-a.darwish@linutronix.de This uses a seqcount_mutex_t for xfrm_policy_hash_generation, that's wrong.
On Mon, Jun 21, 2021 at 01:05:28PM +0200, Steffen Klassert wrote: > On Mon, Jun 21, 2021 at 11:11:18AM +0200, Varad Gautam wrote: > > > > Right, I misread the call chain - security_xfrm_policy_lookup does not reach > > xfrm_policy_lookup, making this patch unnecessary. The bug I have is: > > > > T1, holding hash_resize_mutex and sleeping inside synchronize_rcu: > > > > __schedule > > schedule > > schedule_timeout > > wait_for_completion > > __wait_rcu_gp > > synchronize_rcu > > xfrm_hash_resize > > > > And T2 producing RCU-stalls since it blocked on the mutex: > > > > __schedule > > schedule > > __rt_mutex_slowlock > > rt_mutex_slowlock_locked > > rt_mutex_slowlock > > xfrm_policy_lookup_bytype.constprop.77 > > Ugh, why does xfrm_policy_lookup_bytype use a mutex? This is called > in the receive path inside a sofirq. > > The bug was introduced by: > > commit 77cc278f7b202e4f16f8596837219d02cb090b96 > Author: Ahmed S. Darwish <a.darwish@linutronix.de> > Date: Mon Jul 20 17:55:22 2020 +0200 > > xfrm: policy: Use sequence counters with associated lock > > A sequence counter write side critical section must be protected by some > form of locking to serialize writers. If the serialization primitive is > not disabling preemption implicitly, preemption has to be explicitly > disabled before entering the sequence counter write side critical > section. > > A plain seqcount_t does not contain the information of which lock must > be held when entering a write side critical section. > > Use the new seqcount_spinlock_t and seqcount_mutex_t data types instead, > which allow to associate a lock with the sequence counter. This enables > lockdep to verify that the lock used for writer serialization is held > when the write side critical section is entered. > > If lockdep is disabled this lock association is compiled out and has > neither storage size nor runtime overhead. > > Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de> > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> > Link: https://lkml.kernel.org/r/20200720155530.1173732-17-a.darwish@linutronix.de > > This uses a seqcount_mutex_t for xfrm_policy_hash_generation, that's > wrong. Varad, can you try to replace the seqcount_mutex_t for xfrm_policy_hash_generation by a seqcount_spinlock_t? I'm not familiar with that seqcount changes, but we should not end up with using a mutex in this codepath.
On Tue, Jun 22, 2021 at 01:21:59PM +0200, Steffen Klassert wrote: > On Mon, Jun 21, 2021 at 01:05:28PM +0200, Steffen Klassert wrote: > > On Mon, Jun 21, 2021 at 11:11:18AM +0200, Varad Gautam wrote: > > > > > > Right, I misread the call chain - security_xfrm_policy_lookup does not reach > > > xfrm_policy_lookup, making this patch unnecessary. The bug I have is: > > > > > > T1, holding hash_resize_mutex and sleeping inside synchronize_rcu: > > > > > > __schedule > > > schedule > > > schedule_timeout > > > wait_for_completion > > > __wait_rcu_gp > > > synchronize_rcu > > > xfrm_hash_resize > > > > > > And T2 producing RCU-stalls since it blocked on the mutex: > > > > > > __schedule > > > schedule > > > __rt_mutex_slowlock > > > rt_mutex_slowlock_locked > > > rt_mutex_slowlock > > > xfrm_policy_lookup_bytype.constprop.77 > > > > Ugh, why does xfrm_policy_lookup_bytype use a mutex? This is called > > in the receive path inside a sofirq. > > > > The bug was introduced by: > > > > commit 77cc278f7b202e4f16f8596837219d02cb090b96 > > Author: Ahmed S. Darwish <a.darwish@linutronix.de> > > Date: Mon Jul 20 17:55:22 2020 +0200 > > > > xfrm: policy: Use sequence counters with associated lock > > > > A sequence counter write side critical section must be protected by some > > form of locking to serialize writers. If the serialization primitive is > > not disabling preemption implicitly, preemption has to be explicitly > > disabled before entering the sequence counter write side critical > > section. > > > > A plain seqcount_t does not contain the information of which lock must > > be held when entering a write side critical section. > > > > Use the new seqcount_spinlock_t and seqcount_mutex_t data types instead, > > which allow to associate a lock with the sequence counter. This enables > > lockdep to verify that the lock used for writer serialization is held > > when the write side critical section is entered. > > > > If lockdep is disabled this lock association is compiled out and has > > neither storage size nor runtime overhead. > > > > Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de> > > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> > > Link: https://lkml.kernel.org/r/20200720155530.1173732-17-a.darwish@linutronix.de > > > > This uses a seqcount_mutex_t for xfrm_policy_hash_generation, that's > > wrong. > > Varad, can you try to replace the seqcount_mutex_t for xfrm_policy_hash_generation > by a seqcount_spinlock_t? I'm not familiar with that seqcount changes, > but we should not end up with using a mutex in this codepath. Something like this? (beware, untested, also I don't know if the read side should then disable bh, doesn't look necessary for PREEMPT_RT, but I may be missing something...) diff --git a/include/net/netns/xfrm.h b/include/net/netns/xfrm.h index e816b6a3ef2b..9b376b87bd54 100644 --- a/include/net/netns/xfrm.h +++ b/include/net/netns/xfrm.h @@ -74,6 +74,7 @@ struct netns_xfrm { #endif spinlock_t xfrm_state_lock; seqcount_spinlock_t xfrm_state_hash_generation; + seqcount_spinlock_t xfrm_policy_hash_generation; spinlock_t xfrm_policy_lock; struct mutex xfrm_cfg_mutex; diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c index ce500f847b99..46a6d15b66d6 100644 --- a/net/xfrm/xfrm_policy.c +++ b/net/xfrm/xfrm_policy.c @@ -155,7 +155,6 @@ static struct xfrm_policy_afinfo const __rcu *xfrm_policy_afinfo[AF_INET6 + 1] __read_mostly; static struct kmem_cache *xfrm_dst_cache __ro_after_init; -static __read_mostly seqcount_mutex_t xfrm_policy_hash_generation; static struct rhashtable xfrm_policy_inexact_table; static const struct rhashtable_params xfrm_pol_inexact_params; @@ -585,7 +584,7 @@ static void xfrm_bydst_resize(struct net *net, int dir) return; spin_lock_bh(&net->xfrm.xfrm_policy_lock); - write_seqcount_begin(&xfrm_policy_hash_generation); + write_seqcount_begin(&net->xfrm.xfrm_policy_hash_generation); odst = rcu_dereference_protected(net->xfrm.policy_bydst[dir].table, lockdep_is_held(&net->xfrm.xfrm_policy_lock)); @@ -596,7 +595,7 @@ static void xfrm_bydst_resize(struct net *net, int dir) rcu_assign_pointer(net->xfrm.policy_bydst[dir].table, ndst); net->xfrm.policy_bydst[dir].hmask = nhashmask; - write_seqcount_end(&xfrm_policy_hash_generation); + write_seqcount_end(&net->xfrm.xfrm_policy_hash_generation); spin_unlock_bh(&net->xfrm.xfrm_policy_lock); synchronize_rcu(); @@ -1245,7 +1244,7 @@ static void xfrm_hash_rebuild(struct work_struct *work) } while (read_seqretry(&net->xfrm.policy_hthresh.lock, seq)); spin_lock_bh(&net->xfrm.xfrm_policy_lock); - write_seqcount_begin(&xfrm_policy_hash_generation); + write_seqcount_begin(&net->xfrm.xfrm_policy_hash_generation); /* make sure that we can insert the indirect policies again before * we start with destructive action. @@ -1354,7 +1353,7 @@ static void xfrm_hash_rebuild(struct work_struct *work) out_unlock: __xfrm_policy_inexact_flush(net); - write_seqcount_end(&xfrm_policy_hash_generation); + write_seqcount_end(&net->xfrm.xfrm_policy_hash_generation); spin_unlock_bh(&net->xfrm.xfrm_policy_lock); mutex_unlock(&hash_resize_mutex); @@ -2095,9 +2094,9 @@ static struct xfrm_policy *xfrm_policy_lookup_bytype(struct net *net, u8 type, rcu_read_lock(); retry: do { - sequence = read_seqcount_begin(&xfrm_policy_hash_generation); + sequence = read_seqcount_begin(&net->xfrm.xfrm_policy_hash_generation); chain = policy_hash_direct(net, daddr, saddr, family, dir); - } while (read_seqcount_retry(&xfrm_policy_hash_generation, sequence)); + } while (read_seqcount_retry(&net->xfrm.xfrm_policy_hash_generation, sequence)); ret = NULL; hlist_for_each_entry_rcu(pol, chain, bydst) { @@ -2128,7 +2127,7 @@ static struct xfrm_policy *xfrm_policy_lookup_bytype(struct net *net, u8 type, } skip_inexact: - if (read_seqcount_retry(&xfrm_policy_hash_generation, sequence)) + if (read_seqcount_retry(&net->xfrm.xfrm_policy_hash_generation, sequence)) goto retry; if (ret && !xfrm_pol_hold_rcu(ret)) @@ -4084,6 +4083,7 @@ static int __net_init xfrm_net_init(struct net *net) /* Initialize the per-net locks here */ spin_lock_init(&net->xfrm.xfrm_state_lock); spin_lock_init(&net->xfrm.xfrm_policy_lock); + seqcount_spinlock_init(&net->xfrm.xfrm_policy_hash_generation, &net->xfrm.xfrm_policy_lock); mutex_init(&net->xfrm.xfrm_cfg_mutex); rv = xfrm_statistics_init(net); @@ -4128,7 +4128,6 @@ void __init xfrm_init(void) { register_pernet_subsys(&xfrm_net_ops); xfrm_dev_init(); - seqcount_mutex_init(&xfrm_policy_hash_generation, &hash_resize_mutex); xfrm_input_init(); #ifdef CONFIG_XFRM_ESPINTCP
On Tue, Jun 22, 2021 at 01:51:24PM +0200, Frederic Weisbecker wrote: > On Tue, Jun 22, 2021 at 01:21:59PM +0200, Steffen Klassert wrote: > > On Mon, Jun 21, 2021 at 01:05:28PM +0200, Steffen Klassert wrote: > > > On Mon, Jun 21, 2021 at 11:11:18AM +0200, Varad Gautam wrote: > > > > > > > > Right, I misread the call chain - security_xfrm_policy_lookup does not reach > > > > xfrm_policy_lookup, making this patch unnecessary. The bug I have is: > > > > > > > > T1, holding hash_resize_mutex and sleeping inside synchronize_rcu: > > > > > > > > __schedule > > > > schedule > > > > schedule_timeout > > > > wait_for_completion > > > > __wait_rcu_gp > > > > synchronize_rcu > > > > xfrm_hash_resize > > > > > > > > And T2 producing RCU-stalls since it blocked on the mutex: > > > > > > > > __schedule > > > > schedule > > > > __rt_mutex_slowlock > > > > rt_mutex_slowlock_locked > > > > rt_mutex_slowlock > > > > xfrm_policy_lookup_bytype.constprop.77 > > > > > > Ugh, why does xfrm_policy_lookup_bytype use a mutex? This is called > > > in the receive path inside a sofirq. > > > > > > The bug was introduced by: > > > > > > commit 77cc278f7b202e4f16f8596837219d02cb090b96 > > > Author: Ahmed S. Darwish <a.darwish@linutronix.de> > > > Date: Mon Jul 20 17:55:22 2020 +0200 > > > > > > xfrm: policy: Use sequence counters with associated lock > > > > > > A sequence counter write side critical section must be protected by some > > > form of locking to serialize writers. If the serialization primitive is > > > not disabling preemption implicitly, preemption has to be explicitly > > > disabled before entering the sequence counter write side critical > > > section. > > > > > > A plain seqcount_t does not contain the information of which lock must > > > be held when entering a write side critical section. > > > > > > Use the new seqcount_spinlock_t and seqcount_mutex_t data types instead, > > > which allow to associate a lock with the sequence counter. This enables > > > lockdep to verify that the lock used for writer serialization is held > > > when the write side critical section is entered. > > > > > > If lockdep is disabled this lock association is compiled out and has > > > neither storage size nor runtime overhead. > > > > > > Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de> > > > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> > > > Link: https://lkml.kernel.org/r/20200720155530.1173732-17-a.darwish@linutronix.de > > > > > > This uses a seqcount_mutex_t for xfrm_policy_hash_generation, that's > > > wrong. > > > > Varad, can you try to replace the seqcount_mutex_t for xfrm_policy_hash_generation > > by a seqcount_spinlock_t? I'm not familiar with that seqcount changes, > > but we should not end up with using a mutex in this codepath. > > Something like this? (beware, untested, also I don't know if the read side > should then disable bh, doesn't look necessary for PREEMPT_RT, but I may be > missing something...) Looking a bit deeper into this it seems that the problem is that xfrm_policy_hash_generation and hash_resize_mutex do not protect the same thing. hash_resize_mutex protects user configuration against a worker thread that rebalances the hash buckets. xfrm_policy_hash_generation protects user configuration against the data path that runs in softirq. Finally the following line from xfrm_init() relates these two: seqcount_mutex_init(&xfrm_policy_hash_generation, &hash_resize_mutex); That looks a bit odd. This line was also introduced with the above mentioned patch.
On Tue, Jun 22, 2021 at 01:51:24PM +0200, Frederic Weisbecker wrote: > On Tue, Jun 22, 2021 at 01:21:59PM +0200, Steffen Klassert wrote: > > On Mon, Jun 21, 2021 at 01:05:28PM +0200, Steffen Klassert wrote: > > > On Mon, Jun 21, 2021 at 11:11:18AM +0200, Varad Gautam wrote: > > > > Varad, can you try to replace the seqcount_mutex_t for xfrm_policy_hash_generation > > by a seqcount_spinlock_t? I'm not familiar with that seqcount changes, > > but we should not end up with using a mutex in this codepath. > > Something like this? (beware, untested, also I don't know if the read side > should then disable bh, doesn't look necessary for PREEMPT_RT, but I may be > missing something...) > > diff --git a/include/net/netns/xfrm.h b/include/net/netns/xfrm.h > index e816b6a3ef2b..9b376b87bd54 100644 > --- a/include/net/netns/xfrm.h > +++ b/include/net/netns/xfrm.h > @@ -74,6 +74,7 @@ struct netns_xfrm { > #endif > spinlock_t xfrm_state_lock; > seqcount_spinlock_t xfrm_state_hash_generation; > + seqcount_spinlock_t xfrm_policy_hash_generation; > > spinlock_t xfrm_policy_lock; > struct mutex xfrm_cfg_mutex; > diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c > index ce500f847b99..46a6d15b66d6 100644 > --- a/net/xfrm/xfrm_policy.c > +++ b/net/xfrm/xfrm_policy.c > @@ -155,7 +155,6 @@ static struct xfrm_policy_afinfo const __rcu *xfrm_policy_afinfo[AF_INET6 + 1] > __read_mostly; > > static struct kmem_cache *xfrm_dst_cache __ro_after_init; > -static __read_mostly seqcount_mutex_t xfrm_policy_hash_generation; > > static struct rhashtable xfrm_policy_inexact_table; > static const struct rhashtable_params xfrm_pol_inexact_params; > @@ -585,7 +584,7 @@ static void xfrm_bydst_resize(struct net *net, int dir) > return; > > spin_lock_bh(&net->xfrm.xfrm_policy_lock); > - write_seqcount_begin(&xfrm_policy_hash_generation); > + write_seqcount_begin(&net->xfrm.xfrm_policy_hash_generation); > > odst = rcu_dereference_protected(net->xfrm.policy_bydst[dir].table, > lockdep_is_held(&net->xfrm.xfrm_policy_lock)); > @@ -596,7 +595,7 @@ static void xfrm_bydst_resize(struct net *net, int dir) > rcu_assign_pointer(net->xfrm.policy_bydst[dir].table, ndst); > net->xfrm.policy_bydst[dir].hmask = nhashmask; > > - write_seqcount_end(&xfrm_policy_hash_generation); > + write_seqcount_end(&net->xfrm.xfrm_policy_hash_generation); > spin_unlock_bh(&net->xfrm.xfrm_policy_lock); > > synchronize_rcu(); > @@ -1245,7 +1244,7 @@ static void xfrm_hash_rebuild(struct work_struct *work) > } while (read_seqretry(&net->xfrm.policy_hthresh.lock, seq)); > > spin_lock_bh(&net->xfrm.xfrm_policy_lock); > - write_seqcount_begin(&xfrm_policy_hash_generation); > + write_seqcount_begin(&net->xfrm.xfrm_policy_hash_generation); > > /* make sure that we can insert the indirect policies again before > * we start with destructive action. > @@ -1354,7 +1353,7 @@ static void xfrm_hash_rebuild(struct work_struct *work) > > out_unlock: > __xfrm_policy_inexact_flush(net); > - write_seqcount_end(&xfrm_policy_hash_generation); > + write_seqcount_end(&net->xfrm.xfrm_policy_hash_generation); > spin_unlock_bh(&net->xfrm.xfrm_policy_lock); > > mutex_unlock(&hash_resize_mutex); > @@ -2095,9 +2094,9 @@ static struct xfrm_policy *xfrm_policy_lookup_bytype(struct net *net, u8 type, > rcu_read_lock(); > retry: > do { > - sequence = read_seqcount_begin(&xfrm_policy_hash_generation); > + sequence = read_seqcount_begin(&net->xfrm.xfrm_policy_hash_generation); > chain = policy_hash_direct(net, daddr, saddr, family, dir); > - } while (read_seqcount_retry(&xfrm_policy_hash_generation, sequence)); > + } while (read_seqcount_retry(&net->xfrm.xfrm_policy_hash_generation, sequence)); > > ret = NULL; > hlist_for_each_entry_rcu(pol, chain, bydst) { > @@ -2128,7 +2127,7 @@ static struct xfrm_policy *xfrm_policy_lookup_bytype(struct net *net, u8 type, > } > > skip_inexact: > - if (read_seqcount_retry(&xfrm_policy_hash_generation, sequence)) > + if (read_seqcount_retry(&net->xfrm.xfrm_policy_hash_generation, sequence)) > goto retry; > > if (ret && !xfrm_pol_hold_rcu(ret)) > @@ -4084,6 +4083,7 @@ static int __net_init xfrm_net_init(struct net *net) > /* Initialize the per-net locks here */ > spin_lock_init(&net->xfrm.xfrm_state_lock); > spin_lock_init(&net->xfrm.xfrm_policy_lock); > + seqcount_spinlock_init(&net->xfrm.xfrm_policy_hash_generation, &net->xfrm.xfrm_policy_lock); > mutex_init(&net->xfrm.xfrm_cfg_mutex); > > rv = xfrm_statistics_init(net); > @@ -4128,7 +4128,6 @@ void __init xfrm_init(void) > { > register_pernet_subsys(&xfrm_net_ops); > xfrm_dev_init(); > - seqcount_mutex_init(&xfrm_policy_hash_generation, &hash_resize_mutex); > xfrm_input_init(); > > #ifdef CONFIG_XFRM_ESPINTCP Yes, looks like your patch should do it. The xfrm_policy_lock is the write side protection for the seqcount here. Thanks!
diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c index 0b7409f19ecb1..28e072007c72d 100644 --- a/net/xfrm/xfrm_policy.c +++ b/net/xfrm/xfrm_policy.c @@ -2152,42 +2152,47 @@ static struct xfrm_policy *xfrm_sk_policy_lookup(const struct sock *sk, int dir, u16 family, u32 if_id) { struct xfrm_policy *pol; + bool match; + int err = 0; - rcu_read_lock(); again: + rcu_read_lock(); pol = rcu_dereference(sk->sk_policy[dir]); - if (pol != NULL) { - bool match; - int err = 0; - - if (pol->family != family) { - pol = NULL; - goto out; - } + if (pol == NULL) { + rcu_read_unlock(); + goto out; + } - match = xfrm_selector_match(&pol->selector, fl, family); - if (match) { - if ((sk->sk_mark & pol->mark.m) != pol->mark.v || - pol->if_id != if_id) { - pol = NULL; - goto out; - } - err = security_xfrm_policy_lookup(pol->security, - fl->flowi_secid, - dir); - if (!err) { - if (!xfrm_pol_hold_rcu(pol)) - goto again; - } else if (err == -ESRCH) { - pol = NULL; - } else { - pol = ERR_PTR(err); - } - } else - pol = NULL; + /* Take a reference on this pol and call rcu_read_unlock(). */ + if (!xfrm_pol_hold_rcu(pol)) { + rcu_read_unlock(); + goto again; } -out: rcu_read_unlock(); + + if (pol->family != family) + goto out_put; + + match = xfrm_selector_match(&pol->selector, fl, family); + if (!match) + goto out_put; + + if ((sk->sk_mark & pol->mark.m) != pol->mark.v || + pol->if_id != if_id) + goto out_put; + + err = security_xfrm_policy_lookup(pol->security, + fl->flowi_secid, + dir); + if (!err) { + /* Safe to return, we have a ref on pol. */ + goto out; + } + + out_put: + xfrm_pol_put(pol); + pol = (err && err != -ESRCH) ? ERR_PTR(err) : NULL; + out: return pol; }
Commit "xfrm: policy: Read seqcount outside of rcu-read side in xfrm_policy_lookup_bytype" [Linked] resolved a locking bug in xfrm_policy_lookup_bytype that causes an RCU reader-writer deadlock on the mutex wrapped by xfrm_policy_hash_generation on PREEMPT_RT since 77cc278f7b20 ("xfrm: policy: Use sequence counters with associated lock"). However, xfrm_sk_policy_lookup can still reach xfrm_policy_lookup_bytype while holding rcu_read_lock(), as: xfrm_sk_policy_lookup() rcu_read_lock() security_xfrm_policy_lookup() xfrm_policy_lookup() xfrm_policy_lookup_bytype() read_seqcount_begin() mutex_lock(&hash_resize_mutex) This results in the same deadlock on hash_resize_mutex, where xfrm_hash_resize is holding the mutex and sleeping inside synchronize_rcu, and xfrm_policy_lookup_bytype blocks on the mutex inside the RCU read side critical section. To resolve this, shorten the RCU read side critical section within xfrm_sk_policy_lookup by taking a reference on the associated xfrm_policy, and avoid calling xfrm_policy_lookup_bytype with the rcu_read_lock() held. Fixes: 77cc278f7b20 ("xfrm: policy: Use sequence counters with associated lock") Link: https://lore.kernel.org/r/20210528160407.32127-1-varad.gautam@suse.com/ Signed-off-by: Varad Gautam <varad.gautam@suse.com> Cc: stable@vger.kernel.org # v5.9 --- net/xfrm/xfrm_policy.c | 65 +++++++++++++++++++++++------------------- 1 file changed, 35 insertions(+), 30 deletions(-)