Message ID | 8c0e0c486056b5185b58998f2cce62619ed3f05c.camel@gmx.de |
---|---|
State | New |
Headers | show |
Series | [rfc/patch] mm/slub: restore/expand unfreeze_partials() local exclusion scope | expand |
On Fri, 2021-07-09 at 07:21 +0200, Mike Galbraith wrote:
> Well, bug report in patch for actually.
^^^form
On Fri, Jul 09 2021 at 07:21, Mike Galbraith wrote: > --- a/mm/slub.c > +++ b/mm/slub.c > @@ -2497,7 +2497,9 @@ static void put_cpu_partial(struct kmem_ > * partial array is full. Move the existing > * set to the per node partial list. > */ > + local_lock(&s->cpu_slab->lock); > unfreeze_partials(s); > + local_unlock(&s->cpu_slab->lock); > oldpage = NULL; > pobjects = 0; > pages = 0; > @@ -2579,7 +2581,9 @@ static void flush_cpu_slab(struct work_s > if (c->page) > flush_slab(s, c, true); > > + local_lock(&s->cpu_slab->lock); > unfreeze_partials(s); > + local_unlock(&s->cpu_slab->lock); > } > > static bool has_cpu_slab(int cpu, struct kmem_cache *s) > @@ -2632,8 +2636,11 @@ static int slub_cpu_dead(unsigned int cp > struct kmem_cache *s; > > mutex_lock(&slab_mutex); > - list_for_each_entry(s, &slab_caches, list) > + list_for_each_entry(s, &slab_caches, list) { > + local_lock(&s->cpu_slab->lock); This one is odd. It locks the cpu_slab lock of the CPU which runs this callback and then flushes the slab of the dead CPU. Thanks, tglx
On Fri, 2021-07-09 at 21:28 +0200, Thomas Gleixner wrote: > On Fri, Jul 09 2021 at 07:21, Mike Galbraith wrote: > > static bool has_cpu_slab(int cpu, struct kmem_cache *s) > > @@ -2632,8 +2636,11 @@ static int slub_cpu_dead(unsigned int cp > > struct kmem_cache *s; > > > > mutex_lock(&slab_mutex); > > - list_for_each_entry(s, &slab_caches, list) > > + list_for_each_entry(s, &slab_caches, list) { > > + local_lock(&s->cpu_slab->lock); > > This one is odd. It locks the cpu_slab lock of the CPU which runs > thiscallback and then flushes the slab of the dead CPU. That spot I put back only because it used to exist, ergo I documented it as an afterthought. Yeah, it clearly has nada to do with the explosions, those were stopped by the other two as previously reported. -Mike
Greetings crickets, Methinks he problem is the hole these patches opened only for RT. static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain) { #ifdef CONFIG_SLUB_CPU_PARTIAL struct page *oldpage; int pages; int pobjects; slub_get_cpu_ptr(s->cpu_slab); ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ That is an assertion that the stuff under it is preempt safe for RT and ONLY RT. My box says the RT portion of that assertion is horse pookey. What it does is let kmem_cache_free()/kfree() blast straight through ___slab_alloc() critical sections, swapping out ->partial underneath it. Sprinkling percpu fingerprint power about, I saw lots of ___slab_alloc() preempts put_cpu_partial().. and vice versa. I don't think it's a coincidence that ___slab_alloc() and __unfreeze_partials() both explode trying to access page->freelist. You've seen the former, here's one of the later. crash> bt -sx PID: 18761 TASK: ffff88812fff0000 CPU: 0 COMMAND: "hackbench" #0 [ffff88818f8ff980] machine_kexec+0x14f at ffffffff81051c8f #1 [ffff88818f8ff9c8] __crash_kexec+0xd2 at ffffffff8111ef72 #2 [ffff88818f8ffa88] crash_kexec+0x30 at ffffffff8111fd10 #3 [ffff88818f8ffa98] oops_end+0xd3 at ffffffff810267e3 #4 [ffff88818f8ffab8] exc_general_protection+0x195 at ffffffff8179fdb5 #5 [ffff88818f8ffb50] asm_exc_general_protection+0x1e at ffffffff81800a0e [exception RIP: __unfreeze_partials+156] RIP: ffffffff81248bac RSP: ffff88818f8ffc00 RFLAGS: 00010202 RAX: 0000000000200002 RBX: 0000000000200002 RCX: 000000017fffffff RDX: 00000001ffffffff RSI: 0000000000000202 RDI: ffff888100040b80 RBP: ffff88818f8ffca0 R8: ffff88818f9cba30 R9: 0000000000000001 R10: ffff88818f8ffcc0 R11: 0000000000000000 R12: ffff888100043700 R13: ffff888100040b80 R14: 00000001ffffffff R15: ffffea00046c0808 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #6 [ffff88818f8ffcb8] put_cpu_partial+0x8e at ffffffff81248ece #7 [ffff88818f8ffcd8] consume_skb+0x2b at ffffffff815eddeb #8 [ffff88818f8ffce8] unix_stream_read_generic+0x788 at ffffffff8170b238 #9 [ffff88818f8ffdc0] unix_stream_recvmsg+0x43 at ffffffff8170b433 #10 [ffff88818f8ffdf8] sock_read_iter+0x104 at ffffffff815dd554 #11 [ffff88818f8ffe68] new_sync_read+0x16a at ffffffff812674fa #12 [ffff88818f8ffee8] vfs_read+0x1ae at ffffffff81269c8e #13 [ffff88818f8fff00] ksys_read+0x40 at ffffffff8126a070 #14 [ffff88818f8fff38] do_syscall_64+0x37 at ffffffff8179f5e7 #15 [ffff88818f8fff50] entry_SYSCALL_64_after_hwframe+0x44 at ffffffff8180007c RIP: 00007fbda4438f2e RSP: 00007fff3bf9d798 RFLAGS: 00000246 RAX: ffffffffffffffda RBX: 00007fff3bf9e7a0 RCX: 00007fbda4438f2e RDX: 0000000000001000 RSI: 00007fff3bf9d7a0 RDI: 0000000000000007 RBP: 00007fff3bf9e800 R8: 00007fff3bf9e6e0 R9: 00007fbda4641010 R10: 00007fbda4428028 R11: 0000000000000246 R12: 0000000000001000 crash> dis __unfreeze_partials+156 0xffffffff81248bac <__unfreeze_partials+156>: lock cmpxchg16b 0x20(%r15) crash> gdb list *__unfreeze_partials+156 0xffffffff81248bac is in __unfreeze_partials (mm/slub.c:475). 470 if (!disable_irqs) 471 lockdep_assert_irqs_disabled(); 472 #if defined(CONFIG_HAVE_CMPXCHG_DOUBLE) && \ 473 defined(CONFIG_HAVE_ALIGNED_STRUCT_PAGE) 474 if (s->flags & __CMPXCHG_DOUBLE) { 475 if (cmpxchg_double(&page->freelist, &page->counters, 476 freelist_old, counters_old, 477 freelist_new, counters_new)) 478 return true; 479 } else crash> kmem ffffea00046c0808 PAGE PHYSICAL MAPPING INDEX CNT FLAGS ffffea00104b3000 412cc0000 0 c 2 8000000000003000 reserved,private crash Regarding $subject, Lockdep thinks my hole plugging skills suck rocks (shrug, I've given it plentiful cause) but that's ok, I sometimes think the same of its bug reporting skills :) [ 2.459456] ============================================ [ 2.459456] WARNING: possible recursive locking detected [ 2.459457] 5.14.0.g79e92006-tip-rt #83 Tainted: G E [ 2.459458] -------------------------------------------- [ 2.459458] rcuc/7/56 is trying to acquire lock: [ 2.459459] ffff88840edf01c0 (&(&c->lock)->lock){+.+.}-{0:0}, at: kfree+0x280/0x670 [ 2.459466] but task is already holding lock: [ 2.459466] ffff88840edf4d60 (&(&c->lock)->lock){+.+.}-{0:0}, at: __slab_free+0x4d6/0x600 [ 2.459469] live kernel snooping.... crash> ps | grep UN crash> bt -sx 56 PID: 56 TASK: ffff888100c19a40 CPU: 7 COMMAND: "rcuc/7" #0 [ffff888100c63e40] __schedule+0x2eb at ffffffff818969fb #1 [ffff888100c63ec8] schedule+0x3b at ffffffff8189723b #2 [ffff888100c63ee0] smpboot_thread_fn+0x18c at ffffffff810a90dc #3 [ffff888100c63f18] kthread+0x1af at ffffffff810a27bf #4 [ffff888100c63f50] ret_from_fork+0x1f at ffffffff81001cbf crash> task_struct ffff888100c19a40 | grep __state __state = 1, crash> gdb list *__slab_free+0x4d6 0xffffffff812923c6 is in __slab_free (mm/slub.c:2568). 2563 /* 2564 * partial array is full. Move the existing 2565 * set to the per node partial list. 2566 */ 2567 local_lock(&s->cpu_slab->lock); 2568 unfreeze_partials(s); 2569 local_unlock(&s->cpu_slab->lock); 2570 oldpage = NULL; 2571 pobjects = 0; 2572 pages = 0; crash> gdb list *kfree+0x280 0xffffffff81292770 is in kfree (mm/slub.c:3442). 3437 * __slab_free() as that wouldn't use the cpu freelist at all. 3438 */ 3439 void **freelist; 3440 3441 local_lock(&s->cpu_slab->lock); 3442 c = this_cpu_ptr(s->cpu_slab); 3443 if (unlikely(page != c->page)) { 3444 local_unlock(&s->cpu_slab->lock); 3445 goto redo; 3446 } crash> Check, those are what's in the stacktrace below... but the allegedly deadlocked for real task is very much alive, as is the rest of box. other info that might help us debug this: [ 2.459470] Possible unsafe locking scenario: [ 2.459470] CPU0 [ 2.459470] ---- [ 2.459471] lock(&(&c->lock)->lock); [ 2.459471] lock(&(&c->lock)->lock); [ 2.459472] *** DEADLOCK *** [ 2.459472] May be due to missing lock nesting notation [ 2.459472] 6 locks held by rcuc/7/56: [ 2.459473] #0: ffff88840edd9820 ((softirq_ctrl.lock).lock){+.+.}-{2:2}, at: __local_bh_disable_ip+0xc3/0x190 [ 2.459479] #1: ffffffff82382b80 (rcu_read_lock){....}-{1:2}, at: rt_spin_lock+0x5/0xd0 [ 2.459484] #2: ffffffff82382b80 (rcu_read_lock){....}-{1:2}, at: __local_bh_disable_ip+0xa0/0x190 [ 2.459487] #3: ffffffff82382ac0 (rcu_callback){....}-{0:0}, at: rcu_do_batch+0x195/0x520 [ 2.459490] #4: ffff88840edf4d60 (&(&c->lock)->lock){+.+.}-{0:0}, at: __slab_free+0x4d6/0x600 [ 2.459493] #5: ffffffff82382b80 (rcu_read_lock){....}-{1:2}, at: rt_spin_lock+0x5/0xd0 [ 2.459496] stack backtrace: [ 2.459497] CPU: 7 PID: 56 Comm: rcuc/7 Tainted: G E 5.14.0.g79e92006-tip-rt #83 [ 2.459498] Hardware name: MEDION MS-7848/MS-7848, BIOS M7848W08.20C 09/23/2013 [ 2.459499] Call Trace: [ 2.459501] dump_stack_lvl+0x44/0x57 [ 2.459504] __lock_acquire+0xb7f/0x1ab0 [ 2.459508] lock_acquire+0x2a6/0x340 [ 2.459510] ? kfree+0x280/0x670 [ 2.459513] ? __free_slab+0xa4/0x1b0 [ 2.459515] rt_spin_lock+0x2a/0xd0 [ 2.459516] ? kfree+0x280/0x670 somewhere in the multiverse lies a dead rcuc/7 [ 2.459518] kfree+0x280/0x670 <== local_lock(&s->cpu_slab->lock); #2 [ 2.459521] __free_slab+0xa4/0x1b0 ==> kfree(page_objcgs(page)) [ 2.459523] __unfreeze_partials+0x1d8/0x330 ==> discard_slab(s, page); [ 2.459526] ? _raw_spin_unlock_irqrestore+0x30/0x80 [ 2.459530] ? __slab_free+0x4de/0x600 [ 2.459532] __slab_free+0x4de/0x600 <== local_lock(&s->cpu_slab->lock); #1 [ 2.459534] ? find_held_lock+0x2d/0x90 [ 2.459536] ? kmem_cache_free+0x276/0x630 [ 2.459539] ? rcu_do_batch+0x1c3/0x520 [ 2.459540] ? kmem_cache_free+0x364/0x630 [ 2.459542] kmem_cache_free+0x364/0x630 [ 2.459544] ? rcu_do_batch+0x195/0x520 [ 2.459546] rcu_do_batch+0x1c3/0x520 [ 2.459547] ? rcu_do_batch+0x195/0x520 [ 2.459550] ? rcu_cpu_kthread+0x7e/0x4b0 [ 2.459552] ? rcu_cpu_kthread+0x57/0x4b0 [ 2.459553] rcu_core+0x2c3/0x590 [ 2.459555] ? rcu_cpu_kthread+0x78/0x4b0 [ 2.459557] ? rcu_cpu_kthread+0x7e/0x4b0 [ 2.459558] ? rcu_cpu_kthread+0x57/0x4b0 [ 2.459560] rcu_cpu_kthread+0xc2/0x4b0 [ 2.459562] ? smpboot_thread_fn+0x23/0x300 [ 2.459565] smpboot_thread_fn+0x249/0x300 [ 2.459567] ? smpboot_register_percpu_thread+0xe0/0xe0 [ 2.459569] kthread+0x1af/0x1d0 [ 2.459571] ? set_kthread_struct+0x40/0x40 [ 2.459574] ret_from_fork+0x1f/0x30
On 7/15/21 6:34 PM, Mike Galbraith wrote: > Greetings crickets, > > Methinks he problem is the hole these patches opened only for RT. > > static void put_cpu_partial(struct kmem_cache *s, struct page *page, > int drain) > { > #ifdef CONFIG_SLUB_CPU_PARTIAL > struct page *oldpage; > int pages; > int pobjects; > > slub_get_cpu_ptr(s->cpu_slab); > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > That is an assertion that the stuff under it is preempt safe for RT and > ONLY RT. My box says the RT portion of that assertion is horse pookey. > What it does is let kmem_cache_free()/kfree() blast straight through > ___slab_alloc() critical sections, swapping out ->partial underneath > it. Sprinkling percpu fingerprint power about, I saw lots of > ___slab_alloc() preempts put_cpu_partial().. and vice versa. I don't > think it's a coincidence that ___slab_alloc() and __unfreeze_partials() > both explode trying to access page->freelist. You've seen the former, > here's one of the later. > > crash> bt -sx > PID: 18761 TASK: ffff88812fff0000 CPU: 0 COMMAND: "hackbench" > #0 [ffff88818f8ff980] machine_kexec+0x14f at ffffffff81051c8f > #1 [ffff88818f8ff9c8] __crash_kexec+0xd2 at ffffffff8111ef72 > #2 [ffff88818f8ffa88] crash_kexec+0x30 at ffffffff8111fd10 > #3 [ffff88818f8ffa98] oops_end+0xd3 at ffffffff810267e3 > #4 [ffff88818f8ffab8] exc_general_protection+0x195 at ffffffff8179fdb5 > #5 [ffff88818f8ffb50] asm_exc_general_protection+0x1e at ffffffff81800a0e > [exception RIP: __unfreeze_partials+156] Hm going back to this report, if that's a direct result of __slab_alloc preempting (interrupting) put_cpu_partial() then I think that's something that could happen even on !RT as all we do in the put_cpu_partial() there is disable preemption, and the allocation could come in irq handler. It would mean the whole idea of "mm, slub: detach percpu partial list in unfreeze_partials() using this_cpu_cmpxchg()" isn't safe. I'll have to ponder it. But we didn't see crashes on !RT yet. So could it be that it was still put_cpu_partial() preempting __slab_alloc() messing the partial list, but for some reason the put_cpu_partial() side crashed this time? > RIP: ffffffff81248bac RSP: ffff88818f8ffc00 RFLAGS: 00010202 > RAX: 0000000000200002 RBX: 0000000000200002 RCX: 000000017fffffff > RDX: 00000001ffffffff RSI: 0000000000000202 RDI: ffff888100040b80 > RBP: ffff88818f8ffca0 R8: ffff88818f9cba30 R9: 0000000000000001 > R10: ffff88818f8ffcc0 R11: 0000000000000000 R12: ffff888100043700 > R13: ffff888100040b80 R14: 00000001ffffffff R15: ffffea00046c0808 > ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 > #6 [ffff88818f8ffcb8] put_cpu_partial+0x8e at ffffffff81248ece > #7 [ffff88818f8ffcd8] consume_skb+0x2b at ffffffff815eddeb > #8 [ffff88818f8ffce8] unix_stream_read_generic+0x788 at ffffffff8170b238 > #9 [ffff88818f8ffdc0] unix_stream_recvmsg+0x43 at ffffffff8170b433 > #10 [ffff88818f8ffdf8] sock_read_iter+0x104 at ffffffff815dd554 > #11 [ffff88818f8ffe68] new_sync_read+0x16a at ffffffff812674fa > #12 [ffff88818f8ffee8] vfs_read+0x1ae at ffffffff81269c8e > #13 [ffff88818f8fff00] ksys_read+0x40 at ffffffff8126a070 > #14 [ffff88818f8fff38] do_syscall_64+0x37 at ffffffff8179f5e7 > #15 [ffff88818f8fff50] entry_SYSCALL_64_after_hwframe+0x44 at ffffffff8180007c > RIP: 00007fbda4438f2e RSP: 00007fff3bf9d798 RFLAGS: 00000246 > RAX: ffffffffffffffda RBX: 00007fff3bf9e7a0 RCX: 00007fbda4438f2e > RDX: 0000000000001000 RSI: 00007fff3bf9d7a0 RDI: 0000000000000007 > RBP: 00007fff3bf9e800 R8: 00007fff3bf9e6e0 R9: 00007fbda4641010 > R10: 00007fbda4428028 R11: 0000000000000246 R12: 0000000000001000 > crash> dis __unfreeze_partials+156 > 0xffffffff81248bac <__unfreeze_partials+156>: lock cmpxchg16b 0x20(%r15) > crash> gdb list *__unfreeze_partials+156 > 0xffffffff81248bac is in __unfreeze_partials (mm/slub.c:475). > 470 if (!disable_irqs) > 471 lockdep_assert_irqs_disabled(); > 472 #if defined(CONFIG_HAVE_CMPXCHG_DOUBLE) && \ > 473 defined(CONFIG_HAVE_ALIGNED_STRUCT_PAGE) > 474 if (s->flags & __CMPXCHG_DOUBLE) { > 475 if (cmpxchg_double(&page->freelist, &page->counters, > 476 freelist_old, counters_old, > 477 freelist_new, counters_new)) > 478 return true; > 479 } else > crash> kmem ffffea00046c0808 > PAGE PHYSICAL MAPPING INDEX CNT FLAGS > ffffea00104b3000 412cc0000 0 c 2 8000000000003000 reserved,private > crash > > Regarding $subject, Lockdep thinks my hole plugging skills suck rocks > (shrug, I've given it plentiful cause) but that's ok, I sometimes think > the same of its bug reporting skills :) > > [ 2.459456] ============================================ > [ 2.459456] WARNING: possible recursive locking detected > [ 2.459457] 5.14.0.g79e92006-tip-rt #83 Tainted: G E > [ 2.459458] -------------------------------------------- > [ 2.459458] rcuc/7/56 is trying to acquire lock: > [ 2.459459] ffff88840edf01c0 (&(&c->lock)->lock){+.+.}-{0:0}, at: kfree+0x280/0x670 > [ 2.459466] > but task is already holding lock: > [ 2.459466] ffff88840edf4d60 (&(&c->lock)->lock){+.+.}-{0:0}, at: __slab_free+0x4d6/0x600 > [ 2.459469] > > live kernel snooping.... > > crash> ps | grep UN > crash> bt -sx 56 > PID: 56 TASK: ffff888100c19a40 CPU: 7 COMMAND: "rcuc/7" > #0 [ffff888100c63e40] __schedule+0x2eb at ffffffff818969fb > #1 [ffff888100c63ec8] schedule+0x3b at ffffffff8189723b > #2 [ffff888100c63ee0] smpboot_thread_fn+0x18c at ffffffff810a90dc > #3 [ffff888100c63f18] kthread+0x1af at ffffffff810a27bf > #4 [ffff888100c63f50] ret_from_fork+0x1f at ffffffff81001cbf > crash> task_struct ffff888100c19a40 | grep __state > __state = 1, > crash> gdb list *__slab_free+0x4d6 > 0xffffffff812923c6 is in __slab_free (mm/slub.c:2568). > 2563 /* > 2564 * partial array is full. Move the existing > 2565 * set to the per node partial list. > 2566 */ > 2567 local_lock(&s->cpu_slab->lock); > 2568 unfreeze_partials(s); > 2569 local_unlock(&s->cpu_slab->lock); > 2570 oldpage = NULL; > 2571 pobjects = 0; > 2572 pages = 0; > crash> gdb list *kfree+0x280 > 0xffffffff81292770 is in kfree (mm/slub.c:3442). > 3437 * __slab_free() as that wouldn't use the cpu freelist at all. > 3438 */ > 3439 void **freelist; > 3440 > 3441 local_lock(&s->cpu_slab->lock); > 3442 c = this_cpu_ptr(s->cpu_slab); > 3443 if (unlikely(page != c->page)) { > 3444 local_unlock(&s->cpu_slab->lock); > 3445 goto redo; > 3446 } > crash> > > Check, those are what's in the stacktrace below... but the allegedly > deadlocked for real task is very much alive, as is the rest of box. > > other info that might help us debug this: > [ 2.459470] Possible unsafe locking scenario: > > [ 2.459470] CPU0 > [ 2.459470] ---- > [ 2.459471] lock(&(&c->lock)->lock); > [ 2.459471] lock(&(&c->lock)->lock); > [ 2.459472] > *** DEADLOCK *** > > [ 2.459472] May be due to missing lock nesting notation > > [ 2.459472] 6 locks held by rcuc/7/56: > [ 2.459473] #0: ffff88840edd9820 ((softirq_ctrl.lock).lock){+.+.}-{2:2}, at: __local_bh_disable_ip+0xc3/0x190 > [ 2.459479] #1: ffffffff82382b80 (rcu_read_lock){....}-{1:2}, at: rt_spin_lock+0x5/0xd0 > [ 2.459484] #2: ffffffff82382b80 (rcu_read_lock){....}-{1:2}, at: __local_bh_disable_ip+0xa0/0x190 > [ 2.459487] #3: ffffffff82382ac0 (rcu_callback){....}-{0:0}, at: rcu_do_batch+0x195/0x520 > [ 2.459490] #4: ffff88840edf4d60 (&(&c->lock)->lock){+.+.}-{0:0}, at: __slab_free+0x4d6/0x600 > [ 2.459493] #5: ffffffff82382b80 (rcu_read_lock){....}-{1:2}, at: rt_spin_lock+0x5/0xd0 > [ 2.459496] > stack backtrace: > [ 2.459497] CPU: 7 PID: 56 Comm: rcuc/7 Tainted: G E 5.14.0.g79e92006-tip-rt #83 > [ 2.459498] Hardware name: MEDION MS-7848/MS-7848, BIOS M7848W08.20C 09/23/2013 > [ 2.459499] Call Trace: > [ 2.459501] dump_stack_lvl+0x44/0x57 > [ 2.459504] __lock_acquire+0xb7f/0x1ab0 > [ 2.459508] lock_acquire+0x2a6/0x340 > [ 2.459510] ? kfree+0x280/0x670 > [ 2.459513] ? __free_slab+0xa4/0x1b0 > [ 2.459515] rt_spin_lock+0x2a/0xd0 > [ 2.459516] ? kfree+0x280/0x670 somewhere in the multiverse lies a dead rcuc/7 > [ 2.459518] kfree+0x280/0x670 <== local_lock(&s->cpu_slab->lock); #2 > [ 2.459521] __free_slab+0xa4/0x1b0 ==> kfree(page_objcgs(page)) > [ 2.459523] __unfreeze_partials+0x1d8/0x330 ==> discard_slab(s, page); > [ 2.459526] ? _raw_spin_unlock_irqrestore+0x30/0x80 > [ 2.459530] ? __slab_free+0x4de/0x600 > [ 2.459532] __slab_free+0x4de/0x600 <== local_lock(&s->cpu_slab->lock); #1 > [ 2.459534] ? find_held_lock+0x2d/0x90 > [ 2.459536] ? kmem_cache_free+0x276/0x630 > [ 2.459539] ? rcu_do_batch+0x1c3/0x520 > [ 2.459540] ? kmem_cache_free+0x364/0x630 > [ 2.459542] kmem_cache_free+0x364/0x630 > [ 2.459544] ? rcu_do_batch+0x195/0x520 > [ 2.459546] rcu_do_batch+0x1c3/0x520 > [ 2.459547] ? rcu_do_batch+0x195/0x520 > [ 2.459550] ? rcu_cpu_kthread+0x7e/0x4b0 > [ 2.459552] ? rcu_cpu_kthread+0x57/0x4b0 > [ 2.459553] rcu_core+0x2c3/0x590 > [ 2.459555] ? rcu_cpu_kthread+0x78/0x4b0 > [ 2.459557] ? rcu_cpu_kthread+0x7e/0x4b0 > [ 2.459558] ? rcu_cpu_kthread+0x57/0x4b0 > [ 2.459560] rcu_cpu_kthread+0xc2/0x4b0 > [ 2.459562] ? smpboot_thread_fn+0x23/0x300 > [ 2.459565] smpboot_thread_fn+0x249/0x300 > [ 2.459567] ? smpboot_register_percpu_thread+0xe0/0xe0 > [ 2.459569] kthread+0x1af/0x1d0 > [ 2.459571] ? set_kthread_struct+0x40/0x40 > [ 2.459574] ret_from_fork+0x1f/0x30 >
On Tue, 2021-07-20 at 10:56 +0200, Vlastimil Babka wrote: > > > crash> bt -sx > > PID: 18761 TASK: ffff88812fff0000 CPU: 0 COMMAND: "hackbench" > > #0 [ffff88818f8ff980] machine_kexec+0x14f at ffffffff81051c8f > > #1 [ffff88818f8ff9c8] __crash_kexec+0xd2 at ffffffff8111ef72 > > #2 [ffff88818f8ffa88] crash_kexec+0x30 at ffffffff8111fd10 > > #3 [ffff88818f8ffa98] oops_end+0xd3 at ffffffff810267e3 > > #4 [ffff88818f8ffab8] exc_general_protection+0x195 at ffffffff8179fdb5 > > #5 [ffff88818f8ffb50] asm_exc_general_protection+0x1e at ffffffff81800a0e > > [exception RIP: __unfreeze_partials+156] > > Hm going back to this report... > So could it be that it was stillput_cpu_partial() preempting > __slab_alloc() messing the partial list, but for some reason the > put_cpu_partial() side crashed this time? Thinking this bug is toast, I emptied the trash bin, so no can peek. If need be, and nobody beats me to it (which ~guarantees nobody will beat me to it;), I can make the thing go boom again easily enough. -Mike
On Tue, 2021-07-20 at 13:26 +0200, Mike Galbraith wrote: > On Tue, 2021-07-20 at 10:56 +0200, Vlastimil Babka wrote: > > > crash> bt -sx > > > PID: 18761 TASK: ffff88812fff0000 CPU: 0 COMMAND: "hackbench" > > > #0 [ffff88818f8ff980] machine_kexec+0x14f at ffffffff81051c8f > > > #1 [ffff88818f8ff9c8] __crash_kexec+0xd2 at ffffffff8111ef72 > > > #2 [ffff88818f8ffa88] crash_kexec+0x30 at ffffffff8111fd10 > > > #3 [ffff88818f8ffa98] oops_end+0xd3 at ffffffff810267e3 > > > #4 [ffff88818f8ffab8] exc_general_protection+0x195 at > > > ffffffff8179fdb5 > > > #5 [ffff88818f8ffb50] asm_exc_general_protection+0x1e at > > > ffffffff81800a0e > > > [exception RIP: __unfreeze_partials+156] > > > > Hm going back to this report... > > So could it be that it was stillput_cpu_partial() preempting > > __slab_alloc() messing the partial list, but for some reason the > > put_cpu_partial() side crashed this time? > > Thinking this bug is toast, I emptied the trash bin, so no can peek. I made fireworks while waiting for bike riding time, boom #10 was finally the right flavor, but... crash> bt -sx PID: 32 TASK: ffff888100a56000 CPU: 3 COMMAND: "rcuc/3" #0 [ffff888100aa7a90] machine_kexec+0x14f at ffffffff81051c8f #1 [ffff888100aa7ad8] __crash_kexec+0xd2 at ffffffff81120612 #2 [ffff888100aa7b98] crash_kexec+0x30 at ffffffff811213b0 #3 [ffff888100aa7ba8] oops_end+0xd3 at ffffffff810267e3 #4 [ffff888100aa7bc8] exc_general_protection+0x195 at ffffffff817a2cc5 #5 [ffff888100aa7c60] asm_exc_general_protection+0x1e at ffffffff81800a0e [exception RIP: __unfreeze_partials+149] RIP: ffffffff8124a295 RSP: ffff888100aa7d10 RFLAGS: 00010202 RAX: 0000000000190016 RBX: 0000000000190016 RCX: 000000017fffffff RDX: 00000001ffffffff RSI: 0000000000000023 RDI: ffffffff81e58b10 RBP: ffff888100aa7da0 R8: 0000000000000000 R9: 0000000000190018 R10: ffff888100aa7db8 R11: 000000000002d9e4 R12: ffff888100190500 R13: ffff88810018c980 R14: 00000001ffffffff R15: ffffea0004571588 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #6 [ffff888100aa7db0] put_cpu_partial+0x8e at ffffffff8124a56e #7 [ffff888100aa7dd0] kmem_cache_free+0x3a8 at ffffffff8124d238 #8 [ffff888100aa7e08] rcu_do_batch+0x186 at ffffffff810eb246 #9 [ffff888100aa7e70] rcu_core+0x25f at ffffffff810eeb2f #10 [ffff888100aa7eb0] rcu_cpu_kthread+0x94 at ffffffff810eed24 #11 [ffff888100aa7ee0] smpboot_thread_fn+0x249 at ffffffff8109e559 #12 [ffff888100aa7f18] kthread+0x1ac at ffffffff810984dc #13 [ffff888100aa7f50] ret_from_fork+0x1f at ffffffff81001b1f crash> runq ... CPU 3 RUNQUEUE: ffff88840ece9980 CURRENT: PID: 32 TASK: ffff888100a56000 COMMAND: "rcuc/3" RT PRIO_ARRAY: ffff88840ece9bc0 [ 94] PID: 32 TASK: ffff888100a56000 COMMAND: "rcuc/3" CFS RB_ROOT: ffff88840ece9a40 [120] PID: 33 TASK: ffff888100a51000 COMMAND: "ksoftirqd/3" ... crash> bt -sx 33 PID: 33 TASK: ffff888100a51000 CPU: 3 COMMAND: "ksoftirqd/3" #0 [ffff888100aabdf0] __schedule+0x2d7 at ffffffff817ad3a7 #1 [ffff888100aabec8] schedule+0x3b at ffffffff817ae4eb #2 [ffff888100aabee0] smpboot_thread_fn+0x18c at ffffffff8109e49c #3 [ffff888100aabf18] kthread+0x1ac at ffffffff810984dc #4 [ffff888100aabf50] ret_from_fork+0x1f at ffffffff81001b1f crash>
On 7/21/21 6:56 AM, Mike Galbraith wrote: > On Tue, 2021-07-20 at 13:26 +0200, Mike Galbraith wrote: >> On Tue, 2021-07-20 at 10:56 +0200, Vlastimil Babka wrote: >> > > crash> bt -sx >> > > PID: 18761 TASK: ffff88812fff0000 CPU: 0 COMMAND: "hackbench" >> > > #0 [ffff88818f8ff980] machine_kexec+0x14f at ffffffff81051c8f >> > > #1 [ffff88818f8ff9c8] __crash_kexec+0xd2 at ffffffff8111ef72 >> > > #2 [ffff88818f8ffa88] crash_kexec+0x30 at ffffffff8111fd10 >> > > #3 [ffff88818f8ffa98] oops_end+0xd3 at ffffffff810267e3 >> > > #4 [ffff88818f8ffab8] exc_general_protection+0x195 at >> > > ffffffff8179fdb5 >> > > #5 [ffff88818f8ffb50] asm_exc_general_protection+0x1e at >> > > ffffffff81800a0e >> > > [exception RIP: __unfreeze_partials+156] >> > >> > Hm going back to this report... >> > So could it be that it was stillput_cpu_partial() preempting >> > __slab_alloc() messing the partial list, but for some reason the >> > put_cpu_partial() side crashed this time? >> >> Thinking this bug is toast, I emptied the trash bin, so no can peek. > > I made fireworks while waiting for bike riding time, boom #10 was > finally the right flavor, but... > > crash> bt -sx > PID: 32 TASK: ffff888100a56000 CPU: 3 COMMAND: "rcuc/3" > #0 [ffff888100aa7a90] machine_kexec+0x14f at ffffffff81051c8f > #1 [ffff888100aa7ad8] __crash_kexec+0xd2 at ffffffff81120612 > #2 [ffff888100aa7b98] crash_kexec+0x30 at ffffffff811213b0 > #3 [ffff888100aa7ba8] oops_end+0xd3 at ffffffff810267e3 > #4 [ffff888100aa7bc8] exc_general_protection+0x195 at ffffffff817a2cc5 > #5 [ffff888100aa7c60] asm_exc_general_protection+0x1e at ffffffff81800a0e > [exception RIP: __unfreeze_partials+149] > RIP: ffffffff8124a295 RSP: ffff888100aa7d10 RFLAGS: 00010202 > RAX: 0000000000190016 RBX: 0000000000190016 RCX: 000000017fffffff > RDX: 00000001ffffffff RSI: 0000000000000023 RDI: ffffffff81e58b10 > RBP: ffff888100aa7da0 R8: 0000000000000000 R9: 0000000000190018 > R10: ffff888100aa7db8 R11: 000000000002d9e4 R12: ffff888100190500 > R13: ffff88810018c980 R14: 00000001ffffffff R15: ffffea0004571588 > ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 > #6 [ffff888100aa7db0] put_cpu_partial+0x8e at ffffffff8124a56e > #7 [ffff888100aa7dd0] kmem_cache_free+0x3a8 at ffffffff8124d238 > #8 [ffff888100aa7e08] rcu_do_batch+0x186 at ffffffff810eb246 > #9 [ffff888100aa7e70] rcu_core+0x25f at ffffffff810eeb2f > #10 [ffff888100aa7eb0] rcu_cpu_kthread+0x94 at ffffffff810eed24 > #11 [ffff888100aa7ee0] smpboot_thread_fn+0x249 at ffffffff8109e559 > #12 [ffff888100aa7f18] kthread+0x1ac at ffffffff810984dc > #13 [ffff888100aa7f50] ret_from_fork+0x1f at ffffffff81001b1f > crash> runq > ... > CPU 3 RUNQUEUE: ffff88840ece9980 > CURRENT: PID: 32 TASK: ffff888100a56000 COMMAND: "rcuc/3" > RT PRIO_ARRAY: ffff88840ece9bc0 > [ 94] PID: 32 TASK: ffff888100a56000 COMMAND: "rcuc/3" > CFS RB_ROOT: ffff88840ece9a40 > [120] PID: 33 TASK: ffff888100a51000 COMMAND: "ksoftirqd/3" > ... > crash> bt -sx 33 > PID: 33 TASK: ffff888100a51000 CPU: 3 COMMAND: "ksoftirqd/3" > #0 [ffff888100aabdf0] __schedule+0x2d7 at ffffffff817ad3a7 > #1 [ffff888100aabec8] schedule+0x3b at ffffffff817ae4eb > #2 [ffff888100aabee0] smpboot_thread_fn+0x18c at ffffffff8109e49c > #3 [ffff888100aabf18] kthread+0x1ac at ffffffff810984dc > #4 [ffff888100aabf50] ret_from_fork+0x1f at ffffffff81001b1f > crash> So this doesn't look like our put_cpu_partial() preempted a __slab_alloc() on the same cpu, right? There might have been __slab_alloc() in irq handler preempting us, but we won't see that anymore. I don't immediately see the root cause and this scenario should be possible on !RT too where we however didn't see these explosions. BTW did my ugly patch work? Thanks.
On Wed, 2021-07-21 at 10:44 +0200, Vlastimil Babka wrote: > > So this doesn't look like our put_cpu_partial() preempted a > __slab_alloc() on the same cpu, right? No, likely it was the one preempted by someone long gone, but we'll never know without setting a trap. > BTW did my ugly patch work? Nope. I guess you missed my reporting it to have been a -ENOBOOT, and that cutting it in half, ie snagging only __slab_free() does boot, and seems to cure all of the RT fireworks. (chainsaw noises...) --- mm/slub.c | 12 +++++++++++- 1 file changed, 11 insertions(+), 1 deletion(-) --- a/mm/slub.c +++ b/mm/slub.c @@ -2551,6 +2551,8 @@ static void put_cpu_partial(struct kmem_ int pobjects; slub_get_cpu_ptr(s->cpu_slab); + if (IS_ENABLED(CONFIG_PREEMPT_RT) && drain) + local_lock(&s->cpu_slab->lock); do { pages = 0; pobjects = 0; @@ -2564,7 +2566,13 @@ static void put_cpu_partial(struct kmem_ * partial array is full. Move the existing * set to the per node partial list. */ - unfreeze_partials(s); + this_cpu_write(s->cpu_slab->partial, NULL); + if (IS_ENABLED(CONFIG_PREEMPT_RT) && drain) + local_unlock(&s->cpu_slab->lock); + __unfreeze_partials(s, oldpage); + if (IS_ENABLED(CONFIG_PREEMPT_RT) && drain) + local_lock(&s->cpu_slab->lock); + oldpage = NULL; pobjects = 0; pages = 0; @@ -2581,6 +2589,8 @@ static void put_cpu_partial(struct kmem_ } while (this_cpu_cmpxchg(s->cpu_slab->partial, oldpage, page) != oldpage); + if (IS_ENABLED(CONFIG_PREEMPT_RT) && drain) + local_unlock(&s->cpu_slab->lock); slub_put_cpu_ptr(s->cpu_slab); #endif /* CONFIG_SLUB_CPU_PARTIAL */ }
On 7/21/21 11:33 AM, Mike Galbraith wrote: > On Wed, 2021-07-21 at 10:44 +0200, Vlastimil Babka wrote: >> >> So this doesn't look like our put_cpu_partial() preempted a >> __slab_alloc() on the same cpu, right? > > No, likely it was the one preempted by someone long gone, but we'll > never know without setting a trap. > >> BTW did my ugly patch work? > > Nope. I guess you missed my reporting it to have been a -ENOBOOT, and Indeed, I misunderstood it as you talking about your patch. > that cutting it in half, ie snagging only __slab_free() does boot, and > seems to cure all of the RT fireworks. OK, so depending on drain=1 makes this apply only to put_cpu_partial() called from __slab_free and not get_partial_node(). One notable difference is that in __slab_free we don't have n->list_lock locked and in get_partial_node() we do. I guess in case your list_lock is made raw again by another patch, that explains a local_lock can't nest under it. If not, then I would expect this to work (I don't think they ever nest in the opposite order, also lockdep should tell us instead of -ENOBOOT?), but might be missing something... I'd rather not nest those locks in any case. I just need to convince myself that the scenario the half-fix fixes is indeed the only one that's needed and we're not leaving there other races that are just harder to trigger... > (chainsaw noises...) > > --- > mm/slub.c | 12 +++++++++++- > 1 file changed, 11 insertions(+), 1 deletion(-) > > --- a/mm/slub.c > +++ b/mm/slub.c > @@ -2551,6 +2551,8 @@ static void put_cpu_partial(struct kmem_ > int pobjects; > > slub_get_cpu_ptr(s->cpu_slab); > + if (IS_ENABLED(CONFIG_PREEMPT_RT) && drain) > + local_lock(&s->cpu_slab->lock); > do { > pages = 0; > pobjects = 0; > @@ -2564,7 +2566,13 @@ static void put_cpu_partial(struct kmem_ > * partial array is full. Move the existing > * set to the per node partial list. > */ > - unfreeze_partials(s); > + this_cpu_write(s->cpu_slab->partial, NULL); > + if (IS_ENABLED(CONFIG_PREEMPT_RT) && drain) > + local_unlock(&s->cpu_slab->lock); > + __unfreeze_partials(s, oldpage); > + if (IS_ENABLED(CONFIG_PREEMPT_RT) && drain) > + local_lock(&s->cpu_slab->lock); > + > oldpage = NULL; > pobjects = 0; > pages = 0; > @@ -2581,6 +2589,8 @@ static void put_cpu_partial(struct kmem_ > > } while (this_cpu_cmpxchg(s->cpu_slab->partial, oldpage, page) > != oldpage); > + if (IS_ENABLED(CONFIG_PREEMPT_RT) && drain) > + local_unlock(&s->cpu_slab->lock); > slub_put_cpu_ptr(s->cpu_slab); > #endif /* CONFIG_SLUB_CPU_PARTIAL */ > } > >
On Sat, 2021-07-24 at 00:39 +0200, Vlastimil Babka wrote: > On 7/21/21 11:33 AM, Mike Galbraith wrote: > > On Wed, 2021-07-21 at 10:44 +0200, Vlastimil Babka wrote: > > > > > > So this doesn't look like our put_cpu_partial() preempted a > > > __slab_alloc() on the same cpu, right? > > > > No, likely it was the one preempted by someone long gone, but we'll > > never know without setting a trap. > > > > > BTW did my ugly patch work? > > > > Nope. I guess you missed my reporting it to have been a -ENOBOOT, and > > Indeed, I misunderstood it as you talking about your patch. > > > that cutting it in half, ie snagging only __slab_free() does boot, and > > seems to cure all of the RT fireworks. > > OK, so depending on drain=1 makes this apply only to put_cpu_partial() > called from __slab_free and not get_partial_node(). One notable > difference is that in __slab_free we don't have n->list_lock locked and > in get_partial_node() we do. I guess in case your list_lock is made raw > again by another patch, that explains a local_lock can't nest under it. > If not, then I would expect this to work (I don't think they ever nest > in the opposite order, also lockdep should tell us instead of > -ENOBOOT?), but might be missing something... RT used to convert list_lock to raw_spinlock_t, but no longer does. Whatever is going on, box does not emit a single sign of life with the full patch. > I'd rather not nest those locks in any case. I just need to convince > myself that the scenario the half-fix fixes is indeed the only one > that's needed and we're not leaving there other races that are just > harder to trigger... Yup. I can only state with confidence that the trouble I was able to easily reproduce was fixed up by serializing __slab_free(). Hopefully you'll find that's the only hole in need of plugging. -Mike
On Sat, 2021-07-24 at 00:39 +0200, Vlastimil Babka wrote: > > If not, then I would expect this to work (I don't think they ever nest > in the opposite order, also lockdep should tell us instead of > -ENOBOOT?), but might be missing something... Yeah, like #ifndef CONFIG_PREMPT_RT at the bottom of the loop that our useless damn eyeballs auto-correct instead of reporting :) -Mike
On 7/25/21 4:09 PM, Mike Galbraith wrote: > On Sat, 2021-07-24 at 00:39 +0200, Vlastimil Babka wrote: >> >> If not, then I would expect this to work (I don't think they ever nest >> in the opposite order, also lockdep should tell us instead of >> -ENOBOOT?), but might be missing something... > > Yeah, like #ifndef CONFIG_PREMPT_RT at the bottom of the loop that our > useless damn eyeballs auto-correct instead of reporting :) Well doh, good catch. Hope fixing that helps then? I've so far convinced myself that the full patch will be needed - in my theory the ->partial cmpxchg in put_cpu_partial() itself can race with assignment in ___slab_alloc() in a way that will not cause crashes, but will leak slab pages. > -Mike >
On Sun, 2021-07-25 at 16:16 +0200, Vlastimil Babka wrote: > On 7/25/21 4:09 PM, Mike Galbraith wrote: > > On Sat, 2021-07-24 at 00:39 +0200, Vlastimil Babka wrote: > > > > > > If not, then I would expect this to work (I don't think they ever nest > > > in the opposite order, also lockdep should tell us instead of > > > -ENOBOOT?), but might be missing something... > > > > Yeah, like #ifndef CONFIG_PREMPT_RT at the bottom of the loop that our > > useless damn eyeballs auto-correct instead of reporting :) > > Well doh, good catch. I never did see it. I got sick of saying "but but but", and did make mm/slub.i, which made it glow. > Hope fixing that helps then? Yeah, though RT should perhaps be pinned across release/re-acquire? Actually, local locks should rediscover the recursion handling skills they long had so such RT specific hole poking isn't necessary. There previously would have been no ifdef+typo there for eyeballs to miss and miss and miss. -Mike
On 7/25/21 5:02 PM, Mike Galbraith wrote: > On Sun, 2021-07-25 at 16:16 +0200, Vlastimil Babka wrote: >> On 7/25/21 4:09 PM, Mike Galbraith wrote: >>> On Sat, 2021-07-24 at 00:39 +0200, Vlastimil Babka wrote: >>>> >>>> If not, then I would expect this to work (I don't think they ever nest >>>> in the opposite order, also lockdep should tell us instead of >>>> -ENOBOOT?), but might be missing something... >>> >>> Yeah, like #ifndef CONFIG_PREMPT_RT at the bottom of the loop that our >>> useless damn eyeballs auto-correct instead of reporting :) >> >> Well doh, good catch. > > I never did see it. I got sick of saying "but but but", and did make > mm/slub.i, which made it glow. Glad you did. >> Hope fixing that helps then? > > Yeah, though RT should perhaps be pinned across release/re-acquire? Probably not necessary, this_cpu_cmpxchg() will effectively recognize being moved to a different CPU. Might also move __unfreeze_partials() out of the whole loop to avoid the relock. Yeah that should be better. > Actually, local locks should rediscover the recursion handling skills > they long had so such RT specific hole poking isn't necessary. There > previously would have been no ifdef+typo there for eyeballs to miss and > miss and miss. Hm, now I'm realizing that local_lock() on !RT is just preempt_disable(), i.e. equivalent to get_cpu_ptr(), so some of the ifdeffery could go away? > -Mike >
On 7/25/21 6:27 PM, Vlastimil Babka wrote: >>> Hope fixing that helps then? >> >> Yeah, though RT should perhaps be pinned across release/re-acquire? > > Probably not necessary, this_cpu_cmpxchg() will effectively recognize > being moved to a different CPU. > Might also move __unfreeze_partials() out of the whole loop to avoid the > relock. Yeah that should be better. > >> Actually, local locks should rediscover the recursion handling skills >> they long had so such RT specific hole poking isn't necessary. There >> previously would have been no ifdef+typo there for eyeballs to miss and >> miss and miss. > > Hm, now I'm realizing that local_lock() on !RT is just > preempt_disable(), i.e. equivalent to get_cpu_ptr(), so some of the > ifdeffery could go away? How much will this explode? Thanks. ----8<---- From 99808b198bdf867951131bb9d1ca1bd1cd12b8c4 Mon Sep 17 00:00:00 2001 From: Vlastimil Babka <vbabka@suse.cz> Date: Fri, 23 Jul 2021 23:17:18 +0200 Subject: [PATCH] PREEMPT_RT+SLUB_CPU_PARTIAL fix attempt --- mm/slub.c | 34 +++++++++++++++++++++++++++++++--- 1 file changed, 31 insertions(+), 3 deletions(-) diff --git a/mm/slub.c b/mm/slub.c index 581004a5aca9..d12a50b5ee6f 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -2437,13 +2437,19 @@ static void unfreeze_partials(struct kmem_cache *s) { struct page *partial_page; +#ifndef CONFIG_PREEMPT_RT do { partial_page = this_cpu_read(s->cpu_slab->partial); } while (partial_page && this_cpu_cmpxchg(s->cpu_slab->partial, partial_page, NULL) != partial_page); - +#else + local_lock(&s->cpu_slab->lock); + partial_page = this_cpu_read(s->cpu_slab->partial); + this_cpu_write(s->cpu_slab->partial, NULL); + local_unlock(&s->cpu_slab->lock); +#endif if (partial_page) __unfreeze_partials(s, partial_page); } @@ -2479,10 +2485,15 @@ static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain) { #ifdef CONFIG_SLUB_CPU_PARTIAL struct page *oldpage; + struct page *page_to_unfreeze = NULL; int pages; int pobjects; - slub_get_cpu_ptr(s->cpu_slab); + /* + * On !RT we just want to disable preemption, on RT we need the lock + * for real. This happens to match local_lock() semantics. + */ + local_lock(&s->cpu_slab->lock); do { pages = 0; pobjects = 0; @@ -2496,7 +2507,16 @@ static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain) * partial array is full. Move the existing * set to the per node partial list. */ +#ifndef CONFIG_PREEMPT_RT unfreeze_partials(s); +#else + /* + * Postpone unfreezing until we drop the local + * lock to avoid relocking. + */ + page_to_unfreeze = oldpage; +#endif + oldpage = NULL; pobjects = 0; pages = 0; @@ -2511,9 +2531,17 @@ static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain) page->pobjects = pobjects; page->next = oldpage; +#ifndef CONFIG_PREEMPT_RT } while (this_cpu_cmpxchg(s->cpu_slab->partial, oldpage, page) != oldpage); - slub_put_cpu_ptr(s->cpu_slab); +#else + this_cpu_write(s->cpu_slab->partial, page); + } while (false); +#endif + + local_unlock(&s->cpu_slab->lock); + if (page_to_unfreeze) + __unfreeze_partials(s, page_to_unfreeze); #endif /* CONFIG_SLUB_CPU_PARTIAL */ } -- 2.32.0
On 7/25/21 9:12 PM, Vlastimil Babka wrote: > + /* > + * On !RT we just want to disable preemption, on RT we need the lock > + * for real. This happens to match local_lock() semantics. > + */ > + local_lock(&s->cpu_slab->lock); OK I realized (and tglx confirmed) that this will blow up on !RT + lockdep if interrupted by ___slab_alloc() that will do local_lock_irqsave(). So back to #ifdefs it is. But should work as-is for RT testing.
On Sun, 2021-07-25 at 21:34 +0200, Vlastimil Babka wrote: > On 7/25/21 9:12 PM, Vlastimil Babka wrote: > > + /* > > + * On !RT we just want to disable preemption, on RT we need > > the lock > > + * for real. This happens to match local_lock() semantics. > > + */ > > + local_lock(&s->cpu_slab->lock); > > OK I realized (and tglx confirmed) that this will blow up on !RT + > lockdep if interrupted by ___slab_alloc() that will do > local_lock_irqsave(). So back to #ifdefs it is. But should work as-is > for RT testing. I called it off after 3 1/2 hours. Bug blew box outta the water 10 times in about half of that, so methinks it's safe to call it dead. -Mike
On Sun, 2021-07-25 at 21:34 +0200, Vlastimil Babka wrote: > On 7/25/21 9:12 PM, Vlastimil Babka wrote: > > + /* > > + * On !RT we just want to disable preemption, on RT we need > > the lock > > + * for real. This happens to match local_lock() semantics. > > + */ > > + local_lock(&s->cpu_slab->lock); > > OK I realized (and tglx confirmed) that this will blow up on !RT + > lockdep if interrupted by ___slab_alloc() that will do > local_lock_irqsave(). So back to #ifdefs it is. But should work as-is > for RT testing. Speaking of that local_lock_irqsave(), and some unloved ifdefs.. Why not do something like the below? When I look at new_slab:, I see cpu_slab->partial assignment protected by IRQs being disabled, which implies to me it should probably be so protected everywhere. There used to be another slub_set_percpu_partial() call in unfreeze_partials(), which was indeed called with IRQs disabled, quite sane looking to an mm outsider looking in. The odd man out ->partial assignment was the preempt disabled put_cpu_partial() cmpxchg loop, which contained an IRQ disabled region to accommodate the aforementioned unfreeze_partials(). Is there real world benefit to the cmpxchg loops whacked below (ala monkey see monkey do) over everyone just taking the straight shot you laid down for RT? It's easier on the eye (mine anyway), and neither PREEMPT or PREEMPT_RT seem inclined to complain... tick... tock... --- mm/slub.c | 79 ++++++++++++++++++++++++++++++++------------------------------ 1 file changed, 41 insertions(+), 38 deletions(-) --- a/mm/slub.c +++ b/mm/slub.c @@ -2437,13 +2437,12 @@ static void __unfreeze_partials(struct k static void unfreeze_partials(struct kmem_cache *s) { struct page *partial_page; + unsigned long flags; - do { - partial_page = this_cpu_read(s->cpu_slab->partial); - - } while (partial_page && - this_cpu_cmpxchg(s->cpu_slab->partial, partial_page, NULL) - != partial_page); + local_lock_irqsave(&s->cpu_slab->lock, flags); + partial_page = this_cpu_read(s->cpu_slab->partial); + this_cpu_write(s->cpu_slab->partial, NULL); + local_unlock_irqrestore(&s->cpu_slab->lock, flags); if (partial_page) __unfreeze_partials(s, partial_page); @@ -2480,41 +2479,45 @@ static void put_cpu_partial(struct kmem_ { #ifdef CONFIG_SLUB_CPU_PARTIAL struct page *oldpage; - int pages; - int pobjects; - - slub_get_cpu_ptr(s->cpu_slab); - do { - pages = 0; - pobjects = 0; - oldpage = this_cpu_read(s->cpu_slab->partial); - - if (oldpage) { - pobjects = oldpage->pobjects; - pages = oldpage->pages; - if (drain && pobjects > slub_cpu_partial(s)) { - /* - * partial array is full. Move the existing - * set to the per node partial list. - */ - unfreeze_partials(s); - oldpage = NULL; - pobjects = 0; - pages = 0; - stat(s, CPU_PARTIAL_DRAIN); - } + struct page *page_to_unfreeze = NULL; + unsigned long flags; + int pages = 0, pobjects = 0; + + local_lock_irqsave(&s->cpu_slab->lock, flags); + + if (oldpage = this_cpu_read(s->cpu_slab->partial)) { + pobjects = oldpage->pobjects; + pages = oldpage->pages; + if (drain && pobjects > slub_cpu_partial(s)) { + /* + * partial array is full. Move the existing + * set to the per node partial list. + * + * Postpone unfreezing until we drop the local + * lock to avoid an RT unlock/relock requirement + * due to MEMCG __slab_free() recursion. + */ + page_to_unfreeze = oldpage; + + oldpage = NULL; + pobjects = 0; + pages = 0; + stat(s, CPU_PARTIAL_DRAIN); } + } + + pages++; + pobjects += page->objects - page->inuse; + + page->pages = pages; + page->pobjects = pobjects; + page->next = oldpage; - pages++; - pobjects += page->objects - page->inuse; + this_cpu_write(s->cpu_slab->partial, page); + local_unlock_irqrestore(&s->cpu_slab->lock, flags); - page->pages = pages; - page->pobjects = pobjects; - page->next = oldpage; - - } while (this_cpu_cmpxchg(s->cpu_slab->partial, oldpage, page) - != oldpage); - slub_put_cpu_ptr(s->cpu_slab); + if (page_to_unfreeze) + __unfreeze_partials(s, page_to_unfreeze); #endif /* CONFIG_SLUB_CPU_PARTIAL */ }
On 7/26/21 7:00 PM, Mike Galbraith wrote: > On Sun, 2021-07-25 at 21:34 +0200, Vlastimil Babka wrote: >> On 7/25/21 9:12 PM, Vlastimil Babka wrote: >>> + /* >>> + * On !RT we just want to disable preemption, on RT we need >>> the lock >>> + * for real. This happens to match local_lock() semantics. >>> + */ >>> + local_lock(&s->cpu_slab->lock); >> >> OK I realized (and tglx confirmed) that this will blow up on !RT + >> lockdep if interrupted by ___slab_alloc() that will do >> local_lock_irqsave(). So back to #ifdefs it is. But should work as-is >> for RT testing. > > Speaking of that local_lock_irqsave(), and some unloved ifdefs.. > > Why not do something like the below? When I look at new_slab:, I see > cpu_slab->partial assignment protected by IRQs being disabled, which > implies to me it should probably be so protected everywhere. There > used to be another slub_set_percpu_partial() call in > unfreeze_partials(), which was indeed called with IRQs disabled, quite > sane looking to an mm outsider looking in. The odd man out ->partial > assignment was the preempt disabled put_cpu_partial() cmpxchg loop, > which contained an IRQ disabled region to accommodate the > aforementioned unfreeze_partials(). > > Is there real world benefit to the cmpxchg loops whacked below (ala > monkey see monkey do) over everyone just taking the straight shot you > laid down for RT? It's easier on the eye (mine anyway), and neither > PREEMPT or PREEMPT_RT seem inclined to complain... tick... tock... Yep, sounds like a good approach, thanks. Percpu partial is not *the* SLUB fast path, so it should be sufficient without the lockless cmpxchg tricks. Will incorporate in updated series. > > --- > mm/slub.c | 79 ++++++++++++++++++++++++++++++++------------------------------ > 1 file changed, 41 insertions(+), 38 deletions(-) > > --- a/mm/slub.c > +++ b/mm/slub.c > @@ -2437,13 +2437,12 @@ static void __unfreeze_partials(struct k > static void unfreeze_partials(struct kmem_cache *s) > { > struct page *partial_page; > + unsigned long flags; > > - do { > - partial_page = this_cpu_read(s->cpu_slab->partial); > - > - } while (partial_page && > - this_cpu_cmpxchg(s->cpu_slab->partial, partial_page, NULL) > - != partial_page); > + local_lock_irqsave(&s->cpu_slab->lock, flags); > + partial_page = this_cpu_read(s->cpu_slab->partial); > + this_cpu_write(s->cpu_slab->partial, NULL); > + local_unlock_irqrestore(&s->cpu_slab->lock, flags); > > if (partial_page) > __unfreeze_partials(s, partial_page); > @@ -2480,41 +2479,45 @@ static void put_cpu_partial(struct kmem_ > { > #ifdef CONFIG_SLUB_CPU_PARTIAL > struct page *oldpage; > - int pages; > - int pobjects; > - > - slub_get_cpu_ptr(s->cpu_slab); > - do { > - pages = 0; > - pobjects = 0; > - oldpage = this_cpu_read(s->cpu_slab->partial); > - > - if (oldpage) { > - pobjects = oldpage->pobjects; > - pages = oldpage->pages; > - if (drain && pobjects > slub_cpu_partial(s)) { > - /* > - * partial array is full. Move the existing > - * set to the per node partial list. > - */ > - unfreeze_partials(s); > - oldpage = NULL; > - pobjects = 0; > - pages = 0; > - stat(s, CPU_PARTIAL_DRAIN); > - } > + struct page *page_to_unfreeze = NULL; > + unsigned long flags; > + int pages = 0, pobjects = 0; > + > + local_lock_irqsave(&s->cpu_slab->lock, flags); > + > + if (oldpage = this_cpu_read(s->cpu_slab->partial)) { > + pobjects = oldpage->pobjects; > + pages = oldpage->pages; > + if (drain && pobjects > slub_cpu_partial(s)) { > + /* > + * partial array is full. Move the existing > + * set to the per node partial list. > + * > + * Postpone unfreezing until we drop the local > + * lock to avoid an RT unlock/relock requirement > + * due to MEMCG __slab_free() recursion. > + */ > + page_to_unfreeze = oldpage; > + > + oldpage = NULL; > + pobjects = 0; > + pages = 0; > + stat(s, CPU_PARTIAL_DRAIN); > } > + } > + > + pages++; > + pobjects += page->objects - page->inuse; > + > + page->pages = pages; > + page->pobjects = pobjects; > + page->next = oldpage; > > - pages++; > - pobjects += page->objects - page->inuse; > + this_cpu_write(s->cpu_slab->partial, page); > + local_unlock_irqrestore(&s->cpu_slab->lock, flags); > > - page->pages = pages; > - page->pobjects = pobjects; > - page->next = oldpage; > - > - } while (this_cpu_cmpxchg(s->cpu_slab->partial, oldpage, page) > - != oldpage); > - slub_put_cpu_ptr(s->cpu_slab); > + if (page_to_unfreeze) > + __unfreeze_partials(s, page_to_unfreeze); > #endif /* CONFIG_SLUB_CPU_PARTIAL */ > } > > >
On Mon, 2021-07-26 at 23:26 +0200, Vlastimil Babka wrote: > On 7/26/21 7:00 PM, Mike Galbraith wrote: > > > > Why not do something like the below?... > > Yep, sounds like a good approach, thanks. Percpu partial is not *the* > SLUB fast path, so it should be sufficient without the lockless cmpxchg > tricks. Will incorporate in updated series. Great, my >= 5.13 trees will meanwhile wear it like so: From: Vlastimil Babka <vbabka@suse.cz> Date: Fri, 23 Jul 2021 23:17:18 +0200 mm, slub: Fix PREEMPT_RT plus SLUB_CPU_PARTIAL local exclusion See https://lkml.org/lkml/2021/7/25/185 Mike: Remove ifdefs, make all configs take the straight line path layed out for RT by Vlastimil in his prospective (now confirmed) fix. Signed-off-by: Mike Galbraith <efault@gmx.de> --- mm/slub.c | 79 ++++++++++++++++++++++++++++++++------------------------------ 1 file changed, 41 insertions(+), 38 deletions(-) --- a/mm/slub.c +++ b/mm/slub.c @@ -2437,13 +2437,12 @@ static void __unfreeze_partials(struct k static void unfreeze_partials(struct kmem_cache *s) { struct page *partial_page; + unsigned long flags; - do { - partial_page = this_cpu_read(s->cpu_slab->partial); - - } while (partial_page && - this_cpu_cmpxchg(s->cpu_slab->partial, partial_page, NULL) - != partial_page); + local_lock_irqsave(&s->cpu_slab->lock, flags); + partial_page = this_cpu_read(s->cpu_slab->partial); + this_cpu_write(s->cpu_slab->partial, NULL); + local_unlock_irqrestore(&s->cpu_slab->lock, flags); if (partial_page) __unfreeze_partials(s, partial_page); @@ -2480,41 +2479,45 @@ static void put_cpu_partial(struct kmem_ { #ifdef CONFIG_SLUB_CPU_PARTIAL struct page *oldpage; - int pages; - int pobjects; - - slub_get_cpu_ptr(s->cpu_slab); - do { - pages = 0; - pobjects = 0; - oldpage = this_cpu_read(s->cpu_slab->partial); - - if (oldpage) { - pobjects = oldpage->pobjects; - pages = oldpage->pages; - if (drain && pobjects > slub_cpu_partial(s)) { - /* - * partial array is full. Move the existing - * set to the per node partial list. - */ - unfreeze_partials(s); - oldpage = NULL; - pobjects = 0; - pages = 0; - stat(s, CPU_PARTIAL_DRAIN); - } + struct page *page_to_unfreeze = NULL; + unsigned long flags; + int pages = 0, pobjects = 0; + + local_lock_irqsave(&s->cpu_slab->lock, flags); + + if (oldpage = this_cpu_read(s->cpu_slab->partial)) { + pobjects = oldpage->pobjects; + pages = oldpage->pages; + if (drain && pobjects > slub_cpu_partial(s)) { + /* + * partial array is full. Move the existing + * set to the per node partial list. + * + * Postpone unfreezing until we drop the local + * lock to avoid an RT unlock/relock requirement + * due to MEMCG __slab_free() recursion. + */ + page_to_unfreeze = oldpage; + + oldpage = NULL; + pobjects = 0; + pages = 0; + stat(s, CPU_PARTIAL_DRAIN); } + } + + pages++; + pobjects += page->objects - page->inuse; + + page->pages = pages; + page->pobjects = pobjects; + page->next = oldpage; - pages++; - pobjects += page->objects - page->inuse; + this_cpu_write(s->cpu_slab->partial, page); + local_unlock_irqrestore(&s->cpu_slab->lock, flags); - page->pages = pages; - page->pobjects = pobjects; - page->next = oldpage; - - } while (this_cpu_cmpxchg(s->cpu_slab->partial, oldpage, page) - != oldpage); - slub_put_cpu_ptr(s->cpu_slab); + if (page_to_unfreeze) + __unfreeze_partials(s, page_to_unfreeze); #endif /* CONFIG_SLUB_CPU_PARTIAL */ }
On 7/27/21 6:09 AM, Mike Galbraith wrote: > On Mon, 2021-07-26 at 23:26 +0200, Vlastimil Babka wrote: >> On 7/26/21 7:00 PM, Mike Galbraith wrote: >>> >>> Why not do something like the below?... >> >> Yep, sounds like a good approach, thanks. Percpu partial is not *the* >> SLUB fast path, so it should be sufficient without the lockless cmpxchg >> tricks. Will incorporate in updated series. The updated series incorporating hopefully all fixes from Mike and bigeasy, and rebased to 5.14-rc3 (Thomas told me RT is moving to it), is here: https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/linux.git/log/?h=slub-local-lock-v3r0 I'll post it to linux-mm tomorrow.
On Wed, 2021-07-28 at 18:59 +0200, Vlastimil Babka wrote: > On 7/27/21 6:09 AM, Mike Galbraith wrote: > > On Mon, 2021-07-26 at 23:26 +0200, Vlastimil Babka wrote: > > > On 7/26/21 7:00 PM, Mike Galbraith wrote: > > > > > > > > Why not do something like the below?... > > > > > > Yep, sounds like a good approach, thanks. Percpu partial is not *the* > > > SLUB fast path, so it should be sufficient without the lockless cmpxchg > > > tricks. Will incorporate in updated series. > > The updated series incorporating hopefully all fixes from Mike and > bigeasy, and rebased to 5.14-rc3 (Thomas told me RT is moving to it), is > here: > > https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/linux.git/log/?h=slub-local-lock-v3r0 I had to resurrect the hunk below to build with lockdep, but modulo dinky speedbump, the same RT testdrive that previously exploded was as entertainment free as such testing is supposed to be. --- mm/slub.c | 4 ++++ 1 file changed, 4 insertions(+) --- a/mm/slub.c +++ b/mm/slub.c @@ -2890,7 +2890,11 @@ static void *___slab_alloc(struct kmem_c load_freelist: +#ifdef CONFIG_PREEMPT_RT + lockdep_assert_held(this_cpu_ptr(&s->cpu_slab->lock.lock)); +#else lockdep_assert_held(this_cpu_ptr(&s->cpu_slab->lock)); +#endif /* * freelist is pointing to the list of objects to be used.
On 7/29/21 6:51 AM, Mike Galbraith wrote: > On Wed, 2021-07-28 at 18:59 +0200, Vlastimil Babka wrote: >> On 7/27/21 6:09 AM, Mike Galbraith wrote: >> > On Mon, 2021-07-26 at 23:26 +0200, Vlastimil Babka wrote: >> > > On 7/26/21 7:00 PM, Mike Galbraith wrote: >> > > > >> > > > Why not do something like the below?... >> > > >> > > Yep, sounds like a good approach, thanks. Percpu partial is not *the* >> > > SLUB fast path, so it should be sufficient without the lockless cmpxchg >> > > tricks. Will incorporate in updated series. >> >> The updated series incorporating hopefully all fixes from Mike and >> bigeasy, and rebased to 5.14-rc3 (Thomas told me RT is moving to it), is >> here: >> >> https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/linux.git/log/?h=slub-local-lock-v3r0 > > I had to resurrect the hunk below to build with lockdep, but modulo > dinky speedbump, the same RT testdrive that previously exploded was as > entertainment free as such testing is supposed to be. Ah forgot about that, I'll include it too. Thanks for testing! > --- > mm/slub.c | 4 ++++ > 1 file changed, 4 insertions(+) > > --- a/mm/slub.c > +++ b/mm/slub.c > @@ -2890,7 +2890,11 @@ static void *___slab_alloc(struct kmem_c > > load_freelist: > > +#ifdef CONFIG_PREEMPT_RT > + lockdep_assert_held(this_cpu_ptr(&s->cpu_slab->lock.lock)); > +#else > lockdep_assert_held(this_cpu_ptr(&s->cpu_slab->lock)); > +#endif > > /* > * freelist is pointing to the list of objects to be used. > >
--- a/mm/slub.c +++ b/mm/slub.c @@ -2497,7 +2497,9 @@ static void put_cpu_partial(struct kmem_ * partial array is full. Move the existing * set to the per node partial list. */ + local_lock(&s->cpu_slab->lock); unfreeze_partials(s); + local_unlock(&s->cpu_slab->lock); oldpage = NULL; pobjects = 0; pages = 0; @@ -2579,7 +2581,9 @@ static void flush_cpu_slab(struct work_s if (c->page) flush_slab(s, c, true); + local_lock(&s->cpu_slab->lock); unfreeze_partials(s); + local_unlock(&s->cpu_slab->lock); } static bool has_cpu_slab(int cpu, struct kmem_cache *s) @@ -2632,8 +2636,11 @@ static int slub_cpu_dead(unsigned int cp struct kmem_cache *s; mutex_lock(&slab_mutex); - list_for_each_entry(s, &slab_caches, list) + list_for_each_entry(s, &slab_caches, list) { + local_lock(&s->cpu_slab->lock); __flush_cpu_slab(s, cpu); + local_unlock(&s->cpu_slab->lock); + } mutex_unlock(&slab_mutex); return 0; }