Message ID | 1507296882-18721-1-git-send-email-will.deacon@arm.com |
---|---|
Headers | show |
Series | Switch arm64 over to qrwlock | expand |
On Fri, Oct 06, 2017 at 02:34:37PM +0100, Will Deacon wrote: > Hi all, > > This is version two of the patches I posted yesterday: > > http://lists.infradead.org/pipermail/linux-arm-kernel/2017-October/534666.html > > I'd normally leave it longer before posting again, but Peter had a good > suggestion to rework the layout of the lock word, so I wanted to post a > version that follows that approach. > > I've updated my branch if you're after the full patch stack: > > git://git.kernel.org/pub/scm/linux/kernel/git/will/linux.git qrwlock > > As before, all comments (particularly related to testing and performance) > welcome! > > Cheers, > > Will Hi Will, I tested your patches with locktorture and found measurable performance regression. I also respin the patch of Jan Glauber [1], and I also tried Jan's patch with patch 5 from this series. Numbers differ a lot from my previous measurements, but since that I changed working station and use qemu with the support of parallel threads. Spinlock Read-RW lock Write-RW lock Vanilla: 129804626 12340895 14716138 This series: 113718002 10982159 13068934 Jan patch: 117977108 11363462 13615449 Jan patch + #5: 121483176 11696728 13618967 The bottomline of discussion [1] was that queued locks are more effective when SoC has many CPUs. And 4 is not many. My measurement was made on the 4-CPU machine, and it seems it confirms that. Does it make sense to make queued locks default for many-CPU machines only? There were 2 preparing patches in the series: [PATCH 1/3] kernel/locking: #include <asm/spinlock.h> in qrwlock and [PATCH 2/3] asm-generic: don't #include <linux/atomic.h> in qspinlock_types.h 1st patch is not needed anymore because Babu Moger submitted similar patch that is already in mainline: 9ab6055f95903 ("kernel/locking: Fix compile error with qrwlock.c"). Could you revisit second patch? [1] https://lkml.org/lkml/2017/5/3/330 Yury
On Mon, Oct 09, 2017 at 12:30:52AM +0300, Yury Norov wrote: > The bottomline of discussion [1] was that queued locks are more > effective when SoC has many CPUs. And 4 is not many. qspinlock, yes. qrwlock not, as it fully depends on arch_spinlock_t for the queueing. qrwlock is just a generic rwlock_t implementation.
Hi Yury, On Mon, Oct 09, 2017 at 12:30:52AM +0300, Yury Norov wrote: > On Fri, Oct 06, 2017 at 02:34:37PM +0100, Will Deacon wrote: > > This is version two of the patches I posted yesterday: > > > > http://lists.infradead.org/pipermail/linux-arm-kernel/2017-October/534666.html > > > > I'd normally leave it longer before posting again, but Peter had a good > > suggestion to rework the layout of the lock word, so I wanted to post a > > version that follows that approach. > > > > I've updated my branch if you're after the full patch stack: > > > > git://git.kernel.org/pub/scm/linux/kernel/git/will/linux.git qrwlock > > > > As before, all comments (particularly related to testing and performance) > > welcome! > > > I tested your patches with locktorture and found measurable performance > regression. I also respin the patch of Jan Glauber [1], and I also > tried Jan's patch with patch 5 from this series. Numbers differ a lot > from my previous measurements, but since that I changed working > station and use qemu with the support of parallel threads. > Spinlock Read-RW lock Write-RW lock > Vanilla: 129804626 12340895 14716138 > This series: 113718002 10982159 13068934 > Jan patch: 117977108 11363462 13615449 > Jan patch + #5: 121483176 11696728 13618967 > > The bottomline of discussion [1] was that queued locks are more > effective when SoC has many CPUs. And 4 is not many. My measurement > was made on the 4-CPU machine, and it seems it confirms that. Does > it make sense to make queued locks default for many-CPU machines only? Just to confirm, you're running this under qemu on an x86 host, using full AArch64 system emulation? If so, I really don't think we should base the merits of qrwlocks on arm64 around this type of configuration. Given that you work for a silicon vendor, could you try running on real arm64 hardware instead, please? My measurements on 6-core and 8-core systems look a lot better with qrwlock than what we currently have in mainline, and they also fix a real starvation issue reported by Jeremy [1]. I'd also add that lock fairness comes at a cost, so I'd expect a small drop in total throughput for some workloads. I encourage you to try passing different arguments to locktorture to see this in action. For example, on an 8-core machine: # insmod ./locktorture.ko nwriters_stress=2 nreaders_stress=8 torture_type="rw_lock_irq" stat_interval=2 -rc3: Writes: Total: 6612 Max/Min: 0/0 Fail: 0 Reads : Total: 1265230 Max/Min: 0/0 Fail: 0 Writes: Total: 6709 Max/Min: 0/0 Fail: 0 Reads : Total: 1916418 Max/Min: 0/0 Fail: 0 Writes: Total: 6725 Max/Min: 0/0 Fail: 0 Reads : Total: 5103727 Max/Min: 0/0 Fail: 0 notice how the writers are really struggling here (you only have to tweak a bit more and you get RCU stalls, lose interrupts etc). With the qrwlock: Writes: Total: 47962 Max/Min: 0/0 Fail: 0 Reads : Total: 277903 Max/Min: 0/0 Fail: 0 Writes: Total: 100151 Max/Min: 0/0 Fail: 0 Reads : Total: 525781 Max/Min: 0/0 Fail: 0 Writes: Total: 155284 Max/Min: 0/0 Fail: 0 Reads : Total: 767703 Max/Min: 0/0 Fail: 0 which is an awful lot better for maximum latency and fairness, despite the much lower reader count. > There were 2 preparing patches in the series: > [PATCH 1/3] kernel/locking: #include <asm/spinlock.h> in qrwlock > and > [PATCH 2/3] asm-generic: don't #include <linux/atomic.h> in qspinlock_types.h > > 1st patch is not needed anymore because Babu Moger submitted similar patch that > is already in mainline: 9ab6055f95903 ("kernel/locking: Fix compile error with > qrwlock.c"). Could you revisit second patch? Sorry, not sure what you're asking me to do here. Will [1] http://lists.infradead.org/pipermail/linux-arm-kernel/2017-October/534299.html
On Mon, Oct 09, 2017 at 08:52:43AM +0200, Peter Zijlstra wrote: > On Mon, Oct 09, 2017 at 12:30:52AM +0300, Yury Norov wrote: > > The bottomline of discussion [1] was that queued locks are more > > effective when SoC has many CPUs. And 4 is not many. > > qspinlock, yes. qrwlock not, as it fully depends on arch_spinlock_t for > the queueing. qrwlock is just a generic rwlock_t implementation. Yup, and once I've knocked qrwlocks on the head I'll go take a look at qspinlock. Either way, I'll keep our ticket implementation around because (a) it's a tonne simpler (b) I don't have data to suggest that it sucks and (c) there's been formal work to show that various parts of it are correct. rwlock on the other hand has been shown to be broken, I know that it sucks and there's not been any formal work, so I'll be glad to see the back of it! Will
On Mon, Oct 09, 2017 at 10:59:36AM +0100, Will Deacon wrote: > Hi Yury, > > On Mon, Oct 09, 2017 at 12:30:52AM +0300, Yury Norov wrote: > > On Fri, Oct 06, 2017 at 02:34:37PM +0100, Will Deacon wrote: > > > This is version two of the patches I posted yesterday: > > > > > > http://lists.infradead.org/pipermail/linux-arm-kernel/2017-October/534666.html > > > > > > I'd normally leave it longer before posting again, but Peter had a good > > > suggestion to rework the layout of the lock word, so I wanted to post a > > > version that follows that approach. > > > > > > I've updated my branch if you're after the full patch stack: > > > > > > git://git.kernel.org/pub/scm/linux/kernel/git/will/linux.git qrwlock > > > > > > As before, all comments (particularly related to testing and performance) > > > welcome! > > > > > I tested your patches with locktorture and found measurable performance > > regression. I also respin the patch of Jan Glauber [1], and I also > > tried Jan's patch with patch 5 from this series. Numbers differ a lot > > from my previous measurements, but since that I changed working > > station and use qemu with the support of parallel threads. > > Spinlock Read-RW lock Write-RW lock > > Vanilla: 129804626 12340895 14716138 > > This series: 113718002 10982159 13068934 > > Jan patch: 117977108 11363462 13615449 > > Jan patch + #5: 121483176 11696728 13618967 > > > > The bottomline of discussion [1] was that queued locks are more > > effective when SoC has many CPUs. And 4 is not many. My measurement > > was made on the 4-CPU machine, and it seems it confirms that. Does > > it make sense to make queued locks default for many-CPU machines only? > > Just to confirm, you're running this under qemu on an x86 host, using full > AArch64 system emulation? If so, I really don't think we should base the > merits of qrwlocks on arm64 around this type of configuration. Given that > you work for a silicon vendor, could you try running on real arm64 hardware > instead, please? I don't have the hardware access at the moment. I'll run the test when I'll get it. > My measurements on 6-core and 8-core systems look a lot > better with qrwlock than what we currently have in mainline, and they > also fix a real starvation issue reported by Jeremy [1]. > > I'd also add that lock fairness comes at a cost, so I'd expect a small drop > in total throughput for some workloads. I encourage you to try passing > different arguments to locktorture to see this in action. For example, on > an 8-core machine: > > # insmod ./locktorture.ko nwriters_stress=2 nreaders_stress=8 torture_type="rw_lock_irq" stat_interval=2 > > -rc3: > > Writes: Total: 6612 Max/Min: 0/0 Fail: 0 > Reads : Total: 1265230 Max/Min: 0/0 Fail: 0 > Writes: Total: 6709 Max/Min: 0/0 Fail: 0 > Reads : Total: 1916418 Max/Min: 0/0 Fail: 0 > Writes: Total: 6725 Max/Min: 0/0 Fail: 0 > Reads : Total: 5103727 Max/Min: 0/0 Fail: 0 > > notice how the writers are really struggling here (you only have to tweak a > bit more and you get RCU stalls, lose interrupts etc). > > With the qrwlock: > > Writes: Total: 47962 Max/Min: 0/0 Fail: 0 > Reads : Total: 277903 Max/Min: 0/0 Fail: 0 > Writes: Total: 100151 Max/Min: 0/0 Fail: 0 > Reads : Total: 525781 Max/Min: 0/0 Fail: 0 > Writes: Total: 155284 Max/Min: 0/0 Fail: 0 > Reads : Total: 767703 Max/Min: 0/0 Fail: 0 > > which is an awful lot better for maximum latency and fairness, despite the > much lower reader count. > > > There were 2 preparing patches in the series: > > [PATCH 1/3] kernel/locking: #include <asm/spinlock.h> in qrwlock > > and > > [PATCH 2/3] asm-generic: don't #include <linux/atomic.h> in qspinlock_types.h > > > > 1st patch is not needed anymore because Babu Moger submitted similar patch that > > is already in mainline: 9ab6055f95903 ("kernel/locking: Fix compile error with > > qrwlock.c"). Could you revisit second patch? > > Sorry, not sure what you're asking me to do here. It removes unneeded #include <linux/atomic.h> in include/asm-generic/qspinlock_types.h. Could you or someone else take it upstream? > Will > > [1] http://lists.infradead.org/pipermail/linux-arm-kernel/2017-October/534299.html
On Mon, Oct 09, 2017 at 03:49:21PM +0300, Yury Norov wrote: > On Mon, Oct 09, 2017 at 10:59:36AM +0100, Will Deacon wrote: > > On Mon, Oct 09, 2017 at 12:30:52AM +0300, Yury Norov wrote: > > > There were 2 preparing patches in the series: > > > [PATCH 1/3] kernel/locking: #include <asm/spinlock.h> in qrwlock > > > and > > > [PATCH 2/3] asm-generic: don't #include <linux/atomic.h> in qspinlock_types.h > > > > > > 1st patch is not needed anymore because Babu Moger submitted similar patch that > > > is already in mainline: 9ab6055f95903 ("kernel/locking: Fix compile error with > > > qrwlock.c"). Could you revisit second patch? > > > > Sorry, not sure what you're asking me to do here. > > It removes unneeded #include <linux/atomic.h> in > include/asm-generic/qspinlock_types.h. Could you or someone else take > it upstream? My patch implements qrwlocks, not qspinlocks, so it's a bit weird to take this random patch in the same series. Given that Arnd acked it, I'd suggest either sending it through him, or leaving it until I get round to looking at qspinlock for arm64 (see my reply to Peter). Will
On 10/06/2017 09:34 AM, Will Deacon wrote: > Hi all, > > This is version two of the patches I posted yesterday: > > http://lists.infradead.org/pipermail/linux-arm-kernel/2017-October/534666.html > > I'd normally leave it longer before posting again, but Peter had a good > suggestion to rework the layout of the lock word, so I wanted to post a > version that follows that approach. > > I've updated my branch if you're after the full patch stack: > > git://git.kernel.org/pub/scm/linux/kernel/git/will/linux.git qrwlock > > As before, all comments (particularly related to testing and performance) > welcome! > > Cheers, > > Will > > --->8 > > Will Deacon (5): > kernel/locking: Use struct qrwlock instead of struct __qrwlock > locking/atomic: Add atomic_cond_read_acquire > kernel/locking: Use atomic_cond_read_acquire when spinning in qrwlock > arm64: locking: Move rwlock implementation over to qrwlocks > kernel/locking: Prevent slowpath writers getting held up by fastpath > > arch/arm64/Kconfig | 17 ++++ > arch/arm64/include/asm/Kbuild | 1 + > arch/arm64/include/asm/spinlock.h | 164 +------------------------------- > arch/arm64/include/asm/spinlock_types.h | 6 +- > include/asm-generic/atomic-long.h | 3 + > include/asm-generic/qrwlock.h | 20 +--- > include/asm-generic/qrwlock_types.h | 15 ++- > include/linux/atomic.h | 4 + > kernel/locking/qrwlock.c | 83 +++------------- > 9 files changed, 58 insertions(+), 255 deletions(-) > I had done some performance test of your patch on a 1 socket Cavium CN8880 system with 32 cores. I used my locking stress test which produced the following results with 16 locking threads at various mixes of reader & writer threads on 4.14-rc4 based kernels. The numbers are the minimum/average/maximum locking operations done per locking threads in a 10 seconds period. A minimum number of 1 means there is at least 1 thread that cannot acquire the lock during the test period. w/o qrwlock patch with qrwlock patch ----------------- ------------------ 16 readers 793,024/1,169,763/1,684,751 1,060,127/1,198,583/1,331,003 12 readers 1,162,760/1,641,714/2,162,939 1,685,334/2,099,088/2,338,461 4 writers 1/ 1/ 1 25,540/ 195,975/ 392,232 8 readers 2,135,670/2,391,612/2,737,564 2,985,686/3,359,048/3,870,423 8 writers 1/ 19,867/ 88,173 119,078/ 559,604/1,112,769 4 readers 1,194,917/1,250,876/1,299,304 3,611,059/4,653,775/6,268,370 12 writers 176,156/1,088,513/2,594,534 7,664/ 795,393/1,841,961 16 writers 35,007/1,094,608/1,954,457 1,618,915/1,633,077/1,645,637 It can be seen that qrwlock performed much better than the original rwlock implementation. Tested-by: Waiman Long <longman@redhat.com> Cheers, Longman
Hi, On 10/06/2017 08:34 AM, Will Deacon wrote: > Hi all, > > This is version two of the patches I posted yesterday: > > http://lists.infradead.org/pipermail/linux-arm-kernel/2017-October/534666.html > > I'd normally leave it longer before posting again, but Peter had a good > suggestion to rework the layout of the lock word, so I wanted to post a > version that follows that approach. > > I've updated my branch if you're after the full patch stack: > > git://git.kernel.org/pub/scm/linux/kernel/git/will/linux.git qrwlock > > As before, all comments (particularly related to testing and performance) > welcome! I've been doing perf comparisons with the rwlock fairness patch I posted last week on a single socket thunderx and the baseline rwlock. For most cases where the ratio of read/writers is similar (uncontended readers/writers, single writer/lots readers, etc) the absolute number of lock acquisitions is very similar (within a few percent one way or the other). In this regard both patches are light years ahead of the current arm64 rwlock. The unfairness of the current rwlock allows significantly higher lock acquisition counts (say 4x at 30Readers:1Writer) at the expense of complete writer starvation (or a ~43k:1 ratio at a 30R:1W per locktorture). This is untenable. The qrwlock does an excellent job of maintaining the ratio of reader/writer acquisitions proportional to the number of readers/writers until the total lockers exceeds the number of cores where the ratios start to far exceed the reader/writer ratios (440:1 acquisitions @ 96R:1W) In comparison the other patch tends to favor the writers more, so at a ratio of 48R/1W, the readers are only grabbing the lock at a ratio of 15:1. This flatter curve continues past the number of cores with the readers having a 48:1 advantage with 96R/1W. That said the total lock acquisition counts remain very similar (with maybe a slight advantage to the non queued patch with 1 writer and 12-30 readers) despite the writer acquiring the lock at a higher frequency. OTOH, if the number of writers is closer to the number of readers (24R:24W) then the writers have about a 1.5X bias over the readers independent of the number of reader/writers. This bias seems to be the common multiplier given a reader/writer ratio with those patches and doesn't account for possible single thread starvation. Of course, I've been running other tests as well and the system seems to be behaving as expected (likely better than the rwlock patches under stress). I will continue to test this on a couple other platforms. In the meantime: Tested-by: Jeremy Linton <jeremy.linton@arm.com> > > Cheers, > > Will > > --->8 > > Will Deacon (5): > kernel/locking: Use struct qrwlock instead of struct __qrwlock > locking/atomic: Add atomic_cond_read_acquire > kernel/locking: Use atomic_cond_read_acquire when spinning in qrwlock > arm64: locking: Move rwlock implementation over to qrwlocks > kernel/locking: Prevent slowpath writers getting held up by fastpath > > arch/arm64/Kconfig | 17 ++++ > arch/arm64/include/asm/Kbuild | 1 + > arch/arm64/include/asm/spinlock.h | 164 +------------------------------- > arch/arm64/include/asm/spinlock_types.h | 6 +- > include/asm-generic/atomic-long.h | 3 + > include/asm-generic/qrwlock.h | 20 +--- > include/asm-generic/qrwlock_types.h | 15 ++- > include/linux/atomic.h | 4 + > kernel/locking/qrwlock.c | 83 +++------------- > 9 files changed, 58 insertions(+), 255 deletions(-) >
On 10/6/2017 9:34 AM, Will Deacon wrote: > Hi all, > > This is version two of the patches I posted yesterday: > > http://lists.infradead.org/pipermail/linux-arm-kernel/2017-October/534666.html > > I'd normally leave it longer before posting again, but Peter had a good > suggestion to rework the layout of the lock word, so I wanted to post a > version that follows that approach. > > I've updated my branch if you're after the full patch stack: > > git://git.kernel.org/pub/scm/linux/kernel/git/will/linux.git qrwlock > > As before, all comments (particularly related to testing and performance) > welcome! > > Cheers, > > Will > > --->8 > > Will Deacon (5): > kernel/locking: Use struct qrwlock instead of struct __qrwlock > locking/atomic: Add atomic_cond_read_acquire > kernel/locking: Use atomic_cond_read_acquire when spinning in qrwlock > arm64: locking: Move rwlock implementation over to qrwlocks > kernel/locking: Prevent slowpath writers getting held up by fastpath > > arch/arm64/Kconfig | 17 ++++ > arch/arm64/include/asm/Kbuild | 1 + > arch/arm64/include/asm/spinlock.h | 164 +------------------------------- > arch/arm64/include/asm/spinlock_types.h | 6 +- > include/asm-generic/atomic-long.h | 3 + > include/asm-generic/qrwlock.h | 20 +--- > include/asm-generic/qrwlock_types.h | 15 ++- > include/linux/atomic.h | 4 + > kernel/locking/qrwlock.c | 83 +++------------- > 9 files changed, 58 insertions(+), 255 deletions(-) > Applied on 4.14-rc4, I tested these patches with multiple combinations of readers:writers . These patches help prevent writer starvation in every combination that I tested. Without these patches, when the reader:writer ratio is 2:1, it's trivial for me to see acquisitions of 250:1 (@ 2R:1W). After applying the qrwlock patches, I see the acquisition ratios level out to around ~1.6:1 (@ 2R:1W), which is quite an improvement. Thanks Will! Tested-by: Adam Wallis <awallis@codeaurora.org> -- Adam Wallis Qualcomm Datacenter Technologies as an affiliate of Qualcomm Technologies, Inc. Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.