Message ID | 1535125966-7666-9-git-send-email-will.deacon@arm.com |
---|---|
State | Superseded |
Headers | show |
Series | Avoid synchronous TLB invalidation for intermediate page-table entries on arm64 | expand |
On Fri, 24 Aug 2018 16:52:43 +0100 Will Deacon <will.deacon@arm.com> wrote: > From: Peter Zijlstra <peterz@infradead.org> > > Some architectures require different TLB invalidation instructions > depending on whether it is only the last-level of page table being > changed, or whether there are also changes to the intermediate > (directory) entries higher up the tree. > > Add a new bit to the flags bitfield in struct mmu_gather so that the > architecture code can operate accordingly if it's the intermediate > levels being invalidated. > > Signed-off-by: Peter Zijlstra <peterz@infradead.org> > Signed-off-by: Will Deacon <will.deacon@arm.com> powerpc should be able to move right over to using this rather than keeping the bit in need_flush_all. powerpc may be able to use the unmap granule thing to improve its page size dependent flushes, but it might prefer to go a different way and track start-end for different page sizes. I wonder how much of that stuff should go into generic code, and whether it should instead go into a struct arch_mmu_gather. Thanks, Nick
On Mon, Aug 27, 2018 at 02:44:57PM +1000, Nicholas Piggin wrote: > powerpc may be able to use the unmap granule thing to improve > its page size dependent flushes, but it might prefer to go > a different way and track start-end for different page sizes. I don't really see how tracking multiple ranges would help much with THP. The ranges would end up being almost the same if there is a good mix of page sizes. But something like: void tlb_flush_one(struct mmu_gather *tlb, unsigned long addr) { if (tlb->cleared_ptes && (addr << BITS_PER_LONG - PAGE_SHIFT)) tblie_pte(addr); if (tlb->cleared_pmds && (addr << BITS_PER_LONG - PMD_SHIFT)) tlbie_pmd(addr); if (tlb->cleared_puds && (addr << BITS_PER_LONG - PUD_SHIFT)) tlbie_pud(addr); } void tlb_flush_range(struct mmu_gather *tlb) { unsigned long stride = 1UL << tlb_get_unmap_shift(tlb); unsigned long addr; for (addr = tlb->start; addr < tlb->end; addr += stride) tlb_flush_one(tlb, addr); ptesync(); } Should workd I think. You'll only issue multiple TLBIEs on the boundaries, not every stride. And for hugetlb the above should be optimal, since stride and tlb->cleared_* match up 1:1.
On Tue, Aug 28, 2018 at 03:46:38PM +0200, Peter Zijlstra wrote: > On Mon, Aug 27, 2018 at 02:44:57PM +1000, Nicholas Piggin wrote: > > > powerpc may be able to use the unmap granule thing to improve > > its page size dependent flushes, but it might prefer to go > > a different way and track start-end for different page sizes. > > I don't really see how tracking multiple ranges would help much with > THP. The ranges would end up being almost the same if there is a good > mix of page sizes. > > But something like: > > void tlb_flush_one(struct mmu_gather *tlb, unsigned long addr) > { > if (tlb->cleared_ptes && (addr << BITS_PER_LONG - PAGE_SHIFT)) > tblie_pte(addr); > if (tlb->cleared_pmds && (addr << BITS_PER_LONG - PMD_SHIFT)) > tlbie_pmd(addr); > if (tlb->cleared_puds && (addr << BITS_PER_LONG - PUD_SHIFT)) > tlbie_pud(addr); > } Sorry, those all should (of course) be !(addr << ...). > void tlb_flush_range(struct mmu_gather *tlb) > { > unsigned long stride = 1UL << tlb_get_unmap_shift(tlb); > unsigned long addr; > > for (addr = tlb->start; addr < tlb->end; addr += stride) > tlb_flush_one(tlb, addr); > > ptesync(); > } And one could; like x86 has; add a threshold above which you just kill the complete mm.
On Tue, 28 Aug 2018 15:46:38 +0200 Peter Zijlstra <peterz@infradead.org> wrote: > On Mon, Aug 27, 2018 at 02:44:57PM +1000, Nicholas Piggin wrote: > > > powerpc may be able to use the unmap granule thing to improve > > its page size dependent flushes, but it might prefer to go > > a different way and track start-end for different page sizes. > > I don't really see how tracking multiple ranges would help much with > THP. The ranges would end up being almost the same if there is a good > mix of page sizes. That's assuming quite large unmaps. But a lot of the time they are going to go to a full PID flush. > > But something like: > > void tlb_flush_one(struct mmu_gather *tlb, unsigned long addr) > { > if (tlb->cleared_ptes && (addr << BITS_PER_LONG - PAGE_SHIFT)) > tblie_pte(addr); > if (tlb->cleared_pmds && (addr << BITS_PER_LONG - PMD_SHIFT)) > tlbie_pmd(addr); > if (tlb->cleared_puds && (addr << BITS_PER_LONG - PUD_SHIFT)) > tlbie_pud(addr); > } > > void tlb_flush_range(struct mmu_gather *tlb) > { > unsigned long stride = 1UL << tlb_get_unmap_shift(tlb); > unsigned long addr; > > for (addr = tlb->start; addr < tlb->end; addr += stride) > tlb_flush_one(tlb, addr); > > ptesync(); > } > > Should workd I think. You'll only issue multiple TLBIEs on the > boundaries, not every stride. Yeah we already do basically that today in the flush_tlb_range path, just without the precise test for which page sizes if (hflush) { hstart = (start + PMD_SIZE - 1) & PMD_MASK; hend = end & PMD_MASK; if (hstart == hend) hflush = false; } if (gflush) { gstart = (start + PUD_SIZE - 1) & PUD_MASK; gend = end & PUD_MASK; if (gstart == gend) gflush = false; } asm volatile("ptesync": : :"memory"); if (local) { __tlbiel_va_range(start, end, pid, page_size, mmu_virtual_psize); if (hflush) __tlbiel_va_range(hstart, hend, pid, PMD_SIZE, MMU_PAGE_2M); if (gflush) __tlbiel_va_range(gstart, gend, pid, PUD_SIZE, MMU_PAGE_1G); asm volatile("ptesync": : :"memory"); Thing is I think it's the smallish range cases you want to optimize for. And for those we'll probably do something even smarter (like keep a bitmap of pages to flush) because we really want to keep tlbies off the bus whereas that's less important for x86. Still not really seeing a reason not to implement a struct arch_mmu_gather. A little bit of data contained to the arch is nothing compared with the multitude of hooks and divergence of code. Thanks, Nick
diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h index 859f897d3d53..a5caf90264e6 100644 --- a/include/asm-generic/tlb.h +++ b/include/asm-generic/tlb.h @@ -99,12 +99,22 @@ struct mmu_gather { #endif unsigned long start; unsigned long end; - /* we are in the middle of an operation to clear - * a full mm and can make some optimizations */ - unsigned int fullmm : 1, - /* we have performed an operation which - * requires a complete flush of the tlb */ - need_flush_all : 1; + /* + * we are in the middle of an operation to clear + * a full mm and can make some optimizations + */ + unsigned int fullmm : 1; + + /* + * we have performed an operation which + * requires a complete flush of the tlb + */ + unsigned int need_flush_all : 1; + + /* + * we have removed page directories + */ + unsigned int freed_tables : 1; struct mmu_gather_batch *active; struct mmu_gather_batch local; @@ -139,6 +149,7 @@ static inline void __tlb_reset_range(struct mmu_gather *tlb) tlb->start = TASK_SIZE; tlb->end = 0; } + tlb->freed_tables = 0; } static inline void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb) @@ -280,6 +291,7 @@ static inline void tlb_remove_check_page_size_change(struct mmu_gather *tlb, #define pte_free_tlb(tlb, ptep, address) \ do { \ __tlb_adjust_range(tlb, address, PAGE_SIZE); \ + tlb->freed_tables = 1; \ __pte_free_tlb(tlb, ptep, address); \ } while (0) #endif @@ -287,7 +299,8 @@ static inline void tlb_remove_check_page_size_change(struct mmu_gather *tlb, #ifndef pmd_free_tlb #define pmd_free_tlb(tlb, pmdp, address) \ do { \ - __tlb_adjust_range(tlb, address, PAGE_SIZE); \ + __tlb_adjust_range(tlb, address, PAGE_SIZE); \ + tlb->freed_tables = 1; \ __pmd_free_tlb(tlb, pmdp, address); \ } while (0) #endif @@ -297,6 +310,7 @@ static inline void tlb_remove_check_page_size_change(struct mmu_gather *tlb, #define pud_free_tlb(tlb, pudp, address) \ do { \ __tlb_adjust_range(tlb, address, PAGE_SIZE); \ + tlb->freed_tables = 1; \ __pud_free_tlb(tlb, pudp, address); \ } while (0) #endif @@ -306,7 +320,8 @@ static inline void tlb_remove_check_page_size_change(struct mmu_gather *tlb, #ifndef p4d_free_tlb #define p4d_free_tlb(tlb, pudp, address) \ do { \ - __tlb_adjust_range(tlb, address, PAGE_SIZE); \ + __tlb_adjust_range(tlb, address, PAGE_SIZE); \ + tlb->freed_tables = 1; \ __p4d_free_tlb(tlb, pudp, address); \ } while (0) #endif