Message ID | 20141029194738.GA29911@arm.com |
---|---|
State | New |
Headers | show |
On Wed, Oct 29, 2014 at 12:47 PM, Will Deacon <will.deacon@arm.com> wrote: > > I've cooked up something (see below), but it unfortunately makes a couple > of minor changes to powerpc and microblaze to address redefinitions of > some of the gather callbacks (tlb{start,end}vma, __tlb_remove_tlb_entry). > > On the plus side, it tidies up the force_flush case in zap_pte_range > quite nicely (assuming I didn't screw it up again). Yes, this looks fine to me. Looks like a good cleanup, and moves more code to generic headers rather than having arch-specific stuff. Ben, can you check that this is ok on powerpc? Who else should double-check this due to having been involved in tlb flushing? But I think this is good to go, modulo checking other architectures for sanity. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On Wed, 2014-10-29 at 14:11 -0700, Linus Torvalds wrote: > On Wed, Oct 29, 2014 at 12:47 PM, Will Deacon <will.deacon@arm.com> wrote: > > > > I've cooked up something (see below), but it unfortunately makes a couple > > of minor changes to powerpc and microblaze to address redefinitions of > > some of the gather callbacks (tlb{start,end}vma, __tlb_remove_tlb_entry). > > > > On the plus side, it tidies up the force_flush case in zap_pte_range > > quite nicely (assuming I didn't screw it up again). > > Yes, this looks fine to me. Looks like a good cleanup, and moves more > code to generic headers rather than having arch-specific stuff. > > Ben, can you check that this is ok on powerpc? Who else should > double-check this due to having been involved in tlb flushing? But I > think this is good to go, modulo checking other architectures for > sanity. TLB flushing is only me I think, I'll engage my brain after breakfast and see if is all good. The difficulty for us is that SW loaded TLB, hash32 and hash64 are all very different and my brain's been good at swapping a lot of that stuff out lately... Cheers, Ben. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On Wed, Oct 29, 2014 at 2:27 PM, Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote: > > TLB flushing is only me I think, I'll engage my brain after breakfast > and see if is all good Ping? Breakfast is either long over, of you're starting to look a bit like Mr Creosote... Anyway, Will, I assume this is not a correctness issue for you, just an annoying performance issue. Right? Or is there actually some issue with the actual range not being set to be sufficiently large? Also, it strikes me that I *think* that you might be able to extend your patch to remove the whole "need_flush" field, since as far as I can tell, "tlb->need_flush" is now equivalent to "tlb->start < tlb->end". Of course, as long as we still require that "need_flush_all", that doesn't actually save us any space, so maybe it's not worth changing. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On Sat, 2014-11-01 at 10:01 -0700, Linus Torvalds wrote: > On Wed, Oct 29, 2014 at 2:27 PM, Benjamin Herrenschmidt > <benh@kernel.crashing.org> wrote: > > > > TLB flushing is only me I think, I'll engage my brain after breakfast > > and see if is all good > > Ping? Breakfast is either long over, of you're starting to look a bit > like Mr Creosote... Argh... dropped that ball. > Anyway, Will, I assume this is not a correctness issue for you, just > an annoying performance issue. Right? Or is there actually some issue > with the actual range not being set to be sufficiently large? It should be fine for us in term of correctness I think. We rely on the lazy mmu bits for batching/flushing on hash64, we use __tlb_remove_tlb_entry() for immediate flush on hash32 and the SW loaded TLB cases are pretty dumb here and should be generally unaffected. > Also, it strikes me that I *think* that you might be able to extend > your patch to remove the whole "need_flush" field, since as far as I > can tell, "tlb->need_flush" is now equivalent to "tlb->start < > tlb->end". Of course, as long as we still require that > "need_flush_all", that doesn't actually save us any space, so maybe > it's not worth changing. Cheers, Ben. > Linus > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On Sat, Nov 01, 2014 at 05:01:30PM +0000, Linus Torvalds wrote: > On Wed, Oct 29, 2014 at 2:27 PM, Benjamin Herrenschmidt > <benh@kernel.crashing.org> wrote: > > > > TLB flushing is only me I think, I'll engage my brain after breakfast > > and see if is all good > > Ping? Breakfast is either long over, of you're starting to look a bit > like Mr Creosote... Wafer thin patch? > Anyway, Will, I assume this is not a correctness issue for you, just > an annoying performance issue. Right? Or is there actually some issue > with the actual range not being set to be sufficiently large? Yeah, it's just a performance issue. For ranges over 1k pages, we end up flushing the whole TLB. > Also, it strikes me that I *think* that you might be able to extend > your patch to remove the whole "need_flush" field, since as far as I > can tell, "tlb->need_flush" is now equivalent to "tlb->start < > tlb->end". Of course, as long as we still require that > "need_flush_all", that doesn't actually save us any space, so maybe > it's not worth changing. We use `tlb->end > 0' in the arm64 backend, so I think you're right. I'll take a look in the name of cleanup and this can wait until 3.19. Will -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On Mon, Nov 3, 2014 at 9:56 AM, Will Deacon <will.deacon@arm.com> wrote: > > We use `tlb->end > 0' in the arm64 backend, so I think you're right. I'll > take a look in the name of cleanup and this can wait until 3.19. Ok, I'll just archive this thread for now, and expect a patch at some future date. And as far as I'm concerned, it's ok if it just comes in through the normal arm64 tree, with just a note in the pull request reminding me about why it also touches some other architectures. Thanks, Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
(catching up with email after holiday) On Wed, Oct 29, 2014 at 07:47:39PM +0000, Will Deacon wrote: > mmu_gather: move minimal range calculations into generic code > > On architectures with hardware broadcasting of TLB invalidation messages > , it makes sense to reduce the range of the mmu_gather structure when > unmapping page ranges based on the dirty address information passed to > tlb_remove_tlb_entry. > > arm64 already does this by directly manipulating the start/end fields > of the gather structure, but this confuses the generic code which > does not expect these fields to change and can end up calculating > invalid, negative ranges when forcing a flush in zap_pte_range. > > This patch moves the minimal range calculation out of the arm64 code > and into the generic implementation, simplifying zap_pte_range in the > process (which no longer needs to care about start/end, since they will > point to the appropriate ranges already). > > Signed-off-by: Will Deacon <will.deacon@arm.com> Nice to see this clean-up for arm64, however I have a question below. > diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h > index 5672d7ea1fa0..340bc5c5ca2d 100644 > --- a/include/asm-generic/tlb.h > +++ b/include/asm-generic/tlb.h > @@ -128,6 +128,46 @@ static inline void tlb_remove_page(struct mmu_gather *tlb, struct page *page) > tlb_flush_mmu(tlb); > } > > +static inline void __tlb_adjust_range(struct mmu_gather *tlb, > + unsigned long address) > +{ > + if (!tlb->fullmm) { > + tlb->start = min(tlb->start, address); > + tlb->end = max(tlb->end, address + PAGE_SIZE); > + } > +} Here __tlb_adjust_range() assumes end to be (start + PAGE_SIZE) always. > @@ -152,12 +193,14 @@ static inline void tlb_remove_page(struct mmu_gather *tlb, struct page *page) > #define tlb_remove_pmd_tlb_entry(tlb, pmdp, address) \ > do { \ > tlb->need_flush = 1; \ > + __tlb_adjust_range(tlb, address); \ > __tlb_remove_pmd_tlb_entry(tlb, pmdp, address); \ > } while (0) [...] > #define pmd_free_tlb(tlb, pmdp, address) \ > do { \ > tlb->need_flush = 1; \ > + __tlb_adjust_range(tlb, address); \ > __pmd_free_tlb(tlb, pmdp, address); \ > } while (0) This would work on arm64 but is the PAGE_SIZE range enough for all architectures even when we flush a huge page or a pmd/pud table entry? The approach Peter Z took with his patches was to use pmd_addr_end(addr, TASK_SIZE) and change __tlb_adjust_range() to take start/end arguments: https://lkml.org/lkml/2011/3/7/302
On Tue, Nov 4, 2014 at 6:29 AM, Catalin Marinas <catalin.marinas@arm.com> wrote: > > This would work on arm64 but is the PAGE_SIZE range enough for all > architectures even when we flush a huge page or a pmd/pud table entry? It pretty much had *better* be. For things like page tables caches (ie caching addresses "inside" the page tables, like x86 does), for legacy reasons, flushing an individual page had better flush the page table caches behind it. This is definitely how x86 works, for example. And if you have an architected non-legacy page table cache (which I'm not aware of anybody actually doing), you're going to have some architecturally explicit flushing for that, likely *separate* from a regular TLB entry flush, and thus you'd need more than just some range expansion.. And the logic is very similar for things like hugepages. Either a normal "TLB invalidate" insutrction anywhere in the hugepage will invalidate the whole hugepage), or you would have special instructions or rules for invalidating hugepages and you'd need more than just some range expansion. So in neither case does it make sense to expand the range, afaik. And it would hurt normal architectures. So if we ever find an architecture that would want something that odd, I think it is up to that architecture to do its own odd thing, not cause pain for others. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On Tue, Nov 04, 2014 at 04:08:27PM +0000, Linus Torvalds wrote: > On Tue, Nov 4, 2014 at 6:29 AM, Catalin Marinas <catalin.marinas@arm.com> wrote: > > > > This would work on arm64 but is the PAGE_SIZE range enough for all > > architectures even when we flush a huge page or a pmd/pud table entry? > > It pretty much had *better* be. Thanks for confirming. > For things like page tables caches (ie caching addresses "inside" the > page tables, like x86 does), for legacy reasons, flushing an > individual page had better flush the page table caches behind it. This > is definitely how x86 works, for example. And if you have an > architected non-legacy page table cache (which I'm not aware of > anybody actually doing), you're going to have some architecturally > explicit flushing for that, likely *separate* from a regular TLB entry > flush, and thus you'd need more than just some range expansion.. On arm64 we have two types of TLB invalidation instructions, the standard one which flushes a pte entry together with the corresponding upper level page table cache and a "leaf" operation only for the pte. We don't use the latter in Linux (yet) but in theory it's more efficient. Anyway, even without special "leaf" operations, it would be useful to make the distinction between unmap_vmas() and free_pgtables() with regards to the ranges tracked by mmu_gather. For the former, tlb_flush() needs to flush the range in PAGE_SIZE increments (assuming a mix of small and huge pages). For the latter, PMD_SIZE increments would be enough. With RCU_TABLE_FREE, I think checking tlb->local.next would do the trick but for x86 we can keep mmu_gather.need_flush only for pte clearing and remove need_flush = 1 from p*_free_tlb() functions. The arch specific tlb_flush() can take need_flush into account to change the range flushing increment or even ignore the second tlb_flush() triggered by tlb_finish_mmu() (after free_pgtables(), the ptes have been flushed via tlb_end_vma()).
On Thu, Nov 6, 2014 at 5:57 AM, Catalin Marinas <catalin.marinas@arm.com> wrote: > > Anyway, even without special "leaf" operations, it would be useful to > make the distinction between unmap_vmas() and free_pgtables() with > regards to the ranges tracked by mmu_gather. For the former, tlb_flush() > needs to flush the range in PAGE_SIZE increments (assuming a mix of > small and huge pages). For the latter, PMD_SIZE increments would be > enough. Why woyuld you *ever* care about the increments? Quite frankly, I think even the PAGE_SIZE thing is just (a) stupid and (b) misleading. It might actually be better to instead of tlb->end = max(tlb->end, address + PAGE_SIZE); it might as well be a simpler tlb->end = max(tlb->end, address+1) (Even the "+1" is kind of unnecessary, but it's there to distinguish the zero address from the initial condition). The thing is, absolutely *no* architecture has a TLB flush that involves giving the start and end of the page you want to flush. None. Nada. Zilch. Trying to think of it as a "range of bytes I need to flush" is wrong. And it's misleading, and it makes people think in stupid ways like "I should change PAGE_SIZE to PMD_SIZE when I flush a PMD". That's *wrong*. Every single TLB flush is about a variation of "flush the TLB entry that contains the mapping for this address". The whole "add page-size" is pointless, because the *flushing* is not about giving a page-sized range, it's about giving *one* address in that page. Using "address+1" is actually equally descriptive of what the range is ("flush the tlb entry that maps this *byte*"), but is less amenable to the above kind of silly PMD_SIZE confusion. Because the exact same thing that is true for a single page is true for a page table operation too. There is just a *single* page table directory entry cache, it's not like you have one page table directory entry cache for each page. So what matters for the non-leaf operations is not size. Not AT ALL. It's a totally different operation, and you'd need not a different size, but a different flag entirely - the same way we already have a different flag for the "full_mm" case. And it's actually for exactly the same reason as "full_mm": you do the flush itself differently, it's not that the range is different. If it was just about the range, then "full_mm" would just initialize the range to the whole VM. But it *isn't* about the range. It's about the fact that a full-MM tear-down can fundamentally do different operations, knowing that there are no other threads using that VM any more. So I really really object to playing games with PMD_SIZE or other idiocies, because it fundamentally mis-states the whole problem. If ARM64 wants to make the "lead vs non-leaf" TLB operation, then you need to add a new flag, and you just set that flag when you tear down a page table (as opposed to just a PTE). It doesn't affect the "range" at all in any way as far as I can tell. > With RCU_TABLE_FREE, I think checking tlb->local.next would do the trick > but for x86 we can keep mmu_gather.need_flush only for pte clearing > and remove need_flush = 1 from p*_free_tlb() functions. This is more confusion about what is going on. I'd actually really really prefer to have the "need_flush = 1" for the page table tear-down case even for x86. No, if you never removed any PTE at all, it is possible that it's not actually needed because an x86 CPU isn't supposed to cache non-present page table entries (so if you could tear down the page tables because there were no entries, there should be no TLB entries, and there *hopefully* should be no caches of mid-level page tables either that need a TLB invalidate). But in practice, I'd not take that chance. If you tear down a page table, you should flush the TLB in that range (and note how I say *in* that range - an invalidate anywhere in the range should be sufficient, not "over the whole range"!), because quite frankly, from an implementation standpoint, I really think it's the sane and safe thing to do. So I would suggest you think of the x86 invlpg instruction as your "non-leaf invalidate". The same way you'd want to do non-leaf invalidate whenever you tear down a page table, you'd do "invlpg" on x86. And no, we should *not* play games with "tlb->local.next". That just sounds completely and utterly insane. That's a hack, it's unclear, it's stupid, and it's connected to a totally irrelevant implementation detail, namely that random RCU freeing. Set a flag, for chrissake. Just say "when you free a pmd/pud/pgd, set tlb->need_flush_inner to let the flusher know" (probably in *addition* to "tlb->need_flush", just to maintain that rule). Make it explicit, and make it obvious, and don't play games. And don't make it be about range sizes. Because I really don't see how the exact end/start could ever be relevant. TLB's aren't byte-based, they are entry-based. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On Thu, Nov 06, 2014 at 05:53:58PM +0000, Linus Torvalds wrote: > On Thu, Nov 6, 2014 at 5:57 AM, Catalin Marinas <catalin.marinas@arm.com> wrote: > > > > Anyway, even without special "leaf" operations, it would be useful to > > make the distinction between unmap_vmas() and free_pgtables() with > > regards to the ranges tracked by mmu_gather. For the former, tlb_flush() > > needs to flush the range in PAGE_SIZE increments (assuming a mix of > > small and huge pages). For the latter, PMD_SIZE increments would be > > enough. > > Why woyuld you *ever* care about the increments? Sorry, I wasn't clear enough about the "increments" part. I agreed with not using end = start + PMD_SIZE/PAGE_SIZE from your previous email already. The flush_tlb_range() implementation on ARM(64) uses a loop that goes over the given range in PAGE_SIZE increments. This is fine and even optimal when we flush the PTEs. But it could be even faster when we go over the same range again and only need to flush the page table cache (PMD/PUD/PGD). A new flush_tlb_table_range() function could loop in PMD_SIZE increments. That's an arm64-only implementation of the TLB range flushing, I'm not suggesting the PAGE_SIZE/PMD_SIZE increments when setting mmu_gather.end at all. > Quite frankly, I think even the PAGE_SIZE thing is just (a) stupid and > (b) misleading. > > It might actually be better to instead of > > tlb->end = max(tlb->end, address + PAGE_SIZE); > > it might as well be a simpler > > tlb->end = max(tlb->end, address+1) I fully agree. One minor drawback is that the TLB invalidation instructions on ARM work on pfn and end >> PAGE_SHIFT would make it equal to start. It can be fixed in the arch code though. > So what matters for the non-leaf operations is not size. Not AT ALL. > It's a totally different operation, and you'd need not a different > size, but a different flag entirely - the same way we already have a > different flag for the "full_mm" case. And it's actually for exactly > the same reason as "full_mm": you do the flush itself differently, > it's not that the range is different. If it was just about the range, > then "full_mm" would just initialize the range to the whole VM. But it > *isn't* about the range. It's about the fact that a full-MM tear-down > can fundamentally do different operations, knowing that there are no > other threads using that VM any more. > > So I really really object to playing games with PMD_SIZE or other > idiocies, because it fundamentally mis-states the whole problem. That's not what I suggesting (though I agree I wasn't clear). The use of PMD_SIZE steps in a new flush_tlb_table_range() loop is entirely and arch-specific optimisation. Only that the arch code doesn't currently know which tlb_flush() it should use because need_flush is set for both PTEs and page table tear down. We just need different flags here to be able to optimise the arch code further. > If ARM64 wants to make the "lead vs non-leaf" TLB operation, then you > need to add a new flag, and you just set that flag when you tear down > a page table (as opposed to just a PTE). Indeed. Actually we could use need_flag only for PTEs and ignore it for page table tear-down. With Will's patch, we can already check tlb->end for what need_flush is currently doing and use need_flush in an arch-specific way (and we could give a new name as well). > > With RCU_TABLE_FREE, I think checking tlb->local.next would do the trick > > but for x86 we can keep mmu_gather.need_flush only for pte clearing > > and remove need_flush = 1 from p*_free_tlb() functions. > > This is more confusion about what is going on. Yes, and if we do this we may no longer understand the code in few weeks time. > I'd actually really really prefer to have the "need_flush = 1" for the > page table tear-down case even for x86. No, if you never removed any > PTE at all, it is possible that it's not actually needed because an > x86 CPU isn't supposed to cache non-present page table entries (so if > you could tear down the page tables because there were no entries, > there should be no TLB entries, and there *hopefully* should be no > caches of mid-level page tables either that need a TLB invalidate). On ARM, as long as an intermediate page table entry is valid, even though the full translation (PTE) is not, the CPU can go and cache it. What's worse (and we've hit it before) is that it may even end up reading something that looks like a valid PTE (of the PTE page has been freed before the TLB invalidation) and it will stick around as a full translation. So we need to flush the page table cache before freeing the page table pages. > But in practice, I'd not take that chance. If you tear down a page > table, you should flush the TLB in that range (and note how I say *in* > that range - an invalidate anywhere in the range should be sufficient, > not "over the whole range"!), because quite frankly, from an > implementation standpoint, I really think it's the sane and safe thing > to do. A single TLB is indeed enough for a single page table page removed. If we queue multiple page table pages freeing, we accumulate the range via pmd_free_tlb() etc. and we would eventually need multiple TLB invalidations, at most one every PMD_SIZE (that's when !fullmm). > So I would suggest you think of the x86 invlpg instruction as your > "non-leaf invalidate". The same way you'd want to do non-leaf > invalidate whenever you tear down a page table, you'd do "invlpg" on > x86. I need to dig some more in the x86 code, I'm not familiar with it. We could do the page table cache invalidation non-lazily every time pmd_free_tlb() is called, though it's not as optimal as we need a heavy DSB barrier on ARM after each TLB invalidate. > And no, we should *not* play games with "tlb->local.next". That just > sounds completely and utterly insane. That's a hack, it's unclear, > it's stupid, and it's connected to a totally irrelevant implementation > detail, namely that random RCU freeing. > > Set a flag, for chrissake. Just say "when you free a pmd/pud/pgd, set > tlb->need_flush_inner to let the flusher know" (probably in *addition* > to "tlb->need_flush", just to maintain that rule). Make it explicit, > and make it obvious, and don't play games. I agree.
On Thu, Nov 6, 2014 at 10:38 AM, Catalin Marinas <catalin.marinas@arm.com> wrote: > On Thu, Nov 06, 2014 at 05:53:58PM +0000, Linus Torvalds wrote: > > Sorry, I wasn't clear enough about the "increments" part. I agreed with > not using end = start + PMD_SIZE/PAGE_SIZE from your previous email > already. Ahh, I misunderstood. You're really just after the granularity of tlb flushes. That's fine. That makes sense. In fact, how about adding "granularity" to the mmu_gather structure, and then doing:\ - in __tlb_reset_range(), setting it to ~0ul - add "granularity" to __tlb_adjust_range(), and make it do something like if (!tlb->fullmm) { tlb->granularity = min(tlb->granularity, granularity); tlb->start = min(tlb->start, address); tlb->end = max(tlb->end, address+1); } and then the TLB flush logic would basically do address = tlb->start; do { flush(address); if (address + tlb->granularity < address) break; address = address + tlb->granularity; } while (address < tlb->end); or something like that. Now, if you unmap mixed ranges of large-pages and regular pages, you'd still have that granularity of one page, but quite frankly, if you do that, you probably deserve it. The common case is almost certainly going to be just "unmap large pages" or "unmap normal pages". And if it turns out that I'm completely wrong, and mixed granularities are common, maybe there could be some hack in the "tlb->granularity" calculations that just forces a TLB flush when the granularity changes. Hmm? Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On Thu, Nov 06, 2014 at 09:29:54PM +0000, Linus Torvalds wrote: > On Thu, Nov 6, 2014 at 10:38 AM, Catalin Marinas > <catalin.marinas@arm.com> wrote: > > On Thu, Nov 06, 2014 at 05:53:58PM +0000, Linus Torvalds wrote: > > > > Sorry, I wasn't clear enough about the "increments" part. I agreed with > > not using end = start + PMD_SIZE/PAGE_SIZE from your previous email > > already. > > Ahh, I misunderstood. You're really just after the granularity of tlb flushes. Yes. The granularity would also help when tearing down page tables as the granule would be PMD_SIZE. > That's fine. That makes sense. In fact, how about adding "granularity" > to the mmu_gather structure, and then doing:\ > > - in __tlb_reset_range(), setting it to ~0ul > > - add "granularity" to __tlb_adjust_range(), and make it do something like > > if (!tlb->fullmm) { > tlb->granularity = min(tlb->granularity, granularity); > tlb->start = min(tlb->start, address); > tlb->end = max(tlb->end, address+1); > } > > and then the TLB flush logic would basically do > > address = tlb->start; > do { > flush(address); > if (address + tlb->granularity < address) > break; > address = address + tlb->granularity; > } while (address < tlb->end); > > or something like that. Indeed. We'll come up with a patch after Will's clean-up. > Now, if you unmap mixed ranges of large-pages and regular pages, you'd > still have that granularity of one page, but quite frankly, if you do > that, you probably deserve it. The common case is almost certainly > going to be just "unmap large pages" or "unmap normal pages". I think this could only happen with transparent huge pages that replaced small pages in an anonymous mapping. I don't think munmap'ing them happens very often. Thanks.
diff --git a/arch/arm64/include/asm/tlb.h b/arch/arm64/include/asm/tlb.h index a82c0c5c8b52..a9c9df0f60ff 100644 --- a/arch/arm64/include/asm/tlb.h +++ b/arch/arm64/include/asm/tlb.h @@ -19,10 +19,6 @@ #ifndef __ASM_TLB_H #define __ASM_TLB_H -#define __tlb_remove_pmd_tlb_entry __tlb_remove_pmd_tlb_entry - -#include <asm-generic/tlb.h> - #include <linux/pagemap.h> #include <linux/swap.h> @@ -37,16 +33,8 @@ static inline void __tlb_remove_table(void *_table) #define tlb_remove_entry(tlb, entry) tlb_remove_page(tlb, entry) #endif /* CONFIG_HAVE_RCU_TABLE_FREE */ -/* - * There's three ways the TLB shootdown code is used: - * 1. Unmapping a range of vmas. See zap_page_range(), unmap_region(). - * tlb->fullmm = 0, and tlb_start_vma/tlb_end_vma will be called. - * 2. Unmapping all vmas. See exit_mmap(). - * tlb->fullmm = 1, and tlb_start_vma/tlb_end_vma will be called. - * Page tables will be freed. - * 3. Unmapping argument pages. See shift_arg_pages(). - * tlb->fullmm = 0, but tlb_start_vma/tlb_end_vma will not be called. - */ +#include <asm-generic/tlb.h> + static inline void tlb_flush(struct mmu_gather *tlb) { if (tlb->fullmm) { @@ -54,54 +42,13 @@ static inline void tlb_flush(struct mmu_gather *tlb) } else if (tlb->end > 0) { struct vm_area_struct vma = { .vm_mm = tlb->mm, }; flush_tlb_range(&vma, tlb->start, tlb->end); - tlb->start = TASK_SIZE; - tlb->end = 0; - } -} - -static inline void tlb_add_flush(struct mmu_gather *tlb, unsigned long addr) -{ - if (!tlb->fullmm) { - tlb->start = min(tlb->start, addr); - tlb->end = max(tlb->end, addr + PAGE_SIZE); - } -} - -/* - * Memorize the range for the TLB flush. - */ -static inline void __tlb_remove_tlb_entry(struct mmu_gather *tlb, pte_t *ptep, - unsigned long addr) -{ - tlb_add_flush(tlb, addr); -} - -/* - * In the case of tlb vma handling, we can optimise these away in the - * case where we're doing a full MM flush. When we're doing a munmap, - * the vmas are adjusted to only cover the region to be torn down. - */ -static inline void tlb_start_vma(struct mmu_gather *tlb, - struct vm_area_struct *vma) -{ - if (!tlb->fullmm) { - tlb->start = TASK_SIZE; - tlb->end = 0; } } -static inline void tlb_end_vma(struct mmu_gather *tlb, - struct vm_area_struct *vma) -{ - if (!tlb->fullmm) - tlb_flush(tlb); -} - static inline void __pte_free_tlb(struct mmu_gather *tlb, pgtable_t pte, unsigned long addr) { pgtable_page_dtor(pte); - tlb_add_flush(tlb, addr); tlb_remove_entry(tlb, pte); } @@ -109,7 +56,6 @@ static inline void __pte_free_tlb(struct mmu_gather *tlb, pgtable_t pte, static inline void __pmd_free_tlb(struct mmu_gather *tlb, pmd_t *pmdp, unsigned long addr) { - tlb_add_flush(tlb, addr); tlb_remove_entry(tlb, virt_to_page(pmdp)); } #endif @@ -118,15 +64,8 @@ static inline void __pmd_free_tlb(struct mmu_gather *tlb, pmd_t *pmdp, static inline void __pud_free_tlb(struct mmu_gather *tlb, pud_t *pudp, unsigned long addr) { - tlb_add_flush(tlb, addr); tlb_remove_entry(tlb, virt_to_page(pudp)); } #endif -static inline void __tlb_remove_pmd_tlb_entry(struct mmu_gather *tlb, pmd_t *pmdp, - unsigned long address) -{ - tlb_add_flush(tlb, address); -} - #endif diff --git a/arch/microblaze/include/asm/tlb.h b/arch/microblaze/include/asm/tlb.h index 8aa97817cc8c..99b6ded54849 100644 --- a/arch/microblaze/include/asm/tlb.h +++ b/arch/microblaze/include/asm/tlb.h @@ -14,7 +14,6 @@ #define tlb_flush(tlb) flush_tlb_mm((tlb)->mm) #include <linux/pagemap.h> -#include <asm-generic/tlb.h> #ifdef CONFIG_MMU #define tlb_start_vma(tlb, vma) do { } while (0) @@ -22,4 +21,6 @@ #define __tlb_remove_tlb_entry(tlb, pte, address) do { } while (0) #endif +#include <asm-generic/tlb.h> + #endif /* _ASM_MICROBLAZE_TLB_H */ diff --git a/arch/powerpc/include/asm/pgalloc.h b/arch/powerpc/include/asm/pgalloc.h index e9a9f60e596d..fc3ee06eab87 100644 --- a/arch/powerpc/include/asm/pgalloc.h +++ b/arch/powerpc/include/asm/pgalloc.h @@ -3,7 +3,6 @@ #ifdef __KERNEL__ #include <linux/mm.h> -#include <asm-generic/tlb.h> #ifdef CONFIG_PPC_BOOK3E extern void tlb_flush_pgtable(struct mmu_gather *tlb, unsigned long address); @@ -14,6 +13,8 @@ static inline void tlb_flush_pgtable(struct mmu_gather *tlb, } #endif /* !CONFIG_PPC_BOOK3E */ +extern void tlb_remove_table(struct mmu_gather *tlb, void *table); + #ifdef CONFIG_PPC64 #include <asm/pgalloc-64.h> #else diff --git a/arch/powerpc/include/asm/tlb.h b/arch/powerpc/include/asm/tlb.h index e2b428b0f7ba..20733fa518ae 100644 --- a/arch/powerpc/include/asm/tlb.h +++ b/arch/powerpc/include/asm/tlb.h @@ -27,6 +27,7 @@ #define tlb_start_vma(tlb, vma) do { } while (0) #define tlb_end_vma(tlb, vma) do { } while (0) +#define __tlb_remove_tlb_entry __tlb_remove_tlb_entry extern void tlb_flush(struct mmu_gather *tlb); diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h index 5672d7ea1fa0..340bc5c5ca2d 100644 --- a/include/asm-generic/tlb.h +++ b/include/asm-generic/tlb.h @@ -128,6 +128,46 @@ static inline void tlb_remove_page(struct mmu_gather *tlb, struct page *page) tlb_flush_mmu(tlb); } +static inline void __tlb_adjust_range(struct mmu_gather *tlb, + unsigned long address) +{ + if (!tlb->fullmm) { + tlb->start = min(tlb->start, address); + tlb->end = max(tlb->end, address + PAGE_SIZE); + } +} + +static inline void __tlb_reset_range(struct mmu_gather *tlb) +{ + tlb->start = TASK_SIZE; + tlb->end = 0; +} + +/* + * In the case of tlb vma handling, we can optimise these away in the + * case where we're doing a full MM flush. When we're doing a munmap, + * the vmas are adjusted to only cover the region to be torn down. + */ +#ifndef tlb_start_vma +#define tlb_start_vma(tlb, vma) do { } while (0) +#endif + +#define __tlb_end_vma(tlb, vma) \ + do { \ + if (!tlb->fullmm) { \ + tlb_flush(tlb); \ + __tlb_reset_range(tlb); \ + } \ + } while (0) + +#ifndef tlb_end_vma +#define tlb_end_vma __tlb_end_vma +#endif + +#ifndef __tlb_remove_tlb_entry +#define __tlb_remove_tlb_entry(tlb, ptep, address) do { } while (0) +#endif + /** * tlb_remove_tlb_entry - remember a pte unmapping for later tlb invalidation. * @@ -138,6 +178,7 @@ static inline void tlb_remove_page(struct mmu_gather *tlb, struct page *page) #define tlb_remove_tlb_entry(tlb, ptep, address) \ do { \ tlb->need_flush = 1; \ + __tlb_adjust_range(tlb, address); \ __tlb_remove_tlb_entry(tlb, ptep, address); \ } while (0) @@ -152,12 +193,14 @@ static inline void tlb_remove_page(struct mmu_gather *tlb, struct page *page) #define tlb_remove_pmd_tlb_entry(tlb, pmdp, address) \ do { \ tlb->need_flush = 1; \ + __tlb_adjust_range(tlb, address); \ __tlb_remove_pmd_tlb_entry(tlb, pmdp, address); \ } while (0) #define pte_free_tlb(tlb, ptep, address) \ do { \ tlb->need_flush = 1; \ + __tlb_adjust_range(tlb, address); \ __pte_free_tlb(tlb, ptep, address); \ } while (0) @@ -165,6 +208,7 @@ static inline void tlb_remove_page(struct mmu_gather *tlb, struct page *page) #define pud_free_tlb(tlb, pudp, address) \ do { \ tlb->need_flush = 1; \ + __tlb_adjust_range(tlb, address); \ __pud_free_tlb(tlb, pudp, address); \ } while (0) #endif @@ -172,6 +216,7 @@ static inline void tlb_remove_page(struct mmu_gather *tlb, struct page *page) #define pmd_free_tlb(tlb, pmdp, address) \ do { \ tlb->need_flush = 1; \ + __tlb_adjust_range(tlb, address); \ __pmd_free_tlb(tlb, pmdp, address); \ } while (0) diff --git a/mm/memory.c b/mm/memory.c index 3e503831e042..0bc940e41ec9 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -220,8 +220,6 @@ void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, unsigned long /* Is it from 0 to ~0? */ tlb->fullmm = !(start | (end+1)); tlb->need_flush_all = 0; - tlb->start = start; - tlb->end = end; tlb->need_flush = 0; tlb->local.next = NULL; tlb->local.nr = 0; @@ -232,6 +230,8 @@ void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, unsigned long #ifdef CONFIG_HAVE_RCU_TABLE_FREE tlb->batch = NULL; #endif + + __tlb_reset_range(tlb); } static void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb) @@ -241,6 +241,7 @@ static void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb) #ifdef CONFIG_HAVE_RCU_TABLE_FREE tlb_table_flush(tlb); #endif + __tlb_reset_range(tlb); } static void tlb_flush_mmu_free(struct mmu_gather *tlb) @@ -1186,20 +1187,8 @@ again: arch_leave_lazy_mmu_mode(); /* Do the actual TLB flush before dropping ptl */ - if (force_flush) { - unsigned long old_end; - - /* - * Flush the TLB just for the previous segment, - * then update the range to be the remaining - * TLB range. - */ - old_end = tlb->end; - tlb->end = addr; + if (force_flush) tlb_flush_mmu_tlbonly(tlb); - tlb->start = addr; - tlb->end = old_end; - } pte_unmap_unlock(start_pte, ptl); /*