Message ID | 20201002175303.390363-2-daniel.vetter@ffwll.ch |
---|---|
State | New |
Headers | show |
Series | [1/2] mm/frame-vec: Drop gup_flags from get_vaddr_frames() | expand |
On Fri, Oct 02, 2020 at 07:53:03PM +0200, Daniel Vetter wrote: > For $reasons I've stumbled over this code and I'm not sure the change > to the new gup functions in 55a650c35fea ("mm/gup: frame_vector: > convert get_user_pages() --> pin_user_pages()") was entirely correct. > > This here is used for long term buffers (not just quick I/O) like > RDMA, and John notes this in his patch. But I thought the rule for > these is that they need to add FOLL_LONGTERM, which John's patch > didn't do. > > There is already a dax specific check (added in b7f0554a56f2 ("mm: > fail get_vaddr_frames() for filesystem-dax mappings")), so this seems > like the prudent thing to do. > > Signed-off-by: Daniel Vetter <daniel.vetter@intel.com> > Cc: Andrew Morton <akpm@linux-foundation.org> > Cc: John Hubbard <jhubbard@nvidia.com> > Cc: Jérôme Glisse <jglisse@redhat.com> > Cc: Jan Kara <jack@suse.cz> > Cc: Dan Williams <dan.j.williams@intel.com> > Cc: linux-mm@kvack.org > Cc: linux-arm-kernel@lists.infradead.org > Cc: linux-samsung-soc@vger.kernel.org > Cc: linux-media@vger.kernel.org > Hi all, > > I stumbled over this and figured typing this patch can't hurt. Really > just to maybe learn a few things about how gup/pup is supposed to be > used (we have a bit of that in drivers/gpu), this here isn't really > ralated to anything I'm doing. FOLL_FORCE is a pretty big clue it should be FOLL_LONGTERM, IMHO > I'm also wondering whether the explicit dax check should be removed, > since FOLL_LONGTERM should take care of that already. Yep! Confirms the above! This get_vaddr_frames() thing looks impossible to use properly. How on earth does a driver guarentee "If @start belongs to VM_IO | VM_PFNMAP vma, we don't touch page structures and the caller must make sure pfns aren't reused for anything else while he is using them." The only possible way to do that is if the driver restricts the VMAs to ones it owns and interacts with the vm_private data to refcount something. Since every driver does this wrong anything that uses this is creating terrifying security issues. IMHO this whole API should be deleted :( Jason
On Fri, Oct 2, 2020 at 8:06 PM Jason Gunthorpe <jgg@ziepe.ca> wrote: > On Fri, Oct 02, 2020 at 07:53:03PM +0200, Daniel Vetter wrote: > > For $reasons I've stumbled over this code and I'm not sure the change > > to the new gup functions in 55a650c35fea ("mm/gup: frame_vector: > > convert get_user_pages() --> pin_user_pages()") was entirely correct. > > > > This here is used for long term buffers (not just quick I/O) like > > RDMA, and John notes this in his patch. But I thought the rule for > > these is that they need to add FOLL_LONGTERM, which John's patch > > didn't do. > > > > There is already a dax specific check (added in b7f0554a56f2 ("mm: > > fail get_vaddr_frames() for filesystem-dax mappings")), so this seems > > like the prudent thing to do. > > > > Signed-off-by: Daniel Vetter <daniel.vetter@intel.com> > > Cc: Andrew Morton <akpm@linux-foundation.org> > > Cc: John Hubbard <jhubbard@nvidia.com> > > Cc: Jérôme Glisse <jglisse@redhat.com> > > Cc: Jan Kara <jack@suse.cz> > > Cc: Dan Williams <dan.j.williams@intel.com> > > Cc: linux-mm@kvack.org > > Cc: linux-arm-kernel@lists.infradead.org > > Cc: linux-samsung-soc@vger.kernel.org > > Cc: linux-media@vger.kernel.org > > Hi all, > > > > I stumbled over this and figured typing this patch can't hurt. Really > > just to maybe learn a few things about how gup/pup is supposed to be > > used (we have a bit of that in drivers/gpu), this here isn't really > > ralated to anything I'm doing. > > FOLL_FORCE is a pretty big clue it should be FOLL_LONGTERM, IMHO Since you're here ... I've noticed that ib sets FOLL_FORCE when the ib verb access mode indicates possible writes. I'm not really clear on why FOLL_WRITE isn't enough any why you need to be able to write through a vma that's write protected currently. > > I'm also wondering whether the explicit dax check should be removed, > > since FOLL_LONGTERM should take care of that already. > > Yep! Confirms the above! > > This get_vaddr_frames() thing looks impossible to use properly. How on > earth does a driver guarentee > > "If @start belongs to VM_IO | VM_PFNMAP vma, we don't touch page > structures and the caller must make sure pfns aren't reused for > anything else while he is using them." > > The only possible way to do that is if the driver restricts the VMAs > to ones it owns and interacts with the vm_private data to refcount > something. > > Since every driver does this wrong anything that uses this is creating > terrifying security issues. > > IMHO this whole API should be deleted :( Yeah that part I just tried to conveniently ignore. I guess this dates back to a time when ioremaps where at best fixed, and there wasn't anything like a gpu driver dynamically managing vram around, resulting in random entirely unrelated things possibly being mapped to that set of pfns. The underlying follow_pfn is also used in other places within drivers/media, so this doesn't seem to be an accident, but actually intentional. I guess minimally we'd need a VM_PFNMAP flag for dynamically manged drivers like modern drm gpu drivers, to make sure follow_pfn doesn't follow these? -Daniel
On 10/2/20 10:53 AM, Daniel Vetter wrote: > For $reasons I've stumbled over this code and I'm not sure the change > to the new gup functions in 55a650c35fea ("mm/gup: frame_vector: > convert get_user_pages() --> pin_user_pages()") was entirely correct. > > This here is used for long term buffers (not just quick I/O) like > RDMA, and John notes this in his patch. But I thought the rule for > these is that they need to add FOLL_LONGTERM, which John's patch > didn't do. Yep. The earlier gup --> pup conversion patches were intended to not have any noticeable behavior changes, and FOLL_LONGTERM, with it's special cases and such, added some risk that I wasn't ready to take on yet. Also, FOLL_LONGTERM rules are only *recently* getting firmed up. So there was some doubt at least in my mind, about which sites should have it. But now that we're here, I think it's really good that you've brought this up. It's definitely time to add FOLL_LONGTERM wherever it's missing. thanks,
On Sat, Oct 3, 2020 at 2:31 AM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > On Fri, Oct 02, 2020 at 08:16:48PM +0200, Daniel Vetter wrote: > > On Fri, Oct 2, 2020 at 8:06 PM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > > On Fri, Oct 02, 2020 at 07:53:03PM +0200, Daniel Vetter wrote: > > > > For $reasons I've stumbled over this code and I'm not sure the change > > > > to the new gup functions in 55a650c35fea ("mm/gup: frame_vector: > > > > convert get_user_pages() --> pin_user_pages()") was entirely correct. > > > > > > > > This here is used for long term buffers (not just quick I/O) like > > > > RDMA, and John notes this in his patch. But I thought the rule for > > > > these is that they need to add FOLL_LONGTERM, which John's patch > > > > didn't do. > > > > > > > > There is already a dax specific check (added in b7f0554a56f2 ("mm: > > > > fail get_vaddr_frames() for filesystem-dax mappings")), so this seems > > > > like the prudent thing to do. > > > > > > > > Signed-off-by: Daniel Vetter <daniel.vetter@intel.com> > > > > Cc: Andrew Morton <akpm@linux-foundation.org> > > > > Cc: John Hubbard <jhubbard@nvidia.com> > > > > Cc: Jérôme Glisse <jglisse@redhat.com> > > > > Cc: Jan Kara <jack@suse.cz> > > > > Cc: Dan Williams <dan.j.williams@intel.com> > > > > Cc: linux-mm@kvack.org > > > > Cc: linux-arm-kernel@lists.infradead.org > > > > Cc: linux-samsung-soc@vger.kernel.org > > > > Cc: linux-media@vger.kernel.org > > > > Hi all, > > > > > > > > I stumbled over this and figured typing this patch can't hurt. Really > > > > just to maybe learn a few things about how gup/pup is supposed to be > > > > used (we have a bit of that in drivers/gpu), this here isn't really > > > > ralated to anything I'm doing. > > > > > > FOLL_FORCE is a pretty big clue it should be FOLL_LONGTERM, IMHO > > > > Since you're here ... I've noticed that ib sets FOLL_FORCE when the ib > > verb access mode indicates possible writes. I'm not really clear on > > why FOLL_WRITE isn't enough any why you need to be able to write > > through a vma that's write protected currently. > > Ah, FOLL_FORCE | FOLL_WRITE means *read* confusingly enough, and the > only reason you'd want this version for read is if you are doing > longterm stuff. I wrote about this recently: > > https://lore.kernel.org/linux-mm/20200928235739.GU9916@ziepe.ca/ > > > > Since every driver does this wrong anything that uses this is creating > > > terrifying security issues. > > > > > > IMHO this whole API should be deleted :( > > > > Yeah that part I just tried to conveniently ignore. I guess this dates > > back to a time when ioremaps where at best fixed, and there wasn't > > anything like a gpu driver dynamically managing vram around, resulting > > in random entirely unrelated things possibly being mapped to that set > > of pfns. > > No, it was always wrong. Prior to GPU like cases the lifetime of the > PTE was tied to the vma and when the vma becomes free the driver can > move the things in the PTEs to 'free'. Easy to trigger use-after-free > issues and devices like RDMA have security contexts attached to these > PTEs so it becomes a serious security bug to do something like this. > > > The underlying follow_pfn is also used in other places within > > drivers/media, so this doesn't seem to be an accident, but actually > > intentional. > > Looking closely, there are very few users, most *seem* pointless, but > maybe there is a crazy reason? > > The sequence > get_vaddr_frames(); > frame_vector_to_pages(); > sg_alloc_table_from_pages(); > > Should be written > pin_user_pages_fast(FOLL_LONGTERM); > sg_alloc_table_from_pages() > > There is some 'special' code in frame_vector_to_pages() that tries to > get a struct page for things from a VM_IO or VM_PFNMAP... > > Oh snap, that is *completely* broken! If the first VMA is IO|PFNMAP > then get_vaddr_frames() iterates over all VMAs in the range, of any > kind and extracts the PTEs then blindly references them! This means it > can be used to use after free normal RAM struct pages!! Gah! > > Wow. Okay. That has to go. > > So, I *think* we can assume there is no sane cases where > frame_vector_to_pages() succeeds but pin_user_pages() wasn't called. > > That means the users here: > - habanalabs: Hey Oded can you fix this up? Yes, no problem, I'll do it very soon. Thanks, Oded > > - gpu/exynos: Daniel can you get someone there to stop using it? > > - media/videobuf via vb2_dma_sg_get_userptr() > > Should all be switched to the standard pin_user_pages sequence > above. > > That leaves the only interesting places as vb2_dc_get_userptr() and > vb2_vmalloc_get_userptr() which both completely fail to follow the > REQUIRED behavior in the function's comment about checking PTEs. It > just DMA maps them. Badly broken. > > Guessing this hackery is for some embedded P2P DMA transfer? > > After he three places above should use pin_user_pages_fast(), then > this whole broken API should be moved into videobuf2-memops.c and a > big fat "THIS DOESN'T WORK" stuck on it. > > videobuf2 should probably use P2P DMA buf for this instead. > > Jason
On Sat, Oct 3, 2020 at 12:39 AM John Hubbard <jhubbard@nvidia.com> wrote: > > On 10/2/20 10:53 AM, Daniel Vetter wrote: > > For $reasons I've stumbled over this code and I'm not sure the change > > to the new gup functions in 55a650c35fea ("mm/gup: frame_vector: > > convert get_user_pages() --> pin_user_pages()") was entirely correct. > > > > This here is used for long term buffers (not just quick I/O) like > > RDMA, and John notes this in his patch. But I thought the rule for > > these is that they need to add FOLL_LONGTERM, which John's patch > > didn't do. > > Yep. The earlier gup --> pup conversion patches were intended to not > have any noticeable behavior changes, and FOLL_LONGTERM, with it's > special cases and such, added some risk that I wasn't ready to take > on yet. Also, FOLL_LONGTERM rules are only *recently* getting firmed > up. So there was some doubt at least in my mind, about which sites > should have it. > > But now that we're here, I think it's really good that you've brought > this up. It's definitely time to add FOLL_LONGTERM wherever it's missing. So should I keep this patch, or will it collide with a series you're working on? Also with the firmed up rules, correct that I can also drop the vma_is_fsdax check when the FOLL_LONGTERM flag is set? Thanks, Daniel > > thanks, > -- > John Hubbard > NVIDIA > > > > > There is already a dax specific check (added in b7f0554a56f2 ("mm: > > fail get_vaddr_frames() for filesystem-dax mappings")), so this seems > > like the prudent thing to do. > > > > Signed-off-by: Daniel Vetter <daniel.vetter@intel.com> > > Cc: Andrew Morton <akpm@linux-foundation.org> > > Cc: John Hubbard <jhubbard@nvidia.com> > > Cc: Jérôme Glisse <jglisse@redhat.com> > > Cc: Jan Kara <jack@suse.cz> > > Cc: Dan Williams <dan.j.williams@intel.com> > > Cc: linux-mm@kvack.org > > Cc: linux-arm-kernel@lists.infradead.org > > Cc: linux-samsung-soc@vger.kernel.org > > Cc: linux-media@vger.kernel.org > > --- > > Hi all, > > > > I stumbled over this and figured typing this patch can't hurt. Really > > just to maybe learn a few things about how gup/pup is supposed to be > > used (we have a bit of that in drivers/gpu), this here isn't really > > ralated to anything I'm doing. > > > > I'm also wondering whether the explicit dax check should be removed, > > since FOLL_LONGTERM should take care of that already. > > -Daniel > > --- > > mm/frame_vector.c | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > diff --git a/mm/frame_vector.c b/mm/frame_vector.c > > index 5d34c9047e9c..3507e09cb3ff 100644 > > --- a/mm/frame_vector.c > > +++ b/mm/frame_vector.c > > @@ -35,7 +35,7 @@ int get_vaddr_frames(unsigned long start, unsigned int nr_frames, > > { > > struct mm_struct *mm = current->mm; > > struct vm_area_struct *vma; > > - unsigned int gup_flags = FOLL_WRITE | FOLL_FORCE; > > + unsigned int gup_flags = FOLL_WRITE | FOLL_FORCE | FOLL_LONGTERM; > > int ret = 0; > > int err; > > int locked; > > >
On 10/3/20 2:45 AM, Daniel Vetter wrote: > On Sat, Oct 3, 2020 at 12:39 AM John Hubbard <jhubbard@nvidia.com> wrote: >> >> On 10/2/20 10:53 AM, Daniel Vetter wrote: >>> For $reasons I've stumbled over this code and I'm not sure the change >>> to the new gup functions in 55a650c35fea ("mm/gup: frame_vector: >>> convert get_user_pages() --> pin_user_pages()") was entirely correct. >>> >>> This here is used for long term buffers (not just quick I/O) like >>> RDMA, and John notes this in his patch. But I thought the rule for >>> these is that they need to add FOLL_LONGTERM, which John's patch >>> didn't do. >> >> Yep. The earlier gup --> pup conversion patches were intended to not >> have any noticeable behavior changes, and FOLL_LONGTERM, with it's >> special cases and such, added some risk that I wasn't ready to take >> on yet. Also, FOLL_LONGTERM rules are only *recently* getting firmed >> up. So there was some doubt at least in my mind, about which sites >> should have it. >> >> But now that we're here, I think it's really good that you've brought >> this up. It's definitely time to add FOLL_LONGTERM wherever it's missing. > > So should I keep this patch, or will it collide with a series you're working on? It doesn't collide with anything on my end yet, because I've been slow to pick up on the need for changing callsites to add FOLL_LONGTERM. :) And it looks like that's actually a problem, because: > > Also with the firmed up rules, correct that I can also drop the > vma_is_fsdax check when the FOLL_LONGTERM flag is set? That's the right direction to go *in general*, but I see that the pin_user_pages code is still a bit stuck in the past. And this patch won't actually work, with or without that vma_is_fsdax() check. Because: get_vaddr_frames(FOLL_LONGTERM) pin_user_pages_locked() if (WARN_ON_ONCE(gup_flags & FOLL_LONGTERM)) return -EINVAL; So, again, pin_user_pages*() is at least partly behind the times here. I can jump in and start fixing it up, but it depends on what you and Oded and others are planning? Note: there is a particular combination of dax and locking that we have to still avoid, within gup.c. That's already covered, but needs to continue to be covered when we enable FOLL_LONGTERM in the remaining pin_user_pages*() calling paths. thanks, -- John Hubbard NVIDIA
On Sat, Oct 03, 2020 at 03:52:32PM -0700, John Hubbard wrote: > On 10/3/20 2:45 AM, Daniel Vetter wrote: > > On Sat, Oct 3, 2020 at 12:39 AM John Hubbard <jhubbard@nvidia.com> wrote: > > > > > > On 10/2/20 10:53 AM, Daniel Vetter wrote: > > > > For $reasons I've stumbled over this code and I'm not sure the change > > > > to the new gup functions in 55a650c35fea ("mm/gup: frame_vector: > > > > convert get_user_pages() --> pin_user_pages()") was entirely correct. > > > > > > > > This here is used for long term buffers (not just quick I/O) like > > > > RDMA, and John notes this in his patch. But I thought the rule for > > > > these is that they need to add FOLL_LONGTERM, which John's patch > > > > didn't do. > > > > > > Yep. The earlier gup --> pup conversion patches were intended to not > > > have any noticeable behavior changes, and FOLL_LONGTERM, with it's > > > special cases and such, added some risk that I wasn't ready to take > > > on yet. Also, FOLL_LONGTERM rules are only *recently* getting firmed > > > up. So there was some doubt at least in my mind, about which sites > > > should have it. > > > > > > But now that we're here, I think it's really good that you've brought > > > this up. It's definitely time to add FOLL_LONGTERM wherever it's missing. > > > > So should I keep this patch, or will it collide with a series you're working on? > > It doesn't collide with anything on my end yet, because I've been slow to > pick up on the need for changing callsites to add FOLL_LONGTERM. :) > > And it looks like that's actually a problem, because: > > > > > Also with the firmed up rules, correct that I can also drop the > > vma_is_fsdax check when the FOLL_LONGTERM flag is set? > > That's the right direction to go *in general*, but I see that the > pin_user_pages code is still a bit stuck in the past. And this patch > won't actually work, with or without that vma_is_fsdax() check. > Because: > > get_vaddr_frames(FOLL_LONGTERM) > pin_user_pages_locked() > if (WARN_ON_ONCE(gup_flags & FOLL_LONGTERM)) > return -EINVAL; There is no particular reason this code needs to have the mm sem at that point. It should call pin_user_pages_fast() and only if that fails get the mmap lock and extract the VMA to do broken hackery. Jason
On Sat, Oct 03, 2020 at 11:40:22AM +0200, Daniel Vetter wrote: > > That leaves the only interesting places as vb2_dc_get_userptr() and > > vb2_vmalloc_get_userptr() which both completely fail to follow the > > REQUIRED behavior in the function's comment about checking PTEs. It > > just DMA maps them. Badly broken. > > > > Guessing this hackery is for some embedded P2P DMA transfer? > > Yeah, see also the follow_pfn trickery in > videobuf_dma_contig_user_get(), I think this is fully intentional and > userspace abi we can't break :-/ We don't need to break uABI, it just needs to work properly in the kernel: vma = find_vma_intersection() dma_buf = dma_buf_get_from_vma(vma) sg = dma_buf_p2p_dma_map(dma_buf) [.. do dma ..] dma_buf_unmap(sg) dma_buf_put(dma_buf) It is as we discussed before, dma buf needs to be discoverable from a VMA, at least for users doing this kind of stuff. > Yup this should be done with dma_buf instead, and v4l has that. But > old uapi and all that. This is why I said we might need a new > VM_DYNAMIC_PFNMAP or so, to make follow_pfn not resolve this in the > case where the driver manages the underlying iomem range (or whatever > it is) dynamically and moves buffer objects around, like drm drivers > do. But I looked, and we've run out of vma->vm_flags :-( A VM flag doesn't help - we need to introduce some kind of lifetime, and that has to be derived from the VMA. It needs data not just a flag > The other problem is that I also have no real working clue about all > the VM_* flags and what they all mean, and whether drm drivers set the > right ones in all cases (they probably don't, but oh well). > Documentation for this stuff in headers is a bit thin at times. Yah, I don't really know either :\ The comment above vm_normal_page() is a bit helpful. Don't know what VM_IO/VM_PFNMAP mean in their 3 combinations There are very few places that set VM_PFNMAP without VM_IO.. Jason
On Sun, Oct 4, 2020 at 2:51 PM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > On Sat, Oct 03, 2020 at 11:40:22AM +0200, Daniel Vetter wrote: > > > > That leaves the only interesting places as vb2_dc_get_userptr() and > > > vb2_vmalloc_get_userptr() which both completely fail to follow the > > > REQUIRED behavior in the function's comment about checking PTEs. It > > > just DMA maps them. Badly broken. > > > > > > Guessing this hackery is for some embedded P2P DMA transfer? > > > > Yeah, see also the follow_pfn trickery in > > videobuf_dma_contig_user_get(), I think this is fully intentional and > > userspace abi we can't break :-/ > > We don't need to break uABI, it just needs to work properly in the > kernel: > > vma = find_vma_intersection() > dma_buf = dma_buf_get_from_vma(vma) > sg = dma_buf_p2p_dma_map(dma_buf) > [.. do dma ..] > dma_buf_unmap(sg) > dma_buf_put(dma_buf) > > It is as we discussed before, dma buf needs to be discoverable from a > VMA, at least for users doing this kind of stuff. I'm not a big fan of magic behaviour like this, there's more to dma-buf buffer sharing than just "how do I get at the backing storage". Thus far we've done everything rather explicitly. Plus with exynos and habanalabs converted there's only v4l left over, and that has a proper dma-buf import path already. > > Yup this should be done with dma_buf instead, and v4l has that. But > > old uapi and all that. This is why I said we might need a new > > VM_DYNAMIC_PFNMAP or so, to make follow_pfn not resolve this in the > > case where the driver manages the underlying iomem range (or whatever > > it is) dynamically and moves buffer objects around, like drm drivers > > do. But I looked, and we've run out of vma->vm_flags :-( > > A VM flag doesn't help - we need to introduce some kind of lifetime, > and that has to be derived from the VMA. It needs data not just a flag I don't want to make it work, I just want to make it fail. Rough idea I have in mind is to add a follow_pfn_longterm, for all callers which aren't either synchronized through mmap_sem or an mmu_notifier. If this really breaks anyone's use-case we can add a tainting kernel option which re-enables this (we've done something similar for phys_addr_t based buffer sharing in fbdev, entirely unfixable since the other driver has to just blindly trust that what userspace passes around is legit). This here isn't unfixable, but if v4l people want to keep it without a big "security hole here" sticker, they should do the work, not me :-) > > The other problem is that I also have no real working clue about all > > the VM_* flags and what they all mean, and whether drm drivers set the > > right ones in all cases (they probably don't, but oh well). > > Documentation for this stuff in headers is a bit thin at times. > > Yah, I don't really know either :\ > > The comment above vm_normal_page() is a bit helpful. Don't know what > VM_IO/VM_PFNMAP mean in their 3 combinations > > There are very few places that set VM_PFNMAP without VM_IO.. Best I could find is: - mk68 seems to outright reject pagefaults on VM_IO vma - some places set VM_IO together with VM_MIXEDMAP instead of VM_PFNMAP. There's some comments that this makes cow possible, but I guess that's for the old pfn remap vma (remap_file_pages, which is removed now). But really no clue. VM_IO | VM_MIXEDMAP kinda makes me wonder whether follow_pfn gets the page refcounting all right or horribly wrong in some cases ... -Daniel -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch
Hi! On Fri 02-10-20 15:06:03, Jason Gunthorpe wrote: > This get_vaddr_frames() thing looks impossible to use properly. How on > earth does a driver guarentee > > "If @start belongs to VM_IO | VM_PFNMAP vma, we don't touch page > structures and the caller must make sure pfns aren't reused for > anything else while he is using them." > > The only possible way to do that is if the driver restricts the VMAs > to ones it owns and interacts with the vm_private data to refcount > something. > > Since every driver does this wrong anything that uses this is creating > terrifying security issues. > > IMHO this whole API should be deleted :( So I'm the one guilty for introducing this API. The API was created to factor out code in several (mostly V4L AFAIR) drivers thus reducing amount of drivers poking into MM internals and getting things wrong in various cases. It may well be that the API is still broken from "can driver ensure this" POV - I tried to keep things things as they were before in this regard as I have very little knowledge in how these drivers are supposed to work. Anyway, if you can make this go away, sure go ahead :) Honza
On Sun, Oct 04, 2020 at 06:09:29PM +0200, Daniel Vetter wrote: > On Sun, Oct 4, 2020 at 2:51 PM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > > > On Sat, Oct 03, 2020 at 11:40:22AM +0200, Daniel Vetter wrote: > > > > > > That leaves the only interesting places as vb2_dc_get_userptr() and > > > > vb2_vmalloc_get_userptr() which both completely fail to follow the > > > > REQUIRED behavior in the function's comment about checking PTEs. It > > > > just DMA maps them. Badly broken. > > > > > > > > Guessing this hackery is for some embedded P2P DMA transfer? > > > > > > Yeah, see also the follow_pfn trickery in > > > videobuf_dma_contig_user_get(), I think this is fully intentional and > > > userspace abi we can't break :-/ > > > > We don't need to break uABI, it just needs to work properly in the > > kernel: > > > > vma = find_vma_intersection() > > dma_buf = dma_buf_get_from_vma(vma) > > sg = dma_buf_p2p_dma_map(dma_buf) > > [.. do dma ..] > > dma_buf_unmap(sg) > > dma_buf_put(dma_buf) > > > > It is as we discussed before, dma buf needs to be discoverable from a > > VMA, at least for users doing this kind of stuff. > > I'm not a big fan of magic behaviour like this, there's more to > dma-buf buffer sharing than just "how do I get at the backing > storage". Thus far we've done everything rather explicitly. Plus with > exynos and habanalabs converted there's only v4l left over, and that > has a proper dma-buf import path already. Well, any VA approach like this has to access some backing refcount via the VMA. Not really any way to avoid something like that > > A VM flag doesn't help - we need to introduce some kind of lifetime, > > and that has to be derived from the VMA. It needs data not just a flag > > I don't want to make it work, I just want to make it fail. Rough idea > I have in mind is to add a follow_pfn_longterm, for all callers which > aren't either synchronized through mmap_sem or an mmu_notifier. follow_pfn() doesn't work outside the pagetable locks or mmu notifier protection. Can't be fixed. We only have a few users: arch/s390/pci/pci_mmio.c: ret = follow_pfn(vma, user_addr, pfn); drivers/media/v4l2-core/videobuf-dma-contig.c: ret = follow_pfn(vma, user_address, &this_pfn); drivers/vfio/vfio_iommu_type1.c: ret = follow_pfn(vma, vaddr, pfn); drivers/vfio/vfio_iommu_type1.c: ret = follow_pfn(vma, vaddr, pfn); mm/frame_vector.c: err = follow_pfn(vma, start, &nums[ret]); virt/kvm/kvm_main.c: r = follow_pfn(vma, addr, &pfn); virt/kvm/kvm_main.c: r = follow_pfn(vma, addr, &pfn); VFIO is broken like media, but I saw patches fixing the vfio cases using the VMA and a vfio specific refcount. media & frame_vector we are talking about here. kvm is some similar hack added for P2P DMA, see commit add6a0cd1c5ba51b201e1361b05a5df817083618. It might be protected by notifiers.. s390 looks broken too, needs to hold the page table locks. So, the answer really is that s390 and media need fixing, and this API should go away (or become kvm specific) > If this really breaks anyone's use-case we can add a tainting kernel > option which re-enables this (we've done something similar for > phys_addr_t based buffer sharing in fbdev, entirely unfixable since > the other driver has to just blindly trust that what userspace > passes around is legit). This here isn't unfixable, but if v4l > people want to keep it without a big "security hole here" sticker, > they should do the work, not me :-) This seems fairly reasonable.. So after frame_vec is purged and we have the one caller in media, move all this stuff to media and taint the kernel if it goes down the follow_pfn path Jason
On Sun, Oct 04, 2020 at 01:20:31PM +0200, Daniel Vetter wrote: > Yeah I think that works. I tried understanding gup.c code a bit more, > and it looks like FOLL_LONGTERM only works for the pup_fast variant > right now? All others seem to have this comment that it's somehow > incompatible with FAULT_FLAG_ALLOW_RETRY and daxfs. But grepping > around for that didn't show up anything, at least not nearby dax code. > For my understanding of all this, what's the hiccup there? IIRC it needs the VMA and various other flows can't return the vma list becuse they unlock the mmap_sem will running > For plans, I only started this for a bit of my own learning, but I > think I'll respin with the following changes: > - convert exynos and habanalabs to pin_user_pages_fast directly, > instead of going through this frame-vector detour +1 Jason
On Mon, Oct 5, 2020 at 7:28 PM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > On Sun, Oct 04, 2020 at 06:09:29PM +0200, Daniel Vetter wrote: > > On Sun, Oct 4, 2020 at 2:51 PM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > > > > > On Sat, Oct 03, 2020 at 11:40:22AM +0200, Daniel Vetter wrote: > > > > > > > > That leaves the only interesting places as vb2_dc_get_userptr() and > > > > > vb2_vmalloc_get_userptr() which both completely fail to follow the > > > > > REQUIRED behavior in the function's comment about checking PTEs. It > > > > > just DMA maps them. Badly broken. > > > > > > > > > > Guessing this hackery is for some embedded P2P DMA transfer? > > > > > > > > Yeah, see also the follow_pfn trickery in > > > > videobuf_dma_contig_user_get(), I think this is fully intentional and > > > > userspace abi we can't break :-/ > > > > > > We don't need to break uABI, it just needs to work properly in the > > > kernel: > > > > > > vma = find_vma_intersection() > > > dma_buf = dma_buf_get_from_vma(vma) > > > sg = dma_buf_p2p_dma_map(dma_buf) > > > [.. do dma ..] > > > dma_buf_unmap(sg) > > > dma_buf_put(dma_buf) > > > > > > It is as we discussed before, dma buf needs to be discoverable from a > > > VMA, at least for users doing this kind of stuff. > > > > I'm not a big fan of magic behaviour like this, there's more to > > dma-buf buffer sharing than just "how do I get at the backing > > storage". Thus far we've done everything rather explicitly. Plus with > > exynos and habanalabs converted there's only v4l left over, and that > > has a proper dma-buf import path already. > > Well, any VA approach like this has to access some backing refcount > via the VMA. Not really any way to avoid something like that > > > > A VM flag doesn't help - we need to introduce some kind of lifetime, > > > and that has to be derived from the VMA. It needs data not just a flag > > > > I don't want to make it work, I just want to make it fail. Rough idea > > I have in mind is to add a follow_pfn_longterm, for all callers which > > aren't either synchronized through mmap_sem or an mmu_notifier. > > follow_pfn() doesn't work outside the pagetable locks or mmu notifier > protection. Can't be fixed. > > We only have a few users: > > arch/s390/pci/pci_mmio.c: ret = follow_pfn(vma, user_addr, pfn); > drivers/media/v4l2-core/videobuf-dma-contig.c: ret = follow_pfn(vma, user_address, &this_pfn); > drivers/vfio/vfio_iommu_type1.c: ret = follow_pfn(vma, vaddr, pfn); > drivers/vfio/vfio_iommu_type1.c: ret = follow_pfn(vma, vaddr, pfn); > mm/frame_vector.c: err = follow_pfn(vma, start, &nums[ret]); > virt/kvm/kvm_main.c: r = follow_pfn(vma, addr, &pfn); > virt/kvm/kvm_main.c: r = follow_pfn(vma, addr, &pfn); > > VFIO is broken like media, but I saw patches fixing the vfio cases > using the VMA and a vfio specific refcount. > > media & frame_vector we are talking about here. > > kvm is some similar hack added for P2P DMA, see commit > add6a0cd1c5ba51b201e1361b05a5df817083618. It might be protected by notifiers.. Yeah my thinking is that kvm (and I think also vfio, also seems to have mmu notifier nearby) are ok because of the mmu notiifer. Assuming that one works correctly. > s390 looks broken too, needs to hold the page table locks. Hm yeah I guess that looks fairly reasonable to fix too. > So, the answer really is that s390 and media need fixing, and this API > should go away (or become kvm specific) I'm still not clear how you want fo fix this, since your vma->dma_buf idea is kinda a decade long plan and so just not going to happen: - v4l used this mostly (afaik the lore at least) for buffer sharing with v4l itself, and also a bit with fbdev. Neither even has any dma-buf exporter code as-is. - like I said, there's no central dma-buf instance, it was fairly intentionally create as an all-to-all abstraction. Which means you either have to roll out a vm_ops->gimme_the_dmabuf or, even more work, refactor all the dma-buf exporters to go through the same things - even where we have dma-buf, most mmaps of buffer objects aren't a dma-buf. Those are only set up when userspace explicitly asks for one, so we'd also need to change the mmap code of all drivers involved to make sure the dma-buf is always created when we do any kind of mmap. I don't see that as a realistic thing to ever happen, and meanwhile we can't leave the gap open for a few years. > > If this really breaks anyone's use-case we can add a tainting kernel > > option which re-enables this (we've done something similar for > > phys_addr_t based buffer sharing in fbdev, entirely unfixable since > > the other driver has to just blindly trust that what userspace > > passes around is legit). This here isn't unfixable, but if v4l > > people want to keep it without a big "security hole here" sticker, > > they should do the work, not me :-) > > This seems fairly reasonable.. > > So after frame_vec is purged and we have the one caller in media, move > all this stuff to media and taint the kernel if it goes down the > follow_pfn path Yeah I think moving frame_vec back to media sounds like a good idea, it should stop new users like habanalbas/exynos from popping up at least. It's follow_pfn that freaks me out more. -Daniel -Daniel -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch
On Mon, Oct 05, 2020 at 08:16:33PM +0200, Daniel Vetter wrote: > > kvm is some similar hack added for P2P DMA, see commit > > add6a0cd1c5ba51b201e1361b05a5df817083618. It might be protected by notifiers.. > > Yeah my thinking is that kvm (and I think also vfio, also seems to > have mmu notifier nearby) are ok because of the mmu notiifer. Assuming > that one works correctly. vfio doesn't have a notifier, Alex was looking to add a vfio private scheme in the vma->private_data: https://lore.kernel.org/kvm/159017449210.18853.15037950701494323009.stgit@gimli.home/ Guess it never happened. > > So, the answer really is that s390 and media need fixing, and this API > > should go away (or become kvm specific) > > I'm still not clear how you want fo fix this, since your vma->dma_buf > idea is kinda a decade long plan and so just not going to happen: Well, it doesn't mean we have to change every part of dma_buf to participate in this. Just the bits media cares about. Or maybe it is some higher level varient on top of dma_buf. Or don't use dma_buf for this, add a new object that just provides refcounts and P2P DMA connection for IO pfn ranges.. Jason
On Mon, Oct 5, 2020 at 8:37 PM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > On Mon, Oct 05, 2020 at 08:16:33PM +0200, Daniel Vetter wrote: > > > > kvm is some similar hack added for P2P DMA, see commit > > > add6a0cd1c5ba51b201e1361b05a5df817083618. It might be protected by notifiers.. > > > > Yeah my thinking is that kvm (and I think also vfio, also seems to > > have mmu notifier nearby) are ok because of the mmu notiifer. Assuming > > that one works correctly. > > vfio doesn't have a notifier, Alex was looking to add a vfio private > scheme in the vma->private_data: > > https://lore.kernel.org/kvm/159017449210.18853.15037950701494323009.stgit@gimli.home/ > > Guess it never happened. I was mislead by the mmu notifier in drivers/vfio/vfio.c. But looking closer, that's only used by some drivers, I guess to make sure their device pagetables are kept in sync with reality. And not to make sure the vfio pfn view is kept in sync with reality. This could get real nasty I think. > > > So, the answer really is that s390 and media need fixing, and this API > > > should go away (or become kvm specific) > > > > I'm still not clear how you want fo fix this, since your vma->dma_buf > > idea is kinda a decade long plan and so just not going to happen: > > Well, it doesn't mean we have to change every part of dma_buf to > participate in this. Just the bits media cares about. Or maybe it is > some higher level varient on top of dma_buf. > > Or don't use dma_buf for this, add a new object that just provides > refcounts and P2P DMA connection for IO pfn ranges.. So good news is, I dug some layers deeper in v4l, and there's only 2 users which do actually handle pfn and don't immediately convert to a pages array: - videbuf-dma-contig.c. Luckily videobuf 1 is deprecated since forever, so I think we might get away with either just breaking this, or at least tainting kernels and hiding it behind a nasty Kconfig. This only uses follow_pfn, which we need to keep anyway for vfio in the unsafe variant :-/ - videbuf2-vmalloc.c Digging through history this was added to support import of v4l buffers from drivers that needed contig memory. And way back before CMA, that meant carveout memory not backed by struct page *. That should now all have struct pages and be managed by CMA (since videbuf2-dma-contig.c just uses dma_alloc_coherent underneath), so I think we can just switch to pin_user_pages(FOLL_LONGTERM here too). iow I think I can outright delete the frame vector stuff. -Daniel
On Tue, Oct 06, 2020 at 12:43:31AM +0200, Daniel Vetter wrote: > > iow I think I can outright delete the frame vector stuff. > > Ok this doesn't work, because dma_mmap always uses a remap_pfn_range, > which is a VM_IO | VM_PFNMAP vma and so even if it's cma backed and > not a carveout, we can't get the pages. If CMA memory has struct pages it probably should be mmap'd with different flags, and the lifecycle of the CMA memory needs to respect the struct page refcount? > Plus trying to move the cma pages out of cma for FOLL_LONGTERM would > be kinda bad when they've been allocated as a contig block by > dma_alloc_coherent :-) Isn't holding a long term reference to a CMA page one of those really scary use-after-free security issues I've been talking about? I know nothing about CMA, so can't say too much, sorry Jason
On Wed, Oct 7, 2020 at 12:47 PM Marek Szyprowski <m.szyprowski@samsung.com> wrote: > > Hi Daniel, > > On 03.10.2020 11:40, Daniel Vetter wrote: > >> After he three places above should use pin_user_pages_fast(), then > >> this whole broken API should be moved into videobuf2-memops.c and a > >> big fat "THIS DOESN'T WORK" stuck on it. > >> > >> videobuf2 should probably use P2P DMA buf for this instead. > > Yup this should be done with dma_buf instead, and v4l has that. > > Yes, V4L2 has dma_buf support NOW. That days, using so called V4L2 > USERPTR method was the only way to achieve zero copy buffer sharing > between devices, so this is just a historical baggage. I've been > actively involved in implementing that. I've tried to make it secure as > much as possible assuming the limitation of that approach. With a few > assumptions it works fine. Buffers are refcounted both by the > vm_ops->open or by incrementing the refcount of the vm->file. This > basically works with any sane driver, which doesn't free the mmaped > buffer until the file is released. This is true for V4L2 and FBdev devices. I'm not seeing any of that vm->file refcounting going on, so not seeing anything that prevents the mmap area from being removed. Can you pls give me some pointers about which code you're thinking of? -Daniel > This API is considered as deprecated in V4L2 world, so I think > supporting this hack can be removed one day as nowadays userspace should > use dma buf. > > > ... > > Best regards > -- > Marek Szyprowski, PhD > Samsung R&D Institute Poland >
Hi Daniel, On 07.10.2020 14:01, Daniel Vetter wrote: > On Wed, Oct 7, 2020 at 12:47 PM Marek Szyprowski > <m.szyprowski@samsung.com> wrote: >> On 03.10.2020 11:40, Daniel Vetter wrote: >>>> After he three places above should use pin_user_pages_fast(), then >>>> this whole broken API should be moved into videobuf2-memops.c and a >>>> big fat "THIS DOESN'T WORK" stuck on it. >>>> >>>> videobuf2 should probably use P2P DMA buf for this instead. >>> Yup this should be done with dma_buf instead, and v4l has that. >> Yes, V4L2 has dma_buf support NOW. That days, using so called V4L2 >> USERPTR method was the only way to achieve zero copy buffer sharing >> between devices, so this is just a historical baggage. I've been >> actively involved in implementing that. I've tried to make it secure as >> much as possible assuming the limitation of that approach. With a few >> assumptions it works fine. Buffers are refcounted both by the >> vm_ops->open or by incrementing the refcount of the vm->file. This >> basically works with any sane driver, which doesn't free the mmaped >> buffer until the file is released. This is true for V4L2 and FBdev devices. > I'm not seeing any of that vm->file refcounting going on, so not > seeing anything that prevents the mmap area from being removed. Can > you pls give me some pointers about which code you're thinking of? Well, it was in vb2_get_vma() function, but now I see that it has been lost in fb639eb39154 and 6690c8c78c74 some time ago... Best regards -- Marek Szyprowski, PhD Samsung R&D Institute Poland
On Wed, Oct 07, 2020 at 02:33:56PM +0200, Marek Szyprowski wrote: > Well, it was in vb2_get_vma() function, but now I see that it has been > lost in fb639eb39154 and 6690c8c78c74 some time ago... There is no guarentee that holding a get on the file says anthing about the VMA. This needed to check that the file was some special kind of file that promised the VMA layout and file lifetime are connected. Also, cloning a VMA outside the mm world is just really bad. That would screw up many assumptions the drivers make. If it is all obsolete I say we hide it behind a default n config symbol and taint the kernel if anything uses it. Add a big comment above the follow_pfn to warn others away from this code. Jason
On Wed, Oct 07, 2020 at 02:58:33PM +0200, Daniel Vetter wrote: > On Wed, Oct 7, 2020 at 2:48 PM Tomasz Figa <tfiga@chromium.org> wrote: > > > > On Wed, Oct 7, 2020 at 2:44 PM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > > > > > On Wed, Oct 07, 2020 at 02:33:56PM +0200, Marek Szyprowski wrote: > > > > Well, it was in vb2_get_vma() function, but now I see that it has been > > > > lost in fb639eb39154 and 6690c8c78c74 some time ago... > > > > > > There is no guarentee that holding a get on the file says anthing > > > about the VMA. This needed to check that the file was some special > > > kind of file that promised the VMA layout and file lifetime are > > > connected. > > > > > > Also, cloning a VMA outside the mm world is just really bad. That > > > would screw up many assumptions the drivers make. > > > > > > If it is all obsolete I say we hide it behind a default n config > > > symbol and taint the kernel if anything uses it. > > > > > > Add a big comment above the follow_pfn to warn others away from this > > > code. > > > > Sadly it's just verbally declared as deprecated and not formally noted > > anyway. There are a lot of userspace applications relying on user > > pointer support. > > userptr can stay, it's the userptr abuse for zerocpy buffer sharing > which doesn't work anymore. At least without major surgery (you'd need > an mmu notifier to zap mappings and recreate them, and that pretty > much breaks the v4l model of preallocating all buffers to make sure we > never underflow the buffer queue). And static mappings are not coming > back I think, we'll go ever more into the direction of dynamic > mappings and moving stuff around as needed. Right, and to be clear, the last time I saw a security flaw of this magnitude from a subsystem badly mis-designing itself, Linus's knee-jerk reaction was to propose to remove the whole subsystem. Please don't take status-quo as acceptable, V4L community has to work to resolve this, uABI breakage or not. The follow_pfn related code must be compiled out of normal distro kernel builds. Jason
On Sat, Oct 3, 2020 at 1:31 AM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > On Fri, Oct 02, 2020 at 08:16:48PM +0200, Daniel Vetter wrote: > > On Fri, Oct 2, 2020 at 8:06 PM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > > On Fri, Oct 02, 2020 at 07:53:03PM +0200, Daniel Vetter wrote: > > > > For $reasons I've stumbled over this code and I'm not sure the change > > > > to the new gup functions in 55a650c35fea ("mm/gup: frame_vector: > > > > convert get_user_pages() --> pin_user_pages()") was entirely correct. > > > > > > > > This here is used for long term buffers (not just quick I/O) like > > > > RDMA, and John notes this in his patch. But I thought the rule for > > > > these is that they need to add FOLL_LONGTERM, which John's patch > > > > didn't do. > > > > > > > > There is already a dax specific check (added in b7f0554a56f2 ("mm: > > > > fail get_vaddr_frames() for filesystem-dax mappings")), so this seems > > > > like the prudent thing to do. > > > > > > > > Signed-off-by: Daniel Vetter <daniel.vetter@intel.com> > > > > Cc: Andrew Morton <akpm@linux-foundation.org> > > > > Cc: John Hubbard <jhubbard@nvidia.com> > > > > Cc: Jérôme Glisse <jglisse@redhat.com> > > > > Cc: Jan Kara <jack@suse.cz> > > > > Cc: Dan Williams <dan.j.williams@intel.com> > > > > Cc: linux-mm@kvack.org > > > > Cc: linux-arm-kernel@lists.infradead.org > > > > Cc: linux-samsung-soc@vger.kernel.org > > > > Cc: linux-media@vger.kernel.org > > > > Hi all, > > > > > > > > I stumbled over this and figured typing this patch can't hurt. Really > > > > just to maybe learn a few things about how gup/pup is supposed to be > > > > used (we have a bit of that in drivers/gpu), this here isn't really > > > > ralated to anything I'm doing. > > > > > > FOLL_FORCE is a pretty big clue it should be FOLL_LONGTERM, IMHO > > > > Since you're here ... I've noticed that ib sets FOLL_FORCE when the ib > > verb access mode indicates possible writes. I'm not really clear on > > why FOLL_WRITE isn't enough any why you need to be able to write > > through a vma that's write protected currently. > > Ah, FOLL_FORCE | FOLL_WRITE means *read* confusingly enough, and the > only reason you'd want this version for read is if you are doing > longterm stuff. I wrote about this recently: > > https://lore.kernel.org/linux-mm/20200928235739.GU9916@ziepe.ca/ > > > > Since every driver does this wrong anything that uses this is creating > > > terrifying security issues. > > > > > > IMHO this whole API should be deleted :( > > > > Yeah that part I just tried to conveniently ignore. I guess this dates > > back to a time when ioremaps where at best fixed, and there wasn't > > anything like a gpu driver dynamically managing vram around, resulting > > in random entirely unrelated things possibly being mapped to that set > > of pfns. > > No, it was always wrong. Prior to GPU like cases the lifetime of the > PTE was tied to the vma and when the vma becomes free the driver can > move the things in the PTEs to 'free'. Easy to trigger use-after-free > issues and devices like RDMA have security contexts attached to these > PTEs so it becomes a serious security bug to do something like this. > > > The underlying follow_pfn is also used in other places within > > drivers/media, so this doesn't seem to be an accident, but actually > > intentional. > > Looking closely, there are very few users, most *seem* pointless, but > maybe there is a crazy reason? > > The sequence > get_vaddr_frames(); > frame_vector_to_pages(); > sg_alloc_table_from_pages(); > > Should be written > pin_user_pages_fast(FOLL_LONGTERM); > sg_alloc_table_from_pages() > > There is some 'special' code in frame_vector_to_pages() that tries to > get a struct page for things from a VM_IO or VM_PFNMAP... > > Oh snap, that is *completely* broken! If the first VMA is IO|PFNMAP > then get_vaddr_frames() iterates over all VMAs in the range, of any > kind and extracts the PTEs then blindly references them! This means it > can be used to use after free normal RAM struct pages!! Gah! > > Wow. Okay. That has to go. > > So, I *think* we can assume there is no sane cases where > frame_vector_to_pages() succeeds but pin_user_pages() wasn't called. > > That means the users here: > - habanalabs: Hey Oded can you fix this up? > > - gpu/exynos: Daniel can you get someone there to stop using it? > > - media/videobuf via vb2_dma_sg_get_userptr() > > Should all be switched to the standard pin_user_pages sequence > above. > > That leaves the only interesting places as vb2_dc_get_userptr() and > vb2_vmalloc_get_userptr() which both completely fail to follow the > REQUIRED behavior in the function's comment about checking PTEs. It > just DMA maps them. Badly broken. Note that vb2_vmalloc is only used for in-kernel CPU usage, e.g. the contents being copied by the driver between vb2 buffers and some hardware FIFO or other dedicated buffers. The memory does not go to any hardware DMA. Could you elaborate on what "the REQUIRED behavior is"? I can see that both follow the get_vaddr_frames() -> frame_vector_to_pages() flow, as you mentioned. Perhaps the only change needed is switching to pin_user_pages after all? > > Guessing this hackery is for some embedded P2P DMA transfer? > That's not the case. There are two usage scenarios: 1) Regular user pointer support, i.e. using malloc()ed memory as a video input or output buffer. This is deemed to be deprecated, but still used quite widely by various applications. Quite common for video decoder input and video encoder output, because the other side of the pipeline is CPU, e.g. network stack. 2) Sharing carveout buffers (without struct pages attached) between two devices. I believe this predates DMA-buf and CMA (which enabled carveouts with struct pages) and I'm not sure how commonly it is used these days. > After he three places above should use pin_user_pages_fast(), then > this whole broken API should be moved into videobuf2-memops.c and a > big fat "THIS DOESN'T WORK" stuck on it. > > videobuf2 should probably use P2P DMA buf for this instead. See above. Best regards, Tomasz
On Wed, Oct 07, 2020 at 03:06:17PM +0200, Tomasz Figa wrote: > Note that vb2_vmalloc is only used for in-kernel CPU usage, e.g. the > contents being copied by the driver between vb2 buffers and some > hardware FIFO or other dedicated buffers. The memory does not go to > any hardware DMA. That is even worse, the CPU can't just blindly touch VM_IO pages, that isn't portable. > Could you elaborate on what "the REQUIRED behavior is"? I can see that > both follow the get_vaddr_frames() -> frame_vector_to_pages() flow, as > you mentioned. Perhaps the only change needed is switching to > pin_user_pages after all? It is the comment right on top of get_vaddr_frames(): if @start belongs to VM_IO | VM_PFNMAP vma, we don't touch page structures and the caller must make sure pfns aren't reused for anything else while he is using them. Which means excluding every kind of VMA that is not something this driver understands and then using special knowledge of the driver-specific VMA to assure the above. For instance if you could detect the VMA is from a carevout and do something special like hold the fget() while knowning that the struct file guarentees the carveout remains reserved - then you could use follow_pfn. But it would be faster and better to ask the carveout FD for the vaddr range. Jason
On Wed, Oct 7, 2020 at 3:06 PM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > On Wed, Oct 07, 2020 at 02:58:33PM +0200, Daniel Vetter wrote: > > On Wed, Oct 7, 2020 at 2:48 PM Tomasz Figa <tfiga@chromium.org> wrote: > > > > > > On Wed, Oct 7, 2020 at 2:44 PM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > > > > > > > On Wed, Oct 07, 2020 at 02:33:56PM +0200, Marek Szyprowski wrote: > > > > > Well, it was in vb2_get_vma() function, but now I see that it has been > > > > > lost in fb639eb39154 and 6690c8c78c74 some time ago... > > > > > > > > There is no guarentee that holding a get on the file says anthing > > > > about the VMA. This needed to check that the file was some special > > > > kind of file that promised the VMA layout and file lifetime are > > > > connected. > > > > > > > > Also, cloning a VMA outside the mm world is just really bad. That > > > > would screw up many assumptions the drivers make. > > > > > > > > If it is all obsolete I say we hide it behind a default n config > > > > symbol and taint the kernel if anything uses it. > > > > > > > > Add a big comment above the follow_pfn to warn others away from this > > > > code. > > > > > > Sadly it's just verbally declared as deprecated and not formally noted > > > anyway. There are a lot of userspace applications relying on user > > > pointer support. > > > > userptr can stay, it's the userptr abuse for zerocpy buffer sharing > > which doesn't work anymore. At least without major surgery (you'd need > > an mmu notifier to zap mappings and recreate them, and that pretty > > much breaks the v4l model of preallocating all buffers to make sure we > > never underflow the buffer queue). And static mappings are not coming > > back I think, we'll go ever more into the direction of dynamic > > mappings and moving stuff around as needed. > > Right, and to be clear, the last time I saw a security flaw of this > magnitude from a subsystem badly mis-designing itself, Linus's > knee-jerk reaction was to propose to remove the whole subsystem. > > Please don't take status-quo as acceptable, V4L community has to work > to resolve this, uABI breakage or not. The follow_pfn related code > must be compiled out of normal distro kernel builds. I think the userptr zero-copy hack should be able to go away indeed, given that we now have CMA that allows having carveouts backed by struct pages and having the memory represented as DMA-buf normally. How about the regular userptr use case, though? The existing code resolves the user pointer into pages by following the get_vaddr_frames() -> frame_vector_to_pages() -> sg_alloc_table_from_pages() / vm_map_ram() approach. get_vaddr_frames() seems to use pin_user_pages() behind the scenes if the vma is not an IO or a PFNMAP, falling back to follow_pfn() otherwise. Is your intention to drop get_vaddr_frames() or we could still keep using it and if vec->is_pfns is true: a) if CONFIG_VIDEO_LEGACY_PFN_USERPTR is set, taint the kernel b) otherwise just undo and fail? Best regards, Tomasz
On Wed, Oct 7, 2020 at 3:34 PM Tomasz Figa <tfiga@chromium.org> wrote: > > On Wed, Oct 7, 2020 at 3:06 PM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > > > On Wed, Oct 07, 2020 at 02:58:33PM +0200, Daniel Vetter wrote: > > > On Wed, Oct 7, 2020 at 2:48 PM Tomasz Figa <tfiga@chromium.org> wrote: > > > > > > > > On Wed, Oct 7, 2020 at 2:44 PM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > > > > > > > > > On Wed, Oct 07, 2020 at 02:33:56PM +0200, Marek Szyprowski wrote: > > > > > > Well, it was in vb2_get_vma() function, but now I see that it has been > > > > > > lost in fb639eb39154 and 6690c8c78c74 some time ago... > > > > > > > > > > There is no guarentee that holding a get on the file says anthing > > > > > about the VMA. This needed to check that the file was some special > > > > > kind of file that promised the VMA layout and file lifetime are > > > > > connected. > > > > > > > > > > Also, cloning a VMA outside the mm world is just really bad. That > > > > > would screw up many assumptions the drivers make. > > > > > > > > > > If it is all obsolete I say we hide it behind a default n config > > > > > symbol and taint the kernel if anything uses it. > > > > > > > > > > Add a big comment above the follow_pfn to warn others away from this > > > > > code. > > > > > > > > Sadly it's just verbally declared as deprecated and not formally noted > > > > anyway. There are a lot of userspace applications relying on user > > > > pointer support. > > > > > > userptr can stay, it's the userptr abuse for zerocpy buffer sharing > > > which doesn't work anymore. At least without major surgery (you'd need > > > an mmu notifier to zap mappings and recreate them, and that pretty > > > much breaks the v4l model of preallocating all buffers to make sure we > > > never underflow the buffer queue). And static mappings are not coming > > > back I think, we'll go ever more into the direction of dynamic > > > mappings and moving stuff around as needed. > > > > Right, and to be clear, the last time I saw a security flaw of this > > magnitude from a subsystem badly mis-designing itself, Linus's > > knee-jerk reaction was to propose to remove the whole subsystem. > > > > Please don't take status-quo as acceptable, V4L community has to work > > to resolve this, uABI breakage or not. The follow_pfn related code > > must be compiled out of normal distro kernel builds. > > I think the userptr zero-copy hack should be able to go away indeed, > given that we now have CMA that allows having carveouts backed by > struct pages and having the memory represented as DMA-buf normally. Not sure whether there's a confusion here: dma-buf supports memory not backed by struct page. > How about the regular userptr use case, though? > > The existing code resolves the user pointer into pages by following > the get_vaddr_frames() -> frame_vector_to_pages() -> > sg_alloc_table_from_pages() / vm_map_ram() approach. > get_vaddr_frames() seems to use pin_user_pages() behind the scenes if > the vma is not an IO or a PFNMAP, falling back to follow_pfn() > otherwise. Yeah pin_user_pages is fine, it's just the VM_IO | VM_PFNMAP vma that don't work. > > Is your intention to drop get_vaddr_frames() or we could still keep > using it and if vec->is_pfns is true: > a) if CONFIG_VIDEO_LEGACY_PFN_USERPTR is set, taint the kernel > b) otherwise just undo and fail? I'm typing that patch series (plus a pile more) right now. -Daniel
On Wed, Oct 7, 2020 at 4:12 PM Tomasz Figa <tfiga@chromium.org> wrote: > > On Wed, Oct 7, 2020 at 4:09 PM Daniel Vetter <daniel.vetter@ffwll.ch> wrote: > > > > On Wed, Oct 7, 2020 at 3:34 PM Tomasz Figa <tfiga@chromium.org> wrote: > > > > > > On Wed, Oct 7, 2020 at 3:06 PM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > > > > > > > On Wed, Oct 07, 2020 at 02:58:33PM +0200, Daniel Vetter wrote: > > > > > On Wed, Oct 7, 2020 at 2:48 PM Tomasz Figa <tfiga@chromium.org> wrote: > > > > > > > > > > > > On Wed, Oct 7, 2020 at 2:44 PM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > > > > > > > > > > > > > On Wed, Oct 07, 2020 at 02:33:56PM +0200, Marek Szyprowski wrote: > > > > > > > > Well, it was in vb2_get_vma() function, but now I see that it has been > > > > > > > > lost in fb639eb39154 and 6690c8c78c74 some time ago... > > > > > > > > > > > > > > There is no guarentee that holding a get on the file says anthing > > > > > > > about the VMA. This needed to check that the file was some special > > > > > > > kind of file that promised the VMA layout and file lifetime are > > > > > > > connected. > > > > > > > > > > > > > > Also, cloning a VMA outside the mm world is just really bad. That > > > > > > > would screw up many assumptions the drivers make. > > > > > > > > > > > > > > If it is all obsolete I say we hide it behind a default n config > > > > > > > symbol and taint the kernel if anything uses it. > > > > > > > > > > > > > > Add a big comment above the follow_pfn to warn others away from this > > > > > > > code. > > > > > > > > > > > > Sadly it's just verbally declared as deprecated and not formally noted > > > > > > anyway. There are a lot of userspace applications relying on user > > > > > > pointer support. > > > > > > > > > > userptr can stay, it's the userptr abuse for zerocpy buffer sharing > > > > > which doesn't work anymore. At least without major surgery (you'd need > > > > > an mmu notifier to zap mappings and recreate them, and that pretty > > > > > much breaks the v4l model of preallocating all buffers to make sure we > > > > > never underflow the buffer queue). And static mappings are not coming > > > > > back I think, we'll go ever more into the direction of dynamic > > > > > mappings and moving stuff around as needed. > > > > > > > > Right, and to be clear, the last time I saw a security flaw of this > > > > magnitude from a subsystem badly mis-designing itself, Linus's > > > > knee-jerk reaction was to propose to remove the whole subsystem. > > > > > > > > Please don't take status-quo as acceptable, V4L community has to work > > > > to resolve this, uABI breakage or not. The follow_pfn related code > > > > must be compiled out of normal distro kernel builds. > > > > > > I think the userptr zero-copy hack should be able to go away indeed, > > > given that we now have CMA that allows having carveouts backed by > > > struct pages and having the memory represented as DMA-buf normally. > > > > Not sure whether there's a confusion here: dma-buf supports memory not > > backed by struct page. > > > > That's new to me. The whole API relies on sg_tables a lot, which in > turn rely on struct page pointers to describe the physical memory. You're not allowed to look at struct page pointers from the importer side, those might not be there. Which isn't the prettiest thing, but it works. And even if there's a struct page, you're still not allowed to look at it, since it's fully managed by the exporter under whatever rules that might need. So no touching it, ever. This is also not news, supporting this was in the design brief from the kickoff session 10+ years ago at some linaro connect thing (in Budapest iirc). And we have implementations doing that for almost as long merged in upstream. > > > How about the regular userptr use case, though? > > > > > > The existing code resolves the user pointer into pages by following > > > the get_vaddr_frames() -> frame_vector_to_pages() -> > > > sg_alloc_table_from_pages() / vm_map_ram() approach. > > > get_vaddr_frames() seems to use pin_user_pages() behind the scenes if > > > the vma is not an IO or a PFNMAP, falling back to follow_pfn() > > > otherwise. > > > > Yeah pin_user_pages is fine, it's just the VM_IO | VM_PFNMAP vma that > > don't work. > > Ack. > > > > > > > Is your intention to drop get_vaddr_frames() or we could still keep > > > using it and if vec->is_pfns is true: > > > a) if CONFIG_VIDEO_LEGACY_PFN_USERPTR is set, taint the kernel > > > b) otherwise just undo and fail? > > > > I'm typing that patch series (plus a pile more) right now. > > Cool, thanks! > > We also need to bring back the vma_open() that somehow disappeared > around 4.2, as Marek found. The vm_open isn't enough to stop the problems (it doesn't and cannot protect against unmap_mapping_range), I don't think keeping an incomplete solution around has much benefit. People who need this can disable CONFIG_STRICT_FOLLOW_PFN to keep things working, everyone else probably doesn't want these mm internals leaking into the media subsystem. -Daniel
On Wed, Oct 07, 2020 at 04:11:59PM +0200, Tomasz Figa wrote: > We also need to bring back the vma_open() that somehow disappeared > around 4.2, as Marek found. No Jason
diff --git a/mm/frame_vector.c b/mm/frame_vector.c index 5d34c9047e9c..3507e09cb3ff 100644 --- a/mm/frame_vector.c +++ b/mm/frame_vector.c @@ -35,7 +35,7 @@ int get_vaddr_frames(unsigned long start, unsigned int nr_frames, { struct mm_struct *mm = current->mm; struct vm_area_struct *vma; - unsigned int gup_flags = FOLL_WRITE | FOLL_FORCE; + unsigned int gup_flags = FOLL_WRITE | FOLL_FORCE | FOLL_LONGTERM; int ret = 0; int err; int locked;
For $reasons I've stumbled over this code and I'm not sure the change to the new gup functions in 55a650c35fea ("mm/gup: frame_vector: convert get_user_pages() --> pin_user_pages()") was entirely correct. This here is used for long term buffers (not just quick I/O) like RDMA, and John notes this in his patch. But I thought the rule for these is that they need to add FOLL_LONGTERM, which John's patch didn't do. There is already a dax specific check (added in b7f0554a56f2 ("mm: fail get_vaddr_frames() for filesystem-dax mappings")), so this seems like the prudent thing to do. Signed-off-by: Daniel Vetter <daniel.vetter@intel.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: Jan Kara <jack@suse.cz> Cc: Dan Williams <dan.j.williams@intel.com> Cc: linux-mm@kvack.org Cc: linux-arm-kernel@lists.infradead.org Cc: linux-samsung-soc@vger.kernel.org Cc: linux-media@vger.kernel.org --- Hi all, I stumbled over this and figured typing this patch can't hurt. Really just to maybe learn a few things about how gup/pup is supposed to be used (we have a bit of that in drivers/gpu), this here isn't really ralated to anything I'm doing. I'm also wondering whether the explicit dax check should be removed, since FOLL_LONGTERM should take care of that already. -Daniel --- mm/frame_vector.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)