Message ID | 4-v2-ce66f632bd0d+484-iommu_map_gfp_jgg@nvidia.com |
---|---|
State | New |
Headers | show |
Series | Let iommufd charge IOPTE allocations to the memory cgroup | expand |
> From: Jason Gunthorpe <jgg@nvidia.com> > Sent: Thursday, January 19, 2023 2:01 AM > > Change the sg_alloc_table_from_pages() allocation that was hardwired to > GFP_KERNEL to use the gfp parameter like the other allocations in this > function. > > Auditing says this is never called from an atomic context, so it is safe > as is, but reads wrong. > > Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
On 2023-01-18 18:00, Jason Gunthorpe wrote: > Change the sg_alloc_table_from_pages() allocation that was hardwired to > GFP_KERNEL to use the gfp parameter like the other allocations in this > function. > > Auditing says this is never called from an atomic context, so it is safe > as is, but reads wrong. I think the point may have been that the sgtable metadata is a logically-distinct allocation from the buffer pages themselves. Much like the allocation of the pages array itself further down in __iommu_dma_alloc_pages(). I see these days it wouldn't be catastrophic to pass GFP_HIGHMEM into __get_free_page() via sg_kmalloc(), but still, allocating implementation-internal metadata with all the same constraints as a DMA buffer has just as much smell of wrong about it IMO. I'd say the more confusing thing about this particular context is why we're using iommu_map_sg_atomic() further down - that seems to have been an oversight in 781ca2de89ba, since this particular path has never supported being called in atomic context. Overall I'm starting to wonder if it might not be better to stick a "use GFP_KERNEL_ACCOUNT if you allocate" flag in the domain for any level of the API internals to pick up as appropriate, rather than propagate per-call gfp flags everywhere. As it stands we're still missing potential pagetable and other domain-related allocations by drivers in .attach_dev and even (in probably-shouldn't-really-happen cases) .unmap_pages... Thanks, Robin. > Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> > --- > drivers/iommu/dma-iommu.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c > index 8c2788633c1766..e4bf1bb159f7c7 100644 > --- a/drivers/iommu/dma-iommu.c > +++ b/drivers/iommu/dma-iommu.c > @@ -822,7 +822,7 @@ static struct page **__iommu_dma_alloc_noncontiguous(struct device *dev, > if (!iova) > goto out_free_pages; > > - if (sg_alloc_table_from_pages(sgt, pages, count, 0, size, GFP_KERNEL)) > + if (sg_alloc_table_from_pages(sgt, pages, count, 0, size, gfp)) > goto out_free_iova; > > if (!(ioprot & IOMMU_CACHE)) {
On Fri, Jan 20, 2023 at 07:28:19PM +0000, Robin Murphy wrote: > On 2023-01-18 18:00, Jason Gunthorpe wrote: > > Change the sg_alloc_table_from_pages() allocation that was hardwired to > > GFP_KERNEL to use the gfp parameter like the other allocations in this > > function. > > > > Auditing says this is never called from an atomic context, so it is safe > > as is, but reads wrong. > > I think the point may have been that the sgtable metadata is a > logically-distinct allocation from the buffer pages themselves. Much like > the allocation of the pages array itself further down in > __iommu_dma_alloc_pages(). That makes sense, and it is a good reason to mask off the allocation policy flags from the gfp. On the other hand it also makes sense to continue to pass in things like NOWAIT|NOWARN to all the allocations. Even to the iommu driver. So I'd prefer to change this to mask and make all the following calls consistently use the input gfp > I'd say the more confusing thing about this particular context is why we're > using iommu_map_sg_atomic() further down - that seems to have been an > oversight in 781ca2de89ba, since this particular path has never supported > being called in atomic context. Huh. I had fixed that in v1, this patch was supposed to have that hunk, that was the main point of making this patch actually.. > Overall I'm starting to wonder if it might not be better to stick a "use > GFP_KERNEL_ACCOUNT if you allocate" flag in the domain for any level of the > API internals to pick up as appropriate, rather than propagate per-call gfp > flags everywhere. We might get to something like that, but it requires more parts that are not ready yet. Most likely this would take the form of some kind of 'this is an iommufd created domain' indication. This happens naturally as part of the nesting patches. Right now I want to get people to start testing with this because the charge from the IOPTEs is far and away the largest memory draw. Parts like fixing the iommu drivers to actually use gfp are necessary to make it work. If we flip the two places using KERNEL_ACCOUNT to something else later it doesn't really matter. I think the removal of the two _atomic wrappers is still appropriate stand-alone. > As it stands we're still missing potential pagetable and other > domain-related allocations by drivers in .attach_dev and even (in Yes, I plan to get to those when we add an alloc_domain_iommufd() or whatever op. The driver will know the calling context and can set the gfp flags for any allocations under alloc_domain under that time. Then we can go and figure out if there are other allocations and if all or only some drivers need a flag - eg at attach time. Though this is less worrying because you can only scale attach up to num_pasids * num open vfios. iommufd will let userspace create and populate an unlimited number of iommu_domains, so everything linked to an unattached iommu_domain should be charged. > probably-shouldn't-really-happen cases) .unmap_pages... Gah, unmap_pages isn't allow to fail. There is no way to recover from this. iommufd will spew a warn and then have a small race where userspace can UAF kernel memory. I'd call such a driver implementation broken. Why would you need to do this?? :( Thanks, Jason
On Fri, Jan 20, 2023 at 07:28:19PM +0000, Robin Murphy wrote: > Overall I'm starting to wonder if it might not be better to stick a "use > GFP_KERNEL_ACCOUNT if you allocate" flag in the domain for any level of the > API internals to pick up as appropriate, rather than propagate per-call gfp > flags everywhere. I was thinking about this some more, and I don't thinking hiding the GFP_KERNEL_ACCOUNT in the iommu driver will be very maintainable. The GFP_KERNEL_ACCOUNT is sensitive to current since that is where it gets the cgroup from, if we start putting it in driver code directly it becomes very hard to understand if the call chains are actually originating from a syscall or not. I'd prefer we try to keep thing so that iommufd provides the GFP_KERNEL_ACCOUNT on a call-by-call basis where it is clearer what call chains originate from a system call vs not. So, I think we will strive for adding a gfp flag to the future 'alloc domain iommufd' and pass GFP_KERNEL_ACCOUNT there. Then we can see what is left. Jason
diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c index 8c2788633c1766..e4bf1bb159f7c7 100644 --- a/drivers/iommu/dma-iommu.c +++ b/drivers/iommu/dma-iommu.c @@ -822,7 +822,7 @@ static struct page **__iommu_dma_alloc_noncontiguous(struct device *dev, if (!iova) goto out_free_pages; - if (sg_alloc_table_from_pages(sgt, pages, count, 0, size, GFP_KERNEL)) + if (sg_alloc_table_from_pages(sgt, pages, count, 0, size, gfp)) goto out_free_iova; if (!(ioprot & IOMMU_CACHE)) {
Change the sg_alloc_table_from_pages() allocation that was hardwired to GFP_KERNEL to use the gfp parameter like the other allocations in this function. Auditing says this is never called from an atomic context, so it is safe as is, but reads wrong. Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> --- drivers/iommu/dma-iommu.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)