Message ID | 20231128204938.1453583-1-pasha.tatashin@soleen.com |
---|---|
Headers | show |
Series | IOMMU memory observability | expand |
On Tue, Nov 28, 2023 at 4:34 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > On Tue, Nov 28, 2023 at 12:49 PM Pasha Tatashin > <pasha.tatashin@soleen.com> wrote: > > > > From: Pasha Tatashin <tatashin@google.com> > > > > IOMMU subsystem may contain state that is in gigabytes. Majority of that > > state is iommu page tables. Yet, there is currently, no way to observe > > how much memory is actually used by the iommu subsystem. > > > > This patch series solves this problem by adding both observability to > > all pages that are allocated by IOMMU, and also accountability, so > > admins can limit the amount if via cgroups. > > > > The system-wide observability is using /proc/meminfo: > > SecPageTables: 438176 kB > > > > Contains IOMMU and KVM memory. > > > > Per-node observability: > > /sys/devices/system/node/nodeN/meminfo > > Node N SecPageTables: 422204 kB > > > > Contains IOMMU and KVM memory memory in the given NUMA node. > > > > Per-node IOMMU only observability: > > /sys/devices/system/node/nodeN/vmstat > > nr_iommu_pages 105555 > > > > Contains number of pages IOMMU allocated in the given node. > > Does it make sense to have a KVM-only entry there as well? > > In that case, if SecPageTables in /proc/meminfo is found to be > suspiciously high, it should be easy to tell which component is > contributing most usage through vmstat. I understand that users can do > the subtraction, but we wouldn't want userspace depending on that, in > case a third class of "secondary" page tables emerges that we want to > add to SecPageTables. The in-kernel implementation can do the > subtraction for now if it makes sense though. Hi Yosry, Yes, another counter for KVM could be added. On the other hand KVM only can be computed by subtracting one from another as there are only two types of secondary page tables, KVM and IOMMU: /sys/devices/system/node/node0/meminfo Node 0 SecPageTables: 422204 kB /sys/devices/system/node/nodeN/vmstat nr_iommu_pages 105555 KVM only = SecPageTables - nr_iommu_pages * PAGE_SIZE / 1024 Pasha > > > > > Accountability: using sec_pagetables cgroup-v2 memory.stat entry. > > > > With the change, iova_stress[1] stops as limit is reached: > > > > # ./iova_stress > > iova space: 0T free memory: 497G > > iova space: 1T free memory: 495G > > iova space: 2T free memory: 493G > > iova space: 3T free memory: 491G > > > > stops as limit is reached. > > > > This series encorporates suggestions that came from the discussion > > at LPC [2]. > > > > [1] https://github.com/soleen/iova_stress > > [2] https://lpc.events/event/17/contributions/1466 > > > > Pasha Tatashin (16): > > iommu/vt-d: add wrapper functions for page allocations > > iommu/amd: use page allocation function provided by iommu-pages.h > > iommu/io-pgtable-arm: use page allocation function provided by > > iommu-pages.h > > iommu/io-pgtable-dart: use page allocation function provided by > > iommu-pages.h > > iommu/io-pgtable-arm-v7s: use page allocation function provided by > > iommu-pages.h > > iommu/dma: use page allocation function provided by iommu-pages.h > > iommu/exynos: use page allocation function provided by iommu-pages.h > > iommu/fsl: use page allocation function provided by iommu-pages.h > > iommu/iommufd: use page allocation function provided by iommu-pages.h > > iommu/rockchip: use page allocation function provided by iommu-pages.h > > iommu/sun50i: use page allocation function provided by iommu-pages.h > > iommu/tegra-smmu: use page allocation function provided by > > iommu-pages.h > > iommu: observability of the IOMMU allocations > > iommu: account IOMMU allocated memory > > vhost-vdpa: account iommu allocations > > vfio: account iommu allocations > > > > Documentation/admin-guide/cgroup-v2.rst | 2 +- > > Documentation/filesystems/proc.rst | 4 +- > > drivers/iommu/amd/amd_iommu.h | 8 - > > drivers/iommu/amd/init.c | 91 +++++----- > > drivers/iommu/amd/io_pgtable.c | 13 +- > > drivers/iommu/amd/io_pgtable_v2.c | 20 +- > > drivers/iommu/amd/iommu.c | 13 +- > > drivers/iommu/dma-iommu.c | 8 +- > > drivers/iommu/exynos-iommu.c | 14 +- > > drivers/iommu/fsl_pamu.c | 5 +- > > drivers/iommu/intel/dmar.c | 10 +- > > drivers/iommu/intel/iommu.c | 47 ++--- > > drivers/iommu/intel/iommu.h | 2 - > > drivers/iommu/intel/irq_remapping.c | 10 +- > > drivers/iommu/intel/pasid.c | 12 +- > > drivers/iommu/intel/svm.c | 7 +- > > drivers/iommu/io-pgtable-arm-v7s.c | 9 +- > > drivers/iommu/io-pgtable-arm.c | 7 +- > > drivers/iommu/io-pgtable-dart.c | 37 ++-- > > drivers/iommu/iommu-pages.h | 231 ++++++++++++++++++++++++ > > drivers/iommu/iommufd/iova_bitmap.c | 6 +- > > drivers/iommu/rockchip-iommu.c | 14 +- > > drivers/iommu/sun50i-iommu.c | 7 +- > > drivers/iommu/tegra-smmu.c | 18 +- > > drivers/vfio/vfio_iommu_type1.c | 8 +- > > drivers/vhost/vdpa.c | 3 +- > > include/linux/mmzone.h | 5 +- > > mm/vmstat.c | 3 + > > 28 files changed, 415 insertions(+), 199 deletions(-) > > create mode 100644 drivers/iommu/iommu-pages.h > > > > -- > > 2.43.0.rc2.451.g8631bc7472-goog > > > >
On Tue, Nov 28, 2023 at 5:34 PM Robin Murphy <robin.murphy@arm.com> wrote: > > On 2023-11-28 8:49 pm, Pasha Tatashin wrote: > > Convert iommu/dma-iommu.c to use the new page allocation functions > > provided in iommu-pages.h. > > These have nothing to do with IOMMU pagetables, they are DMA buffers and > they belong to whoever called the corresponding dma_alloc_* function. Hi Robin, This is true, however, we want to account and observe the pages allocated by IOMMU subsystem for DMA buffers, as they are essentially unmovable locked pages. Should we separate IOMMU memory from KVM memory all together and add another field to /proc/meminfo, something like "iommu -> iommu pagetable and dma memory", or do we want to export DMA memory separately from IOMMU page tables? Since, I included DMA memory, I specifically removed mentioning of IOMMU page tables in the most of places, and only report it as IOMMU memory. However, since it is still bundled together with SecPageTables it can be confusing. Pasha
On 2023-11-28 10:50 pm, Pasha Tatashin wrote: > On Tue, Nov 28, 2023 at 5:34 PM Robin Murphy <robin.murphy@arm.com> wrote: >> >> On 2023-11-28 8:49 pm, Pasha Tatashin wrote: >>> Convert iommu/dma-iommu.c to use the new page allocation functions >>> provided in iommu-pages.h. >> >> These have nothing to do with IOMMU pagetables, they are DMA buffers and >> they belong to whoever called the corresponding dma_alloc_* function. > > Hi Robin, > > This is true, however, we want to account and observe the pages > allocated by IOMMU subsystem for DMA buffers, as they are essentially > unmovable locked pages. Should we separate IOMMU memory from KVM > memory all together and add another field to /proc/meminfo, something > like "iommu -> iommu pagetable and dma memory", or do we want to > export DMA memory separately from IOMMU page tables? These are not allocated by "the IOMMU subsystem", they are allocated by the DMA API. Even if you want to claim that a driver pinning memory via iommu_dma_ops is somehow different from the same driver pinning the same amount of memory via dma-direct when iommu.passthrough=1, it's still nonsense because you're failing to account the pages which iommu_dma_ops gets from CMA, dma_common_alloc_pages(), dynamic SWIOTLB, the various pools, and so on. Thanks, Robin. > Since, I included DMA memory, I specifically removed mentioning of > IOMMU page tables in the most of places, and only report it as IOMMU > memory. However, since it is still bundled together with SecPageTables > it can be confusing. > > Pasha
On Tue, Nov 28, 2023 at 5:53 PM Robin Murphy <robin.murphy@arm.com> wrote: > > On 2023-11-28 8:49 pm, Pasha Tatashin wrote: > > Convert iommu/fsl_pamu.c to use the new page allocation functions > > provided in iommu-pages.h. > > Again, this is not a pagetable. This thing doesn't even *have* pagetables. > > Similar to patches #1 and #2 where you're lumping in configuration > tables which belong to the IOMMU driver itself, as opposed to pagetables > which effectively belong to an IOMMU domain's user. But then there are > still drivers where you're *not* accounting similar configuration > structures, so I really struggle to see how this metric is useful when > it's so completely inconsistent in what it's counting :/ The whole IOMMU subsystem allocates a significant amount of kernel locked memory that we want to at least observe. The new field in vmstat does just that: it reports ALL buddy allocator memory that IOMMU allocates. However, for accounting purposes, I agree, we need to do better, and separate at least iommu pagetables from the rest. We can separate the metric into two: iommu pagetable only iommu everything or into three: iommu pagetable only iommu dma iommu everything What do you think? Pasha > > Thanks, > Robin. > > > Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com> > > --- > > drivers/iommu/fsl_pamu.c | 5 +++-- > > 1 file changed, 3 insertions(+), 2 deletions(-) > > > > diff --git a/drivers/iommu/fsl_pamu.c b/drivers/iommu/fsl_pamu.c > > index f37d3b044131..7bfb49940f0c 100644 > > --- a/drivers/iommu/fsl_pamu.c > > +++ b/drivers/iommu/fsl_pamu.c > > @@ -16,6 +16,7 @@ > > #include <linux/platform_device.h> > > > > #include <asm/mpc85xx.h> > > +#include "iommu-pages.h" > > > > /* define indexes for each operation mapping scenario */ > > #define OMI_QMAN 0x00 > > @@ -828,7 +829,7 @@ static int fsl_pamu_probe(struct platform_device *pdev) > > (PAGE_SIZE << get_order(OMT_SIZE)); > > order = get_order(mem_size); > > > > - p = alloc_pages(GFP_KERNEL | __GFP_ZERO, order); > > + p = __iommu_alloc_pages(GFP_KERNEL, order); > > if (!p) { > > dev_err(dev, "unable to allocate PAACT/SPAACT/OMT block\n"); > > ret = -ENOMEM; > > @@ -916,7 +917,7 @@ static int fsl_pamu_probe(struct platform_device *pdev) > > iounmap(guts_regs); > > > > if (ppaact) > > - free_pages((unsigned long)ppaact, order); > > + iommu_free_pages(ppaact, order); > > > > ppaact = NULL; > >
On Tue, Nov 28, 2023 at 2:32 PM Pasha Tatashin <pasha.tatashin@soleen.com> wrote: > > On Tue, Nov 28, 2023 at 4:34 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > On Tue, Nov 28, 2023 at 12:49 PM Pasha Tatashin > > <pasha.tatashin@soleen.com> wrote: > > > > > > From: Pasha Tatashin <tatashin@google.com> > > > > > > IOMMU subsystem may contain state that is in gigabytes. Majority of that > > > state is iommu page tables. Yet, there is currently, no way to observe > > > how much memory is actually used by the iommu subsystem. > > > > > > This patch series solves this problem by adding both observability to > > > all pages that are allocated by IOMMU, and also accountability, so > > > admins can limit the amount if via cgroups. > > > > > > The system-wide observability is using /proc/meminfo: > > > SecPageTables: 438176 kB > > > > > > Contains IOMMU and KVM memory. > > > > > > Per-node observability: > > > /sys/devices/system/node/nodeN/meminfo > > > Node N SecPageTables: 422204 kB > > > > > > Contains IOMMU and KVM memory memory in the given NUMA node. > > > > > > Per-node IOMMU only observability: > > > /sys/devices/system/node/nodeN/vmstat > > > nr_iommu_pages 105555 > > > > > > Contains number of pages IOMMU allocated in the given node. > > > > Does it make sense to have a KVM-only entry there as well? > > > > In that case, if SecPageTables in /proc/meminfo is found to be > > suspiciously high, it should be easy to tell which component is > > contributing most usage through vmstat. I understand that users can do > > the subtraction, but we wouldn't want userspace depending on that, in > > case a third class of "secondary" page tables emerges that we want to > > add to SecPageTables. The in-kernel implementation can do the > > subtraction for now if it makes sense though. > > Hi Yosry, > > Yes, another counter for KVM could be added. On the other hand KVM > only can be computed by subtracting one from another as there are only > two types of secondary page tables, KVM and IOMMU: > > /sys/devices/system/node/node0/meminfo > Node 0 SecPageTables: 422204 kB > > /sys/devices/system/node/nodeN/vmstat > nr_iommu_pages 105555 > > KVM only = SecPageTables - nr_iommu_pages * PAGE_SIZE / 1024 > Right, but as I mention above, if userspace starts depending on this equation, we won't be able to add any more classes of "secondary" page tables to SecPageTables. I'd like to avoid that if possible. We can do the subtraction in the kernel.
On Tue, Nov 28, 2023 at 5:59 PM Robin Murphy <robin.murphy@arm.com> wrote: > > On 2023-11-28 10:50 pm, Pasha Tatashin wrote: > > On Tue, Nov 28, 2023 at 5:34 PM Robin Murphy <robin.murphy@arm.com> wrote: > >> > >> On 2023-11-28 8:49 pm, Pasha Tatashin wrote: > >>> Convert iommu/dma-iommu.c to use the new page allocation functions > >>> provided in iommu-pages.h. > >> > >> These have nothing to do with IOMMU pagetables, they are DMA buffers and > >> they belong to whoever called the corresponding dma_alloc_* function. > > > > Hi Robin, > > > > This is true, however, we want to account and observe the pages > > allocated by IOMMU subsystem for DMA buffers, as they are essentially > > unmovable locked pages. Should we separate IOMMU memory from KVM > > memory all together and add another field to /proc/meminfo, something > > like "iommu -> iommu pagetable and dma memory", or do we want to > > export DMA memory separately from IOMMU page tables? > > These are not allocated by "the IOMMU subsystem", they are allocated by > the DMA API. Even if you want to claim that a driver pinning memory via > iommu_dma_ops is somehow different from the same driver pinning the same > amount of memory via dma-direct when iommu.passthrough=1, it's still > nonsense because you're failing to account the pages which iommu_dma_ops > gets from CMA, dma_common_alloc_pages(), dynamic SWIOTLB, the various > pools, and so on. > > Thanks, > Robin. > > > Since, I included DMA memory, I specifically removed mentioning of > > IOMMU page tables in the most of places, and only report it as IOMMU > > memory. However, since it is still bundled together with SecPageTables > > it can be confusing. > > > > Pasha
> > This is true, however, we want to account and observe the pages > > allocated by IOMMU subsystem for DMA buffers, as they are essentially > > unmovable locked pages. Should we separate IOMMU memory from KVM > > memory all together and add another field to /proc/meminfo, something > > like "iommu -> iommu pagetable and dma memory", or do we want to > > export DMA memory separately from IOMMU page tables? > > These are not allocated by "the IOMMU subsystem", they are allocated by > the DMA API. Even if you want to claim that a driver pinning memory via > iommu_dma_ops is somehow different from the same driver pinning the same > amount of memory via dma-direct when iommu.passthrough=1, it's still > nonsense because you're failing to account the pages which iommu_dma_ops > gets from CMA, dma_common_alloc_pages(), dynamic SWIOTLB, the various > pools, and so on. I see, IOMMU variants are used only for discontiguous allocations, and the common ones are defined outside of driver/iommu. Alright, I can remove all the changes for all no-page table related IOMMU allocations. Pasha
On Tue, Nov 28, 2023 at 06:00:13PM -0500, Pasha Tatashin wrote: > On Tue, Nov 28, 2023 at 5:53 PM Robin Murphy <robin.murphy@arm.com> wrote: > > > > On 2023-11-28 8:49 pm, Pasha Tatashin wrote: > > > Convert iommu/fsl_pamu.c to use the new page allocation functions > > > provided in iommu-pages.h. > > > > Again, this is not a pagetable. This thing doesn't even *have* pagetables. > > > > Similar to patches #1 and #2 where you're lumping in configuration > > tables which belong to the IOMMU driver itself, as opposed to pagetables > > which effectively belong to an IOMMU domain's user. But then there are > > still drivers where you're *not* accounting similar configuration > > structures, so I really struggle to see how this metric is useful when > > it's so completely inconsistent in what it's counting :/ > > The whole IOMMU subsystem allocates a significant amount of kernel > locked memory that we want to at least observe. The new field in > vmstat does just that: it reports ALL buddy allocator memory that > IOMMU allocates. However, for accounting purposes, I agree, we need to > do better, and separate at least iommu pagetables from the rest. > > We can separate the metric into two: > iommu pagetable only > iommu everything > > or into three: > iommu pagetable only > iommu dma > iommu everything > > What do you think? I think I said this at LPC - if you want to have fine grained accounting of memory by owner you need to go talk to the cgroup people and come up with something generic. Adding ever open coded finer category breakdowns just for iommu doesn't make alot of sense. You can make some argument that the pagetable memory should be counted because kvm counts it's shadow memory, but I wouldn't go into further detail than that with hand coded counters.. Jason
On Tue, Nov 28, 2023 at 03:03:30PM -0800, Yosry Ahmed wrote: > > Yes, another counter for KVM could be added. On the other hand KVM > > only can be computed by subtracting one from another as there are only > > two types of secondary page tables, KVM and IOMMU: > > > > /sys/devices/system/node/node0/meminfo > > Node 0 SecPageTables: 422204 kB > > > > /sys/devices/system/node/nodeN/vmstat > > nr_iommu_pages 105555 > > > > KVM only = SecPageTables - nr_iommu_pages * PAGE_SIZE / 1024 > > > > Right, but as I mention above, if userspace starts depending on this > equation, we won't be able to add any more classes of "secondary" page > tables to SecPageTables. I'd like to avoid that if possible. We can do > the subtraction in the kernel. What Sean had suggested was that SecPageTables was always intended to account all the non-primary mmu memory used by page tables. If this is the case we shouldn't be trying to break it apart into finer counters. These are big picture counters, not detailed allocation by owner counters. Jason
On Tue, Nov 28, 2023 at 08:49:31PM +0000, Pasha Tatashin wrote: > Convert iommu/iommufd/* files to use the new page allocation functions > provided in iommu-pages.h. > > Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com> > --- > drivers/iommu/iommufd/iova_bitmap.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) This is a short term allocation, it should not be counted, that is why it is already not using GFP_KERNEL_ACCOUNT. Jason
On Tue, Nov 28, 2023 at 3:52 PM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > On Tue, Nov 28, 2023 at 03:03:30PM -0800, Yosry Ahmed wrote: > > > Yes, another counter for KVM could be added. On the other hand KVM > > > only can be computed by subtracting one from another as there are only > > > two types of secondary page tables, KVM and IOMMU: > > > > > > /sys/devices/system/node/node0/meminfo > > > Node 0 SecPageTables: 422204 kB > > > > > > /sys/devices/system/node/nodeN/vmstat > > > nr_iommu_pages 105555 > > > > > > KVM only = SecPageTables - nr_iommu_pages * PAGE_SIZE / 1024 > > > > > > > Right, but as I mention above, if userspace starts depending on this > > equation, we won't be able to add any more classes of "secondary" page > > tables to SecPageTables. I'd like to avoid that if possible. We can do > > the subtraction in the kernel. > > What Sean had suggested was that SecPageTables was always intended to > account all the non-primary mmu memory used by page tables. If this is > the case we shouldn't be trying to break it apart into finer > counters. These are big picture counters, not detailed allocation by > owner counters. Right, I agree with that, but if SecPageTables includes page tables from multiple sources, and it is observed to be suspiciously high, the logical next step is to try to find the culprit, right? > > Jason
On Tue, Nov 28, 2023 at 04:25:03PM -0800, Yosry Ahmed wrote: > > > Right, but as I mention above, if userspace starts depending on this > > > equation, we won't be able to add any more classes of "secondary" page > > > tables to SecPageTables. I'd like to avoid that if possible. We can do > > > the subtraction in the kernel. > > > > What Sean had suggested was that SecPageTables was always intended to > > account all the non-primary mmu memory used by page tables. If this is > > the case we shouldn't be trying to break it apart into finer > > counters. These are big picture counters, not detailed allocation by > > owner counters. > > Right, I agree with that, but if SecPageTables includes page tables > from multiple sources, and it is observed to be suspiciously high, the > logical next step is to try to find the culprit, right? You can make that case already, if it is high wouldn't you want to find the exact VMM process that was making it high? It is a sign of fire, not a detailed debug tool. Jason
On Tue, Nov 28, 2023 at 4:28 PM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > On Tue, Nov 28, 2023 at 04:25:03PM -0800, Yosry Ahmed wrote: > > > > > Right, but as I mention above, if userspace starts depending on this > > > > equation, we won't be able to add any more classes of "secondary" page > > > > tables to SecPageTables. I'd like to avoid that if possible. We can do > > > > the subtraction in the kernel. > > > > > > What Sean had suggested was that SecPageTables was always intended to > > > account all the non-primary mmu memory used by page tables. If this is > > > the case we shouldn't be trying to break it apart into finer > > > counters. These are big picture counters, not detailed allocation by > > > owner counters. > > > > Right, I agree with that, but if SecPageTables includes page tables > > from multiple sources, and it is observed to be suspiciously high, the > > logical next step is to try to find the culprit, right? > > You can make that case already, if it is high wouldn't you want to > find the exact VMM process that was making it high? > > It is a sign of fire, not a detailed debug tool. Fair enough. We can always add separate counters later if needed, potentially under KVM stats to get more fine-grained details as you mentioned. I am only worried about users subtracting the iommu-only counter to get a KVM counter. We should at least document that SecPageTables may be expanded to include other sources later to avoid that. > > Jason
On Tue, Nov 28, 2023 at 04:30:27PM -0800, Yosry Ahmed wrote: > On Tue, Nov 28, 2023 at 4:28 PM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > > > On Tue, Nov 28, 2023 at 04:25:03PM -0800, Yosry Ahmed wrote: > > > > > > > Right, but as I mention above, if userspace starts depending on this > > > > > equation, we won't be able to add any more classes of "secondary" page > > > > > tables to SecPageTables. I'd like to avoid that if possible. We can do > > > > > the subtraction in the kernel. > > > > > > > > What Sean had suggested was that SecPageTables was always intended to > > > > account all the non-primary mmu memory used by page tables. If this is > > > > the case we shouldn't be trying to break it apart into finer > > > > counters. These are big picture counters, not detailed allocation by > > > > owner counters. > > > > > > Right, I agree with that, but if SecPageTables includes page tables > > > from multiple sources, and it is observed to be suspiciously high, the > > > logical next step is to try to find the culprit, right? > > > > You can make that case already, if it is high wouldn't you want to > > find the exact VMM process that was making it high? > > > > It is a sign of fire, not a detailed debug tool. > > Fair enough. We can always add separate counters later if needed, > potentially under KVM stats to get more fine-grained details as you > mentioned. > > I am only worried about users subtracting the iommu-only counter to > get a KVM counter. We should at least document that SecPageTables may > be expanded to include other sources later to avoid that. Well, we just broke it already, anyone thinking it was only kvm counters is going to be sad now :) As I understand it was already described to be more general that kvm so probably nothing to do really Jason
On 28/11/2023 11:50 pm, Jason Gunthorpe wrote: > On Tue, Nov 28, 2023 at 06:00:13PM -0500, Pasha Tatashin wrote: >> On Tue, Nov 28, 2023 at 5:53 PM Robin Murphy <robin.murphy@arm.com> wrote: >>> >>> On 2023-11-28 8:49 pm, Pasha Tatashin wrote: >>>> Convert iommu/fsl_pamu.c to use the new page allocation functions >>>> provided in iommu-pages.h. >>> >>> Again, this is not a pagetable. This thing doesn't even *have* pagetables. >>> >>> Similar to patches #1 and #2 where you're lumping in configuration >>> tables which belong to the IOMMU driver itself, as opposed to pagetables >>> which effectively belong to an IOMMU domain's user. But then there are >>> still drivers where you're *not* accounting similar configuration >>> structures, so I really struggle to see how this metric is useful when >>> it's so completely inconsistent in what it's counting :/ >> >> The whole IOMMU subsystem allocates a significant amount of kernel >> locked memory that we want to at least observe. The new field in >> vmstat does just that: it reports ALL buddy allocator memory that >> IOMMU allocates. However, for accounting purposes, I agree, we need to >> do better, and separate at least iommu pagetables from the rest. >> >> We can separate the metric into two: >> iommu pagetable only >> iommu everything >> >> or into three: >> iommu pagetable only >> iommu dma >> iommu everything >> >> What do you think? > > I think I said this at LPC - if you want to have fine grained > accounting of memory by owner you need to go talk to the cgroup people > and come up with something generic. Adding ever open coded finer > category breakdowns just for iommu doesn't make alot of sense. > > You can make some argument that the pagetable memory should be counted > because kvm counts it's shadow memory, but I wouldn't go into further > detail than that with hand coded counters.. Right, pagetable memory is interesting since it's something that any random kernel user can indirectly allocate via iommu_domain_alloc() and iommu_map(), and some of those users may even be doing so on behalf of userspace. I have no objection to accounting and potentially applying limits to *that*. Beyond that, though, there is nothing special about "the IOMMU subsystem". The amount of memory an IOMMU driver needs to allocate for itself in order to function is not of interest beyond curiosity, it just is what it is; limiting it would only break the IOMMU, and if a user thinks it's "too much", the only actionable thing that might help is to physically remove devices from the system. Similar for DMA buffers; it might be intriguing to account those, but it's not really an actionable metric - in the overwhelming majority of cases you can't simply tell a driver to allocate less than what it needs. And that is of course assuming if we were to account *all* DMA buffers, since whether they happen to have an IOMMU translation or not is irrelevant (we'd have already accounted the pagetables as pagetables if so). I bet "the networking subsystem" also consumes significant memory on the same kind of big systems where IOMMU pagetables would be of any concern. I believe some of the some of the "serious" NICs can easily run up hundreds of megabytes if not gigabytes worth of queues, SKB pools, etc. - would you propose accounting those too? Thanks, Robin.
> >> We can separate the metric into two: > >> iommu pagetable only > >> iommu everything > >> > >> or into three: > >> iommu pagetable only > >> iommu dma > >> iommu everything > >> > >> What do you think? > > > > I think I said this at LPC - if you want to have fine grained > > accounting of memory by owner you need to go talk to the cgroup people > > and come up with something generic. Adding ever open coded finer > > category breakdowns just for iommu doesn't make alot of sense. > > > > You can make some argument that the pagetable memory should be counted > > because kvm counts it's shadow memory, but I wouldn't go into further > > detail than that with hand coded counters.. > > Right, pagetable memory is interesting since it's something that any > random kernel user can indirectly allocate via iommu_domain_alloc() and > iommu_map(), and some of those users may even be doing so on behalf of > userspace. I have no objection to accounting and potentially applying > limits to *that*. Yes, in the next version, I will separate pagetable only from the rest, for the limits. > Beyond that, though, there is nothing special about "the IOMMU > subsystem". The amount of memory an IOMMU driver needs to allocate for > itself in order to function is not of interest beyond curiosity, it just > is what it is; limiting it would only break the IOMMU, and if a user Agree about the amount of memory IOMMU allocates for itself, but that should be small, if it is not, we have to at least show where the memory is used. > thinks it's "too much", the only actionable thing that might help is to > physically remove devices from the system. Similar for DMA buffers; it > might be intriguing to account those, but it's not really an actionable > metric - in the overwhelming majority of cases you can't simply tell a > driver to allocate less than what it needs. And that is of course > assuming if we were to account *all* DMA buffers, since whether they > happen to have an IOMMU translation or not is irrelevant (we'd have > already accounted the pagetables as pagetables if so). DMA mappings should be observable (do not have to be limited). At the very least, it can help with explaining the kernel memory overhead anomalies on production systems. > I bet "the networking subsystem" also consumes significant memory on the It does, and GPU drivers also may consume a significant amount of memory. > same kind of big systems where IOMMU pagetables would be of any concern. > I believe some of the some of the "serious" NICs can easily run up > hundreds of megabytes if not gigabytes worth of queues, SKB pools, etc. > - would you propose accounting those too? Yes. Any kind of kernel memory that is proportional to the workload should be accountable. Someone is using those resources compared to the idling system, and that someone should be charged. Pasha
On Wed, Nov 29, 2023 at 02:45:03PM -0500, Pasha Tatashin wrote: > > same kind of big systems where IOMMU pagetables would be of any concern. > > I believe some of the some of the "serious" NICs can easily run up > > hundreds of megabytes if not gigabytes worth of queues, SKB pools, etc. > > - would you propose accounting those too? > > Yes. Any kind of kernel memory that is proportional to the workload > should be accountable. Someone is using those resources compared to > the idling system, and that someone should be charged. There is a difference between charged and accounted You should be running around adding GFP_KERNEL_ACCOUNT, yes. I already did a bunch of that work. Split that out from this series and send it to the right maintainers. Adding a counter for allocations and showing in procfs is a very different question. IMHO that should not be done in micro, the threshold to add a new counter should be high. There is definately room for a generic debugging feature to break down GFP_KERNEL_ACCOUNT by owernship somehow. Maybe it can already be done with BPF. IDK Jason
On Wed, Nov 29, 2023 at 3:03 PM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > On Wed, Nov 29, 2023 at 02:45:03PM -0500, Pasha Tatashin wrote: > > > > same kind of big systems where IOMMU pagetables would be of any concern. > > > I believe some of the some of the "serious" NICs can easily run up > > > hundreds of megabytes if not gigabytes worth of queues, SKB pools, etc. > > > - would you propose accounting those too? > > > > Yes. Any kind of kernel memory that is proportional to the workload > > should be accountable. Someone is using those resources compared to > > the idling system, and that someone should be charged. > > There is a difference between charged and accounted > > You should be running around adding GFP_KERNEL_ACCOUNT, yes. I already > did a bunch of that work. Split that out from this series and send it > to the right maintainers. I will do that. > > Adding a counter for allocations and showing in procfs is a very > different question. IMHO that should not be done in micro, the > threshold to add a new counter should be high. I agree, /proc/meminfo, should not include everything, however overall network consumption that includes memory allocated by network driver would be useful to have, may be it should be exported by device drivers and added to the protocol memory. We already have network protocol memory consumption in procfs: # awk '{printf "%-10s %s\n", $1, $4}' /proc/net/protocols | grep -v '\-1' protocol memory UDPv6 22673 TCPv6 16961 > There is definately room for a generic debugging feature to break down > GFP_KERNEL_ACCOUNT by owernship somehow. Maybe it can already be done > with BPF. IDK
On Tue, Nov 28, 2023 at 6:52 PM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > On Tue, Nov 28, 2023 at 08:49:31PM +0000, Pasha Tatashin wrote: > > Convert iommu/iommufd/* files to use the new page allocation functions > > provided in iommu-pages.h. > > > > Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com> > > --- > > drivers/iommu/iommufd/iova_bitmap.c | 4 ++-- > > 1 file changed, 2 insertions(+), 2 deletions(-) > > This is a short term allocation, it should not be counted, that is why > it is already not using GFP_KERNEL_ACCOUNT. I made this change for completeness. I changed all calls to get_free_page/alloc_page etc under driver/iommu to use the iommu_alloc_* variants, this also helps future developers in this area to use the right allocation functions. The accounting is implemented using cheap per-cpu counters, so should not affect the performance, I think it is OK to keep them here.
On Wed, Nov 29, 2023 at 04:59:43PM -0500, Pasha Tatashin wrote: > On Tue, Nov 28, 2023 at 6:52 PM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > > > On Tue, Nov 28, 2023 at 08:49:31PM +0000, Pasha Tatashin wrote: > > > Convert iommu/iommufd/* files to use the new page allocation functions > > > provided in iommu-pages.h. > > > > > > Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com> > > > --- > > > drivers/iommu/iommufd/iova_bitmap.c | 4 ++-- > > > 1 file changed, 2 insertions(+), 2 deletions(-) > > > > This is a short term allocation, it should not be counted, that is why > > it is already not using GFP_KERNEL_ACCOUNT. > > I made this change for completeness. I changed all calls to > get_free_page/alloc_page etc under driver/iommu to use the > iommu_alloc_* variants, this also helps future developers in this area > to use the right allocation functions. > The accounting is implemented using cheap per-cpu counters, so should > not affect the performance, I think it is OK to keep them here. Except it is a mis use of an API that should only be used for page table memory :( Jason
From: Pasha Tatashin <tatashin@google.com> IOMMU subsystem may contain state that is in gigabytes. Majority of that state is iommu page tables. Yet, there is currently, no way to observe how much memory is actually used by the iommu subsystem. This patch series solves this problem by adding both observability to all pages that are allocated by IOMMU, and also accountability, so admins can limit the amount if via cgroups. The system-wide observability is using /proc/meminfo: SecPageTables: 438176 kB Contains IOMMU and KVM memory. Per-node observability: /sys/devices/system/node/nodeN/meminfo Node N SecPageTables: 422204 kB Contains IOMMU and KVM memory memory in the given NUMA node. Per-node IOMMU only observability: /sys/devices/system/node/nodeN/vmstat nr_iommu_pages 105555 Contains number of pages IOMMU allocated in the given node. Accountability: using sec_pagetables cgroup-v2 memory.stat entry. With the change, iova_stress[1] stops as limit is reached: # ./iova_stress iova space: 0T free memory: 497G iova space: 1T free memory: 495G iova space: 2T free memory: 493G iova space: 3T free memory: 491G stops as limit is reached. This series encorporates suggestions that came from the discussion at LPC [2]. [1] https://github.com/soleen/iova_stress [2] https://lpc.events/event/17/contributions/1466 Pasha Tatashin (16): iommu/vt-d: add wrapper functions for page allocations iommu/amd: use page allocation function provided by iommu-pages.h iommu/io-pgtable-arm: use page allocation function provided by iommu-pages.h iommu/io-pgtable-dart: use page allocation function provided by iommu-pages.h iommu/io-pgtable-arm-v7s: use page allocation function provided by iommu-pages.h iommu/dma: use page allocation function provided by iommu-pages.h iommu/exynos: use page allocation function provided by iommu-pages.h iommu/fsl: use page allocation function provided by iommu-pages.h iommu/iommufd: use page allocation function provided by iommu-pages.h iommu/rockchip: use page allocation function provided by iommu-pages.h iommu/sun50i: use page allocation function provided by iommu-pages.h iommu/tegra-smmu: use page allocation function provided by iommu-pages.h iommu: observability of the IOMMU allocations iommu: account IOMMU allocated memory vhost-vdpa: account iommu allocations vfio: account iommu allocations Documentation/admin-guide/cgroup-v2.rst | 2 +- Documentation/filesystems/proc.rst | 4 +- drivers/iommu/amd/amd_iommu.h | 8 - drivers/iommu/amd/init.c | 91 +++++----- drivers/iommu/amd/io_pgtable.c | 13 +- drivers/iommu/amd/io_pgtable_v2.c | 20 +- drivers/iommu/amd/iommu.c | 13 +- drivers/iommu/dma-iommu.c | 8 +- drivers/iommu/exynos-iommu.c | 14 +- drivers/iommu/fsl_pamu.c | 5 +- drivers/iommu/intel/dmar.c | 10 +- drivers/iommu/intel/iommu.c | 47 ++--- drivers/iommu/intel/iommu.h | 2 - drivers/iommu/intel/irq_remapping.c | 10 +- drivers/iommu/intel/pasid.c | 12 +- drivers/iommu/intel/svm.c | 7 +- drivers/iommu/io-pgtable-arm-v7s.c | 9 +- drivers/iommu/io-pgtable-arm.c | 7 +- drivers/iommu/io-pgtable-dart.c | 37 ++-- drivers/iommu/iommu-pages.h | 231 ++++++++++++++++++++++++ drivers/iommu/iommufd/iova_bitmap.c | 6 +- drivers/iommu/rockchip-iommu.c | 14 +- drivers/iommu/sun50i-iommu.c | 7 +- drivers/iommu/tegra-smmu.c | 18 +- drivers/vfio/vfio_iommu_type1.c | 8 +- drivers/vhost/vdpa.c | 3 +- include/linux/mmzone.h | 5 +- mm/vmstat.c | 3 + 28 files changed, 415 insertions(+), 199 deletions(-) create mode 100644 drivers/iommu/iommu-pages.h