diff mbox series

[RFC,v3,1/8] gpu: rfc: Proposal for a GPU cgroup controller

Message ID 20220309165222.2843651-2-tjmercier@google.com
State New
Headers show
Series Proposal for a GPU cgroup controller | expand

Commit Message

T.J. Mercier March 9, 2022, 4:52 p.m. UTC
From: Hridya Valsaraju <hridya@google.com>

This patch adds a proposal for a new GPU cgroup controller for
accounting/limiting GPU and GPU-related memory allocations.
The proposed controller is based on the DRM cgroup controller[1] and
follows the design of the RDMA cgroup controller.

The new cgroup controller would:
* Allow setting per-cgroup limits on the total size of buffers charged
  to it.
* Allow setting per-device limits on the total size of buffers
  allocated by device within a cgroup.
* Expose a per-device/allocator breakdown of the buffers charged to a
  cgroup.

The prototype in the following patches is only for memory accounting
using the GPU cgroup controller and does not implement limit setting.

[1]: https://lore.kernel.org/amd-gfx/20210126214626.16260-1-brian.welty@intel.com/

Signed-off-by: Hridya Valsaraju <hridya@google.com>
Signed-off-by: T.J. Mercier <tjmercier@google.com>

---
v3 changes
Remove Upstreaming Plan from gpu-cgroup.rst per John Stultz.

Use more common dual author commit message format per John Stultz.
---
 Documentation/gpu/rfc/gpu-cgroup.rst | 183 +++++++++++++++++++++++++++
 Documentation/gpu/rfc/index.rst      |   4 +
 2 files changed, 187 insertions(+)
 create mode 100644 Documentation/gpu/rfc/gpu-cgroup.rst

Comments

T.J. Mercier March 22, 2022, 3:41 p.m. UTC | #1
On Mon, Mar 21, 2022 at 10:37 AM Michal Koutný <mkoutny@suse.com> wrote:
>
> Hello.
>
> On Wed, Mar 09, 2022 at 04:52:11PM +0000, "T.J. Mercier" <tjmercier@google.com> wrote:
> > +The new cgroup controller would:
> > +
> > +* Allow setting per-cgroup limits on the total size of buffers charged to it.
>
> What is the meaning of the total? (I only have very naïve
> understanding of the device buffers.)

So "total" is used twice here in two different contexts.
The first one is the global "GPU" cgroup context. As in any buffer
that any exporter claims is a GPU buffer, regardless of where/how it
is allocated. So this refers to the sum of all gpu buffers of any
type/source. An exporter contributes to this total by registering a
corresponding gpucg_device and making charges against that device when
it exports.
The second one is in a per device context. This allows us to make a
distinction between different types of GPU memory based on who
exported the buffer. A single process can make use of several
different types of dma buffers (for example cached and uncached
versions of the same type of memory), and it would be useful to have
different limits for each. These are distinguished by the device name
string chosen when the gpucg_device is first registered.

>
> Is it like a) there's global pool of memory that is partitioned among
> individual devices or b) each device has its own specific type of memory
> and adding across two devices is adding apples and oranges or c) there
> can be various devices both of a) and b) type?

So I guess the most correct answer to this question is c.


>
> (Apologies not replying to previous versions and possibly missing
> anything.)
>
> Thanks,
> Michal
>
Michal Koutný March 23, 2022, 10:40 a.m. UTC | #2
On Tue, Mar 22, 2022 at 08:41:55AM -0700, "T.J. Mercier" <tjmercier@google.com> wrote:
> So "total" is used twice here in two different contexts.
> The first one is the global "GPU" cgroup context. As in any buffer
> that any exporter claims is a GPU buffer, regardless of where/how it
> is allocated. So this refers to the sum of all gpu buffers of any
> type/source. An exporter contributes to this total by registering a
> corresponding gpucg_device and making charges against that device when
> it exports.
> The second one is in a per device context. This allows us to make a
> distinction between different types of GPU memory based on who
> exported the buffer. A single process can make use of several
> different types of dma buffers (for example cached and uncached
> versions of the same type of memory), and it would be useful to have
> different limits for each. These are distinguished by the device name
> string chosen when the gpucg_device is first registered.

So is this understanding correct?

(if there was an analogous line in gpu.memory.current to gpu.memory.max)
	$ cat gpu.memory.current
	total T
	dev1  d1
	...
	devN  dn

T = Σ di + RAM_backed_buffers

and that some of RAM_backed_buffers may be accounted also in
memory.current (case by case, depending on allocator).

Thanks,
Michal
diff mbox series

Patch

diff --git a/Documentation/gpu/rfc/gpu-cgroup.rst b/Documentation/gpu/rfc/gpu-cgroup.rst
new file mode 100644
index 000000000000..5b40d5518a5e
--- /dev/null
+++ b/Documentation/gpu/rfc/gpu-cgroup.rst
@@ -0,0 +1,183 @@ 
+===================================
+GPU cgroup controller
+===================================
+
+Goals
+=====
+This document intends to outline a plan to create a cgroup v2 controller subsystem
+for the per-cgroup accounting of device and system memory allocated by the GPU
+and related subsystems.
+
+The new cgroup controller would:
+
+* Allow setting per-cgroup limits on the total size of buffers charged to it.
+
+* Allow setting per-device limits on the total size of buffers allocated by a
+  device/allocator within a cgroup.
+
+* Expose a per-device/allocator breakdown of the buffers charged to a cgroup.
+
+Alternatives Considered
+=======================
+
+The following alternatives were considered:
+
+The memory cgroup controller
+____________________________
+
+1. As was noted in [1], memory accounting provided by the GPU cgroup
+controller is not a good fit for integration into memcg due to the
+differences in how accounting is performed. It implements a mechanism
+for the allocator attribution of GPU and GPU-related memory by
+charging each buffer to the cgroup of the process on behalf of which
+the memory was allocated. The buffer stays charged to the cgroup until
+it is freed regardless of whether the process retains any references
+to it. On the other hand, the memory cgroup controller offers a more
+fine-grained charging and uncharging behavior depending on the kind of
+page being accounted.
+
+2. Memcg performs accounting in units of pages. In the DMA-BUF buffer sharing model,
+a process takes a reference to the entire buffer(hence keeping it alive) even if
+it is only accessing parts of it. Therefore, per-page memory tracking for DMA-BUF
+memory accounting would only introduce additional overhead without any benefits.
+
+[1]: https://patchwork.kernel.org/project/dri-devel/cover/20190501140438.9506-1-brian.welty@intel.com/#22624705
+
+Userspace service to keep track of buffer allocations and releases
+__________________________________________________________________
+
+1. There is no way for a userspace service to intercept all allocations and releases.
+2. In case the process gets killed or restarted, we lose all accounting so far.
+
+UAPI
+====
+When enabled, the new cgroup controller would create the following files in every cgroup.
+
+::
+
+        gpu.memory.current (R)
+        gpu.memory.max (R/W)
+
+gpu.memory.current is a read-only file and would contain per-device memory allocations
+in a key-value format where key is a string representing the device name
+and the value is the size of memory charged to the device in the cgroup in bytes.
+
+For example:
+
+::
+
+        cat /sys/kernel/fs/cgroup1/gpu.memory.current
+        dev1 4194304
+        dev2 4194304
+
+The string key for each device is set by the device driver when the device registers
+with the GPU cgroup controller to participate in resource accounting(see section
+'Design and Implementation' for more details).
+
+gpu.memory.max is a read/write file. It would show the current total
+size limits on memory usage for the cgroup and the limits on total memory usage
+for each allocator/device.
+
+Setting a total limit for a cgroup can be done as follows:
+
+::
+
+        echo “total 41943040” > /sys/kernel/fs/cgroup1/gpu.memory.max
+
+Setting a total limit for a particular device/allocator can be done as follows:
+
+::
+
+        echo “dev1 4194304” >  /sys/kernel/fs/cgroup1/gpu.memory.max
+
+In this example, 'dev1' is the string key set by the device driver during
+registration.
+
+Design and Implementation
+=========================
+
+The cgroup controller would closely follow the design of the RDMA cgroup controller
+subsystem where each cgroup maintains a list of resource pools.
+Each resource pool contains a struct device and the counter to track current total,
+and the maximum limit set for the device.
+
+The below code block is a preliminary estimation on how the core kernel data structures
+and APIs would look like.
+
+.. code-block:: c
+
+        /**
+         * The GPU cgroup controller data structure.
+         */
+        struct gpucg {
+                struct cgroup_subsys_state css;
+
+                /* list of all resource pools that belong to this cgroup */
+                struct list_head rpools;
+        };
+
+        struct gpucg_device {
+                /*
+                 * list  of various resource pools in various cgroups that the device is
+                 * part of.
+                 */
+                struct list_head rpools;
+
+                /* list of all devices registered for GPU cgroup accounting */
+                struct list_head dev_node;
+
+                /* name to be used as identifier for accounting and limit setting */
+                const char *name;
+        };
+
+        struct gpucg_resource_pool {
+                /* The device whose resource usage is tracked by this resource pool */
+                struct gpucg_device *device;
+
+                /* list of all resource pools for the cgroup */
+                struct list_head cg_node;
+
+                /*
+                 * list maintained by the gpucg_device to keep track of its
+                 * resource pools
+                 */
+                struct list_head dev_node;
+
+                /* tracks memory usage of the resource pool */
+                struct page_counter total;
+        };
+
+        /**
+         * gpucg_register_device - Registers a device for memory accounting using the
+         * GPU cgroup controller.
+         *
+         * @device: The device to register for memory accounting. Must remain valid
+         * after registration.
+         * @name: Pointer to a string literal to denote the name of the device.
+         */
+        void gpucg_register_device(struct gpucg_device *gpucg_dev, const char *name);
+
+        /**
+         * gpucg_try_charge - charge memory to the specified gpucg and gpucg_device.
+         *
+         * @gpucg: The gpu cgroup to charge the memory to.
+         * @device: The device to charge the memory to.
+         * @usage: size of memory to charge in bytes.
+         *
+         * Return: returns 0 if the charging is successful and otherwise returns an
+         * error code.
+         */
+        int gpucg_try_charge(struct gpucg *gpucg, struct gpucg_device *device, u64 usage);
+
+        /**
+         * gpucg_uncharge - uncharge memory from the specified gpucg and gpucg_device.
+         *
+         * @gpucg: The gpu cgroup to uncharge the memory from.
+         * @device: The device to charge the memory from.
+         * @usage: size of memory to uncharge in bytes.
+         */
+        void gpucg_uncharge(struct gpucg *gpucg, struct gpucg_device *device, u64 usage);
+
+Future Work
+===========
+Additional GPU resources can be supported by adding new controller files.
diff --git a/Documentation/gpu/rfc/index.rst b/Documentation/gpu/rfc/index.rst
index 91e93a705230..0a9bcd94e95d 100644
--- a/Documentation/gpu/rfc/index.rst
+++ b/Documentation/gpu/rfc/index.rst
@@ -23,3 +23,7 @@  host such documentation:
 .. toctree::
 
     i915_scheduler.rst
+
+.. toctree::
+
+    gpu-cgroup.rst