mbox series

[RFC,00/18] Rust DRM subsystem abstractions (& preview AGX driver)

Message ID 20230307-rust-drm-v1-0-917ff5bc80a8@asahilina.net
Headers show
Series Rust DRM subsystem abstractions (& preview AGX driver) | expand

Message

Asahi Lina March 7, 2023, 2:25 p.m. UTC
Hi everyone!

This is my first take on the Rust abstractions for the DRM
subsystem. It includes the abstractions themselves, some minor
prerequisite changes to the C side, as well as the drm-asahi GPU driver
(for reference on how the abstractions are used, but not necessarily
intended to land together).

These patches apply on top of the tree at [1], which is based on
6.3-rc1 with a large number of Rust abstraction/support commits added on
top. Most of these are not prerequisites for the DRM abstractions
themselves, but rather only of the driver.

* #1-12 introduce the abstractions, module by module, with minor C
  changes before the dependent abstraction.
  * Patch 10 is a little addition to drm_sched that I ended up needing,
    but I can pull it out of the abstraction into its own patch if
    needed.
* #13-14 add a minor feature to drm/gem and its abstraction used
  by the driver.
* #15-16 introduce the (unstable) asahi UAPI. This is obviously not
  ready for merge yet, but comments are welcome!
* #17 adds a Rust helper macro to handle GPU core/firmware differences.
  This probably belongs in the driver at this point, but right now it
  has to live in rust/macros since there is no mechanism for per-driver
  proc macros.
* #18 adds the driver proper, in one big commit, for reference purposes.

I've been working since mid last year on an Apple AGX GPU driver for
Linux, using the (at the time) out-of-tree Rust support. As part of this
effort, I've been writing safe Rust abstractions for portions of the DRM
subsystem.

Now that Rust itself is upstream, I'd like to get all the abstractions
upstreamed so we can eventually get the driver upstreamed!

These abstractions have been used by the driver since our release in
December [2], in a simpler synchronous-submission form:

* drm::ioctl
* drm::device
* drm::drv
* drm::file
* drm::{gem, gem::shmem}
* drm::mm

This series adds these too, which are used by the explicit sync refactor
of the driver (the version in this series):

* drm::syncobj
* drm::sched
* dma_fence

The major dependencies for the DRM abstractions themselves are:

* [3] rust: error: Add missing wrappers to convert to/from kernel error codes
* [4] rust: Miscellaneous macro improvements
* [5] rust: Add a Sealed trait
* [6] rust: device: Add a minimal RawDevice trait
* [7] rust: Enable the new_uninit feature for kernel and driver crates
* [8] rust: ioctl: Add ioctl number manipulation functions
* [9] rust: sync: Arc: Any downcasting and assume_init()
*     rust: Add `container_of` and `offset_of` macros
*     kernel::sync::mutex and dependencies

Most of these (the ones with links) have already been submitted, and I
expect all of them to land for 6.4 (the mutex one will likely be last,
since there is some refactoring that will happen over the current state
to make it more ergonomic to use). The mutex dep is only necessary for
drm::mm and dma_fence, and transitively drm::syncobj and drm::sched.

Things work! We've had most of the abstractions in production edge
kernels with the driver, and the new explicit sync stuff has passed
quite a few torture tests (this is how we found the drm_sched issue,
patch 11).

The abstractions are intended to be safe (safety review very welcome!).
While writing them, I tried to avoid making any changes to the C side
unless absolutely necessary. I understand that it will probably make
sense to adjust the C side to make some things easier, but I wanted to
start from this as a baseline.

Known issues:

- The existing Rust integration does not currently allow building
  abstractions as modules, so the Rust abstractions are only available
  for DRM components that are built in. I added some extra Kconfig
  symbols to deal with this, so a driver built as a module can depende
  on having those built in. This should go away in the future (but may
  not be ready in time for submission... I understand this probably
  shouldn't be a blocker though?).

- DRM relies heavily on the "subclassing" pattern for driver objects,
  and this doesn't map well to Rust. I tried several approaches for
  various bits, so we can see how they work out. In particular, whether
  wrapper types should pretend to be smart pointers and Deref to their
  inner driver-specific types, and whether they should be marked as
  method receivers (Yuck, internal rustc implementation hacks! But
  Arc<T> already does the same thing and it makes usage in
  driver-implemented callbacks as `self` possible) are things I'd love
  to discuss ^^.

- Only what I need for my driver is implemented (plus a small amount of
  obvious extras where better API completeness makes sense). I think the
  general idea with Rust abstractions is that we add things as they
  become necessary.

- The plain GEM vs. GEM-shmem duality ended up with quite a hairy type
  hierarchy. I'd love to figure out how to make this simpler...

- drm::mm ends up requiring a built-in mutex in the abstraction, instead
  of delegating that to the user with the usual Rust mutability rules.
  This is because nodes can be dropped at any time, and those operations
  need to be synchronized. We could try to avoid forbidding those drops
  or mark the node type !Send, but that would make it a lot less
  ergonomic to use...

I'm looking for feedback on the abstractions of all kinds, so we can
move towards an upstreamable version. Optimistically, I'd love to get
this upstream for 6.5, and the driver for 6.6.

Please feel free to ask any questions about the Rust bits, since I know
a lot of this is new to many of the C folks!

This is a fairly complete driver for Apple AGX G13 and G14 series GPUs.

The driver today supports the Apple M1, M1 Pro, M1 Max, M1 Ultra, and M2
SoCs, across two firmware revisions each. It has an explicit sync UAPI
heavily inspired by the upcoming Intel Xe UAPI, designed with Vulkan
support in mind. On the Mesa side we currently have a Gallium driver
that is mostly already upstream (missing the UAPI bits mostly) and
passes the dEQP GLES2/EGL tests, with most of GLES3.0 passing in
downstream work-in-progress branches. This is a reverse engineered
community driver (we have no hardware documentation of any kind, other
than some hints from aspects shared with PowerVR).

While developing the driver, I tried to make use of Rust's safety and
lifetime features to provide not just CPU-side safety, but also
partial firmware-ABI safety. Thanks to this, it has turned out to be
a very stable driver even though GPU firmware crashes are fatal (no
restart capability, need to reboot!) and the FW/driver interface is a
huge mess of unsafe shared memory structures with complex pointer
chains. There are over 70 ABI types and 3000+ lines of firmware ABI type
definitions that vary between firmware builds and GPU cores...

In a simpler blocking-submission form, it has been shipping in Asahi
Linux edge kernels since December [2], with lots of users and zero (!)
reported oopses (and only a couple reports of GPU firmware crashes,
though that issue should now be fixed). It has survived OOM scenarios
(Rust makes error cleanup easy!), UAPI-level fuzzing, countless broken
Mesa builds, uptimes of 40+ days, and more.

The explicit sync refactor significantly increases performance (and
potential problems), but this version has survived a lot of torture
with dEQP/piglit tests and some manual corner case testing.

In other words, Rust works! ^^

There are some design notes on the driver and further links at [10].

[1] https://github.com/AsahiLinux/linux.git drm-rfc-base-20230307
[2] https://asahilinux.org/2022/12/gpu-drivers-now-in-asahi-linux/
[3] https://lore.kernel.org/rust-for-linux/20230224-rust-error-v1-0-f8f9a9a87303@asahilina.net/T/
[4] https://lore.kernel.org/rust-for-linux/20230224-rust-macros-v1-0-b39fae46e102@asahilina.net/T/
[5] https://lore.kernel.org/rust-for-linux/20230224-rust-iopt-rtkit-v1-0-49ced3391295@asahilina.net/T/#m515bad2cff7f5a46f55897e6b73c6c2f1fb2c638
[6] https://lore.kernel.org/rust-for-linux/20230224-rust-iopt-rtkit-v1-0-49ced3391295@asahilina.net/T/#m4c64e390c43b3ff1b8470fc8b37eaf87f6e12c94
[7] https://lore.kernel.org/rust-for-linux/CQV7ZNT6LMXI.1XG4YXSH8I7JK@vincent-arch/T/
[8] https://lore.kernel.org/rust-for-linux/61f734d6-1497-755f-3632-3f261b890846@asahilina.net/T/
[9] https://lore.kernel.org/rust-for-linux/20230224-rust-arc-v1-0-568eea613a41@asahilina.net/T/
[10] https://github.com/AsahiLinux/docs/wiki/SW:AGX-driver-notes

Signed-off-by: Asahi Lina <lina@asahilina.net>
---
Asahi Lina (18):
      rust: drm: ioctl: Add DRM ioctl abstraction
      rust: drm: Add Device and Driver abstractions
      rust: drm: file: Add File abstraction
      rust: drm: gem: Add GEM object abstraction
      drm/gem-shmem: Export VM ops functions
      rust: drm: gem: shmem: Add DRM shmem helper abstraction
      rust: drm: mm: Add DRM MM Range Allocator abstraction
      rust: dma_fence: Add DMA Fence abstraction
      rust: drm: syncobj: Add DRM Sync Object abstraction
      drm/scheduler: Add can_run_job callback
      drm/scheduler: Clean up jobs when the scheduler is torn down
      rust: drm: sched: Add GPU scheduler abstraction
      drm/gem: Add a flag to control whether objects can be exported
      rust: drm: gem: Add set_exportable() method
      drm/asahi: Add the Asahi driver UAPI [DO NOT MERGE]
      rust: bindings: Bind the Asahi DRM UAPI
      rust: macros: Add versions macro
      drm/asahi: Add the Asahi driver for Apple AGX GPUs
 drivers/gpu/drm/Kconfig                |   19 +
 drivers/gpu/drm/Makefile               |    1 +
 drivers/gpu/drm/asahi/Kconfig          |   35 +
 drivers/gpu/drm/asahi/Makefile         |    3 +
 drivers/gpu/drm/asahi/alloc.rs         | 1046 ++++++++++++++++++++++++++
 drivers/gpu/drm/asahi/asahi.rs         |   53 ++
 drivers/gpu/drm/asahi/buffer.rs        |  694 ++++++++++++++++++
 drivers/gpu/drm/asahi/channel.rs       |  542 ++++++++++++++
 drivers/gpu/drm/asahi/debug.rs         |  129 ++++
 drivers/gpu/drm/asahi/driver.rs        |  166 +++++
 drivers/gpu/drm/asahi/event.rs         |  229 ++++++
 drivers/gpu/drm/asahi/file.rs          |  718 ++++++++++++++++++
 drivers/gpu/drm/asahi/float.rs         |  381 ++++++++++
 drivers/gpu/drm/asahi/fw/buffer.rs     |  170 +++++
 drivers/gpu/drm/asahi/fw/channels.rs   |  385 ++++++++++
 drivers/gpu/drm/asahi/fw/compute.rs    |  107 +++
 drivers/gpu/drm/asahi/fw/event.rs      |  100 +++
 drivers/gpu/drm/asahi/fw/fragment.rs   |  276 +++++++
 drivers/gpu/drm/asahi/fw/initdata.rs   | 1264 ++++++++++++++++++++++++++++++++
 drivers/gpu/drm/asahi/fw/job.rs        |   56 ++
 drivers/gpu/drm/asahi/fw/microseq.rs   |  384 ++++++++++
 drivers/gpu/drm/asahi/fw/mod.rs        |   15 +
 drivers/gpu/drm/asahi/fw/types.rs      |  233 ++++++
 drivers/gpu/drm/asahi/fw/vertex.rs     |  177 +++++
 drivers/gpu/drm/asahi/fw/workqueue.rs  |  168 +++++
 drivers/gpu/drm/asahi/gem.rs           |  301 ++++++++
 drivers/gpu/drm/asahi/gpu.rs           | 1088 +++++++++++++++++++++++++++
 drivers/gpu/drm/asahi/hw/mod.rs        |  522 +++++++++++++
 drivers/gpu/drm/asahi/hw/t600x.rs      |  140 ++++
 drivers/gpu/drm/asahi/hw/t8103.rs      |   80 ++
 drivers/gpu/drm/asahi/hw/t8112.rs      |   82 +++
 drivers/gpu/drm/asahi/initdata.rs      |  777 ++++++++++++++++++++
 drivers/gpu/drm/asahi/mem.rs           |  133 ++++
 drivers/gpu/drm/asahi/microseq.rs      |   61 ++
 drivers/gpu/drm/asahi/mmu.rs           | 1249 +++++++++++++++++++++++++++++++
 drivers/gpu/drm/asahi/object.rs        |  704 ++++++++++++++++++
 drivers/gpu/drm/asahi/place.rs         |  343 +++++++++
 drivers/gpu/drm/asahi/queue/common.rs  |   52 ++
 drivers/gpu/drm/asahi/queue/compute.rs |  371 ++++++++++
 drivers/gpu/drm/asahi/queue/mod.rs     |  725 ++++++++++++++++++
 drivers/gpu/drm/asahi/queue/render.rs  | 1173 +++++++++++++++++++++++++++++
 drivers/gpu/drm/asahi/regs.rs          |  387 ++++++++++
 drivers/gpu/drm/asahi/slotalloc.rs     |  292 ++++++++
 drivers/gpu/drm/asahi/util.rs          |   44 ++
 drivers/gpu/drm/asahi/workqueue.rs     |  880 ++++++++++++++++++++++
 drivers/gpu/drm/drm_gem.c              |    1 +
 drivers/gpu/drm/drm_gem_shmem_helper.c |    9 +-
 drivers/gpu/drm/drm_prime.c            |    5 +
 drivers/gpu/drm/scheduler/sched_main.c |   37 +-
 include/drm/drm_gem.h                  |    8 +
 include/drm/drm_gem_shmem_helper.h     |    3 +
 include/drm/gpu_scheduler.h            |    8 +
 include/uapi/drm/asahi_drm.h           |  556 ++++++++++++++
 rust/bindings/bindings_helper.h        |   14 +
 rust/helpers.c                         |  168 +++++
 rust/kernel/dma_fence.rs               |  532 ++++++++++++++
 rust/kernel/drm/device.rs              |   76 ++
 rust/kernel/drm/drv.rs                 |  342 +++++++++
 rust/kernel/drm/file.rs                |  113 +++
 rust/kernel/drm/gem/mod.rs             |  384 ++++++++++
 rust/kernel/drm/gem/shmem.rs           |  381 ++++++++++
 rust/kernel/drm/ioctl.rs               |  147 ++++
 rust/kernel/drm/mm.rs                  |  309 ++++++++
 rust/kernel/drm/mod.rs                 |   13 +
 rust/kernel/drm/sched.rs               |  358 +++++++++
 rust/kernel/drm/syncobj.rs             |   77 ++
 rust/kernel/lib.rs                     |    4 +
 rust/macros/lib.rs                     |    7 +
 rust/macros/versions.rs                |  267 +++++++
 69 files changed, 20569 insertions(+), 5 deletions(-)
---
base-commit: c9eb15274c9861026682a6b3e645891fccf88e07
change-id: 20230307-rust-drm-b5af3c2a9e55

Thank you,
~~ Lina

Comments

Karol Herbst March 7, 2023, 2:48 p.m. UTC | #1
On Tue, Mar 7, 2023 at 3:27 PM Asahi Lina <lina@asahilina.net> wrote:
>
> DRM drivers need to be able to declare which driver-specific ioctls they
> support. This abstraction adds the required types and a helper macro to
> generate the ioctl definition inside the DRM driver.
>
> Note that this macro is not usable until further bits of the
> abstraction are in place (but it will not fail to compile on its own, if
> not called).
>
> Signed-off-by: Asahi Lina <lina@asahilina.net>
> ---
>  drivers/gpu/drm/Kconfig         |   7 ++
>  rust/bindings/bindings_helper.h |   2 +
>  rust/kernel/drm/ioctl.rs        | 147 ++++++++++++++++++++++++++++++++++++++++
>  rust/kernel/drm/mod.rs          |   5 ++
>  rust/kernel/lib.rs              |   2 +
>  5 files changed, 163 insertions(+)
>
> diff --git a/drivers/gpu/drm/Kconfig b/drivers/gpu/drm/Kconfig
> index dc0f94f02a82..dab8f0f9aa96 100644
> --- a/drivers/gpu/drm/Kconfig
> +++ b/drivers/gpu/drm/Kconfig
> @@ -27,6 +27,13 @@ menuconfig DRM
>           details.  You should also select and configure AGP
>           (/dev/agpgart) support if it is available for your platform.
>
> +# Rust abstractions cannot be built as modules currently, so force them as
> +# bool by using these intermediate symbols. In the future these could be
> +# tristate once abstractions themselves can be built as modules.
> +config RUST_DRM
> +       bool "Rust support for the DRM subsystem"
> +       depends on DRM=y
> +
>  config DRM_MIPI_DBI
>         tristate
>         depends on DRM
> diff --git a/rust/bindings/bindings_helper.h b/rust/bindings/bindings_helper.h
> index 91bb7906ca5a..2687bef1676f 100644
> --- a/rust/bindings/bindings_helper.h
> +++ b/rust/bindings/bindings_helper.h
> @@ -6,6 +6,7 @@
>   * Sorted alphabetically.
>   */
>
> +#include <drm/drm_ioctl.h>
>  #include <linux/delay.h>
>  #include <linux/device.h>
>  #include <linux/dma-mapping.h>
> @@ -23,6 +24,7 @@
>  #include <linux/sysctl.h>
>  #include <linux/timekeeping.h>
>  #include <linux/xarray.h>
> +#include <uapi/drm/drm.h>
>

might make more sense to add this chunk to the patch actually needing it

>  /* `bindgen` gets confused at certain things. */
>  const gfp_t BINDINGS_GFP_KERNEL = GFP_KERNEL;
> diff --git a/rust/kernel/drm/ioctl.rs b/rust/kernel/drm/ioctl.rs
> new file mode 100644
> index 000000000000..10304efbd5f1
> --- /dev/null
> +++ b/rust/kernel/drm/ioctl.rs
> @@ -0,0 +1,147 @@
> +// SPDX-License-Identifier: GPL-2.0 OR MIT
> +#![allow(non_snake_case)]
> +
> +//! DRM IOCTL definitions.
> +//!
> +//! C header: [`include/linux/drm/drm_ioctl.h`](../../../../include/linux/drm/drm_ioctl.h)
> +
> +use crate::ioctl;
> +
> +const BASE: u32 = bindings::DRM_IOCTL_BASE as u32;
> +
> +/// Construct a DRM ioctl number with no argument.
> +pub const fn IO(nr: u32) -> u32 {
> +    ioctl::_IO(BASE, nr)
> +}
> +
> +/// Construct a DRM ioctl number with a read-only argument.
> +pub const fn IOR<T>(nr: u32) -> u32 {
> +    ioctl::_IOR::<T>(BASE, nr)
> +}
> +
> +/// Construct a DRM ioctl number with a write-only argument.
> +pub const fn IOW<T>(nr: u32) -> u32 {
> +    ioctl::_IOW::<T>(BASE, nr)
> +}
> +
> +/// Construct a DRM ioctl number with a read-write argument.
> +pub const fn IOWR<T>(nr: u32) -> u32 {
> +    ioctl::_IOWR::<T>(BASE, nr)
> +}
> +
> +/// Descriptor type for DRM ioctls. Use the `declare_drm_ioctls!{}` macro to construct them.
> +pub type DrmIoctlDescriptor = bindings::drm_ioctl_desc;
> +
> +/// This is for ioctl which are used for rendering, and require that the file descriptor is either
> +/// for a render node, or if it’s a legacy/primary node, then it must be authenticated.
> +pub const AUTH: u32 = bindings::drm_ioctl_flags_DRM_AUTH;
> +
> +/// This must be set for any ioctl which can change the modeset or display state. Userspace must
> +/// call the ioctl through a primary node, while it is the active master.
> +///
> +/// Note that read-only modeset ioctl can also be called by unauthenticated clients, or when a
> +/// master is not the currently active one.
> +pub const MASTER: u32 = bindings::drm_ioctl_flags_DRM_MASTER;
> +
> +/// Anything that could potentially wreak a master file descriptor needs to have this flag set.
> +///
> +/// Current that’s only for the SETMASTER and DROPMASTER ioctl, which e.g. logind can call to force
> +/// a non-behaving master (display compositor) into compliance.
> +///
> +/// This is equivalent to callers with the SYSADMIN capability.
> +pub const ROOT_ONLY: u32 = bindings::drm_ioctl_flags_DRM_ROOT_ONLY;
> +
> +/// Whether drm_ioctl_desc.func should be called with the DRM BKL held or not. Enforced as the
> +/// default for all modern drivers, hence there should never be a need to set this flag.
> +///
> +/// Do not use anywhere else than for the VBLANK_WAIT IOCTL, which is the only legacy IOCTL which
> +/// needs this.
> +pub const UNLOCKED: u32 = bindings::drm_ioctl_flags_DRM_UNLOCKED;
> +
> +/// This is used for all ioctl needed for rendering only, for drivers which support render nodes.
> +/// This should be all new render drivers, and hence it should be always set for any ioctl with
> +/// `AUTH` set. Note though that read-only query ioctl might have this set, but have not set
> +/// DRM_AUTH because they do not require authentication.
> +pub const RENDER_ALLOW: u32 = bindings::drm_ioctl_flags_DRM_RENDER_ALLOW;
> +
> +/// Declare the DRM ioctls for a driver.
> +///
> +/// Each entry in the list should have the form:
> +///
> +/// `(ioctl_number, argument_type, flags, user_callback),`
> +///
> +/// `argument_type` is the type name within the `bindings` crate.
> +/// `user_callback` should have the following prototype:
> +///
> +/// ```
> +/// fn foo(device: &kernel::drm::device::Device<Self>,
> +///        data: &mut bindings::argument_type,
> +///        file: &kernel::drm::file::File<Self::File>,
> +/// )
> +/// ```
> +/// where `Self` is the drm::drv::Driver implementation these ioctls are being declared within.
> +///
> +/// # Examples
> +///
> +/// ```
> +/// kernel::declare_drm_ioctls! {
> +///     (FOO_GET_PARAM, drm_foo_get_param, ioctl::RENDER_ALLOW, my_get_param_handler),
> +/// }

I am wondering.. couldn't we make it a proc_macro and just tag all the
functions instead? Though I also see the point of having a central
list of all ioctls... Maybe we should have some higher level
discussions around on _how_ we want things to look like.

> +/// ```
> +///
> +#[macro_export]
> +macro_rules! declare_drm_ioctls {
> +    ( $(($cmd:ident, $struct:ident, $flags:expr, $func:expr)),* $(,)? ) => {
> +        const IOCTLS: &'static [$crate::drm::ioctl::DrmIoctlDescriptor] = {
> +            const _:() = {
> +                let i: u32 = $crate::bindings::DRM_COMMAND_BASE;
> +                // Assert that all the IOCTLs are in the right order and there are no gaps,
> +                // and that the sizeof of the specified type is correct.
> +                $(
> +                    let cmd: u32 = $crate::macros::concat_idents!($crate::bindings::DRM_IOCTL_, $cmd);
> +                    ::core::assert!(i == $crate::ioctl::_IOC_NR(cmd));
> +                    ::core::assert!(core::mem::size_of::<$crate::bindings::$struct>() == $crate::ioctl::_IOC_SIZE(cmd));

::core::mem::size_of

> +                    let i: u32 = i + 1;
> +                )*
> +            };
> +
> +            let ioctls = &[$(
> +                $crate::bindings::drm_ioctl_desc {
> +                    cmd: $crate::macros::concat_idents!($crate::bindings::DRM_IOCTL_, $cmd) as u32,
> +                    func: {
> +                        #[allow(non_snake_case)]
> +                        unsafe extern "C" fn $cmd(
> +                                raw_dev: *mut $crate::bindings::drm_device,
> +                                raw_data: *mut ::core::ffi::c_void,
> +                                raw_file_priv: *mut $crate::bindings::drm_file,
> +                        ) -> core::ffi::c_int {

::core

> +                            // SAFETY: We never drop this, and the DRM core ensures the device lives
> +                            // while callbacks are being called.
> +                            //
> +                            // FIXME: Currently there is nothing enforcing that the types of the
> +                            // dev/file match the current driver these ioctls are being declared
> +                            // for, and it's not clear how to enforce this within the type system.
> +                            let dev = ::core::mem::ManuallyDrop::new(unsafe {
> +                                $crate::drm::device::Device::from_raw(raw_dev)
> +                            });
> +                            // SAFETY: This is just the ioctl argument, which hopefully has the right type
> +                            // (we've done our best checking the size).
> +                            let data = unsafe { &mut *(raw_data as *mut $crate::bindings::$struct) };
> +                            // SAFETY: This is just the DRM file structure
> +                            let file = unsafe { $crate::drm::file::File::from_raw(raw_file_priv) };
> +
> +                            match $func(&*dev, data, &file) {
> +                                Err(e) => e.to_kernel_errno(),
> +                                Ok(i) => i.try_into().unwrap_or(ERANGE.to_kernel_errno()),

need to specify the namespace on ERANGE, no?

> +                            }
> +                        }
> +                        Some($cmd)
> +                    },
> +                    flags: $flags,
> +                    name: $crate::c_str!(::core::stringify!($cmd)).as_char_ptr(),
> +                }
> +            ),*];
> +            ioctls
> +        };
> +    };
> +}
> diff --git a/rust/kernel/drm/mod.rs b/rust/kernel/drm/mod.rs
> new file mode 100644
> index 000000000000..9ec6d7cbcaf3
> --- /dev/null
> +++ b/rust/kernel/drm/mod.rs
> @@ -0,0 +1,5 @@
> +// SPDX-License-Identifier: GPL-2.0 OR MIT
> +
> +//! DRM subsystem abstractions.
> +
> +pub mod ioctl;
> diff --git a/rust/kernel/lib.rs b/rust/kernel/lib.rs
> index 7903490816bf..cb23d24c6718 100644
> --- a/rust/kernel/lib.rs
> +++ b/rust/kernel/lib.rs
> @@ -37,6 +37,8 @@ mod build_assert;
>  pub mod delay;
>  pub mod device;
>  pub mod driver;
> +#[cfg(CONFIG_RUST_DRM)]
> +pub mod drm;
>  pub mod error;
>  pub mod io_buffer;
>  pub mod io_mem;
>
> --
> 2.35.1
>
Maíra Canal March 7, 2023, 3:32 p.m. UTC | #2
On 3/7/23 11:25, Asahi Lina wrote:
> DRM drivers need to be able to declare which driver-specific ioctls they
> support. This abstraction adds the required types and a helper macro to
> generate the ioctl definition inside the DRM driver.
> 
> Note that this macro is not usable until further bits of the
> abstraction are in place (but it will not fail to compile on its own, if
> not called).
> 
> Signed-off-by: Asahi Lina <lina@asahilina.net>
> ---
>   drivers/gpu/drm/Kconfig         |   7 ++
>   rust/bindings/bindings_helper.h |   2 +
>   rust/kernel/drm/ioctl.rs        | 147 ++++++++++++++++++++++++++++++++++++++++
>   rust/kernel/drm/mod.rs          |   5 ++
>   rust/kernel/lib.rs              |   2 +
>   5 files changed, 163 insertions(+)
> 
> diff --git a/drivers/gpu/drm/Kconfig b/drivers/gpu/drm/Kconfig
> index dc0f94f02a82..dab8f0f9aa96 100644
> --- a/drivers/gpu/drm/Kconfig
> +++ b/drivers/gpu/drm/Kconfig
> @@ -27,6 +27,13 @@ menuconfig DRM
>   	  details.  You should also select and configure AGP
>   	  (/dev/agpgart) support if it is available for your platform.
>   

[...]

> +
> +/// Declare the DRM ioctls for a driver.
> +///
> +/// Each entry in the list should have the form:
> +///
> +/// `(ioctl_number, argument_type, flags, user_callback),`
> +///
> +/// `argument_type` is the type name within the `bindings` crate.
> +/// `user_callback` should have the following prototype:
> +///
> +/// ```
> +/// fn foo(device: &kernel::drm::device::Device<Self>,
> +///        data: &mut bindings::argument_type,
> +///        file: &kernel::drm::file::File<Self::File>,
> +/// )
> +/// ```
> +/// where `Self` is the drm::drv::Driver implementation these ioctls are being declared within.
> +///
> +/// # Examples
> +///
> +/// ```
> +/// kernel::declare_drm_ioctls! {
> +///     (FOO_GET_PARAM, drm_foo_get_param, ioctl::RENDER_ALLOW, my_get_param_handler),
> +/// }
> +/// ```
> +///
> +#[macro_export]
> +macro_rules! declare_drm_ioctls {
> +    ( $(($cmd:ident, $struct:ident, $flags:expr, $func:expr)),* $(,)? ) => {
> +        const IOCTLS: &'static [$crate::drm::ioctl::DrmIoctlDescriptor] = {
> +            const _:() = {
> +                let i: u32 = $crate::bindings::DRM_COMMAND_BASE;
> +                // Assert that all the IOCTLs are in the right order and there are no gaps,
> +                // and that the sizeof of the specified type is correct.

I believe that not necessarily the IOCTLs need to be in the right order and
with no gaps. For example, armada_drm.h has a gap in between 0x00 and
0x02 and exynos_drm.h also have gaps. Moreover, some drivers, like vgem and
virtgpu, start their IOCTLs with 0x01.

Best Regards,
- Maíra Canal

> +                $(
> +                    let cmd: u32 = $crate::macros::concat_idents!($crate::bindings::DRM_IOCTL_, $cmd);
> +                    ::core::assert!(i == $crate::ioctl::_IOC_NR(cmd));
> +                    ::core::assert!(core::mem::size_of::<$crate::bindings::$struct>() == $crate::ioctl::_IOC_SIZE(cmd));
> +                    let i: u32 = i + 1;
> +                )*
> +            };
> +
> +            let ioctls = &[$(
> +                $crate::bindings::drm_ioctl_desc {
> +                    cmd: $crate::macros::concat_idents!($crate::bindings::DRM_IOCTL_, $cmd) as u32,
> +                    func: {
> +                        #[allow(non_snake_case)]
> +                        unsafe extern "C" fn $cmd(
> +                                raw_dev: *mut $crate::bindings::drm_device,
> +                                raw_data: *mut ::core::ffi::c_void,
> +                                raw_file_priv: *mut $crate::bindings::drm_file,
> +                        ) -> core::ffi::c_int {
> +                            // SAFETY: We never drop this, and the DRM core ensures the device lives
> +                            // while callbacks are being called.
> +                            //
> +                            // FIXME: Currently there is nothing enforcing that the types of the
> +                            // dev/file match the current driver these ioctls are being declared
> +                            // for, and it's not clear how to enforce this within the type system.
> +                            let dev = ::core::mem::ManuallyDrop::new(unsafe {
> +                                $crate::drm::device::Device::from_raw(raw_dev)
> +                            });
> +                            // SAFETY: This is just the ioctl argument, which hopefully has the right type
> +                            // (we've done our best checking the size).
> +                            let data = unsafe { &mut *(raw_data as *mut $crate::bindings::$struct) };
> +                            // SAFETY: This is just the DRM file structure
> +                            let file = unsafe { $crate::drm::file::File::from_raw(raw_file_priv) };
> +
> +                            match $func(&*dev, data, &file) {
> +                                Err(e) => e.to_kernel_errno(),
> +                                Ok(i) => i.try_into().unwrap_or(ERANGE.to_kernel_errno()),
> +                            }
> +                        }
> +                        Some($cmd)
> +                    },
> +                    flags: $flags,
> +                    name: $crate::c_str!(::core::stringify!($cmd)).as_char_ptr(),
> +                }
> +            ),*];
> +            ioctls
> +        };
> +    };
> +}
> diff --git a/rust/kernel/drm/mod.rs b/rust/kernel/drm/mod.rs
> new file mode 100644
> index 000000000000..9ec6d7cbcaf3
> --- /dev/null
> +++ b/rust/kernel/drm/mod.rs
> @@ -0,0 +1,5 @@
> +// SPDX-License-Identifier: GPL-2.0 OR MIT
> +
> +//! DRM subsystem abstractions.
> +
> +pub mod ioctl;
> diff --git a/rust/kernel/lib.rs b/rust/kernel/lib.rs
> index 7903490816bf..cb23d24c6718 100644
> --- a/rust/kernel/lib.rs
> +++ b/rust/kernel/lib.rs
> @@ -37,6 +37,8 @@ mod build_assert;
>   pub mod delay;
>   pub mod device;
>   pub mod driver;
> +#[cfg(CONFIG_RUST_DRM)]
> +pub mod drm;
>   pub mod error;
>   pub mod io_buffer;
>   pub mod io_mem;
>
Asahi Lina March 7, 2023, 4:17 p.m. UTC | #3
That was supposed to have Markdown-style section headings, but I forgot
that b4 considers a leading # as a comment... sorry for the abrupt topic
changes...

The intended headings are below.

On 07/03/2023 23.25, Asahi Lina wrote:
> Hi everyone!
> 
> This is my first take on the Rust abstractions for the DRM
> subsystem. It includes the abstractions themselves, some minor
> prerequisite changes to the C side, as well as the drm-asahi GPU driver
> (for reference on how the abstractions are used, but not necessarily
> intended to land together).
> 
> These patches apply on top of the tree at [1], which is based on
> 6.3-rc1 with a large number of Rust abstraction/support commits added on
> top. Most of these are not prerequisites for the DRM abstractions
> themselves, but rather only of the driver.
> 
> * #1-12 introduce the abstractions, module by module, with minor C
>   changes before the dependent abstraction.
>   * Patch 10 is a little addition to drm_sched that I ended up needing,
>     but I can pull it out of the abstraction into its own patch if
>     needed.
> * #13-14 add a minor feature to drm/gem and its abstraction used
>   by the driver.
> * #15-16 introduce the (unstable) asahi UAPI. This is obviously not
>   ready for merge yet, but comments are welcome!
> * #17 adds a Rust helper macro to handle GPU core/firmware differences.
>   This probably belongs in the driver at this point, but right now it
>   has to live in rust/macros since there is no mechanism for per-driver
>   proc macros.
> * #18 adds the driver proper, in one big commit, for reference purposes.

## Background

> I've been working since mid last year on an Apple AGX GPU driver for
> Linux, using the (at the time) out-of-tree Rust support. As part of this
> effort, I've been writing safe Rust abstractions for portions of the DRM
> subsystem.
> 
> Now that Rust itself is upstream, I'd like to get all the abstractions
> upstreamed so we can eventually get the driver upstreamed!
> 
> These abstractions have been used by the driver since our release in
> December [2], in a simpler synchronous-submission form:
> 
> * drm::ioctl
> * drm::device
> * drm::drv
> * drm::file
> * drm::{gem, gem::shmem}
> * drm::mm
> 
> This series adds these too, which are used by the explicit sync refactor
> of the driver (the version in this series):
> 
> * drm::syncobj
> * drm::sched
> * dma_fence
> 
> The major dependencies for the DRM abstractions themselves are:
> 
> * [3] rust: error: Add missing wrappers to convert to/from kernel error codes
> * [4] rust: Miscellaneous macro improvements
> * [5] rust: Add a Sealed trait
> * [6] rust: device: Add a minimal RawDevice trait
> * [7] rust: Enable the new_uninit feature for kernel and driver crates
> * [8] rust: ioctl: Add ioctl number manipulation functions
> * [9] rust: sync: Arc: Any downcasting and assume_init()
> *     rust: Add `container_of` and `offset_of` macros
> *     kernel::sync::mutex and dependencies
> 
> Most of these (the ones with links) have already been submitted, and I
> expect all of them to land for 6.4 (the mutex one will likely be last,
> since there is some refactoring that will happen over the current state
> to make it more ergonomic to use). The mutex dep is only necessary for
> drm::mm and dma_fence, and transitively drm::syncobj and drm::sched.

## State

> Things work! We've had most of the abstractions in production edge
> kernels with the driver, and the new explicit sync stuff has passed
> quite a few torture tests (this is how we found the drm_sched issue,
> patch 11).
> 
> The abstractions are intended to be safe (safety review very welcome!).
> While writing them, I tried to avoid making any changes to the C side
> unless absolutely necessary. I understand that it will probably make
> sense to adjust the C side to make some things easier, but I wanted to
> start from this as a baseline.
> 
> Known issues:
> 
> - The existing Rust integration does not currently allow building
>   abstractions as modules, so the Rust abstractions are only available
>   for DRM components that are built in. I added some extra Kconfig
>   symbols to deal with this, so a driver built as a module can depende
>   on having those built in. This should go away in the future (but may
>   not be ready in time for submission... I understand this probably
>   shouldn't be a blocker though?).
> 
> - DRM relies heavily on the "subclassing" pattern for driver objects,
>   and this doesn't map well to Rust. I tried several approaches for
>   various bits, so we can see how they work out. In particular, whether
>   wrapper types should pretend to be smart pointers and Deref to their
>   inner driver-specific types, and whether they should be marked as
>   method receivers (Yuck, internal rustc implementation hacks! But
>   Arc<T> already does the same thing and it makes usage in
>   driver-implemented callbacks as `self` possible) are things I'd love
>   to discuss ^^.
> 
> - Only what I need for my driver is implemented (plus a small amount of
>   obvious extras where better API completeness makes sense). I think the
>   general idea with Rust abstractions is that we add things as they
>   become necessary.
> 
> - The plain GEM vs. GEM-shmem duality ended up with quite a hairy type
>   hierarchy. I'd love to figure out how to make this simpler...
> 
> - drm::mm ends up requiring a built-in mutex in the abstraction, instead
>   of delegating that to the user with the usual Rust mutability rules.
>   This is because nodes can be dropped at any time, and those operations
>   need to be synchronized. We could try to avoid forbidding those drops
>   or mark the node type !Send, but that would make it a lot less
>   ergonomic to use...
> 
> I'm looking for feedback on the abstractions of all kinds, so we can
> move towards an upstreamable version. Optimistically, I'd love to get
> this upstream for 6.5, and the driver for 6.6.
> 
> Please feel free to ask any questions about the Rust bits, since I know
> a lot of this is new to many of the C folks!

## About the drm-asahi driver

> This is a fairly complete driver for Apple AGX G13 and G14 series GPUs.
> 
> The driver today supports the Apple M1, M1 Pro, M1 Max, M1 Ultra, and M2
> SoCs, across two firmware revisions each. It has an explicit sync UAPI
> heavily inspired by the upcoming Intel Xe UAPI, designed with Vulkan
> support in mind. On the Mesa side we currently have a Gallium driver
> that is mostly already upstream (missing the UAPI bits mostly) and
> passes the dEQP GLES2/EGL tests, with most of GLES3.0 passing in
> downstream work-in-progress branches. This is a reverse engineered
> community driver (we have no hardware documentation of any kind, other
> than some hints from aspects shared with PowerVR).
> 
> While developing the driver, I tried to make use of Rust's safety and
> lifetime features to provide not just CPU-side safety, but also
> partial firmware-ABI safety. Thanks to this, it has turned out to be
> a very stable driver even though GPU firmware crashes are fatal (no
> restart capability, need to reboot!) and the FW/driver interface is a
> huge mess of unsafe shared memory structures with complex pointer
> chains. There are over 70 ABI types and 3000+ lines of firmware ABI type
> definitions that vary between firmware builds and GPU cores...
> 
> In a simpler blocking-submission form, it has been shipping in Asahi
> Linux edge kernels since December [2], with lots of users and zero (!)
> reported oopses (and only a couple reports of GPU firmware crashes,
> though that issue should now be fixed). It has survived OOM scenarios
> (Rust makes error cleanup easy!), UAPI-level fuzzing, countless broken
> Mesa builds, uptimes of 40+ days, and more.
> 
> The explicit sync refactor significantly increases performance (and
> potential problems), but this version has survived a lot of torture
> with dEQP/piglit tests and some manual corner case testing.
> 
> In other words, Rust works! ^^
> 
> There are some design notes on the driver and further links at [10].

## Links

> [1] https://github.com/AsahiLinux/linux.git drm-rfc-base-20230307
> [2] https://asahilinux.org/2022/12/gpu-drivers-now-in-asahi-linux/
> [3] https://lore.kernel.org/rust-for-linux/20230224-rust-error-v1-0-f8f9a9a87303@asahilina.net/T/
> [4] https://lore.kernel.org/rust-for-linux/20230224-rust-macros-v1-0-b39fae46e102@asahilina.net/T/
> [5] https://lore.kernel.org/rust-for-linux/20230224-rust-iopt-rtkit-v1-0-49ced3391295@asahilina.net/T/#m515bad2cff7f5a46f55897e6b73c6c2f1fb2c638
> [6] https://lore.kernel.org/rust-for-linux/20230224-rust-iopt-rtkit-v1-0-49ced3391295@asahilina.net/T/#m4c64e390c43b3ff1b8470fc8b37eaf87f6e12c94
> [7] https://lore.kernel.org/rust-for-linux/CQV7ZNT6LMXI.1XG4YXSH8I7JK@vincent-arch/T/
> [8] https://lore.kernel.org/rust-for-linux/61f734d6-1497-755f-3632-3f261b890846@asahilina.net/T/
> [9] https://lore.kernel.org/rust-for-linux/20230224-rust-arc-v1-0-568eea613a41@asahilina.net/T/
> [10] https://github.com/AsahiLinux/docs/wiki/SW:AGX-driver-notes

~~ Lina
Björn Roy Baron March 7, 2023, 5:34 p.m. UTC | #4
------- Original Message -------
On Tuesday, March 7th, 2023 at 15:25, Asahi Lina <lina@asahilina.net> wrote:

> DRM drivers need to be able to declare which driver-specific ioctls they
> support. This abstraction adds the required types and a helper macro to
> generate the ioctl definition inside the DRM driver.
> 
> Note that this macro is not usable until further bits of the
> abstraction are in place (but it will not fail to compile on its own, if
> not called).
> 
> Signed-off-by: Asahi Lina lina@asahilina.net
> 
> ---
>  drivers/gpu/drm/Kconfig         |   7 ++
>  rust/bindings/bindings_helper.h |   2 +
>  rust/kernel/drm/ioctl.rs        | 147 ++++++++++++++++++++++++++++++++++++++++
>  rust/kernel/drm/mod.rs          |   5 ++
>  rust/kernel/lib.rs              |   2 +
>  5 files changed, 163 insertions(+)
> 
> diff --git a/drivers/gpu/drm/Kconfig b/drivers/gpu/drm/Kconfig
> index dc0f94f02a82..dab8f0f9aa96 100644
> --- a/drivers/gpu/drm/Kconfig
> +++ b/drivers/gpu/drm/Kconfig
> @@ -27,6 +27,13 @@ menuconfig DRM
>  	  details.  You should also select and configure AGP
>  	  (/dev/agpgart) support if it is available for your platform.
> 
> +# Rust abstractions cannot be built as modules currently, so force them as
> +# bool by using these intermediate symbols. In the future these could be
> +# tristate once abstractions themselves can be built as modules.
> +config RUST_DRM
> +	bool "Rust support for the DRM subsystem"
> +	depends on DRM=y
> +
>  config DRM_MIPI_DBI
>  	tristate
>  	depends on DRM
> diff --git a/rust/bindings/bindings_helper.h b/rust/bindings/bindings_helper.h
> index 91bb7906ca5a..2687bef1676f 100644
> --- a/rust/bindings/bindings_helper.h
> +++ b/rust/bindings/bindings_helper.h
> @@ -6,6 +6,7 @@
>   * Sorted alphabetically.
>   */
> 
> +#include <drm/drm_ioctl.h>
>  #include <linux/delay.h>
>  #include <linux/device.h>
>  #include <linux/dma-mapping.h>
> @@ -23,6 +24,7 @@
>  #include <linux/sysctl.h>
>  #include <linux/timekeeping.h>
>  #include <linux/xarray.h>
> +#include <uapi/drm/drm.h>
> 
>  /* `bindgen` gets confused at certain things. */
>  const gfp_t BINDINGS_GFP_KERNEL = GFP_KERNEL;
> diff --git a/rust/kernel/drm/ioctl.rs b/rust/kernel/drm/ioctl.rs
> new file mode 100644
> index 000000000000..10304efbd5f1
> --- /dev/null
> +++ b/rust/kernel/drm/ioctl.rs
> @@ -0,0 +1,147 @@
> +// SPDX-License-Identifier: GPL-2.0 OR MIT
> +#![allow(non_snake_case)]
> +
> +//! DRM IOCTL definitions.
> +//!
> +//! C header: [`include/linux/drm/drm_ioctl.h`](../../../../include/linux/drm/drm_ioctl.h)
> +
> +use crate::ioctl;
> +
> +const BASE: u32 = bindings::DRM_IOCTL_BASE as u32;
> +
> +/// Construct a DRM ioctl number with no argument.
> +pub const fn IO(nr: u32) -> u32 {
> +    ioctl::_IO(BASE, nr)
> +}
> +
> +/// Construct a DRM ioctl number with a read-only argument.
> +pub const fn IOR<T>(nr: u32) -> u32 {
> +    ioctl::_IOR::<T>(BASE, nr)
> +}
> +
> +/// Construct a DRM ioctl number with a write-only argument.
> +pub const fn IOW<T>(nr: u32) -> u32 {
> +    ioctl::_IOW::<T>(BASE, nr)
> +}
> +
> +/// Construct a DRM ioctl number with a read-write argument.
> +pub const fn IOWR<T>(nr: u32) -> u32 {
> +    ioctl::_IOWR::<T>(BASE, nr)
> +}
> +
> +/// Descriptor type for DRM ioctls. Use the `declare_drm_ioctls!{}` macro to construct them.
> +pub type DrmIoctlDescriptor = bindings::drm_ioctl_desc;
> +
> +/// This is for ioctl which are used for rendering, and require that the file descriptor is either
> +/// for a render node, or if it’s a legacy/primary node, then it must be authenticated.
> +pub const AUTH: u32 = bindings::drm_ioctl_flags_DRM_AUTH;
> +
> +/// This must be set for any ioctl which can change the modeset or display state. Userspace must
> +/// call the ioctl through a primary node, while it is the active master.
> +///
> +/// Note that read-only modeset ioctl can also be called by unauthenticated clients, or when a
> +/// master is not the currently active one.
> +pub const MASTER: u32 = bindings::drm_ioctl_flags_DRM_MASTER;
> +
> +/// Anything that could potentially wreak a master file descriptor needs to have this flag set.
> +///
> +/// Current that’s only for the SETMASTER and DROPMASTER ioctl, which e.g. logind can call to force
> +/// a non-behaving master (display compositor) into compliance.
> +///
> +/// This is equivalent to callers with the SYSADMIN capability.
> +pub const ROOT_ONLY: u32 = bindings::drm_ioctl_flags_DRM_ROOT_ONLY;
> +
> +/// Whether drm_ioctl_desc.func should be called with the DRM BKL held or not. Enforced as the
> +/// default for all modern drivers, hence there should never be a need to set this flag.
> +///
> +/// Do not use anywhere else than for the VBLANK_WAIT IOCTL, which is the only legacy IOCTL which
> +/// needs this.
> +pub const UNLOCKED: u32 = bindings::drm_ioctl_flags_DRM_UNLOCKED;
> +
> +/// This is used for all ioctl needed for rendering only, for drivers which support render nodes.
> +/// This should be all new render drivers, and hence it should be always set for any ioctl with
> +/// `AUTH` set. Note though that read-only query ioctl might have this set, but have not set
> +/// DRM_AUTH because they do not require authentication.
> +pub const RENDER_ALLOW: u32 = bindings::drm_ioctl_flags_DRM_RENDER_ALLOW;
> +
> +/// Declare the DRM ioctls for a driver.
> +///
> +/// Each entry in the list should have the form:
> +///
> +/// `(ioctl_number, argument_type, flags, user_callback),`
> +///
> +/// `argument_type` is the type name within the `bindings` crate.
> +/// `user_callback` should have the following prototype:
> +///
> +/// ```
> +/// fn foo(device: &kernel::drm::device::Device<Self>,
> +///        data: &mut bindings::argument_type,
> +///        file: &kernel::drm::file::File<Self::File>,
> +/// )
> +/// ```
> +/// where `Self` is the drm::drv::Driver implementation these ioctls are being declared within.
> +///
> +/// # Examples
> +///
> +/// ```
> +/// kernel::declare_drm_ioctls! {
> +///     (FOO_GET_PARAM, drm_foo_get_param, ioctl::RENDER_ALLOW, my_get_param_handler),
> +/// }
> +/// ```
> +///
> +#[macro_export]
> +macro_rules! declare_drm_ioctls {
> +    ( $(($cmd:ident, $struct:ident, $flags:expr, $func:expr)),* $(,)? ) => {
> +        const IOCTLS: &'static [$crate::drm::ioctl::DrmIoctlDescriptor] = {
> +            const _:() = {
> +                let i: u32 = $crate::bindings::DRM_COMMAND_BASE;
> +                // Assert that all the IOCTLs are in the right order and there are no gaps,
> +                // and that the sizeof of the specified type is correct.
> +                $(
> +                    let cmd: u32 = $crate::macros::concat_idents!($crate::bindings::DRM_IOCTL_, $cmd);
> +                    ::core::assert!(i == $crate::ioctl::_IOC_NR(cmd));
> +                    ::core::assert!(core::mem::size_of::<$crate::bindings::$struct>() == $crate::ioctl::_IOC_SIZE(cmd));
> +                    let i: u32 = i + 1;
> +                )*
> +            };
> +
> +            let ioctls = &[$(
> +                $crate::bindings::drm_ioctl_desc {
> +                    cmd: $crate::macros::concat_idents!($crate::bindings::DRM_IOCTL_, $cmd) as u32,
> +                    func: {
> +                        #[allow(non_snake_case)]
> +                        unsafe extern "C" fn $cmd(
> +                                raw_dev: *mut $crate::bindings::drm_device,
> +                                raw_data: *mut ::core::ffi::c_void,
> +                                raw_file_priv: *mut $crate::bindings::drm_file,
> +                        ) -> core::ffi::c_int {
> +                            // SAFETY: We never drop this, and the DRM core ensures the device lives
> +                            // while callbacks are being called.
> +                            //
> +                            // FIXME: Currently there is nothing enforcing that the types of the
> +                            // dev/file match the current driver these ioctls are being declared
> +                            // for, and it's not clear how to enforce this within the type system.
> +                            let dev = ::core::mem::ManuallyDrop::new(unsafe {
> +                                $crate::drm::device::Device::from_raw(raw_dev)
> +                            });
> +                            // SAFETY: This is just the ioctl argument, which hopefully has the right type
> +                            // (we've done our best checking the size).

In the rust tree there is the ReadableFromBytes [1] trait which indicates that it is safe to read arbitrary bytes into the type. Maybe you could add it as bound on the argument type when it lands in rust-next? This way you can't end up with for example a struct containing a bool with the byte value 2, which is UB.

https://rust-for-linux.github.io/docs/kernel/io_buffer/trait.ReadableFromBytes.html [1]

> +                            let data = unsafe { &mut *(raw_data as *mut $crate::bindings::$struct) };
> +                            // SAFETY: This is just the DRM file structure
> +                            let file = unsafe { $crate::drm::file::File::from_raw(raw_file_priv) };
> +
> +                            match $func(&*dev, data, &file) {
> +                                Err(e) => e.to_kernel_errno(),
> +                                Ok(i) => i.try_into().unwrap_or(ERANGE.to_kernel_errno()),
> +                            }
> +                        }
> +                        Some($cmd)
> +                    },
> +                    flags: $flags,
> +                    name: $crate::c_str!(::core::stringify!($cmd)).as_char_ptr(),
> +                }
> +            ),*];
> +            ioctls
> +        };
> +    };
> +}
> diff --git a/rust/kernel/drm/mod.rs b/rust/kernel/drm/mod.rs
> new file mode 100644
> index 000000000000..9ec6d7cbcaf3
> --- /dev/null
> +++ b/rust/kernel/drm/mod.rs
> @@ -0,0 +1,5 @@
> +// SPDX-License-Identifier: GPL-2.0 OR MIT
> +
> +//! DRM subsystem abstractions.
> +
> +pub mod ioctl;
> diff --git a/rust/kernel/lib.rs b/rust/kernel/lib.rs
> index 7903490816bf..cb23d24c6718 100644
> --- a/rust/kernel/lib.rs
> +++ b/rust/kernel/lib.rs
> @@ -37,6 +37,8 @@ mod build_assert;
>  pub mod delay;
>  pub mod device;
>  pub mod driver;
> +#[cfg(CONFIG_RUST_DRM)]
> +pub mod drm;
>  pub mod error;
>  pub mod io_buffer;
>  pub mod io_mem;
> 
> --
> 2.35.1

Cheers,
Bjorn
Björn Roy Baron March 7, 2023, 6:19 p.m. UTC | #5
------- Original Message -------
On Tuesday, March 7th, 2023 at 15:25, Asahi Lina <lina@asahilina.net> wrote:

> Add the initial abstractions for DRM drivers and devices. These go
> together in one commit since they are fairly tightly coupled types.
> 
> A few things have been stubbed out, to be implemented as further bits of
> the DRM subsystem are introduced.
> 
> Signed-off-by: Asahi Lina lina@asahilina.net
> 
> ---
>  rust/bindings/bindings_helper.h |   3 +
>  rust/kernel/drm/device.rs       |  76 +++++++++
>  rust/kernel/drm/drv.rs          | 339 ++++++++++++++++++++++++++++++++++++++++
>  rust/kernel/drm/mod.rs          |   2 +
>  4 files changed, 420 insertions(+)
> 
> diff --git a/rust/bindings/bindings_helper.h b/rust/bindings/bindings_helper.h
> index 2687bef1676f..2a999138c4ae 100644
> --- a/rust/bindings/bindings_helper.h
> +++ b/rust/bindings/bindings_helper.h
> @@ -6,10 +6,13 @@
>   * Sorted alphabetically.
>   */
> 
> +#include <drm/drm_device.h>
> +#include <drm/drm_drv.h>
>  #include <drm/drm_ioctl.h>
>  #include <linux/delay.h>
>  #include <linux/device.h>
>  #include <linux/dma-mapping.h>
> +#include <linux/fs.h>
>  #include <linux/ioctl.h>
>  #include <linux/io-pgtable.h>
>  #include <linux/ktime.h>
> diff --git a/rust/kernel/drm/device.rs b/rust/kernel/drm/device.rs
> new file mode 100644
> index 000000000000..6007f941137a
> --- /dev/null
> +++ b/rust/kernel/drm/device.rs
> @@ -0,0 +1,76 @@
> +// SPDX-License-Identifier: GPL-2.0 OR MIT
> +
> +//! DRM device.
> +//!
> +//! C header: [`include/linux/drm/drm_device.h`](../../../../include/linux/drm/drm_device.h)
> +
> +use crate::{bindings, device, drm, types::ForeignOwnable};
> +use core::marker::PhantomData;
> +
> +/// Represents a reference to a DRM device. The device is reference-counted and is guaranteed to
> +/// not be dropped while this object is alive.
> +pub struct Device<T: drm::drv::Driver> {
> +    // Type invariant: ptr must be a valid and initialized drm_device,
> +    // and this value must either own a reference to it or the caller
> +    // must ensure that it is never dropped if the reference is borrowed.
> +    pub(super) ptr: *mut bindings::drm_device,
> +    _p: PhantomData<T>,
> +}
> +
> +impl<T: drm::drv::Driver> Device<T> {
> +    // Not intended to be called externally, except via declare_drm_ioctls!()
> +    #[doc(hidden)]
> +    pub unsafe fn from_raw(raw: *mut bindings::drm_device) -> Device<T> {
> +        Device {
> +            ptr: raw,
> +            _p: PhantomData,
> +        }
> +    }
> +
> +    #[allow(dead_code)]
> +    pub(crate) fn raw(&self) -> *const bindings::drm_device {
> +        self.ptr
> +    }
> +
> +    pub(crate) fn raw_mut(&mut self) -> *mut bindings::drm_device {
> +        self.ptr
> +    }
> +
> +    /// Returns a borrowed reference to the user data associated with this Device.
> +    pub fn data(&self) -> <T::Data as ForeignOwnable>::Borrowed<'_> {
> +        unsafe { T::Data::borrow((*self.ptr).dev_private) }
> +    }
> +}
> +
> +impl<T: drm::drv::Driver> Drop for Device<T> {
> +    fn drop(&mut self) {
> +        // SAFETY: By the type invariants, we know that `self` owns a reference, so it is safe to
> +        // relinquish it now.
> +        unsafe { bindings::drm_dev_put(self.ptr) };
> +    }
> +}
> +
> +impl<T: drm::drv::Driver> Clone for Device<T> {
> +    fn clone(&self) -> Self {
> +        // SAFETY: We get a new reference and then create a new owning object from the raw pointer
> +        unsafe {
> +            bindings::drm_dev_get(self.ptr);
> +            Device::from_raw(self.ptr)
> +        }
> +    }
> +}
> +
> +// SAFETY: `Device` only holds a pointer to a C device, which is safe to be used from any thread.
> +unsafe impl<T: drm::drv::Driver> Send for Device<T> {}
> +
> +// SAFETY: `Device` only holds a pointer to a C device, references to which are safe to be used
> +// from any thread.
> +unsafe impl<T: drm::drv::Driver> Sync for Device<T> {}
> +
> +// Make drm::Device work for dev_info!() and friends
> +unsafe impl<T: drm::drv::Driver> device::RawDevice for Device<T> {
> +    fn raw_device(&self) -> *mut bindings::device {
> +        // SAFETY: ptr must be valid per the type invariant
> +        unsafe { (*self.ptr).dev }
> +    }
> +}
> diff --git a/rust/kernel/drm/drv.rs b/rust/kernel/drm/drv.rs
> new file mode 100644
> index 000000000000..29a465515dc9
> --- /dev/null
> +++ b/rust/kernel/drm/drv.rs
> @@ -0,0 +1,339 @@
> +// SPDX-License-Identifier: GPL-2.0 OR MIT
> +
> +//! DRM driver core.
> +//!
> +//! C header: [`include/linux/drm/drm_drv.h`](../../../../include/linux/drm/drm_drv.h)
> +
> +use crate::{
> +    bindings, device, drm,
> +    error::code::*,
> +    error::from_kernel_err_ptr,
> +    error::{Error, Result},
> +    prelude::*,
> +    private::Sealed,
> +    str::CStr,
> +    types::ForeignOwnable,
> +    ThisModule,
> +};
> +use core::{
> +    marker::{PhantomData, PhantomPinned},
> +    pin::Pin,
> +};
> +use macros::vtable;
> +
> +/// Driver use the GEM memory manager. This should be set for all modern drivers.
> +pub const FEAT_GEM: u32 = bindings::drm_driver_feature_DRIVER_GEM;
> +/// Driver supports mode setting interfaces (KMS).
> +pub const FEAT_MODESET: u32 = bindings::drm_driver_feature_DRIVER_MODESET;
> +/// Driver supports dedicated render nodes.
> +pub const FEAT_RENDER: u32 = bindings::drm_driver_feature_DRIVER_RENDER;
> +/// Driver supports the full atomic modesetting userspace API.
> +///
> +/// Drivers which only use atomic internally, but do not support the full userspace API (e.g. not
> +/// all properties converted to atomic, or multi-plane updates are not guaranteed to be tear-free)
> +/// should not set this flag.
> +pub const FEAT_ATOMIC: u32 = bindings::drm_driver_feature_DRIVER_ATOMIC;
> +/// Driver supports DRM sync objects for explicit synchronization of command submission.
> +pub const FEAT_SYNCOBJ: u32 = bindings::drm_driver_feature_DRIVER_SYNCOBJ;
> +/// Driver supports the timeline flavor of DRM sync objects for explicit synchronization of command
> +/// submission.
> +pub const FEAT_SYNCOBJ_TIMELINE: u32 = bindings::drm_driver_feature_DRIVER_SYNCOBJ_TIMELINE;
> +
> +/// Information data for a DRM Driver.
> +pub struct DriverInfo {
> +    /// Driver major version.
> +    pub major: i32,
> +    /// Driver minor version.
> +    pub minor: i32,
> +    /// Driver patchlevel version.
> +    pub patchlevel: i32,
> +    /// Driver name.
> +    pub name: &'static CStr,
> +    /// Driver description.
> +    pub desc: &'static CStr,
> +    /// Driver date.
> +    pub date: &'static CStr,
> +}
> +

Could you please add an Invariants section to the doc comments indicating what requirements these function pointers must satisfy?

> +/// Internal memory management operation set, normally created by memory managers (e.g. GEM).
> +///
> +/// See `kernel::drm::gem` and `kernel::drm::gem::shmem`.
> +pub struct AllocOps {
> +    pub(crate) gem_create_object: Option<
> +        unsafe extern "C" fn(
> +            dev: *mut bindings::drm_device,
> +            size: usize,
> +        ) -> *mut bindings::drm_gem_object,
> +    >,
> +    pub(crate) prime_handle_to_fd: Option<
> +        unsafe extern "C" fn(
> +            dev: *mut bindings::drm_device,
> +            file_priv: *mut bindings::drm_file,
> +            handle: u32,
> +            flags: u32,
> +            prime_fd: *mut core::ffi::c_int,
> +        ) -> core::ffi::c_int,
> +    >,
> +    pub(crate) prime_fd_to_handle: Option<
> +        unsafe extern "C" fn(
> +            dev: *mut bindings::drm_device,
> +            file_priv: *mut bindings::drm_file,
> +            prime_fd: core::ffi::c_int,
> +            handle: *mut u32,
> +        ) -> core::ffi::c_int,
> +    >,
> +    pub(crate) gem_prime_import: Option<
> +        unsafe extern "C" fn(
> +            dev: *mut bindings::drm_device,
> +            dma_buf: *mut bindings::dma_buf,
> +        ) -> *mut bindings::drm_gem_object,
> +    >,
> +    pub(crate) gem_prime_import_sg_table: Option<
> +        unsafe extern "C" fn(
> +            dev: *mut bindings::drm_device,
> +            attach: *mut bindings::dma_buf_attachment,
> +            sgt: *mut bindings::sg_table,
> +        ) -> *mut bindings::drm_gem_object,
> +    >,
> +    pub(crate) gem_prime_mmap: Option<
> +        unsafe extern "C" fn(
> +            obj: *mut bindings::drm_gem_object,
> +            vma: *mut bindings::vm_area_struct,
> +        ) -> core::ffi::c_int,
> +    >,
> +    pub(crate) dumb_create: Option<
> +        unsafe extern "C" fn(
> +            file_priv: *mut bindings::drm_file,
> +            dev: *mut bindings::drm_device,
> +            args: *mut bindings::drm_mode_create_dumb,
> +        ) -> core::ffi::c_int,
> +    >,
> +    pub(crate) dumb_map_offset: Option<
> +        unsafe extern "C" fn(
> +            file_priv: *mut bindings::drm_file,
> +            dev: *mut bindings::drm_device,
> +            handle: u32,
> +            offset: *mut u64,
> +        ) -> core::ffi::c_int,
> +    >,
> +    pub(crate) dumb_destroy: Option<
> +        unsafe extern "C" fn(
> +            file_priv: *mut bindings::drm_file,
> +            dev: *mut bindings::drm_device,
> +            handle: u32,
> +        ) -> core::ffi::c_int,
> +    >,
> +}
> +
> +/// Trait for memory manager implementations. Implemented internally.
> +pub trait AllocImpl: Sealed {
> +    /// The C callback operations for this memory manager.
> +    const ALLOC_OPS: AllocOps;
> +}
> +
> +/// A DRM driver implementation.
> +#[vtable]
> +pub trait Driver {
> +    /// Context data associated with the DRM driver
> +    ///
> +    /// Determines the type of the context data passed to each of the methods of the trait.
> +    type Data: ForeignOwnable + Sync + Send;
> +
> +    /// The type used to manage memory for this driver.
> +    ///
> +    /// Should be either `drm::gem::Object<T>` or `drm::gem::shmem::Object<T>`.
> +    type Object: AllocImpl;
> +
> +    /// Driver metadata
> +    const INFO: DriverInfo;
> +
> +    /// Feature flags
> +    const FEATURES: u32;
> +
> +    /// IOCTL list. See `kernel::drm::ioctl::declare_drm_ioctls!{}`.
> +    const IOCTLS: &'static [drm::ioctl::DrmIoctlDescriptor];
> +}
> +
> +/// A registration of a DRM device
> +///
> +/// # Invariants:
> +///
> +/// drm is always a valid pointer to an allocated drm_device
> +pub struct Registration<T: Driver> {
> +    drm: drm::device::Device<T>,
> +    registered: bool,
> +    fops: bindings::file_operations,
> +    vtable: Pin<Box<bindings::drm_driver>>,
> +    _p: PhantomData<T>,
> +    _pin: PhantomPinned,
> +}
> +
> +#[cfg(CONFIG_DRM_LEGACY)]
> +macro_rules! drm_legacy_fields {
> +    ( $($field:ident: $val:expr),* $(,)? ) => {
> +        bindings::drm_driver {
> +            $( $field: $val ),*,
> +            firstopen: None,
> +            preclose: None,
> +            dma_ioctl: None,
> +            dma_quiescent: None,
> +            context_dtor: None,
> +            irq_handler: None,
> +            irq_preinstall: None,
> +            irq_postinstall: None,
> +            irq_uninstall: None,
> +            get_vblank_counter: None,
> +            enable_vblank: None,
> +            disable_vblank: None,
> +            dev_priv_size: 0,
> +        }
> +    }
> +}
> +
> +#[cfg(not(CONFIG_DRM_LEGACY))]
> +macro_rules! drm_legacy_fields {
> +    ( $($field:ident: $val:expr),* $(,)? ) => {
> +        bindings::drm_driver {
> +            $( $field: $val ),*
> +        }
> +    }
> +}
> +
> +/// Registers a DRM device with the rest of the kernel.
> +///
> +/// It automatically picks up THIS_MODULE.
> +#[allow(clippy::crate_in_macro_def)]
> +#[macro_export]
> +macro_rules! drm_device_register {
> +    ($reg:expr, $data:expr, $flags:expr $(,)?) => {{
> +        $crate::drm::drv::Registration::register($reg, $data, $flags, &crate::THIS_MODULE)
> +    }};
> +}
> +
> +impl<T: Driver> Registration<T> {
> +    const VTABLE: bindings::drm_driver = drm_legacy_fields! {
> +        load: None,
> +        open: None, // TODO: File abstraction
> +        postclose: None, // TODO: File abstraction
> +        lastclose: None,
> +        unload: None,
> +        release: None,
> +        master_set: None,
> +        master_drop: None,
> +        debugfs_init: None,
> +        gem_create_object: T::Object::ALLOC_OPS.gem_create_object,
> +        prime_handle_to_fd: T::Object::ALLOC_OPS.prime_handle_to_fd,
> +        prime_fd_to_handle: T::Object::ALLOC_OPS.prime_fd_to_handle,
> +        gem_prime_import: T::Object::ALLOC_OPS.gem_prime_import,
> +        gem_prime_import_sg_table: T::Object::ALLOC_OPS.gem_prime_import_sg_table,
> +        gem_prime_mmap: T::Object::ALLOC_OPS.gem_prime_mmap,
> +        dumb_create: T::Object::ALLOC_OPS.dumb_create,
> +        dumb_map_offset: T::Object::ALLOC_OPS.dumb_map_offset,
> +        dumb_destroy: T::Object::ALLOC_OPS.dumb_destroy,
> +
> +        major: T::INFO.major,
> +        minor: T::INFO.minor,
> +        patchlevel: T::INFO.patchlevel,
> +        name: T::INFO.name.as_char_ptr() as *mut _,
> +        desc: T::INFO.desc.as_char_ptr() as *mut _,
> +        date: T::INFO.date.as_char_ptr() as *mut _,
> +
> +        driver_features: T::FEATURES,
> +        ioctls: T::IOCTLS.as_ptr(),
> +        num_ioctls: T::IOCTLS.len() as i32,
> +        fops: core::ptr::null_mut(),
> +    };
> +
> +    /// Creates a new [`Registration`] but does not register it yet.
> +    ///
> +    /// It is allowed to move.
> +    pub fn new(parent: &dyn device::RawDevice) -> Result<Self> {
> +        let vtable = Pin::new(Box::try_new(Self::VTABLE)?);
> +        let raw_drm = unsafe { bindings::drm_dev_alloc(&*vtable, parent.raw_device()) };
> +        let raw_drm = from_kernel_err_ptr(raw_drm)?;
> +
> +        // The reference count is one, and now we take ownership of that reference as a
> +        // drm::device::Device.
> +        let drm = unsafe { drm::device::Device::from_raw(raw_drm) };
> +
> +        Ok(Self {
> +            drm,
> +            registered: false,
> +            vtable,
> +            fops: Default::default(), // TODO: GEM abstraction
> +            _pin: PhantomPinned,
> +            _p: PhantomData,
> +        })
> +    }
> +
> +    /// Registers a DRM device with the rest of the kernel.
> +    ///
> +    /// Users are encouraged to use the [`drm_device_register!()`] macro because it automatically
> +    /// picks up the current module.
> +    pub fn register(
> +        self: Pin<&mut Self>,
> +        data: T::Data,
> +        flags: usize,
> +        module: &'static ThisModule,
> +    ) -> Result {
> +        if self.registered {
> +            // Already registered.
> +            return Err(EINVAL);
> +        }
> +
> +        // SAFETY: We never move out of `this`.
> +        let this = unsafe { self.get_unchecked_mut() };
> +        let data_pointer = <T::Data as ForeignOwnable>::into_foreign(data);
> +        // SAFETY: `drm` is valid per the type invariant
> +        unsafe {
> +            (*this.drm.raw_mut()).dev_private = data_pointer as *mut _;
> +        }
> +
> +        this.fops.owner = module.0;
> +        this.vtable.fops = &this.fops;
> +
> +        // SAFETY: The device is now initialized and ready to be registered.
> +        let ret = unsafe { bindings::drm_dev_register(this.drm.raw_mut(), flags as u64) };
> +        if ret < 0 {
> +            // SAFETY: `data_pointer` was returned by `into_foreign` above.
> +            unsafe { T::Data::from_foreign(data_pointer) };
> +            return Err(Error::from_kernel_errno(ret));
> +        }
> +
> +        this.registered = true;
> +        Ok(())
> +    }
> +
> +    /// Returns a reference to the `Device` instance for this registration.
> +    pub fn device(&self) -> &drm::device::Device<T> {
> +        &self.drm
> +    }
> +}
> +
> +// SAFETY: `Registration` doesn't offer any methods or access to fields when shared between threads
> +// or CPUs, so it is safe to share it.
> +unsafe impl<T: Driver> Sync for Registration<T> {}
> +
> +// SAFETY: Registration with and unregistration from the drm subsystem can happen from any thread.
> +// Additionally, `T::Data` (which is dropped during unregistration) is `Send`, so it is ok to move
> +// `Registration` to different threads.
> +#[allow(clippy::non_send_fields_in_send_ty)]
> +unsafe impl<T: Driver> Send for Registration<T> {}
> +
> +impl<T: Driver> Drop for Registration<T> {
> +    /// Removes the registration from the kernel if it has completed successfully before.
> +    fn drop(&mut self) {
> +        if self.registered {
> +            // Get a pointer to the data stored in device before destroying it.
> +            // SAFETY: `drm` is valid per the type invariant
> +            let data_pointer = unsafe { (*self.drm.raw_mut()).dev_private };
> +
> +            // SAFETY: Since `registered` is true, `self.drm` is both valid and registered.
> +            unsafe { bindings::drm_dev_unregister(self.drm.raw_mut()) };
> +
> +            // Free data as well.
> +            // SAFETY: `data_pointer` was returned by `into_foreign` during registration.
> +            unsafe { <T::Data as ForeignOwnable>::from_foreign(data_pointer) };
> +        }
> +    }
> +}
> diff --git a/rust/kernel/drm/mod.rs b/rust/kernel/drm/mod.rs
> index 9ec6d7cbcaf3..69376b3c6db9 100644
> --- a/rust/kernel/drm/mod.rs
> +++ b/rust/kernel/drm/mod.rs
> @@ -2,4 +2,6 @@
> 
>  //! DRM subsystem abstractions.
> 
> +pub mod device;
> +pub mod drv;
>  pub mod ioctl;
> 
> --
> 2.35.1

Cheers,
Bjorn
Christian König March 8, 2023, 8:46 a.m. UTC | #6
Am 07.03.23 um 15:25 schrieb Asahi Lina:
> Some hardware may require more complex resource utilization accounting
> than the simple job count supported by drm_sched internally. Add a
> can_run_job callback to allow drivers to implement more logic before
> deciding whether to run a GPU job.

Well complete NAK.

This is clearly going against the idea of having jobs only depend on 
fences and nothing else which is mandatory for correct memory management.

If the hw is busy with something you need to return the fence for this 
from the prepare_job callback so that the scheduler can be notified when 
the hw is available again.

Regards,
Christian.

>
> Signed-off-by: Asahi Lina <lina@asahilina.net>
> ---
>   drivers/gpu/drm/scheduler/sched_main.c | 10 ++++++++++
>   include/drm/gpu_scheduler.h            |  8 ++++++++
>   2 files changed, 18 insertions(+)
>
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> index 4e6ad6e122bc..5c0add2c7546 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -1001,6 +1001,16 @@ static int drm_sched_main(void *param)
>   		if (!entity)
>   			continue;
>   
> +		if (sched->ops->can_run_job) {
> +			sched_job = to_drm_sched_job(spsc_queue_peek(&entity->job_queue));
> +			if (!sched_job) {
> +				complete_all(&entity->entity_idle);
> +				continue;
> +			}
> +			if (!sched->ops->can_run_job(sched_job))
> +				continue;
> +		}
> +
>   		sched_job = drm_sched_entity_pop_job(entity);
>   
>   		if (!sched_job) {
> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
> index 9db9e5e504ee..bd89ea9507b9 100644
> --- a/include/drm/gpu_scheduler.h
> +++ b/include/drm/gpu_scheduler.h
> @@ -396,6 +396,14 @@ struct drm_sched_backend_ops {
>   	struct dma_fence *(*prepare_job)(struct drm_sched_job *sched_job,
>   					 struct drm_sched_entity *s_entity);
>   
> +	/**
> +	 * @can_run_job: Called before job execution to check whether the
> +	 * hardware is free enough to run the job.  This can be used to
> +	 * implement more complex hardware resource policies than the
> +	 * hw_submission limit.
> +	 */
> +	bool (*can_run_job)(struct drm_sched_job *sched_job);
> +
>   	/**
>            * @run_job: Called to execute the job once all of the dependencies
>            * have been resolved.  This may be called multiple times, if
>
Asahi Lina March 8, 2023, 9:41 a.m. UTC | #7
On 08/03/2023 17.46, Christian König wrote:
> Am 07.03.23 um 15:25 schrieb Asahi Lina:
>> Some hardware may require more complex resource utilization accounting
>> than the simple job count supported by drm_sched internally. Add a
>> can_run_job callback to allow drivers to implement more logic before
>> deciding whether to run a GPU job.
> 
> Well complete NAK.
> 
> This is clearly going against the idea of having jobs only depend on 
> fences and nothing else which is mandatory for correct memory management.
> 
> If the hw is busy with something you need to return the fence for this 
> from the prepare_job callback so that the scheduler can be notified when 
> the hw is available again.

I think you misunderstood the intent here... This isn't about job
dependencies, it's about in-flight resource limits.

drm_sched already has a hw_submission_limit that specifies the number of
submissions that can be in flight, but that doesn't work for us because
each job from drm_sched's point of view consists of multiple commands
split among 3 firmware queues. The firmware can only support up to 128
work commands in flight per queue (barriers don't count), otherwise it
overflows a fixed-size buffer.

So we need more complex accounting of how many underlying commands are
in flight per queue to determine whether it is safe to run a new job,
and that is what this callback accomplishes. This has to happen even
when individual jobs have no buffer/resource dependencies between them
(which is what the fences would express).

You can see the driver implementation of that callback in
drivers/gpu/drm/asahi/queue/mod.rs (QueueJob::can_run()), which then
calls into drivers/gpu/drm/asahi/workqueue.rs (Job::can_submit()) that
does the actual available slot count checks.

The can_run_job logic is written to mirror the hw_submission_limit logic
(just a bit later in the sched main loop since we need to actually pick
a job to do the check), and just like for that case, completion of any
job in the same scheduler will cause another run of the main loop and
another check (which is exactly what we want here).

This case (potentially scheduling more than the FW job limit) is rare
but handling it is necessary, since otherwise the entire job
completion/tracking logic gets screwed up on the firmware end and queues
end up stuck (I've managed to trigger this before).

~~ Lina
Christian König March 8, 2023, 10 a.m. UTC | #8
Am 08.03.23 um 10:41 schrieb Asahi Lina:
> On 08/03/2023 17.46, Christian König wrote:
>> Am 07.03.23 um 15:25 schrieb Asahi Lina:
>>> Some hardware may require more complex resource utilization accounting
>>> than the simple job count supported by drm_sched internally. Add a
>>> can_run_job callback to allow drivers to implement more logic before
>>> deciding whether to run a GPU job.
>> Well complete NAK.
>>
>> This is clearly going against the idea of having jobs only depend on
>> fences and nothing else which is mandatory for correct memory management.
>>
>> If the hw is busy with something you need to return the fence for this
>> from the prepare_job callback so that the scheduler can be notified when
>> the hw is available again.
> I think you misunderstood the intent here... This isn't about job
> dependencies, it's about in-flight resource limits.
>
> drm_sched already has a hw_submission_limit that specifies the number of
> submissions that can be in flight, but that doesn't work for us because
> each job from drm_sched's point of view consists of multiple commands
> split among 3 firmware queues. The firmware can only support up to 128
> work commands in flight per queue (barriers don't count), otherwise it
> overflows a fixed-size buffer.
>
> So we need more complex accounting of how many underlying commands are
> in flight per queue to determine whether it is safe to run a new job,
> and that is what this callback accomplishes. This has to happen even
> when individual jobs have no buffer/resource dependencies between them
> (which is what the fences would express).

Yeah, I already assumed that you have something like this.

And to make it clear this is unfortunately a complete NAK to this 
approach! You can't do this!

The background is that core memory management requires that signaling a 
fence only depends on signaling other fences and hardware progress and 
nothing else. Otherwise you immediately run into problems because of 
circle dependencies or what we call infinite fences.

Jason Ekstrand gave a create presentation on that problem a few years 
ago on LPC. I strongly suggest you google that one up.

> You can see the driver implementation of that callback in
> drivers/gpu/drm/asahi/queue/mod.rs (QueueJob::can_run()), which then
> calls into drivers/gpu/drm/asahi/workqueue.rs (Job::can_submit()) that
> does the actual available slot count checks.
>
> The can_run_job logic is written to mirror the hw_submission_limit logic
> (just a bit later in the sched main loop since we need to actually pick
> a job to do the check), and just like for that case, completion of any
> job in the same scheduler will cause another run of the main loop and
> another check (which is exactly what we want here).

Yeah and that hw_submission_limit is based on a fence signaling again.

When you have some firmware limitation that a job needs resources which 
are currently in use by other submissions then those other submissions 
have fences as well and you can return those in the prepare_job callback.

If those other submissions don't have fences, then you have a major 
design problem inside your driver and we need to get back to square one 
and talk about that dependency handling.

> This case (potentially scheduling more than the FW job limit) is rare
> but handling it is necessary, since otherwise the entire job
> completion/tracking logic gets screwed up on the firmware end and queues
> end up stuck (I've managed to trigger this before).

Actually that's a pretty normal use case. I've have rejected similar 
requirements like this before as well.

For an example how this can work see amdgpu_job_prepare_job(): 
https://elixir.bootlin.com/linux/v6.3-rc1/source/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c#L251

The gang submit gives and example of a global fence lock and the VMIDs 
are an example of a global shared firmware resource.

Regards,
Christian.

>
> ~~ Lina
Karol Herbst March 8, 2023, 12:39 p.m. UTC | #9
On Wed, Mar 8, 2023 at 9:46 AM Christian König <christian.koenig@amd.com> wrote:
>
> Am 07.03.23 um 15:25 schrieb Asahi Lina:
> > Some hardware may require more complex resource utilization accounting
> > than the simple job count supported by drm_sched internally. Add a
> > can_run_job callback to allow drivers to implement more logic before
> > deciding whether to run a GPU job.
>
> Well complete NAK.
>

There hasn't even been any kind of discussion yet you already come
around with a "Well complete NAK"

First, this can be seen as rude behavior and me being part of the drm
community I don't want to have to see this kind of thing.

Obviously, any kind of strong "technical" review point is a nak until
people settle with an agreement on what to land, there is no point in
pointing out a "NAK", especially if that's the first thing you say. If
you want to express your strong disagreement with the proposed
solution, then state what your pain points are directly.

If there is a long discussion and a maintainer feels it's going
nowhere and no conclusion will be reached it might be this kind of
"speaking with authority" point has to be made. But not as the starter
into a discussion. This is unnecessarily hostile towards the
contributor. And I wished we wouldn't have to see this kind of
behavior here.

Yes, some kernel maintainers do this a lot, but kernel maintainers
also have this kind of reputation and people don't want to have to
deal with this nonsense and decide to not contribute at all. So please
just drop this attitude.

> This is clearly going against the idea of having jobs only depend on
> fences and nothing else which is mandatory for correct memory management.
>

I'm sure it's all documented and there is a design document on how
things have to look like you can point out? Might help to get a better
understanding on how things should be.

> If the hw is busy with something you need to return the fence for this
> from the prepare_job callback so that the scheduler can be notified when
> the hw is available again.
>
> Regards,
> Christian.
>
> >
> > Signed-off-by: Asahi Lina <lina@asahilina.net>
> > ---
> >   drivers/gpu/drm/scheduler/sched_main.c | 10 ++++++++++
> >   include/drm/gpu_scheduler.h            |  8 ++++++++
> >   2 files changed, 18 insertions(+)
> >
> > diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> > index 4e6ad6e122bc..5c0add2c7546 100644
> > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > @@ -1001,6 +1001,16 @@ static int drm_sched_main(void *param)
> >               if (!entity)
> >                       continue;
> >
> > +             if (sched->ops->can_run_job) {
> > +                     sched_job = to_drm_sched_job(spsc_queue_peek(&entity->job_queue));
> > +                     if (!sched_job) {
> > +                             complete_all(&entity->entity_idle);
> > +                             continue;
> > +                     }
> > +                     if (!sched->ops->can_run_job(sched_job))
> > +                             continue;
> > +             }
> > +
> >               sched_job = drm_sched_entity_pop_job(entity);
> >
> >               if (!sched_job) {
> > diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
> > index 9db9e5e504ee..bd89ea9507b9 100644
> > --- a/include/drm/gpu_scheduler.h
> > +++ b/include/drm/gpu_scheduler.h
> > @@ -396,6 +396,14 @@ struct drm_sched_backend_ops {
> >       struct dma_fence *(*prepare_job)(struct drm_sched_job *sched_job,
> >                                        struct drm_sched_entity *s_entity);
> >
> > +     /**
> > +      * @can_run_job: Called before job execution to check whether the
> > +      * hardware is free enough to run the job.  This can be used to
> > +      * implement more complex hardware resource policies than the
> > +      * hw_submission limit.
> > +      */
> > +     bool (*can_run_job)(struct drm_sched_job *sched_job);
> > +
> >       /**
> >            * @run_job: Called to execute the job once all of the dependencies
> >            * have been resolved.  This may be called multiple times, if
> >
>
Maíra Canal March 8, 2023, 1:38 p.m. UTC | #10
On 3/7/23 11:25, Asahi Lina wrote:
> The DRM shmem helper includes common code useful for drivers which
> allocate GEM objects as anonymous shmem. Add a Rust abstraction for
> this. Drivers can choose the raw GEM implementation or the shmem layer,
> depending on their needs.
> 
> Signed-off-by: Asahi Lina <lina@asahilina.net>
> ---
>   drivers/gpu/drm/Kconfig         |   5 +
>   rust/bindings/bindings_helper.h |   2 +
>   rust/helpers.c                  |  67 +++++++
>   rust/kernel/drm/gem/mod.rs      |   3 +
>   rust/kernel/drm/gem/shmem.rs    | 381 ++++++++++++++++++++++++++++++++++++++++
>   5 files changed, 458 insertions(+)
> 

[...]

> +unsafe extern "C" fn gem_create_object<T: DriverObject>(
> +    raw_dev: *mut bindings::drm_device,
> +    size: usize,
> +) -> *mut bindings::drm_gem_object {
> +    // SAFETY: GEM ensures the device lives as long as its objects live,
> +    // so we can conjure up a reference from thin air and never drop it.
> +    let dev = ManuallyDrop::new(unsafe { device::Device::from_raw(raw_dev) });
> +
> +    let inner = match T::new(&*dev, size) {
> +        Ok(v) => v,
> +        Err(e) => return e.to_ptr(),
> +    };
> +
> +    let p = unsafe {
> +        bindings::krealloc(
> +            core::ptr::null(),
> +            Object::<T>::SIZE,
> +            bindings::GFP_KERNEL | bindings::__GFP_ZERO,
> +        ) as *mut Object<T>
> +    };
> +
> +    if p.is_null() {
> +        return ENOMEM.to_ptr();
> +    }
> +
> +    // SAFETY: p is valid as long as the alloc succeeded
> +    unsafe {
> +        addr_of_mut!((*p).dev).write(dev);
> +        addr_of_mut!((*p).inner).write(inner);
> +    }
> +
> +    // SAFETY: drm_gem_shmem_object is safe to zero-init, and
> +    // the rest of Object has been initialized
> +    let new: &mut Object<T> = unsafe { &mut *(p as *mut _) };
> +
> +    new.obj.base.funcs = &Object::<T>::VTABLE;
> +    &mut new.obj.base
> +}

It would be nice to allow to set wc inside the gem_create_object callback,
as some drivers do it so, like v3d, vc4, panfrost, lima...

Best Regards,
- Maíra Canal

> +
> +unsafe extern "C" fn free_callback<T: DriverObject>(obj: *mut bindings::drm_gem_object) {
> +    // SAFETY: All of our objects are Object<T>.
> +    let p = crate::container_of!(obj, Object<T>, obj) as *mut Object<T>;
> +
> +    // SAFETY: p is never used after this
> +    unsafe {
> +        core::ptr::drop_in_place(&mut (*p).inner);
> +    }
> +
> +    // SAFETY: This pointer has to be valid, since p is valid
> +    unsafe {
> +        bindings::drm_gem_shmem_free(&mut (*p).obj);
> +    }
> +}
> +
> +impl<T: DriverObject> Object<T> {
> +    /// The size of this object's structure.
> +    const SIZE: usize = mem::size_of::<Self>();
> +
> +    /// `drm_gem_object_funcs` vtable suitable for GEM shmem objects.
> +    const VTABLE: bindings::drm_gem_object_funcs = bindings::drm_gem_object_funcs {
> +        free: Some(free_callback::<T>),
> +        open: Some(super::open_callback::<T, Object<T>>),
> +        close: Some(super::close_callback::<T, Object<T>>),
> +        print_info: Some(bindings::drm_gem_shmem_object_print_info),
> +        export: None,
> +        pin: Some(bindings::drm_gem_shmem_object_pin),
> +        unpin: Some(bindings::drm_gem_shmem_object_unpin),
> +        get_sg_table: Some(bindings::drm_gem_shmem_object_get_sg_table),
> +        vmap: Some(bindings::drm_gem_shmem_object_vmap),
> +        vunmap: Some(bindings::drm_gem_shmem_object_vunmap),
> +        mmap: Some(bindings::drm_gem_shmem_object_mmap),
> +        vm_ops: &SHMEM_VM_OPS,
> +    };
> +
> +    // SAFETY: Must only be used with DRM functions that are thread-safe
> +    unsafe fn mut_shmem(&self) -> *mut bindings::drm_gem_shmem_object {
> +        &self.obj as *const _ as *mut _
> +    }
> +
> +    /// Create a new shmem-backed DRM object of the given size.
> +    pub fn new(dev: &device::Device<T::Driver>, size: usize) -> Result<gem::UniqueObjectRef<Self>> {
> +        // SAFETY: This function can be called as long as the ALLOC_OPS are set properly
> +        // for this driver, and the gem_create_object is called.
> +        let p = unsafe { bindings::drm_gem_shmem_create(dev.raw() as *mut _, size) };
> +        let p = crate::container_of!(p, Object<T>, obj) as *mut _;
> +
> +        // SAFETY: The gem_create_object callback ensures this is a valid Object<T>,
> +        // so we can take a unique reference to it.
> +        let obj_ref = gem::UniqueObjectRef { ptr: p };
> +
> +        Ok(obj_ref)
> +    }
> +
> +    /// Returns the `Device` that owns this GEM object.
> +    pub fn dev(&self) -> &device::Device<T::Driver> {
> +        &self.dev
> +    }
> +
> +    /// Creates (if necessary) and returns a scatter-gather table of DMA pages for this object.
> +    ///
> +    /// This will pin the object in memory.
> +    pub fn sg_table(&self) -> Result<SGTable<T>> {
> +        // SAFETY: drm_gem_shmem_get_pages_sgt is thread-safe.
> +        let sgt = from_kernel_err_ptr(unsafe {
> +            bindings::drm_gem_shmem_get_pages_sgt(self.mut_shmem())
> +        })?;
> +
> +        Ok(SGTable {
> +            sgt,
> +            _owner: self.reference(),
> +        })
> +    }
> +
> +    /// Creates and returns a virtual kernel memory mapping for this object.
> +    pub fn vmap(&self) -> Result<VMap<T>> {
> +        let mut map: MaybeUninit<bindings::iosys_map> = MaybeUninit::uninit();
> +
> +        // SAFETY: drm_gem_shmem_vmap is thread-safe
> +        to_result(unsafe { bindings::drm_gem_shmem_vmap(self.mut_shmem(), map.as_mut_ptr()) })?;
> +
> +        // SAFETY: if drm_gem_shmem_vmap did not fail, map is initialized now
> +        let map = unsafe { map.assume_init() };
> +
> +        Ok(VMap {
> +            map,
> +            owner: self.reference(),
> +        })
> +    }
> +
> +    /// Set the write-combine flag for this object.
> +    ///
> +    /// Should be called before any mappings are made.
> +    pub fn set_wc(&mut self, map_wc: bool) {
> +        unsafe { (*self.mut_shmem()).map_wc = map_wc };
> +    }
> +}
> +
> +impl<T: DriverObject> Deref for Object<T> {
> +    type Target = T;
> +
> +    fn deref(&self) -> &Self::Target {
> +        &self.inner
> +    }
> +}
> +
> +impl<T: DriverObject> DerefMut for Object<T> {
> +    fn deref_mut(&mut self) -> &mut Self::Target {
> +        &mut self.inner
> +    }
> +}
> +
> +impl<T: DriverObject> crate::private::Sealed for Object<T> {}
> +
> +impl<T: DriverObject> gem::IntoGEMObject for Object<T> {
> +    type Driver = T::Driver;
> +
> +    fn gem_obj(&self) -> *mut bindings::drm_gem_object {
> +        &self.obj.base as *const _ as *mut _
> +    }
> +
> +    fn from_gem_obj(obj: *mut bindings::drm_gem_object) -> *mut Object<T> {
> +        crate::container_of!(obj, Object<T>, obj) as *mut Object<T>
> +    }
> +}
> +
> +impl<T: DriverObject> drv::AllocImpl for Object<T> {
> +    const ALLOC_OPS: drv::AllocOps = drv::AllocOps {
> +        gem_create_object: Some(gem_create_object::<T>),
> +        prime_handle_to_fd: Some(bindings::drm_gem_prime_handle_to_fd),
> +        prime_fd_to_handle: Some(bindings::drm_gem_prime_fd_to_handle),
> +        gem_prime_import: None,
> +        gem_prime_import_sg_table: Some(bindings::drm_gem_shmem_prime_import_sg_table),
> +        gem_prime_mmap: Some(bindings::drm_gem_prime_mmap),
> +        dumb_create: Some(bindings::drm_gem_shmem_dumb_create),
> +        dumb_map_offset: None,
> +        dumb_destroy: None,
> +    };
> +}
> +
> +/// A virtual mapping for a shmem-backed GEM object in kernel address space.
> +pub struct VMap<T: DriverObject> {
> +    map: bindings::iosys_map,
> +    owner: gem::ObjectRef<Object<T>>,
> +}
> +
> +impl<T: DriverObject> VMap<T> {
> +    /// Returns a const raw pointer to the start of the mapping.
> +    pub fn as_ptr(&self) -> *const core::ffi::c_void {
> +        // SAFETY: The shmem helpers always return non-iomem maps
> +        unsafe { self.map.__bindgen_anon_1.vaddr }
> +    }
> +
> +    /// Returns a mutable raw pointer to the start of the mapping.
> +    pub fn as_mut_ptr(&mut self) -> *mut core::ffi::c_void {
> +        // SAFETY: The shmem helpers always return non-iomem maps
> +        unsafe { self.map.__bindgen_anon_1.vaddr }
> +    }
> +
> +    /// Returns a byte slice view of the mapping.
> +    pub fn as_slice(&self) -> &[u8] {
> +        // SAFETY: The vmap maps valid memory up to the owner size
> +        unsafe { slice::from_raw_parts(self.as_ptr() as *const u8, self.owner.size()) }
> +    }
> +
> +    /// Returns mutable a byte slice view of the mapping.
> +    pub fn as_mut_slice(&mut self) -> &mut [u8] {
> +        // SAFETY: The vmap maps valid memory up to the owner size
> +        unsafe { slice::from_raw_parts_mut(self.as_mut_ptr() as *mut u8, self.owner.size()) }
> +    }
> +
> +    /// Borrows a reference to the object that owns this virtual mapping.
> +    pub fn owner(&self) -> &gem::ObjectRef<Object<T>> {
> +        &self.owner
> +    }
> +}
> +
> +impl<T: DriverObject> Drop for VMap<T> {
> +    fn drop(&mut self) {
> +        // SAFETY: This function is thread-safe
> +        unsafe {
> +            bindings::drm_gem_shmem_vunmap(self.owner.mut_shmem(), &mut self.map);
> +        }
> +    }
> +}
> +
> +/// SAFETY: `iosys_map` objects are safe to send across threads.
> +unsafe impl<T: DriverObject> Send for VMap<T> {}
> +unsafe impl<T: DriverObject> Sync for VMap<T> {}
> +
> +/// A single scatter-gather entry, representing a span of pages in the device's DMA address space.
> +///
> +/// For devices not behind a standalone IOMMU, this corresponds to physical addresses.
> +#[repr(transparent)]
> +pub struct SGEntry(bindings::scatterlist);
> +
> +impl SGEntry {
> +    /// Returns the starting DMA address of this span
> +    pub fn dma_address(&self) -> usize {
> +        (unsafe { bindings::sg_dma_address(&self.0) }) as usize
> +    }
> +
> +    /// Returns the length of this span in bytes
> +    pub fn dma_len(&self) -> usize {
> +        (unsafe { bindings::sg_dma_len(&self.0) }) as usize
> +    }
> +}
> +
> +/// A scatter-gather table of DMA address spans for a GEM shmem object.
> +///
> +/// # Invariants
> +/// `sgt` must be a valid pointer to the `sg_table`, which must correspond to the owned
> +/// object in `_owner` (which ensures it remains valid).
> +pub struct SGTable<T: DriverObject> {
> +    sgt: *const bindings::sg_table,
> +    _owner: gem::ObjectRef<Object<T>>,
> +}
> +
> +impl<T: DriverObject> SGTable<T> {
> +    /// Returns an iterator through the SGTable's entries
> +    pub fn iter(&'_ self) -> SGTableIter<'_> {
> +        SGTableIter {
> +            left: unsafe { (*self.sgt).nents } as usize,
> +            sg: unsafe { (*self.sgt).sgl },
> +            _p: PhantomData,
> +        }
> +    }
> +}
> +
> +impl<'a, T: DriverObject> IntoIterator for &'a SGTable<T> {
> +    type Item = &'a SGEntry;
> +    type IntoIter = SGTableIter<'a>;
> +
> +    fn into_iter(self) -> Self::IntoIter {
> +        self.iter()
> +    }
> +}
> +
> +/// SAFETY: `sg_table` objects are safe to send across threads.
> +unsafe impl<T: DriverObject> Send for SGTable<T> {}
> +unsafe impl<T: DriverObject> Sync for SGTable<T> {}
> +
> +/// An iterator through `SGTable` entries.
> +///
> +/// # Invariants
> +/// `sg` must be a valid pointer to the scatterlist, which must outlive our lifetime.
> +pub struct SGTableIter<'a> {
> +    sg: *mut bindings::scatterlist,
> +    left: usize,
> +    _p: PhantomData<&'a ()>,
> +}
> +
> +impl<'a> Iterator for SGTableIter<'a> {
> +    type Item = &'a SGEntry;
> +
> +    fn next(&mut self) -> Option<Self::Item> {
> +        if self.left == 0 {
> +            None
> +        } else {
> +            let sg = self.sg;
> +            self.sg = unsafe { bindings::sg_next(self.sg) };
> +            self.left -= 1;
> +            Some(unsafe { &(*(sg as *const SGEntry)) })
> +        }
> +    }
> +}
>
Christian König March 8, 2023, 1:47 p.m. UTC | #11
Am 08.03.23 um 13:39 schrieb Karol Herbst:
> On Wed, Mar 8, 2023 at 9:46 AM Christian König <christian.koenig@amd.com> wrote:
>> Am 07.03.23 um 15:25 schrieb Asahi Lina:
>>> Some hardware may require more complex resource utilization accounting
>>> than the simple job count supported by drm_sched internally. Add a
>>> can_run_job callback to allow drivers to implement more logic before
>>> deciding whether to run a GPU job.
>> Well complete NAK.
>>
> There hasn't even been any kind of discussion yet you already come
> around with a "Well complete NAK"
>
> First, this can be seen as rude behavior and me being part of the drm
> community I don't want to have to see this kind of thing.
>
> Obviously, any kind of strong "technical" review point is a nak until
> people settle with an agreement on what to land, there is no point in
> pointing out a "NAK", especially if that's the first thing you say. If
> you want to express your strong disagreement with the proposed
> solution, then state what your pain points are directly.
>
> If there is a long discussion and a maintainer feels it's going
> nowhere and no conclusion will be reached it might be this kind of
> "speaking with authority" point has to be made. But not as the starter
> into a discussion. This is unnecessarily hostile towards the
> contributor. And I wished we wouldn't have to see this kind of
> behavior here.
>
> Yes, some kernel maintainers do this a lot, but kernel maintainers
> also have this kind of reputation and people don't want to have to
> deal with this nonsense and decide to not contribute at all. So please
> just drop this attitude.

Yes, you are completely right with that, but getting this message to the 
recipient is intentional on my side.

I give completely NAKs when the author of a patch has missed such a 
fundamental technical connection that further discussion doesn't make sense.

It's not meant to be in any way rude or offending. I can put a smiley 
behind it if it somehow helps, but we still need a way to raise this big 
red stop sign.

>> This is clearly going against the idea of having jobs only depend on
>> fences and nothing else which is mandatory for correct memory management.
>>
> I'm sure it's all documented and there is a design document on how
> things have to look like you can point out? Might help to get a better
> understanding on how things should be.

Yeah, that's the problematic part. We have documented this very 
extensively: 
https://www.kernel.org/doc/html/v5.9/driver-api/dma-buf.html#indefinite-dma-fences

And both Jason and Daniel gave talks about the underlying problem and 
try to come up with patches to raise warnings when that happens, but 
people still keep coming up with the same idea over and over again.

It's just that the technical relationship between preventing jobs from 
running and with that preventing dma_fences from signaling and the core 
memory management with page faults and shrinkers waiting for those 
fences is absolutely not obvious.

We had at least 10 different teams from different companies falling into 
the same trap already and either the patches were rejected of hand or 
had to painfully reverted or mitigated later on.

Regards,
Christian.

>
>> If the hw is busy with something you need to return the fence for this
>> from the prepare_job callback so that the scheduler can be notified when
>> the hw is available again.
>>
>> Regards,
>> Christian.
>>
>>> Signed-off-by: Asahi Lina <lina@asahilina.net>
>>> ---
>>>    drivers/gpu/drm/scheduler/sched_main.c | 10 ++++++++++
>>>    include/drm/gpu_scheduler.h            |  8 ++++++++
>>>    2 files changed, 18 insertions(+)
>>>
>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
>>> index 4e6ad6e122bc..5c0add2c7546 100644
>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>> @@ -1001,6 +1001,16 @@ static int drm_sched_main(void *param)
>>>                if (!entity)
>>>                        continue;
>>>
>>> +             if (sched->ops->can_run_job) {
>>> +                     sched_job = to_drm_sched_job(spsc_queue_peek(&entity->job_queue));
>>> +                     if (!sched_job) {
>>> +                             complete_all(&entity->entity_idle);
>>> +                             continue;
>>> +                     }
>>> +                     if (!sched->ops->can_run_job(sched_job))
>>> +                             continue;
>>> +             }
>>> +
>>>                sched_job = drm_sched_entity_pop_job(entity);
>>>
>>>                if (!sched_job) {
>>> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
>>> index 9db9e5e504ee..bd89ea9507b9 100644
>>> --- a/include/drm/gpu_scheduler.h
>>> +++ b/include/drm/gpu_scheduler.h
>>> @@ -396,6 +396,14 @@ struct drm_sched_backend_ops {
>>>        struct dma_fence *(*prepare_job)(struct drm_sched_job *sched_job,
>>>                                         struct drm_sched_entity *s_entity);
>>>
>>> +     /**
>>> +      * @can_run_job: Called before job execution to check whether the
>>> +      * hardware is free enough to run the job.  This can be used to
>>> +      * implement more complex hardware resource policies than the
>>> +      * hw_submission limit.
>>> +      */
>>> +     bool (*can_run_job)(struct drm_sched_job *sched_job);
>>> +
>>>        /**
>>>             * @run_job: Called to execute the job once all of the dependencies
>>>             * have been resolved.  This may be called multiple times, if
>>>
Karol Herbst March 8, 2023, 2:43 p.m. UTC | #12
On Wed, Mar 8, 2023 at 2:47 PM Christian König <christian.koenig@amd.com> wrote:
>
> Am 08.03.23 um 13:39 schrieb Karol Herbst:
> > On Wed, Mar 8, 2023 at 9:46 AM Christian König <christian.koenig@amd.com> wrote:
> >> Am 07.03.23 um 15:25 schrieb Asahi Lina:
> >>> Some hardware may require more complex resource utilization accounting
> >>> than the simple job count supported by drm_sched internally. Add a
> >>> can_run_job callback to allow drivers to implement more logic before
> >>> deciding whether to run a GPU job.
> >> Well complete NAK.
> >>
> > There hasn't even been any kind of discussion yet you already come
> > around with a "Well complete NAK"
> >
> > First, this can be seen as rude behavior and me being part of the drm
> > community I don't want to have to see this kind of thing.
> >
> > Obviously, any kind of strong "technical" review point is a nak until
> > people settle with an agreement on what to land, there is no point in
> > pointing out a "NAK", especially if that's the first thing you say. If
> > you want to express your strong disagreement with the proposed
> > solution, then state what your pain points are directly.
> >
> > If there is a long discussion and a maintainer feels it's going
> > nowhere and no conclusion will be reached it might be this kind of
> > "speaking with authority" point has to be made. But not as the starter
> > into a discussion. This is unnecessarily hostile towards the
> > contributor. And I wished we wouldn't have to see this kind of
> > behavior here.
> >
> > Yes, some kernel maintainers do this a lot, but kernel maintainers
> > also have this kind of reputation and people don't want to have to
> > deal with this nonsense and decide to not contribute at all. So please
> > just drop this attitude.
>
> Yes, you are completely right with that, but getting this message to the
> recipient is intentional on my side.
>
> I give completely NAKs when the author of a patch has missed such a
> fundamental technical connection that further discussion doesn't make sense.
>
> It's not meant to be in any way rude or offending. I can put a smiley
> behind it if it somehow helps, but we still need a way to raise this big
> red stop sign.
>

"further"? There was no discussion at all, you just started off like
that. If you think somebody misses that connection, you can point out
to documentation/videos whatever so the contributor can understand
what's wrong with an approach. You did that, so that's fine. It's just
starting off _any_ discussion with a "Well complete NAK" is terrible
style. I'd feel uncomfortable if that happened to me and I'm sure
there are enough people like that that we should be more reasonable
with our replies. Just.. don't.

We are all humans here and people react negatively to such things. And
if people do it on purpose it just makes it worse.

> >> This is clearly going against the idea of having jobs only depend on
> >> fences and nothing else which is mandatory for correct memory management.
> >>
> > I'm sure it's all documented and there is a design document on how
> > things have to look like you can point out? Might help to get a better
> > understanding on how things should be.
>
> Yeah, that's the problematic part. We have documented this very
> extensively:
> https://www.kernel.org/doc/html/v5.9/driver-api/dma-buf.html#indefinite-dma-fences
>
> And both Jason and Daniel gave talks about the underlying problem and

fyi:
s/Jason/Faith/g

> try to come up with patches to raise warnings when that happens, but
> people still keep coming up with the same idea over and over again.
>

Yes, and we'll have to tell them over and over again. Nothing wrong
with that. That's just part of maintaining such a big subsystem. And
that's definitely not a valid reason to phrase things like above.

> It's just that the technical relationship between preventing jobs from
> running and with that preventing dma_fences from signaling and the core
> memory management with page faults and shrinkers waiting for those
> fences is absolutely not obvious.
>
> We had at least 10 different teams from different companies falling into
> the same trap already and either the patches were rejected of hand or
> had to painfully reverted or mitigated later on.
>

Sure, but that's just part of the job. And pointing out fundamental
mistakes early on is important, but the situation won't get any better
by being like that. Yes, we'll have to repeat the same words over and
over again, and yes that might be annoying, but that's just how it is.

> Regards,
> Christian.
>
> >
> >> If the hw is busy with something you need to return the fence for this
> >> from the prepare_job callback so that the scheduler can be notified when
> >> the hw is available again.
> >>
> >> Regards,
> >> Christian.
> >>
> >>> Signed-off-by: Asahi Lina <lina@asahilina.net>
> >>> ---
> >>>    drivers/gpu/drm/scheduler/sched_main.c | 10 ++++++++++
> >>>    include/drm/gpu_scheduler.h            |  8 ++++++++
> >>>    2 files changed, 18 insertions(+)
> >>>
> >>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> >>> index 4e6ad6e122bc..5c0add2c7546 100644
> >>> --- a/drivers/gpu/drm/scheduler/sched_main.c
> >>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> >>> @@ -1001,6 +1001,16 @@ static int drm_sched_main(void *param)
> >>>                if (!entity)
> >>>                        continue;
> >>>
> >>> +             if (sched->ops->can_run_job) {
> >>> +                     sched_job = to_drm_sched_job(spsc_queue_peek(&entity->job_queue));
> >>> +                     if (!sched_job) {
> >>> +                             complete_all(&entity->entity_idle);
> >>> +                             continue;
> >>> +                     }
> >>> +                     if (!sched->ops->can_run_job(sched_job))
> >>> +                             continue;
> >>> +             }
> >>> +
> >>>                sched_job = drm_sched_entity_pop_job(entity);
> >>>
> >>>                if (!sched_job) {
> >>> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
> >>> index 9db9e5e504ee..bd89ea9507b9 100644
> >>> --- a/include/drm/gpu_scheduler.h
> >>> +++ b/include/drm/gpu_scheduler.h
> >>> @@ -396,6 +396,14 @@ struct drm_sched_backend_ops {
> >>>        struct dma_fence *(*prepare_job)(struct drm_sched_job *sched_job,
> >>>                                         struct drm_sched_entity *s_entity);
> >>>
> >>> +     /**
> >>> +      * @can_run_job: Called before job execution to check whether the
> >>> +      * hardware is free enough to run the job.  This can be used to
> >>> +      * implement more complex hardware resource policies than the
> >>> +      * hw_submission limit.
> >>> +      */
> >>> +     bool (*can_run_job)(struct drm_sched_job *sched_job);
> >>> +
> >>>        /**
> >>>             * @run_job: Called to execute the job once all of the dependencies
> >>>             * have been resolved.  This may be called multiple times, if
> >>>
>
Asahi Lina March 8, 2023, 2:53 p.m. UTC | #13
On 08/03/2023 19.00, Christian König wrote:
> Am 08.03.23 um 10:41 schrieb Asahi Lina:
>> On 08/03/2023 17.46, Christian König wrote:
>>> Am 07.03.23 um 15:25 schrieb Asahi Lina:
>>>> Some hardware may require more complex resource utilization accounting
>>>> than the simple job count supported by drm_sched internally. Add a
>>>> can_run_job callback to allow drivers to implement more logic before
>>>> deciding whether to run a GPU job.
>>> Well complete NAK.
>>>
>>> This is clearly going against the idea of having jobs only depend on
>>> fences and nothing else which is mandatory for correct memory management.
>>>
>>> If the hw is busy with something you need to return the fence for this
>>> from the prepare_job callback so that the scheduler can be notified when
>>> the hw is available again.
>> I think you misunderstood the intent here... This isn't about job
>> dependencies, it's about in-flight resource limits.
>>
>> drm_sched already has a hw_submission_limit that specifies the number of
>> submissions that can be in flight, but that doesn't work for us because
>> each job from drm_sched's point of view consists of multiple commands
>> split among 3 firmware queues. The firmware can only support up to 128
>> work commands in flight per queue (barriers don't count), otherwise it
>> overflows a fixed-size buffer.
>>
>> So we need more complex accounting of how many underlying commands are
>> in flight per queue to determine whether it is safe to run a new job,
>> and that is what this callback accomplishes. This has to happen even
>> when individual jobs have no buffer/resource dependencies between them
>> (which is what the fences would express).
> 
> Yeah, I already assumed that you have something like this.
> 
> And to make it clear this is unfortunately a complete NAK to this 
> approach! You can't do this!

I think you still have some significant misconception about how this
driver works and uses drm_sched... I would appreciate it if you listen
and try to understand the design before giving hard NAKs... (this isn't
a Radeon)

> The background is that core memory management requires that signaling a 
> fence only depends on signaling other fences and hardware progress and 
> nothing else. Otherwise you immediately run into problems because of 
> circle dependencies or what we call infinite fences.

And hardware progress is exactly the only dependency here...

> Jason Ekstrand gave a create presentation on that problem a few years 
> ago on LPC. I strongly suggest you google that one up.

Faith Ekstrand (it looks like you mistyped that name...) is the person
who proposed that I should use drm_sched in this way (see below), we've
had a few private meetings about this design ^^

>> You can see the driver implementation of that callback in
>> drivers/gpu/drm/asahi/queue/mod.rs (QueueJob::can_run()), which then
>> calls into drivers/gpu/drm/asahi/workqueue.rs (Job::can_submit()) that
>> does the actual available slot count checks.
>>
>> The can_run_job logic is written to mirror the hw_submission_limit logic
>> (just a bit later in the sched main loop since we need to actually pick
>> a job to do the check), and just like for that case, completion of any
>> job in the same scheduler will cause another run of the main loop and
>> another check (which is exactly what we want here).
> 
> Yeah and that hw_submission_limit is based on a fence signaling again.

I don't think so...? It's just an atomic that gets checked in
drm_sched_ready(). There are no extra fences involved (other than the
job completion fences that trigger another scheduler run). The idea is
that when the hardware queue makes forward progress you check against
the limit again and submit more jobs as needed. I'm doing the same exact
thing, I'm just using more complex logic for the notion of in-flight
queue limits!

> When you have some firmware limitation that a job needs resources which 
> are currently in use by other submissions then those other submissions 
> have fences as well and you can return those in the prepare_job callback.
> 
> If those other submissions don't have fences, then you have a major 
> design problem inside your driver and we need to get back to square one 
> and talk about that dependency handling.

I think we have a disconnect in our views of what is going on here...

This hardware has firmware-side scheduling with an arbitrary (as far as
I know) number of queues. There is one scheduler instance and one entity
per userspace queue (not global!). These queues process jobs in some
logical sequence, though at the firmware level they get split into up to
three queues each (and there is some parallelism allowed). The
limitation here is in the number of in-flight jobs per firmware queue,
not global.

There is no way for things to deadlock. If jobs have been submitted to
the firmware queue, that means their dependencies were signaled already.
Jobs have intra-job dependencies via driver barriers (which drm_sched
knows nothing about), but the submission code in the driver guarantees
that they are deadlock-free since you can only barrier on past commands,
which by definition submit first.

If a firmware queue is full, drm_sched blocks. Since it is full, that
means it will run those commands (since they have no outside
dependencies and they are already queued and ready to run by the
firmware), eventually space will be freed, and each time a job completes
drm_sched will do the can_run_job check again and decide whether to run
a new job.

Since the firmware queues contain commands which only have past-facing
barriers on other already submitted commands, by definition they will
become empty at some point as long as the firmware is making forward
progress. And therefore, by definition, can_run_job will eventually
return true at some point after a job completion fence is signaled (the
one for the last job submitted prior). There is a check in the driver to
ensure that we do not allow submissions which, by themselves, would
exceed the queued command limit (we actually just limit to 64 commands
overall right now, which is conservative but seems reasonable given the
128-per-firmware-queue limit).

I get the feeling that you are conflating pending jobs with submitted
jobs. This isn't about how many jobs you can have pending in drm_sched
before running them or anything like that. Of course, at that point,
arbitrary dependencies come into play and you can end up with deadlocks
on dependency fences. But that's not the case here. What can_run_job is
waiting on is guaranteed to make forward progress.

>> This case (potentially scheduling more than the FW job limit) is rare
>> but handling it is necessary, since otherwise the entire job
>> completion/tracking logic gets screwed up on the firmware end and queues
>> end up stuck (I've managed to trigger this before).
> 
> Actually that's a pretty normal use case. I've have rejected similar 
> requirements like this before as well.
> 
> For an example how this can work see amdgpu_job_prepare_job(): 
> https://elixir.bootlin.com/linux/v6.3-rc1/source/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c#L251
> 
> The gang submit gives and example of a global fence lock and the VMIDs 
> are an example of a global shared firmware resource.

But the resource can_run_job is checking on isn't globally shared! It's
specific to this scheduler instance, just like hw_submission_limit is,
so as long as the firmware behind the scheduler is making forward
progress, the resource will be guaranteed to be freed until another job
can run.

I actually know I have a different theoretical deadlock issue along
these lines in the driver because right now we grab actually global
resources (including a VMID) before job submission to drm_sched. This is
a known issue, and to fix it without reducing performance I need to
introduce some kind of "patching/fixup" system for firmware commands
(because we need to inject those identifiers in dozens of places, but we
don't want to construct those commands from scratch at job run time
because that introduces latency at the wrong time and makes error
handling/validation more complicated and error-prone), and that is
exactly what should happen in prepare_job, as you say. And yes, at that
point that should use fences to block when those resources are
exhausted. But that's a different discussion we should have when
reviewing the driver, it has nothing to do with the DRM abstractions nor
the can_run_job callback I'm adding here nor the firmware queue length
limit issue! (And also the global hardware devices are plentiful enough
that I would be very surprised if anyone ever deadlocks it in practice
even with the current code, so I honestly don't think that should be a
blocker for driver submission either, I can and will fix it later...)

~~ Lina
Christian König March 8, 2023, 3:02 p.m. UTC | #14
Am 08.03.23 um 15:43 schrieb Karol Herbst:
> [SNIP]
> "further"? There was no discussion at all,

Yeah, well that is exactly what I wanted to archive.

>   you just started off like
> that. If you think somebody misses that connection, you can point out
> to documentation/videos whatever so the contributor can understand
> what's wrong with an approach. You did that, so that's fine. It's just
> starting off _any_ discussion with a "Well complete NAK" is terrible
> style. I'd feel uncomfortable if that happened to me and I'm sure
> there are enough people like that that we should be more reasonable
> with our replies. Just.. don't.
>
> We are all humans here and people react negatively to such things. And
> if people do it on purpose it just makes it worse.

I completely see your point, I just don't know how to improve it.

I don't stop people like this because I want to make them uncomfortable 
but because I want to prevent further discussions on that topic.

In other words how can I make people notice that this is something 
fundamental while still being polite?

>>>> This is clearly going against the idea of having jobs only depend on
>>>> fences and nothing else which is mandatory for correct memory management.
>>>>
>>> I'm sure it's all documented and there is a design document on how
>>> things have to look like you can point out? Might help to get a better
>>> understanding on how things should be.
>> Yeah, that's the problematic part. We have documented this very
>> extensively:
>> https://www.kernel.org/doc/html/v5.9/driver-api/dma-buf.html#indefinite-dma-fences
>>
>> And both Jason and Daniel gave talks about the underlying problem and
> fyi:
> s/Jason/Faith/g

+1. I wasn't aware of that.

>> try to come up with patches to raise warnings when that happens, but
>> people still keep coming up with the same idea over and over again.
>>
> Yes, and we'll have to tell them over and over again. Nothing wrong
> with that. That's just part of maintaining such a big subsystem. And
> that's definitely not a valid reason to phrase things like above.
>
>> It's just that the technical relationship between preventing jobs from
>> running and with that preventing dma_fences from signaling and the core
>> memory management with page faults and shrinkers waiting for those
>> fences is absolutely not obvious.
>>
>> We had at least 10 different teams from different companies falling into
>> the same trap already and either the patches were rejected of hand or
>> had to painfully reverted or mitigated later on.
>>
> Sure, but that's just part of the job. And pointing out fundamental
> mistakes early on is important, but the situation won't get any better
> by being like that. Yes, we'll have to repeat the same words over and
> over again, and yes that might be annoying, but that's just how it is.

Well I have no problem explaining people why a solution doesn't work.

But what usually happens is that people don't realize that they need to 
back of from a design and completely start over.

Regards,
Christian.

>
>> Regards,
>> Christian.
>>
>>>> If the hw is busy with something you need to return the fence for this
>>>> from the prepare_job callback so that the scheduler can be notified when
>>>> the hw is available again.
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>> Signed-off-by: Asahi Lina <lina@asahilina.net>
>>>>> ---
>>>>>     drivers/gpu/drm/scheduler/sched_main.c | 10 ++++++++++
>>>>>     include/drm/gpu_scheduler.h            |  8 ++++++++
>>>>>     2 files changed, 18 insertions(+)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
>>>>> index 4e6ad6e122bc..5c0add2c7546 100644
>>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>>>> @@ -1001,6 +1001,16 @@ static int drm_sched_main(void *param)
>>>>>                 if (!entity)
>>>>>                         continue;
>>>>>
>>>>> +             if (sched->ops->can_run_job) {
>>>>> +                     sched_job = to_drm_sched_job(spsc_queue_peek(&entity->job_queue));
>>>>> +                     if (!sched_job) {
>>>>> +                             complete_all(&entity->entity_idle);
>>>>> +                             continue;
>>>>> +                     }
>>>>> +                     if (!sched->ops->can_run_job(sched_job))
>>>>> +                             continue;
>>>>> +             }
>>>>> +
>>>>>                 sched_job = drm_sched_entity_pop_job(entity);
>>>>>
>>>>>                 if (!sched_job) {
>>>>> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
>>>>> index 9db9e5e504ee..bd89ea9507b9 100644
>>>>> --- a/include/drm/gpu_scheduler.h
>>>>> +++ b/include/drm/gpu_scheduler.h
>>>>> @@ -396,6 +396,14 @@ struct drm_sched_backend_ops {
>>>>>         struct dma_fence *(*prepare_job)(struct drm_sched_job *sched_job,
>>>>>                                          struct drm_sched_entity *s_entity);
>>>>>
>>>>> +     /**
>>>>> +      * @can_run_job: Called before job execution to check whether the
>>>>> +      * hardware is free enough to run the job.  This can be used to
>>>>> +      * implement more complex hardware resource policies than the
>>>>> +      * hw_submission limit.
>>>>> +      */
>>>>> +     bool (*can_run_job)(struct drm_sched_job *sched_job);
>>>>> +
>>>>>         /**
>>>>>              * @run_job: Called to execute the job once all of the dependencies
>>>>>              * have been resolved.  This may be called multiple times, if
>>>>>
Karol Herbst March 8, 2023, 3:19 p.m. UTC | #15
On Wed, Mar 8, 2023 at 4:09 PM Christian König <christian.koenig@amd.com> wrote:
>
> Am 08.03.23 um 15:43 schrieb Karol Herbst:
> > [SNIP]
> > "further"? There was no discussion at all,
>
> Yeah, well that is exactly what I wanted to archive.
>
> >   you just started off like
> > that. If you think somebody misses that connection, you can point out
> > to documentation/videos whatever so the contributor can understand
> > what's wrong with an approach. You did that, so that's fine. It's just
> > starting off _any_ discussion with a "Well complete NAK" is terrible
> > style. I'd feel uncomfortable if that happened to me and I'm sure
> > there are enough people like that that we should be more reasonable
> > with our replies. Just.. don't.
> >
> > We are all humans here and people react negatively to such things. And
> > if people do it on purpose it just makes it worse.
>
> I completely see your point, I just don't know how to improve it.
>
> I don't stop people like this because I want to make them uncomfortable
> but because I want to prevent further discussions on that topic.
>
> In other words how can I make people notice that this is something
> fundamental while still being polite?
>

I think a little improvement over this would be to at least wait a few
replies before resorting to those strong statements. Just before it
becomes a risk in just wasting time.

> >>>> This is clearly going against the idea of having jobs only depend on
> >>>> fences and nothing else which is mandatory for correct memory management.
> >>>>
> >>> I'm sure it's all documented and there is a design document on how
> >>> things have to look like you can point out? Might help to get a better
> >>> understanding on how things should be.
> >> Yeah, that's the problematic part. We have documented this very
> >> extensively:
> >> https://www.kernel.org/doc/html/v5.9/driver-api/dma-buf.html#indefinite-dma-fences
> >>
> >> And both Jason and Daniel gave talks about the underlying problem and
> > fyi:
> > s/Jason/Faith/g
>
> +1. I wasn't aware of that.
>
> >> try to come up with patches to raise warnings when that happens, but
> >> people still keep coming up with the same idea over and over again.
> >>
> > Yes, and we'll have to tell them over and over again. Nothing wrong
> > with that. That's just part of maintaining such a big subsystem. And
> > that's definitely not a valid reason to phrase things like above.
> >
> >> It's just that the technical relationship between preventing jobs from
> >> running and with that preventing dma_fences from signaling and the core
> >> memory management with page faults and shrinkers waiting for those
> >> fences is absolutely not obvious.
> >>
> >> We had at least 10 different teams from different companies falling into
> >> the same trap already and either the patches were rejected of hand or
> >> had to painfully reverted or mitigated later on.
> >>
> > Sure, but that's just part of the job. And pointing out fundamental
> > mistakes early on is important, but the situation won't get any better
> > by being like that. Yes, we'll have to repeat the same words over and
> > over again, and yes that might be annoying, but that's just how it is.
>
> Well I have no problem explaining people why a solution doesn't work.
>
> But what usually happens is that people don't realize that they need to
> back of from a design and completely start over.
>
> Regards,
> Christian.
>
> >
> >> Regards,
> >> Christian.
> >>
> >>>> If the hw is busy with something you need to return the fence for this
> >>>> from the prepare_job callback so that the scheduler can be notified when
> >>>> the hw is available again.
> >>>>
> >>>> Regards,
> >>>> Christian.
> >>>>
> >>>>> Signed-off-by: Asahi Lina <lina@asahilina.net>
> >>>>> ---
> >>>>>     drivers/gpu/drm/scheduler/sched_main.c | 10 ++++++++++
> >>>>>     include/drm/gpu_scheduler.h            |  8 ++++++++
> >>>>>     2 files changed, 18 insertions(+)
> >>>>>
> >>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> >>>>> index 4e6ad6e122bc..5c0add2c7546 100644
> >>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
> >>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> >>>>> @@ -1001,6 +1001,16 @@ static int drm_sched_main(void *param)
> >>>>>                 if (!entity)
> >>>>>                         continue;
> >>>>>
> >>>>> +             if (sched->ops->can_run_job) {
> >>>>> +                     sched_job = to_drm_sched_job(spsc_queue_peek(&entity->job_queue));
> >>>>> +                     if (!sched_job) {
> >>>>> +                             complete_all(&entity->entity_idle);
> >>>>> +                             continue;
> >>>>> +                     }
> >>>>> +                     if (!sched->ops->can_run_job(sched_job))
> >>>>> +                             continue;
> >>>>> +             }
> >>>>> +
> >>>>>                 sched_job = drm_sched_entity_pop_job(entity);
> >>>>>
> >>>>>                 if (!sched_job) {
> >>>>> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
> >>>>> index 9db9e5e504ee..bd89ea9507b9 100644
> >>>>> --- a/include/drm/gpu_scheduler.h
> >>>>> +++ b/include/drm/gpu_scheduler.h
> >>>>> @@ -396,6 +396,14 @@ struct drm_sched_backend_ops {
> >>>>>         struct dma_fence *(*prepare_job)(struct drm_sched_job *sched_job,
> >>>>>                                          struct drm_sched_entity *s_entity);
> >>>>>
> >>>>> +     /**
> >>>>> +      * @can_run_job: Called before job execution to check whether the
> >>>>> +      * hardware is free enough to run the job.  This can be used to
> >>>>> +      * implement more complex hardware resource policies than the
> >>>>> +      * hw_submission limit.
> >>>>> +      */
> >>>>> +     bool (*can_run_job)(struct drm_sched_job *sched_job);
> >>>>> +
> >>>>>         /**
> >>>>>              * @run_job: Called to execute the job once all of the dependencies
> >>>>>              * have been resolved.  This may be called multiple times, if
> >>>>>
>
Christian König March 8, 2023, 3:30 p.m. UTC | #16
Am 08.03.23 um 15:53 schrieb Asahi Lina:
> [SNIP]
>> The background is that core memory management requires that signaling a
>> fence only depends on signaling other fences and hardware progress and
>> nothing else. Otherwise you immediately run into problems because of
>> circle dependencies or what we call infinite fences.
> And hardware progress is exactly the only dependency here...

Well then you should have a fence for that hardware progress.

>> Jason Ekstrand gave a create presentation on that problem a few years
>> ago on LPC. I strongly suggest you google that one up.
> Faith Ekstrand (it looks like you mistyped that name...)

My fault I was really just mistyping that :)

>   is the person
> who proposed that I should use drm_sched in this way (see below), we've
> had a few private meetings about this design ^^
>
>>> You can see the driver implementation of that callback in
>>> drivers/gpu/drm/asahi/queue/mod.rs (QueueJob::can_run()), which then
>>> calls into drivers/gpu/drm/asahi/workqueue.rs (Job::can_submit()) that
>>> does the actual available slot count checks.
>>>
>>> The can_run_job logic is written to mirror the hw_submission_limit logic
>>> (just a bit later in the sched main loop since we need to actually pick
>>> a job to do the check), and just like for that case, completion of any
>>> job in the same scheduler will cause another run of the main loop and
>>> another check (which is exactly what we want here).
>> Yeah and that hw_submission_limit is based on a fence signaling again.
> I don't think so...? It's just an atomic that gets checked in
> drm_sched_ready(). There are no extra fences involved (other than the
> job completion fences that trigger another scheduler run). The idea is
> that when the hardware queue makes forward progress you check against
> the limit again and submit more jobs as needed. I'm doing the same exact
> thing, I'm just using more complex logic for the notion of in-flight
> queue limits!

Then why can't you express that logic in a dependency fence?

>> When you have some firmware limitation that a job needs resources which
>> are currently in use by other submissions then those other submissions
>> have fences as well and you can return those in the prepare_job callback.
>>
>> If those other submissions don't have fences, then you have a major
>> design problem inside your driver and we need to get back to square one
>> and talk about that dependency handling.
> I think we have a disconnect in our views of what is going on here...
>
> This hardware has firmware-side scheduling with an arbitrary (as far as
> I know) number of queues. There is one scheduler instance and one entity
> per userspace queue (not global!). These queues process jobs in some
> logical sequence, though at the firmware level they get split into up to
> three queues each (and there is some parallelism allowed). The
> limitation here is in the number of in-flight jobs per firmware queue,
> not global.

So far I'm familiar with that design.

> There is no way for things to deadlock. If jobs have been submitted to
> the firmware queue, that means their dependencies were signaled already.
> Jobs have intra-job dependencies via driver barriers (which drm_sched
> knows nothing about), but the submission code in the driver guarantees
> that they are deadlock-free since you can only barrier on past commands,
> which by definition submit first.
>
> If a firmware queue is full, drm_sched blocks. Since it is full, that
> means it will run those commands (since they have no outside
> dependencies and they are already queued and ready to run by the
> firmware), eventually space will be freed, and each time a job completes
> drm_sched will do the can_run_job check again and decide whether to run
> a new job.
>
> Since the firmware queues contain commands which only have past-facing
> barriers on other already submitted commands, by definition they will
> become empty at some point as long as the firmware is making forward
> progress. And therefore, by definition, can_run_job will eventually
> return true at some point after a job completion fence is signaled (the
> one for the last job submitted prior). There is a check in the driver to
> ensure that we do not allow submissions which, by themselves, would
> exceed the queued command limit (we actually just limit to 64 commands
> overall right now, which is conservative but seems reasonable given the
> 128-per-firmware-queue limit).

Well then again why don't you give that fence out as dependency? Is it 
because the scheduler tries to optimize those away?

> I get the feeling that you are conflating pending jobs with submitted
> jobs. This isn't about how many jobs you can have pending in drm_sched
> before running them or anything like that. Of course, at that point,
> arbitrary dependencies come into play and you can end up with deadlocks
> on dependency fences. But that's not the case here. What can_run_job is
> waiting on is guaranteed to make forward progress.

I see that we have a disconnection here. As far as I can see you can use 
the can_run callback in only three ways:

1. To check for some userspace dependency (We don't need to discuss 
that, it's evil and we both know it).

2. You check for some hw resource availability. Similar to VMID on 
amdgpu hw.

     This is what I think you do here (but I might be wrong). But this 
would be extremely problematic because you can then live lock.
     E.g. queue A keeps submitting jobs which take only a few resources 
and by doing so delays submitting jobs from queue B indefinitely.

3. You have an intra queue dependency. E.g. you have jobs which take X 
amount of resources, you can submit only to a specific limit.
     But in this case you should be able to return fences from the same 
queue as dependency and won't need that callback.

     We would just need to adjust drm_sched_entity_add_dependency_cb() a 
bit because dependencies from the same queue are currently filtered out 
because it assumes a pipeline nature of submission (e.g. previous 
submissions are finished before new submissions start).

>>> This case (potentially scheduling more than the FW job limit) is rare
>>> but handling it is necessary, since otherwise the entire job
>>> completion/tracking logic gets screwed up on the firmware end and queues
>>> end up stuck (I've managed to trigger this before).
>> Actually that's a pretty normal use case. I've have rejected similar
>> requirements like this before as well.
>>
>> For an example how this can work see amdgpu_job_prepare_job():
>> https://elixir.bootlin.com/linux/v6.3-rc1/source/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c#L251
>>
>> The gang submit gives and example of a global fence lock and the VMIDs
>> are an example of a global shared firmware resource.
> But the resource can_run_job is checking on isn't globally shared! It's
> specific to this scheduler instance, just like hw_submission_limit is,
> so as long as the firmware behind the scheduler is making forward
> progress, the resource will be guaranteed to be freed until another job
> can run.

Well either it should be globally shared because it is a shared resource 
(similar to our VMID or gangs) or it is an intra queue limitation in 
which case you could just use the fences previously submitted on the 
queue as dependency.

> I actually know I have a different theoretical deadlock issue along
> these lines in the driver because right now we grab actually global
> resources (including a VMID) before job submission to drm_sched. This is
> a known issue, and to fix it without reducing performance I need to
> introduce some kind of "patching/fixup" system for firmware commands
> (because we need to inject those identifiers in dozens of places, but we
> don't want to construct those commands from scratch at job run time
> because that introduces latency at the wrong time and makes error
> handling/validation more complicated and error-prone), and that is
> exactly what should happen in prepare_job, as you say. And yes, at that
> point that should use fences to block when those resources are
> exhausted. But that's a different discussion we should have when
> reviewing the driver, it has nothing to do with the DRM abstractions nor
> the can_run_job callback I'm adding here nor the firmware queue length
> limit issue! (And also the global hardware devices are plentiful enough
> that I would be very surprised if anyone ever deadlocks it in practice
> even with the current code, so I honestly don't think that should be a
> blocker for driver submission either, I can and will fix it later...)

Well this is what I thought about those problems in amdgpu as well and 
it totally shipwrecked.

We still have memory allocations in the VMID code path which I'm still 
not sure how to remove.

Regards,
Christian.

>
> ~~ Lina
Asahi Lina March 8, 2023, 4:44 p.m. UTC | #17
On 09/03/2023 00.30, Christian König wrote:
> Am 08.03.23 um 15:53 schrieb Asahi Lina:
>> [SNIP]
>>> The background is that core memory management requires that signaling a
>>> fence only depends on signaling other fences and hardware progress and
>>> nothing else. Otherwise you immediately run into problems because of
>>> circle dependencies or what we call infinite fences.
>> And hardware progress is exactly the only dependency here...
> 
> Well then you should have a fence for that hardware progress.

I do, it's the prior job hardware completion fences that drm_sched
already knows about!

Yes, I could return those in the prepare callback, it just means I need
to start stashing fence references in the underlying firmware job queue
command objects so I can find out what is the oldest pending fence is,
and return it when a queue is full. As long as drm_sched doesn't mind if
I keep giving it fences (since multiple commands can have to complete
before there is space) or the occasional already signaled fence (because
this process is inherently racy), it should work fine.

If you think this is the better way, I'll do it that way and drop this
patch. It just seemed simpler to do it with another callback, since
drm_sched is already tracking those fences and doing a hardware queue
limit check anyway, and that way I can avoid tracking the fences down
into the hardware queue code... *

(But I still maintain what I'm trying to do here is entirely correct and
deadlock-free! If you prefer I use prepare_job and return prior job
fences from that instead, that's very different from NAKing the patch
saying it's broken...)

* If you're wondering how the fences get signaled at all then: callback
closures that capture a reference to the fence when firmware commands
are constructed and submitted. I know, I know, fancy Rust stuff... ^^
If you'd rather have me use the fences for the blocking, I'll probably
just drop the signaling bit from the closures so we don't need to keep
two redundant fence references in different places per command. I still
need the closures for command completion processing though, since I use
them to process statistics too...

>>> Jason Ekstrand gave a create presentation on that problem a few years
>>> ago on LPC. I strongly suggest you google that one up.
>> Faith Ekstrand (it looks like you mistyped that name...)
> 
> My fault I was really just mistyping that :)

It's all good ^^

> 
> I see that we have a disconnection here. As far as I can see you can use 
> the can_run callback in only three ways:
> 
> 1. To check for some userspace dependency (We don't need to discuss 
> that, it's evil and we both know it).
> 
> 2. You check for some hw resource availability. Similar to VMID on 
> amdgpu hw.
> 
>      This is what I think you do here (but I might be wrong).

It isn't... I agree, it would be problematic. It doesn't make any sense
to check for global resources this way, not just because you might
deadlock but also because there might be nothing to signal to the
scheduler that a resource was freed at all once it is!

> But this 
> would be extremely problematic because you can then live lock.
>      E.g. queue A keeps submitting jobs which take only a few resources 
> and by doing so delays submitting jobs from queue B indefinitely.

This particular issue aside, fairness in global resource allocation is a
conversation I'd love to have! Right now the driver doesn't try to
ensure that, a queue can easily monopolize certain hardware resources
(though one queue can only monopolize one of each, so you'd need
something like 63 queues with 63 distinct VMs all submitting
free-running jobs back to back in order to starve other queues of
resources forever). For starters, one thing I'm thinking of doing is
reserving certain subsets of hardware resources for queues with a given
priority, so you can at least guarantee forward progress of
higher-priority queues when faced with misbehaving lower-priority
queues. But if we want to guarantee proper fairness, I think I'll have
to start doing things like switching to a CPU-roundtrip submission model
when resources become scarce (to guarantee that queues actually release
the resources once in a while) and then figure out how to add fairness
to the allocation code...

But let's have that conversation when we talk about the driver (or maybe
on IRC or something?), right now I'm more interested in getting the
abstractions reviewed ^^

> 3. You have an intra queue dependency. E.g. you have jobs which take X 
> amount of resources, you can submit only to a specific limit.
>      But in this case you should be able to return fences from the same 
> queue as dependency and won't need that callback.

Yes, I can do this. I can just do the same check can_run_job() does and
if it fails, pick the oldest job in the full firmware queue and return
its fence (it just means I need to keep track of those fences there, as
I said above).

>      We would just need to adjust drm_sched_entity_add_dependency_cb() a 
> bit because dependencies from the same queue are currently filtered out 
> because it assumes a pipeline nature of submission (e.g. previous 
> submissions are finished before new submissions start).

Actually that should be fine, because I'd be returning the underlying
hardware completion fences (what the run() callback returns) which the
driver owns, and wouldn't be recognized as belonging to the sched.

>> I actually know I have a different theoretical deadlock issue along
>> these lines in the driver because right now we grab actually global
>> resources (including a VMID) before job submission to drm_sched. This is
>> a known issue, and to fix it without reducing performance I need to
>> introduce some kind of "patching/fixup" system for firmware commands
>> (because we need to inject those identifiers in dozens of places, but we
>> don't want to construct those commands from scratch at job run time
>> because that introduces latency at the wrong time and makes error
>> handling/validation more complicated and error-prone), and that is
>> exactly what should happen in prepare_job, as you say. And yes, at that
>> point that should use fences to block when those resources are
>> exhausted. But that's a different discussion we should have when
>> reviewing the driver, it has nothing to do with the DRM abstractions nor
>> the can_run_job callback I'm adding here nor the firmware queue length
>> limit issue! (And also the global hardware devices are plentiful enough
>> that I would be very surprised if anyone ever deadlocks it in practice
>> even with the current code, so I honestly don't think that should be a
>> blocker for driver submission either, I can and will fix it later...)
> 
> Well this is what I thought about those problems in amdgpu as well and 
> it totally shipwrecked.
> 
> We still have memory allocations in the VMID code path which I'm still 
> not sure how to remove.

We don't even have a shrinker yet, and I'm sure that's going to be a lot
of fun when we add it too... but yes, if we can't do any memory
allocations in some of these callbacks (is this documented anywhere?),
that's going to be interesting...

It's not all bad news though! All memory allocations are fallible in
kernel Rust (and therefore explicit, and also failures have to be
explicitly handled or propagated), so it's pretty easy to point out
where they are, and there are already discussions of higher-level
tooling to enforce rules like that (and things like wait contexts).
Also, Rust makes it a lot easier to refactor code in general and not be
scared that you're going to regress everything, so I'm not really
worried if I need to turn a chunk of the driver on its head to solve
some of these problems in the future ^^ (I already did that when I
switched it from the "demo" synchronous submission model to the proper
explicit sync + fences one.)

~~ Lina
Christian König March 8, 2023, 5:57 p.m. UTC | #18
Am 08.03.23 um 17:44 schrieb Asahi Lina:
> On 09/03/2023 00.30, Christian König wrote:
>> Am 08.03.23 um 15:53 schrieb Asahi Lina:
>>> [SNIP]
>>>> The background is that core memory management requires that signaling a
>>>> fence only depends on signaling other fences and hardware progress and
>>>> nothing else. Otherwise you immediately run into problems because of
>>>> circle dependencies or what we call infinite fences.
>>> And hardware progress is exactly the only dependency here...
>> Well then you should have a fence for that hardware progress.
> I do, it's the prior job hardware completion fences that drm_sched
> already knows about!
>
> Yes, I could return those in the prepare callback, it just means I need
> to start stashing fence references in the underlying firmware job queue
> command objects so I can find out what is the oldest pending fence is,
> and return it when a queue is full. As long as drm_sched doesn't mind if
> I keep giving it fences (since multiple commands can have to complete
> before there is space) or the occasional already signaled fence (because
> this process is inherently racy), it should work fine.

Well this handling is intentional and necessary, but see below for a 
more in deep explanation.

> If you think this is the better way, I'll do it that way and drop this
> patch. It just seemed simpler to do it with another callback, since
> drm_sched is already tracking those fences and doing a hardware queue
> limit check anyway, and that way I can avoid tracking the fences down
> into the hardware queue code... *

Well it's not the better way, it's the only way that works.

I have to admit that my bet on your intentions was wrong, but even that 
use case doesn't work correctly.

See when your callback returns false it is perfectly possible that all 
hw fences are signaled between returning that information and processing it.

The result would be that the scheduler goes to sleep and never wakes up 
again.

That's why we have that rule that all dependencies need to be expressed 
by those dma_fence objects, cause those are design with such races in mind.

> (But I still maintain what I'm trying to do here is entirely correct and
> deadlock-free! If you prefer I use prepare_job and return prior job
> fences from that instead, that's very different from NAKing the patch
> saying it's broken...)

As I said we exercised those ideas before and yes this approach here 
came up before as well and no it doesn't work.

> * If you're wondering how the fences get signaled at all then: callback
> closures that capture a reference to the fence when firmware commands
> are constructed and submitted. I know, I know, fancy Rust stuff... ^^
> If you'd rather have me use the fences for the blocking, I'll probably
> just drop the signaling bit from the closures so we don't need to keep
> two redundant fence references in different places per command. I still
> need the closures for command completion processing though, since I use
> them to process statistics too...
>
>> I see that we have a disconnection here. As far as I can see you can use
>> the can_run callback in only three ways:
>>
>> 1. To check for some userspace dependency (We don't need to discuss
>> that, it's evil and we both know it).
>>
>> 2. You check for some hw resource availability. Similar to VMID on
>> amdgpu hw.
>>
>>       This is what I think you do here (but I might be wrong).
> It isn't... I agree, it would be problematic. It doesn't make any sense
> to check for global resources this way, not just because you might
> deadlock but also because there might be nothing to signal to the
> scheduler that a resource was freed at all once it is!
>
>> But this
>> would be extremely problematic because you can then live lock.
>>       E.g. queue A keeps submitting jobs which take only a few resources
>> and by doing so delays submitting jobs from queue B indefinitely.
> This particular issue aside, fairness in global resource allocation is a
> conversation I'd love to have! Right now the driver doesn't try to
> ensure that, a queue can easily monopolize certain hardware resources
> (though one queue can only monopolize one of each, so you'd need
> something like 63 queues with 63 distinct VMs all submitting
> free-running jobs back to back in order to starve other queues of
> resources forever). For starters, one thing I'm thinking of doing is
> reserving certain subsets of hardware resources for queues with a given
> priority, so you can at least guarantee forward progress of
> higher-priority queues when faced with misbehaving lower-priority
> queues. But if we want to guarantee proper fairness, I think I'll have
> to start doing things like switching to a CPU-roundtrip submission model
> when resources become scarce (to guarantee that queues actually release
> the resources once in a while) and then figure out how to add fairness
> to the allocation code...
>
> But let's have that conversation when we talk about the driver (or maybe
> on IRC or something?), right now I'm more interested in getting the
> abstractions reviewed ^^

Well that stuff is highly problematic as well. The fairness aside you 
risk starvation which in turn breaks the guarantee of forward progress.

In this particular case you can catch this with a timeout for the hw 
operation, but you should consider blocking that from the sw side as well.

>> 3. You have an intra queue dependency. E.g. you have jobs which take X
>> amount of resources, you can submit only to a specific limit.
>>       But in this case you should be able to return fences from the same
>> queue as dependency and won't need that callback.
> Yes, I can do this. I can just do the same check can_run_job() does and
> if it fails, pick the oldest job in the full firmware queue and return
> its fence (it just means I need to keep track of those fences there, as
> I said above).
>
>>       We would just need to adjust drm_sched_entity_add_dependency_cb() a
>> bit because dependencies from the same queue are currently filtered out
>> because it assumes a pipeline nature of submission (e.g. previous
>> submissions are finished before new submissions start).
> Actually that should be fine, because I'd be returning the underlying
> hardware completion fences (what the run() callback returns) which the
> driver owns, and wouldn't be recognized as belonging to the sched.
>
>>> I actually know I have a different theoretical deadlock issue along
>>> these lines in the driver because right now we grab actually global
>>> resources (including a VMID) before job submission to drm_sched. This is
>>> a known issue, and to fix it without reducing performance I need to
>>> introduce some kind of "patching/fixup" system for firmware commands
>>> (because we need to inject those identifiers in dozens of places, but we
>>> don't want to construct those commands from scratch at job run time
>>> because that introduces latency at the wrong time and makes error
>>> handling/validation more complicated and error-prone), and that is
>>> exactly what should happen in prepare_job, as you say. And yes, at that
>>> point that should use fences to block when those resources are
>>> exhausted. But that's a different discussion we should have when
>>> reviewing the driver, it has nothing to do with the DRM abstractions nor
>>> the can_run_job callback I'm adding here nor the firmware queue length
>>> limit issue! (And also the global hardware devices are plentiful enough
>>> that I would be very surprised if anyone ever deadlocks it in practice
>>> even with the current code, so I honestly don't think that should be a
>>> blocker for driver submission either, I can and will fix it later...)
>> Well this is what I thought about those problems in amdgpu as well and
>> it totally shipwrecked.
>>
>> We still have memory allocations in the VMID code path which I'm still
>> not sure how to remove.
> We don't even have a shrinker yet, and I'm sure that's going to be a lot
> of fun when we add it too... but yes, if we can't do any memory
> allocations in some of these callbacks (is this documented anywhere?),
> that's going to be interesting...

Yes, that is all part of the dma_fence documentation. It's just 
absolutely not obvious what all this means.

> It's not all bad news though! All memory allocations are fallible in
> kernel Rust (and therefore explicit, and also failures have to be
> explicitly handled or propagated), so it's pretty easy to point out
> where they are, and there are already discussions of higher-level
> tooling to enforce rules like that (and things like wait contexts).
> Also, Rust makes it a lot easier to refactor code in general and not be
> scared that you're going to regress everything, so I'm not really
> worried if I need to turn a chunk of the driver on its head to solve
> some of these problems in the future ^^ (I already did that when I
> switched it from the "demo" synchronous submission model to the proper
> explicit sync + fences one.)

Yeah, well the problem isn't that you run into memory allocation failure.

The problem is rather something like this:
1. You try to allocate memory to signal your fence.
2. This memory allocation can't be fulfilled and goes to sleep to wait 
for reclaim.
3. On another CPU reclaim is running and through the general purpose 
shrinker, page fault or MMU notifier ends up wait for your dma_fence.

You don't even need to implement the shrinker for this to go boom 
extremely easy.

So everything involved with signaling the fence can allocate memory only 
with GFP_ATOMIC and only if you absolutely have to.

Christian.


>
> ~~ Lina
Asahi Lina March 8, 2023, 7:05 p.m. UTC | #19
On 09/03/2023 02.57, Christian König wrote:
> Am 08.03.23 um 17:44 schrieb Asahi Lina:
>> On 09/03/2023 00.30, Christian König wrote:
>>> Am 08.03.23 um 15:53 schrieb Asahi Lina:
>>>> [SNIP]
>>>>> The background is that core memory management requires that signaling a
>>>>> fence only depends on signaling other fences and hardware progress and
>>>>> nothing else. Otherwise you immediately run into problems because of
>>>>> circle dependencies or what we call infinite fences.
>>>> And hardware progress is exactly the only dependency here...
>>> Well then you should have a fence for that hardware progress.
>> I do, it's the prior job hardware completion fences that drm_sched
>> already knows about!
>>
>> Yes, I could return those in the prepare callback, it just means I need
>> to start stashing fence references in the underlying firmware job queue
>> command objects so I can find out what is the oldest pending fence is,
>> and return it when a queue is full. As long as drm_sched doesn't mind if
>> I keep giving it fences (since multiple commands can have to complete
>> before there is space) or the occasional already signaled fence (because
>> this process is inherently racy), it should work fine.
> 
> Well this handling is intentional and necessary, but see below for a 
> more in deep explanation.
> 
>> If you think this is the better way, I'll do it that way and drop this
>> patch. It just seemed simpler to do it with another callback, since
>> drm_sched is already tracking those fences and doing a hardware queue
>> limit check anyway, and that way I can avoid tracking the fences down
>> into the hardware queue code... *
> 
> Well it's not the better way, it's the only way that works.
> 
> I have to admit that my bet on your intentions was wrong, but even that 
> use case doesn't work correctly.
> 
> See when your callback returns false it is perfectly possible that all 
> hw fences are signaled between returning that information and processing it.
> 
> The result would be that the scheduler goes to sleep and never wakes up 
> again.

That can't happen, because it will just go into another iteration of the
drm_sched main loop since there is an entity available still.

Rather there is probably the opposite bug in this patch: the can_run_job
logic should be moved into the wait_event_interruptible() condition
check, otherwise I think it can end up busy-looping since the condition
itself can be true even when the can_run_job check blocks it.

But there is no risk of it going to sleep and never waking up because
job completions will wake up the waitqueue by definition, and that
happens after the driver-side queues are popped. If this problem could
happen, then the existing hw_submission_limit logic would be broken in
the same way. It is logically equivalent in how it works.

Basically, if properly done in wait_event_interruptible, it is exactly
the logic of that macro that prevents this race condition and makes
everything work at all. Without it, drm_sched would be completely broken.

> As I said we exercised those ideas before and yes this approach here 
> came up before as well and no it doesn't work.

It can never deadlock with this patch as it stands (though it could busy
loop), and if properly moved into the wait_event_interruptible(), it
would also never busy loop and work entirely as intended. The actual API
change is sound.

I don't know why you're trying so hard to convince everyone that this
approach is fundamentally broken... It might be a bad idea for other
reasons, it might encourage incorrect usage, it might not be the best
option, there are plenty of arguments you can make... but you just keep
trying to make an argument that it just can't work at all for some
reason. Why? I already said I'm happy dropping it in favor of the fences...

It's intended to mirror the hw_submission_limit logic. If you think this
is broken, then that's broken too. They are equivalent mechanisms.

>> This particular issue aside, fairness in global resource allocation is a
>> conversation I'd love to have! Right now the driver doesn't try to
>> ensure that, a queue can easily monopolize certain hardware resources
>> (though one queue can only monopolize one of each, so you'd need
>> something like 63 queues with 63 distinct VMs all submitting
>> free-running jobs back to back in order to starve other queues of
>> resources forever). For starters, one thing I'm thinking of doing is
>> reserving certain subsets of hardware resources for queues with a given
>> priority, so you can at least guarantee forward progress of
>> higher-priority queues when faced with misbehaving lower-priority
>> queues. But if we want to guarantee proper fairness, I think I'll have
>> to start doing things like switching to a CPU-roundtrip submission model
>> when resources become scarce (to guarantee that queues actually release
>> the resources once in a while) and then figure out how to add fairness
>> to the allocation code...
>>
>> But let's have that conversation when we talk about the driver (or maybe
>> on IRC or something?), right now I'm more interested in getting the
>> abstractions reviewed ^^
> 
> Well that stuff is highly problematic as well. The fairness aside you 
> risk starvation which in turn breaks the guarantee of forward progress.
> 
> In this particular case you can catch this with a timeout for the hw 
> operation, but you should consider blocking that from the sw side as well.

In the current state I actually think it's not really that problematic,
because the resources are acquired directly in the ioctl path. So that
can block if starved, but if that can cause overall forward progress to
stop because some fence doesn't get signaled, then so can just not doing
the ioctl in the first place, so there's not much point (userspace can
always misbehave with its fence usage...). By the time anything gets
submitted to drm_sched, the resources are already guaranteed to be
acquired, we never block in the run callback.

It needs to be fixed of course, but if the threat model is a malicious
GPU process, well, there are many other ways to DoS your system... and I
don't think it's very likely that 63+ queues (which usually means 63+
processes with OpenGL) will end up accidentally starving the GPU in a
tight loop at the same time. I'd love to hear about real-world scenarios
where this kind of thing has been a real problem and not just a
theoretical one though... maybe I'm missing something?

Basically my priorities with the driver are:

1. Make sure it never crashes
2. Make sure it works well for real users
3. Make it work smoothly for real users under reasonable load
(priorities, CPU scheduler interactions, etc.)
4. Make it handle accidental problems more gracefully (OOMs etc, I need
to look into private GEM BO accounting to processes so the OOM killer
has better data to work with)
5. Make it more robust against deliberate abuse/starvation (this should
matter more once we have some kind of paravirtualization solution...)

And right now we're somewhere between 2 and 3. So if there are cases
where this resource acquisition stuff can cause a problem for real
users, I'll want to fix it earlier. But if this is more theoretical than
anything (with the resource limits of AGX GPUs), I'd rather focus on
things like memory accounting and shrinker support first.

>> We don't even have a shrinker yet, and I'm sure that's going to be a lot
>> of fun when we add it too... but yes, if we can't do any memory
>> allocations in some of these callbacks (is this documented anywhere?),
>> that's going to be interesting...
> 
> Yes, that is all part of the dma_fence documentation. It's just 
> absolutely not obvious what all this means.

I mean is there any documentation on how this interacts with drm_sched?
Like, am I not allowed to allocate memory in prepare()? What about
run()? What about GPU interrupt work? (not a raw IRQ handler context, I
mean the execution path from GPU IRQ to drm_sched run() fences getting
signaled)

>> It's not all bad news though! All memory allocations are fallible in
>> kernel Rust (and therefore explicit, and also failures have to be
>> explicitly handled or propagated), so it's pretty easy to point out
>> where they are, and there are already discussions of higher-level
>> tooling to enforce rules like that (and things like wait contexts).
>> Also, Rust makes it a lot easier to refactor code in general and not be
>> scared that you're going to regress everything, so I'm not really
>> worried if I need to turn a chunk of the driver on its head to solve
>> some of these problems in the future ^^ (I already did that when I
>> switched it from the "demo" synchronous submission model to the proper
>> explicit sync + fences one.)
> 
> Yeah, well the problem isn't that you run into memory allocation failure.

What I mean is that the mandatory failure handling means it's relatively
easy to audit where memory allocations can actually happen.

> The problem is rather something like this:
> 1. You try to allocate memory to signal your fence.
> 2. This memory allocation can't be fulfilled and goes to sleep to wait 
> for reclaim.
> 3. On another CPU reclaim is running and through the general purpose 
> shrinker, page fault or MMU notifier ends up wait for your dma_fence.
> 
> You don't even need to implement the shrinker for this to go boom 
> extremely easy.

Hmm, can you actually get something waiting on a dma_fence like that
today with this driver? We don't have a shrinker, we don't have
synchronous page faults or MMU notifications for the GPU, and this is
explicit sync so all in/out fences cross over into userspace so surely
they can't be trusted anyway?

I'm definitely not familiar with the intricacies of DMA fences and how
they interact with everything else yet, but it's starting to sound like
either this isn't quite broken for our simple driver yet, or it must be
broken pretty much everywhere in some way...

> So everything involved with signaling the fence can allocate memory only 
> with GFP_ATOMIC and only if you absolutely have to.

I don't think we even have a good story for passing around gfp_flags in
Rust code so that will be interesting... though I need to actually audit
the code paths and see how many allocations we really do. I know I alloc
some vectors for holding completed commands and stuff like that, but I'm
pretty sure I can fix that one with some reworking, and I'm not sure how
many other random things there really are...? Obviously most allocations
happen at command creation time, on completion you mostly get a lot of
freeing, so maybe I can just eliminate all allocs and not worry about
GFP_ATOMIC.

~~ Lina
Christian König March 8, 2023, 7:12 p.m. UTC | #20
Am 08.03.23 um 20:05 schrieb Asahi Lina:
> [SNIP]
>> Well it's not the better way, it's the only way that works.
>>
>> I have to admit that my bet on your intentions was wrong, but even that
>> use case doesn't work correctly.
>>
>> See when your callback returns false it is perfectly possible that all
>> hw fences are signaled between returning that information and processing it.
>>
>> The result would be that the scheduler goes to sleep and never wakes up
>> again.
> That can't happen, because it will just go into another iteration of the
> drm_sched main loop since there is an entity available still.
>
> Rather there is probably the opposite bug in this patch: the can_run_job
> logic should be moved into the wait_event_interruptible() condition
> check, otherwise I think it can end up busy-looping since the condition
> itself can be true even when the can_run_job check blocks it.
>
> But there is no risk of it going to sleep and never waking up because
> job completions will wake up the waitqueue by definition, and that
> happens after the driver-side queues are popped. If this problem could
> happen, then the existing hw_submission_limit logic would be broken in
> the same way. It is logically equivalent in how it works.
>
> Basically, if properly done in wait_event_interruptible, it is exactly
> the logic of that macro that prevents this race condition and makes
> everything work at all. Without it, drm_sched would be completely broken.
>
>> As I said we exercised those ideas before and yes this approach here
>> came up before as well and no it doesn't work.
> It can never deadlock with this patch as it stands (though it could busy
> loop), and if properly moved into the wait_event_interruptible(), it
> would also never busy loop and work entirely as intended. The actual API
> change is sound.
>
> I don't know why you're trying so hard to convince everyone that this
> approach is fundamentally broken... It might be a bad idea for other
> reasons, it might encourage incorrect usage, it might not be the best
> option, there are plenty of arguments you can make... but you just keep
> trying to make an argument that it just can't work at all for some
> reason. Why? I already said I'm happy dropping it in favor of the fences...

Well because it is broken.

When you move the check into the wait_event_interruptible condition then 
who is going to call wait_event_interruptible when the condition changes?

As I said this idea came up before and was rejected multiple times.

Regards,
Christian.

>
> It's intended to mirror the hw_submission_limit logic. If you think this
> is broken, then that's broken too. They are equivalent mechanisms.
>
>>> This particular issue aside, fairness in global resource allocation is a
>>> conversation I'd love to have! Right now the driver doesn't try to
>>> ensure that, a queue can easily monopolize certain hardware resources
>>> (though one queue can only monopolize one of each, so you'd need
>>> something like 63 queues with 63 distinct VMs all submitting
>>> free-running jobs back to back in order to starve other queues of
>>> resources forever). For starters, one thing I'm thinking of doing is
>>> reserving certain subsets of hardware resources for queues with a given
>>> priority, so you can at least guarantee forward progress of
>>> higher-priority queues when faced with misbehaving lower-priority
>>> queues. But if we want to guarantee proper fairness, I think I'll have
>>> to start doing things like switching to a CPU-roundtrip submission model
>>> when resources become scarce (to guarantee that queues actually release
>>> the resources once in a while) and then figure out how to add fairness
>>> to the allocation code...
>>>
>>> But let's have that conversation when we talk about the driver (or maybe
>>> on IRC or something?), right now I'm more interested in getting the
>>> abstractions reviewed ^^
>> Well that stuff is highly problematic as well. The fairness aside you
>> risk starvation which in turn breaks the guarantee of forward progress.
>>
>> In this particular case you can catch this with a timeout for the hw
>> operation, but you should consider blocking that from the sw side as well.
> In the current state I actually think it's not really that problematic,
> because the resources are acquired directly in the ioctl path. So that
> can block if starved, but if that can cause overall forward progress to
> stop because some fence doesn't get signaled, then so can just not doing
> the ioctl in the first place, so there's not much point (userspace can
> always misbehave with its fence usage...). By the time anything gets
> submitted to drm_sched, the resources are already guaranteed to be
> acquired, we never block in the run callback.
>
> It needs to be fixed of course, but if the threat model is a malicious
> GPU process, well, there are many other ways to DoS your system... and I
> don't think it's very likely that 63+ queues (which usually means 63+
> processes with OpenGL) will end up accidentally starving the GPU in a
> tight loop at the same time. I'd love to hear about real-world scenarios
> where this kind of thing has been a real problem and not just a
> theoretical one though... maybe I'm missing something?
>
> Basically my priorities with the driver are:
>
> 1. Make sure it never crashes
> 2. Make sure it works well for real users
> 3. Make it work smoothly for real users under reasonable load
> (priorities, CPU scheduler interactions, etc.)
> 4. Make it handle accidental problems more gracefully (OOMs etc, I need
> to look into private GEM BO accounting to processes so the OOM killer
> has better data to work with)
> 5. Make it more robust against deliberate abuse/starvation (this should
> matter more once we have some kind of paravirtualization solution...)
>
> And right now we're somewhere between 2 and 3. So if there are cases
> where this resource acquisition stuff can cause a problem for real
> users, I'll want to fix it earlier. But if this is more theoretical than
> anything (with the resource limits of AGX GPUs), I'd rather focus on
> things like memory accounting and shrinker support first.
>
>>> We don't even have a shrinker yet, and I'm sure that's going to be a lot
>>> of fun when we add it too... but yes, if we can't do any memory
>>> allocations in some of these callbacks (is this documented anywhere?),
>>> that's going to be interesting...
>> Yes, that is all part of the dma_fence documentation. It's just
>> absolutely not obvious what all this means.
> I mean is there any documentation on how this interacts with drm_sched?
> Like, am I not allowed to allocate memory in prepare()? What about
> run()? What about GPU interrupt work? (not a raw IRQ handler context, I
> mean the execution path from GPU IRQ to drm_sched run() fences getting
> signaled)
>
>>> It's not all bad news though! All memory allocations are fallible in
>>> kernel Rust (and therefore explicit, and also failures have to be
>>> explicitly handled or propagated), so it's pretty easy to point out
>>> where they are, and there are already discussions of higher-level
>>> tooling to enforce rules like that (and things like wait contexts).
>>> Also, Rust makes it a lot easier to refactor code in general and not be
>>> scared that you're going to regress everything, so I'm not really
>>> worried if I need to turn a chunk of the driver on its head to solve
>>> some of these problems in the future ^^ (I already did that when I
>>> switched it from the "demo" synchronous submission model to the proper
>>> explicit sync + fences one.)
>> Yeah, well the problem isn't that you run into memory allocation failure.
> What I mean is that the mandatory failure handling means it's relatively
> easy to audit where memory allocations can actually happen.
>
>> The problem is rather something like this:
>> 1. You try to allocate memory to signal your fence.
>> 2. This memory allocation can't be fulfilled and goes to sleep to wait
>> for reclaim.
>> 3. On another CPU reclaim is running and through the general purpose
>> shrinker, page fault or MMU notifier ends up wait for your dma_fence.
>>
>> You don't even need to implement the shrinker for this to go boom
>> extremely easy.
> Hmm, can you actually get something waiting on a dma_fence like that
> today with this driver? We don't have a shrinker, we don't have
> synchronous page faults or MMU notifications for the GPU, and this is
> explicit sync so all in/out fences cross over into userspace so surely
> they can't be trusted anyway?
>
> I'm definitely not familiar with the intricacies of DMA fences and how
> they interact with everything else yet, but it's starting to sound like
> either this isn't quite broken for our simple driver yet, or it must be
> broken pretty much everywhere in some way...
>
>> So everything involved with signaling the fence can allocate memory only
>> with GFP_ATOMIC and only if you absolutely have to.
> I don't think we even have a good story for passing around gfp_flags in
> Rust code so that will be interesting... though I need to actually audit
> the code paths and see how many allocations we really do. I know I alloc
> some vectors for holding completed commands and stuff like that, but I'm
> pretty sure I can fix that one with some reworking, and I'm not sure how
> many other random things there really are...? Obviously most allocations
> happen at command creation time, on completion you mostly get a lot of
> freeing, so maybe I can just eliminate all allocs and not worry about
> GFP_ATOMIC.
>
> ~~ Lina
Asahi Lina March 8, 2023, 7:45 p.m. UTC | #21
On 09/03/2023 04.12, Christian König wrote:
> Am 08.03.23 um 20:05 schrieb Asahi Lina:
>> [SNIP]
>>> Well it's not the better way, it's the only way that works.
>>>
>>> I have to admit that my bet on your intentions was wrong, but even that
>>> use case doesn't work correctly.
>>>
>>> See when your callback returns false it is perfectly possible that all
>>> hw fences are signaled between returning that information and processing it.
>>>
>>> The result would be that the scheduler goes to sleep and never wakes up
>>> again.
>> That can't happen, because it will just go into another iteration of the
>> drm_sched main loop since there is an entity available still.
>>
>> Rather there is probably the opposite bug in this patch: the can_run_job
>> logic should be moved into the wait_event_interruptible() condition
>> check, otherwise I think it can end up busy-looping since the condition
>> itself can be true even when the can_run_job check blocks it.
>>
>> But there is no risk of it going to sleep and never waking up because
>> job completions will wake up the waitqueue by definition, and that
>> happens after the driver-side queues are popped. If this problem could
>> happen, then the existing hw_submission_limit logic would be broken in
>> the same way. It is logically equivalent in how it works.
>>
>> Basically, if properly done in wait_event_interruptible, it is exactly
>> the logic of that macro that prevents this race condition and makes
>> everything work at all. Without it, drm_sched would be completely broken.
>>
>>> As I said we exercised those ideas before and yes this approach here
>>> came up before as well and no it doesn't work.
>> It can never deadlock with this patch as it stands (though it could busy
>> loop), and if properly moved into the wait_event_interruptible(), it
>> would also never busy loop and work entirely as intended. The actual API
>> change is sound.
>>
>> I don't know why you're trying so hard to convince everyone that this
>> approach is fundamentally broken... It might be a bad idea for other
>> reasons, it might encourage incorrect usage, it might not be the best
>> option, there are plenty of arguments you can make... but you just keep
>> trying to make an argument that it just can't work at all for some
>> reason. Why? I already said I'm happy dropping it in favor of the fences...
> 
> Well because it is broken.
> 
> When you move the check into the wait_event_interruptible condition then 
> who is going to call wait_event_interruptible when the condition changes?

I think you mean wake_up_interruptible(). That would be
drm_sched_job_done(), on the fence callback when a job completes, which
as I keep saying is the same logic used for
hw_rq_count/hw_submission_limit tracking.

Please think about it for a second, it's really not that complicated to
see why it works:

- Driver pops off completed commands <-- can_run_job condition satisfied
- Driver signals fence
 - drm_sched_job_done_cb()
  - drm_sched_job_done()
   - atomic_dec(&sched->hw_rq_count); <-- hw_submission_limit satisfied
   - ...
   - wake_up_interruptible(&sched->wake_up_worker);
      ^- happens after both conditions are potentially satisfied

It really is completely equivalent to just making the hw_rq_count logic
customizable by the driver. The actual flow is the same. As long as the
driver guarantees it satisfies the can_run_job() condition before
signaling the completion fence that triggered that change, it works fine.

> As I said this idea came up before and was rejected multiple times.

Maybe it was a different idea, or maybe it was rejected for other
reasons, or maybe it was wrongly rejected for being broken when it isn't ^^

~~ Lina
Christian König March 8, 2023, 8:14 p.m. UTC | #22
Am 08.03.23 um 20:45 schrieb Asahi Lina:
> On 09/03/2023 04.12, Christian König wrote:
>> Am 08.03.23 um 20:05 schrieb Asahi Lina:
>>> [SNIP]
>>>> Well it's not the better way, it's the only way that works.
>>>>
>>>> I have to admit that my bet on your intentions was wrong, but even that
>>>> use case doesn't work correctly.
>>>>
>>>> See when your callback returns false it is perfectly possible that all
>>>> hw fences are signaled between returning that information and processing it.
>>>>
>>>> The result would be that the scheduler goes to sleep and never wakes up
>>>> again.
>>> That can't happen, because it will just go into another iteration of the
>>> drm_sched main loop since there is an entity available still.
>>>
>>> Rather there is probably the opposite bug in this patch: the can_run_job
>>> logic should be moved into the wait_event_interruptible() condition
>>> check, otherwise I think it can end up busy-looping since the condition
>>> itself can be true even when the can_run_job check blocks it.
>>>
>>> But there is no risk of it going to sleep and never waking up because
>>> job completions will wake up the waitqueue by definition, and that
>>> happens after the driver-side queues are popped. If this problem could
>>> happen, then the existing hw_submission_limit logic would be broken in
>>> the same way. It is logically equivalent in how it works.
>>>
>>> Basically, if properly done in wait_event_interruptible, it is exactly
>>> the logic of that macro that prevents this race condition and makes
>>> everything work at all. Without it, drm_sched would be completely broken.
>>>
>>>> As I said we exercised those ideas before and yes this approach here
>>>> came up before as well and no it doesn't work.
>>> It can never deadlock with this patch as it stands (though it could busy
>>> loop), and if properly moved into the wait_event_interruptible(), it
>>> would also never busy loop and work entirely as intended. The actual API
>>> change is sound.
>>>
>>> I don't know why you're trying so hard to convince everyone that this
>>> approach is fundamentally broken... It might be a bad idea for other
>>> reasons, it might encourage incorrect usage, it might not be the best
>>> option, there are plenty of arguments you can make... but you just keep
>>> trying to make an argument that it just can't work at all for some
>>> reason. Why? I already said I'm happy dropping it in favor of the fences...
>> Well because it is broken.
>>
>> When you move the check into the wait_event_interruptible condition then
>> who is going to call wait_event_interruptible when the condition changes?
> I think you mean wake_up_interruptible(). That would be
> drm_sched_job_done(), on the fence callback when a job completes, which
> as I keep saying is the same logic used for
> hw_rq_count/hw_submission_limit tracking.

As the documentation to wait_event says:

  * wake_up() has to be called after changing any variable that could
  * change the result of the wait condition.

So what you essentially try to do here is to skip that and say 
drm_sched_job_done() would call that anyway, but when you read any 
variable to determine that state then as far as I can see nothing is 
guarantying that order.

The only other possibility how you could use the callback correctly 
would be to call drm_fence_is_signaled() to query the state of your hw 
submission from the same fence which is then signaled. But then the 
question is once more why you don't give that fence directly to the 
scheduler?

> Please think about it for a second,

Yeah, I'm trying to really follow your intentions here. But that doesn't 
really makes sense.

Either you are trying to do something invalid or you are trying to 
circumvent the object model somehow and add a shortcut for the signaling 
API. Both would be more than fishy.

Regards,
Christian.

>   it's really not that complicated to
> see why it works:
>
> - Driver pops off completed commands <-- can_run_job condition satisfied
> - Driver signals fence
>   - drm_sched_job_done_cb()
>    - drm_sched_job_done()
>     - atomic_dec(&sched->hw_rq_count); <-- hw_submission_limit satisfied
>     - ...
>     - wake_up_interruptible(&sched->wake_up_worker);
>        ^- happens after both conditions are potentially satisfied
>
> It really is completely equivalent to just making the hw_rq_count logic
> customizable by the driver. The actual flow is the same. As long as the
> driver guarantees it satisfies the can_run_job() condition before
> signaling the completion fence that triggered that change, it works fine.
>
>> As I said this idea came up before and was rejected multiple times.
> Maybe it was a different idea, or maybe it was rejected for other
> reasons, or maybe it was wrongly rejected for being broken when it isn't ^^
>
> ~~ Lina
Asahi Lina March 9, 2023, 6:10 a.m. UTC | #23
On 08/03/2023 03.19, Björn Roy Baron wrote:
> ------- Original Message -------
> On Tuesday, March 7th, 2023 at 15:25, Asahi Lina <lina@asahilina.net> wrote:
> 
>> Add the initial abstractions for DRM drivers and devices. These go
>> together in one commit since they are fairly tightly coupled types.
>>
>> A few things have been stubbed out, to be implemented as further bits of
>> the DRM subsystem are introduced.
>>
>> Signed-off-by: Asahi Lina lina@asahilina.net
>>
>> ---
[...]

>> +/// Information data for a DRM Driver.
>> +pub struct DriverInfo {
>> +    /// Driver major version.
>> +    pub major: i32,
>> +    /// Driver minor version.
>> +    pub minor: i32,
>> +    /// Driver patchlevel version.
>> +    pub patchlevel: i32,
>> +    /// Driver name.
>> +    pub name: &'static CStr,
>> +    /// Driver description.
>> +    pub desc: &'static CStr,
>> +    /// Driver date.
>> +    pub date: &'static CStr,
>> +}
>> +
> 
> Could you please add an Invariants section to the doc comments indicating what requirements these function pointers must satisfy?

I can try (as much as I can divine from the C side anyway...). I guess
you want interface docs for each callback, so like what it must do and
what invariants each one must uphold?

Note that this is a kernel crate-only struct (the fields are not public)
so users can't create their own AllocOps variants anyway (plus AllocImpl
is sealed, on top of that), but I guess it makes sense to document for
internal kernel crate purposes. At some point it might make sense to
allow drivers to override these with proper Rust callbacks (and then the
wrappers need to ensure safety), but right now that's not implemented.

>> +/// Internal memory management operation set, normally created by memory managers (e.g. GEM).
>> +///
>> +/// See `kernel::drm::gem` and `kernel::drm::gem::shmem`.
>> +pub struct AllocOps {
>> +    pub(crate) gem_create_object: Option<
>> +        unsafe extern "C" fn(
>> +            dev: *mut bindings::drm_device,
>> +            size: usize,
>> +        ) -> *mut bindings::drm_gem_object,
>> +    >,
>> +    pub(crate) prime_handle_to_fd: Option<
>> +        unsafe extern "C" fn(
>> +            dev: *mut bindings::drm_device,
>> +            file_priv: *mut bindings::drm_file,
>> +            handle: u32,
>> +            flags: u32,
>> +            prime_fd: *mut core::ffi::c_int,
>> +        ) -> core::ffi::c_int,
>> +    >,
>> +    pub(crate) prime_fd_to_handle: Option<
>> +        unsafe extern "C" fn(
>> +            dev: *mut bindings::drm_device,
>> +            file_priv: *mut bindings::drm_file,
>> +            prime_fd: core::ffi::c_int,
>> +            handle: *mut u32,
>> +        ) -> core::ffi::c_int,
>> +    >,
>> +    pub(crate) gem_prime_import: Option<
>> +        unsafe extern "C" fn(
>> +            dev: *mut bindings::drm_device,
>> +            dma_buf: *mut bindings::dma_buf,
>> +        ) -> *mut bindings::drm_gem_object,
>> +    >,
>> +    pub(crate) gem_prime_import_sg_table: Option<
>> +        unsafe extern "C" fn(
>> +            dev: *mut bindings::drm_device,
>> +            attach: *mut bindings::dma_buf_attachment,
>> +            sgt: *mut bindings::sg_table,
>> +        ) -> *mut bindings::drm_gem_object,
>> +    >,
>> +    pub(crate) gem_prime_mmap: Option<
>> +        unsafe extern "C" fn(
>> +            obj: *mut bindings::drm_gem_object,
>> +            vma: *mut bindings::vm_area_struct,
>> +        ) -> core::ffi::c_int,
>> +    >,
>> +    pub(crate) dumb_create: Option<
>> +        unsafe extern "C" fn(
>> +            file_priv: *mut bindings::drm_file,
>> +            dev: *mut bindings::drm_device,
>> +            args: *mut bindings::drm_mode_create_dumb,
>> +        ) -> core::ffi::c_int,
>> +    >,
>> +    pub(crate) dumb_map_offset: Option<
>> +        unsafe extern "C" fn(
>> +            file_priv: *mut bindings::drm_file,
>> +            dev: *mut bindings::drm_device,
>> +            handle: u32,
>> +            offset: *mut u64,
>> +        ) -> core::ffi::c_int,
>> +    >,
>> +    pub(crate) dumb_destroy: Option<
>> +        unsafe extern "C" fn(
>> +            file_priv: *mut bindings::drm_file,
>> +            dev: *mut bindings::drm_device,
>> +            handle: u32,
>> +        ) -> core::ffi::c_int,
>> +    >,
>> +}
>> +

~~ Lina
Maíra Canal March 9, 2023, 12:09 p.m. UTC | #24
On 3/9/23 03:15, Dave Airlie wrote:
> On Thu, 9 Mar 2023 at 15:32, Asahi Lina <lina@asahilina.net> wrote:
>>
>> On 08/03/2023 00.32, Maíra Canal wrote:
>>> On 3/7/23 11:25, Asahi Lina wrote:
>>>> DRM drivers need to be able to declare which driver-specific ioctls they
>>>> support. This abstraction adds the required types and a helper macro to
>>>> generate the ioctl definition inside the DRM driver.
>>>>
>>>> Note that this macro is not usable until further bits of the
>>>> abstraction are in place (but it will not fail to compile on its own, if
>>>> not called).
>>>>
>>>> Signed-off-by: Asahi Lina <lina@asahilina.net>
>>>> ---
>>>>    drivers/gpu/drm/Kconfig         |   7 ++
>>>>    rust/bindings/bindings_helper.h |   2 +
>>>>    rust/kernel/drm/ioctl.rs        | 147 ++++++++++++++++++++++++++++++++++++++++
>>>>    rust/kernel/drm/mod.rs          |   5 ++
>>>>    rust/kernel/lib.rs              |   2 +
>>>>    5 files changed, 163 insertions(+)
>>>>
>>>> diff --git a/drivers/gpu/drm/Kconfig b/drivers/gpu/drm/Kconfig
>>>> index dc0f94f02a82..dab8f0f9aa96 100644
>>>> --- a/drivers/gpu/drm/Kconfig
>>>> +++ b/drivers/gpu/drm/Kconfig
>>>> @@ -27,6 +27,13 @@ menuconfig DRM
>>>>         details.  You should also select and configure AGP
>>>>         (/dev/agpgart) support if it is available for your platform.
>>>>
>>>
>>> [...]
>>>
>>>> +
>>>> +/// Declare the DRM ioctls for a driver.
>>>> +///
>>>> +/// Each entry in the list should have the form:
>>>> +///
>>>> +/// `(ioctl_number, argument_type, flags, user_callback),`
>>>> +///
>>>> +/// `argument_type` is the type name within the `bindings` crate.
>>>> +/// `user_callback` should have the following prototype:
>>>> +///
>>>> +/// ```
>>>> +/// fn foo(device: &kernel::drm::device::Device<Self>,
>>>> +///        data: &mut bindings::argument_type,
>>>> +///        file: &kernel::drm::file::File<Self::File>,
>>>> +/// )
>>>> +/// ```
>>>> +/// where `Self` is the drm::drv::Driver implementation these ioctls are being declared within.
>>>> +///
>>>> +/// # Examples
>>>> +///
>>>> +/// ```
>>>> +/// kernel::declare_drm_ioctls! {
>>>> +///     (FOO_GET_PARAM, drm_foo_get_param, ioctl::RENDER_ALLOW, my_get_param_handler),
>>>> +/// }
>>>> +/// ```
>>>> +///
>>>> +#[macro_export]
>>>> +macro_rules! declare_drm_ioctls {
>>>> +    ( $(($cmd:ident, $struct:ident, $flags:expr, $func:expr)),* $(,)? ) => {
>>>> +        const IOCTLS: &'static [$crate::drm::ioctl::DrmIoctlDescriptor] = {
>>>> +            const _:() = {
>>>> +                let i: u32 = $crate::bindings::DRM_COMMAND_BASE;
>>>> +                // Assert that all the IOCTLs are in the right order and there are no gaps,
>>>> +                // and that the sizeof of the specified type is correct.
>>>
>>> I believe that not necessarily the IOCTLs need to be in the right order and
>>> with no gaps. For example, armada_drm.h has a gap in between 0x00 and
>>> 0x02 and exynos_drm.h also have gaps. Moreover, some drivers, like vgem and
>>> virtgpu, start their IOCTLs with 0x01.
>>
>> Yeah, we talked about this a bit... do you have any ideas about how to
>> design this? I think it should be possible with a const function
>> initializing an array entry by entry, we just need a two-pass macro
>> (once to determine the max ioctl number, then again to actually output
>> the implementation).
>>
>> I'm not sure why drivers would have gaps in the ioctl numbers though...
>> my idea was that new drivers shouldn't need that as far as I can tell
>> (you can't remove APIs after the fact due to UAPI stability guarantees,
>> so as long as you don't have gaps to begin with...). But I guess if
>> we're reimplementing existing drivers in Rust we'll need this... though
>> maybe it makes sense to just say it's not supported and require
>> reimplementations that have holes to just explicitly add dummy ioctls
>> that return EINVAL? We could even provide such a dummy generic ioctl
>> handler on the abstraction side, so drivers just have to add it to the
>> list, or make the macro take a special token that is used for
>> placeholder ioctls that don't exist (which then creates the NULL
>> function pointer that the drm core interprets as invalid)...
> 
> I can think of two reason for gaps having appeared:
> 
> a) developers wanted to group new uapis at a nice base number.
> This is never essential it's just makes things easier to read, and
> allows slotting other ioctls into the gaps later.
> 
> b) parallel feature development ends up conflicting then one thread never lands.
> I've got two-three devs each adding a uAPI, we assign them 0x10, 0x11,
> 0x12 while they work, then 0x11 never lands because it was a bad idea.
> 
> However I think you should be fine enforcing a non-sparse space here
> unless we want to handle replacing current drivers, as long as it's
> hard to screw up so you know early.

I guess it would be nice to support old UAPIs for cases of reimplementations.
Currently, I'm working on a reimplementation of vgem and I ended up having to
create a dummy IOCTL to deal with the sparse number space. Although creating
dummy IOCTLs works, I don't believe it is a nice practice.

Moreover, I believe that if we keep developing new drivers with Rust, cases
(a) and (b) will end up happening, and maybe the Rust abstractions should
work like DRM and allow it to happen.

Best Regards,
- Maíra Canal

> 
> Dave.
Karol Herbst March 9, 2023, 8:39 p.m. UTC | #25
On Thu, Mar 9, 2023 at 9:24 PM Faith Ekstrand
<faith.ekstrand@collabora.com> wrote:
>
> On Thu, 2023-03-09 at 15:04 +0900, Asahi Lina wrote:
> > On 08/03/2023 02.34, Björn Roy Baron wrote:
> > > > +                            // SAFETY: This is just the ioctl
> > > > argument, which hopefully has the right type
> > > > +                            // (we've done our best checking the
> > > > size).
> > >
> > > In the rust tree there is the ReadableFromBytes [1] trait which
> > > indicates that it is safe to read arbitrary bytes into the type.
> > > Maybe you could add it as bound on the argument type when it lands
> > > in rust-next? This way you can't end up with for example a struct
> > > containing a bool with the byte value 2, which is UB.
> >
> > There's actually a much bigger story here, because that trait isn't
> > really very useful without a way to auto-derive it. I need the same
> > kind
> > of guarantee for all the GPU firmware structs...
> >
> > There's one using only declarative macros [1] and one using proc
> > macros
> > [2]. And then, since ioctl arguments are declared in C UAPI header
> > files, we need a way to be able to derive those traits for them...
> > which
> > I guess means bindgen changes?
>
> It'd be cool to be able to auto-verify that uAPI structs are all
> tightly packed and use the right subset of types.  Maybe not possible
> this iteration but it'd be cool to see in future.  I'd like to see it
> for C as well, ideally.
>
> ~Faith
>

I'm sure that with a macro you could verify that a struct definition
doesn't contain any gaps, just not sure on how one would enforce that.
Could add a trait which can only be implemented through a proc_macro?
Maybe we can have a proc_macro ensuring no gaps? Would be cool tech to
have indeed.
Daniel Vetter April 5, 2023, 2:44 p.m. UTC | #26
On Tue, Mar 07, 2023 at 11:25:43PM +0900, Asahi Lina wrote:
> +/// Look up a GEM object handle for a `File` and return an `ObjectRef` for it.
> +pub(crate) fn lookup_handle(file: &DrmFile, handle: u32) -> Result<ObjectRef> {
> +    Ok(ObjectRef::new(shmem::Object::lookup_handle(file, handle)?))
> +}

So maybe my expectations for rust typing is a bit too much, but I kinda
expected this to be fully generic:

- trait Driver (drm_driver) knows the driver's object type
- a generic create_handle function could ensure that for drm_file (which
  is always for a specific drm_device and hence Driver) can ensure at the
  type level that you only put the right objects into the drm_file
- a generic lookup_handle function on the drm_file knows the Driver trait
  and so can give you back the right type right away.

Why the wrapping, what do I miss?
-Daniel
Daniel Vetter April 6, 2023, 10:42 a.m. UTC | #27
On Thu, Apr 06, 2023 at 01:44:22PM +0900, Asahi Lina wrote:
> On 05/04/2023 23.37, Daniel Vetter wrote:
> > On Tue, Mar 07, 2023 at 11:25:43PM +0900, Asahi Lina wrote:
> > > +/// A generic monotonically incrementing ID used to uniquely identify object instances within the
> > > +/// driver.
> > > +pub(crate) struct ID(AtomicU64);
> > > +
> > > +impl ID {
> > > +    /// Create a new ID counter with a given value.
> > > +    fn new(val: u64) -> ID {
> > > +        ID(AtomicU64::new(val))
> > > +    }
> > > +
> > > +    /// Fetch the next unique ID.
> > > +    pub(crate) fn next(&self) -> u64 {
> > > +        self.0.fetch_add(1, Ordering::Relaxed)
> > > +    }
> > > +}
> > 
> > Continuing the theme of me commenting on individual things, I stumbled
> > over this because I noticed that there's a lot of id based lookups where I
> > don't expect them, and started chasing.
> > 
> > - For ids use xarray, not atomic counters. Yes I know dma_fence timelines
> >    gets this wrong, this goes back to an innocent time where we didn't
> >    allocate more than one timeline per engine, and no one fixed it since
> >    then. Yes u64 should be big enough for everyone :-/
> > 
> > - Attaching ID spaces to drm_device is also not great. drm is full of
> >    these mistakes. Much better if their per drm_file and so private to each
> >    client.
> > 
> > - They shouldn't be used for anything else than uapi id -> kernel object
> >    lookup at the beginning of ioctl code, and nowhere else. At least from
> >    skimming it seems like these are used all over the driver codebase,
> >    which does freak me out. At least on the C side that's a clear indicator
> >    for a refcount/lockin/data structure model that's not thought out at
> >    all.
> > 
> > What's going on here, what do I miss?
> 
> These aren't UAPI IDs, they are driver-internal IDs (the UAPI IDs do use
> xarray and are per-File). Most of them are just for debugging, so that when
> I enable full debug spam I have some way to correlate different things that
> are happening together (this subset of interleaved log lines relate to the
> same submission). Basically just object names that are easier to read (and
> less of a security leak) than pointers and guaranteed not to repeat. You
> could get rid of most of them and it wouldn't affect the driver design, it
> just makes it very hard to see what's going on with debug logs ^^;

Hm generally we just print the kernel addresses with the right printk
modifiers. Those filter/hash addresses if you have the right paranoia
settings enabled. I guess throwing in a debug id doesn't hurt, but would
be good to make that a lot more clearer.

I haven't read the full driver yet because I'm still too much lost, that's
why I guess I missed the xarray stuff on the file. I'll try and go
understand that.

For the big topic below I need to think more.
-Daniel
 
> There are only two that are ever used for non-debugging purposes: the VM ID,
> and the File ID. Both are per-device global IDs attached to the VMs (not the
> UAPI VM objects, but rather the underlyng MMU address space managers they
> represent, including the kernel-internal ones) and to Files themselves. They
> are used for destroying GEM objects: since the objects are also
> device-global across multiple clients, I need a way to do things like "clean
> up all mappings for this File" or "clean up all mappings for this VM".
> There's an annoying circular reference between GEM objects and their
> mappings, which is why this is explicitly coded out in destroy paths instead
> of naturally happening via Drop semantics (without that cleanup code, the
> circular reference leaks it).
> 
> So e.g. when a File does a GEM close or explicitly asks for all mappings of
> an object to be removed, it goes out to the (possibly shared) GEM object and
> tells it to drop all mappings marked as owned by that unique File ID. When
> an explicit "unmap all in VM" op happens, it asks the GEM object to drop all
> mappings for that underlying VM ID. Similarly, when a UAPI VM object is
> dropped (in the Drop impl, so both explicitly and when the whole File/xarray
> is dropped and such), that does an explicit unmap of a special dummy object
> it owns which would otherwise leak since it is not tracked as a GEM object
> owned by that File and therefore not handled by GEM closing. And again along
> the same lines, the allocators in alloc.rs explicitly destroy the mappings
> for their backing GEM objects on Drop. All this is due to that annoying
> circular reference between VMs and GEM objects that I'm not sure how to fix.
> 
> Note that if I *don't* do this (or forget to do it somewhere) the
> consequence is just that we leak memory, and if you try to destroy the wrong
> IDs somehow the worst that can happen is you unmap things you shouldn't and
> fault the GPU (or, in the kernel or kernel-managed user VM cases,
> potentially the firmware). Rust safety guarantees still keep things from
> going entirely off the rails within the kernel, since everything that
> matters is reference counted (which is why these reference cycles are
> possible at all).
> 
> This all started when I was looking at the panfrost driver for reference. It
> does the same thing except it uses actual pointers to the owning entities
> instead of IDs, and pointer comparison (see panfrost_gem_close). Of course
> you could try do that in Rust too (literally storing and comparing raw
> pointers that aren't owned references), but then you're introducing a Pin<>
> requirement on those objects to make their addresses stable and it feels way
> more icky and error-prone than unique IDs (since addresses can be reused).
> panfrost only has a single mmu (what I call the raw VM) per File while I
> have an arbitrary number, which is why I end up with the extra
> distinction/complexity of both File and VM IDs, but the concept is the same.
> 
> Some of this is going to be refactored when I implement arbitrary VM range
> mapping/unmapping, which would be a good time to improve this... but is
> there something particularly wrong/broken about the way I'm doing it now
> that I missed? I figured unique u64 IDs would be a pretty safe way to
> identify entities and cleanup the mappings when needed.
> 
> ~~ Lina
> 
> _______________________________________________
> Linaro-mm-sig mailing list -- linaro-mm-sig@lists.linaro.org
> To unsubscribe send an email to linaro-mm-sig-leave@lists.linaro.org
Daniel Vetter April 6, 2023, 11:25 a.m. UTC | #28
On Thu, Apr 06, 2023 at 02:02:55PM +0900, Asahi Lina wrote:
> On 05/04/2023 23.44, Daniel Vetter wrote:
> > On Tue, Mar 07, 2023 at 11:25:43PM +0900, Asahi Lina wrote:
> > > +/// Look up a GEM object handle for a `File` and return an `ObjectRef` for it.
> > > +pub(crate) fn lookup_handle(file: &DrmFile, handle: u32) -> Result<ObjectRef> {
> > > +    Ok(ObjectRef::new(shmem::Object::lookup_handle(file, handle)?))
> > > +}
> > 
> > So maybe my expectations for rust typing is a bit too much, but I kinda
> > expected this to be fully generic:
> > 
> > - trait Driver (drm_driver) knows the driver's object type
> > - a generic create_handle function could ensure that for drm_file (which
> >    is always for a specific drm_device and hence Driver) can ensure at the
> >    type level that you only put the right objects into the drm_file
> > - a generic lookup_handle function on the drm_file knows the Driver trait
> >    and so can give you back the right type right away.
> > 
> > Why the wrapping, what do I miss?
> 
> Sigh, so this is one of the many ways I'm trying to work around the "Rust
> doesn't do subclasses" problem (so we can figure out what the best one is
> ^^).
> 
> The generic shmem::Object::lookup_handle() call *is* fully generic and will
> get you back a driver-specific object. But since Rust doesn't do
> subclassing, what you get back isn't a driver-specific type T, but rather a
> (reference to a) shmem::Object<T>. T represents the inner driver-specific
> data/functionality (only), and the outer shmem::Object<T> includes the
> actual drm_gem_shmem_object plus a T. This is backwards from C, where you
> expect the opposite situation where T contains a shmem object, but that just
> doesn't work with Rust because there's no way to build a safe API around
> that model as far as I know.

Ah I think I just got confused. I did untangle (I think at least) the
Object<T> trick, I guess the only thing that confused me here is why this
is in the shmem module? Or is that the rust problem again?

I'd kinda have expected that we'd have a gem::Object<T> here that the
lookup_handle function returns. So for the shmem case I guess that would
then be gem::Object<shmem::Object<T>> for the driver type T with driver
specific stuff? I guess not very pretty ...

> Now the problem is from the higher layers I want object operations that
> interact with the shmem::Object<T> (that is, they call generic GEM functions
> on the object). Options so far:
> 
> 1. Add an outer wrapper and put that functionality there.
> 2. Just have the functions on T as helpers, so you need to call T::foo(obj)
> instead of obj.foo().
> 3. Use the undocumented method receiver trait thing to make shmem::Object<T>
> a valid `self` type, plus add auto-Deref to shmem::Object. Then obj.foo()
> works.
> 
> #1 is what I use here. #2 is how the driver-specific File ioctl callbacks
> are implemented, and also sched::Job<T>. #3 is used for fence callbacks
> (FenceObject<T>). None of them are great, and I'd love to hear what people
> think of the various options...
> 
> There are other unexplored options, like in this GEM case it could be
> covered with a driver-internal auxiliary trait impl'd on shmem::Object<T>
> buuut that doesn't work when you actually need callbacks on T itself to
> circle back to shmem::Object<T>, as is the case with File/Job/FenceObject.

Ok I think I'm completely lost here. But I also havent' looked at how this
is all really used in the driver, it's really just the shmem:: module in
the lookup_handle function which looked strange to me.
-Daniel
Daniel Vetter April 6, 2023, 11:26 a.m. UTC | #29
On Thu, Apr 06, 2023 at 02:09:21PM +0900, Asahi Lina wrote:
> Argh. This (and my other reply) was supposed to go to Daniel, but
> Thunderbird... just dropped that recipient? And then my silly brain saw all
> the Cc:s go to To: and figured it was some weird consolidation and so I
> moved everything to Cc: except the only name that started with "Da" and...
> yeah, that wasn't the same person.
> 
> Sorry for the confusion... I have no idea why Thunderbird hates Daniel...

Don't worry, I get cc'ed on so much stuff that whether I'm cc'ed or not
has zero impact on whether I'll read a mail or not. It just kinda
disappears into the big lable:cc bucket ...
-Daniel

> 
> On 06/04/2023 13.44, Asahi Lina wrote:
> > On 05/04/2023 23.37, Daniel Vetter wrote:
> > > On Tue, Mar 07, 2023 at 11:25:43PM +0900, Asahi Lina wrote:
> > > > +/// A generic monotonically incrementing ID used to uniquely identify object instances within the
> > > > +/// driver.
> > > > +pub(crate) struct ID(AtomicU64);
> > > > +
> > > > +impl ID {
> > > > +    /// Create a new ID counter with a given value.
> > > > +    fn new(val: u64) -> ID {
> > > > +        ID(AtomicU64::new(val))
> > > > +    }
> > > > +
> > > > +    /// Fetch the next unique ID.
> > > > +    pub(crate) fn next(&self) -> u64 {
> > > > +        self.0.fetch_add(1, Ordering::Relaxed)
> > > > +    }
> > > > +}
> > > 
> > > Continuing the theme of me commenting on individual things, I stumbled
> > > over this because I noticed that there's a lot of id based lookups where I
> > > don't expect them, and started chasing.
> > > 
> > > - For ids use xarray, not atomic counters. Yes I know dma_fence timelines
> > >     gets this wrong, this goes back to an innocent time where we didn't
> > >     allocate more than one timeline per engine, and no one fixed it since
> > >     then. Yes u64 should be big enough for everyone :-/
> > > 
> > > - Attaching ID spaces to drm_device is also not great. drm is full of
> > >     these mistakes. Much better if their per drm_file and so private to each
> > >     client.
> > > 
> > > - They shouldn't be used for anything else than uapi id -> kernel object
> > >     lookup at the beginning of ioctl code, and nowhere else. At least from
> > >     skimming it seems like these are used all over the driver codebase,
> > >     which does freak me out. At least on the C side that's a clear indicator
> > >     for a refcount/lockin/data structure model that's not thought out at
> > >     all.
> > > 
> > > What's going on here, what do I miss?
> > 
> > These aren't UAPI IDs, they are driver-internal IDs (the UAPI IDs do use
> > xarray and are per-File). Most of them are just for debugging, so that
> > when I enable full debug spam I have some way to correlate different
> > things that are happening together (this subset of interleaved log lines
> > relate to the same submission). Basically just object names that are
> > easier to read (and less of a security leak) than pointers and
> > guaranteed not to repeat. You could get rid of most of them and it
> > wouldn't affect the driver design, it just makes it very hard to see
> > what's going on with debug logs ^^;
> > 
> > There are only two that are ever used for non-debugging purposes: the VM
> > ID, and the File ID. Both are per-device global IDs attached to the VMs
> > (not the UAPI VM objects, but rather the underlyng MMU address space
> > managers they represent, including the kernel-internal ones) and to
> > Files themselves. They are used for destroying GEM objects: since the
> > objects are also device-global across multiple clients, I need a way to
> > do things like "clean up all mappings for this File" or "clean up all
> > mappings for this VM". There's an annoying circular reference between
> > GEM objects and their mappings, which is why this is explicitly coded
> > out in destroy paths instead of naturally happening via Drop semantics
> > (without that cleanup code, the circular reference leaks it).
> > 
> > So e.g. when a File does a GEM close or explicitly asks for all mappings
> > of an object to be removed, it goes out to the (possibly shared) GEM
> > object and tells it to drop all mappings marked as owned by that unique
> > File ID. When an explicit "unmap all in VM" op happens, it asks the GEM
> > object to drop all mappings for that underlying VM ID. Similarly, when a
> > UAPI VM object is dropped (in the Drop impl, so both explicitly and when
> > the whole File/xarray is dropped and such), that does an explicit unmap
> > of a special dummy object it owns which would otherwise leak since it is
> > not tracked as a GEM object owned by that File and therefore not handled
> > by GEM closing. And again along the same lines, the allocators in
> > alloc.rs explicitly destroy the mappings for their backing GEM objects
> > on Drop. All this is due to that annoying circular reference between VMs
> > and GEM objects that I'm not sure how to fix.
> > 
> > Note that if I *don't* do this (or forget to do it somewhere) the
> > consequence is just that we leak memory, and if you try to destroy the
> > wrong IDs somehow the worst that can happen is you unmap things you
> > shouldn't and fault the GPU (or, in the kernel or kernel-managed user VM
> > cases, potentially the firmware). Rust safety guarantees still keep
> > things from going entirely off the rails within the kernel, since
> > everything that matters is reference counted (which is why these
> > reference cycles are possible at all).
> > 
> > This all started when I was looking at the panfrost driver for
> > reference. It does the same thing except it uses actual pointers to the
> > owning entities instead of IDs, and pointer comparison (see
> > panfrost_gem_close). Of course you could try do that in Rust too
> > (literally storing and comparing raw pointers that aren't owned
> > references), but then you're introducing a Pin<> requirement on those
> > objects to make their addresses stable and it feels way more icky and
> > error-prone than unique IDs (since addresses can be reused). panfrost
> > only has a single mmu (what I call the raw VM) per File while I have an
> > arbitrary number, which is why I end up with the extra
> > distinction/complexity of both File and VM IDs, but the concept is the same.
> > 
> > Some of this is going to be refactored when I implement arbitrary VM
> > range mapping/unmapping, which would be a good time to improve this...
> > but is there something particularly wrong/broken about the way I'm doing
> > it now that I missed? I figured unique u64 IDs would be a pretty safe
> > way to identify entities and cleanup the mappings when needed.
> > 
> > ~~ Lina
> > 
> 
> ~~ Lina
>
Daniel Vetter April 6, 2023, 11:55 a.m. UTC | #30
On Thu, Apr 06, 2023 at 01:44:22PM +0900, Asahi Lina wrote:
> On 05/04/2023 23.37, Daniel Vetter wrote:
> > On Tue, Mar 07, 2023 at 11:25:43PM +0900, Asahi Lina wrote:
> > > +/// A generic monotonically incrementing ID used to uniquely identify object instances within the
> > > +/// driver.
> > > +pub(crate) struct ID(AtomicU64);
> > > +
> > > +impl ID {
> > > +    /// Create a new ID counter with a given value.
> > > +    fn new(val: u64) -> ID {
> > > +        ID(AtomicU64::new(val))
> > > +    }
> > > +
> > > +    /// Fetch the next unique ID.
> > > +    pub(crate) fn next(&self) -> u64 {
> > > +        self.0.fetch_add(1, Ordering::Relaxed)
> > > +    }
> > > +}
> > 
> > Continuing the theme of me commenting on individual things, I stumbled
> > over this because I noticed that there's a lot of id based lookups where I
> > don't expect them, and started chasing.
> > 
> > - For ids use xarray, not atomic counters. Yes I know dma_fence timelines
> >    gets this wrong, this goes back to an innocent time where we didn't
> >    allocate more than one timeline per engine, and no one fixed it since
> >    then. Yes u64 should be big enough for everyone :-/
> > 
> > - Attaching ID spaces to drm_device is also not great. drm is full of
> >    these mistakes. Much better if their per drm_file and so private to each
> >    client.
> > 
> > - They shouldn't be used for anything else than uapi id -> kernel object
> >    lookup at the beginning of ioctl code, and nowhere else. At least from
> >    skimming it seems like these are used all over the driver codebase,
> >    which does freak me out. At least on the C side that's a clear indicator
> >    for a refcount/lockin/data structure model that's not thought out at
> >    all.
> > 
> > What's going on here, what do I miss?
> 
> These aren't UAPI IDs, they are driver-internal IDs (the UAPI IDs do use
> xarray and are per-File). Most of them are just for debugging, so that when
> I enable full debug spam I have some way to correlate different things that
> are happening together (this subset of interleaved log lines relate to the
> same submission). Basically just object names that are easier to read (and
> less of a security leak) than pointers and guaranteed not to repeat. You
> could get rid of most of them and it wouldn't affect the driver design, it
> just makes it very hard to see what's going on with debug logs ^^;
> 
> There are only two that are ever used for non-debugging purposes: the VM ID,
> and the File ID. Both are per-device global IDs attached to the VMs (not the
> UAPI VM objects, but rather the underlyng MMU address space managers they
> represent, including the kernel-internal ones) and to Files themselves. They
> are used for destroying GEM objects: since the objects are also
> device-global across multiple clients, I need a way to do things like "clean
> up all mappings for this File" or "clean up all mappings for this VM".
> There's an annoying circular reference between GEM objects and their
> mappings, which is why this is explicitly coded out in destroy paths instead
> of naturally happening via Drop semantics (without that cleanup code, the
> circular reference leaks it).
> 
> So e.g. when a File does a GEM close or explicitly asks for all mappings of
> an object to be removed, it goes out to the (possibly shared) GEM object and
> tells it to drop all mappings marked as owned by that unique File ID. When
> an explicit "unmap all in VM" op happens, it asks the GEM object to drop all
> mappings for that underlying VM ID. Similarly, when a UAPI VM object is
> dropped (in the Drop impl, so both explicitly and when the whole File/xarray
> is dropped and such), that does an explicit unmap of a special dummy object
> it owns which would otherwise leak since it is not tracked as a GEM object
> owned by that File and therefore not handled by GEM closing. And again along
> the same lines, the allocators in alloc.rs explicitly destroy the mappings
> for their backing GEM objects on Drop. All this is due to that annoying
> circular reference between VMs and GEM objects that I'm not sure how to fix.
> 
> Note that if I *don't* do this (or forget to do it somewhere) the
> consequence is just that we leak memory, and if you try to destroy the wrong
> IDs somehow the worst that can happen is you unmap things you shouldn't and
> fault the GPU (or, in the kernel or kernel-managed user VM cases,
> potentially the firmware). Rust safety guarantees still keep things from
> going entirely off the rails within the kernel, since everything that
> matters is reference counted (which is why these reference cycles are
> possible at all).
> 
> This all started when I was looking at the panfrost driver for reference. It
> does the same thing except it uses actual pointers to the owning entities
> instead of IDs, and pointer comparison (see panfrost_gem_close). Of course
> you could try do that in Rust too (literally storing and comparing raw
> pointers that aren't owned references), but then you're introducing a Pin<>
> requirement on those objects to make their addresses stable and it feels way
> more icky and error-prone than unique IDs (since addresses can be reused).
> panfrost only has a single mmu (what I call the raw VM) per File while I
> have an arbitrary number, which is why I end up with the extra
> distinction/complexity of both File and VM IDs, but the concept is the same.
> 
> Some of this is going to be refactored when I implement arbitrary VM range
> mapping/unmapping, which would be a good time to improve this... but is
> there something particularly wrong/broken about the way I'm doing it now
> that I missed? I figured unique u64 IDs would be a pretty safe way to
> identify entities and cleanup the mappings when needed.

Ok, some attempt at going through the vm_id/file_id stuff. Extremely
high-level purely informed by having read too many drivers:

First on the drm_file/struct file/file_id. This is the uapi interface
object, and it's refcounted in the vfs, but that's entirely the vfs'
business and none of the driver (or even subsystem). Once userspace has
done the final close() the file is gone, there's no way to ever get
anything meaningfully out of it because userspace dropped it. So if the
driver has any kind of backpointer to that's a design bug, because in all
the place you might want to care (ioctl, fdinfo for schedu stats, any
other file_operations callback) the vfs ensures it stays alive during the
callback and you essentially have a borrowed reference.

I've seen a lot of drivers try to make clever backpointings to stuff
that's essentially tied to the drm_file, and I've not found a single case
that made sense. iow, file_id as a lookup thingie needs to go. In
principle it's the same argument I've made already for the syncobj rust
wrappers. For specific uses I guess I need some rust reading help, but
from your description it sounds like the vm_id is much more the core
piece.

So for that we have the gpu ctx -> vm -> gem_bos chain of reference. Now
on the C side if you have a modern driver that uses the
vm_bind/unbind/gpuva manager approach, the reference counts go in that
single direction only, anything else is essentially borrowed references
under protection of a mutex/lock or similar thing (for e.g. going from the
bo to the vm for eviction).

In addition to the above chain the xarray in the drm_file also holds
references to each of these. So far so good, in the drm_file ->postclose
callback you just walk the xarrays and drop all the references, and
everything gets cleaned up, at least in the C world.

Aside: I'm ignoring the entire sched/job/gpu-ctx side because that's a
separate can of worms and big other threads floating around already.

But if either due to the uabi being a bit more legacy, or Rust requiring
that the backpointers are reference-counted from the gem_bo->vma->vm and
can't follow borrow semantics (afaiui the usual linux list_head pattern of
walking the list under a lock giving you a borrowed reference for each
element doesn't work too well in rust?) then that's not a problem, you can
still all clean it out:

- The key bit is that your vm struct needs both a refcount like kref and
  a separate open count. Each gpu ctx and the xarray for vm objects in
  drm_file hold _both_ the kref and the open refcount (in rust the open
  refcount implies the Arc or things go sideways).

- the other key bit is that drm_file ->postclose does _not_ have simple
  Drop semantics, it's more explicit.

- in the drm_file lastclose you first walk all the gpu ctx. The simplest
  semantics is that close() synchronously tears down all leftover gpu ctx,
  i.e. you unload them from the gpu. Details are under a lot of discussion
  in the various scheduler threads, but essentially this should ensure
  that the gpu ctx destruction completely removes all references to the
  ctx. If instead you have the legacy problem of apps expecting that
  rendering continues even if they called exit() before it finishes, then
  it gets more messy. I have no idea whether that's still a problem for
  new drivers or can be avoided.

- Next up you do the same thing for the vm xarray (which drops both the
  kref an open refcounts).

- At this point there might still be a ton of vm objects around with
  elevated kref. Except not, because at this point the open refcount of
  each vm should have dropped to zero. When that happens the vm object
  itself is still alive, plus even better for rust, you are in the
  vm_close(vm) function call so you have a full borrowed reference to
  that. Which means you can walk the entire address space and unmap
  everything explicit. Which should get rid of any gem_bo->vma->vm
  backpointers you have lying around.

- At that point all your vm objects are gone too, because the kref managed
  backpointers are gone.

- You walk the xarray of gem_bo (well the drm subsystem does that for
  you), which cleans out the reamining references to gem_bo. Only the
  gem_bo which are shared with other process or have a dma_buf will
  survive, like they should.

No leak, no funky driver-internal vm_id based lookup, and with rust we
should even be able to guarantee you never mix up Arc<Vm> with OpenRef<Vm>
(or however that exactly works in rust types, I have not much real clue).

If you have any other functional needs for vm_id then I guess I need to go
through them, but they should be all fixable.
-Daniel
Asahi Lina April 6, 2023, 1:15 p.m. UTC | #31
On 06/04/2023 20.55, Daniel Vetter wrote:
> On Thu, Apr 06, 2023 at 01:44:22PM +0900, Asahi Lina wrote:
>> On 05/04/2023 23.37, Daniel Vetter wrote:
>>> On Tue, Mar 07, 2023 at 11:25:43PM +0900, Asahi Lina wrote:
>>>> +/// A generic monotonically incrementing ID used to uniquely identify object instances within the
>>>> +/// driver.
>>>> +pub(crate) struct ID(AtomicU64);
>>>> +
>>>> +impl ID {
>>>> +    /// Create a new ID counter with a given value.
>>>> +    fn new(val: u64) -> ID {
>>>> +        ID(AtomicU64::new(val))
>>>> +    }
>>>> +
>>>> +    /// Fetch the next unique ID.
>>>> +    pub(crate) fn next(&self) -> u64 {
>>>> +        self.0.fetch_add(1, Ordering::Relaxed)
>>>> +    }
>>>> +}
>>>
>>> Continuing the theme of me commenting on individual things, I stumbled
>>> over this because I noticed that there's a lot of id based lookups where I
>>> don't expect them, and started chasing.
>>>
>>> - For ids use xarray, not atomic counters. Yes I know dma_fence timelines
>>>     gets this wrong, this goes back to an innocent time where we didn't
>>>     allocate more than one timeline per engine, and no one fixed it since
>>>     then. Yes u64 should be big enough for everyone :-/
>>>
>>> - Attaching ID spaces to drm_device is also not great. drm is full of
>>>     these mistakes. Much better if their per drm_file and so private to each
>>>     client.
>>>
>>> - They shouldn't be used for anything else than uapi id -> kernel object
>>>     lookup at the beginning of ioctl code, and nowhere else. At least from
>>>     skimming it seems like these are used all over the driver codebase,
>>>     which does freak me out. At least on the C side that's a clear indicator
>>>     for a refcount/lockin/data structure model that's not thought out at
>>>     all.
>>>
>>> What's going on here, what do I miss?
>>
>> These aren't UAPI IDs, they are driver-internal IDs (the UAPI IDs do use
>> xarray and are per-File). Most of them are just for debugging, so that when
>> I enable full debug spam I have some way to correlate different things that
>> are happening together (this subset of interleaved log lines relate to the
>> same submission). Basically just object names that are easier to read (and
>> less of a security leak) than pointers and guaranteed not to repeat. You
>> could get rid of most of them and it wouldn't affect the driver design, it
>> just makes it very hard to see what's going on with debug logs ^^;
>>
>> There are only two that are ever used for non-debugging purposes: the VM ID,
>> and the File ID. Both are per-device global IDs attached to the VMs (not the
>> UAPI VM objects, but rather the underlyng MMU address space managers they
>> represent, including the kernel-internal ones) and to Files themselves. They
>> are used for destroying GEM objects: since the objects are also
>> device-global across multiple clients, I need a way to do things like "clean
>> up all mappings for this File" or "clean up all mappings for this VM".
>> There's an annoying circular reference between GEM objects and their
>> mappings, which is why this is explicitly coded out in destroy paths instead
>> of naturally happening via Drop semantics (without that cleanup code, the
>> circular reference leaks it).
>>
>> So e.g. when a File does a GEM close or explicitly asks for all mappings of
>> an object to be removed, it goes out to the (possibly shared) GEM object and
>> tells it to drop all mappings marked as owned by that unique File ID. When
>> an explicit "unmap all in VM" op happens, it asks the GEM object to drop all
>> mappings for that underlying VM ID. Similarly, when a UAPI VM object is
>> dropped (in the Drop impl, so both explicitly and when the whole File/xarray
>> is dropped and such), that does an explicit unmap of a special dummy object
>> it owns which would otherwise leak since it is not tracked as a GEM object
>> owned by that File and therefore not handled by GEM closing. And again along
>> the same lines, the allocators in alloc.rs explicitly destroy the mappings
>> for their backing GEM objects on Drop. All this is due to that annoying
>> circular reference between VMs and GEM objects that I'm not sure how to fix.
>>
>> Note that if I *don't* do this (or forget to do it somewhere) the
>> consequence is just that we leak memory, and if you try to destroy the wrong
>> IDs somehow the worst that can happen is you unmap things you shouldn't and
>> fault the GPU (or, in the kernel or kernel-managed user VM cases,
>> potentially the firmware). Rust safety guarantees still keep things from
>> going entirely off the rails within the kernel, since everything that
>> matters is reference counted (which is why these reference cycles are
>> possible at all).
>>
>> This all started when I was looking at the panfrost driver for reference. It
>> does the same thing except it uses actual pointers to the owning entities
>> instead of IDs, and pointer comparison (see panfrost_gem_close). Of course
>> you could try do that in Rust too (literally storing and comparing raw
>> pointers that aren't owned references), but then you're introducing a Pin<>
>> requirement on those objects to make their addresses stable and it feels way
>> more icky and error-prone than unique IDs (since addresses can be reused).
>> panfrost only has a single mmu (what I call the raw VM) per File while I
>> have an arbitrary number, which is why I end up with the extra
>> distinction/complexity of both File and VM IDs, but the concept is the same.
>>
>> Some of this is going to be refactored when I implement arbitrary VM range
>> mapping/unmapping, which would be a good time to improve this... but is
>> there something particularly wrong/broken about the way I'm doing it now
>> that I missed? I figured unique u64 IDs would be a pretty safe way to
>> identify entities and cleanup the mappings when needed.
> 
> Ok, some attempt at going through the vm_id/file_id stuff. Extremely
> high-level purely informed by having read too many drivers:
> 
> First on the drm_file/struct file/file_id. This is the uapi interface
> object, and it's refcounted in the vfs, but that's entirely the vfs'
> business and none of the driver (or even subsystem). Once userspace has
> done the final close() the file is gone, there's no way to ever get
> anything meaningfully out of it because userspace dropped it. So if the
> driver has any kind of backpointer to that's a design bug, because in all
> the place you might want to care (ioctl, fdinfo for schedu stats, any
> other file_operations callback) the vfs ensures it stays alive during the
> callback and you essentially have a borrowed reference.

Right, there's none of that for the File, and it is not refcounted 
itself. Certainly there are no direct references, and as for the IDs: 
the IDs of relevant Files live in GEM objects that hold mappings owned 
by that file. As part of File close all the GEM objects get closed, 
which removes those mappings. So by the time the File goes away there 
should be no references to its ID anywhere (other than if I stashed some 
away for debugging, I forget whether I did in some child object).

If this process breaks for some reason (say, stray mappings remain 
indexed to a File ID that is gone), that means we leak the mappings, 
which leaks the GEM objects themselves and the VM they are mapped to. 
Not great but not fireworks either. As long as the DRM core properly 
calls the GEM close callback on everything before calling the File close 
callback though, that shouldn't happen.

> I've seen a lot of drivers try to make clever backpointings to stuff
> that's essentially tied to the drm_file, and I've not found a single case
> that made sense. iow, file_id as a lookup thingie needs to go. In
> principle it's the same argument I've made already for the syncobj rust
> wrappers. For specific uses I guess I need some rust reading help, but
> from your description it sounds like the vm_id is much more the core
> piece.

The file ID is simply how GEM mappings are identified as belonging to an 
active file within the mapping list of an object. GEM object close is 
literally the only place this ID is ever used for anything other than 
passing around:

/// Callback to drop all mappings for a GEM object owned by a given `File`
fn close(obj: &Object, file: &DrmFile) {
     mod_pr_debug!("DriverObject::close vm_id={:?} id={}\n", obj.vm_id, 
obj.id);
     obj.drop_file_mappings(file.inner().file_id());
}

I could also just iterate through the VM XArray for the File and drop 
mappings one VM at a time instead of doing all of them in one go, it's 
just slightly more cumbersome (though potentially less code because I 
could get rid of all the forwarding the file_id I do now).

On the other hand, once we implement arbitrary VM maps, I suspect this 
is going to go away anyway with the new design, so I'm not really very 
inclined to fix it until that happens... ^^

> So for that we have the gpu ctx -> vm -> gem_bos chain of reference. Now
> on the C side if you have a modern driver that uses the
> vm_bind/unbind/gpuva manager approach, the reference counts go in that
> single direction only, anything else is essentially borrowed references
> under protection of a mutex/lock or similar thing (for e.g. going from the
> bo to the vm for eviction).

Right, so that is what is going to change with the pending refactor. 
What I have right now is a design that used to be the old driver-managed 
VM design (and still retains part of that for kernel-managed objects) 
for the old synchronous demo UAPI, that I then shoehorned into the 
redesigned vm_bind UAPI by just not supporting the interesting cases 
(partial maps/unmaps/remaps, etc.). This is all temporary, it's just to 
get us by for now since OpenGL doesn't need it and there is no usable 
Vulkan driver that cares yet... I wanted to focus on the explicit sync 
and general sched/queuing part of the new UAPI before I got to the VM 
bind stuff, since I figured that would be more interesting (and pulls in 
all the new abstractions, plus major perf benefit). So the UAPI itself 
has vm_bind but only the "easy" subset of cases are supported by the 
driver (whole object maps/unmaps) and the refcounting is still backwards.

As I said this originally came from the Panfrost design that doesn't 
have vm_bind but instead keeps a list of mappings with pointer equality 
checks in BOs... so that's why ^^

Thanks for explaining the design approach though, it's roughly what I 
had in mind but it's good to hear I'm on the right track! I'd love to go 
into more detail about how to implement vm_bind if you have time though 
(maybe a meeting?). In particular things like using the mm allocator to 
keep track of mapping ranges and supporting splitting and all that.

> In addition to the above chain the xarray in the drm_file also holds
> references to each of these. So far so good, in the drm_file ->postclose
> callback you just walk the xarrays and drop all the references, and
> everything gets cleaned up, at least in the C world.

In the Rust world you just do nothing since the XArray abstraction knows 
how to drop all of its contained objects!

> But if either due to the uabi being a bit more legacy, or Rust requiring
> that the backpointers are reference-counted from the gem_bo->vma->vm and
> can't follow borrow semantics (afaiui the usual linux list_head pattern of
> walking the list under a lock giving you a borrowed reference for each
> element doesn't work too well in rust?) then that's not a problem, you can
> still all clean it out:
> 
> - The key bit is that your vm struct needs both a refcount like kref and
>    a separate open count. Each gpu ctx and the xarray for vm objects in
>    drm_file hold _both_ the kref and the open refcount (in rust the open
>    refcount implies the Arc or things go sideways).
> 
> - the other key bit is that drm_file ->postclose does _not_ have simple
>    Drop semantics, it's more explicit.
> 
> - in the drm_file lastclose you first walk all the gpu ctx. The simplest
>    semantics is that close() synchronously tears down all leftover gpu ctx,
>    i.e. you unload them from the gpu. Details are under a lot of discussion
>    in the various scheduler threads, but essentially this should ensure
>    that the gpu ctx destruction completely removes all references to the
>    ctx. If instead you have the legacy problem of apps expecting that
>    rendering continues even if they called exit() before it finishes, then
>    it gets more messy. I have no idea whether that's still a problem for
>    new drivers or can be avoided.
> 
> - Next up you do the same thing for the vm xarray (which drops both the
>    kref an open refcounts).
> 
> - At this point there might still be a ton of vm objects around with
>    elevated kref. Except not, because at this point the open refcount of
>    each vm should have dropped to zero. When that happens the vm object
>    itself is still alive, plus even better for rust, you are in the
>    vm_close(vm) function call so you have a full borrowed reference to
>    that. Which means you can walk the entire address space and unmap
>    everything explicit. Which should get rid of any gem_bo->vma->vm
>    backpointers you have lying around.
> 
> - At that point all your vm objects are gone too, because the kref managed
>    backpointers are gone.
> 
> - You walk the xarray of gem_bo (well the drm subsystem does that for
>    you), which cleans out the reamining references to gem_bo. Only the
>    gem_bo which are shared with other process or have a dma_buf will
>    survive, like they should.
> 
> No leak, no funky driver-internal vm_id based lookup, and with rust we
> should even be able to guarantee you never mix up Arc<Vm> with OpenRef<Vm>
> (or however that exactly works in rust types, I have not much real clue).

That would totally work, and actually I already use somewhat analogous 
mechanisms in other places like firmware queues!

If this all weren't getting turned on its head for the new VM management 
I'd implement it, but hopefully we can agree there's not much point 
right now... I'd rather focus on the DRM abstraction design and work on 
improving the driver in parallel right now, and then about one kernel 
cycle or so from now it should definitely be in a better place for 
review. Honestly, there are bigger design problems with the driver right 
now than these IDs (that I already know about)... so I want to focus 
more on the abstractions and their usage right now than the internal 
driver design which I *know* has problems ^^

Rust is really good at getting you to come up with a *safe* design as 
far as memory and ownership, but that doesn't mean it's perfectly clean 
code and more importantly it does nothing for deadlocks and allocating 
in the wrong paths and getting resource allocation semantics right etc 
etc. The GPU FW queue stuff is at the very least due for another major 
refactor/cleanup to defer resource allocation and actual queuing to job 
prepare/run time (right now there's some horrible hacks to do it upfront 
at submit because I don't have a mechanism to back-patch job structures 
with those resource IDs later at exec time, but I want to add that), and 
along the way I can also fix the using job fences to block on pending 
job count thing that Christian really wants me to do instead of the 
can_run_job thing, and then getting all this resource stuff truly right 
is also going to mean eventually using fences to handle blocking on 
resource exhaustion too (though maybe I can get away with implementing 
that a bit later)...

The driver works stupidly well for how quickly I wrote it, but it still 
has all these rough edges that definitely need fixing before it's 
something I could say I'm happy with... I'm sure if you start hammering 
it with evil workloads you will hit some of its current problems (like I 
did yesterday with the deadlocks on GpuContext inval). I also need to 
learn more about the subtleties of fence signaling and all that, 
especially once a shrinker comes into play...

~~ Lina
Asahi Lina April 6, 2023, 1:32 p.m. UTC | #32
On 06/04/2023 20.25, Daniel Vetter wrote:
> On Thu, Apr 06, 2023 at 02:02:55PM +0900, Asahi Lina wrote:
>> On 05/04/2023 23.44, Daniel Vetter wrote:
>>> On Tue, Mar 07, 2023 at 11:25:43PM +0900, Asahi Lina wrote:
>>>> +/// Look up a GEM object handle for a `File` and return an `ObjectRef` for it.
>>>> +pub(crate) fn lookup_handle(file: &DrmFile, handle: u32) -> Result<ObjectRef> {
>>>> +    Ok(ObjectRef::new(shmem::Object::lookup_handle(file, handle)?))
>>>> +}
>>>
>>> So maybe my expectations for rust typing is a bit too much, but I kinda
>>> expected this to be fully generic:
>>>
>>> - trait Driver (drm_driver) knows the driver's object type
>>> - a generic create_handle function could ensure that for drm_file (which
>>>     is always for a specific drm_device and hence Driver) can ensure at the
>>>     type level that you only put the right objects into the drm_file
>>> - a generic lookup_handle function on the drm_file knows the Driver trait
>>>     and so can give you back the right type right away.
>>>
>>> Why the wrapping, what do I miss?
>>
>> Sigh, so this is one of the many ways I'm trying to work around the "Rust
>> doesn't do subclasses" problem (so we can figure out what the best one is
>> ^^).
>>
>> The generic shmem::Object::lookup_handle() call *is* fully generic and will
>> get you back a driver-specific object. But since Rust doesn't do
>> subclassing, what you get back isn't a driver-specific type T, but rather a
>> (reference to a) shmem::Object<T>. T represents the inner driver-specific
>> data/functionality (only), and the outer shmem::Object<T> includes the
>> actual drm_gem_shmem_object plus a T. This is backwards from C, where you
>> expect the opposite situation where T contains a shmem object, but that just
>> doesn't work with Rust because there's no way to build a safe API around
>> that model as far as I know.
> 
> Ah I think I just got confused. I did untangle (I think at least) the
> Object<T> trick, I guess the only thing that confused me here is why this
> is in the shmem module? Or is that the rust problem again?
> 
> I'd kinda have expected that we'd have a gem::Object<T> here that the
> lookup_handle function returns. So for the shmem case I guess that would
> then be gem::Object<shmem::Object<T>> for the driver type T with driver
> specific stuff? I guess not very pretty ...

Ahh, uh... Yeah, so shmem objects are allocated their own way (the shmem 
core expects to kfree them in drm_gem_shmem_free) and 
bindings::drm_gem_shmem_object already contains a 
bindings::drm_gem_object. Since the composition is already done in the C 
side, we can't just do it again in Rust cleanly. That's why I have this 
weird setup with both a common trait for common GEM functionality and 
separate actual types that both implement it.

Honestly the whole GEM codepath is untested other than the bits 
inherited by shmem. I'm not sure I'll be able to verify that this all 
makes sense until another Rust driver comes along that needs something 
other than shmem. I just felt I had to do *something* for GEM since the 
hierarchy is there and I needed shmem...

This whole gem stuff is IMO the messiest part of the abstractions 
though, so I'm happy to turn it on its head if it makes it better and 
someone has an idea of how to do that ^^

>> Now the problem is from the higher layers I want object operations that
>> interact with the shmem::Object<T> (that is, they call generic GEM functions
>> on the object). Options so far:
>>
>> 1. Add an outer wrapper and put that functionality there.
>> 2. Just have the functions on T as helpers, so you need to call T::foo(obj)
>> instead of obj.foo().
>> 3. Use the undocumented method receiver trait thing to make shmem::Object<T>
>> a valid `self` type, plus add auto-Deref to shmem::Object. Then obj.foo()
>> works.
>>
>> #1 is what I use here. #2 is how the driver-specific File ioctl callbacks
>> are implemented, and also sched::Job<T>. #3 is used for fence callbacks
>> (FenceObject<T>). None of them are great, and I'd love to hear what people
>> think of the various options...
>>
>> There are other unexplored options, like in this GEM case it could be
>> covered with a driver-internal auxiliary trait impl'd on shmem::Object<T>
>> buuut that doesn't work when you actually need callbacks on T itself to
>> circle back to shmem::Object<T>, as is the case with File/Job/FenceObject.
> 
> Ok I think I'm completely lost here. But I also havent' looked at how this
> is all really used in the driver, it's really just the shmem:: module in
> the lookup_handle function which looked strange to me.

Ah, sorry, I misunderstood what you were talking about in my previous 
email then. That's just a default trait function. It comes from common 
functionality in the gem module, but shmem::Object implements the trait 
so it ends up offering it too (lookup_handle() is not duplicated, it 
only lives in gem, shmem only has to implement going to/from the 
drm_gem_object pointer so the rest of the methods can use it). That's 
part of why the type/trait hierarchy is kind of messy here, it's so I 
can share functionality between both types even though they are 
pre-composed on the C side.

In the end the object types are specialized for any given driver, so 
you're always getting your own unique kind of object anyway. It's just 
that drivers based on shmem will go through it to reach the common code 
and work with a shmem::Object<T>, and drivers using raw gem will use 
gem::Object<T> instead.

~~ Lina
Daniel Vetter April 6, 2023, 1:48 p.m. UTC | #33
On Thu, Apr 06, 2023 at 10:15:56PM +0900, Asahi Lina wrote:
> On 06/04/2023 20.55, Daniel Vetter wrote:
> > On Thu, Apr 06, 2023 at 01:44:22PM +0900, Asahi Lina wrote:
> > > On 05/04/2023 23.37, Daniel Vetter wrote:
> > > > On Tue, Mar 07, 2023 at 11:25:43PM +0900, Asahi Lina wrote:
> > > > > +/// A generic monotonically incrementing ID used to uniquely identify object instances within the
> > > > > +/// driver.
> > > > > +pub(crate) struct ID(AtomicU64);
> > > > > +
> > > > > +impl ID {
> > > > > +    /// Create a new ID counter with a given value.
> > > > > +    fn new(val: u64) -> ID {
> > > > > +        ID(AtomicU64::new(val))
> > > > > +    }
> > > > > +
> > > > > +    /// Fetch the next unique ID.
> > > > > +    pub(crate) fn next(&self) -> u64 {
> > > > > +        self.0.fetch_add(1, Ordering::Relaxed)
> > > > > +    }
> > > > > +}
> > > > 
> > > > Continuing the theme of me commenting on individual things, I stumbled
> > > > over this because I noticed that there's a lot of id based lookups where I
> > > > don't expect them, and started chasing.
> > > > 
> > > > - For ids use xarray, not atomic counters. Yes I know dma_fence timelines
> > > >     gets this wrong, this goes back to an innocent time where we didn't
> > > >     allocate more than one timeline per engine, and no one fixed it since
> > > >     then. Yes u64 should be big enough for everyone :-/
> > > > 
> > > > - Attaching ID spaces to drm_device is also not great. drm is full of
> > > >     these mistakes. Much better if their per drm_file and so private to each
> > > >     client.
> > > > 
> > > > - They shouldn't be used for anything else than uapi id -> kernel object
> > > >     lookup at the beginning of ioctl code, and nowhere else. At least from
> > > >     skimming it seems like these are used all over the driver codebase,
> > > >     which does freak me out. At least on the C side that's a clear indicator
> > > >     for a refcount/lockin/data structure model that's not thought out at
> > > >     all.
> > > > 
> > > > What's going on here, what do I miss?
> > > 
> > > These aren't UAPI IDs, they are driver-internal IDs (the UAPI IDs do use
> > > xarray and are per-File). Most of them are just for debugging, so that when
> > > I enable full debug spam I have some way to correlate different things that
> > > are happening together (this subset of interleaved log lines relate to the
> > > same submission). Basically just object names that are easier to read (and
> > > less of a security leak) than pointers and guaranteed not to repeat. You
> > > could get rid of most of them and it wouldn't affect the driver design, it
> > > just makes it very hard to see what's going on with debug logs ^^;
> > > 
> > > There are only two that are ever used for non-debugging purposes: the VM ID,
> > > and the File ID. Both are per-device global IDs attached to the VMs (not the
> > > UAPI VM objects, but rather the underlyng MMU address space managers they
> > > represent, including the kernel-internal ones) and to Files themselves. They
> > > are used for destroying GEM objects: since the objects are also
> > > device-global across multiple clients, I need a way to do things like "clean
> > > up all mappings for this File" or "clean up all mappings for this VM".
> > > There's an annoying circular reference between GEM objects and their
> > > mappings, which is why this is explicitly coded out in destroy paths instead
> > > of naturally happening via Drop semantics (without that cleanup code, the
> > > circular reference leaks it).
> > > 
> > > So e.g. when a File does a GEM close or explicitly asks for all mappings of
> > > an object to be removed, it goes out to the (possibly shared) GEM object and
> > > tells it to drop all mappings marked as owned by that unique File ID. When
> > > an explicit "unmap all in VM" op happens, it asks the GEM object to drop all
> > > mappings for that underlying VM ID. Similarly, when a UAPI VM object is
> > > dropped (in the Drop impl, so both explicitly and when the whole File/xarray
> > > is dropped and such), that does an explicit unmap of a special dummy object
> > > it owns which would otherwise leak since it is not tracked as a GEM object
> > > owned by that File and therefore not handled by GEM closing. And again along
> > > the same lines, the allocators in alloc.rs explicitly destroy the mappings
> > > for their backing GEM objects on Drop. All this is due to that annoying
> > > circular reference between VMs and GEM objects that I'm not sure how to fix.
> > > 
> > > Note that if I *don't* do this (or forget to do it somewhere) the
> > > consequence is just that we leak memory, and if you try to destroy the wrong
> > > IDs somehow the worst that can happen is you unmap things you shouldn't and
> > > fault the GPU (or, in the kernel or kernel-managed user VM cases,
> > > potentially the firmware). Rust safety guarantees still keep things from
> > > going entirely off the rails within the kernel, since everything that
> > > matters is reference counted (which is why these reference cycles are
> > > possible at all).
> > > 
> > > This all started when I was looking at the panfrost driver for reference. It
> > > does the same thing except it uses actual pointers to the owning entities
> > > instead of IDs, and pointer comparison (see panfrost_gem_close). Of course
> > > you could try do that in Rust too (literally storing and comparing raw
> > > pointers that aren't owned references), but then you're introducing a Pin<>
> > > requirement on those objects to make their addresses stable and it feels way
> > > more icky and error-prone than unique IDs (since addresses can be reused).
> > > panfrost only has a single mmu (what I call the raw VM) per File while I
> > > have an arbitrary number, which is why I end up with the extra
> > > distinction/complexity of both File and VM IDs, but the concept is the same.
> > > 
> > > Some of this is going to be refactored when I implement arbitrary VM range
> > > mapping/unmapping, which would be a good time to improve this... but is
> > > there something particularly wrong/broken about the way I'm doing it now
> > > that I missed? I figured unique u64 IDs would be a pretty safe way to
> > > identify entities and cleanup the mappings when needed.
> > 
> > Ok, some attempt at going through the vm_id/file_id stuff. Extremely
> > high-level purely informed by having read too many drivers:
> > 
> > First on the drm_file/struct file/file_id. This is the uapi interface
> > object, and it's refcounted in the vfs, but that's entirely the vfs'
> > business and none of the driver (or even subsystem). Once userspace has
> > done the final close() the file is gone, there's no way to ever get
> > anything meaningfully out of it because userspace dropped it. So if the
> > driver has any kind of backpointer to that's a design bug, because in all
> > the place you might want to care (ioctl, fdinfo for schedu stats, any
> > other file_operations callback) the vfs ensures it stays alive during the
> > callback and you essentially have a borrowed reference.
> 
> Right, there's none of that for the File, and it is not refcounted itself.
> Certainly there are no direct references, and as for the IDs: the IDs of
> relevant Files live in GEM objects that hold mappings owned by that file. As
> part of File close all the GEM objects get closed, which removes those
> mappings. So by the time the File goes away there should be no references to
> its ID anywhere (other than if I stashed some away for debugging, I forget
> whether I did in some child object).
> 
> If this process breaks for some reason (say, stray mappings remain indexed
> to a File ID that is gone), that means we leak the mappings, which leaks the
> GEM objects themselves and the VM they are mapped to. Not great but not
> fireworks either. As long as the DRM core properly calls the GEM close
> callback on everything before calling the File close callback though, that
> shouldn't happen.
> 
> > I've seen a lot of drivers try to make clever backpointings to stuff
> > that's essentially tied to the drm_file, and I've not found a single case
> > that made sense. iow, file_id as a lookup thingie needs to go. In
> > principle it's the same argument I've made already for the syncobj rust
> > wrappers. For specific uses I guess I need some rust reading help, but
> > from your description it sounds like the vm_id is much more the core
> > piece.
> 
> The file ID is simply how GEM mappings are identified as belonging to an
> active file within the mapping list of an object. GEM object close is
> literally the only place this ID is ever used for anything other than
> passing around:
> 
> /// Callback to drop all mappings for a GEM object owned by a given `File`
> fn close(obj: &Object, file: &DrmFile) {
>     mod_pr_debug!("DriverObject::close vm_id={:?} id={}\n", obj.vm_id,
> obj.id);
>     obj.drop_file_mappings(file.inner().file_id());
> }
> 
> I could also just iterate through the VM XArray for the File and drop
> mappings one VM at a time instead of doing all of them in one go, it's just
> slightly more cumbersome (though potentially less code because I could get
> rid of all the forwarding the file_id I do now).
> 
> On the other hand, once we implement arbitrary VM maps, I suspect this is
> going to go away anyway with the new design, so I'm not really very inclined
> to fix it until that happens... ^^

Yeah the driver-managed vm needs a bunch more reference loops and gets
awkward fast. the gpuva library might need to keep support for that, but I
really hope it's not needed.

> > So for that we have the gpu ctx -> vm -> gem_bos chain of reference. Now
> > on the C side if you have a modern driver that uses the
> > vm_bind/unbind/gpuva manager approach, the reference counts go in that
> > single direction only, anything else is essentially borrowed references
> > under protection of a mutex/lock or similar thing (for e.g. going from the
> > bo to the vm for eviction).
> 
> Right, so that is what is going to change with the pending refactor. What I
> have right now is a design that used to be the old driver-managed VM design
> (and still retains part of that for kernel-managed objects) for the old
> synchronous demo UAPI, that I then shoehorned into the redesigned vm_bind
> UAPI by just not supporting the interesting cases (partial
> maps/unmaps/remaps, etc.). This is all temporary, it's just to get us by for
> now since OpenGL doesn't need it and there is no usable Vulkan driver that
> cares yet... I wanted to focus on the explicit sync and general
> sched/queuing part of the new UAPI before I got to the VM bind stuff, since
> I figured that would be more interesting (and pulls in all the new
> abstractions, plus major perf benefit). So the UAPI itself has vm_bind but
> only the "easy" subset of cases are supported by the driver (whole object
> maps/unmaps) and the refcounting is still backwards.
> 
> As I said this originally came from the Panfrost design that doesn't have
> vm_bind but instead keeps a list of mappings with pointer equality checks in
> BOs... so that's why ^^
> 
> Thanks for explaining the design approach though, it's roughly what I had in
> mind but it's good to hear I'm on the right track! I'd love to go into more
> detail about how to implement vm_bind if you have time though (maybe a
> meeting?). In particular things like using the mm allocator to keep track of
> mapping ranges and supporting splitting and all that.

Yeah vm_bind sounds like a good topic to discuss. I don't think we'll get
all the pieces aligned to land that before asahi, but the driver internals
should at least match wrt semantics with that so that the refactoring
isn't total pain.

> > In addition to the above chain the xarray in the drm_file also holds
> > references to each of these. So far so good, in the drm_file ->postclose
> > callback you just walk the xarrays and drop all the references, and
> > everything gets cleaned up, at least in the C world.
> 
> In the Rust world you just do nothing since the XArray abstraction knows how
> to drop all of its contained objects!

Yeah xarray should work with Drop, but I guess you need a special
uapi/open-reference object that knows that it needs to perform additional
cleanup (like quiescent the gpu ctx or unamp everything for the vm).

> > But if either due to the uabi being a bit more legacy, or Rust requiring
> > that the backpointers are reference-counted from the gem_bo->vma->vm and
> > can't follow borrow semantics (afaiui the usual linux list_head pattern of
> > walking the list under a lock giving you a borrowed reference for each
> > element doesn't work too well in rust?) then that's not a problem, you can
> > still all clean it out:
> > 
> > - The key bit is that your vm struct needs both a refcount like kref and
> >    a separate open count. Each gpu ctx and the xarray for vm objects in
> >    drm_file hold _both_ the kref and the open refcount (in rust the open
> >    refcount implies the Arc or things go sideways).
> > 
> > - the other key bit is that drm_file ->postclose does _not_ have simple
> >    Drop semantics, it's more explicit.
> > 
> > - in the drm_file lastclose you first walk all the gpu ctx. The simplest
> >    semantics is that close() synchronously tears down all leftover gpu ctx,
> >    i.e. you unload them from the gpu. Details are under a lot of discussion
> >    in the various scheduler threads, but essentially this should ensure
> >    that the gpu ctx destruction completely removes all references to the
> >    ctx. If instead you have the legacy problem of apps expecting that
> >    rendering continues even if they called exit() before it finishes, then
> >    it gets more messy. I have no idea whether that's still a problem for
> >    new drivers or can be avoided.
> > 
> > - Next up you do the same thing for the vm xarray (which drops both the
> >    kref an open refcounts).
> > 
> > - At this point there might still be a ton of vm objects around with
> >    elevated kref. Except not, because at this point the open refcount of
> >    each vm should have dropped to zero. When that happens the vm object
> >    itself is still alive, plus even better for rust, you are in the
> >    vm_close(vm) function call so you have a full borrowed reference to
> >    that. Which means you can walk the entire address space and unmap
> >    everything explicit. Which should get rid of any gem_bo->vma->vm
> >    backpointers you have lying around.
> > 
> > - At that point all your vm objects are gone too, because the kref managed
> >    backpointers are gone.
> > 
> > - You walk the xarray of gem_bo (well the drm subsystem does that for
> >    you), which cleans out the reamining references to gem_bo. Only the
> >    gem_bo which are shared with other process or have a dma_buf will
> >    survive, like they should.
> > 
> > No leak, no funky driver-internal vm_id based lookup, and with rust we
> > should even be able to guarantee you never mix up Arc<Vm> with OpenRef<Vm>
> > (or however that exactly works in rust types, I have not much real clue).
> 
> That would totally work, and actually I already use somewhat analogous
> mechanisms in other places like firmware queues!
> 
> If this all weren't getting turned on its head for the new VM management I'd
> implement it, but hopefully we can agree there's not much point right now...
> I'd rather focus on the DRM abstraction design and work on improving the
> driver in parallel right now, and then about one kernel cycle or so from now
> it should definitely be in a better place for review. Honestly, there are
> bigger design problems with the driver right now than these IDs (that I
> already know about)... so I want to focus more on the abstractions and their
> usage right now than the internal driver design which I *know* has problems
> ^^

Yeah I think the only fundamental issue you have is that (if I get this
all right) you're trying to clean up mappings from the gem_bo, not from
the vm. The gem_bo (unlike the vm) is freely shareable (at least in
general), so tying anything else to the lifetime of a gem_bo in any way is
a design flaw.

This is similar to dma_fence that can end up absolutely everywhere, and
why drm/sched has this decoupling between hw_fence and drm_job fences with
wider visibility. i915-gem/i915-scheduler and a lot of the really old
drivers all get this wrong, and you end up with either terrible explicit
cleanup code that tries to go around looking for all the references that
it needs to drop. Or you just leak.

All these things need to be sorted out at design time so that they're
impossible.

> Rust is really good at getting you to come up with a *safe* design as far as
> memory and ownership, but that doesn't mean it's perfectly clean code and
> more importantly it does nothing for deadlocks and allocating in the wrong
> paths and getting resource allocation semantics right etc etc. The GPU FW
> queue stuff is at the very least due for another major refactor/cleanup to
> defer resource allocation and actual queuing to job prepare/run time (right
> now there's some horrible hacks to do it upfront at submit because I don't
> have a mechanism to back-patch job structures with those resource IDs later
> at exec time, but I want to add that), and along the way I can also fix the
> using job fences to block on pending job count thing that Christian really
> wants me to do instead of the can_run_job thing, and then getting all this
> resource stuff truly right is also going to mean eventually using fences to
> handle blocking on resource exhaustion too (though maybe I can get away with
> implementing that a bit later)...
> 
> The driver works stupidly well for how quickly I wrote it, but it still has
> all these rough edges that definitely need fixing before it's something I
> could say I'm happy with... I'm sure if you start hammering it with evil
> workloads you will hit some of its current problems (like I did yesterday
> with the deadlocks on GpuContext inval). I also need to learn more about the
> subtleties of fence signaling and all that, especially once a shrinker comes
> into play...

Yeah I think rust is impressive at creating working code. The real
challenge, and really where I see all the short term value at least, is in
clarifying the semantics. Because that'll help us to clarify the semantics
on the C side too, which gives immediate benefits for everyone. Not just
new drivers in rust.

But it's also the part that's really, really hard work.
-Daniel
Daniel Vetter April 6, 2023, 1:54 p.m. UTC | #34
On Thu, Apr 06, 2023 at 10:32:29PM +0900, Asahi Lina wrote:
> On 06/04/2023 20.25, Daniel Vetter wrote:
> > On Thu, Apr 06, 2023 at 02:02:55PM +0900, Asahi Lina wrote:
> > > On 05/04/2023 23.44, Daniel Vetter wrote:
> > > > On Tue, Mar 07, 2023 at 11:25:43PM +0900, Asahi Lina wrote:
> > > > > +/// Look up a GEM object handle for a `File` and return an `ObjectRef` for it.
> > > > > +pub(crate) fn lookup_handle(file: &DrmFile, handle: u32) -> Result<ObjectRef> {
> > > > > +    Ok(ObjectRef::new(shmem::Object::lookup_handle(file, handle)?))
> > > > > +}
> > > > 
> > > > So maybe my expectations for rust typing is a bit too much, but I kinda
> > > > expected this to be fully generic:
> > > > 
> > > > - trait Driver (drm_driver) knows the driver's object type
> > > > - a generic create_handle function could ensure that for drm_file (which
> > > >     is always for a specific drm_device and hence Driver) can ensure at the
> > > >     type level that you only put the right objects into the drm_file
> > > > - a generic lookup_handle function on the drm_file knows the Driver trait
> > > >     and so can give you back the right type right away.
> > > > 
> > > > Why the wrapping, what do I miss?
> > > 
> > > Sigh, so this is one of the many ways I'm trying to work around the "Rust
> > > doesn't do subclasses" problem (so we can figure out what the best one is
> > > ^^).
> > > 
> > > The generic shmem::Object::lookup_handle() call *is* fully generic and will
> > > get you back a driver-specific object. But since Rust doesn't do
> > > subclassing, what you get back isn't a driver-specific type T, but rather a
> > > (reference to a) shmem::Object<T>. T represents the inner driver-specific
> > > data/functionality (only), and the outer shmem::Object<T> includes the
> > > actual drm_gem_shmem_object plus a T. This is backwards from C, where you
> > > expect the opposite situation where T contains a shmem object, but that just
> > > doesn't work with Rust because there's no way to build a safe API around
> > > that model as far as I know.
> > 
> > Ah I think I just got confused. I did untangle (I think at least) the
> > Object<T> trick, I guess the only thing that confused me here is why this
> > is in the shmem module? Or is that the rust problem again?
> > 
> > I'd kinda have expected that we'd have a gem::Object<T> here that the
> > lookup_handle function returns. So for the shmem case I guess that would
> > then be gem::Object<shmem::Object<T>> for the driver type T with driver
> > specific stuff? I guess not very pretty ...
> 
> Ahh, uh... Yeah, so shmem objects are allocated their own way (the shmem
> core expects to kfree them in drm_gem_shmem_free) and
> bindings::drm_gem_shmem_object already contains a bindings::drm_gem_object.
> Since the composition is already done in the C side, we can't just do it
> again in Rust cleanly. That's why I have this weird setup with both a common
> trait for common GEM functionality and separate actual types that both
> implement it.

Hm this is annoying. For a single driver it doesn't matter, but I do
expect that once we have more, and especially once we have more libraries
wrapped (ttm, gpuva, execbuf submit helpers, ...) then the common glue
really becomes the gem_bo for many of these things.

Could we have a GemObject trait which covers this? sole function is an
unsafe one that gives you the raw C pointer :-) It still means that every
gem memory manager library needs to impl that trait, but all the
manager-agnostic bits in the wrappers would be generic? trait would then
also have the right dependent type to ensure type safety in all this.

Maybe something to discuss in the next meeting with the rust folks.

> Honestly the whole GEM codepath is untested other than the bits inherited by
> shmem. I'm not sure I'll be able to verify that this all makes sense until
> another Rust driver comes along that needs something other than shmem. I
> just felt I had to do *something* for GEM since the hierarchy is there and I
> needed shmem...
> 
> This whole gem stuff is IMO the messiest part of the abstractions though, so
> I'm happy to turn it on its head if it makes it better and someone has an
> idea of how to do that ^^

Yeah I still haven't worked up enough courage to type up my gem
abstraction review :-/

> > > Now the problem is from the higher layers I want object operations that
> > > interact with the shmem::Object<T> (that is, they call generic GEM functions
> > > on the object). Options so far:
> > > 
> > > 1. Add an outer wrapper and put that functionality there.
> > > 2. Just have the functions on T as helpers, so you need to call T::foo(obj)
> > > instead of obj.foo().
> > > 3. Use the undocumented method receiver trait thing to make shmem::Object<T>
> > > a valid `self` type, plus add auto-Deref to shmem::Object. Then obj.foo()
> > > works.
> > > 
> > > #1 is what I use here. #2 is how the driver-specific File ioctl callbacks
> > > are implemented, and also sched::Job<T>. #3 is used for fence callbacks
> > > (FenceObject<T>). None of them are great, and I'd love to hear what people
> > > think of the various options...
> > > 
> > > There are other unexplored options, like in this GEM case it could be
> > > covered with a driver-internal auxiliary trait impl'd on shmem::Object<T>
> > > buuut that doesn't work when you actually need callbacks on T itself to
> > > circle back to shmem::Object<T>, as is the case with File/Job/FenceObject.
> > 
> > Ok I think I'm completely lost here. But I also havent' looked at how this
> > is all really used in the driver, it's really just the shmem:: module in
> > the lookup_handle function which looked strange to me.
> 
> Ah, sorry, I misunderstood what you were talking about in my previous email
> then. That's just a default trait function. It comes from common
> functionality in the gem module, but shmem::Object implements the trait so
> it ends up offering it too (lookup_handle() is not duplicated, it only lives
> in gem, shmem only has to implement going to/from the drm_gem_object pointer
> so the rest of the methods can use it). That's part of why the type/trait
> hierarchy is kind of messy here, it's so I can share functionality between
> both types even though they are pre-composed on the C side.

Ok, so it's all already what I expect and I'm just confused with rust
syntax.

> In the end the object types are specialized for any given driver, so you're
> always getting your own unique kind of object anyway. It's just that drivers
> based on shmem will go through it to reach the common code and work with a
> shmem::Object<T>, and drivers using raw gem will use gem::Object<T> instead.

Ok, sounds all good.
-Daniel
Asahi Lina April 6, 2023, 3:19 p.m. UTC | #35
On 06/04/2023 22.48, Daniel Vetter wrote:
> On Thu, Apr 06, 2023 at 10:15:56PM +0900, Asahi Lina wrote:
>> On 06/04/2023 20.55, Daniel Vetter wrote:
>>> On Thu, Apr 06, 2023 at 01:44:22PM +0900, Asahi Lina wrote:
>>>> On 05/04/2023 23.37, Daniel Vetter wrote:
>>>>> On Tue, Mar 07, 2023 at 11:25:43PM +0900, Asahi Lina wrote:
>>>>>> +/// A generic monotonically incrementing ID used to uniquely identify object instances within the
>>>>>> +/// driver.
>>>>>> +pub(crate) struct ID(AtomicU64);
>>>>>> +
>>>>>> +impl ID {
>>>>>> +    /// Create a new ID counter with a given value.
>>>>>> +    fn new(val: u64) -> ID {
>>>>>> +        ID(AtomicU64::new(val))
>>>>>> +    }
>>>>>> +
>>>>>> +    /// Fetch the next unique ID.
>>>>>> +    pub(crate) fn next(&self) -> u64 {
>>>>>> +        self.0.fetch_add(1, Ordering::Relaxed)
>>>>>> +    }
>>>>>> +}
>>>>>
>>>>> Continuing the theme of me commenting on individual things, I stumbled
>>>>> over this because I noticed that there's a lot of id based lookups where I
>>>>> don't expect them, and started chasing.
>>>>>
>>>>> - For ids use xarray, not atomic counters. Yes I know dma_fence timelines
>>>>>      gets this wrong, this goes back to an innocent time where we didn't
>>>>>      allocate more than one timeline per engine, and no one fixed it since
>>>>>      then. Yes u64 should be big enough for everyone :-/
>>>>>
>>>>> - Attaching ID spaces to drm_device is also not great. drm is full of
>>>>>      these mistakes. Much better if their per drm_file and so private to each
>>>>>      client.
>>>>>
>>>>> - They shouldn't be used for anything else than uapi id -> kernel object
>>>>>      lookup at the beginning of ioctl code, and nowhere else. At least from
>>>>>      skimming it seems like these are used all over the driver codebase,
>>>>>      which does freak me out. At least on the C side that's a clear indicator
>>>>>      for a refcount/lockin/data structure model that's not thought out at
>>>>>      all.
>>>>>
>>>>> What's going on here, what do I miss?
>>>>
>>>> These aren't UAPI IDs, they are driver-internal IDs (the UAPI IDs do use
>>>> xarray and are per-File). Most of them are just for debugging, so that when
>>>> I enable full debug spam I have some way to correlate different things that
>>>> are happening together (this subset of interleaved log lines relate to the
>>>> same submission). Basically just object names that are easier to read (and
>>>> less of a security leak) than pointers and guaranteed not to repeat. You
>>>> could get rid of most of them and it wouldn't affect the driver design, it
>>>> just makes it very hard to see what's going on with debug logs ^^;
>>>>
>>>> There are only two that are ever used for non-debugging purposes: the VM ID,
>>>> and the File ID. Both are per-device global IDs attached to the VMs (not the
>>>> UAPI VM objects, but rather the underlyng MMU address space managers they
>>>> represent, including the kernel-internal ones) and to Files themselves. They
>>>> are used for destroying GEM objects: since the objects are also
>>>> device-global across multiple clients, I need a way to do things like "clean
>>>> up all mappings for this File" or "clean up all mappings for this VM".
>>>> There's an annoying circular reference between GEM objects and their
>>>> mappings, which is why this is explicitly coded out in destroy paths instead
>>>> of naturally happening via Drop semantics (without that cleanup code, the
>>>> circular reference leaks it).
>>>>
>>>> So e.g. when a File does a GEM close or explicitly asks for all mappings of
>>>> an object to be removed, it goes out to the (possibly shared) GEM object and
>>>> tells it to drop all mappings marked as owned by that unique File ID. When
>>>> an explicit "unmap all in VM" op happens, it asks the GEM object to drop all
>>>> mappings for that underlying VM ID. Similarly, when a UAPI VM object is
>>>> dropped (in the Drop impl, so both explicitly and when the whole File/xarray
>>>> is dropped and such), that does an explicit unmap of a special dummy object
>>>> it owns which would otherwise leak since it is not tracked as a GEM object
>>>> owned by that File and therefore not handled by GEM closing. And again along
>>>> the same lines, the allocators in alloc.rs explicitly destroy the mappings
>>>> for their backing GEM objects on Drop. All this is due to that annoying
>>>> circular reference between VMs and GEM objects that I'm not sure how to fix.
>>>>
>>>> Note that if I *don't* do this (or forget to do it somewhere) the
>>>> consequence is just that we leak memory, and if you try to destroy the wrong
>>>> IDs somehow the worst that can happen is you unmap things you shouldn't and
>>>> fault the GPU (or, in the kernel or kernel-managed user VM cases,
>>>> potentially the firmware). Rust safety guarantees still keep things from
>>>> going entirely off the rails within the kernel, since everything that
>>>> matters is reference counted (which is why these reference cycles are
>>>> possible at all).
>>>>
>>>> This all started when I was looking at the panfrost driver for reference. It
>>>> does the same thing except it uses actual pointers to the owning entities
>>>> instead of IDs, and pointer comparison (see panfrost_gem_close). Of course
>>>> you could try do that in Rust too (literally storing and comparing raw
>>>> pointers that aren't owned references), but then you're introducing a Pin<>
>>>> requirement on those objects to make their addresses stable and it feels way
>>>> more icky and error-prone than unique IDs (since addresses can be reused).
>>>> panfrost only has a single mmu (what I call the raw VM) per File while I
>>>> have an arbitrary number, which is why I end up with the extra
>>>> distinction/complexity of both File and VM IDs, but the concept is the same.
>>>>
>>>> Some of this is going to be refactored when I implement arbitrary VM range
>>>> mapping/unmapping, which would be a good time to improve this... but is
>>>> there something particularly wrong/broken about the way I'm doing it now
>>>> that I missed? I figured unique u64 IDs would be a pretty safe way to
>>>> identify entities and cleanup the mappings when needed.
>>>
>>> Ok, some attempt at going through the vm_id/file_id stuff. Extremely
>>> high-level purely informed by having read too many drivers:
>>>
>>> First on the drm_file/struct file/file_id. This is the uapi interface
>>> object, and it's refcounted in the vfs, but that's entirely the vfs'
>>> business and none of the driver (or even subsystem). Once userspace has
>>> done the final close() the file is gone, there's no way to ever get
>>> anything meaningfully out of it because userspace dropped it. So if the
>>> driver has any kind of backpointer to that's a design bug, because in all
>>> the place you might want to care (ioctl, fdinfo for schedu stats, any
>>> other file_operations callback) the vfs ensures it stays alive during the
>>> callback and you essentially have a borrowed reference.
>>
>> Right, there's none of that for the File, and it is not refcounted itself.
>> Certainly there are no direct references, and as for the IDs: the IDs of
>> relevant Files live in GEM objects that hold mappings owned by that file. As
>> part of File close all the GEM objects get closed, which removes those
>> mappings. So by the time the File goes away there should be no references to
>> its ID anywhere (other than if I stashed some away for debugging, I forget
>> whether I did in some child object).
>>
>> If this process breaks for some reason (say, stray mappings remain indexed
>> to a File ID that is gone), that means we leak the mappings, which leaks the
>> GEM objects themselves and the VM they are mapped to. Not great but not
>> fireworks either. As long as the DRM core properly calls the GEM close
>> callback on everything before calling the File close callback though, that
>> shouldn't happen.
>>
>>> I've seen a lot of drivers try to make clever backpointings to stuff
>>> that's essentially tied to the drm_file, and I've not found a single case
>>> that made sense. iow, file_id as a lookup thingie needs to go. In
>>> principle it's the same argument I've made already for the syncobj rust
>>> wrappers. For specific uses I guess I need some rust reading help, but
>>> from your description it sounds like the vm_id is much more the core
>>> piece.
>>
>> The file ID is simply how GEM mappings are identified as belonging to an
>> active file within the mapping list of an object. GEM object close is
>> literally the only place this ID is ever used for anything other than
>> passing around:
>>
>> /// Callback to drop all mappings for a GEM object owned by a given `File`
>> fn close(obj: &Object, file: &DrmFile) {
>>      mod_pr_debug!("DriverObject::close vm_id={:?} id={}\n", obj.vm_id,
>> obj.id);
>>      obj.drop_file_mappings(file.inner().file_id());
>> }
>>
>> I could also just iterate through the VM XArray for the File and drop
>> mappings one VM at a time instead of doing all of them in one go, it's just
>> slightly more cumbersome (though potentially less code because I could get
>> rid of all the forwarding the file_id I do now).
>>
>> On the other hand, once we implement arbitrary VM maps, I suspect this is
>> going to go away anyway with the new design, so I'm not really very inclined
>> to fix it until that happens... ^^
> 
> Yeah the driver-managed vm needs a bunch more reference loops and gets
> awkward fast. the gpuva library might need to keep support for that, but I
> really hope it's not needed.
> 
>>> So for that we have the gpu ctx -> vm -> gem_bos chain of reference. Now
>>> on the C side if you have a modern driver that uses the
>>> vm_bind/unbind/gpuva manager approach, the reference counts go in that
>>> single direction only, anything else is essentially borrowed references
>>> under protection of a mutex/lock or similar thing (for e.g. going from the
>>> bo to the vm for eviction).
>>
>> Right, so that is what is going to change with the pending refactor. What I
>> have right now is a design that used to be the old driver-managed VM design
>> (and still retains part of that for kernel-managed objects) for the old
>> synchronous demo UAPI, that I then shoehorned into the redesigned vm_bind
>> UAPI by just not supporting the interesting cases (partial
>> maps/unmaps/remaps, etc.). This is all temporary, it's just to get us by for
>> now since OpenGL doesn't need it and there is no usable Vulkan driver that
>> cares yet... I wanted to focus on the explicit sync and general
>> sched/queuing part of the new UAPI before I got to the VM bind stuff, since
>> I figured that would be more interesting (and pulls in all the new
>> abstractions, plus major perf benefit). So the UAPI itself has vm_bind but
>> only the "easy" subset of cases are supported by the driver (whole object
>> maps/unmaps) and the refcounting is still backwards.
>>
>> As I said this originally came from the Panfrost design that doesn't have
>> vm_bind but instead keeps a list of mappings with pointer equality checks in
>> BOs... so that's why ^^
>>
>> Thanks for explaining the design approach though, it's roughly what I had in
>> mind but it's good to hear I'm on the right track! I'd love to go into more
>> detail about how to implement vm_bind if you have time though (maybe a
>> meeting?). In particular things like using the mm allocator to keep track of
>> mapping ranges and supporting splitting and all that.
> 
> Yeah vm_bind sounds like a good topic to discuss. I don't think we'll get
> all the pieces aligned to land that before asahi, but the driver internals
> should at least match wrt semantics with that so that the refactoring
> isn't total pain.
> 
>>> In addition to the above chain the xarray in the drm_file also holds
>>> references to each of these. So far so good, in the drm_file ->postclose
>>> callback you just walk the xarrays and drop all the references, and
>>> everything gets cleaned up, at least in the C world.
>>
>> In the Rust world you just do nothing since the XArray abstraction knows how
>> to drop all of its contained objects!
> 
> Yeah xarray should work with Drop, but I guess you need a special
> uapi/open-reference object that knows that it needs to perform additional
> cleanup (like quiescent the gpu ctx or unamp everything for the vm).

Yeah, I already have that for VMs. Since I have a layer between UAPI VM 
objects and the underlying MMU VM objects, the UAPI VM object Drop impl 
can take care of explicitly unmapping whatever it needs to, or however 
that ends up working out with the new design. I prefer that to explicit 
cleanup code since it means you can't forget to do it.

Rust is pretty nice for throwing around tiny objects, 1:1 wrappers, or 
even zero-sized types that just do one thing + Drop in order to make 
some semantic ergonomic to use. That's how the XArray reservation stuff 
works: you get back a trivial object that just references the queue (yay 
lifetimes, no refcounting here) and holds the reservation open, and then 
you either fill it (which consumes the reservation guard) or drop it 
(which cleans up the reservation). There's lots of that kind of pattern 
in kernel Rust and I think we should use it often, it just makes things 
a lot less error-prone (ScopeGuard is another nice one!)

>>> But if either due to the uabi being a bit more legacy, or Rust requiring
>>> that the backpointers are reference-counted from the gem_bo->vma->vm and
>>> can't follow borrow semantics (afaiui the usual linux list_head pattern of
>>> walking the list under a lock giving you a borrowed reference for each
>>> element doesn't work too well in rust?) then that's not a problem, you can
>>> still all clean it out:
>>>
>>> - The key bit is that your vm struct needs both a refcount like kref and
>>>     a separate open count. Each gpu ctx and the xarray for vm objects in
>>>     drm_file hold _both_ the kref and the open refcount (in rust the open
>>>     refcount implies the Arc or things go sideways).
>>>
>>> - the other key bit is that drm_file ->postclose does _not_ have simple
>>>     Drop semantics, it's more explicit.
>>>
>>> - in the drm_file lastclose you first walk all the gpu ctx. The simplest
>>>     semantics is that close() synchronously tears down all leftover gpu ctx,
>>>     i.e. you unload them from the gpu. Details are under a lot of discussion
>>>     in the various scheduler threads, but essentially this should ensure
>>>     that the gpu ctx destruction completely removes all references to the
>>>     ctx. If instead you have the legacy problem of apps expecting that
>>>     rendering continues even if they called exit() before it finishes, then
>>>     it gets more messy. I have no idea whether that's still a problem for
>>>     new drivers or can be avoided.
>>>
>>> - Next up you do the same thing for the vm xarray (which drops both the
>>>     kref an open refcounts).
>>>
>>> - At this point there might still be a ton of vm objects around with
>>>     elevated kref. Except not, because at this point the open refcount of
>>>     each vm should have dropped to zero. When that happens the vm object
>>>     itself is still alive, plus even better for rust, you are in the
>>>     vm_close(vm) function call so you have a full borrowed reference to
>>>     that. Which means you can walk the entire address space and unmap
>>>     everything explicit. Which should get rid of any gem_bo->vma->vm
>>>     backpointers you have lying around.
>>>
>>> - At that point all your vm objects are gone too, because the kref managed
>>>     backpointers are gone.
>>>
>>> - You walk the xarray of gem_bo (well the drm subsystem does that for
>>>     you), which cleans out the reamining references to gem_bo. Only the
>>>     gem_bo which are shared with other process or have a dma_buf will
>>>     survive, like they should.
>>>
>>> No leak, no funky driver-internal vm_id based lookup, and with rust we
>>> should even be able to guarantee you never mix up Arc<Vm> with OpenRef<Vm>
>>> (or however that exactly works in rust types, I have not much real clue).
>>
>> That would totally work, and actually I already use somewhat analogous
>> mechanisms in other places like firmware queues!
>>
>> If this all weren't getting turned on its head for the new VM management I'd
>> implement it, but hopefully we can agree there's not much point right now...
>> I'd rather focus on the DRM abstraction design and work on improving the
>> driver in parallel right now, and then about one kernel cycle or so from now
>> it should definitely be in a better place for review. Honestly, there are
>> bigger design problems with the driver right now than these IDs (that I
>> already know about)... so I want to focus more on the abstractions and their
>> usage right now than the internal driver design which I *know* has problems
>> ^^
> 
> Yeah I think the only fundamental issue you have is that (if I get this
> all right) you're trying to clean up mappings from the gem_bo, not from
> the vm. The gem_bo (unlike the vm) is freely shareable (at least in
> general), so tying anything else to the lifetime of a gem_bo in any way is
> a design flaw.

Yeah, it wasn't nice from the start. Actually the first bit of code I 
wrote is the MMU code, and originally it was even literally C code based 
on the panfrost MMU code as-is... I quickly realized that the C wasn't 
going to be that useful when I started diving into the GEM abstractions, 
so it got rewritten in Rust early on...

So right now it works (and I have no reason to believe it has actual 
leak bugs lurking today) but it's not a nice design and it's going to 
get a major refactor/redesign once I switch to proper vm_bind tracking.

> This is similar to dma_fence that can end up absolutely everywhere, and
> why drm/sched has this decoupling between hw_fence and drm_job fences with
> wider visibility. i915-gem/i915-scheduler and a lot of the really old
> drivers all get this wrong, and you end up with either terrible explicit
> cleanup code that tries to go around looking for all the references that
> it needs to drop. Or you just leak.

I think for fences my general approach is going to be to just try to 
keep to what I'm doing now and minimize the references fences hold, and 
treat them as a signaling mechanism that ideally doesn't have to hold a 
reference to anything other than the module. After all, the real king of 
what needs to be alive is the firmware, and its mechanisms don't map 
well to fences directly, so I need to do bespoke resource management 
there anyway (and then just plug it into fences so it can feed into 
drm_sched and the rest of the world). I don't know if that makes sense, 
but it feels like it does? I still need to spend a bunch of time 
thinking about this though...

> All these things need to be sorted out at design time so that they're
> impossible.

That's the other nice thing about Rust, it makes refactoring a lot 
faster too! The compiler is really good at annoying you and refusing to 
compile things until you've fixed all the really dumb mistakes you 
introduced, and then there's a pretty good chance it'll run and the 
remaining bugs will be really obvious after that. As much as you learn 
to hate the compiler, it's so much better than trying to debug things at 
runtime... ^^

I'm not sure what your opinion is on this, but personally if you/others 
were okay with it I wouldn't be too worried about hypothetically merging 
the driver in the state it's in today, with the expectation to hack 
major parts of it to bits and pieces over the next few months. I've done 
it a few times already... it usually doesn't take more than a day or two 
to make some major refactor to a component and get it back up and 
running. (I do expect to do a bunch of that cleanup over the next few 
months before it's even possible to merge anyway, just a hypothetical).

>> Rust is really good at getting you to come up with a *safe* design as far as
>> memory and ownership, but that doesn't mean it's perfectly clean code and
>> more importantly it does nothing for deadlocks and allocating in the wrong
>> paths and getting resource allocation semantics right etc etc. The GPU FW
>> queue stuff is at the very least due for another major refactor/cleanup to
>> defer resource allocation and actual queuing to job prepare/run time (right
>> now there's some horrible hacks to do it upfront at submit because I don't
>> have a mechanism to back-patch job structures with those resource IDs later
>> at exec time, but I want to add that), and along the way I can also fix the
>> using job fences to block on pending job count thing that Christian really
>> wants me to do instead of the can_run_job thing, and then getting all this
>> resource stuff truly right is also going to mean eventually using fences to
>> handle blocking on resource exhaustion too (though maybe I can get away with
>> implementing that a bit later)...
>>
>> The driver works stupidly well for how quickly I wrote it, but it still has
>> all these rough edges that definitely need fixing before it's something I
>> could say I'm happy with... I'm sure if you start hammering it with evil
>> workloads you will hit some of its current problems (like I did yesterday
>> with the deadlocks on GpuContext inval). I also need to learn more about the
>> subtleties of fence signaling and all that, especially once a shrinker comes
>> into play...
> 
> Yeah I think rust is impressive at creating working code. The real
> challenge, and really where I see all the short term value at least, is in
> clarifying the semantics. Because that'll help us to clarify the semantics
> on the C side too, which gives immediate benefits for everyone. Not just
> new drivers in rust.
> 
> But it's also the part that's really, really hard work.

Yup!

~~ Lina