mbox series

[v2,0/9] Preemption support for A7XX

Message ID 20240830-preemption-a750-t-v2-0-86aeead2cd80@gmail.com
Headers show
Series Preemption support for A7XX | expand

Message

Antonino Maniscalco Aug. 30, 2024, 3:32 p.m. UTC
This series implements preemption for A7XX targets, which allows the GPU to
switch to an higher priority ring when work is pushed to it, reducing latency
for high priority submissions.

This series enables L1 preemption with skip_save_restore which requires
the following userspace patches to function:

https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/30544

A flag is added to `msm_submitqueue_create` to only allow submissions
from compatible userspace to be preempted, therefore maintaining
compatibility.

Some commits from this series are based on a previous series to enable
preemption on A6XX targets:

https://lkml.kernel.org/1520489185-21828-1-git-send-email-smasetty@codeaurora.org

Signed-off-by: Antonino Maniscalco <antomani103@gmail.com>
---
Changes in v2:
- Added preept_record_size for X185 in PATCH 3/7
- Added patches to reset perf counters
- Dropped unused defines
- Dropped unused variable (fixes warning)
- Only enable preemption on a750
- Reject MSM_SUBMITQUEUE_ALLOW_PREEMPT for unsupported targets
- Added Akhil's Reviewed-By tags to patches 1/9,2/9,3/9
- Added Neil's Tested-By tags
- Added explanation for UAPI changes in commit message
- Link to v1: https://lore.kernel.org/r/20240815-preemption-a750-t-v1-0-7bda26c34037@gmail.com

---
Antonino Maniscalco (9):
      drm/msm: Fix bv_fence being used as bv_rptr
      drm/msm: Add submitqueue setup and close
      drm/msm: Add a `preempt_record_size` field
      drm/msm/A6xx: Implement preemption for A7XX targets
      drm/msm/A6xx: Sync relevant adreno_pm4.xml changes
      drm/msm/A6xx: Use posamble to reset counters on preemption
      drm/msm/A6xx: Add traces for preemption
      drm/msm/A6XX: Add a flag to allow preemption to submitqueue_create
      drm/msm/A6xx: Enable preemption for A750

 drivers/gpu/drm/msm/Makefile                       |   1 +
 drivers/gpu/drm/msm/adreno/a6xx_catalog.c          |   4 +
 drivers/gpu/drm/msm/adreno/a6xx_gpu.c              | 353 +++++++++++++++-
 drivers/gpu/drm/msm/adreno/a6xx_gpu.h              | 174 ++++++++
 drivers/gpu/drm/msm/adreno/a6xx_preempt.c          | 462 +++++++++++++++++++++
 drivers/gpu/drm/msm/adreno/adreno_gpu.h            |   8 +-
 drivers/gpu/drm/msm/msm_gpu.h                      |   7 +
 drivers/gpu/drm/msm/msm_gpu_trace.h                |  28 ++
 drivers/gpu/drm/msm/msm_ringbuffer.h               |   8 +
 drivers/gpu/drm/msm/msm_submitqueue.c              |  13 +
 .../gpu/drm/msm/registers/adreno/adreno_pm4.xml    |  39 +-
 include/uapi/drm/msm_drm.h                         |   5 +-
 12 files changed, 1062 insertions(+), 40 deletions(-)
---
base-commit: 7c626ce4bae1ac14f60076d00eafe71af30450ba
change-id: 20240815-preemption-a750-t-fcee9a844b39

Best regards,

Comments

Rob Clark Aug. 30, 2024, 6:23 p.m. UTC | #1
On Fri, Aug 30, 2024 at 8:33 AM Antonino Maniscalco
<antomani103@gmail.com> wrote:
>
> Initialize with 4 rings to enable preemption.
>
> For now only on A750 as other targets require testing.
>
> Signed-off-by: Antonino Maniscalco <antomani103@gmail.com>
> Tested-by: Neil Armstrong <neil.armstrong@linaro.org> # on SM8650-QRD
> ---
>  drivers/gpu/drm/msm/adreno/a6xx_gpu.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
> index a2853309288b..ea518209c03d 100644
> --- a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
> +++ b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
> @@ -2609,7 +2609,9 @@ struct msm_gpu *a6xx_gpu_init(struct drm_device *dev)
>                 return ERR_PTR(ret);
>         }
>
> -       if (is_a7xx)
> +       if (adreno_is_a750(adreno_gpu))
> +               ret = adreno_gpu_init(dev, pdev, adreno_gpu, &funcs_a7xx, 4);

perhaps it would be useful (to enable others to more easily test), to
allow this to be overridden with a modparam.. something like

       if ((enable_preemption == 1) || (enable_preemption == -1 &&
(config->info->quirks & ADRENO_QUIRK_PREEMPTION))

That would allow overriding enable_preemption to 1 or 0 on cmdline to
force enable/disable preemption.

That plus some instructions about how to test preemption (ie. what
deqp tests to run, or similar) would make it easier to "crowd source"
the testing (assuming you don't have every a7xx device there is)

BR,
-R

> +       else if (is_a7xx)
>                 ret = adreno_gpu_init(dev, pdev, adreno_gpu, &funcs_a7xx, 1);
>         else if (adreno_has_gmu_wrapper(adreno_gpu))
>                 ret = adreno_gpu_init(dev, pdev, adreno_gpu, &funcs_gmuwrapper, 1);
>
> --
> 2.46.0
>
Connor Abbott Aug. 30, 2024, 6:54 p.m. UTC | #2
On Fri, Aug 30, 2024 at 7:08 PM Rob Clark <robdclark@gmail.com> wrote:
>
> On Fri, Aug 30, 2024 at 8:33 AM Antonino Maniscalco
> <antomani103@gmail.com> wrote:
> >
> > This patch implements preemption feature for A6xx targets, this allows
> > the GPU to switch to a higher priority ringbuffer if one is ready. A6XX
> > hardware as such supports multiple levels of preemption granularities,
> > ranging from coarse grained(ringbuffer level) to a more fine grained
> > such as draw-call level or a bin boundary level preemption. This patch
> > enables the basic preemption level, with more fine grained preemption
> > support to follow.
> >
> > Signed-off-by: Sharat Masetty <smasetty@codeaurora.org>
> > Signed-off-by: Antonino Maniscalco <antomani103@gmail.com>
> > Tested-by: Neil Armstrong <neil.armstrong@linaro.org> # on SM8650-QRD
> > ---
> >  drivers/gpu/drm/msm/Makefile              |   1 +
> >  drivers/gpu/drm/msm/adreno/a6xx_gpu.c     | 323 +++++++++++++++++++++-
> >  drivers/gpu/drm/msm/adreno/a6xx_gpu.h     | 168 ++++++++++++
> >  drivers/gpu/drm/msm/adreno/a6xx_preempt.c | 431 ++++++++++++++++++++++++++++++
> >  drivers/gpu/drm/msm/msm_ringbuffer.h      |   7 +
> >  5 files changed, 921 insertions(+), 9 deletions(-)
> >
>
> [snip]
>
> > +
> > +int a6xx_preempt_submitqueue_setup(struct msm_gpu *gpu,
> > +               struct msm_gpu_submitqueue *queue)
> > +{
> > +       void *ptr;
> > +
> > +       /*
> > +        * Create a per submitqueue buffer for the CP to save and restore user
> > +        * specific information such as the VPC streamout data.
> > +        */
> > +       ptr = msm_gem_kernel_new(gpu->dev, A6XX_PREEMPT_USER_RECORD_SIZE,
> > +                       MSM_BO_WC, gpu->aspace, &queue->bo, &queue->bo_iova);
>
> Can this be MSM_BO_MAP_PRIV?  Otherwise it is visible (and writeable)
> by other proceess's userspace generated cmdstream.
>
> And a similar question for the scratch_bo..  I'd have to give some
> thought to what sort of mischief could be had, but generall kernel
> mappings that are not MAP_PRIV are a thing to be careful about.
>

It seems like the idea behind this is that it's supposed to be
per-context. kgsl allocates it as part of the context, as part of the
userspace address space, and then in order to know which user record
to use when preempting, before each submit (although really it only
needs to be done when setting the pagetable) it does a CP_MEM_WRITE of
the user record address to a scratch buffer holding an array of the
current user record for each ring. Then when preempting it reads the
address for the next ring from the scratch buffer and sets it. I think
we need to do that dance too.

Connor

> BR,
> -R
Connor Abbott Aug. 30, 2024, 7:09 p.m. UTC | #3
On Fri, Aug 30, 2024 at 8:00 PM Rob Clark <robdclark@gmail.com> wrote:
>
> On Fri, Aug 30, 2024 at 11:54 AM Connor Abbott <cwabbott0@gmail.com> wrote:
> >
> > On Fri, Aug 30, 2024 at 7:08 PM Rob Clark <robdclark@gmail.com> wrote:
> > >
> > > On Fri, Aug 30, 2024 at 8:33 AM Antonino Maniscalco
> > > <antomani103@gmail.com> wrote:
> > > >
> > > > This patch implements preemption feature for A6xx targets, this allows
> > > > the GPU to switch to a higher priority ringbuffer if one is ready. A6XX
> > > > hardware as such supports multiple levels of preemption granularities,
> > > > ranging from coarse grained(ringbuffer level) to a more fine grained
> > > > such as draw-call level or a bin boundary level preemption. This patch
> > > > enables the basic preemption level, with more fine grained preemption
> > > > support to follow.
> > > >
> > > > Signed-off-by: Sharat Masetty <smasetty@codeaurora.org>
> > > > Signed-off-by: Antonino Maniscalco <antomani103@gmail.com>
> > > > Tested-by: Neil Armstrong <neil.armstrong@linaro.org> # on SM8650-QRD
> > > > ---
> > > >  drivers/gpu/drm/msm/Makefile              |   1 +
> > > >  drivers/gpu/drm/msm/adreno/a6xx_gpu.c     | 323 +++++++++++++++++++++-
> > > >  drivers/gpu/drm/msm/adreno/a6xx_gpu.h     | 168 ++++++++++++
> > > >  drivers/gpu/drm/msm/adreno/a6xx_preempt.c | 431 ++++++++++++++++++++++++++++++
> > > >  drivers/gpu/drm/msm/msm_ringbuffer.h      |   7 +
> > > >  5 files changed, 921 insertions(+), 9 deletions(-)
> > > >
> > >
> > > [snip]
> > >
> > > > +
> > > > +int a6xx_preempt_submitqueue_setup(struct msm_gpu *gpu,
> > > > +               struct msm_gpu_submitqueue *queue)
> > > > +{
> > > > +       void *ptr;
> > > > +
> > > > +       /*
> > > > +        * Create a per submitqueue buffer for the CP to save and restore user
> > > > +        * specific information such as the VPC streamout data.
> > > > +        */
> > > > +       ptr = msm_gem_kernel_new(gpu->dev, A6XX_PREEMPT_USER_RECORD_SIZE,
> > > > +                       MSM_BO_WC, gpu->aspace, &queue->bo, &queue->bo_iova);
> > >
> > > Can this be MSM_BO_MAP_PRIV?  Otherwise it is visible (and writeable)
> > > by other proceess's userspace generated cmdstream.
> > >
> > > And a similar question for the scratch_bo..  I'd have to give some
> > > thought to what sort of mischief could be had, but generall kernel
> > > mappings that are not MAP_PRIV are a thing to be careful about.
> > >
> >
> > It seems like the idea behind this is that it's supposed to be
> > per-context. kgsl allocates it as part of the context, as part of the
> > userspace address space, and then in order to know which user record
> > to use when preempting, before each submit (although really it only
> > needs to be done when setting the pagetable) it does a CP_MEM_WRITE of
> > the user record address to a scratch buffer holding an array of the
> > current user record for each ring. Then when preempting it reads the
> > address for the next ring from the scratch buffer and sets it. I think
> > we need to do that dance too.
>
> Moving it into userspace's address space (vm) would be better.
>
> I assume the preempt record is where state is saved/restored?  So
> would need to be in kernel aspace/vm?  Or is the fw changing ttbr0
> after saving state but before restoring?
>
> BR,
> -R

The preempt record is split into a number of pieces, each with their
own address. One of those pieces is the SMMU record with ttbr0 and
other SMMU things. Another piece is the "private" context record with
sensitive things like RB address/rptr/wptr, although actually the bulk
of the registers are saved here. Then the user or "non-private" record
is its own piece, which is presumably saved before switching ttbr0 and
restored after the SMMU record is restored and ttbr0 is switched.

Connor

>
> > Connor
> >
> > > BR,
> > > -R
Rob Clark Aug. 30, 2024, 8:25 p.m. UTC | #4
On Fri, Aug 30, 2024 at 8:33 AM Antonino Maniscalco
<antomani103@gmail.com> wrote:
>
> This patch implements preemption feature for A6xx targets, this allows
> the GPU to switch to a higher priority ringbuffer if one is ready. A6XX
> hardware as such supports multiple levels of preemption granularities,
> ranging from coarse grained(ringbuffer level) to a more fine grained
> such as draw-call level or a bin boundary level preemption. This patch
> enables the basic preemption level, with more fine grained preemption
> support to follow.
>
> Signed-off-by: Sharat Masetty <smasetty@codeaurora.org>
> Signed-off-by: Antonino Maniscalco <antomani103@gmail.com>
> Tested-by: Neil Armstrong <neil.armstrong@linaro.org> # on SM8650-QRD
> ---
>  drivers/gpu/drm/msm/Makefile              |   1 +
>  drivers/gpu/drm/msm/adreno/a6xx_gpu.c     | 323 +++++++++++++++++++++-
>  drivers/gpu/drm/msm/adreno/a6xx_gpu.h     | 168 ++++++++++++
>  drivers/gpu/drm/msm/adreno/a6xx_preempt.c | 431 ++++++++++++++++++++++++++++++
>  drivers/gpu/drm/msm/msm_ringbuffer.h      |   7 +
>  5 files changed, 921 insertions(+), 9 deletions(-)
>

[snip]

> @@ -784,6 +1062,16 @@ static int a6xx_ucode_load(struct msm_gpu *gpu)
>                 msm_gem_object_set_name(a6xx_gpu->shadow_bo, "shadow");
>         }
>
> +       a6xx_gpu->pwrup_reglist_ptr = msm_gem_kernel_new(gpu->dev, PAGE_SIZE,
> +                                                        MSM_BO_WC  | MSM_BO_MAP_PRIV,
> +                                                        gpu->aspace, &a6xx_gpu->pwrup_reglist_bo,
> +                                                        &a6xx_gpu->pwrup_reglist_iova);

I guess this could also be MSM_BO_GPU_READONLY?

BR,
-R
Antonino Maniscalco Aug. 31, 2024, 2:26 p.m. UTC | #5
On 8/30/24 10:25 PM, Rob Clark wrote:
> On Fri, Aug 30, 2024 at 8:33 AM Antonino Maniscalco
> <antomani103@gmail.com> wrote:
>>
>> This patch implements preemption feature for A6xx targets, this allows
>> the GPU to switch to a higher priority ringbuffer if one is ready. A6XX
>> hardware as such supports multiple levels of preemption granularities,
>> ranging from coarse grained(ringbuffer level) to a more fine grained
>> such as draw-call level or a bin boundary level preemption. This patch
>> enables the basic preemption level, with more fine grained preemption
>> support to follow.
>>
>> Signed-off-by: Sharat Masetty <smasetty@codeaurora.org>
>> Signed-off-by: Antonino Maniscalco <antomani103@gmail.com>
>> Tested-by: Neil Armstrong <neil.armstrong@linaro.org> # on SM8650-QRD
>> ---
>>   drivers/gpu/drm/msm/Makefile              |   1 +
>>   drivers/gpu/drm/msm/adreno/a6xx_gpu.c     | 323 +++++++++++++++++++++-
>>   drivers/gpu/drm/msm/adreno/a6xx_gpu.h     | 168 ++++++++++++
>>   drivers/gpu/drm/msm/adreno/a6xx_preempt.c | 431 ++++++++++++++++++++++++++++++
>>   drivers/gpu/drm/msm/msm_ringbuffer.h      |   7 +
>>   5 files changed, 921 insertions(+), 9 deletions(-)
>>
> 
> [snip]
> 
>> @@ -784,6 +1062,16 @@ static int a6xx_ucode_load(struct msm_gpu *gpu)
>>                  msm_gem_object_set_name(a6xx_gpu->shadow_bo, "shadow");
>>          }
>>
>> +       a6xx_gpu->pwrup_reglist_ptr = msm_gem_kernel_new(gpu->dev, PAGE_SIZE,
>> +                                                        MSM_BO_WC  | MSM_BO_MAP_PRIV,
>> +                                                        gpu->aspace, &a6xx_gpu->pwrup_reglist_bo,
>> +                                                        &a6xx_gpu->pwrup_reglist_iova);
> 
> I guess this could also be MSM_BO_GPU_READONLY?
> 
> BR,
> -R

Besides containing the the actual reglist this buffer also contains the 
`cpu_gpu_lock` structure which is written by the SQE so adding the 
`MSM_BO_GPU_READONLY` flag would cause it to fault.

Best regards,