diff mbox series

[RFC] drm/msm: Add UABI to request perfcntr usage

Message ID 20241205165419.54080-1-robdclark@gmail.com
State New
Headers show
Series [RFC] drm/msm: Add UABI to request perfcntr usage | expand

Commit Message

Rob Clark Dec. 5, 2024, 4:54 p.m. UTC
From: Rob Clark <robdclark@chromium.org>

Performance counter usage falls into two categories:

1. Local usage, where the counter configuration, start, and end read
   happen within (locally to) a single SUBMIT.  In this case, there is
   no dependency on counter configuration or values between submits, and
   in fact counters are normally cleared on context switches, making it
   impossible to rely on cross-submit state.

2. Global usage, where a single privilaged daemon/process is sampling
   counter values across all processes for profiling.

The two categories are mutually exclusive.  While you can have many
processes making local counter usage, you cannot combine global and
local usage without the two stepping on each others feet (by changing
counter configuration).

For global counter usage, there is already a SYSPROF param (since global
counter usage requires disabling counter clearing on context switch).
This patch adds a REQ_CNTRS param to request local counter usage.  If
one or more processes has requested counter usage, then a SYSPROF
request will fail with -EBUSY.  And if SYSPROF is active, then REQ_CNTRS
will fail with -EBUSY, maintaining the mutual exclusivity.

This is purely an advisory interface to help coordinate userspace.
There is no real means of enforcement, but the worst that can happen if
userspace ignores a REQ_CNTRS failure is that you'll get nonsense
profiling data.

Signed-off-by: Rob Clark <robdclark@chromium.org>
---
kgsl takes a different approach, which involves a lot more UABI for
assigning counters to different processes.  But I think by taking
advantage of the fact that mesa (freedreno+turnip) reconfigure the
counters they need in each SUBMIT, for their respective gl/vk perf-
counter extensions, we can take this simpler approach.

 drivers/gpu/drm/msm/adreno/adreno_gpu.c |  2 +
 drivers/gpu/drm/msm/msm_drv.c           |  5 ++-
 drivers/gpu/drm/msm/msm_gpu.c           |  1 +
 drivers/gpu/drm/msm/msm_gpu.h           | 29 +++++++++++++-
 drivers/gpu/drm/msm/msm_submitqueue.c   | 52 ++++++++++++++++++++++++-
 include/uapi/drm/msm_drm.h              |  1 +
 6 files changed, 85 insertions(+), 5 deletions(-)

Comments

Akhil P Oommen Dec. 12, 2024, 3:58 p.m. UTC | #1
On 12/5/2024 10:24 PM, Rob Clark wrote:
> From: Rob Clark <robdclark@chromium.org>
> 
> Performance counter usage falls into two categories:
> 
> 1. Local usage, where the counter configuration, start, and end read
>    happen within (locally to) a single SUBMIT.  In this case, there is
>    no dependency on counter configuration or values between submits, and
>    in fact counters are normally cleared on context switches, making it
>    impossible to rely on cross-submit state.
> 
> 2. Global usage, where a single privilaged daemon/process is sampling
>    counter values across all processes for profiling.
> 
> The two categories are mutually exclusive.  While you can have many
> processes making local counter usage, you cannot combine global and
> local usage without the two stepping on each others feet (by changing
> counter configuration).
> 
> For global counter usage, there is already a SYSPROF param (since global
> counter usage requires disabling counter clearing on context switch).
> This patch adds a REQ_CNTRS param to request local counter usage.  If
> one or more processes has requested counter usage, then a SYSPROF
> request will fail with -EBUSY.  And if SYSPROF is active, then REQ_CNTRS
> will fail with -EBUSY, maintaining the mutual exclusivity.
> 
> This is purely an advisory interface to help coordinate userspace.
> There is no real means of enforcement, but the worst that can happen if
> userspace ignores a REQ_CNTRS failure is that you'll get nonsense
> profiling data.
> 
> Signed-off-by: Rob Clark <robdclark@chromium.org>
> ---
> kgsl takes a different approach, which involves a lot more UABI for
> assigning counters to different processes.  But I think by taking
> advantage of the fact that mesa (freedreno+turnip) reconfigure the
> counters they need in each SUBMIT, for their respective gl/vk perf-
> counter extensions, we can take this simpler approach.

KGSL's approach is preemption and ifpc safe (also whatever HW changes
that will come up in future generations). How will we ensure that here?

I have plans to bring up IFPC support in near future. Also, I brought up
this point during preemption series. But from the responses, I felt that
profiling was not considered a serious usecase. Still I wonder how the
perfcounter extensions work accurately with preemption.

-Akhil

> 
>  drivers/gpu/drm/msm/adreno/adreno_gpu.c |  2 +
>  drivers/gpu/drm/msm/msm_drv.c           |  5 ++-
>  drivers/gpu/drm/msm/msm_gpu.c           |  1 +
>  drivers/gpu/drm/msm/msm_gpu.h           | 29 +++++++++++++-
>  drivers/gpu/drm/msm/msm_submitqueue.c   | 52 ++++++++++++++++++++++++-
>  include/uapi/drm/msm_drm.h              |  1 +
>  6 files changed, 85 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/gpu/drm/msm/adreno/adreno_gpu.c b/drivers/gpu/drm/msm/adreno/adreno_gpu.c
> index 31bbf2c83de4..f688e37059b8 100644
> --- a/drivers/gpu/drm/msm/adreno/adreno_gpu.c
> +++ b/drivers/gpu/drm/msm/adreno/adreno_gpu.c
> @@ -441,6 +441,8 @@ int adreno_set_param(struct msm_gpu *gpu, struct msm_file_private *ctx,
>  		if (!capable(CAP_SYS_ADMIN))
>  			return UERR(EPERM, drm, "invalid permissions");
>  		return msm_file_private_set_sysprof(ctx, gpu, value);
> +	case MSM_PARAM_REQ_CNTRS:
> +		return msm_file_private_request_counters(ctx, gpu, value);
>  	default:
>  		return UERR(EINVAL, drm, "%s: invalid param: %u", gpu->name, param);
>  	}
> diff --git a/drivers/gpu/drm/msm/msm_drv.c b/drivers/gpu/drm/msm/msm_drv.c
> index 6416d2cb4efc..bf8314ff4a25 100644
> --- a/drivers/gpu/drm/msm/msm_drv.c
> +++ b/drivers/gpu/drm/msm/msm_drv.c
> @@ -377,9 +377,12 @@ static void msm_postclose(struct drm_device *dev, struct drm_file *file)
>  	 * It is not possible to set sysprof param to non-zero if gpu
>  	 * is not initialized:
>  	 */
> -	if (priv->gpu)
> +	if (ctx->sysprof)
>  		msm_file_private_set_sysprof(ctx, priv->gpu, 0);
>  
> +	if (ctx->counters_requested)
> +		msm_file_private_request_counters(ctx, priv->gpu, 0);
> +
>  	context_close(ctx);
>  }
>  
> diff --git a/drivers/gpu/drm/msm/msm_gpu.c b/drivers/gpu/drm/msm/msm_gpu.c
> index 82f204f3bb8f..013b59ca3bb1 100644
> --- a/drivers/gpu/drm/msm/msm_gpu.c
> +++ b/drivers/gpu/drm/msm/msm_gpu.c
> @@ -991,6 +991,7 @@ int msm_gpu_init(struct drm_device *drm, struct platform_device *pdev,
>  	gpu->nr_rings = nr_rings;
>  
>  	refcount_set(&gpu->sysprof_active, 1);
> +	refcount_set(&gpu->local_counters_active, 1);
>  
>  	return 0;
>  
> diff --git a/drivers/gpu/drm/msm/msm_gpu.h b/drivers/gpu/drm/msm/msm_gpu.h
> index e25009150579..83c61e523b1b 100644
> --- a/drivers/gpu/drm/msm/msm_gpu.h
> +++ b/drivers/gpu/drm/msm/msm_gpu.h
> @@ -195,12 +195,28 @@ struct msm_gpu {
>  	int nr_rings;
>  
>  	/**
> -	 * sysprof_active:
> +	 * @sysprof_active:
>  	 *
> -	 * The count of contexts that have enabled system profiling.
> +	 * The count of contexts that have enabled system profiling plus one.
> +	 *
> +	 * Note: refcount_t does not like 0->1 transitions.. we want to keep
> +	 * the under/overflow checks that refcount_t provides, but allow
> +	 * multiple on/off transitions so we track the logical value plus one.)
>  	 */
>  	refcount_t sysprof_active;
>  
> +	/**
> +	 * @local_counters_active:
> +	 *
> +	 * The count of contexts that have requested local (intra-submit)
> +	 * performance counter usage plus one.
> +	 *
> +	 * Note: refcount_t does not like 0->1 transitions.. we want to keep
> +	 * the under/overflow checks that refcount_t provides, but allow
> +	 * multiple on/off transitions so we track the logical value plus one.)
> +	 */
> +	refcount_t local_counters_active;
> +
>  	/**
>  	 * lock:
>  	 *
> @@ -383,6 +399,13 @@ struct msm_file_private {
>  	 */
>  	int sysprof;
>  
> +	/**
> +	 * @counters_requested:
> +	 *
> +	 * Has the context requested local perfcntr usage.
> +	 */
> +	bool counters_requested;
> +
>  	/**
>  	 * comm: Overridden task comm, see MSM_PARAM_COMM
>  	 *
> @@ -626,6 +649,8 @@ void msm_submitqueue_destroy(struct kref *kref);
>  
>  int msm_file_private_set_sysprof(struct msm_file_private *ctx,
>  				 struct msm_gpu *gpu, int sysprof);
> +int msm_file_private_request_counters(struct msm_file_private *ctx,
> +				      struct msm_gpu *gpu, int reqcntrs);
>  void __msm_file_private_destroy(struct kref *kref);
>  
>  static inline void msm_file_private_put(struct msm_file_private *ctx)
> diff --git a/drivers/gpu/drm/msm/msm_submitqueue.c b/drivers/gpu/drm/msm/msm_submitqueue.c
> index 7fed1de63b5d..1e1e21e6f7ae 100644
> --- a/drivers/gpu/drm/msm/msm_submitqueue.c
> +++ b/drivers/gpu/drm/msm/msm_submitqueue.c
> @@ -10,6 +10,15 @@
>  int msm_file_private_set_sysprof(struct msm_file_private *ctx,
>  				 struct msm_gpu *gpu, int sysprof)
>  {
> +	int ret = 0;
> +
> +	mutex_lock(&gpu->lock);
> +
> +	if (sysprof && (refcount_read(&gpu->local_counters_active) > 1)) {
> +		ret = UERR(EBUSY, gpu->dev, "Local counter usage active");
> +		goto out_unlock;
> +	}
> +
>  	/*
>  	 * Since pm_runtime and sysprof_active are both refcounts, we
>  	 * call apply the new value first, and then unwind the previous
> @@ -18,7 +27,8 @@ int msm_file_private_set_sysprof(struct msm_file_private *ctx,
>  
>  	switch (sysprof) {
>  	default:
> -		return UERR(EINVAL, gpu->dev, "Invalid sysprof: %d", sysprof);
> +		ret = UERR(EINVAL, gpu->dev, "Invalid sysprof: %d", sysprof);
> +		goto out_unlock;
>  	case 2:
>  		pm_runtime_get_sync(&gpu->pdev->dev);
>  		fallthrough;
> @@ -43,7 +53,45 @@ int msm_file_private_set_sysprof(struct msm_file_private *ctx,
>  
>  	ctx->sysprof = sysprof;
>  
> -	return 0;
> +out_unlock:
> +	mutex_unlock(&gpu->lock);
> +
> +	return ret;
> +}
> +
> +int msm_file_private_request_counters(struct msm_file_private *ctx,
> +				      struct msm_gpu *gpu, int reqctrs)
> +{
> +	int ret = 0;
> +
> +	mutex_lock(&gpu->lock);
> +
> +	if (reqctrs && (refcount_read(&gpu->sysprof_active) > 1)) {
> +		ret = UERR(EBUSY, gpu->dev, "System profiling active");
> +		goto out_unlock;
> +	}
> +
> +	if (reqctrs) {
> +		if (ctx->counters_requested) {
> +			ret = UERR(EINVAL, gpu->dev, "Already requested");
> +			goto out_unlock;
> +		}
> +
> +		ctx->counters_requested = true;
> +		refcount_inc(&gpu->local_counters_active);
> +	} else {
> +		if (!ctx->counters_requested) {
> +			ret = UERR(EINVAL, gpu->dev, "Not requested");
> +			goto out_unlock;
> +		}
> +		refcount_dec(&gpu->local_counters_active);
> +		ctx->counters_requested = false;
> +	}
> +
> +out_unlock:
> +	mutex_unlock(&gpu->lock);
> +
> +	return ret;
>  }
>  
>  void __msm_file_private_destroy(struct kref *kref)
> diff --git a/include/uapi/drm/msm_drm.h b/include/uapi/drm/msm_drm.h
> index 2342cb90857e..ae7fb355e4a1 100644
> --- a/include/uapi/drm/msm_drm.h
> +++ b/include/uapi/drm/msm_drm.h
> @@ -91,6 +91,7 @@ struct drm_msm_timespec {
>  #define MSM_PARAM_UBWC_SWIZZLE 0x12 /* RO */
>  #define MSM_PARAM_MACROTILE_MODE 0x13 /* RO */
>  #define MSM_PARAM_UCHE_TRAP_BASE 0x14 /* RO */
> +#define MSM_PARAM_REQ_CNTRS  0x15 /* WO: request "local" (intra-submit) perfcntr usage  */
>  
>  /* For backwards compat.  The original support for preemption was based on
>   * a single ring per priority level so # of priority levels equals the #
Antonino Maniscalco Dec. 12, 2024, 4:14 p.m. UTC | #2
On 12/12/24 4:58 PM, Akhil P Oommen wrote:
> On 12/5/2024 10:24 PM, Rob Clark wrote:
>> From: Rob Clark <robdclark@chromium.org>
>>
>> Performance counter usage falls into two categories:
>>
>> 1. Local usage, where the counter configuration, start, and end read
>>     happen within (locally to) a single SUBMIT.  In this case, there is
>>     no dependency on counter configuration or values between submits, and
>>     in fact counters are normally cleared on context switches, making it
>>     impossible to rely on cross-submit state.
>>
>> 2. Global usage, where a single privilaged daemon/process is sampling
>>     counter values across all processes for profiling.
>>
>> The two categories are mutually exclusive.  While you can have many
>> processes making local counter usage, you cannot combine global and
>> local usage without the two stepping on each others feet (by changing
>> counter configuration).
>>
>> For global counter usage, there is already a SYSPROF param (since global
>> counter usage requires disabling counter clearing on context switch).
>> This patch adds a REQ_CNTRS param to request local counter usage.  If
>> one or more processes has requested counter usage, then a SYSPROF
>> request will fail with -EBUSY.  And if SYSPROF is active, then REQ_CNTRS
>> will fail with -EBUSY, maintaining the mutual exclusivity.
>>
>> This is purely an advisory interface to help coordinate userspace.
>> There is no real means of enforcement, but the worst that can happen if
>> userspace ignores a REQ_CNTRS failure is that you'll get nonsense
>> profiling data.
>>
>> Signed-off-by: Rob Clark <robdclark@chromium.org>
>> ---
>> kgsl takes a different approach, which involves a lot more UABI for
>> assigning counters to different processes.  But I think by taking
>> advantage of the fact that mesa (freedreno+turnip) reconfigure the
>> counters they need in each SUBMIT, for their respective gl/vk perf-
>> counter extensions, we can take this simpler approach.
> 
> KGSL's approach is preemption and ifpc safe (also whatever HW changes
> that will come up in future generations). How will we ensure that here?
> 
> I have plans to bring up IFPC support in near future. Also, I brought up
> this point during preemption series. But from the responses, I felt that
> profiling was not considered a serious usecase. Still I wonder how the
> perfcounter extensions work accurately with preemption.

So back then I implemented the postamble IB to clear perf counters and 
that gets disabled when sysprof (so global usage) is happening. The 
kernel is oblivious to "Local isage" of profiling but in that case 
really what we want to do is disable preemption which in my 
understanding can be done from userspace with a PKT. In my understanding 
this had us covered for all usecases.

So what would you expect instead we should do kernel side to make 
profiling preemption safe?

> 
> -Akhil
> 
>>
>>   drivers/gpu/drm/msm/adreno/adreno_gpu.c |  2 +
>>   drivers/gpu/drm/msm/msm_drv.c           |  5 ++-
>>   drivers/gpu/drm/msm/msm_gpu.c           |  1 +
>>   drivers/gpu/drm/msm/msm_gpu.h           | 29 +++++++++++++-
>>   drivers/gpu/drm/msm/msm_submitqueue.c   | 52 ++++++++++++++++++++++++-
>>   include/uapi/drm/msm_drm.h              |  1 +
>>   6 files changed, 85 insertions(+), 5 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/msm/adreno/adreno_gpu.c b/drivers/gpu/drm/msm/adreno/adreno_gpu.c
>> index 31bbf2c83de4..f688e37059b8 100644
>> --- a/drivers/gpu/drm/msm/adreno/adreno_gpu.c
>> +++ b/drivers/gpu/drm/msm/adreno/adreno_gpu.c
>> @@ -441,6 +441,8 @@ int adreno_set_param(struct msm_gpu *gpu, struct msm_file_private *ctx,
>>   		if (!capable(CAP_SYS_ADMIN))
>>   			return UERR(EPERM, drm, "invalid permissions");
>>   		return msm_file_private_set_sysprof(ctx, gpu, value);
>> +	case MSM_PARAM_REQ_CNTRS:
>> +		return msm_file_private_request_counters(ctx, gpu, value);
>>   	default:
>>   		return UERR(EINVAL, drm, "%s: invalid param: %u", gpu->name, param);
>>   	}
>> diff --git a/drivers/gpu/drm/msm/msm_drv.c b/drivers/gpu/drm/msm/msm_drv.c
>> index 6416d2cb4efc..bf8314ff4a25 100644
>> --- a/drivers/gpu/drm/msm/msm_drv.c
>> +++ b/drivers/gpu/drm/msm/msm_drv.c
>> @@ -377,9 +377,12 @@ static void msm_postclose(struct drm_device *dev, struct drm_file *file)
>>   	 * It is not possible to set sysprof param to non-zero if gpu
>>   	 * is not initialized:
>>   	 */
>> -	if (priv->gpu)
>> +	if (ctx->sysprof)
>>   		msm_file_private_set_sysprof(ctx, priv->gpu, 0);
>>   
>> +	if (ctx->counters_requested)
>> +		msm_file_private_request_counters(ctx, priv->gpu, 0);
>> +
>>   	context_close(ctx);
>>   }
>>   
>> diff --git a/drivers/gpu/drm/msm/msm_gpu.c b/drivers/gpu/drm/msm/msm_gpu.c
>> index 82f204f3bb8f..013b59ca3bb1 100644
>> --- a/drivers/gpu/drm/msm/msm_gpu.c
>> +++ b/drivers/gpu/drm/msm/msm_gpu.c
>> @@ -991,6 +991,7 @@ int msm_gpu_init(struct drm_device *drm, struct platform_device *pdev,
>>   	gpu->nr_rings = nr_rings;
>>   
>>   	refcount_set(&gpu->sysprof_active, 1);
>> +	refcount_set(&gpu->local_counters_active, 1);
>>   
>>   	return 0;
>>   
>> diff --git a/drivers/gpu/drm/msm/msm_gpu.h b/drivers/gpu/drm/msm/msm_gpu.h
>> index e25009150579..83c61e523b1b 100644
>> --- a/drivers/gpu/drm/msm/msm_gpu.h
>> +++ b/drivers/gpu/drm/msm/msm_gpu.h
>> @@ -195,12 +195,28 @@ struct msm_gpu {
>>   	int nr_rings;
>>   
>>   	/**
>> -	 * sysprof_active:
>> +	 * @sysprof_active:
>>   	 *
>> -	 * The count of contexts that have enabled system profiling.
>> +	 * The count of contexts that have enabled system profiling plus one.
>> +	 *
>> +	 * Note: refcount_t does not like 0->1 transitions.. we want to keep
>> +	 * the under/overflow checks that refcount_t provides, but allow
>> +	 * multiple on/off transitions so we track the logical value plus one.)
>>   	 */
>>   	refcount_t sysprof_active;
>>   
>> +	/**
>> +	 * @local_counters_active:
>> +	 *
>> +	 * The count of contexts that have requested local (intra-submit)
>> +	 * performance counter usage plus one.
>> +	 *
>> +	 * Note: refcount_t does not like 0->1 transitions.. we want to keep
>> +	 * the under/overflow checks that refcount_t provides, but allow
>> +	 * multiple on/off transitions so we track the logical value plus one.)
>> +	 */
>> +	refcount_t local_counters_active;
>> +
>>   	/**
>>   	 * lock:
>>   	 *
>> @@ -383,6 +399,13 @@ struct msm_file_private {
>>   	 */
>>   	int sysprof;
>>   
>> +	/**
>> +	 * @counters_requested:
>> +	 *
>> +	 * Has the context requested local perfcntr usage.
>> +	 */
>> +	bool counters_requested;
>> +
>>   	/**
>>   	 * comm: Overridden task comm, see MSM_PARAM_COMM
>>   	 *
>> @@ -626,6 +649,8 @@ void msm_submitqueue_destroy(struct kref *kref);
>>   
>>   int msm_file_private_set_sysprof(struct msm_file_private *ctx,
>>   				 struct msm_gpu *gpu, int sysprof);
>> +int msm_file_private_request_counters(struct msm_file_private *ctx,
>> +				      struct msm_gpu *gpu, int reqcntrs);
>>   void __msm_file_private_destroy(struct kref *kref);
>>   
>>   static inline void msm_file_private_put(struct msm_file_private *ctx)
>> diff --git a/drivers/gpu/drm/msm/msm_submitqueue.c b/drivers/gpu/drm/msm/msm_submitqueue.c
>> index 7fed1de63b5d..1e1e21e6f7ae 100644
>> --- a/drivers/gpu/drm/msm/msm_submitqueue.c
>> +++ b/drivers/gpu/drm/msm/msm_submitqueue.c
>> @@ -10,6 +10,15 @@
>>   int msm_file_private_set_sysprof(struct msm_file_private *ctx,
>>   				 struct msm_gpu *gpu, int sysprof)
>>   {
>> +	int ret = 0;
>> +
>> +	mutex_lock(&gpu->lock);
>> +
>> +	if (sysprof && (refcount_read(&gpu->local_counters_active) > 1)) {
>> +		ret = UERR(EBUSY, gpu->dev, "Local counter usage active");
>> +		goto out_unlock;
>> +	}
>> +
>>   	/*
>>   	 * Since pm_runtime and sysprof_active are both refcounts, we
>>   	 * call apply the new value first, and then unwind the previous
>> @@ -18,7 +27,8 @@ int msm_file_private_set_sysprof(struct msm_file_private *ctx,
>>   
>>   	switch (sysprof) {
>>   	default:
>> -		return UERR(EINVAL, gpu->dev, "Invalid sysprof: %d", sysprof);
>> +		ret = UERR(EINVAL, gpu->dev, "Invalid sysprof: %d", sysprof);
>> +		goto out_unlock;
>>   	case 2:
>>   		pm_runtime_get_sync(&gpu->pdev->dev);
>>   		fallthrough;
>> @@ -43,7 +53,45 @@ int msm_file_private_set_sysprof(struct msm_file_private *ctx,
>>   
>>   	ctx->sysprof = sysprof;
>>   
>> -	return 0;
>> +out_unlock:
>> +	mutex_unlock(&gpu->lock);
>> +
>> +	return ret;
>> +}
>> +
>> +int msm_file_private_request_counters(struct msm_file_private *ctx,
>> +				      struct msm_gpu *gpu, int reqctrs)
>> +{
>> +	int ret = 0;
>> +
>> +	mutex_lock(&gpu->lock);
>> +
>> +	if (reqctrs && (refcount_read(&gpu->sysprof_active) > 1)) {
>> +		ret = UERR(EBUSY, gpu->dev, "System profiling active");
>> +		goto out_unlock;
>> +	}
>> +
>> +	if (reqctrs) {
>> +		if (ctx->counters_requested) {
>> +			ret = UERR(EINVAL, gpu->dev, "Already requested");
>> +			goto out_unlock;
>> +		}
>> +
>> +		ctx->counters_requested = true;
>> +		refcount_inc(&gpu->local_counters_active);
>> +	} else {
>> +		if (!ctx->counters_requested) {
>> +			ret = UERR(EINVAL, gpu->dev, "Not requested");
>> +			goto out_unlock;
>> +		}
>> +		refcount_dec(&gpu->local_counters_active);
>> +		ctx->counters_requested = false;
>> +	}
>> +
>> +out_unlock:
>> +	mutex_unlock(&gpu->lock);
>> +
>> +	return ret;
>>   }
>>   
>>   void __msm_file_private_destroy(struct kref *kref)
>> diff --git a/include/uapi/drm/msm_drm.h b/include/uapi/drm/msm_drm.h
>> index 2342cb90857e..ae7fb355e4a1 100644
>> --- a/include/uapi/drm/msm_drm.h
>> +++ b/include/uapi/drm/msm_drm.h
>> @@ -91,6 +91,7 @@ struct drm_msm_timespec {
>>   #define MSM_PARAM_UBWC_SWIZZLE 0x12 /* RO */
>>   #define MSM_PARAM_MACROTILE_MODE 0x13 /* RO */
>>   #define MSM_PARAM_UCHE_TRAP_BASE 0x14 /* RO */
>> +#define MSM_PARAM_REQ_CNTRS  0x15 /* WO: request "local" (intra-submit) perfcntr usage  */
>>   
>>   /* For backwards compat.  The original support for preemption was based on
>>    * a single ring per priority level so # of priority levels equals the #
> 


Best regards,
Rob Clark Dec. 12, 2024, 5:08 p.m. UTC | #3
On Thu, Dec 12, 2024 at 7:59 AM Akhil P Oommen <quic_akhilpo@quicinc.com> wrote:
>
> On 12/5/2024 10:24 PM, Rob Clark wrote:
> > From: Rob Clark <robdclark@chromium.org>
> >
> > Performance counter usage falls into two categories:
> >
> > 1. Local usage, where the counter configuration, start, and end read
> >    happen within (locally to) a single SUBMIT.  In this case, there is
> >    no dependency on counter configuration or values between submits, and
> >    in fact counters are normally cleared on context switches, making it
> >    impossible to rely on cross-submit state.
> >
> > 2. Global usage, where a single privilaged daemon/process is sampling
> >    counter values across all processes for profiling.
> >
> > The two categories are mutually exclusive.  While you can have many
> > processes making local counter usage, you cannot combine global and
> > local usage without the two stepping on each others feet (by changing
> > counter configuration).
> >
> > For global counter usage, there is already a SYSPROF param (since global
> > counter usage requires disabling counter clearing on context switch).
> > This patch adds a REQ_CNTRS param to request local counter usage.  If
> > one or more processes has requested counter usage, then a SYSPROF
> > request will fail with -EBUSY.  And if SYSPROF is active, then REQ_CNTRS
> > will fail with -EBUSY, maintaining the mutual exclusivity.
> >
> > This is purely an advisory interface to help coordinate userspace.
> > There is no real means of enforcement, but the worst that can happen if
> > userspace ignores a REQ_CNTRS failure is that you'll get nonsense
> > profiling data.
> >
> > Signed-off-by: Rob Clark <robdclark@chromium.org>
> > ---
> > kgsl takes a different approach, which involves a lot more UABI for
> > assigning counters to different processes.  But I think by taking
> > advantage of the fact that mesa (freedreno+turnip) reconfigure the
> > counters they need in each SUBMIT, for their respective gl/vk perf-
> > counter extensions, we can take this simpler approach.
>
> KGSL's approach is preemption and ifpc safe (also whatever HW changes
> that will come up in future generations). How will we ensure that here?
>
> I have plans to bring up IFPC support in near future. Also, I brought up
> this point during preemption series. But from the responses, I felt that
> profiling was not considered a serious usecase. Still I wonder how the
> perfcounter extensions work accurately with preemption.

Re: IFPC, I think initially we have to inhibit IFPC when SYSPROF is active

Longer term, I think we want to just save and restore all of the SEL
regs as well as the counters themselves on preemption.  AFAIU
currently only the counters themselves are saved/restored.  But there
is only one 32b SEL reg for each 64b counter, so I'm not sure that you
save that many cycles by not just saving/restoring the SEL regs as
well.  (And of course with REQ_CNTRS the kernel knows which processes
need counter save/restore and which do not, so you are only taking the
extra context switch overhead if a process is actually using the
perfcntrs.)

Alternatively, I think we could just declare this as a userspace
problem, and solve it with CP_SET_AMBLE PREAMBLE/POSTAMBLE?

Just for background, rendernode UABI is exposed to all processes that
can use the GPU, ie. basically everything.  Which makes it an
attractive attack surface.  This is why I prefer minimalism when it
comes to UABI, and not adding new ioctls and complexity in the kernel
when it is not essential ;-)

BR,
-R

> -Akhil
>
> >
> >  drivers/gpu/drm/msm/adreno/adreno_gpu.c |  2 +
> >  drivers/gpu/drm/msm/msm_drv.c           |  5 ++-
> >  drivers/gpu/drm/msm/msm_gpu.c           |  1 +
> >  drivers/gpu/drm/msm/msm_gpu.h           | 29 +++++++++++++-
> >  drivers/gpu/drm/msm/msm_submitqueue.c   | 52 ++++++++++++++++++++++++-
> >  include/uapi/drm/msm_drm.h              |  1 +
> >  6 files changed, 85 insertions(+), 5 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/msm/adreno/adreno_gpu.c b/drivers/gpu/drm/msm/adreno/adreno_gpu.c
> > index 31bbf2c83de4..f688e37059b8 100644
> > --- a/drivers/gpu/drm/msm/adreno/adreno_gpu.c
> > +++ b/drivers/gpu/drm/msm/adreno/adreno_gpu.c
> > @@ -441,6 +441,8 @@ int adreno_set_param(struct msm_gpu *gpu, struct msm_file_private *ctx,
> >               if (!capable(CAP_SYS_ADMIN))
> >                       return UERR(EPERM, drm, "invalid permissions");
> >               return msm_file_private_set_sysprof(ctx, gpu, value);
> > +     case MSM_PARAM_REQ_CNTRS:
> > +             return msm_file_private_request_counters(ctx, gpu, value);
> >       default:
> >               return UERR(EINVAL, drm, "%s: invalid param: %u", gpu->name, param);
> >       }
> > diff --git a/drivers/gpu/drm/msm/msm_drv.c b/drivers/gpu/drm/msm/msm_drv.c
> > index 6416d2cb4efc..bf8314ff4a25 100644
> > --- a/drivers/gpu/drm/msm/msm_drv.c
> > +++ b/drivers/gpu/drm/msm/msm_drv.c
> > @@ -377,9 +377,12 @@ static void msm_postclose(struct drm_device *dev, struct drm_file *file)
> >        * It is not possible to set sysprof param to non-zero if gpu
> >        * is not initialized:
> >        */
> > -     if (priv->gpu)
> > +     if (ctx->sysprof)
> >               msm_file_private_set_sysprof(ctx, priv->gpu, 0);
> >
> > +     if (ctx->counters_requested)
> > +             msm_file_private_request_counters(ctx, priv->gpu, 0);
> > +
> >       context_close(ctx);
> >  }
> >
> > diff --git a/drivers/gpu/drm/msm/msm_gpu.c b/drivers/gpu/drm/msm/msm_gpu.c
> > index 82f204f3bb8f..013b59ca3bb1 100644
> > --- a/drivers/gpu/drm/msm/msm_gpu.c
> > +++ b/drivers/gpu/drm/msm/msm_gpu.c
> > @@ -991,6 +991,7 @@ int msm_gpu_init(struct drm_device *drm, struct platform_device *pdev,
> >       gpu->nr_rings = nr_rings;
> >
> >       refcount_set(&gpu->sysprof_active, 1);
> > +     refcount_set(&gpu->local_counters_active, 1);
> >
> >       return 0;
> >
> > diff --git a/drivers/gpu/drm/msm/msm_gpu.h b/drivers/gpu/drm/msm/msm_gpu.h
> > index e25009150579..83c61e523b1b 100644
> > --- a/drivers/gpu/drm/msm/msm_gpu.h
> > +++ b/drivers/gpu/drm/msm/msm_gpu.h
> > @@ -195,12 +195,28 @@ struct msm_gpu {
> >       int nr_rings;
> >
> >       /**
> > -      * sysprof_active:
> > +      * @sysprof_active:
> >        *
> > -      * The count of contexts that have enabled system profiling.
> > +      * The count of contexts that have enabled system profiling plus one.
> > +      *
> > +      * Note: refcount_t does not like 0->1 transitions.. we want to keep
> > +      * the under/overflow checks that refcount_t provides, but allow
> > +      * multiple on/off transitions so we track the logical value plus one.)
> >        */
> >       refcount_t sysprof_active;
> >
> > +     /**
> > +      * @local_counters_active:
> > +      *
> > +      * The count of contexts that have requested local (intra-submit)
> > +      * performance counter usage plus one.
> > +      *
> > +      * Note: refcount_t does not like 0->1 transitions.. we want to keep
> > +      * the under/overflow checks that refcount_t provides, but allow
> > +      * multiple on/off transitions so we track the logical value plus one.)
> > +      */
> > +     refcount_t local_counters_active;
> > +
> >       /**
> >        * lock:
> >        *
> > @@ -383,6 +399,13 @@ struct msm_file_private {
> >        */
> >       int sysprof;
> >
> > +     /**
> > +      * @counters_requested:
> > +      *
> > +      * Has the context requested local perfcntr usage.
> > +      */
> > +     bool counters_requested;
> > +
> >       /**
> >        * comm: Overridden task comm, see MSM_PARAM_COMM
> >        *
> > @@ -626,6 +649,8 @@ void msm_submitqueue_destroy(struct kref *kref);
> >
> >  int msm_file_private_set_sysprof(struct msm_file_private *ctx,
> >                                struct msm_gpu *gpu, int sysprof);
> > +int msm_file_private_request_counters(struct msm_file_private *ctx,
> > +                                   struct msm_gpu *gpu, int reqcntrs);
> >  void __msm_file_private_destroy(struct kref *kref);
> >
> >  static inline void msm_file_private_put(struct msm_file_private *ctx)
> > diff --git a/drivers/gpu/drm/msm/msm_submitqueue.c b/drivers/gpu/drm/msm/msm_submitqueue.c
> > index 7fed1de63b5d..1e1e21e6f7ae 100644
> > --- a/drivers/gpu/drm/msm/msm_submitqueue.c
> > +++ b/drivers/gpu/drm/msm/msm_submitqueue.c
> > @@ -10,6 +10,15 @@
> >  int msm_file_private_set_sysprof(struct msm_file_private *ctx,
> >                                struct msm_gpu *gpu, int sysprof)
> >  {
> > +     int ret = 0;
> > +
> > +     mutex_lock(&gpu->lock);
> > +
> > +     if (sysprof && (refcount_read(&gpu->local_counters_active) > 1)) {
> > +             ret = UERR(EBUSY, gpu->dev, "Local counter usage active");
> > +             goto out_unlock;
> > +     }
> > +
> >       /*
> >        * Since pm_runtime and sysprof_active are both refcounts, we
> >        * call apply the new value first, and then unwind the previous
> > @@ -18,7 +27,8 @@ int msm_file_private_set_sysprof(struct msm_file_private *ctx,
> >
> >       switch (sysprof) {
> >       default:
> > -             return UERR(EINVAL, gpu->dev, "Invalid sysprof: %d", sysprof);
> > +             ret = UERR(EINVAL, gpu->dev, "Invalid sysprof: %d", sysprof);
> > +             goto out_unlock;
> >       case 2:
> >               pm_runtime_get_sync(&gpu->pdev->dev);
> >               fallthrough;
> > @@ -43,7 +53,45 @@ int msm_file_private_set_sysprof(struct msm_file_private *ctx,
> >
> >       ctx->sysprof = sysprof;
> >
> > -     return 0;
> > +out_unlock:
> > +     mutex_unlock(&gpu->lock);
> > +
> > +     return ret;
> > +}
> > +
> > +int msm_file_private_request_counters(struct msm_file_private *ctx,
> > +                                   struct msm_gpu *gpu, int reqctrs)
> > +{
> > +     int ret = 0;
> > +
> > +     mutex_lock(&gpu->lock);
> > +
> > +     if (reqctrs && (refcount_read(&gpu->sysprof_active) > 1)) {
> > +             ret = UERR(EBUSY, gpu->dev, "System profiling active");
> > +             goto out_unlock;
> > +     }
> > +
> > +     if (reqctrs) {
> > +             if (ctx->counters_requested) {
> > +                     ret = UERR(EINVAL, gpu->dev, "Already requested");
> > +                     goto out_unlock;
> > +             }
> > +
> > +             ctx->counters_requested = true;
> > +             refcount_inc(&gpu->local_counters_active);
> > +     } else {
> > +             if (!ctx->counters_requested) {
> > +                     ret = UERR(EINVAL, gpu->dev, "Not requested");
> > +                     goto out_unlock;
> > +             }
> > +             refcount_dec(&gpu->local_counters_active);
> > +             ctx->counters_requested = false;
> > +     }
> > +
> > +out_unlock:
> > +     mutex_unlock(&gpu->lock);
> > +
> > +     return ret;
> >  }
> >
> >  void __msm_file_private_destroy(struct kref *kref)
> > diff --git a/include/uapi/drm/msm_drm.h b/include/uapi/drm/msm_drm.h
> > index 2342cb90857e..ae7fb355e4a1 100644
> > --- a/include/uapi/drm/msm_drm.h
> > +++ b/include/uapi/drm/msm_drm.h
> > @@ -91,6 +91,7 @@ struct drm_msm_timespec {
> >  #define MSM_PARAM_UBWC_SWIZZLE 0x12 /* RO */
> >  #define MSM_PARAM_MACROTILE_MODE 0x13 /* RO */
> >  #define MSM_PARAM_UCHE_TRAP_BASE 0x14 /* RO */
> > +#define MSM_PARAM_REQ_CNTRS  0x15 /* WO: request "local" (intra-submit) perfcntr usage  */
> >
> >  /* For backwards compat.  The original support for preemption was based on
> >   * a single ring per priority level so # of priority levels equals the #
>
Rob Clark Dec. 12, 2024, 5:12 p.m. UTC | #4
On Thu, Dec 12, 2024 at 9:08 AM Rob Clark <robdclark@gmail.com> wrote:
>
> On Thu, Dec 12, 2024 at 7:59 AM Akhil P Oommen <quic_akhilpo@quicinc.com> wrote:
> >
> > On 12/5/2024 10:24 PM, Rob Clark wrote:
> > > From: Rob Clark <robdclark@chromium.org>
> > >
> > > Performance counter usage falls into two categories:
> > >
> > > 1. Local usage, where the counter configuration, start, and end read
> > >    happen within (locally to) a single SUBMIT.  In this case, there is
> > >    no dependency on counter configuration or values between submits, and
> > >    in fact counters are normally cleared on context switches, making it
> > >    impossible to rely on cross-submit state.
> > >
> > > 2. Global usage, where a single privilaged daemon/process is sampling
> > >    counter values across all processes for profiling.
> > >
> > > The two categories are mutually exclusive.  While you can have many
> > > processes making local counter usage, you cannot combine global and
> > > local usage without the two stepping on each others feet (by changing
> > > counter configuration).
> > >
> > > For global counter usage, there is already a SYSPROF param (since global
> > > counter usage requires disabling counter clearing on context switch).
> > > This patch adds a REQ_CNTRS param to request local counter usage.  If
> > > one or more processes has requested counter usage, then a SYSPROF
> > > request will fail with -EBUSY.  And if SYSPROF is active, then REQ_CNTRS
> > > will fail with -EBUSY, maintaining the mutual exclusivity.
> > >
> > > This is purely an advisory interface to help coordinate userspace.
> > > There is no real means of enforcement, but the worst that can happen if
> > > userspace ignores a REQ_CNTRS failure is that you'll get nonsense
> > > profiling data.
> > >
> > > Signed-off-by: Rob Clark <robdclark@chromium.org>
> > > ---
> > > kgsl takes a different approach, which involves a lot more UABI for
> > > assigning counters to different processes.  But I think by taking
> > > advantage of the fact that mesa (freedreno+turnip) reconfigure the
> > > counters they need in each SUBMIT, for their respective gl/vk perf-
> > > counter extensions, we can take this simpler approach.
> >
> > KGSL's approach is preemption and ifpc safe (also whatever HW changes
> > that will come up in future generations). How will we ensure that here?
> >
> > I have plans to bring up IFPC support in near future. Also, I brought up
> > this point during preemption series. But from the responses, I felt that
> > profiling was not considered a serious usecase. Still I wonder how the
> > perfcounter extensions work accurately with preemption.
>
> Re: IFPC, I think initially we have to inhibit IFPC when SYSPROF is active
>
> Longer term, I think we want to just save and restore all of the SEL
> regs as well as the counters themselves on preemption.  AFAIU
> currently only the counters themselves are saved/restored.  But there
> is only one 32b SEL reg for each 64b counter, so I'm not sure that you
> save that many cycles by not just saving/restoring the SEL regs as
> well.  (And of course with REQ_CNTRS the kernel knows which processes
> need counter save/restore and which do not, so you are only taking the
> extra context switch overhead if a process is actually using the
> perfcntrs.)

Actually I'm maybe blending two different, but similar cases.
PREAMBLE/POSTAMBLE, I think, cover us for preemption

For IFPC, we'd need a way to tell GMU that SYSPROF is active, so it
could save/restore all the counters and selectors  (IFPC shouldn't
matter for local profiling / REQ_CNTRS case, since you wouldn't go
into IFPC mid-submit.)

BR,
-R

> Alternatively, I think we could just declare this as a userspace
> problem, and solve it with CP_SET_AMBLE PREAMBLE/POSTAMBLE?
>
> Just for background, rendernode UABI is exposed to all processes that
> can use the GPU, ie. basically everything.  Which makes it an
> attractive attack surface.  This is why I prefer minimalism when it
> comes to UABI, and not adding new ioctls and complexity in the kernel
> when it is not essential ;-)
>
> BR,
> -R
>
> > -Akhil
> >
> > >
> > >  drivers/gpu/drm/msm/adreno/adreno_gpu.c |  2 +
> > >  drivers/gpu/drm/msm/msm_drv.c           |  5 ++-
> > >  drivers/gpu/drm/msm/msm_gpu.c           |  1 +
> > >  drivers/gpu/drm/msm/msm_gpu.h           | 29 +++++++++++++-
> > >  drivers/gpu/drm/msm/msm_submitqueue.c   | 52 ++++++++++++++++++++++++-
> > >  include/uapi/drm/msm_drm.h              |  1 +
> > >  6 files changed, 85 insertions(+), 5 deletions(-)
> > >
> > > diff --git a/drivers/gpu/drm/msm/adreno/adreno_gpu.c b/drivers/gpu/drm/msm/adreno/adreno_gpu.c
> > > index 31bbf2c83de4..f688e37059b8 100644
> > > --- a/drivers/gpu/drm/msm/adreno/adreno_gpu.c
> > > +++ b/drivers/gpu/drm/msm/adreno/adreno_gpu.c
> > > @@ -441,6 +441,8 @@ int adreno_set_param(struct msm_gpu *gpu, struct msm_file_private *ctx,
> > >               if (!capable(CAP_SYS_ADMIN))
> > >                       return UERR(EPERM, drm, "invalid permissions");
> > >               return msm_file_private_set_sysprof(ctx, gpu, value);
> > > +     case MSM_PARAM_REQ_CNTRS:
> > > +             return msm_file_private_request_counters(ctx, gpu, value);
> > >       default:
> > >               return UERR(EINVAL, drm, "%s: invalid param: %u", gpu->name, param);
> > >       }
> > > diff --git a/drivers/gpu/drm/msm/msm_drv.c b/drivers/gpu/drm/msm/msm_drv.c
> > > index 6416d2cb4efc..bf8314ff4a25 100644
> > > --- a/drivers/gpu/drm/msm/msm_drv.c
> > > +++ b/drivers/gpu/drm/msm/msm_drv.c
> > > @@ -377,9 +377,12 @@ static void msm_postclose(struct drm_device *dev, struct drm_file *file)
> > >        * It is not possible to set sysprof param to non-zero if gpu
> > >        * is not initialized:
> > >        */
> > > -     if (priv->gpu)
> > > +     if (ctx->sysprof)
> > >               msm_file_private_set_sysprof(ctx, priv->gpu, 0);
> > >
> > > +     if (ctx->counters_requested)
> > > +             msm_file_private_request_counters(ctx, priv->gpu, 0);
> > > +
> > >       context_close(ctx);
> > >  }
> > >
> > > diff --git a/drivers/gpu/drm/msm/msm_gpu.c b/drivers/gpu/drm/msm/msm_gpu.c
> > > index 82f204f3bb8f..013b59ca3bb1 100644
> > > --- a/drivers/gpu/drm/msm/msm_gpu.c
> > > +++ b/drivers/gpu/drm/msm/msm_gpu.c
> > > @@ -991,6 +991,7 @@ int msm_gpu_init(struct drm_device *drm, struct platform_device *pdev,
> > >       gpu->nr_rings = nr_rings;
> > >
> > >       refcount_set(&gpu->sysprof_active, 1);
> > > +     refcount_set(&gpu->local_counters_active, 1);
> > >
> > >       return 0;
> > >
> > > diff --git a/drivers/gpu/drm/msm/msm_gpu.h b/drivers/gpu/drm/msm/msm_gpu.h
> > > index e25009150579..83c61e523b1b 100644
> > > --- a/drivers/gpu/drm/msm/msm_gpu.h
> > > +++ b/drivers/gpu/drm/msm/msm_gpu.h
> > > @@ -195,12 +195,28 @@ struct msm_gpu {
> > >       int nr_rings;
> > >
> > >       /**
> > > -      * sysprof_active:
> > > +      * @sysprof_active:
> > >        *
> > > -      * The count of contexts that have enabled system profiling.
> > > +      * The count of contexts that have enabled system profiling plus one.
> > > +      *
> > > +      * Note: refcount_t does not like 0->1 transitions.. we want to keep
> > > +      * the under/overflow checks that refcount_t provides, but allow
> > > +      * multiple on/off transitions so we track the logical value plus one.)
> > >        */
> > >       refcount_t sysprof_active;
> > >
> > > +     /**
> > > +      * @local_counters_active:
> > > +      *
> > > +      * The count of contexts that have requested local (intra-submit)
> > > +      * performance counter usage plus one.
> > > +      *
> > > +      * Note: refcount_t does not like 0->1 transitions.. we want to keep
> > > +      * the under/overflow checks that refcount_t provides, but allow
> > > +      * multiple on/off transitions so we track the logical value plus one.)
> > > +      */
> > > +     refcount_t local_counters_active;
> > > +
> > >       /**
> > >        * lock:
> > >        *
> > > @@ -383,6 +399,13 @@ struct msm_file_private {
> > >        */
> > >       int sysprof;
> > >
> > > +     /**
> > > +      * @counters_requested:
> > > +      *
> > > +      * Has the context requested local perfcntr usage.
> > > +      */
> > > +     bool counters_requested;
> > > +
> > >       /**
> > >        * comm: Overridden task comm, see MSM_PARAM_COMM
> > >        *
> > > @@ -626,6 +649,8 @@ void msm_submitqueue_destroy(struct kref *kref);
> > >
> > >  int msm_file_private_set_sysprof(struct msm_file_private *ctx,
> > >                                struct msm_gpu *gpu, int sysprof);
> > > +int msm_file_private_request_counters(struct msm_file_private *ctx,
> > > +                                   struct msm_gpu *gpu, int reqcntrs);
> > >  void __msm_file_private_destroy(struct kref *kref);
> > >
> > >  static inline void msm_file_private_put(struct msm_file_private *ctx)
> > > diff --git a/drivers/gpu/drm/msm/msm_submitqueue.c b/drivers/gpu/drm/msm/msm_submitqueue.c
> > > index 7fed1de63b5d..1e1e21e6f7ae 100644
> > > --- a/drivers/gpu/drm/msm/msm_submitqueue.c
> > > +++ b/drivers/gpu/drm/msm/msm_submitqueue.c
> > > @@ -10,6 +10,15 @@
> > >  int msm_file_private_set_sysprof(struct msm_file_private *ctx,
> > >                                struct msm_gpu *gpu, int sysprof)
> > >  {
> > > +     int ret = 0;
> > > +
> > > +     mutex_lock(&gpu->lock);
> > > +
> > > +     if (sysprof && (refcount_read(&gpu->local_counters_active) > 1)) {
> > > +             ret = UERR(EBUSY, gpu->dev, "Local counter usage active");
> > > +             goto out_unlock;
> > > +     }
> > > +
> > >       /*
> > >        * Since pm_runtime and sysprof_active are both refcounts, we
> > >        * call apply the new value first, and then unwind the previous
> > > @@ -18,7 +27,8 @@ int msm_file_private_set_sysprof(struct msm_file_private *ctx,
> > >
> > >       switch (sysprof) {
> > >       default:
> > > -             return UERR(EINVAL, gpu->dev, "Invalid sysprof: %d", sysprof);
> > > +             ret = UERR(EINVAL, gpu->dev, "Invalid sysprof: %d", sysprof);
> > > +             goto out_unlock;
> > >       case 2:
> > >               pm_runtime_get_sync(&gpu->pdev->dev);
> > >               fallthrough;
> > > @@ -43,7 +53,45 @@ int msm_file_private_set_sysprof(struct msm_file_private *ctx,
> > >
> > >       ctx->sysprof = sysprof;
> > >
> > > -     return 0;
> > > +out_unlock:
> > > +     mutex_unlock(&gpu->lock);
> > > +
> > > +     return ret;
> > > +}
> > > +
> > > +int msm_file_private_request_counters(struct msm_file_private *ctx,
> > > +                                   struct msm_gpu *gpu, int reqctrs)
> > > +{
> > > +     int ret = 0;
> > > +
> > > +     mutex_lock(&gpu->lock);
> > > +
> > > +     if (reqctrs && (refcount_read(&gpu->sysprof_active) > 1)) {
> > > +             ret = UERR(EBUSY, gpu->dev, "System profiling active");
> > > +             goto out_unlock;
> > > +     }
> > > +
> > > +     if (reqctrs) {
> > > +             if (ctx->counters_requested) {
> > > +                     ret = UERR(EINVAL, gpu->dev, "Already requested");
> > > +                     goto out_unlock;
> > > +             }
> > > +
> > > +             ctx->counters_requested = true;
> > > +             refcount_inc(&gpu->local_counters_active);
> > > +     } else {
> > > +             if (!ctx->counters_requested) {
> > > +                     ret = UERR(EINVAL, gpu->dev, "Not requested");
> > > +                     goto out_unlock;
> > > +             }
> > > +             refcount_dec(&gpu->local_counters_active);
> > > +             ctx->counters_requested = false;
> > > +     }
> > > +
> > > +out_unlock:
> > > +     mutex_unlock(&gpu->lock);
> > > +
> > > +     return ret;
> > >  }
> > >
> > >  void __msm_file_private_destroy(struct kref *kref)
> > > diff --git a/include/uapi/drm/msm_drm.h b/include/uapi/drm/msm_drm.h
> > > index 2342cb90857e..ae7fb355e4a1 100644
> > > --- a/include/uapi/drm/msm_drm.h
> > > +++ b/include/uapi/drm/msm_drm.h
> > > @@ -91,6 +91,7 @@ struct drm_msm_timespec {
> > >  #define MSM_PARAM_UBWC_SWIZZLE 0x12 /* RO */
> > >  #define MSM_PARAM_MACROTILE_MODE 0x13 /* RO */
> > >  #define MSM_PARAM_UCHE_TRAP_BASE 0x14 /* RO */
> > > +#define MSM_PARAM_REQ_CNTRS  0x15 /* WO: request "local" (intra-submit) perfcntr usage  */
> > >
> > >  /* For backwards compat.  The original support for preemption was based on
> > >   * a single ring per priority level so # of priority levels equals the #
> >
Akhil P Oommen Dec. 13, 2024, 4:47 p.m. UTC | #5
On 12/12/2024 10:42 PM, Rob Clark wrote:
> On Thu, Dec 12, 2024 at 9:08 AM Rob Clark <robdclark@gmail.com> wrote:
>>
>> On Thu, Dec 12, 2024 at 7:59 AM Akhil P Oommen <quic_akhilpo@quicinc.com> wrote:
>>>
>>> On 12/5/2024 10:24 PM, Rob Clark wrote:
>>>> From: Rob Clark <robdclark@chromium.org>
>>>>
>>>> Performance counter usage falls into two categories:
>>>>
>>>> 1. Local usage, where the counter configuration, start, and end read
>>>>    happen within (locally to) a single SUBMIT.  In this case, there is
>>>>    no dependency on counter configuration or values between submits, and
>>>>    in fact counters are normally cleared on context switches, making it
>>>>    impossible to rely on cross-submit state.
>>>>
>>>> 2. Global usage, where a single privilaged daemon/process is sampling
>>>>    counter values across all processes for profiling.
>>>>
>>>> The two categories are mutually exclusive.  While you can have many
>>>> processes making local counter usage, you cannot combine global and
>>>> local usage without the two stepping on each others feet (by changing
>>>> counter configuration).

As such the HW doesn't have any limitation, unless you run out of
counters in a group. We just need an arbitration between processes (UMD
or KMD based).

Also, KGSL exposes an ioctl to directly read the counter with a fixed
minimal latency. Because inline reads via submission may have huge
latency spikes based on workload especially when compute shaders are
involved. Isn't a low latency counter reads desirable in a fullfledged
system profiler?

>>>>
>>>> For global counter usage, there is already a SYSPROF param (since global
>>>> counter usage requires disabling counter clearing on context switch).
>>>> This patch adds a REQ_CNTRS param to request local counter usage.  If
>>>> one or more processes has requested counter usage, then a SYSPROF
>>>> request will fail with -EBUSY.  And if SYSPROF is active, then REQ_CNTRS
>>>> will fail with -EBUSY, maintaining the mutual exclusivity.
>>>>
>>>> This is purely an advisory interface to help coordinate userspace.
>>>> There is no real means of enforcement, but the worst that can happen if
>>>> userspace ignores a REQ_CNTRS failure is that you'll get nonsense
>>>> profiling data.
>>>>
>>>> Signed-off-by: Rob Clark <robdclark@chromium.org>
>>>> ---
>>>> kgsl takes a different approach, which involves a lot more UABI for
>>>> assigning counters to different processes.  But I think by taking
>>>> advantage of the fact that mesa (freedreno+turnip) reconfigure the
>>>> counters they need in each SUBMIT, for their respective gl/vk perf-
>>>> counter extensions, we can take this simpler approach.
>>>
>>> KGSL's approach is preemption and ifpc safe (also whatever HW changes
>>> that will come up in future generations). How will we ensure that here?
>>>
>>> I have plans to bring up IFPC support in near future. Also, I brought up
>>> this point during preemption series. But from the responses, I felt that
>>> profiling was not considered a serious usecase. Still I wonder how the
>>> perfcounter extensions work accurately with preemption.
>>
>> Re: IFPC, I think initially we have to inhibit IFPC when SYSPROF is active
>>
>> Longer term, I think we want to just save and restore all of the SEL
>> regs as well as the counters themselves on preemption.  AFAIU
>> currently only the counters themselves are saved/restored.  But there
>> is only one 32b SEL reg for each 64b counter, so I'm not sure that you
>> save that many cycles by not just saving/restoring the SEL regs as
>> well.  (And of course with REQ_CNTRS the kernel knows which processes
>> need counter save/restore and which do not, so you are only taking the
>> extra context switch overhead if a process is actually using the
>> perfcntrs.)
> 
> Actually I'm maybe blending two different, but similar cases.
> PREAMBLE/POSTAMBLE, I think, cover us for preemption
> 
> For IFPC, we'd need a way to tell GMU that SYSPROF is active, so it
> could save/restore all the counters and selectors  (IFPC shouldn't
> matter for local profiling / REQ_CNTRS case, since you wouldn't go
> into IFPC mid-submit.)
> 
> BR,
> -R
> 
>> Alternatively, I think we could just declare this as a userspace
>> problem, and solve it with CP_SET_AMBLE PREAMBLE/POSTAMBLE?
>>
>> Just for background, rendernode UABI is exposed to all processes that
>> can use the GPU, ie. basically everything.  Which makes it an
>> attractive attack surface.  This is why I prefer minimalism when it
>> comes to UABI, and not adding new ioctls and complexity in the kernel
>> when it is not essential ;-)

I fully agree with you about maintaining minimalism in KMD. Here all we
need is a way for UMD to ask "give me a counter offset with 'x'
countable from 'y' group". And let KMD do the arbitration of counters
between userspace processes and also within KMD. And we can cut down on
some of the related things present in kgsl which are unnecessary at the
moment.

More importantly, I am not sure if we should really fight hard against
something that is basically an architectur spec. Future HW evolution
happens based on this architecture. So is it really wise to build things
in the opposite direction. FYI, all other GPU KMD drivers which Qcom
uses are aligned on this.

-Akhil

>>
>> BR,
>> -R
>>
>>> -Akhil
>>>
>>>>
>>>>  drivers/gpu/drm/msm/adreno/adreno_gpu.c |  2 +
>>>>  drivers/gpu/drm/msm/msm_drv.c           |  5 ++-
>>>>  drivers/gpu/drm/msm/msm_gpu.c           |  1 +
>>>>  drivers/gpu/drm/msm/msm_gpu.h           | 29 +++++++++++++-
>>>>  drivers/gpu/drm/msm/msm_submitqueue.c   | 52 ++++++++++++++++++++++++-
>>>>  include/uapi/drm/msm_drm.h              |  1 +
>>>>  6 files changed, 85 insertions(+), 5 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/msm/adreno/adreno_gpu.c b/drivers/gpu/drm/msm/adreno/adreno_gpu.c
>>>> index 31bbf2c83de4..f688e37059b8 100644
>>>> --- a/drivers/gpu/drm/msm/adreno/adreno_gpu.c
>>>> +++ b/drivers/gpu/drm/msm/adreno/adreno_gpu.c
>>>> @@ -441,6 +441,8 @@ int adreno_set_param(struct msm_gpu *gpu, struct msm_file_private *ctx,
>>>>               if (!capable(CAP_SYS_ADMIN))
>>>>                       return UERR(EPERM, drm, "invalid permissions");
>>>>               return msm_file_private_set_sysprof(ctx, gpu, value);
>>>> +     case MSM_PARAM_REQ_CNTRS:
>>>> +             return msm_file_private_request_counters(ctx, gpu, value);
>>>>       default:
>>>>               return UERR(EINVAL, drm, "%s: invalid param: %u", gpu->name, param);
>>>>       }
>>>> diff --git a/drivers/gpu/drm/msm/msm_drv.c b/drivers/gpu/drm/msm/msm_drv.c
>>>> index 6416d2cb4efc..bf8314ff4a25 100644
>>>> --- a/drivers/gpu/drm/msm/msm_drv.c
>>>> +++ b/drivers/gpu/drm/msm/msm_drv.c
>>>> @@ -377,9 +377,12 @@ static void msm_postclose(struct drm_device *dev, struct drm_file *file)
>>>>        * It is not possible to set sysprof param to non-zero if gpu
>>>>        * is not initialized:
>>>>        */
>>>> -     if (priv->gpu)
>>>> +     if (ctx->sysprof)
>>>>               msm_file_private_set_sysprof(ctx, priv->gpu, 0);
>>>>
>>>> +     if (ctx->counters_requested)
>>>> +             msm_file_private_request_counters(ctx, priv->gpu, 0);
>>>> +
>>>>       context_close(ctx);
>>>>  }
>>>>
>>>> diff --git a/drivers/gpu/drm/msm/msm_gpu.c b/drivers/gpu/drm/msm/msm_gpu.c
>>>> index 82f204f3bb8f..013b59ca3bb1 100644
>>>> --- a/drivers/gpu/drm/msm/msm_gpu.c
>>>> +++ b/drivers/gpu/drm/msm/msm_gpu.c
>>>> @@ -991,6 +991,7 @@ int msm_gpu_init(struct drm_device *drm, struct platform_device *pdev,
>>>>       gpu->nr_rings = nr_rings;
>>>>
>>>>       refcount_set(&gpu->sysprof_active, 1);
>>>> +     refcount_set(&gpu->local_counters_active, 1);
>>>>
>>>>       return 0;
>>>>
>>>> diff --git a/drivers/gpu/drm/msm/msm_gpu.h b/drivers/gpu/drm/msm/msm_gpu.h
>>>> index e25009150579..83c61e523b1b 100644
>>>> --- a/drivers/gpu/drm/msm/msm_gpu.h
>>>> +++ b/drivers/gpu/drm/msm/msm_gpu.h
>>>> @@ -195,12 +195,28 @@ struct msm_gpu {
>>>>       int nr_rings;
>>>>
>>>>       /**
>>>> -      * sysprof_active:
>>>> +      * @sysprof_active:
>>>>        *
>>>> -      * The count of contexts that have enabled system profiling.
>>>> +      * The count of contexts that have enabled system profiling plus one.
>>>> +      *
>>>> +      * Note: refcount_t does not like 0->1 transitions.. we want to keep
>>>> +      * the under/overflow checks that refcount_t provides, but allow
>>>> +      * multiple on/off transitions so we track the logical value plus one.)
>>>>        */
>>>>       refcount_t sysprof_active;
>>>>
>>>> +     /**
>>>> +      * @local_counters_active:
>>>> +      *
>>>> +      * The count of contexts that have requested local (intra-submit)
>>>> +      * performance counter usage plus one.
>>>> +      *
>>>> +      * Note: refcount_t does not like 0->1 transitions.. we want to keep
>>>> +      * the under/overflow checks that refcount_t provides, but allow
>>>> +      * multiple on/off transitions so we track the logical value plus one.)
>>>> +      */
>>>> +     refcount_t local_counters_active;
>>>> +
>>>>       /**
>>>>        * lock:
>>>>        *
>>>> @@ -383,6 +399,13 @@ struct msm_file_private {
>>>>        */
>>>>       int sysprof;
>>>>
>>>> +     /**
>>>> +      * @counters_requested:
>>>> +      *
>>>> +      * Has the context requested local perfcntr usage.
>>>> +      */
>>>> +     bool counters_requested;
>>>> +
>>>>       /**
>>>>        * comm: Overridden task comm, see MSM_PARAM_COMM
>>>>        *
>>>> @@ -626,6 +649,8 @@ void msm_submitqueue_destroy(struct kref *kref);
>>>>
>>>>  int msm_file_private_set_sysprof(struct msm_file_private *ctx,
>>>>                                struct msm_gpu *gpu, int sysprof);
>>>> +int msm_file_private_request_counters(struct msm_file_private *ctx,
>>>> +                                   struct msm_gpu *gpu, int reqcntrs);
>>>>  void __msm_file_private_destroy(struct kref *kref);
>>>>
>>>>  static inline void msm_file_private_put(struct msm_file_private *ctx)
>>>> diff --git a/drivers/gpu/drm/msm/msm_submitqueue.c b/drivers/gpu/drm/msm/msm_submitqueue.c
>>>> index 7fed1de63b5d..1e1e21e6f7ae 100644
>>>> --- a/drivers/gpu/drm/msm/msm_submitqueue.c
>>>> +++ b/drivers/gpu/drm/msm/msm_submitqueue.c
>>>> @@ -10,6 +10,15 @@
>>>>  int msm_file_private_set_sysprof(struct msm_file_private *ctx,
>>>>                                struct msm_gpu *gpu, int sysprof)
>>>>  {
>>>> +     int ret = 0;
>>>> +
>>>> +     mutex_lock(&gpu->lock);
>>>> +
>>>> +     if (sysprof && (refcount_read(&gpu->local_counters_active) > 1)) {
>>>> +             ret = UERR(EBUSY, gpu->dev, "Local counter usage active");
>>>> +             goto out_unlock;
>>>> +     }
>>>> +
>>>>       /*
>>>>        * Since pm_runtime and sysprof_active are both refcounts, we
>>>>        * call apply the new value first, and then unwind the previous
>>>> @@ -18,7 +27,8 @@ int msm_file_private_set_sysprof(struct msm_file_private *ctx,
>>>>
>>>>       switch (sysprof) {
>>>>       default:
>>>> -             return UERR(EINVAL, gpu->dev, "Invalid sysprof: %d", sysprof);
>>>> +             ret = UERR(EINVAL, gpu->dev, "Invalid sysprof: %d", sysprof);
>>>> +             goto out_unlock;
>>>>       case 2:
>>>>               pm_runtime_get_sync(&gpu->pdev->dev);
>>>>               fallthrough;
>>>> @@ -43,7 +53,45 @@ int msm_file_private_set_sysprof(struct msm_file_private *ctx,
>>>>
>>>>       ctx->sysprof = sysprof;
>>>>
>>>> -     return 0;
>>>> +out_unlock:
>>>> +     mutex_unlock(&gpu->lock);
>>>> +
>>>> +     return ret;
>>>> +}
>>>> +
>>>> +int msm_file_private_request_counters(struct msm_file_private *ctx,
>>>> +                                   struct msm_gpu *gpu, int reqctrs)
>>>> +{
>>>> +     int ret = 0;
>>>> +
>>>> +     mutex_lock(&gpu->lock);
>>>> +
>>>> +     if (reqctrs && (refcount_read(&gpu->sysprof_active) > 1)) {
>>>> +             ret = UERR(EBUSY, gpu->dev, "System profiling active");
>>>> +             goto out_unlock;
>>>> +     }
>>>> +
>>>> +     if (reqctrs) {
>>>> +             if (ctx->counters_requested) {
>>>> +                     ret = UERR(EINVAL, gpu->dev, "Already requested");
>>>> +                     goto out_unlock;
>>>> +             }
>>>> +
>>>> +             ctx->counters_requested = true;
>>>> +             refcount_inc(&gpu->local_counters_active);
>>>> +     } else {
>>>> +             if (!ctx->counters_requested) {
>>>> +                     ret = UERR(EINVAL, gpu->dev, "Not requested");
>>>> +                     goto out_unlock;
>>>> +             }
>>>> +             refcount_dec(&gpu->local_counters_active);
>>>> +             ctx->counters_requested = false;
>>>> +     }
>>>> +
>>>> +out_unlock:
>>>> +     mutex_unlock(&gpu->lock);
>>>> +
>>>> +     return ret;
>>>>  }
>>>>
>>>>  void __msm_file_private_destroy(struct kref *kref)
>>>> diff --git a/include/uapi/drm/msm_drm.h b/include/uapi/drm/msm_drm.h
>>>> index 2342cb90857e..ae7fb355e4a1 100644
>>>> --- a/include/uapi/drm/msm_drm.h
>>>> +++ b/include/uapi/drm/msm_drm.h
>>>> @@ -91,6 +91,7 @@ struct drm_msm_timespec {
>>>>  #define MSM_PARAM_UBWC_SWIZZLE 0x12 /* RO */
>>>>  #define MSM_PARAM_MACROTILE_MODE 0x13 /* RO */
>>>>  #define MSM_PARAM_UCHE_TRAP_BASE 0x14 /* RO */
>>>> +#define MSM_PARAM_REQ_CNTRS  0x15 /* WO: request "local" (intra-submit) perfcntr usage  */
>>>>
>>>>  /* For backwards compat.  The original support for preemption was based on
>>>>   * a single ring per priority level so # of priority levels equals the #
>>>
Akhil P Oommen Dec. 13, 2024, 4:50 p.m. UTC | #6
On 12/12/2024 9:44 PM, Antonino Maniscalco wrote:
> On 12/12/24 4:58 PM, Akhil P Oommen wrote:
>> On 12/5/2024 10:24 PM, Rob Clark wrote:
>>> From: Rob Clark <robdclark@chromium.org>
>>>
>>> Performance counter usage falls into two categories:
>>>
>>> 1. Local usage, where the counter configuration, start, and end read
>>>     happen within (locally to) a single SUBMIT.  In this case, there is
>>>     no dependency on counter configuration or values between submits,
>>> and
>>>     in fact counters are normally cleared on context switches, making it
>>>     impossible to rely on cross-submit state.
>>>
>>> 2. Global usage, where a single privilaged daemon/process is sampling
>>>     counter values across all processes for profiling.
>>>
>>> The two categories are mutually exclusive.  While you can have many
>>> processes making local counter usage, you cannot combine global and
>>> local usage without the two stepping on each others feet (by changing
>>> counter configuration).
>>>
>>> For global counter usage, there is already a SYSPROF param (since global
>>> counter usage requires disabling counter clearing on context switch).
>>> This patch adds a REQ_CNTRS param to request local counter usage.  If
>>> one or more processes has requested counter usage, then a SYSPROF
>>> request will fail with -EBUSY.  And if SYSPROF is active, then REQ_CNTRS
>>> will fail with -EBUSY, maintaining the mutual exclusivity.
>>>
>>> This is purely an advisory interface to help coordinate userspace.
>>> There is no real means of enforcement, but the worst that can happen if
>>> userspace ignores a REQ_CNTRS failure is that you'll get nonsense
>>> profiling data.
>>>
>>> Signed-off-by: Rob Clark <robdclark@chromium.org>
>>> ---
>>> kgsl takes a different approach, which involves a lot more UABI for
>>> assigning counters to different processes.  But I think by taking
>>> advantage of the fact that mesa (freedreno+turnip) reconfigure the
>>> counters they need in each SUBMIT, for their respective gl/vk perf-
>>> counter extensions, we can take this simpler approach.
>>
>> KGSL's approach is preemption and ifpc safe (also whatever HW changes
>> that will come up in future generations). How will we ensure that here?
>>
>> I have plans to bring up IFPC support in near future. Also, I brought up
>> this point during preemption series. But from the responses, I felt that
>> profiling was not considered a serious usecase. Still I wonder how the
>> perfcounter extensions work accurately with preemption.
> 
> So back then I implemented the postamble IB to clear perf counters and
> that gets disabled when sysprof (so global usage) is happening. The
> kernel is oblivious to "Local isage" of profiling but in that case
> really what we want to do is disable preemption which in my
> understanding can be done from userspace with a PKT. In my understanding
> this had us covered for all usecases.

I think this wasn't mentioned at that time. Which UMD PKT in a6x+ did
you mean?

-Akhil.

> 
> So what would you expect instead we should do kernel side to make
> profiling preemption safe?
> 
>>
>> -Akhil
>>
>>>
>>>   drivers/gpu/drm/msm/adreno/adreno_gpu.c |  2 +
>>>   drivers/gpu/drm/msm/msm_drv.c           |  5 ++-
>>>   drivers/gpu/drm/msm/msm_gpu.c           |  1 +
>>>   drivers/gpu/drm/msm/msm_gpu.h           | 29 +++++++++++++-
>>>   drivers/gpu/drm/msm/msm_submitqueue.c   | 52 ++++++++++++++++++++++++-
>>>   include/uapi/drm/msm_drm.h              |  1 +
>>>   6 files changed, 85 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/msm/adreno/adreno_gpu.c b/drivers/gpu/
>>> drm/msm/adreno/adreno_gpu.c
>>> index 31bbf2c83de4..f688e37059b8 100644
>>> --- a/drivers/gpu/drm/msm/adreno/adreno_gpu.c
>>> +++ b/drivers/gpu/drm/msm/adreno/adreno_gpu.c
>>> @@ -441,6 +441,8 @@ int adreno_set_param(struct msm_gpu *gpu, struct
>>> msm_file_private *ctx,
>>>           if (!capable(CAP_SYS_ADMIN))
>>>               return UERR(EPERM, drm, "invalid permissions");
>>>           return msm_file_private_set_sysprof(ctx, gpu, value);
>>> +    case MSM_PARAM_REQ_CNTRS:
>>> +        return msm_file_private_request_counters(ctx, gpu, value);
>>>       default:
>>>           return UERR(EINVAL, drm, "%s: invalid param: %u", gpu-
>>> >name, param);
>>>       }
>>> diff --git a/drivers/gpu/drm/msm/msm_drv.c b/drivers/gpu/drm/msm/
>>> msm_drv.c
>>> index 6416d2cb4efc..bf8314ff4a25 100644
>>> --- a/drivers/gpu/drm/msm/msm_drv.c
>>> +++ b/drivers/gpu/drm/msm/msm_drv.c
>>> @@ -377,9 +377,12 @@ static void msm_postclose(struct drm_device
>>> *dev, struct drm_file *file)
>>>        * It is not possible to set sysprof param to non-zero if gpu
>>>        * is not initialized:
>>>        */
>>> -    if (priv->gpu)
>>> +    if (ctx->sysprof)
>>>           msm_file_private_set_sysprof(ctx, priv->gpu, 0);
>>>   +    if (ctx->counters_requested)
>>> +        msm_file_private_request_counters(ctx, priv->gpu, 0);
>>> +
>>>       context_close(ctx);
>>>   }
>>>   diff --git a/drivers/gpu/drm/msm/msm_gpu.c b/drivers/gpu/drm/msm/
>>> msm_gpu.c
>>> index 82f204f3bb8f..013b59ca3bb1 100644
>>> --- a/drivers/gpu/drm/msm/msm_gpu.c
>>> +++ b/drivers/gpu/drm/msm/msm_gpu.c
>>> @@ -991,6 +991,7 @@ int msm_gpu_init(struct drm_device *drm, struct
>>> platform_device *pdev,
>>>       gpu->nr_rings = nr_rings;
>>>         refcount_set(&gpu->sysprof_active, 1);
>>> +    refcount_set(&gpu->local_counters_active, 1);
>>>         return 0;
>>>   diff --git a/drivers/gpu/drm/msm/msm_gpu.h b/drivers/gpu/drm/msm/
>>> msm_gpu.h
>>> index e25009150579..83c61e523b1b 100644
>>> --- a/drivers/gpu/drm/msm/msm_gpu.h
>>> +++ b/drivers/gpu/drm/msm/msm_gpu.h
>>> @@ -195,12 +195,28 @@ struct msm_gpu {
>>>       int nr_rings;
>>>         /**
>>> -     * sysprof_active:
>>> +     * @sysprof_active:
>>>        *
>>> -     * The count of contexts that have enabled system profiling.
>>> +     * The count of contexts that have enabled system profiling plus
>>> one.
>>> +     *
>>> +     * Note: refcount_t does not like 0->1 transitions.. we want to
>>> keep
>>> +     * the under/overflow checks that refcount_t provides, but allow
>>> +     * multiple on/off transitions so we track the logical value
>>> plus one.)
>>>        */
>>>       refcount_t sysprof_active;
>>>   +    /**
>>> +     * @local_counters_active:
>>> +     *
>>> +     * The count of contexts that have requested local (intra-submit)
>>> +     * performance counter usage plus one.
>>> +     *
>>> +     * Note: refcount_t does not like 0->1 transitions.. we want to
>>> keep
>>> +     * the under/overflow checks that refcount_t provides, but allow
>>> +     * multiple on/off transitions so we track the logical value
>>> plus one.)
>>> +     */
>>> +    refcount_t local_counters_active;
>>> +
>>>       /**
>>>        * lock:
>>>        *
>>> @@ -383,6 +399,13 @@ struct msm_file_private {
>>>        */
>>>       int sysprof;
>>>   +    /**
>>> +     * @counters_requested:
>>> +     *
>>> +     * Has the context requested local perfcntr usage.
>>> +     */
>>> +    bool counters_requested;
>>> +
>>>       /**
>>>        * comm: Overridden task comm, see MSM_PARAM_COMM
>>>        *
>>> @@ -626,6 +649,8 @@ void msm_submitqueue_destroy(struct kref *kref);
>>>     int msm_file_private_set_sysprof(struct msm_file_private *ctx,
>>>                    struct msm_gpu *gpu, int sysprof);
>>> +int msm_file_private_request_counters(struct msm_file_private *ctx,
>>> +                      struct msm_gpu *gpu, int reqcntrs);
>>>   void __msm_file_private_destroy(struct kref *kref);
>>>     static inline void msm_file_private_put(struct msm_file_private
>>> *ctx)
>>> diff --git a/drivers/gpu/drm/msm/msm_submitqueue.c b/drivers/gpu/drm/
>>> msm/msm_submitqueue.c
>>> index 7fed1de63b5d..1e1e21e6f7ae 100644
>>> --- a/drivers/gpu/drm/msm/msm_submitqueue.c
>>> +++ b/drivers/gpu/drm/msm/msm_submitqueue.c
>>> @@ -10,6 +10,15 @@
>>>   int msm_file_private_set_sysprof(struct msm_file_private *ctx,
>>>                    struct msm_gpu *gpu, int sysprof)
>>>   {
>>> +    int ret = 0;
>>> +
>>> +    mutex_lock(&gpu->lock);
>>> +
>>> +    if (sysprof && (refcount_read(&gpu->local_counters_active) > 1)) {
>>> +        ret = UERR(EBUSY, gpu->dev, "Local counter usage active");
>>> +        goto out_unlock;
>>> +    }
>>> +
>>>       /*
>>>        * Since pm_runtime and sysprof_active are both refcounts, we
>>>        * call apply the new value first, and then unwind the previous
>>> @@ -18,7 +27,8 @@ int msm_file_private_set_sysprof(struct
>>> msm_file_private *ctx,
>>>         switch (sysprof) {
>>>       default:
>>> -        return UERR(EINVAL, gpu->dev, "Invalid sysprof: %d", sysprof);
>>> +        ret = UERR(EINVAL, gpu->dev, "Invalid sysprof: %d", sysprof);
>>> +        goto out_unlock;
>>>       case 2:
>>>           pm_runtime_get_sync(&gpu->pdev->dev);
>>>           fallthrough;
>>> @@ -43,7 +53,45 @@ int msm_file_private_set_sysprof(struct
>>> msm_file_private *ctx,
>>>         ctx->sysprof = sysprof;
>>>   -    return 0;
>>> +out_unlock:
>>> +    mutex_unlock(&gpu->lock);
>>> +
>>> +    return ret;
>>> +}
>>> +
>>> +int msm_file_private_request_counters(struct msm_file_private *ctx,
>>> +                      struct msm_gpu *gpu, int reqctrs)
>>> +{
>>> +    int ret = 0;
>>> +
>>> +    mutex_lock(&gpu->lock);
>>> +
>>> +    if (reqctrs && (refcount_read(&gpu->sysprof_active) > 1)) {
>>> +        ret = UERR(EBUSY, gpu->dev, "System profiling active");
>>> +        goto out_unlock;
>>> +    }
>>> +
>>> +    if (reqctrs) {
>>> +        if (ctx->counters_requested) {
>>> +            ret = UERR(EINVAL, gpu->dev, "Already requested");
>>> +            goto out_unlock;
>>> +        }
>>> +
>>> +        ctx->counters_requested = true;
>>> +        refcount_inc(&gpu->local_counters_active);
>>> +    } else {
>>> +        if (!ctx->counters_requested) {
>>> +            ret = UERR(EINVAL, gpu->dev, "Not requested");
>>> +            goto out_unlock;
>>> +        }
>>> +        refcount_dec(&gpu->local_counters_active);
>>> +        ctx->counters_requested = false;
>>> +    }
>>> +
>>> +out_unlock:
>>> +    mutex_unlock(&gpu->lock);
>>> +
>>> +    return ret;
>>>   }
>>>     void __msm_file_private_destroy(struct kref *kref)
>>> diff --git a/include/uapi/drm/msm_drm.h b/include/uapi/drm/msm_drm.h
>>> index 2342cb90857e..ae7fb355e4a1 100644
>>> --- a/include/uapi/drm/msm_drm.h
>>> +++ b/include/uapi/drm/msm_drm.h
>>> @@ -91,6 +91,7 @@ struct drm_msm_timespec {
>>>   #define MSM_PARAM_UBWC_SWIZZLE 0x12 /* RO */
>>>   #define MSM_PARAM_MACROTILE_MODE 0x13 /* RO */
>>>   #define MSM_PARAM_UCHE_TRAP_BASE 0x14 /* RO */
>>> +#define MSM_PARAM_REQ_CNTRS  0x15 /* WO: request "local" (intra-
>>> submit) perfcntr usage  */
>>>     /* For backwards compat.  The original support for preemption was
>>> based on
>>>    * a single ring per priority level so # of priority levels equals
>>> the #
>>
> 
> 
> Best regards,
Antonino Maniscalco Dec. 13, 2024, 5:10 p.m. UTC | #7
On 12/13/24 5:50 PM, Akhil P Oommen wrote:
> On 12/12/2024 9:44 PM, Antonino Maniscalco wrote:
>> On 12/12/24 4:58 PM, Akhil P Oommen wrote:
>>> On 12/5/2024 10:24 PM, Rob Clark wrote:
>>>> From: Rob Clark <robdclark@chromium.org>
>>>>
>>>> Performance counter usage falls into two categories:
>>>>
>>>> 1. Local usage, where the counter configuration, start, and end read
>>>>      happen within (locally to) a single SUBMIT.  In this case, there is
>>>>      no dependency on counter configuration or values between submits,
>>>> and
>>>>      in fact counters are normally cleared on context switches, making it
>>>>      impossible to rely on cross-submit state.
>>>>
>>>> 2. Global usage, where a single privilaged daemon/process is sampling
>>>>      counter values across all processes for profiling.
>>>>
>>>> The two categories are mutually exclusive.  While you can have many
>>>> processes making local counter usage, you cannot combine global and
>>>> local usage without the two stepping on each others feet (by changing
>>>> counter configuration).
>>>>
>>>> For global counter usage, there is already a SYSPROF param (since global
>>>> counter usage requires disabling counter clearing on context switch).
>>>> This patch adds a REQ_CNTRS param to request local counter usage.  If
>>>> one or more processes has requested counter usage, then a SYSPROF
>>>> request will fail with -EBUSY.  And if SYSPROF is active, then REQ_CNTRS
>>>> will fail with -EBUSY, maintaining the mutual exclusivity.
>>>>
>>>> This is purely an advisory interface to help coordinate userspace.
>>>> There is no real means of enforcement, but the worst that can happen if
>>>> userspace ignores a REQ_CNTRS failure is that you'll get nonsense
>>>> profiling data.
>>>>
>>>> Signed-off-by: Rob Clark <robdclark@chromium.org>
>>>> ---
>>>> kgsl takes a different approach, which involves a lot more UABI for
>>>> assigning counters to different processes.  But I think by taking
>>>> advantage of the fact that mesa (freedreno+turnip) reconfigure the
>>>> counters they need in each SUBMIT, for their respective gl/vk perf-
>>>> counter extensions, we can take this simpler approach.
>>>
>>> KGSL's approach is preemption and ifpc safe (also whatever HW changes
>>> that will come up in future generations). How will we ensure that here?
>>>
>>> I have plans to bring up IFPC support in near future. Also, I brought up
>>> this point during preemption series. But from the responses, I felt that
>>> profiling was not considered a serious usecase. Still I wonder how the
>>> perfcounter extensions work accurately with preemption.
>>
>> So back then I implemented the postamble IB to clear perf counters and
>> that gets disabled when sysprof (so global usage) is happening. The
>> kernel is oblivious to "Local isage" of profiling but in that case
>> really what we want to do is disable preemption which in my
>> understanding can be done from userspace with a PKT. In my understanding
>> this had us covered for all usecases.
> 
> I think this wasn't mentioned at that time. Which UMD PKT in a6x+ did
> you mean?

Ah, I thought it wasmentioned, sorry.
The packet I was referring to is:
	<doc> Make next dword 1 to disable preemption, 0 to re-enable it. </doc>
	<value name="CP_PREEMPT_DISABLE" value="0x6c" variants="A6XX"/>

BTW you mentioned wanting to look into IFPC. Since I too wanted to look 
into implementing it wonder if you could let me know when you planned on 
working on it.

> 
> -Akhil.
> 
>>
>> So what would you expect instead we should do kernel side to make
>> profiling preemption safe?
>>
>>>
>>> -Akhil
>>>
>>>>
>>>>    drivers/gpu/drm/msm/adreno/adreno_gpu.c |  2 +
>>>>    drivers/gpu/drm/msm/msm_drv.c           |  5 ++-
>>>>    drivers/gpu/drm/msm/msm_gpu.c           |  1 +
>>>>    drivers/gpu/drm/msm/msm_gpu.h           | 29 +++++++++++++-
>>>>    drivers/gpu/drm/msm/msm_submitqueue.c   | 52 ++++++++++++++++++++++++-
>>>>    include/uapi/drm/msm_drm.h              |  1 +
>>>>    6 files changed, 85 insertions(+), 5 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/msm/adreno/adreno_gpu.c b/drivers/gpu/
>>>> drm/msm/adreno/adreno_gpu.c
>>>> index 31bbf2c83de4..f688e37059b8 100644
>>>> --- a/drivers/gpu/drm/msm/adreno/adreno_gpu.c
>>>> +++ b/drivers/gpu/drm/msm/adreno/adreno_gpu.c
>>>> @@ -441,6 +441,8 @@ int adreno_set_param(struct msm_gpu *gpu, struct
>>>> msm_file_private *ctx,
>>>>            if (!capable(CAP_SYS_ADMIN))
>>>>                return UERR(EPERM, drm, "invalid permissions");
>>>>            return msm_file_private_set_sysprof(ctx, gpu, value);
>>>> +    case MSM_PARAM_REQ_CNTRS:
>>>> +        return msm_file_private_request_counters(ctx, gpu, value);
>>>>        default:
>>>>            return UERR(EINVAL, drm, "%s: invalid param: %u", gpu-
>>>>> name, param);
>>>>        }
>>>> diff --git a/drivers/gpu/drm/msm/msm_drv.c b/drivers/gpu/drm/msm/
>>>> msm_drv.c
>>>> index 6416d2cb4efc..bf8314ff4a25 100644
>>>> --- a/drivers/gpu/drm/msm/msm_drv.c
>>>> +++ b/drivers/gpu/drm/msm/msm_drv.c
>>>> @@ -377,9 +377,12 @@ static void msm_postclose(struct drm_device
>>>> *dev, struct drm_file *file)
>>>>         * It is not possible to set sysprof param to non-zero if gpu
>>>>         * is not initialized:
>>>>         */
>>>> -    if (priv->gpu)
>>>> +    if (ctx->sysprof)
>>>>            msm_file_private_set_sysprof(ctx, priv->gpu, 0);
>>>>    +    if (ctx->counters_requested)
>>>> +        msm_file_private_request_counters(ctx, priv->gpu, 0);
>>>> +
>>>>        context_close(ctx);
>>>>    }
>>>>    diff --git a/drivers/gpu/drm/msm/msm_gpu.c b/drivers/gpu/drm/msm/
>>>> msm_gpu.c
>>>> index 82f204f3bb8f..013b59ca3bb1 100644
>>>> --- a/drivers/gpu/drm/msm/msm_gpu.c
>>>> +++ b/drivers/gpu/drm/msm/msm_gpu.c
>>>> @@ -991,6 +991,7 @@ int msm_gpu_init(struct drm_device *drm, struct
>>>> platform_device *pdev,
>>>>        gpu->nr_rings = nr_rings;
>>>>          refcount_set(&gpu->sysprof_active, 1);
>>>> +    refcount_set(&gpu->local_counters_active, 1);
>>>>          return 0;
>>>>    diff --git a/drivers/gpu/drm/msm/msm_gpu.h b/drivers/gpu/drm/msm/
>>>> msm_gpu.h
>>>> index e25009150579..83c61e523b1b 100644
>>>> --- a/drivers/gpu/drm/msm/msm_gpu.h
>>>> +++ b/drivers/gpu/drm/msm/msm_gpu.h
>>>> @@ -195,12 +195,28 @@ struct msm_gpu {
>>>>        int nr_rings;
>>>>          /**
>>>> -     * sysprof_active:
>>>> +     * @sysprof_active:
>>>>         *
>>>> -     * The count of contexts that have enabled system profiling.
>>>> +     * The count of contexts that have enabled system profiling plus
>>>> one.
>>>> +     *
>>>> +     * Note: refcount_t does not like 0->1 transitions.. we want to
>>>> keep
>>>> +     * the under/overflow checks that refcount_t provides, but allow
>>>> +     * multiple on/off transitions so we track the logical value
>>>> plus one.)
>>>>         */
>>>>        refcount_t sysprof_active;
>>>>    +    /**
>>>> +     * @local_counters_active:
>>>> +     *
>>>> +     * The count of contexts that have requested local (intra-submit)
>>>> +     * performance counter usage plus one.
>>>> +     *
>>>> +     * Note: refcount_t does not like 0->1 transitions.. we want to
>>>> keep
>>>> +     * the under/overflow checks that refcount_t provides, but allow
>>>> +     * multiple on/off transitions so we track the logical value
>>>> plus one.)
>>>> +     */
>>>> +    refcount_t local_counters_active;
>>>> +
>>>>        /**
>>>>         * lock:
>>>>         *
>>>> @@ -383,6 +399,13 @@ struct msm_file_private {
>>>>         */
>>>>        int sysprof;
>>>>    +    /**
>>>> +     * @counters_requested:
>>>> +     *
>>>> +     * Has the context requested local perfcntr usage.
>>>> +     */
>>>> +    bool counters_requested;
>>>> +
>>>>        /**
>>>>         * comm: Overridden task comm, see MSM_PARAM_COMM
>>>>         *
>>>> @@ -626,6 +649,8 @@ void msm_submitqueue_destroy(struct kref *kref);
>>>>      int msm_file_private_set_sysprof(struct msm_file_private *ctx,
>>>>                     struct msm_gpu *gpu, int sysprof);
>>>> +int msm_file_private_request_counters(struct msm_file_private *ctx,
>>>> +                      struct msm_gpu *gpu, int reqcntrs);
>>>>    void __msm_file_private_destroy(struct kref *kref);
>>>>      static inline void msm_file_private_put(struct msm_file_private
>>>> *ctx)
>>>> diff --git a/drivers/gpu/drm/msm/msm_submitqueue.c b/drivers/gpu/drm/
>>>> msm/msm_submitqueue.c
>>>> index 7fed1de63b5d..1e1e21e6f7ae 100644
>>>> --- a/drivers/gpu/drm/msm/msm_submitqueue.c
>>>> +++ b/drivers/gpu/drm/msm/msm_submitqueue.c
>>>> @@ -10,6 +10,15 @@
>>>>    int msm_file_private_set_sysprof(struct msm_file_private *ctx,
>>>>                     struct msm_gpu *gpu, int sysprof)
>>>>    {
>>>> +    int ret = 0;
>>>> +
>>>> +    mutex_lock(&gpu->lock);
>>>> +
>>>> +    if (sysprof && (refcount_read(&gpu->local_counters_active) > 1)) {
>>>> +        ret = UERR(EBUSY, gpu->dev, "Local counter usage active");
>>>> +        goto out_unlock;
>>>> +    }
>>>> +
>>>>        /*
>>>>         * Since pm_runtime and sysprof_active are both refcounts, we
>>>>         * call apply the new value first, and then unwind the previous
>>>> @@ -18,7 +27,8 @@ int msm_file_private_set_sysprof(struct
>>>> msm_file_private *ctx,
>>>>          switch (sysprof) {
>>>>        default:
>>>> -        return UERR(EINVAL, gpu->dev, "Invalid sysprof: %d", sysprof);
>>>> +        ret = UERR(EINVAL, gpu->dev, "Invalid sysprof: %d", sysprof);
>>>> +        goto out_unlock;
>>>>        case 2:
>>>>            pm_runtime_get_sync(&gpu->pdev->dev);
>>>>            fallthrough;
>>>> @@ -43,7 +53,45 @@ int msm_file_private_set_sysprof(struct
>>>> msm_file_private *ctx,
>>>>          ctx->sysprof = sysprof;
>>>>    -    return 0;
>>>> +out_unlock:
>>>> +    mutex_unlock(&gpu->lock);
>>>> +
>>>> +    return ret;
>>>> +}
>>>> +
>>>> +int msm_file_private_request_counters(struct msm_file_private *ctx,
>>>> +                      struct msm_gpu *gpu, int reqctrs)
>>>> +{
>>>> +    int ret = 0;
>>>> +
>>>> +    mutex_lock(&gpu->lock);
>>>> +
>>>> +    if (reqctrs && (refcount_read(&gpu->sysprof_active) > 1)) {
>>>> +        ret = UERR(EBUSY, gpu->dev, "System profiling active");
>>>> +        goto out_unlock;
>>>> +    }
>>>> +
>>>> +    if (reqctrs) {
>>>> +        if (ctx->counters_requested) {
>>>> +            ret = UERR(EINVAL, gpu->dev, "Already requested");
>>>> +            goto out_unlock;
>>>> +        }
>>>> +
>>>> +        ctx->counters_requested = true;
>>>> +        refcount_inc(&gpu->local_counters_active);
>>>> +    } else {
>>>> +        if (!ctx->counters_requested) {
>>>> +            ret = UERR(EINVAL, gpu->dev, "Not requested");
>>>> +            goto out_unlock;
>>>> +        }
>>>> +        refcount_dec(&gpu->local_counters_active);
>>>> +        ctx->counters_requested = false;
>>>> +    }
>>>> +
>>>> +out_unlock:
>>>> +    mutex_unlock(&gpu->lock);
>>>> +
>>>> +    return ret;
>>>>    }
>>>>      void __msm_file_private_destroy(struct kref *kref)
>>>> diff --git a/include/uapi/drm/msm_drm.h b/include/uapi/drm/msm_drm.h
>>>> index 2342cb90857e..ae7fb355e4a1 100644
>>>> --- a/include/uapi/drm/msm_drm.h
>>>> +++ b/include/uapi/drm/msm_drm.h
>>>> @@ -91,6 +91,7 @@ struct drm_msm_timespec {
>>>>    #define MSM_PARAM_UBWC_SWIZZLE 0x12 /* RO */
>>>>    #define MSM_PARAM_MACROTILE_MODE 0x13 /* RO */
>>>>    #define MSM_PARAM_UCHE_TRAP_BASE 0x14 /* RO */
>>>> +#define MSM_PARAM_REQ_CNTRS  0x15 /* WO: request "local" (intra-
>>>> submit) perfcntr usage  */
>>>>      /* For backwards compat.  The original support for preemption was
>>>> based on
>>>>     * a single ring per priority level so # of priority levels equals
>>>> the #
>>>
>>
>>
>> Best regards,
> 


Best regards,
Rob Clark Dec. 13, 2024, 5:50 p.m. UTC | #8
On Fri, Dec 13, 2024 at 8:47 AM Akhil P Oommen <quic_akhilpo@quicinc.com> wrote:
>
> On 12/12/2024 10:42 PM, Rob Clark wrote:
> > On Thu, Dec 12, 2024 at 9:08 AM Rob Clark <robdclark@gmail.com> wrote:
> >>
> >> On Thu, Dec 12, 2024 at 7:59 AM Akhil P Oommen <quic_akhilpo@quicinc.com> wrote:
> >>>
> >>> On 12/5/2024 10:24 PM, Rob Clark wrote:
> >>>> From: Rob Clark <robdclark@chromium.org>
> >>>>
> >>>> Performance counter usage falls into two categories:
> >>>>
> >>>> 1. Local usage, where the counter configuration, start, and end read
> >>>>    happen within (locally to) a single SUBMIT.  In this case, there is
> >>>>    no dependency on counter configuration or values between submits, and
> >>>>    in fact counters are normally cleared on context switches, making it
> >>>>    impossible to rely on cross-submit state.
> >>>>
> >>>> 2. Global usage, where a single privilaged daemon/process is sampling
> >>>>    counter values across all processes for profiling.
> >>>>
> >>>> The two categories are mutually exclusive.  While you can have many
> >>>> processes making local counter usage, you cannot combine global and
> >>>> local usage without the two stepping on each others feet (by changing
> >>>> counter configuration).
>
> As such the HW doesn't have any limitation, unless you run out of
> counters in a group. We just need an arbitration between processes (UMD
> or KMD based).

True.. but is this actually needed?  Are there real-life use-cases?
Or is it just something that someone wrote down in a requirements
document because they could?

An app _can_ query the counters itself, although in isolation they
aren't super useful.  What a user/developer would actually find useful
are the derived counters that the global/system profiler provides.
The local counters are useful to get per-shader cycle counts, but I've
never found myself looking at that _and_ global/system profiler at the
same time.  Getting accurate local counter values involves inserting
extra WFI's breaking the GPU pipelining, making global profiling kinda
useless.

> Also, KGSL exposes an ioctl to directly read the counter with a fixed
> minimal latency. Because inline reads via submission may have huge
> latency spikes based on workload especially when compute shaders are
> involved. Isn't a low latency counter reads desirable in a fullfledged
> system profiler?

For system profiler, we read back the counters from the cpu[1].
Although we might need to revisit that for android.

[1] https://gitlab.freedesktop.org/mesa/mesa/-/blob/main/src/freedreno/perfcntrs/freedreno_dt.c?ref_type=heads#L223

> >>>>
> >>>> For global counter usage, there is already a SYSPROF param (since global
> >>>> counter usage requires disabling counter clearing on context switch).
> >>>> This patch adds a REQ_CNTRS param to request local counter usage.  If
> >>>> one or more processes has requested counter usage, then a SYSPROF
> >>>> request will fail with -EBUSY.  And if SYSPROF is active, then REQ_CNTRS
> >>>> will fail with -EBUSY, maintaining the mutual exclusivity.
> >>>>
> >>>> This is purely an advisory interface to help coordinate userspace.
> >>>> There is no real means of enforcement, but the worst that can happen if
> >>>> userspace ignores a REQ_CNTRS failure is that you'll get nonsense
> >>>> profiling data.
> >>>>
> >>>> Signed-off-by: Rob Clark <robdclark@chromium.org>
> >>>> ---
> >>>> kgsl takes a different approach, which involves a lot more UABI for
> >>>> assigning counters to different processes.  But I think by taking
> >>>> advantage of the fact that mesa (freedreno+turnip) reconfigure the
> >>>> counters they need in each SUBMIT, for their respective gl/vk perf-
> >>>> counter extensions, we can take this simpler approach.
> >>>
> >>> KGSL's approach is preemption and ifpc safe (also whatever HW changes
> >>> that will come up in future generations). How will we ensure that here?
> >>>
> >>> I have plans to bring up IFPC support in near future. Also, I brought up
> >>> this point during preemption series. But from the responses, I felt that
> >>> profiling was not considered a serious usecase. Still I wonder how the
> >>> perfcounter extensions work accurately with preemption.
> >>
> >> Re: IFPC, I think initially we have to inhibit IFPC when SYSPROF is active
> >>
> >> Longer term, I think we want to just save and restore all of the SEL
> >> regs as well as the counters themselves on preemption.  AFAIU
> >> currently only the counters themselves are saved/restored.  But there
> >> is only one 32b SEL reg for each 64b counter, so I'm not sure that you
> >> save that many cycles by not just saving/restoring the SEL regs as
> >> well.  (And of course with REQ_CNTRS the kernel knows which processes
> >> need counter save/restore and which do not, so you are only taking the
> >> extra context switch overhead if a process is actually using the
> >> perfcntrs.)
> >
> > Actually I'm maybe blending two different, but similar cases.
> > PREAMBLE/POSTAMBLE, I think, cover us for preemption
> >
> > For IFPC, we'd need a way to tell GMU that SYSPROF is active, so it
> > could save/restore all the counters and selectors  (IFPC shouldn't
> > matter for local profiling / REQ_CNTRS case, since you wouldn't go
> > into IFPC mid-submit.)
> >
> > BR,
> > -R
> >
> >> Alternatively, I think we could just declare this as a userspace
> >> problem, and solve it with CP_SET_AMBLE PREAMBLE/POSTAMBLE?
> >>
> >> Just for background, rendernode UABI is exposed to all processes that
> >> can use the GPU, ie. basically everything.  Which makes it an
> >> attractive attack surface.  This is why I prefer minimalism when it
> >> comes to UABI, and not adding new ioctls and complexity in the kernel
> >> when it is not essential ;-)
>
> I fully agree with you about maintaining minimalism in KMD. Here all we
> need is a way for UMD to ask "give me a counter offset with 'x'
> countable from 'y' group". And let KMD do the arbitration of counters
> between userspace processes and also within KMD. And we can cut down on
> some of the related things present in kgsl which are unnecessary at the
> moment.

I'm not completely ruling it out.. just trying to figure out if we
actually need it.  Maybe android forces us to switch to something
other than devmem for reading counters?  That is the most plausible
reason I could think of to add UABI for this.

(OTOH we could alternatively just give privileged userspace a way to
mmap the mmio for percntrs via drm instead of devmem)

> More importantly, I am not sure if we should really fight hard against
> something that is basically an architectur spec. Future HW evolution
> happens based on this architecture. So is it really wise to build things
> in the opposite direction. FYI, all other GPU KMD drivers which Qcom
> uses are aligned on this.

Sure, but future hw is going to need future userspace.  It isn't
really problematic to introduce a new UABI for future hw, because
there is no "new kernel + old userspace" scenario.

BR,
-R

> -Akhil
>
> >>
> >> BR,
> >> -R
> >>
> >>> -Akhil
> >>>
> >>>>
> >>>>  drivers/gpu/drm/msm/adreno/adreno_gpu.c |  2 +
> >>>>  drivers/gpu/drm/msm/msm_drv.c           |  5 ++-
> >>>>  drivers/gpu/drm/msm/msm_gpu.c           |  1 +
> >>>>  drivers/gpu/drm/msm/msm_gpu.h           | 29 +++++++++++++-
> >>>>  drivers/gpu/drm/msm/msm_submitqueue.c   | 52 ++++++++++++++++++++++++-
> >>>>  include/uapi/drm/msm_drm.h              |  1 +
> >>>>  6 files changed, 85 insertions(+), 5 deletions(-)
> >>>>
> >>>> diff --git a/drivers/gpu/drm/msm/adreno/adreno_gpu.c b/drivers/gpu/drm/msm/adreno/adreno_gpu.c
> >>>> index 31bbf2c83de4..f688e37059b8 100644
> >>>> --- a/drivers/gpu/drm/msm/adreno/adreno_gpu.c
> >>>> +++ b/drivers/gpu/drm/msm/adreno/adreno_gpu.c
> >>>> @@ -441,6 +441,8 @@ int adreno_set_param(struct msm_gpu *gpu, struct msm_file_private *ctx,
> >>>>               if (!capable(CAP_SYS_ADMIN))
> >>>>                       return UERR(EPERM, drm, "invalid permissions");
> >>>>               return msm_file_private_set_sysprof(ctx, gpu, value);
> >>>> +     case MSM_PARAM_REQ_CNTRS:
> >>>> +             return msm_file_private_request_counters(ctx, gpu, value);
> >>>>       default:
> >>>>               return UERR(EINVAL, drm, "%s: invalid param: %u", gpu->name, param);
> >>>>       }
> >>>> diff --git a/drivers/gpu/drm/msm/msm_drv.c b/drivers/gpu/drm/msm/msm_drv.c
> >>>> index 6416d2cb4efc..bf8314ff4a25 100644
> >>>> --- a/drivers/gpu/drm/msm/msm_drv.c
> >>>> +++ b/drivers/gpu/drm/msm/msm_drv.c
> >>>> @@ -377,9 +377,12 @@ static void msm_postclose(struct drm_device *dev, struct drm_file *file)
> >>>>        * It is not possible to set sysprof param to non-zero if gpu
> >>>>        * is not initialized:
> >>>>        */
> >>>> -     if (priv->gpu)
> >>>> +     if (ctx->sysprof)
> >>>>               msm_file_private_set_sysprof(ctx, priv->gpu, 0);
> >>>>
> >>>> +     if (ctx->counters_requested)
> >>>> +             msm_file_private_request_counters(ctx, priv->gpu, 0);
> >>>> +
> >>>>       context_close(ctx);
> >>>>  }
> >>>>
> >>>> diff --git a/drivers/gpu/drm/msm/msm_gpu.c b/drivers/gpu/drm/msm/msm_gpu.c
> >>>> index 82f204f3bb8f..013b59ca3bb1 100644
> >>>> --- a/drivers/gpu/drm/msm/msm_gpu.c
> >>>> +++ b/drivers/gpu/drm/msm/msm_gpu.c
> >>>> @@ -991,6 +991,7 @@ int msm_gpu_init(struct drm_device *drm, struct platform_device *pdev,
> >>>>       gpu->nr_rings = nr_rings;
> >>>>
> >>>>       refcount_set(&gpu->sysprof_active, 1);
> >>>> +     refcount_set(&gpu->local_counters_active, 1);
> >>>>
> >>>>       return 0;
> >>>>
> >>>> diff --git a/drivers/gpu/drm/msm/msm_gpu.h b/drivers/gpu/drm/msm/msm_gpu.h
> >>>> index e25009150579..83c61e523b1b 100644
> >>>> --- a/drivers/gpu/drm/msm/msm_gpu.h
> >>>> +++ b/drivers/gpu/drm/msm/msm_gpu.h
> >>>> @@ -195,12 +195,28 @@ struct msm_gpu {
> >>>>       int nr_rings;
> >>>>
> >>>>       /**
> >>>> -      * sysprof_active:
> >>>> +      * @sysprof_active:
> >>>>        *
> >>>> -      * The count of contexts that have enabled system profiling.
> >>>> +      * The count of contexts that have enabled system profiling plus one.
> >>>> +      *
> >>>> +      * Note: refcount_t does not like 0->1 transitions.. we want to keep
> >>>> +      * the under/overflow checks that refcount_t provides, but allow
> >>>> +      * multiple on/off transitions so we track the logical value plus one.)
> >>>>        */
> >>>>       refcount_t sysprof_active;
> >>>>
> >>>> +     /**
> >>>> +      * @local_counters_active:
> >>>> +      *
> >>>> +      * The count of contexts that have requested local (intra-submit)
> >>>> +      * performance counter usage plus one.
> >>>> +      *
> >>>> +      * Note: refcount_t does not like 0->1 transitions.. we want to keep
> >>>> +      * the under/overflow checks that refcount_t provides, but allow
> >>>> +      * multiple on/off transitions so we track the logical value plus one.)
> >>>> +      */
> >>>> +     refcount_t local_counters_active;
> >>>> +
> >>>>       /**
> >>>>        * lock:
> >>>>        *
> >>>> @@ -383,6 +399,13 @@ struct msm_file_private {
> >>>>        */
> >>>>       int sysprof;
> >>>>
> >>>> +     /**
> >>>> +      * @counters_requested:
> >>>> +      *
> >>>> +      * Has the context requested local perfcntr usage.
> >>>> +      */
> >>>> +     bool counters_requested;
> >>>> +
> >>>>       /**
> >>>>        * comm: Overridden task comm, see MSM_PARAM_COMM
> >>>>        *
> >>>> @@ -626,6 +649,8 @@ void msm_submitqueue_destroy(struct kref *kref);
> >>>>
> >>>>  int msm_file_private_set_sysprof(struct msm_file_private *ctx,
> >>>>                                struct msm_gpu *gpu, int sysprof);
> >>>> +int msm_file_private_request_counters(struct msm_file_private *ctx,
> >>>> +                                   struct msm_gpu *gpu, int reqcntrs);
> >>>>  void __msm_file_private_destroy(struct kref *kref);
> >>>>
> >>>>  static inline void msm_file_private_put(struct msm_file_private *ctx)
> >>>> diff --git a/drivers/gpu/drm/msm/msm_submitqueue.c b/drivers/gpu/drm/msm/msm_submitqueue.c
> >>>> index 7fed1de63b5d..1e1e21e6f7ae 100644
> >>>> --- a/drivers/gpu/drm/msm/msm_submitqueue.c
> >>>> +++ b/drivers/gpu/drm/msm/msm_submitqueue.c
> >>>> @@ -10,6 +10,15 @@
> >>>>  int msm_file_private_set_sysprof(struct msm_file_private *ctx,
> >>>>                                struct msm_gpu *gpu, int sysprof)
> >>>>  {
> >>>> +     int ret = 0;
> >>>> +
> >>>> +     mutex_lock(&gpu->lock);
> >>>> +
> >>>> +     if (sysprof && (refcount_read(&gpu->local_counters_active) > 1)) {
> >>>> +             ret = UERR(EBUSY, gpu->dev, "Local counter usage active");
> >>>> +             goto out_unlock;
> >>>> +     }
> >>>> +
> >>>>       /*
> >>>>        * Since pm_runtime and sysprof_active are both refcounts, we
> >>>>        * call apply the new value first, and then unwind the previous
> >>>> @@ -18,7 +27,8 @@ int msm_file_private_set_sysprof(struct msm_file_private *ctx,
> >>>>
> >>>>       switch (sysprof) {
> >>>>       default:
> >>>> -             return UERR(EINVAL, gpu->dev, "Invalid sysprof: %d", sysprof);
> >>>> +             ret = UERR(EINVAL, gpu->dev, "Invalid sysprof: %d", sysprof);
> >>>> +             goto out_unlock;
> >>>>       case 2:
> >>>>               pm_runtime_get_sync(&gpu->pdev->dev);
> >>>>               fallthrough;
> >>>> @@ -43,7 +53,45 @@ int msm_file_private_set_sysprof(struct msm_file_private *ctx,
> >>>>
> >>>>       ctx->sysprof = sysprof;
> >>>>
> >>>> -     return 0;
> >>>> +out_unlock:
> >>>> +     mutex_unlock(&gpu->lock);
> >>>> +
> >>>> +     return ret;
> >>>> +}
> >>>> +
> >>>> +int msm_file_private_request_counters(struct msm_file_private *ctx,
> >>>> +                                   struct msm_gpu *gpu, int reqctrs)
> >>>> +{
> >>>> +     int ret = 0;
> >>>> +
> >>>> +     mutex_lock(&gpu->lock);
> >>>> +
> >>>> +     if (reqctrs && (refcount_read(&gpu->sysprof_active) > 1)) {
> >>>> +             ret = UERR(EBUSY, gpu->dev, "System profiling active");
> >>>> +             goto out_unlock;
> >>>> +     }
> >>>> +
> >>>> +     if (reqctrs) {
> >>>> +             if (ctx->counters_requested) {
> >>>> +                     ret = UERR(EINVAL, gpu->dev, "Already requested");
> >>>> +                     goto out_unlock;
> >>>> +             }
> >>>> +
> >>>> +             ctx->counters_requested = true;
> >>>> +             refcount_inc(&gpu->local_counters_active);
> >>>> +     } else {
> >>>> +             if (!ctx->counters_requested) {
> >>>> +                     ret = UERR(EINVAL, gpu->dev, "Not requested");
> >>>> +                     goto out_unlock;
> >>>> +             }
> >>>> +             refcount_dec(&gpu->local_counters_active);
> >>>> +             ctx->counters_requested = false;
> >>>> +     }
> >>>> +
> >>>> +out_unlock:
> >>>> +     mutex_unlock(&gpu->lock);
> >>>> +
> >>>> +     return ret;
> >>>>  }
> >>>>
> >>>>  void __msm_file_private_destroy(struct kref *kref)
> >>>> diff --git a/include/uapi/drm/msm_drm.h b/include/uapi/drm/msm_drm.h
> >>>> index 2342cb90857e..ae7fb355e4a1 100644
> >>>> --- a/include/uapi/drm/msm_drm.h
> >>>> +++ b/include/uapi/drm/msm_drm.h
> >>>> @@ -91,6 +91,7 @@ struct drm_msm_timespec {
> >>>>  #define MSM_PARAM_UBWC_SWIZZLE 0x12 /* RO */
> >>>>  #define MSM_PARAM_MACROTILE_MODE 0x13 /* RO */
> >>>>  #define MSM_PARAM_UCHE_TRAP_BASE 0x14 /* RO */
> >>>> +#define MSM_PARAM_REQ_CNTRS  0x15 /* WO: request "local" (intra-submit) perfcntr usage  */
> >>>>
> >>>>  /* For backwards compat.  The original support for preemption was based on
> >>>>   * a single ring per priority level so # of priority levels equals the #
> >>>
>
Akhil P Oommen Dec. 16, 2024, 4:43 p.m. UTC | #9
On 12/13/2024 11:20 PM, Rob Clark wrote:
> On Fri, Dec 13, 2024 at 8:47 AM Akhil P Oommen <quic_akhilpo@quicinc.com> wrote:
>>
>> On 12/12/2024 10:42 PM, Rob Clark wrote:
>>> On Thu, Dec 12, 2024 at 9:08 AM Rob Clark <robdclark@gmail.com> wrote:
>>>>
>>>> On Thu, Dec 12, 2024 at 7:59 AM Akhil P Oommen <quic_akhilpo@quicinc.com> wrote:
>>>>>
>>>>> On 12/5/2024 10:24 PM, Rob Clark wrote:
>>>>>> From: Rob Clark <robdclark@chromium.org>
>>>>>>
>>>>>> Performance counter usage falls into two categories:
>>>>>>
>>>>>> 1. Local usage, where the counter configuration, start, and end read
>>>>>>    happen within (locally to) a single SUBMIT.  In this case, there is
>>>>>>    no dependency on counter configuration or values between submits, and
>>>>>>    in fact counters are normally cleared on context switches, making it
>>>>>>    impossible to rely on cross-submit state.
>>>>>>
>>>>>> 2. Global usage, where a single privilaged daemon/process is sampling
>>>>>>    counter values across all processes for profiling.
>>>>>>
>>>>>> The two categories are mutually exclusive.  While you can have many
>>>>>> processes making local counter usage, you cannot combine global and
>>>>>> local usage without the two stepping on each others feet (by changing
>>>>>> counter configuration).
>>
>> As such the HW doesn't have any limitation, unless you run out of
>> counters in a group. We just need an arbitration between processes (UMD
>> or KMD based).
> 
> True.. but is this actually needed?  Are there real-life use-cases?
> Or is it just something that someone wrote down in a requirements
> document because they could?

Not an expert on this, but system profiler may be used with an
application that use perfcounter extensions (for something like dynamic
workload adjustments??), right?

> 
> An app _can_ query the counters itself, although in isolation they
> aren't super useful.  What a user/developer would actually find useful
> are the derived counters that the global/system profiler provides.
> The local counters are useful to get per-shader cycle counts, but I've
> never found myself looking at that _and_ global/system profiler at the
> same time.  Getting accurate local counter values involves inserting
> extra WFI's breaking the GPU pipelining, making global profiling kinda
> useless.
> 
>> Also, KGSL exposes an ioctl to directly read the counter with a fixed
>> minimal latency. Because inline reads via submission may have huge
>> latency spikes based on workload especially when compute shaders are
>> involved. Isn't a low latency counter reads desirable in a fullfledged
>> system profiler?
> 
> For system profiler, we read back the counters from the cpu[1].
> Although we might need to revisit that for android.
> 
> [1] https://gitlab.freedesktop.org/mesa/mesa/-/blob/main/src/freedreno/perfcntrs/freedreno_dt.c?ref_type=heads#L223
> 

That won't work when IFPC is enabled. Disabling IFPC is a potential
short term option, but that is suboptimal as it alters the environment.
IFPC can occur between every frame.

>>>>>>
>>>>>> For global counter usage, there is already a SYSPROF param (since global
>>>>>> counter usage requires disabling counter clearing on context switch).
>>>>>> This patch adds a REQ_CNTRS param to request local counter usage.  If
>>>>>> one or more processes has requested counter usage, then a SYSPROF
>>>>>> request will fail with -EBUSY.  And if SYSPROF is active, then REQ_CNTRS
>>>>>> will fail with -EBUSY, maintaining the mutual exclusivity.
>>>>>>
>>>>>> This is purely an advisory interface to help coordinate userspace.
>>>>>> There is no real means of enforcement, but the worst that can happen if
>>>>>> userspace ignores a REQ_CNTRS failure is that you'll get nonsense
>>>>>> profiling data.
>>>>>>
>>>>>> Signed-off-by: Rob Clark <robdclark@chromium.org>
>>>>>> ---
>>>>>> kgsl takes a different approach, which involves a lot more UABI for
>>>>>> assigning counters to different processes.  But I think by taking
>>>>>> advantage of the fact that mesa (freedreno+turnip) reconfigure the
>>>>>> counters they need in each SUBMIT, for their respective gl/vk perf-
>>>>>> counter extensions, we can take this simpler approach.
>>>>>
>>>>> KGSL's approach is preemption and ifpc safe (also whatever HW changes
>>>>> that will come up in future generations). How will we ensure that here?
>>>>>
>>>>> I have plans to bring up IFPC support in near future. Also, I brought up
>>>>> this point during preemption series. But from the responses, I felt that
>>>>> profiling was not considered a serious usecase. Still I wonder how the
>>>>> perfcounter extensions work accurately with preemption.
>>>>
>>>> Re: IFPC, I think initially we have to inhibit IFPC when SYSPROF is active
>>>>
>>>> Longer term, I think we want to just save and restore all of the SEL
>>>> regs as well as the counters themselves on preemption.  AFAIU
>>>> currently only the counters themselves are saved/restored.  But there
>>>> is only one 32b SEL reg for each 64b counter, so I'm not sure that you
>>>> save that many cycles by not just saving/restoring the SEL regs as
>>>> well.  (And of course with REQ_CNTRS the kernel knows which processes
>>>> need counter save/restore and which do not, so you are only taking the
>>>> extra context switch overhead if a process is actually using the
>>>> perfcntrs.)
>>>
>>> Actually I'm maybe blending two different, but similar cases.
>>> PREAMBLE/POSTAMBLE, I think, cover us for preemption
>>>
>>> For IFPC, we'd need a way to tell GMU that SYSPROF is active, so it
>>> could save/restore all the counters and selectors  (IFPC shouldn't
>>> matter for local profiling / REQ_CNTRS case, since you wouldn't go
>>> into IFPC mid-submit.)
>>>
>>> BR,
>>> -R
>>>
>>>> Alternatively, I think we could just declare this as a userspace
>>>> problem, and solve it with CP_SET_AMBLE PREAMBLE/POSTAMBLE?
>>>>
>>>> Just for background, rendernode UABI is exposed to all processes that
>>>> can use the GPU, ie. basically everything.  Which makes it an
>>>> attractive attack surface.  This is why I prefer minimalism when it
>>>> comes to UABI, and not adding new ioctls and complexity in the kernel
>>>> when it is not essential ;-)
>>
>> I fully agree with you about maintaining minimalism in KMD. Here all we
>> need is a way for UMD to ask "give me a counter offset with 'x'
>> countable from 'y' group". And let KMD do the arbitration of counters
>> between userspace processes and also within KMD. And we can cut down on
>> some of the related things present in kgsl which are unnecessary at the
>> moment.
> 
> I'm not completely ruling it out.. just trying to figure out if we
> actually need it.  Maybe android forces us to switch to something
> other than devmem for reading counters?  That is the most plausible
> reason I could think of to add UABI for this.
> 
> (OTOH we could alternatively just give privileged userspace a way to
> mmap the mmio for percntrs via drm instead of devmem)

This is non-starter with IFPC currently, unless we disable IFPC which is
suboptimal IMO.

> 
>> More importantly, I am not sure if we should really fight hard against
>> something that is basically an architectur spec. Future HW evolution
>> happens based on this architecture. So is it really wise to build things
>> in the opposite direction. FYI, all other GPU KMD drivers which Qcom
>> uses are aligned on this.
> 
> Sure, but future hw is going to need future userspace.  It isn't
> really problematic to introduce a new UABI for future hw, because
> there is no "new kernel + old userspace" scenario.
> 

With IoT projects, a6x/a7x gpu needs to be supported for another decade
or so. A full-fledged profiling support (like an open source equivalent
of Snapdragon Profiler) may become an important product requirement in
future, depending on the customer. At that time, it is convenient to
have a low friction path to implement that. Otherwise, upstream drivers
will be always treated as second class internally here and that worries
me, tbh.

If you are inclined to this, I can help with the KMD side changes based
on what I proposed. I see that V3D/V4C exposes something similar.

-Akhil

> BR,
> -R
> 
>> -Akhil
>>
>>>>
>>>> BR,
>>>> -R
>>>>
>>>>> -Akhil
>>>>>
>>>>>>
>>>>>>  drivers/gpu/drm/msm/adreno/adreno_gpu.c |  2 +
>>>>>>  drivers/gpu/drm/msm/msm_drv.c           |  5 ++-
>>>>>>  drivers/gpu/drm/msm/msm_gpu.c           |  1 +
>>>>>>  drivers/gpu/drm/msm/msm_gpu.h           | 29 +++++++++++++-
>>>>>>  drivers/gpu/drm/msm/msm_submitqueue.c   | 52 ++++++++++++++++++++++++-
>>>>>>  include/uapi/drm/msm_drm.h              |  1 +
>>>>>>  6 files changed, 85 insertions(+), 5 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/msm/adreno/adreno_gpu.c b/drivers/gpu/drm/msm/adreno/adreno_gpu.c
>>>>>> index 31bbf2c83de4..f688e37059b8 100644
>>>>>> --- a/drivers/gpu/drm/msm/adreno/adreno_gpu.c
>>>>>> +++ b/drivers/gpu/drm/msm/adreno/adreno_gpu.c
>>>>>> @@ -441,6 +441,8 @@ int adreno_set_param(struct msm_gpu *gpu, struct msm_file_private *ctx,
>>>>>>               if (!capable(CAP_SYS_ADMIN))
>>>>>>                       return UERR(EPERM, drm, "invalid permissions");
>>>>>>               return msm_file_private_set_sysprof(ctx, gpu, value);
>>>>>> +     case MSM_PARAM_REQ_CNTRS:
>>>>>> +             return msm_file_private_request_counters(ctx, gpu, value);
>>>>>>       default:
>>>>>>               return UERR(EINVAL, drm, "%s: invalid param: %u", gpu->name, param);
>>>>>>       }
>>>>>> diff --git a/drivers/gpu/drm/msm/msm_drv.c b/drivers/gpu/drm/msm/msm_drv.c
>>>>>> index 6416d2cb4efc..bf8314ff4a25 100644
>>>>>> --- a/drivers/gpu/drm/msm/msm_drv.c
>>>>>> +++ b/drivers/gpu/drm/msm/msm_drv.c
>>>>>> @@ -377,9 +377,12 @@ static void msm_postclose(struct drm_device *dev, struct drm_file *file)
>>>>>>        * It is not possible to set sysprof param to non-zero if gpu
>>>>>>        * is not initialized:
>>>>>>        */
>>>>>> -     if (priv->gpu)
>>>>>> +     if (ctx->sysprof)
>>>>>>               msm_file_private_set_sysprof(ctx, priv->gpu, 0);
>>>>>>
>>>>>> +     if (ctx->counters_requested)
>>>>>> +             msm_file_private_request_counters(ctx, priv->gpu, 0);
>>>>>> +
>>>>>>       context_close(ctx);
>>>>>>  }
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/msm/msm_gpu.c b/drivers/gpu/drm/msm/msm_gpu.c
>>>>>> index 82f204f3bb8f..013b59ca3bb1 100644
>>>>>> --- a/drivers/gpu/drm/msm/msm_gpu.c
>>>>>> +++ b/drivers/gpu/drm/msm/msm_gpu.c
>>>>>> @@ -991,6 +991,7 @@ int msm_gpu_init(struct drm_device *drm, struct platform_device *pdev,
>>>>>>       gpu->nr_rings = nr_rings;
>>>>>>
>>>>>>       refcount_set(&gpu->sysprof_active, 1);
>>>>>> +     refcount_set(&gpu->local_counters_active, 1);
>>>>>>
>>>>>>       return 0;
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/msm/msm_gpu.h b/drivers/gpu/drm/msm/msm_gpu.h
>>>>>> index e25009150579..83c61e523b1b 100644
>>>>>> --- a/drivers/gpu/drm/msm/msm_gpu.h
>>>>>> +++ b/drivers/gpu/drm/msm/msm_gpu.h
>>>>>> @@ -195,12 +195,28 @@ struct msm_gpu {
>>>>>>       int nr_rings;
>>>>>>
>>>>>>       /**
>>>>>> -      * sysprof_active:
>>>>>> +      * @sysprof_active:
>>>>>>        *
>>>>>> -      * The count of contexts that have enabled system profiling.
>>>>>> +      * The count of contexts that have enabled system profiling plus one.
>>>>>> +      *
>>>>>> +      * Note: refcount_t does not like 0->1 transitions.. we want to keep
>>>>>> +      * the under/overflow checks that refcount_t provides, but allow
>>>>>> +      * multiple on/off transitions so we track the logical value plus one.)
>>>>>>        */
>>>>>>       refcount_t sysprof_active;
>>>>>>
>>>>>> +     /**
>>>>>> +      * @local_counters_active:
>>>>>> +      *
>>>>>> +      * The count of contexts that have requested local (intra-submit)
>>>>>> +      * performance counter usage plus one.
>>>>>> +      *
>>>>>> +      * Note: refcount_t does not like 0->1 transitions.. we want to keep
>>>>>> +      * the under/overflow checks that refcount_t provides, but allow
>>>>>> +      * multiple on/off transitions so we track the logical value plus one.)
>>>>>> +      */
>>>>>> +     refcount_t local_counters_active;
>>>>>> +
>>>>>>       /**
>>>>>>        * lock:
>>>>>>        *
>>>>>> @@ -383,6 +399,13 @@ struct msm_file_private {
>>>>>>        */
>>>>>>       int sysprof;
>>>>>>
>>>>>> +     /**
>>>>>> +      * @counters_requested:
>>>>>> +      *
>>>>>> +      * Has the context requested local perfcntr usage.
>>>>>> +      */
>>>>>> +     bool counters_requested;
>>>>>> +
>>>>>>       /**
>>>>>>        * comm: Overridden task comm, see MSM_PARAM_COMM
>>>>>>        *
>>>>>> @@ -626,6 +649,8 @@ void msm_submitqueue_destroy(struct kref *kref);
>>>>>>
>>>>>>  int msm_file_private_set_sysprof(struct msm_file_private *ctx,
>>>>>>                                struct msm_gpu *gpu, int sysprof);
>>>>>> +int msm_file_private_request_counters(struct msm_file_private *ctx,
>>>>>> +                                   struct msm_gpu *gpu, int reqcntrs);
>>>>>>  void __msm_file_private_destroy(struct kref *kref);
>>>>>>
>>>>>>  static inline void msm_file_private_put(struct msm_file_private *ctx)
>>>>>> diff --git a/drivers/gpu/drm/msm/msm_submitqueue.c b/drivers/gpu/drm/msm/msm_submitqueue.c
>>>>>> index 7fed1de63b5d..1e1e21e6f7ae 100644
>>>>>> --- a/drivers/gpu/drm/msm/msm_submitqueue.c
>>>>>> +++ b/drivers/gpu/drm/msm/msm_submitqueue.c
>>>>>> @@ -10,6 +10,15 @@
>>>>>>  int msm_file_private_set_sysprof(struct msm_file_private *ctx,
>>>>>>                                struct msm_gpu *gpu, int sysprof)
>>>>>>  {
>>>>>> +     int ret = 0;
>>>>>> +
>>>>>> +     mutex_lock(&gpu->lock);
>>>>>> +
>>>>>> +     if (sysprof && (refcount_read(&gpu->local_counters_active) > 1)) {
>>>>>> +             ret = UERR(EBUSY, gpu->dev, "Local counter usage active");
>>>>>> +             goto out_unlock;
>>>>>> +     }
>>>>>> +
>>>>>>       /*
>>>>>>        * Since pm_runtime and sysprof_active are both refcounts, we
>>>>>>        * call apply the new value first, and then unwind the previous
>>>>>> @@ -18,7 +27,8 @@ int msm_file_private_set_sysprof(struct msm_file_private *ctx,
>>>>>>
>>>>>>       switch (sysprof) {
>>>>>>       default:
>>>>>> -             return UERR(EINVAL, gpu->dev, "Invalid sysprof: %d", sysprof);
>>>>>> +             ret = UERR(EINVAL, gpu->dev, "Invalid sysprof: %d", sysprof);
>>>>>> +             goto out_unlock;
>>>>>>       case 2:
>>>>>>               pm_runtime_get_sync(&gpu->pdev->dev);
>>>>>>               fallthrough;
>>>>>> @@ -43,7 +53,45 @@ int msm_file_private_set_sysprof(struct msm_file_private *ctx,
>>>>>>
>>>>>>       ctx->sysprof = sysprof;
>>>>>>
>>>>>> -     return 0;
>>>>>> +out_unlock:
>>>>>> +     mutex_unlock(&gpu->lock);
>>>>>> +
>>>>>> +     return ret;
>>>>>> +}
>>>>>> +
>>>>>> +int msm_file_private_request_counters(struct msm_file_private *ctx,
>>>>>> +                                   struct msm_gpu *gpu, int reqctrs)
>>>>>> +{
>>>>>> +     int ret = 0;
>>>>>> +
>>>>>> +     mutex_lock(&gpu->lock);
>>>>>> +
>>>>>> +     if (reqctrs && (refcount_read(&gpu->sysprof_active) > 1)) {
>>>>>> +             ret = UERR(EBUSY, gpu->dev, "System profiling active");
>>>>>> +             goto out_unlock;
>>>>>> +     }
>>>>>> +
>>>>>> +     if (reqctrs) {
>>>>>> +             if (ctx->counters_requested) {
>>>>>> +                     ret = UERR(EINVAL, gpu->dev, "Already requested");
>>>>>> +                     goto out_unlock;
>>>>>> +             }
>>>>>> +
>>>>>> +             ctx->counters_requested = true;
>>>>>> +             refcount_inc(&gpu->local_counters_active);
>>>>>> +     } else {
>>>>>> +             if (!ctx->counters_requested) {
>>>>>> +                     ret = UERR(EINVAL, gpu->dev, "Not requested");
>>>>>> +                     goto out_unlock;
>>>>>> +             }
>>>>>> +             refcount_dec(&gpu->local_counters_active);
>>>>>> +             ctx->counters_requested = false;
>>>>>> +     }
>>>>>> +
>>>>>> +out_unlock:
>>>>>> +     mutex_unlock(&gpu->lock);
>>>>>> +
>>>>>> +     return ret;
>>>>>>  }
>>>>>>
>>>>>>  void __msm_file_private_destroy(struct kref *kref)
>>>>>> diff --git a/include/uapi/drm/msm_drm.h b/include/uapi/drm/msm_drm.h
>>>>>> index 2342cb90857e..ae7fb355e4a1 100644
>>>>>> --- a/include/uapi/drm/msm_drm.h
>>>>>> +++ b/include/uapi/drm/msm_drm.h
>>>>>> @@ -91,6 +91,7 @@ struct drm_msm_timespec {
>>>>>>  #define MSM_PARAM_UBWC_SWIZZLE 0x12 /* RO */
>>>>>>  #define MSM_PARAM_MACROTILE_MODE 0x13 /* RO */
>>>>>>  #define MSM_PARAM_UCHE_TRAP_BASE 0x14 /* RO */
>>>>>> +#define MSM_PARAM_REQ_CNTRS  0x15 /* WO: request "local" (intra-submit) perfcntr usage  */
>>>>>>
>>>>>>  /* For backwards compat.  The original support for preemption was based on
>>>>>>   * a single ring per priority level so # of priority levels equals the #
>>>>>
>>
Connor Abbott Dec. 16, 2024, 4:58 p.m. UTC | #10
On Mon, Dec 16, 2024 at 11:55 AM Akhil P Oommen
<quic_akhilpo@quicinc.com> wrote:
>
> On 12/13/2024 10:40 PM, Antonino Maniscalco wrote:
> > On 12/13/24 5:50 PM, Akhil P Oommen wrote:
> >> On 12/12/2024 9:44 PM, Antonino Maniscalco wrote:
> >>> On 12/12/24 4:58 PM, Akhil P Oommen wrote:
> >>>> On 12/5/2024 10:24 PM, Rob Clark wrote:
> >>>>> From: Rob Clark <robdclark@chromium.org>
> >>>>>
> >>>>> Performance counter usage falls into two categories:
> >>>>>
> >>>>> 1. Local usage, where the counter configuration, start, and end read
> >>>>>      happen within (locally to) a single SUBMIT.  In this case,
> >>>>> there is
> >>>>>      no dependency on counter configuration or values between submits,
> >>>>> and
> >>>>>      in fact counters are normally cleared on context switches,
> >>>>> making it
> >>>>>      impossible to rely on cross-submit state.
> >>>>>
> >>>>> 2. Global usage, where a single privilaged daemon/process is sampling
> >>>>>      counter values across all processes for profiling.
> >>>>>
> >>>>> The two categories are mutually exclusive.  While you can have many
> >>>>> processes making local counter usage, you cannot combine global and
> >>>>> local usage without the two stepping on each others feet (by changing
> >>>>> counter configuration).
> >>>>>
> >>>>> For global counter usage, there is already a SYSPROF param (since
> >>>>> global
> >>>>> counter usage requires disabling counter clearing on context switch).
> >>>>> This patch adds a REQ_CNTRS param to request local counter usage.  If
> >>>>> one or more processes has requested counter usage, then a SYSPROF
> >>>>> request will fail with -EBUSY.  And if SYSPROF is active, then
> >>>>> REQ_CNTRS
> >>>>> will fail with -EBUSY, maintaining the mutual exclusivity.
> >>>>>
> >>>>> This is purely an advisory interface to help coordinate userspace.
> >>>>> There is no real means of enforcement, but the worst that can
> >>>>> happen if
> >>>>> userspace ignores a REQ_CNTRS failure is that you'll get nonsense
> >>>>> profiling data.
> >>>>>
> >>>>> Signed-off-by: Rob Clark <robdclark@chromium.org>
> >>>>> ---
> >>>>> kgsl takes a different approach, which involves a lot more UABI for
> >>>>> assigning counters to different processes.  But I think by taking
> >>>>> advantage of the fact that mesa (freedreno+turnip) reconfigure the
> >>>>> counters they need in each SUBMIT, for their respective gl/vk perf-
> >>>>> counter extensions, we can take this simpler approach.
> >>>>
> >>>> KGSL's approach is preemption and ifpc safe (also whatever HW changes
> >>>> that will come up in future generations). How will we ensure that here?
> >>>>
> >>>> I have plans to bring up IFPC support in near future. Also, I
> >>>> brought up
> >>>> this point during preemption series. But from the responses, I felt
> >>>> that
> >>>> profiling was not considered a serious usecase. Still I wonder how the
> >>>> perfcounter extensions work accurately with preemption.
> >>>
> >>> So back then I implemented the postamble IB to clear perf counters and
> >>> that gets disabled when sysprof (so global usage) is happening. The
> >>> kernel is oblivious to "Local isage" of profiling but in that case
> >>> really what we want to do is disable preemption which in my
> >>> understanding can be done from userspace with a PKT. In my understanding
> >>> this had us covered for all usecases.
> >>
> >> I think this wasn't mentioned at that time. Which UMD PKT in a6x+ did
> >> you mean?
> >
> > Ah, I thought it wasmentioned, sorry.
> > The packet I was referring to is:
> >     <doc> Make next dword 1 to disable preemption, 0 to re-enable it. </
> > doc>
> >     <value name="CP_PREEMPT_DISABLE" value="0x6c" variants="A6XX"/>
>
> Ah! Okay. I think this packet is not used by the downstream blob. IMO,
> disabling preemption is still a suboptimal solution.

Downstream doesn't expose userspace perfcounters (i.e.
VK_KHR_performance_query) at all. My understanding is that Android
requires you not to expose them for security reasons, but I could be
wrong there. In any case, Qualcomm clearly hasn't really thought
through what it would take to make everything work well with userspace
perfcounters and hasn't implemented the necessary firmware bits for
that, so we're left muddling through and doing what we can.

Connor

>
> >
> > BTW you mentioned wanting to look into IFPC. Since I too wanted to look
> > into implementing it wonder if you could let me know when you planned on
> > working on it.
>
> I have few patches in progress. Nothing final yet and need verification
> on the hw side. Also, I need to do some housekeeping here to debug gmu
> issues since IFPC increases the probability of those a lot.
>
> I will try to send out the patches very soon.
>
> -Akhil.
>
> >
> >>
> >> -Akhil.
> >>
> >>>
> >>> So what would you expect instead we should do kernel side to make
> >>> profiling preemption safe?
> >>>
> >>>>
> >>>> -Akhil
> >>>>
> >>>>>
> >>>>>    drivers/gpu/drm/msm/adreno/adreno_gpu.c |  2 +
> >>>>>    drivers/gpu/drm/msm/msm_drv.c           |  5 ++-
> >>>>>    drivers/gpu/drm/msm/msm_gpu.c           |  1 +
> >>>>>    drivers/gpu/drm/msm/msm_gpu.h           | 29 +++++++++++++-
> >>>>>    drivers/gpu/drm/msm/msm_submitqueue.c   | 52 +++++++++++++++++++
> >>>>> +++++-
> >>>>>    include/uapi/drm/msm_drm.h              |  1 +
> >>>>>    6 files changed, 85 insertions(+), 5 deletions(-)
> >>>>>
> >>>>> diff --git a/drivers/gpu/drm/msm/adreno/adreno_gpu.c b/drivers/gpu/
> >>>>> drm/msm/adreno/adreno_gpu.c
> >>>>> index 31bbf2c83de4..f688e37059b8 100644
> >>>>> --- a/drivers/gpu/drm/msm/adreno/adreno_gpu.c
> >>>>> +++ b/drivers/gpu/drm/msm/adreno/adreno_gpu.c
> >>>>> @@ -441,6 +441,8 @@ int adreno_set_param(struct msm_gpu *gpu, struct
> >>>>> msm_file_private *ctx,
> >>>>>            if (!capable(CAP_SYS_ADMIN))
> >>>>>                return UERR(EPERM, drm, "invalid permissions");
> >>>>>            return msm_file_private_set_sysprof(ctx, gpu, value);
> >>>>> +    case MSM_PARAM_REQ_CNTRS:
> >>>>> +        return msm_file_private_request_counters(ctx, gpu, value);
> >>>>>        default:
> >>>>>            return UERR(EINVAL, drm, "%s: invalid param: %u", gpu-
> >>>>>> name, param);
> >>>>>        }
> >>>>> diff --git a/drivers/gpu/drm/msm/msm_drv.c b/drivers/gpu/drm/msm/
> >>>>> msm_drv.c
> >>>>> index 6416d2cb4efc..bf8314ff4a25 100644
> >>>>> --- a/drivers/gpu/drm/msm/msm_drv.c
> >>>>> +++ b/drivers/gpu/drm/msm/msm_drv.c
> >>>>> @@ -377,9 +377,12 @@ static void msm_postclose(struct drm_device
> >>>>> *dev, struct drm_file *file)
> >>>>>         * It is not possible to set sysprof param to non-zero if gpu
> >>>>>         * is not initialized:
> >>>>>         */
> >>>>> -    if (priv->gpu)
> >>>>> +    if (ctx->sysprof)
> >>>>>            msm_file_private_set_sysprof(ctx, priv->gpu, 0);
> >>>>>    +    if (ctx->counters_requested)
> >>>>> +        msm_file_private_request_counters(ctx, priv->gpu, 0);
> >>>>> +
> >>>>>        context_close(ctx);
> >>>>>    }
> >>>>>    diff --git a/drivers/gpu/drm/msm/msm_gpu.c b/drivers/gpu/drm/msm/
> >>>>> msm_gpu.c
> >>>>> index 82f204f3bb8f..013b59ca3bb1 100644
> >>>>> --- a/drivers/gpu/drm/msm/msm_gpu.c
> >>>>> +++ b/drivers/gpu/drm/msm/msm_gpu.c
> >>>>> @@ -991,6 +991,7 @@ int msm_gpu_init(struct drm_device *drm, struct
> >>>>> platform_device *pdev,
> >>>>>        gpu->nr_rings = nr_rings;
> >>>>>          refcount_set(&gpu->sysprof_active, 1);
> >>>>> +    refcount_set(&gpu->local_counters_active, 1);
> >>>>>          return 0;
> >>>>>    diff --git a/drivers/gpu/drm/msm/msm_gpu.h b/drivers/gpu/drm/msm/
> >>>>> msm_gpu.h
> >>>>> index e25009150579..83c61e523b1b 100644
> >>>>> --- a/drivers/gpu/drm/msm/msm_gpu.h
> >>>>> +++ b/drivers/gpu/drm/msm/msm_gpu.h
> >>>>> @@ -195,12 +195,28 @@ struct msm_gpu {
> >>>>>        int nr_rings;
> >>>>>          /**
> >>>>> -     * sysprof_active:
> >>>>> +     * @sysprof_active:
> >>>>>         *
> >>>>> -     * The count of contexts that have enabled system profiling.
> >>>>> +     * The count of contexts that have enabled system profiling plus
> >>>>> one.
> >>>>> +     *
> >>>>> +     * Note: refcount_t does not like 0->1 transitions.. we want to
> >>>>> keep
> >>>>> +     * the under/overflow checks that refcount_t provides, but allow
> >>>>> +     * multiple on/off transitions so we track the logical value
> >>>>> plus one.)
> >>>>>         */
> >>>>>        refcount_t sysprof_active;
> >>>>>    +    /**
> >>>>> +     * @local_counters_active:
> >>>>> +     *
> >>>>> +     * The count of contexts that have requested local (intra-submit)
> >>>>> +     * performance counter usage plus one.
> >>>>> +     *
> >>>>> +     * Note: refcount_t does not like 0->1 transitions.. we want to
> >>>>> keep
> >>>>> +     * the under/overflow checks that refcount_t provides, but allow
> >>>>> +     * multiple on/off transitions so we track the logical value
> >>>>> plus one.)
> >>>>> +     */
> >>>>> +    refcount_t local_counters_active;
> >>>>> +
> >>>>>        /**
> >>>>>         * lock:
> >>>>>         *
> >>>>> @@ -383,6 +399,13 @@ struct msm_file_private {
> >>>>>         */
> >>>>>        int sysprof;
> >>>>>    +    /**
> >>>>> +     * @counters_requested:
> >>>>> +     *
> >>>>> +     * Has the context requested local perfcntr usage.
> >>>>> +     */
> >>>>> +    bool counters_requested;
> >>>>> +
> >>>>>        /**
> >>>>>         * comm: Overridden task comm, see MSM_PARAM_COMM
> >>>>>         *
> >>>>> @@ -626,6 +649,8 @@ void msm_submitqueue_destroy(struct kref *kref);
> >>>>>      int msm_file_private_set_sysprof(struct msm_file_private *ctx,
> >>>>>                     struct msm_gpu *gpu, int sysprof);
> >>>>> +int msm_file_private_request_counters(struct msm_file_private *ctx,
> >>>>> +                      struct msm_gpu *gpu, int reqcntrs);
> >>>>>    void __msm_file_private_destroy(struct kref *kref);
> >>>>>      static inline void msm_file_private_put(struct msm_file_private
> >>>>> *ctx)
> >>>>> diff --git a/drivers/gpu/drm/msm/msm_submitqueue.c b/drivers/gpu/drm/
> >>>>> msm/msm_submitqueue.c
> >>>>> index 7fed1de63b5d..1e1e21e6f7ae 100644
> >>>>> --- a/drivers/gpu/drm/msm/msm_submitqueue.c
> >>>>> +++ b/drivers/gpu/drm/msm/msm_submitqueue.c
> >>>>> @@ -10,6 +10,15 @@
> >>>>>    int msm_file_private_set_sysprof(struct msm_file_private *ctx,
> >>>>>                     struct msm_gpu *gpu, int sysprof)
> >>>>>    {
> >>>>> +    int ret = 0;
> >>>>> +
> >>>>> +    mutex_lock(&gpu->lock);
> >>>>> +
> >>>>> +    if (sysprof && (refcount_read(&gpu->local_counters_active) >
> >>>>> 1)) {
> >>>>> +        ret = UERR(EBUSY, gpu->dev, "Local counter usage active");
> >>>>> +        goto out_unlock;
> >>>>> +    }
> >>>>> +
> >>>>>        /*
> >>>>>         * Since pm_runtime and sysprof_active are both refcounts, we
> >>>>>         * call apply the new value first, and then unwind the previous
> >>>>> @@ -18,7 +27,8 @@ int msm_file_private_set_sysprof(struct
> >>>>> msm_file_private *ctx,
> >>>>>          switch (sysprof) {
> >>>>>        default:
> >>>>> -        return UERR(EINVAL, gpu->dev, "Invalid sysprof: %d",
> >>>>> sysprof);
> >>>>> +        ret = UERR(EINVAL, gpu->dev, "Invalid sysprof: %d", sysprof);
> >>>>> +        goto out_unlock;
> >>>>>        case 2:
> >>>>>            pm_runtime_get_sync(&gpu->pdev->dev);
> >>>>>            fallthrough;
> >>>>> @@ -43,7 +53,45 @@ int msm_file_private_set_sysprof(struct
> >>>>> msm_file_private *ctx,
> >>>>>          ctx->sysprof = sysprof;
> >>>>>    -    return 0;
> >>>>> +out_unlock:
> >>>>> +    mutex_unlock(&gpu->lock);
> >>>>> +
> >>>>> +    return ret;
> >>>>> +}
> >>>>> +
> >>>>> +int msm_file_private_request_counters(struct msm_file_private *ctx,
> >>>>> +                      struct msm_gpu *gpu, int reqctrs)
> >>>>> +{
> >>>>> +    int ret = 0;
> >>>>> +
> >>>>> +    mutex_lock(&gpu->lock);
> >>>>> +
> >>>>> +    if (reqctrs && (refcount_read(&gpu->sysprof_active) > 1)) {
> >>>>> +        ret = UERR(EBUSY, gpu->dev, "System profiling active");
> >>>>> +        goto out_unlock;
> >>>>> +    }
> >>>>> +
> >>>>> +    if (reqctrs) {
> >>>>> +        if (ctx->counters_requested) {
> >>>>> +            ret = UERR(EINVAL, gpu->dev, "Already requested");
> >>>>> +            goto out_unlock;
> >>>>> +        }
> >>>>> +
> >>>>> +        ctx->counters_requested = true;
> >>>>> +        refcount_inc(&gpu->local_counters_active);
> >>>>> +    } else {
> >>>>> +        if (!ctx->counters_requested) {
> >>>>> +            ret = UERR(EINVAL, gpu->dev, "Not requested");
> >>>>> +            goto out_unlock;
> >>>>> +        }
> >>>>> +        refcount_dec(&gpu->local_counters_active);
> >>>>> +        ctx->counters_requested = false;
> >>>>> +    }
> >>>>> +
> >>>>> +out_unlock:
> >>>>> +    mutex_unlock(&gpu->lock);
> >>>>> +
> >>>>> +    return ret;
> >>>>>    }
> >>>>>      void __msm_file_private_destroy(struct kref *kref)
> >>>>> diff --git a/include/uapi/drm/msm_drm.h b/include/uapi/drm/msm_drm.h
> >>>>> index 2342cb90857e..ae7fb355e4a1 100644
> >>>>> --- a/include/uapi/drm/msm_drm.h
> >>>>> +++ b/include/uapi/drm/msm_drm.h
> >>>>> @@ -91,6 +91,7 @@ struct drm_msm_timespec {
> >>>>>    #define MSM_PARAM_UBWC_SWIZZLE 0x12 /* RO */
> >>>>>    #define MSM_PARAM_MACROTILE_MODE 0x13 /* RO */
> >>>>>    #define MSM_PARAM_UCHE_TRAP_BASE 0x14 /* RO */
> >>>>> +#define MSM_PARAM_REQ_CNTRS  0x15 /* WO: request "local" (intra-
> >>>>> submit) perfcntr usage  */
> >>>>>      /* For backwards compat.  The original support for preemption was
> >>>>> based on
> >>>>>     * a single ring per priority level so # of priority levels equals
> >>>>> the #
> >>>>
> >>>
> >>>
> >>> Best regards,
> >>
> >
> >
> > Best regards,
>
Rob Clark Dec. 16, 2024, 5:16 p.m. UTC | #11
On Mon, Dec 16, 2024 at 8:59 AM Connor Abbott <cwabbott0@gmail.com> wrote:
>
> On Mon, Dec 16, 2024 at 11:55 AM Akhil P Oommen
> <quic_akhilpo@quicinc.com> wrote:
> >
> > On 12/13/2024 10:40 PM, Antonino Maniscalco wrote:
> > > On 12/13/24 5:50 PM, Akhil P Oommen wrote:
> > >> On 12/12/2024 9:44 PM, Antonino Maniscalco wrote:
> > >>> On 12/12/24 4:58 PM, Akhil P Oommen wrote:
> > >>>> On 12/5/2024 10:24 PM, Rob Clark wrote:
> > >>>>> From: Rob Clark <robdclark@chromium.org>
> > >>>>>
> > >>>>> Performance counter usage falls into two categories:
> > >>>>>
> > >>>>> 1. Local usage, where the counter configuration, start, and end read
> > >>>>>      happen within (locally to) a single SUBMIT.  In this case,
> > >>>>> there is
> > >>>>>      no dependency on counter configuration or values between submits,
> > >>>>> and
> > >>>>>      in fact counters are normally cleared on context switches,
> > >>>>> making it
> > >>>>>      impossible to rely on cross-submit state.
> > >>>>>
> > >>>>> 2. Global usage, where a single privilaged daemon/process is sampling
> > >>>>>      counter values across all processes for profiling.
> > >>>>>
> > >>>>> The two categories are mutually exclusive.  While you can have many
> > >>>>> processes making local counter usage, you cannot combine global and
> > >>>>> local usage without the two stepping on each others feet (by changing
> > >>>>> counter configuration).
> > >>>>>
> > >>>>> For global counter usage, there is already a SYSPROF param (since
> > >>>>> global
> > >>>>> counter usage requires disabling counter clearing on context switch).
> > >>>>> This patch adds a REQ_CNTRS param to request local counter usage.  If
> > >>>>> one or more processes has requested counter usage, then a SYSPROF
> > >>>>> request will fail with -EBUSY.  And if SYSPROF is active, then
> > >>>>> REQ_CNTRS
> > >>>>> will fail with -EBUSY, maintaining the mutual exclusivity.
> > >>>>>
> > >>>>> This is purely an advisory interface to help coordinate userspace.
> > >>>>> There is no real means of enforcement, but the worst that can
> > >>>>> happen if
> > >>>>> userspace ignores a REQ_CNTRS failure is that you'll get nonsense
> > >>>>> profiling data.
> > >>>>>
> > >>>>> Signed-off-by: Rob Clark <robdclark@chromium.org>
> > >>>>> ---
> > >>>>> kgsl takes a different approach, which involves a lot more UABI for
> > >>>>> assigning counters to different processes.  But I think by taking
> > >>>>> advantage of the fact that mesa (freedreno+turnip) reconfigure the
> > >>>>> counters they need in each SUBMIT, for their respective gl/vk perf-
> > >>>>> counter extensions, we can take this simpler approach.
> > >>>>
> > >>>> KGSL's approach is preemption and ifpc safe (also whatever HW changes
> > >>>> that will come up in future generations). How will we ensure that here?
> > >>>>
> > >>>> I have plans to bring up IFPC support in near future. Also, I
> > >>>> brought up
> > >>>> this point during preemption series. But from the responses, I felt
> > >>>> that
> > >>>> profiling was not considered a serious usecase. Still I wonder how the
> > >>>> perfcounter extensions work accurately with preemption.
> > >>>
> > >>> So back then I implemented the postamble IB to clear perf counters and
> > >>> that gets disabled when sysprof (so global usage) is happening. The
> > >>> kernel is oblivious to "Local isage" of profiling but in that case
> > >>> really what we want to do is disable preemption which in my
> > >>> understanding can be done from userspace with a PKT. In my understanding
> > >>> this had us covered for all usecases.
> > >>
> > >> I think this wasn't mentioned at that time. Which UMD PKT in a6x+ did
> > >> you mean?
> > >
> > > Ah, I thought it wasmentioned, sorry.
> > > The packet I was referring to is:
> > >     <doc> Make next dword 1 to disable preemption, 0 to re-enable it. </
> > > doc>
> > >     <value name="CP_PREEMPT_DISABLE" value="0x6c" variants="A6XX"/>
> >
> > Ah! Okay. I think this packet is not used by the downstream blob. IMO,
> > disabling preemption is still a suboptimal solution.
>
> Downstream doesn't expose userspace perfcounters (i.e.
> VK_KHR_performance_query) at all. My understanding is that Android
> requires you not to expose them for security reasons, but I could be
> wrong there. In any case, Qualcomm clearly hasn't really thought
> through what it would take to make everything work well with userspace
> perfcounters and hasn't implemented the necessary firmware bits for
> that, so we're left muddling through and doing what we can.

That is correct, VK_KHR_performance_query is disallowed on android.
There is an android CTS test for that.

BR,
-R

>
> Connor
>
> >
> > >
> > > BTW you mentioned wanting to look into IFPC. Since I too wanted to look
> > > into implementing it wonder if you could let me know when you planned on
> > > working on it.
> >
> > I have few patches in progress. Nothing final yet and need verification
> > on the hw side. Also, I need to do some housekeeping here to debug gmu
> > issues since IFPC increases the probability of those a lot.
> >
> > I will try to send out the patches very soon.
> >
> > -Akhil.
> >
> > >
> > >>
> > >> -Akhil.
> > >>
> > >>>
> > >>> So what would you expect instead we should do kernel side to make
> > >>> profiling preemption safe?
> > >>>
> > >>>>
> > >>>> -Akhil
> > >>>>
> > >>>>>
> > >>>>>    drivers/gpu/drm/msm/adreno/adreno_gpu.c |  2 +
> > >>>>>    drivers/gpu/drm/msm/msm_drv.c           |  5 ++-
> > >>>>>    drivers/gpu/drm/msm/msm_gpu.c           |  1 +
> > >>>>>    drivers/gpu/drm/msm/msm_gpu.h           | 29 +++++++++++++-
> > >>>>>    drivers/gpu/drm/msm/msm_submitqueue.c   | 52 +++++++++++++++++++
> > >>>>> +++++-
> > >>>>>    include/uapi/drm/msm_drm.h              |  1 +
> > >>>>>    6 files changed, 85 insertions(+), 5 deletions(-)
> > >>>>>
> > >>>>> diff --git a/drivers/gpu/drm/msm/adreno/adreno_gpu.c b/drivers/gpu/
> > >>>>> drm/msm/adreno/adreno_gpu.c
> > >>>>> index 31bbf2c83de4..f688e37059b8 100644
> > >>>>> --- a/drivers/gpu/drm/msm/adreno/adreno_gpu.c
> > >>>>> +++ b/drivers/gpu/drm/msm/adreno/adreno_gpu.c
> > >>>>> @@ -441,6 +441,8 @@ int adreno_set_param(struct msm_gpu *gpu, struct
> > >>>>> msm_file_private *ctx,
> > >>>>>            if (!capable(CAP_SYS_ADMIN))
> > >>>>>                return UERR(EPERM, drm, "invalid permissions");
> > >>>>>            return msm_file_private_set_sysprof(ctx, gpu, value);
> > >>>>> +    case MSM_PARAM_REQ_CNTRS:
> > >>>>> +        return msm_file_private_request_counters(ctx, gpu, value);
> > >>>>>        default:
> > >>>>>            return UERR(EINVAL, drm, "%s: invalid param: %u", gpu-
> > >>>>>> name, param);
> > >>>>>        }
> > >>>>> diff --git a/drivers/gpu/drm/msm/msm_drv.c b/drivers/gpu/drm/msm/
> > >>>>> msm_drv.c
> > >>>>> index 6416d2cb4efc..bf8314ff4a25 100644
> > >>>>> --- a/drivers/gpu/drm/msm/msm_drv.c
> > >>>>> +++ b/drivers/gpu/drm/msm/msm_drv.c
> > >>>>> @@ -377,9 +377,12 @@ static void msm_postclose(struct drm_device
> > >>>>> *dev, struct drm_file *file)
> > >>>>>         * It is not possible to set sysprof param to non-zero if gpu
> > >>>>>         * is not initialized:
> > >>>>>         */
> > >>>>> -    if (priv->gpu)
> > >>>>> +    if (ctx->sysprof)
> > >>>>>            msm_file_private_set_sysprof(ctx, priv->gpu, 0);
> > >>>>>    +    if (ctx->counters_requested)
> > >>>>> +        msm_file_private_request_counters(ctx, priv->gpu, 0);
> > >>>>> +
> > >>>>>        context_close(ctx);
> > >>>>>    }
> > >>>>>    diff --git a/drivers/gpu/drm/msm/msm_gpu.c b/drivers/gpu/drm/msm/
> > >>>>> msm_gpu.c
> > >>>>> index 82f204f3bb8f..013b59ca3bb1 100644
> > >>>>> --- a/drivers/gpu/drm/msm/msm_gpu.c
> > >>>>> +++ b/drivers/gpu/drm/msm/msm_gpu.c
> > >>>>> @@ -991,6 +991,7 @@ int msm_gpu_init(struct drm_device *drm, struct
> > >>>>> platform_device *pdev,
> > >>>>>        gpu->nr_rings = nr_rings;
> > >>>>>          refcount_set(&gpu->sysprof_active, 1);
> > >>>>> +    refcount_set(&gpu->local_counters_active, 1);
> > >>>>>          return 0;
> > >>>>>    diff --git a/drivers/gpu/drm/msm/msm_gpu.h b/drivers/gpu/drm/msm/
> > >>>>> msm_gpu.h
> > >>>>> index e25009150579..83c61e523b1b 100644
> > >>>>> --- a/drivers/gpu/drm/msm/msm_gpu.h
> > >>>>> +++ b/drivers/gpu/drm/msm/msm_gpu.h
> > >>>>> @@ -195,12 +195,28 @@ struct msm_gpu {
> > >>>>>        int nr_rings;
> > >>>>>          /**
> > >>>>> -     * sysprof_active:
> > >>>>> +     * @sysprof_active:
> > >>>>>         *
> > >>>>> -     * The count of contexts that have enabled system profiling.
> > >>>>> +     * The count of contexts that have enabled system profiling plus
> > >>>>> one.
> > >>>>> +     *
> > >>>>> +     * Note: refcount_t does not like 0->1 transitions.. we want to
> > >>>>> keep
> > >>>>> +     * the under/overflow checks that refcount_t provides, but allow
> > >>>>> +     * multiple on/off transitions so we track the logical value
> > >>>>> plus one.)
> > >>>>>         */
> > >>>>>        refcount_t sysprof_active;
> > >>>>>    +    /**
> > >>>>> +     * @local_counters_active:
> > >>>>> +     *
> > >>>>> +     * The count of contexts that have requested local (intra-submit)
> > >>>>> +     * performance counter usage plus one.
> > >>>>> +     *
> > >>>>> +     * Note: refcount_t does not like 0->1 transitions.. we want to
> > >>>>> keep
> > >>>>> +     * the under/overflow checks that refcount_t provides, but allow
> > >>>>> +     * multiple on/off transitions so we track the logical value
> > >>>>> plus one.)
> > >>>>> +     */
> > >>>>> +    refcount_t local_counters_active;
> > >>>>> +
> > >>>>>        /**
> > >>>>>         * lock:
> > >>>>>         *
> > >>>>> @@ -383,6 +399,13 @@ struct msm_file_private {
> > >>>>>         */
> > >>>>>        int sysprof;
> > >>>>>    +    /**
> > >>>>> +     * @counters_requested:
> > >>>>> +     *
> > >>>>> +     * Has the context requested local perfcntr usage.
> > >>>>> +     */
> > >>>>> +    bool counters_requested;
> > >>>>> +
> > >>>>>        /**
> > >>>>>         * comm: Overridden task comm, see MSM_PARAM_COMM
> > >>>>>         *
> > >>>>> @@ -626,6 +649,8 @@ void msm_submitqueue_destroy(struct kref *kref);
> > >>>>>      int msm_file_private_set_sysprof(struct msm_file_private *ctx,
> > >>>>>                     struct msm_gpu *gpu, int sysprof);
> > >>>>> +int msm_file_private_request_counters(struct msm_file_private *ctx,
> > >>>>> +                      struct msm_gpu *gpu, int reqcntrs);
> > >>>>>    void __msm_file_private_destroy(struct kref *kref);
> > >>>>>      static inline void msm_file_private_put(struct msm_file_private
> > >>>>> *ctx)
> > >>>>> diff --git a/drivers/gpu/drm/msm/msm_submitqueue.c b/drivers/gpu/drm/
> > >>>>> msm/msm_submitqueue.c
> > >>>>> index 7fed1de63b5d..1e1e21e6f7ae 100644
> > >>>>> --- a/drivers/gpu/drm/msm/msm_submitqueue.c
> > >>>>> +++ b/drivers/gpu/drm/msm/msm_submitqueue.c
> > >>>>> @@ -10,6 +10,15 @@
> > >>>>>    int msm_file_private_set_sysprof(struct msm_file_private *ctx,
> > >>>>>                     struct msm_gpu *gpu, int sysprof)
> > >>>>>    {
> > >>>>> +    int ret = 0;
> > >>>>> +
> > >>>>> +    mutex_lock(&gpu->lock);
> > >>>>> +
> > >>>>> +    if (sysprof && (refcount_read(&gpu->local_counters_active) >
> > >>>>> 1)) {
> > >>>>> +        ret = UERR(EBUSY, gpu->dev, "Local counter usage active");
> > >>>>> +        goto out_unlock;
> > >>>>> +    }
> > >>>>> +
> > >>>>>        /*
> > >>>>>         * Since pm_runtime and sysprof_active are both refcounts, we
> > >>>>>         * call apply the new value first, and then unwind the previous
> > >>>>> @@ -18,7 +27,8 @@ int msm_file_private_set_sysprof(struct
> > >>>>> msm_file_private *ctx,
> > >>>>>          switch (sysprof) {
> > >>>>>        default:
> > >>>>> -        return UERR(EINVAL, gpu->dev, "Invalid sysprof: %d",
> > >>>>> sysprof);
> > >>>>> +        ret = UERR(EINVAL, gpu->dev, "Invalid sysprof: %d", sysprof);
> > >>>>> +        goto out_unlock;
> > >>>>>        case 2:
> > >>>>>            pm_runtime_get_sync(&gpu->pdev->dev);
> > >>>>>            fallthrough;
> > >>>>> @@ -43,7 +53,45 @@ int msm_file_private_set_sysprof(struct
> > >>>>> msm_file_private *ctx,
> > >>>>>          ctx->sysprof = sysprof;
> > >>>>>    -    return 0;
> > >>>>> +out_unlock:
> > >>>>> +    mutex_unlock(&gpu->lock);
> > >>>>> +
> > >>>>> +    return ret;
> > >>>>> +}
> > >>>>> +
> > >>>>> +int msm_file_private_request_counters(struct msm_file_private *ctx,
> > >>>>> +                      struct msm_gpu *gpu, int reqctrs)
> > >>>>> +{
> > >>>>> +    int ret = 0;
> > >>>>> +
> > >>>>> +    mutex_lock(&gpu->lock);
> > >>>>> +
> > >>>>> +    if (reqctrs && (refcount_read(&gpu->sysprof_active) > 1)) {
> > >>>>> +        ret = UERR(EBUSY, gpu->dev, "System profiling active");
> > >>>>> +        goto out_unlock;
> > >>>>> +    }
> > >>>>> +
> > >>>>> +    if (reqctrs) {
> > >>>>> +        if (ctx->counters_requested) {
> > >>>>> +            ret = UERR(EINVAL, gpu->dev, "Already requested");
> > >>>>> +            goto out_unlock;
> > >>>>> +        }
> > >>>>> +
> > >>>>> +        ctx->counters_requested = true;
> > >>>>> +        refcount_inc(&gpu->local_counters_active);
> > >>>>> +    } else {
> > >>>>> +        if (!ctx->counters_requested) {
> > >>>>> +            ret = UERR(EINVAL, gpu->dev, "Not requested");
> > >>>>> +            goto out_unlock;
> > >>>>> +        }
> > >>>>> +        refcount_dec(&gpu->local_counters_active);
> > >>>>> +        ctx->counters_requested = false;
> > >>>>> +    }
> > >>>>> +
> > >>>>> +out_unlock:
> > >>>>> +    mutex_unlock(&gpu->lock);
> > >>>>> +
> > >>>>> +    return ret;
> > >>>>>    }
> > >>>>>      void __msm_file_private_destroy(struct kref *kref)
> > >>>>> diff --git a/include/uapi/drm/msm_drm.h b/include/uapi/drm/msm_drm.h
> > >>>>> index 2342cb90857e..ae7fb355e4a1 100644
> > >>>>> --- a/include/uapi/drm/msm_drm.h
> > >>>>> +++ b/include/uapi/drm/msm_drm.h
> > >>>>> @@ -91,6 +91,7 @@ struct drm_msm_timespec {
> > >>>>>    #define MSM_PARAM_UBWC_SWIZZLE 0x12 /* RO */
> > >>>>>    #define MSM_PARAM_MACROTILE_MODE 0x13 /* RO */
> > >>>>>    #define MSM_PARAM_UCHE_TRAP_BASE 0x14 /* RO */
> > >>>>> +#define MSM_PARAM_REQ_CNTRS  0x15 /* WO: request "local" (intra-
> > >>>>> submit) perfcntr usage  */
> > >>>>>      /* For backwards compat.  The original support for preemption was
> > >>>>> based on
> > >>>>>     * a single ring per priority level so # of priority levels equals
> > >>>>> the #
> > >>>>
> > >>>
> > >>>
> > >>> Best regards,
> > >>
> > >
> > >
> > > Best regards,
> >
Akhil P Oommen Dec. 17, 2024, 6:58 p.m. UTC | #12
On 12/17/2024 2:21 AM, Rob Clark wrote:
> On Mon, Dec 16, 2024 at 12:25 PM Akhil P Oommen
> <quic_akhilpo@quicinc.com> wrote:
>>
>> On 12/16/2024 10:28 PM, Connor Abbott wrote:
>>> On Mon, Dec 16, 2024 at 11:55 AM Akhil P Oommen
>>> <quic_akhilpo@quicinc.com> wrote:
>>>>
>>>> On 12/13/2024 10:40 PM, Antonino Maniscalco wrote:
>>>>> On 12/13/24 5:50 PM, Akhil P Oommen wrote:
>>>>>> On 12/12/2024 9:44 PM, Antonino Maniscalco wrote:
>>>>>>> On 12/12/24 4:58 PM, Akhil P Oommen wrote:
>>>>>>>> On 12/5/2024 10:24 PM, Rob Clark wrote:
>>>>>>>>> From: Rob Clark <robdclark@chromium.org>
>>>>>>>>>
>>>>>>>>> Performance counter usage falls into two categories:
>>>>>>>>>
>>>>>>>>> 1. Local usage, where the counter configuration, start, and end read
>>>>>>>>>      happen within (locally to) a single SUBMIT.  In this case,
>>>>>>>>> there is
>>>>>>>>>      no dependency on counter configuration or values between submits,
>>>>>>>>> and
>>>>>>>>>      in fact counters are normally cleared on context switches,
>>>>>>>>> making it
>>>>>>>>>      impossible to rely on cross-submit state.
>>>>>>>>>
>>>>>>>>> 2. Global usage, where a single privilaged daemon/process is sampling
>>>>>>>>>      counter values across all processes for profiling.
>>>>>>>>>
>>>>>>>>> The two categories are mutually exclusive.  While you can have many
>>>>>>>>> processes making local counter usage, you cannot combine global and
>>>>>>>>> local usage without the two stepping on each others feet (by changing
>>>>>>>>> counter configuration).
>>>>>>>>>
>>>>>>>>> For global counter usage, there is already a SYSPROF param (since
>>>>>>>>> global
>>>>>>>>> counter usage requires disabling counter clearing on context switch).
>>>>>>>>> This patch adds a REQ_CNTRS param to request local counter usage.  If
>>>>>>>>> one or more processes has requested counter usage, then a SYSPROF
>>>>>>>>> request will fail with -EBUSY.  And if SYSPROF is active, then
>>>>>>>>> REQ_CNTRS
>>>>>>>>> will fail with -EBUSY, maintaining the mutual exclusivity.
>>>>>>>>>
>>>>>>>>> This is purely an advisory interface to help coordinate userspace.
>>>>>>>>> There is no real means of enforcement, but the worst that can
>>>>>>>>> happen if
>>>>>>>>> userspace ignores a REQ_CNTRS failure is that you'll get nonsense
>>>>>>>>> profiling data.
>>>>>>>>>
>>>>>>>>> Signed-off-by: Rob Clark <robdclark@chromium.org>
>>>>>>>>> ---
>>>>>>>>> kgsl takes a different approach, which involves a lot more UABI for
>>>>>>>>> assigning counters to different processes.  But I think by taking
>>>>>>>>> advantage of the fact that mesa (freedreno+turnip) reconfigure the
>>>>>>>>> counters they need in each SUBMIT, for their respective gl/vk perf-
>>>>>>>>> counter extensions, we can take this simpler approach.
>>>>>>>>
>>>>>>>> KGSL's approach is preemption and ifpc safe (also whatever HW changes
>>>>>>>> that will come up in future generations). How will we ensure that here?
>>>>>>>>
>>>>>>>> I have plans to bring up IFPC support in near future. Also, I
>>>>>>>> brought up
>>>>>>>> this point during preemption series. But from the responses, I felt
>>>>>>>> that
>>>>>>>> profiling was not considered a serious usecase. Still I wonder how the
>>>>>>>> perfcounter extensions work accurately with preemption.
>>>>>>>
>>>>>>> So back then I implemented the postamble IB to clear perf counters and
>>>>>>> that gets disabled when sysprof (so global usage) is happening. The
>>>>>>> kernel is oblivious to "Local isage" of profiling but in that case
>>>>>>> really what we want to do is disable preemption which in my
>>>>>>> understanding can be done from userspace with a PKT. In my understanding
>>>>>>> this had us covered for all usecases.
>>>>>>
>>>>>> I think this wasn't mentioned at that time. Which UMD PKT in a6x+ did
>>>>>> you mean?
>>>>>
>>>>> Ah, I thought it wasmentioned, sorry.
>>>>> The packet I was referring to is:
>>>>>     <doc> Make next dword 1 to disable preemption, 0 to re-enable it. </
>>>>> doc>
>>>>>     <value name="CP_PREEMPT_DISABLE" value="0x6c" variants="A6XX"/>
>>>>
>>>> Ah! Okay. I think this packet is not used by the downstream blob. IMO,
>>>> disabling preemption is still a suboptimal solution.
>>>
>>> Downstream doesn't expose userspace perfcounters (i.e.
>>> VK_KHR_performance_query) at all. My understanding is that Android
>>> requires you not to expose them for security reasons, but I could be
>>> wrong there. In any case, Qualcomm clearly hasn't really thought
>>> through what it would take to make everything work well with userspace
>>> perfcounters and hasn't implemented the necessary firmware bits for
>>> that, so we're left muddling through and doing what we can.
>>>
>>
>> Honestly, I don't know what you meant by the missing support.
> 
> GMU support to save/restore SEL regs on IFPC when SYSPROF is enabled ;-)
> 

That won't solve the case of reading counters via devmem. That still
would require disabling ifpc.

> If we had that, we wouldn't need ioclts to assign+configure counters,
> everything else could be done in userspace (modulo trying to do both
> local and global profiling at the same time.. but I'm not convinced
> that is a valid use-case, as I mentioned earlier)
> 
Lets ignore this use-case then. We can revisit if it becomes a thing in
upstream.

-Akhil.

> BR,
> -R
> 
>> -Akhil.
>>
>>> Connor
>>>
>>>>
>>>>>
>>>>> BTW you mentioned wanting to look into IFPC. Since I too wanted to look
>>>>> into implementing it wonder if you could let me know when you planned on
>>>>> working on it.
>>>>
>>>> I have few patches in progress. Nothing final yet and need verification
>>>> on the hw side. Also, I need to do some housekeeping here to debug gmu
>>>> issues since IFPC increases the probability of those a lot.
>>>>
>>>> I will try to send out the patches very soon.
>>>>
>>>> -Akhil.
>>>>
>>>>>
>>>>>>
>>>>>> -Akhil.
>>>>>>
>>>>>>>
>>>>>>> So what would you expect instead we should do kernel side to make
>>>>>>> profiling preemption safe?
>>>>>>>
>>>>>>>>
>>>>>>>> -Akhil
>>>>>>>>
>>>>>>>>>
>>>>>>>>>    drivers/gpu/drm/msm/adreno/adreno_gpu.c |  2 +
>>>>>>>>>    drivers/gpu/drm/msm/msm_drv.c           |  5 ++-
>>>>>>>>>    drivers/gpu/drm/msm/msm_gpu.c           |  1 +
>>>>>>>>>    drivers/gpu/drm/msm/msm_gpu.h           | 29 +++++++++++++-
>>>>>>>>>    drivers/gpu/drm/msm/msm_submitqueue.c   | 52 +++++++++++++++++++
>>>>>>>>> +++++-
>>>>>>>>>    include/uapi/drm/msm_drm.h              |  1 +
>>>>>>>>>    6 files changed, 85 insertions(+), 5 deletions(-)
>>>>>>>>>
>>>>>>>>> diff --git a/drivers/gpu/drm/msm/adreno/adreno_gpu.c b/drivers/gpu/
>>>>>>>>> drm/msm/adreno/adreno_gpu.c
>>>>>>>>> index 31bbf2c83de4..f688e37059b8 100644
>>>>>>>>> --- a/drivers/gpu/drm/msm/adreno/adreno_gpu.c
>>>>>>>>> +++ b/drivers/gpu/drm/msm/adreno/adreno_gpu.c
>>>>>>>>> @@ -441,6 +441,8 @@ int adreno_set_param(struct msm_gpu *gpu, struct
>>>>>>>>> msm_file_private *ctx,
>>>>>>>>>            if (!capable(CAP_SYS_ADMIN))
>>>>>>>>>                return UERR(EPERM, drm, "invalid permissions");
>>>>>>>>>            return msm_file_private_set_sysprof(ctx, gpu, value);
>>>>>>>>> +    case MSM_PARAM_REQ_CNTRS:
>>>>>>>>> +        return msm_file_private_request_counters(ctx, gpu, value);
>>>>>>>>>        default:
>>>>>>>>>            return UERR(EINVAL, drm, "%s: invalid param: %u", gpu-
>>>>>>>>>> name, param);
>>>>>>>>>        }
>>>>>>>>> diff --git a/drivers/gpu/drm/msm/msm_drv.c b/drivers/gpu/drm/msm/
>>>>>>>>> msm_drv.c
>>>>>>>>> index 6416d2cb4efc..bf8314ff4a25 100644
>>>>>>>>> --- a/drivers/gpu/drm/msm/msm_drv.c
>>>>>>>>> +++ b/drivers/gpu/drm/msm/msm_drv.c
>>>>>>>>> @@ -377,9 +377,12 @@ static void msm_postclose(struct drm_device
>>>>>>>>> *dev, struct drm_file *file)
>>>>>>>>>         * It is not possible to set sysprof param to non-zero if gpu
>>>>>>>>>         * is not initialized:
>>>>>>>>>         */
>>>>>>>>> -    if (priv->gpu)
>>>>>>>>> +    if (ctx->sysprof)
>>>>>>>>>            msm_file_private_set_sysprof(ctx, priv->gpu, 0);
>>>>>>>>>    +    if (ctx->counters_requested)
>>>>>>>>> +        msm_file_private_request_counters(ctx, priv->gpu, 0);
>>>>>>>>> +
>>>>>>>>>        context_close(ctx);
>>>>>>>>>    }
>>>>>>>>>    diff --git a/drivers/gpu/drm/msm/msm_gpu.c b/drivers/gpu/drm/msm/
>>>>>>>>> msm_gpu.c
>>>>>>>>> index 82f204f3bb8f..013b59ca3bb1 100644
>>>>>>>>> --- a/drivers/gpu/drm/msm/msm_gpu.c
>>>>>>>>> +++ b/drivers/gpu/drm/msm/msm_gpu.c
>>>>>>>>> @@ -991,6 +991,7 @@ int msm_gpu_init(struct drm_device *drm, struct
>>>>>>>>> platform_device *pdev,
>>>>>>>>>        gpu->nr_rings = nr_rings;
>>>>>>>>>          refcount_set(&gpu->sysprof_active, 1);
>>>>>>>>> +    refcount_set(&gpu->local_counters_active, 1);
>>>>>>>>>          return 0;
>>>>>>>>>    diff --git a/drivers/gpu/drm/msm/msm_gpu.h b/drivers/gpu/drm/msm/
>>>>>>>>> msm_gpu.h
>>>>>>>>> index e25009150579..83c61e523b1b 100644
>>>>>>>>> --- a/drivers/gpu/drm/msm/msm_gpu.h
>>>>>>>>> +++ b/drivers/gpu/drm/msm/msm_gpu.h
>>>>>>>>> @@ -195,12 +195,28 @@ struct msm_gpu {
>>>>>>>>>        int nr_rings;
>>>>>>>>>          /**
>>>>>>>>> -     * sysprof_active:
>>>>>>>>> +     * @sysprof_active:
>>>>>>>>>         *
>>>>>>>>> -     * The count of contexts that have enabled system profiling.
>>>>>>>>> +     * The count of contexts that have enabled system profiling plus
>>>>>>>>> one.
>>>>>>>>> +     *
>>>>>>>>> +     * Note: refcount_t does not like 0->1 transitions.. we want to
>>>>>>>>> keep
>>>>>>>>> +     * the under/overflow checks that refcount_t provides, but allow
>>>>>>>>> +     * multiple on/off transitions so we track the logical value
>>>>>>>>> plus one.)
>>>>>>>>>         */
>>>>>>>>>        refcount_t sysprof_active;
>>>>>>>>>    +    /**
>>>>>>>>> +     * @local_counters_active:
>>>>>>>>> +     *
>>>>>>>>> +     * The count of contexts that have requested local (intra-submit)
>>>>>>>>> +     * performance counter usage plus one.
>>>>>>>>> +     *
>>>>>>>>> +     * Note: refcount_t does not like 0->1 transitions.. we want to
>>>>>>>>> keep
>>>>>>>>> +     * the under/overflow checks that refcount_t provides, but allow
>>>>>>>>> +     * multiple on/off transitions so we track the logical value
>>>>>>>>> plus one.)
>>>>>>>>> +     */
>>>>>>>>> +    refcount_t local_counters_active;
>>>>>>>>> +
>>>>>>>>>        /**
>>>>>>>>>         * lock:
>>>>>>>>>         *
>>>>>>>>> @@ -383,6 +399,13 @@ struct msm_file_private {
>>>>>>>>>         */
>>>>>>>>>        int sysprof;
>>>>>>>>>    +    /**
>>>>>>>>> +     * @counters_requested:
>>>>>>>>> +     *
>>>>>>>>> +     * Has the context requested local perfcntr usage.
>>>>>>>>> +     */
>>>>>>>>> +    bool counters_requested;
>>>>>>>>> +
>>>>>>>>>        /**
>>>>>>>>>         * comm: Overridden task comm, see MSM_PARAM_COMM
>>>>>>>>>         *
>>>>>>>>> @@ -626,6 +649,8 @@ void msm_submitqueue_destroy(struct kref *kref);
>>>>>>>>>      int msm_file_private_set_sysprof(struct msm_file_private *ctx,
>>>>>>>>>                     struct msm_gpu *gpu, int sysprof);
>>>>>>>>> +int msm_file_private_request_counters(struct msm_file_private *ctx,
>>>>>>>>> +                      struct msm_gpu *gpu, int reqcntrs);
>>>>>>>>>    void __msm_file_private_destroy(struct kref *kref);
>>>>>>>>>      static inline void msm_file_private_put(struct msm_file_private
>>>>>>>>> *ctx)
>>>>>>>>> diff --git a/drivers/gpu/drm/msm/msm_submitqueue.c b/drivers/gpu/drm/
>>>>>>>>> msm/msm_submitqueue.c
>>>>>>>>> index 7fed1de63b5d..1e1e21e6f7ae 100644
>>>>>>>>> --- a/drivers/gpu/drm/msm/msm_submitqueue.c
>>>>>>>>> +++ b/drivers/gpu/drm/msm/msm_submitqueue.c
>>>>>>>>> @@ -10,6 +10,15 @@
>>>>>>>>>    int msm_file_private_set_sysprof(struct msm_file_private *ctx,
>>>>>>>>>                     struct msm_gpu *gpu, int sysprof)
>>>>>>>>>    {
>>>>>>>>> +    int ret = 0;
>>>>>>>>> +
>>>>>>>>> +    mutex_lock(&gpu->lock);
>>>>>>>>> +
>>>>>>>>> +    if (sysprof && (refcount_read(&gpu->local_counters_active) >
>>>>>>>>> 1)) {
>>>>>>>>> +        ret = UERR(EBUSY, gpu->dev, "Local counter usage active");
>>>>>>>>> +        goto out_unlock;
>>>>>>>>> +    }
>>>>>>>>> +
>>>>>>>>>        /*
>>>>>>>>>         * Since pm_runtime and sysprof_active are both refcounts, we
>>>>>>>>>         * call apply the new value first, and then unwind the previous
>>>>>>>>> @@ -18,7 +27,8 @@ int msm_file_private_set_sysprof(struct
>>>>>>>>> msm_file_private *ctx,
>>>>>>>>>          switch (sysprof) {
>>>>>>>>>        default:
>>>>>>>>> -        return UERR(EINVAL, gpu->dev, "Invalid sysprof: %d",
>>>>>>>>> sysprof);
>>>>>>>>> +        ret = UERR(EINVAL, gpu->dev, "Invalid sysprof: %d", sysprof);
>>>>>>>>> +        goto out_unlock;
>>>>>>>>>        case 2:
>>>>>>>>>            pm_runtime_get_sync(&gpu->pdev->dev);
>>>>>>>>>            fallthrough;
>>>>>>>>> @@ -43,7 +53,45 @@ int msm_file_private_set_sysprof(struct
>>>>>>>>> msm_file_private *ctx,
>>>>>>>>>          ctx->sysprof = sysprof;
>>>>>>>>>    -    return 0;
>>>>>>>>> +out_unlock:
>>>>>>>>> +    mutex_unlock(&gpu->lock);
>>>>>>>>> +
>>>>>>>>> +    return ret;
>>>>>>>>> +}
>>>>>>>>> +
>>>>>>>>> +int msm_file_private_request_counters(struct msm_file_private *ctx,
>>>>>>>>> +                      struct msm_gpu *gpu, int reqctrs)
>>>>>>>>> +{
>>>>>>>>> +    int ret = 0;
>>>>>>>>> +
>>>>>>>>> +    mutex_lock(&gpu->lock);
>>>>>>>>> +
>>>>>>>>> +    if (reqctrs && (refcount_read(&gpu->sysprof_active) > 1)) {
>>>>>>>>> +        ret = UERR(EBUSY, gpu->dev, "System profiling active");
>>>>>>>>> +        goto out_unlock;
>>>>>>>>> +    }
>>>>>>>>> +
>>>>>>>>> +    if (reqctrs) {
>>>>>>>>> +        if (ctx->counters_requested) {
>>>>>>>>> +            ret = UERR(EINVAL, gpu->dev, "Already requested");
>>>>>>>>> +            goto out_unlock;
>>>>>>>>> +        }
>>>>>>>>> +
>>>>>>>>> +        ctx->counters_requested = true;
>>>>>>>>> +        refcount_inc(&gpu->local_counters_active);
>>>>>>>>> +    } else {
>>>>>>>>> +        if (!ctx->counters_requested) {
>>>>>>>>> +            ret = UERR(EINVAL, gpu->dev, "Not requested");
>>>>>>>>> +            goto out_unlock;
>>>>>>>>> +        }
>>>>>>>>> +        refcount_dec(&gpu->local_counters_active);
>>>>>>>>> +        ctx->counters_requested = false;
>>>>>>>>> +    }
>>>>>>>>> +
>>>>>>>>> +out_unlock:
>>>>>>>>> +    mutex_unlock(&gpu->lock);
>>>>>>>>> +
>>>>>>>>> +    return ret;
>>>>>>>>>    }
>>>>>>>>>      void __msm_file_private_destroy(struct kref *kref)
>>>>>>>>> diff --git a/include/uapi/drm/msm_drm.h b/include/uapi/drm/msm_drm.h
>>>>>>>>> index 2342cb90857e..ae7fb355e4a1 100644
>>>>>>>>> --- a/include/uapi/drm/msm_drm.h
>>>>>>>>> +++ b/include/uapi/drm/msm_drm.h
>>>>>>>>> @@ -91,6 +91,7 @@ struct drm_msm_timespec {
>>>>>>>>>    #define MSM_PARAM_UBWC_SWIZZLE 0x12 /* RO */
>>>>>>>>>    #define MSM_PARAM_MACROTILE_MODE 0x13 /* RO */
>>>>>>>>>    #define MSM_PARAM_UCHE_TRAP_BASE 0x14 /* RO */
>>>>>>>>> +#define MSM_PARAM_REQ_CNTRS  0x15 /* WO: request "local" (intra-
>>>>>>>>> submit) perfcntr usage  */
>>>>>>>>>      /* For backwards compat.  The original support for preemption was
>>>>>>>>> based on
>>>>>>>>>     * a single ring per priority level so # of priority levels equals
>>>>>>>>> the #
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Best regards,
>>>>>>
>>>>>
>>>>>
>>>>> Best regards,
>>>>
>>
diff mbox series

Patch

diff --git a/drivers/gpu/drm/msm/adreno/adreno_gpu.c b/drivers/gpu/drm/msm/adreno/adreno_gpu.c
index 31bbf2c83de4..f688e37059b8 100644
--- a/drivers/gpu/drm/msm/adreno/adreno_gpu.c
+++ b/drivers/gpu/drm/msm/adreno/adreno_gpu.c
@@ -441,6 +441,8 @@  int adreno_set_param(struct msm_gpu *gpu, struct msm_file_private *ctx,
 		if (!capable(CAP_SYS_ADMIN))
 			return UERR(EPERM, drm, "invalid permissions");
 		return msm_file_private_set_sysprof(ctx, gpu, value);
+	case MSM_PARAM_REQ_CNTRS:
+		return msm_file_private_request_counters(ctx, gpu, value);
 	default:
 		return UERR(EINVAL, drm, "%s: invalid param: %u", gpu->name, param);
 	}
diff --git a/drivers/gpu/drm/msm/msm_drv.c b/drivers/gpu/drm/msm/msm_drv.c
index 6416d2cb4efc..bf8314ff4a25 100644
--- a/drivers/gpu/drm/msm/msm_drv.c
+++ b/drivers/gpu/drm/msm/msm_drv.c
@@ -377,9 +377,12 @@  static void msm_postclose(struct drm_device *dev, struct drm_file *file)
 	 * It is not possible to set sysprof param to non-zero if gpu
 	 * is not initialized:
 	 */
-	if (priv->gpu)
+	if (ctx->sysprof)
 		msm_file_private_set_sysprof(ctx, priv->gpu, 0);
 
+	if (ctx->counters_requested)
+		msm_file_private_request_counters(ctx, priv->gpu, 0);
+
 	context_close(ctx);
 }
 
diff --git a/drivers/gpu/drm/msm/msm_gpu.c b/drivers/gpu/drm/msm/msm_gpu.c
index 82f204f3bb8f..013b59ca3bb1 100644
--- a/drivers/gpu/drm/msm/msm_gpu.c
+++ b/drivers/gpu/drm/msm/msm_gpu.c
@@ -991,6 +991,7 @@  int msm_gpu_init(struct drm_device *drm, struct platform_device *pdev,
 	gpu->nr_rings = nr_rings;
 
 	refcount_set(&gpu->sysprof_active, 1);
+	refcount_set(&gpu->local_counters_active, 1);
 
 	return 0;
 
diff --git a/drivers/gpu/drm/msm/msm_gpu.h b/drivers/gpu/drm/msm/msm_gpu.h
index e25009150579..83c61e523b1b 100644
--- a/drivers/gpu/drm/msm/msm_gpu.h
+++ b/drivers/gpu/drm/msm/msm_gpu.h
@@ -195,12 +195,28 @@  struct msm_gpu {
 	int nr_rings;
 
 	/**
-	 * sysprof_active:
+	 * @sysprof_active:
 	 *
-	 * The count of contexts that have enabled system profiling.
+	 * The count of contexts that have enabled system profiling plus one.
+	 *
+	 * Note: refcount_t does not like 0->1 transitions.. we want to keep
+	 * the under/overflow checks that refcount_t provides, but allow
+	 * multiple on/off transitions so we track the logical value plus one.)
 	 */
 	refcount_t sysprof_active;
 
+	/**
+	 * @local_counters_active:
+	 *
+	 * The count of contexts that have requested local (intra-submit)
+	 * performance counter usage plus one.
+	 *
+	 * Note: refcount_t does not like 0->1 transitions.. we want to keep
+	 * the under/overflow checks that refcount_t provides, but allow
+	 * multiple on/off transitions so we track the logical value plus one.)
+	 */
+	refcount_t local_counters_active;
+
 	/**
 	 * lock:
 	 *
@@ -383,6 +399,13 @@  struct msm_file_private {
 	 */
 	int sysprof;
 
+	/**
+	 * @counters_requested:
+	 *
+	 * Has the context requested local perfcntr usage.
+	 */
+	bool counters_requested;
+
 	/**
 	 * comm: Overridden task comm, see MSM_PARAM_COMM
 	 *
@@ -626,6 +649,8 @@  void msm_submitqueue_destroy(struct kref *kref);
 
 int msm_file_private_set_sysprof(struct msm_file_private *ctx,
 				 struct msm_gpu *gpu, int sysprof);
+int msm_file_private_request_counters(struct msm_file_private *ctx,
+				      struct msm_gpu *gpu, int reqcntrs);
 void __msm_file_private_destroy(struct kref *kref);
 
 static inline void msm_file_private_put(struct msm_file_private *ctx)
diff --git a/drivers/gpu/drm/msm/msm_submitqueue.c b/drivers/gpu/drm/msm/msm_submitqueue.c
index 7fed1de63b5d..1e1e21e6f7ae 100644
--- a/drivers/gpu/drm/msm/msm_submitqueue.c
+++ b/drivers/gpu/drm/msm/msm_submitqueue.c
@@ -10,6 +10,15 @@ 
 int msm_file_private_set_sysprof(struct msm_file_private *ctx,
 				 struct msm_gpu *gpu, int sysprof)
 {
+	int ret = 0;
+
+	mutex_lock(&gpu->lock);
+
+	if (sysprof && (refcount_read(&gpu->local_counters_active) > 1)) {
+		ret = UERR(EBUSY, gpu->dev, "Local counter usage active");
+		goto out_unlock;
+	}
+
 	/*
 	 * Since pm_runtime and sysprof_active are both refcounts, we
 	 * call apply the new value first, and then unwind the previous
@@ -18,7 +27,8 @@  int msm_file_private_set_sysprof(struct msm_file_private *ctx,
 
 	switch (sysprof) {
 	default:
-		return UERR(EINVAL, gpu->dev, "Invalid sysprof: %d", sysprof);
+		ret = UERR(EINVAL, gpu->dev, "Invalid sysprof: %d", sysprof);
+		goto out_unlock;
 	case 2:
 		pm_runtime_get_sync(&gpu->pdev->dev);
 		fallthrough;
@@ -43,7 +53,45 @@  int msm_file_private_set_sysprof(struct msm_file_private *ctx,
 
 	ctx->sysprof = sysprof;
 
-	return 0;
+out_unlock:
+	mutex_unlock(&gpu->lock);
+
+	return ret;
+}
+
+int msm_file_private_request_counters(struct msm_file_private *ctx,
+				      struct msm_gpu *gpu, int reqctrs)
+{
+	int ret = 0;
+
+	mutex_lock(&gpu->lock);
+
+	if (reqctrs && (refcount_read(&gpu->sysprof_active) > 1)) {
+		ret = UERR(EBUSY, gpu->dev, "System profiling active");
+		goto out_unlock;
+	}
+
+	if (reqctrs) {
+		if (ctx->counters_requested) {
+			ret = UERR(EINVAL, gpu->dev, "Already requested");
+			goto out_unlock;
+		}
+
+		ctx->counters_requested = true;
+		refcount_inc(&gpu->local_counters_active);
+	} else {
+		if (!ctx->counters_requested) {
+			ret = UERR(EINVAL, gpu->dev, "Not requested");
+			goto out_unlock;
+		}
+		refcount_dec(&gpu->local_counters_active);
+		ctx->counters_requested = false;
+	}
+
+out_unlock:
+	mutex_unlock(&gpu->lock);
+
+	return ret;
 }
 
 void __msm_file_private_destroy(struct kref *kref)
diff --git a/include/uapi/drm/msm_drm.h b/include/uapi/drm/msm_drm.h
index 2342cb90857e..ae7fb355e4a1 100644
--- a/include/uapi/drm/msm_drm.h
+++ b/include/uapi/drm/msm_drm.h
@@ -91,6 +91,7 @@  struct drm_msm_timespec {
 #define MSM_PARAM_UBWC_SWIZZLE 0x12 /* RO */
 #define MSM_PARAM_MACROTILE_MODE 0x13 /* RO */
 #define MSM_PARAM_UCHE_TRAP_BASE 0x14 /* RO */
+#define MSM_PARAM_REQ_CNTRS  0x15 /* WO: request "local" (intra-submit) perfcntr usage  */
 
 /* For backwards compat.  The original support for preemption was based on
  * a single ring per priority level so # of priority levels equals the #