[2/2] sched_ext: Add cpuperf support

Message ID	20240619031250.2936087-3-tj@kernel.org
State	Superseded
Headers	show Received: from mail-pl1-f182.google.com (mail-pl1-f182.google.com [209.85.214.182]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 7379957888; Wed, 19 Jun 2024 03:13:02 +0000 (UTC) Sender: Tejun Heo <htejun@gmail.com> From: Tejun Heo <tj@kernel.org> To: rafael@kernel.org, viresh.kumar@linaro.org Cc: linux-pm@vger.kernel.org, void@manifault.com, linux-kernel@vger.kernel.org, kernel-team@meta.com, mingo@redhat.com, peterz@infradead.org, Tejun Heo <tj@kernel.org>, David Vernet <dvernet@meta.com>, "Rafael J . Wysocki" <rafael.j.wysocki@intel.com> Subject: [PATCH 2/2] sched_ext: Add cpuperf support Date: Tue, 18 Jun 2024 17:12:03 -1000 Message-ID: <20240619031250.2936087-3-tj@kernel.org> In-Reply-To: <20240619031250.2936087-1-tj@kernel.org> References: <20240619031250.2936087-1-tj@kernel.org> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	[1/2] cpufreq_schedutil: Refactor sugov_cpu_is_busy() \| expand [1/2] cpufreq_schedutil: Refactor sugov_cpu_is_busy() [2/2] sched_ext: Add cpuperf support

Tejun Heo June 19, 2024, 3:12 a.m. UTC

sched_ext currently does not integrate with schedutil. When schedutil is the
governor, frequencies are left unregulated and usually get stuck close to
the highest performance level from running RT tasks.

Add CPU performance monitoring and scaling support by integrating into
schedutil. The following kfuncs are added:

- scx_bpf_cpuperf_cap(): Query the relative performance capacity of
  different CPUs in the system.

- scx_bpf_cpuperf_cur(): Query the current performance level of a CPU
  relative to its max performance.

- scx_bpf_cpuperf_set(): Set the current target performance level of a CPU.

This gives direct control over CPU performance setting to the BPF scheduler.
The only changes on the schedutil side are accounting for the utilization
factor from sched_ext and disabling frequency holding heuristics as it may
not apply well to sched_ext schedulers which may have a lot weaker
connection between tasks and their current / last CPU.

With cpuperf support added, there is no reason to block uclamp. Enable while
at it.

A toy implementation of cpuperf is added to scx_qmap as a demonstration of
the feature.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: David Vernet <dvernet@meta.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
---
 kernel/sched/cpufreq_schedutil.c         |  12 +-
 kernel/sched/ext.c                       |  83 ++++++++++++-
 kernel/sched/ext.h                       |   9 ++
 kernel/sched/sched.h                     |   1 +
 tools/sched_ext/include/scx/common.bpf.h |   3 +
 tools/sched_ext/scx_qmap.bpf.c           | 142 ++++++++++++++++++++++-
 tools/sched_ext/scx_qmap.c               |   8 ++
 7 files changed, 252 insertions(+), 6 deletions(-)

Tejun Heo June 21, 2024, 10:39 p.m. UTC | #1

On Wed, Jun 19, 2024 at 09:51:39AM -1000, Tejun Heo wrote:
> sched_ext currently does not integrate with schedutil. When schedutil is the
> governor, frequencies are left unregulated and usually get stuck close to
> the highest performance level from running RT tasks.
> 
> Add CPU performance monitoring and scaling support by integrating into
> schedutil. The following kfuncs are added:
> 
> - scx_bpf_cpuperf_cap(): Query the relative performance capacity of
>   different CPUs in the system.
> 
> - scx_bpf_cpuperf_cur(): Query the current performance level of a CPU
>   relative to its max performance.
> 
> - scx_bpf_cpuperf_set(): Set the current target performance level of a CPU.
> 
> This gives direct control over CPU performance setting to the BPF scheduler.
> The only changes on the schedutil side are accounting for the utilization
> factor from sched_ext and disabling frequency holding heuristics as it may
> not apply well to sched_ext schedulers which may have a lot weaker
> connection between tasks and their current / last CPU.
> 
> With cpuperf support added, there is no reason to block uclamp. Enable while
> at it.
> 
> A toy implementation of cpuperf is added to scx_qmap as a demonstration of
> the feature.
> 
> v2: Ignore cpu_util_cfs_boost() when scx_switched_all() in sugov_get_util()
>     to avoid factoring in stale util metric. (Christian)
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Reviewed-by: David Vernet <dvernet@meta.com>
> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> Cc: Viresh Kumar <viresh.kumar@linaro.org>
> Cc: Christian Loehle <christian.loehle@arm.com>

Applied to sched_ext/for-6.11.

Thanks.

Hongyan Xia July 2, 2024, 10:23 a.m. UTC | #2

On 19/06/2024 04:12, Tejun Heo wrote:
> sched_ext currently does not integrate with schedutil. When schedutil is the
> governor, frequencies are left unregulated and usually get stuck close to
> the highest performance level from running RT tasks.
> 
> Add CPU performance monitoring and scaling support by integrating into
> schedutil. The following kfuncs are added:
> 
> - scx_bpf_cpuperf_cap(): Query the relative performance capacity of
>    different CPUs in the system.
> 
> - scx_bpf_cpuperf_cur(): Query the current performance level of a CPU
>    relative to its max performance.
> 
> - scx_bpf_cpuperf_set(): Set the current target performance level of a CPU.
> 
> This gives direct control over CPU performance setting to the BPF scheduler.
> The only changes on the schedutil side are accounting for the utilization
> factor from sched_ext and disabling frequency holding heuristics as it may
> not apply well to sched_ext schedulers which may have a lot weaker
> connection between tasks and their current / last CPU.
> 
> With cpuperf support added, there is no reason to block uclamp. Enable while
> at it.
> 
> A toy implementation of cpuperf is added to scx_qmap as a demonstration of
> the feature.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Reviewed-by: David Vernet <dvernet@meta.com>
> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> Cc: Viresh Kumar <viresh.kumar@linaro.org>
> ---
>   kernel/sched/cpufreq_schedutil.c         |  12 +-
>   kernel/sched/ext.c                       |  83 ++++++++++++-
>   kernel/sched/ext.h                       |   9 ++
>   kernel/sched/sched.h                     |   1 +
>   tools/sched_ext/include/scx/common.bpf.h |   3 +
>   tools/sched_ext/scx_qmap.bpf.c           | 142 ++++++++++++++++++++++-
>   tools/sched_ext/scx_qmap.c               |   8 ++
>   7 files changed, 252 insertions(+), 6 deletions(-)
> 
> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> index 972b7dd65af2..12174c0137a5 100644
> --- a/kernel/sched/cpufreq_schedutil.c
> +++ b/kernel/sched/cpufreq_schedutil.c
> @@ -197,7 +197,9 @@ unsigned long sugov_effective_cpu_perf(int cpu, unsigned long actual,
>   
>   static void sugov_get_util(struct sugov_cpu *sg_cpu, unsigned long boost)
>   {
> -	unsigned long min, max, util = cpu_util_cfs_boost(sg_cpu->cpu);
> +	unsigned long min, max;
> +	unsigned long util = cpu_util_cfs_boost(sg_cpu->cpu) +
> +		scx_cpuperf_target(sg_cpu->cpu);
>   
>   	util = effective_cpu_util(sg_cpu->cpu, util, &min, &max);
>   	util = max(util, boost);
> @@ -330,6 +332,14 @@ static bool sugov_hold_freq(struct sugov_cpu *sg_cpu)
>   	unsigned long idle_calls;
>   	bool ret;
>   
> +	/*
> +	 * The heuristics in this function is for the fair class. For SCX, the
> +	 * performance target comes directly from the BPF scheduler. Let's just
> +	 * follow it.
> +	 */
> +	if (scx_switched_all())
> +		return false;
> +
>   	/* if capped by uclamp_max, always update to be in compliance */
>   	if (uclamp_rq_is_capped(cpu_rq(sg_cpu->cpu)))
>   		return false;
> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index f814e84ceeb3..04fb0eeee5ec 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -16,6 +16,8 @@ enum scx_consts {
>   	SCX_EXIT_BT_LEN			= 64,
>   	SCX_EXIT_MSG_LEN		= 1024,
>   	SCX_EXIT_DUMP_DFL_LEN		= 32768,
> +
> +	SCX_CPUPERF_ONE			= SCHED_CAPACITY_SCALE,
>   };
>   
>   enum scx_exit_kind {
> @@ -3520,7 +3522,7 @@ DEFINE_SCHED_CLASS(ext) = {
>   	.update_curr		= update_curr_scx,
>   
>   #ifdef CONFIG_UCLAMP_TASK
> -	.uclamp_enabled		= 0,
> +	.uclamp_enabled		= 1,
>   #endif
>   };
>   

Hi. I know this is a bit late, but the implication of this change here 
can be quite interesting.

With this patch but without switching this knob from 0 to 1, this series 
gives me the perfect opportunity to implement a custom uclamp within 
sched_ext on top of the cpufreq support added. I think this would be 
what some vendors looking at sched_ext would also want. But, if 
.uclamp_enabled == 1, then the mainline uclamp implementation is in 
effect regardless of what ext scheduler is loaded. In fact, 
uclamp_{inc,dec}() are before calling the {enqueue,dequeue}_task() so 
now there's no easy way to circumvent it.

What would be really nice is to have cpufreq support in sched_ext but 
not force uclamp_enabled. But, I also think there will be people who are 
happy with the current uclamp implementation and want to just reuse it. 
The best thing is to let the loaded scheduler decide, somehow, which I 
don't know if there's an easy way to do this yet.

> [...]

Tejun Heo July 2, 2024, 4:37 p.m. UTC | #3

Hello, Hongyan.

On Tue, Jul 02, 2024 at 11:23:58AM +0100, Hongyan Xia wrote:
> What would be really nice is to have cpufreq support in sched_ext but not
> force uclamp_enabled. But, I also think there will be people who are happy
> with the current uclamp implementation and want to just reuse it. The best
> thing is to let the loaded scheduler decide, somehow, which I don't know if
> there's an easy way to do this yet.

I don't know much about uclamp but at least from sched_ext side, it's
trivial add an ops flag for it and because we know that no tasks are on the
ext class before BPF scheduler is loaded, as long as we switch the
uclamp_enabled value while the BPF scheduler is not loaded, the uclamp
buckets should stay balanced. AFAICS, the only core change we need to make
is mooving the uclamp_enabled bool outside sched_class so that it can be
changed runtime. Is that the case or am I missing something?

Thanks.

Hongyan Xia July 2, 2024, 5:12 p.m. UTC | #4

On 02/07/2024 17:37, Tejun Heo wrote:
> Hello, Hongyan.
> 
> On Tue, Jul 02, 2024 at 11:23:58AM +0100, Hongyan Xia wrote:
>> What would be really nice is to have cpufreq support in sched_ext but not
>> force uclamp_enabled. But, I also think there will be people who are happy
>> with the current uclamp implementation and want to just reuse it. The best
>> thing is to let the loaded scheduler decide, somehow, which I don't know if
>> there's an easy way to do this yet.
> 
> I don't know much about uclamp but at least from sched_ext side, it's
> trivial add an ops flag for it and because we know that no tasks are on the
> ext class before BPF scheduler is loaded, as long as we switch the
> uclamp_enabled value while the BPF scheduler is not loaded, the uclamp
> buckets should stay balanced. AFAICS, the only core change we need to make
> is mooving the uclamp_enabled bool outside sched_class so that it can be
> changed runtime. Is that the case or am I missing something?
> 

Pretty much. Just to clarify what I meant, it would be fantastic if for 
ext, sched_class->uclamp_enabled is decided the moment we load the 
custom scheduler, not globally enabled all the time for all ext 
schedulers, in case the custom scheduler wants to ignore uclamp or has 
its own uclamp implementation. During ext_ops->init(), it would be great 
if the loaded scheduler could decide whether its 
sched_class->uclamp_enabled should be enabled.

However, sched_class->uclamp_enabled is just a normal struct variable, 
so I cannot immediately see a clean way to let the loaded scheduler 
program this field. We might be able to expose a function from the 
kernel side to write sched_class->uclamp_enabled during ext_ops->init(), 
although that looks a bit messy.

Tejun Heo July 2, 2024, 5:56 p.m. UTC | #5

Hello,

So, maybe something like this. It's not the prettiest but avoids adding
indirect calls to fair and rt while allowing sched_ext to report what the
BPF scheduler wants. Only compile tested. Would something like this work for
the use cases you have on mind?

Thanks.

Index: work/kernel/sched/core.c
===================================================================
--- work.orig/kernel/sched/core.c
+++ work/kernel/sched/core.c
@@ -1671,6 +1671,20 @@ static inline void uclamp_rq_dec_id(stru
 	}
 }
 
+bool sched_uclamp_enabled(void)
+{
+	return true;
+}
+
+static bool class_supports_uclamp(const struct sched_class *class)
+{
+	if (likely(class->uclamp_enabled == sched_uclamp_enabled))
+		return true;
+	if (!class->uclamp_enabled)
+		return false;
+	return class->uclamp_enabled();
+}
+
 static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p)
 {
 	enum uclamp_id clamp_id;
@@ -1684,7 +1698,7 @@ static inline void uclamp_rq_inc(struct
 	if (!static_branch_unlikely(&sched_uclamp_used))
 		return;
 
-	if (unlikely(!p->sched_class->uclamp_enabled))
+	if (class_supports_uclamp(p->sched_class))
 		return;
 
 	for_each_clamp_id(clamp_id)
@@ -1708,7 +1722,7 @@ static inline void uclamp_rq_dec(struct
 	if (!static_branch_unlikely(&sched_uclamp_used))
 		return;
 
-	if (unlikely(!p->sched_class->uclamp_enabled))
+	if (class_supports_uclamp(p->sched_class))
 		return;
 
 	for_each_clamp_id(clamp_id)
Index: work/kernel/sched/ext.c
===================================================================
--- work.orig/kernel/sched/ext.c
+++ work/kernel/sched/ext.c
@@ -116,10 +116,17 @@ enum scx_ops_flags {
 	 */
 	SCX_OPS_SWITCH_PARTIAL	= 1LLU << 3,
 
+	/*
+	 * Disable built-in uclamp support. Can be useful when the BPF scheduler
+	 * wants to implement custom uclamp support.
+	 */
+	SCX_OPS_DISABLE_UCLAMP	= 1LLU << 4,
+
 	SCX_OPS_ALL_FLAGS	= SCX_OPS_KEEP_BUILTIN_IDLE |
 				  SCX_OPS_ENQ_LAST |
 				  SCX_OPS_ENQ_EXITING |
-				  SCX_OPS_SWITCH_PARTIAL,
+				  SCX_OPS_SWITCH_PARTIAL |
+				  SCX_OPS_DISABLE_UCLAMP,
 };
 
 /* argument container for ops.init_task() */
@@ -3437,6 +3444,13 @@ static void switched_from_scx(struct rq
 static void wakeup_preempt_scx(struct rq *rq, struct task_struct *p,int wake_flags) {}
 static void switched_to_scx(struct rq *rq, struct task_struct *p) {}
 
+#ifdef CONFIG_UCLAMP_TASK
+static bool uclamp_enabled_scx(void)
+{
+	return !(scx_ops.flags & SCX_OPS_DISABLE_UCLAMP);
+}
+#endif
+
 int scx_check_setscheduler(struct task_struct *p, int policy)
 {
 	lockdep_assert_rq_held(task_rq(p));
@@ -3522,7 +3536,7 @@ DEFINE_SCHED_CLASS(ext) = {
 	.update_curr		= update_curr_scx,
 
 #ifdef CONFIG_UCLAMP_TASK
-	.uclamp_enabled		= 1,
+	.uclamp_enabled		= uclamp_enabled_scx,
 #endif
 };
 
Index: work/kernel/sched/fair.c
===================================================================
--- work.orig/kernel/sched/fair.c
+++ work/kernel/sched/fair.c
@@ -13228,9 +13228,7 @@ DEFINE_SCHED_CLASS(fair) = {
 	.task_is_throttled	= task_is_throttled_fair,
 #endif
 
-#ifdef CONFIG_UCLAMP_TASK
-	.uclamp_enabled		= 1,
-#endif
+	SCHED_CLASS_UCLAMP_ENABLED
 };
 
 #ifdef CONFIG_SCHED_DEBUG
Index: work/kernel/sched/rt.c
===================================================================
--- work.orig/kernel/sched/rt.c
+++ work/kernel/sched/rt.c
@@ -2681,9 +2681,7 @@ DEFINE_SCHED_CLASS(rt) = {
 	.task_is_throttled	= task_is_throttled_rt,
 #endif
 
-#ifdef CONFIG_UCLAMP_TASK
-	.uclamp_enabled		= 1,
-#endif
+	SCHED_CLASS_UCLAMP_ENABLED
 };
 
 #ifdef CONFIG_RT_GROUP_SCHED
Index: work/kernel/sched/sched.h
===================================================================
--- work.orig/kernel/sched/sched.h
+++ work/kernel/sched/sched.h
@@ -2339,11 +2339,6 @@ struct affinity_context {
 extern s64 update_curr_common(struct rq *rq);
 
 struct sched_class {
-
-#ifdef CONFIG_UCLAMP_TASK
-	int uclamp_enabled;
-#endif
-
 	void (*enqueue_task) (struct rq *rq, struct task_struct *p, int flags);
 	void (*dequeue_task) (struct rq *rq, struct task_struct *p, int flags);
 	void (*yield_task)   (struct rq *rq);
@@ -2405,8 +2400,21 @@ struct sched_class {
 #ifdef CONFIG_SCHED_CORE
 	int (*task_is_throttled)(struct task_struct *p, int cpu);
 #endif
+
+#ifdef CONFIG_UCLAMP_TASK
+	bool (*uclamp_enabled)(void);
+#endif
 };
 
+#ifdef CONFIG_UCLAMP_TASK
+bool sched_uclamp_enabled(void);
+
+#define SCHED_CLASS_UCLAMP_ENABLED	\
+	.uclamp_enabled = sched_uclamp_enabled,
+#else
+#define SCHED_CLASS_UCLAMP_ENABLED
+#endif
+
 static inline void put_prev_task(struct rq *rq, struct task_struct *prev)
 {
 	WARN_ON_ONCE(rq->curr != prev);

Hongyan Xia July 2, 2024, 8:41 p.m. UTC | #6

On 02/07/2024 18:56, Tejun Heo wrote:
> Hello,
> 
> So, maybe something like this. It's not the prettiest but avoids adding
> indirect calls to fair and rt while allowing sched_ext to report what the
> BPF scheduler wants. Only compile tested. Would something like this work for
> the use cases you have on mind?
> 
> Thanks.
> 
> Index: work/kernel/sched/core.c
> ===================================================================
> --- work.orig/kernel/sched/core.c
> +++ work/kernel/sched/core.c
> @@ -1671,6 +1671,20 @@ static inline void uclamp_rq_dec_id(stru
>   	}
>   }
>   
> +bool sched_uclamp_enabled(void)
> +{
> +	return true;
> +}
> +
> +static bool class_supports_uclamp(const struct sched_class *class)
> +{
> +	if (likely(class->uclamp_enabled == sched_uclamp_enabled))
> +		return true;
> +	if (!class->uclamp_enabled)
> +		return false;
> +	return class->uclamp_enabled();
> +}
> +
>   static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p)
>   {
>   	enum uclamp_id clamp_id;
> @@ -1684,7 +1698,7 @@ static inline void uclamp_rq_inc(struct
>   	if (!static_branch_unlikely(&sched_uclamp_used))
>   		return;
>   
> -	if (unlikely(!p->sched_class->uclamp_enabled))
> +	if (class_supports_uclamp(p->sched_class))
>   		return;
>   
>   	for_each_clamp_id(clamp_id)
> @@ -1708,7 +1722,7 @@ static inline void uclamp_rq_dec(struct
>   	if (!static_branch_unlikely(&sched_uclamp_used))
>   		return;
>   
> -	if (unlikely(!p->sched_class->uclamp_enabled))
> +	if (class_supports_uclamp(p->sched_class))
>   		return;
>   
>   	for_each_clamp_id(clamp_id)
> Index: work/kernel/sched/ext.c
> ===================================================================
> --- work.orig/kernel/sched/ext.c
> +++ work/kernel/sched/ext.c
> @@ -116,10 +116,17 @@ enum scx_ops_flags {
>   	 */
>   	SCX_OPS_SWITCH_PARTIAL	= 1LLU << 3,
>   
> +	/*
> +	 * Disable built-in uclamp support. Can be useful when the BPF scheduler
> +	 * wants to implement custom uclamp support.
> +	 */
> +	SCX_OPS_DISABLE_UCLAMP	= 1LLU << 4,
> +
>   	SCX_OPS_ALL_FLAGS	= SCX_OPS_KEEP_BUILTIN_IDLE |
>   				  SCX_OPS_ENQ_LAST |
>   				  SCX_OPS_ENQ_EXITING |
> -				  SCX_OPS_SWITCH_PARTIAL,
> +				  SCX_OPS_SWITCH_PARTIAL |
> +				  SCX_OPS_DISABLE_UCLAMP,
>   };
>   
>   /* argument container for ops.init_task() */
> @@ -3437,6 +3444,13 @@ static void switched_from_scx(struct rq
>   static void wakeup_preempt_scx(struct rq *rq, struct task_struct *p,int wake_flags) {}
>   static void switched_to_scx(struct rq *rq, struct task_struct *p) {}
>   
> +#ifdef CONFIG_UCLAMP_TASK
> +static bool uclamp_enabled_scx(void)
> +{
> +	return !(scx_ops.flags & SCX_OPS_DISABLE_UCLAMP);
> +}
> +#endif
> +
>   int scx_check_setscheduler(struct task_struct *p, int policy)
>   {
>   	lockdep_assert_rq_held(task_rq(p));
> @@ -3522,7 +3536,7 @@ DEFINE_SCHED_CLASS(ext) = {
>   	.update_curr		= update_curr_scx,
>   
>   #ifdef CONFIG_UCLAMP_TASK
> -	.uclamp_enabled		= 1,
> +	.uclamp_enabled		= uclamp_enabled_scx,
>   #endif
>   };
>   
> Index: work/kernel/sched/fair.c
> ===================================================================
> --- work.orig/kernel/sched/fair.c
> +++ work/kernel/sched/fair.c
> @@ -13228,9 +13228,7 @@ DEFINE_SCHED_CLASS(fair) = {
>   	.task_is_throttled	= task_is_throttled_fair,
>   #endif
>   
> -#ifdef CONFIG_UCLAMP_TASK
> -	.uclamp_enabled		= 1,
> -#endif
> +	SCHED_CLASS_UCLAMP_ENABLED
>   };
>   
>   #ifdef CONFIG_SCHED_DEBUG
> Index: work/kernel/sched/rt.c
> ===================================================================
> --- work.orig/kernel/sched/rt.c
> +++ work/kernel/sched/rt.c
> @@ -2681,9 +2681,7 @@ DEFINE_SCHED_CLASS(rt) = {
>   	.task_is_throttled	= task_is_throttled_rt,
>   #endif
>   
> -#ifdef CONFIG_UCLAMP_TASK
> -	.uclamp_enabled		= 1,
> -#endif
> +	SCHED_CLASS_UCLAMP_ENABLED
>   };
>   
>   #ifdef CONFIG_RT_GROUP_SCHED
> Index: work/kernel/sched/sched.h
> ===================================================================
> --- work.orig/kernel/sched/sched.h
> +++ work/kernel/sched/sched.h
> @@ -2339,11 +2339,6 @@ struct affinity_context {
>   extern s64 update_curr_common(struct rq *rq);
>   
>   struct sched_class {
> -
> -#ifdef CONFIG_UCLAMP_TASK
> -	int uclamp_enabled;
> -#endif
> -
>   	void (*enqueue_task) (struct rq *rq, struct task_struct *p, int flags);
>   	void (*dequeue_task) (struct rq *rq, struct task_struct *p, int flags);
>   	void (*yield_task)   (struct rq *rq);
> @@ -2405,8 +2400,21 @@ struct sched_class {
>   #ifdef CONFIG_SCHED_CORE
>   	int (*task_is_throttled)(struct task_struct *p, int cpu);
>   #endif
> +
> +#ifdef CONFIG_UCLAMP_TASK
> +	bool (*uclamp_enabled)(void);
> +#endif
>   };
>   
> +#ifdef CONFIG_UCLAMP_TASK
> +bool sched_uclamp_enabled(void);
> +
> +#define SCHED_CLASS_UCLAMP_ENABLED	\
> +	.uclamp_enabled = sched_uclamp_enabled,
> +#else
> +#define SCHED_CLASS_UCLAMP_ENABLED
> +#endif
> +
>   static inline void put_prev_task(struct rq *rq, struct task_struct *prev)
>   {
>   	WARN_ON_ONCE(rq->curr != prev);

Looks good to me!

Actually, if we are okay with changing the sched_class struct and 
touching the code of other classes, I wonder if a cleaner solution is 
just to completely remove sched_class->uclamp_enabled and let each class 
decide what to do in enqueue and dequeue, so instead of

	uclamp_inc/dec();
	p->sched_class->enqueue/dequeue_task();

we can just

	p->sched_class->enqueue/dequeue_task();
		do_uclamp_inside_each_class();

and we export uclamp_inc/dec() functions from core.c to RT, fair and 
ext. For RT and fair, just

	enqueue/dequeue_task_fair/rt();
		uclamp_inc/dec();

For ext, maybe expose bpf_uclamp_inc/dec() to the custom scheduler. If a 
scheduler wants to reuse the current uclamp implementation, just call 
these. If not, skip them and implement its own.

Tejun Heo July 2, 2024, 9:12 p.m. UTC | #7

Hello,

On Tue, Jul 02, 2024 at 09:41:30PM +0100, Hongyan Xia wrote:
...
> Actually, if we are okay with changing the sched_class struct and touching
> the code of other classes, I wonder if a cleaner solution is just to
> completely remove sched_class->uclamp_enabled and let each class decide what
> to do in enqueue and dequeue, so instead of
> 
> 	uclamp_inc/dec();
> 	p->sched_class->enqueue/dequeue_task();
> 
> we can just
> 
> 	p->sched_class->enqueue/dequeue_task();
> 		do_uclamp_inside_each_class();
> 
> and we export uclamp_inc/dec() functions from core.c to RT, fair and ext.
> For RT and fair, just
> 
> 	enqueue/dequeue_task_fair/rt();
> 		uclamp_inc/dec();
>
> For ext, maybe expose bpf_uclamp_inc/dec() to the custom scheduler. If a
> scheduler wants to reuse the current uclamp implementation, just call these.
> If not, skip them and implement its own.

That does sound a lot better. Mind writing up a patchset?

Thanks.

Qais Yousef July 24, 2024, 11:45 p.m. UTC | #8

On 06/18/24 17:12, Tejun Heo wrote:
> sched_ext currently does not integrate with schedutil. When schedutil is the
> governor, frequencies are left unregulated and usually get stuck close to
> the highest performance level from running RT tasks.

Have you tried to investigate why is that? By default RT run at max frequency.
Only way to prevent them from doing that is by using uclamp

	https://kernel.org/doc/html/latest/scheduler/sched-util-clamp.html#sched-util-clamp-min-rt-default

If that's not the cause, then it's likely something else is broken.

> 
> Add CPU performance monitoring and scaling support by integrating into
> schedutil. The following kfuncs are added:
> 
> - scx_bpf_cpuperf_cap(): Query the relative performance capacity of
>   different CPUs in the system.
> 
> - scx_bpf_cpuperf_cur(): Query the current performance level of a CPU
>   relative to its max performance.
> 
> - scx_bpf_cpuperf_set(): Set the current target performance level of a CPU.

What is exactly the problem you're seeing? You shouldn't need to set
performance directly. Are you trying to fix a problem, or add a new feature?

> 
> This gives direct control over CPU performance setting to the BPF scheduler.

Why would we need to do that?  schedutil is supposed to operate in utilization
signal. Overriding it with custom unknown changes makes it all random governor
based on what's current bpf sched_ext is loaded? This make bug reports and
debugging problems a lot harder.

I do hope by the way that loading external scheduler does cause the kernel to
be tainted. With these random changes, it's hard to know if it is a problem in
the kernel or with external out of tree entity. Out of tree modules taint the
kernel, so should loading sched_ext.

It should not cause spurious reports, nor prevent us from changing the code
without worrying about breaking out of tree code.

> The only changes on the schedutil side are accounting for the utilization
> factor from sched_ext and disabling frequency holding heuristics as it may
> not apply well to sched_ext schedulers which may have a lot weaker
> connection between tasks and their current / last CPU.
> 
> With cpuperf support added, there is no reason to block uclamp. Enable while
> at it.
> 
> A toy implementation of cpuperf is added to scx_qmap as a demonstration of
> the feature.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Reviewed-by: David Vernet <dvernet@meta.com>
> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> Cc: Viresh Kumar <viresh.kumar@linaro.org>
> ---
>  kernel/sched/cpufreq_schedutil.c         |  12 +-
>  kernel/sched/ext.c                       |  83 ++++++++++++-
>  kernel/sched/ext.h                       |   9 ++
>  kernel/sched/sched.h                     |   1 +
>  tools/sched_ext/include/scx/common.bpf.h |   3 +
>  tools/sched_ext/scx_qmap.bpf.c           | 142 ++++++++++++++++++++++-
>  tools/sched_ext/scx_qmap.c               |   8 ++
>  7 files changed, 252 insertions(+), 6 deletions(-)
> 
> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> index 972b7dd65af2..12174c0137a5 100644
> --- a/kernel/sched/cpufreq_schedutil.c
> +++ b/kernel/sched/cpufreq_schedutil.c
> @@ -197,7 +197,9 @@ unsigned long sugov_effective_cpu_perf(int cpu, unsigned long actual,
>  
>  static void sugov_get_util(struct sugov_cpu *sg_cpu, unsigned long boost)
>  {
> -	unsigned long min, max, util = cpu_util_cfs_boost(sg_cpu->cpu);
> +	unsigned long min, max;
> +	unsigned long util = cpu_util_cfs_boost(sg_cpu->cpu) +
> +		scx_cpuperf_target(sg_cpu->cpu);
>  
>  	util = effective_cpu_util(sg_cpu->cpu, util, &min, &max);
>  	util = max(util, boost);
> @@ -330,6 +332,14 @@ static bool sugov_hold_freq(struct sugov_cpu *sg_cpu)
>  	unsigned long idle_calls;
>  	bool ret;
>  
> +	/*
> +	 * The heuristics in this function is for the fair class. For SCX, the
> +	 * performance target comes directly from the BPF scheduler. Let's just
> +	 * follow it.
> +	 */
> +	if (scx_switched_all())
> +		return false;

Why do you need to totally override? What problems did you find in current util
value and what have you done to try to fix it first rather than override it
completely?

> +
>  	/* if capped by uclamp_max, always update to be in compliance */
>  	if (uclamp_rq_is_capped(cpu_rq(sg_cpu->cpu)))
>  		return false;
> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index f814e84ceeb3..04fb0eeee5ec 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -16,6 +16,8 @@ enum scx_consts {
>  	SCX_EXIT_BT_LEN			= 64,
>  	SCX_EXIT_MSG_LEN		= 1024,
>  	SCX_EXIT_DUMP_DFL_LEN		= 32768,
> +
> +	SCX_CPUPERF_ONE			= SCHED_CAPACITY_SCALE,
>  };
>  
>  enum scx_exit_kind {
> @@ -3520,7 +3522,7 @@ DEFINE_SCHED_CLASS(ext) = {
>  	.update_curr		= update_curr_scx,
>  
>  #ifdef CONFIG_UCLAMP_TASK
> -	.uclamp_enabled		= 0,
> +	.uclamp_enabled		= 1,
>  #endif
>  };
>  
> @@ -4393,7 +4395,7 @@ static int scx_ops_enable(struct sched_ext_ops *ops, struct bpf_link *link)
>  	struct scx_task_iter sti;
>  	struct task_struct *p;
>  	unsigned long timeout;
> -	int i, ret;
> +	int i, cpu, ret;
>  
>  	mutex_lock(&scx_ops_enable_mutex);
>  
> @@ -4442,6 +4444,9 @@ static int scx_ops_enable(struct sched_ext_ops *ops, struct bpf_link *link)
>  
>  	atomic_long_set(&scx_nr_rejected, 0);
>  
> +	for_each_possible_cpu(cpu)
> +		cpu_rq(cpu)->scx.cpuperf_target = SCX_CPUPERF_ONE;
> +
>  	/*
>  	 * Keep CPUs stable during enable so that the BPF scheduler can track
>  	 * online CPUs by watching ->on/offline_cpu() after ->init().
> @@ -5835,6 +5840,77 @@ __bpf_kfunc void scx_bpf_dump_bstr(char *fmt, unsigned long long *data,
>  		ops_dump_flush();
>  }
>  
> +/**
> + * scx_bpf_cpuperf_cap - Query the maximum relative capacity of a CPU
> + * @cpu: CPU of interest
> + *
> + * Return the maximum relative capacity of @cpu in relation to the most
> + * performant CPU in the system. The return value is in the range [1,
> + * %SCX_CPUPERF_ONE]. See scx_bpf_cpuperf_cur().
> + */
> +__bpf_kfunc u32 scx_bpf_cpuperf_cap(s32 cpu)
> +{
> +	if (ops_cpu_valid(cpu, NULL))
> +		return arch_scale_cpu_capacity(cpu);
> +	else
> +		return SCX_CPUPERF_ONE;
> +}

Hmm. This is tricky. It looks fine, but I worry about changing how we want to
handle capacities in the future and then being tied down forever with out of
tree sched_ext not being able to load.

How are we going to protect against such potential changes? Just make it a NOP?

A bit hypothetical but so far these are considered internal scheduler details
that could change anytime with no consequence. With this attaching to this info
changing them will become a lot harder as there's external dependencies that
will fail to load or work properly. And what is the regression rule in this
case?

You should make all functions return an error to future proof them against
suddenly disappearing.

> +
> +/**
> + * scx_bpf_cpuperf_cur - Query the current relative performance of a CPU
> + * @cpu: CPU of interest
> + *
> + * Return the current relative performance of @cpu in relation to its maximum.
> + * The return value is in the range [1, %SCX_CPUPERF_ONE].
> + *
> + * The current performance level of a CPU in relation to the maximum performance
> + * available in the system can be calculated as follows:
> + *
> + *   scx_bpf_cpuperf_cap() * scx_bpf_cpuperf_cur() / %SCX_CPUPERF_ONE
> + *
> + * The result is in the range [1, %SCX_CPUPERF_ONE].
> + */
> +__bpf_kfunc u32 scx_bpf_cpuperf_cur(s32 cpu)
> +{
> +	if (ops_cpu_valid(cpu, NULL))
> +		return arch_scale_freq_capacity(cpu);
> +	else
> +		return SCX_CPUPERF_ONE;
> +}
> +
> +/**
> + * scx_bpf_cpuperf_set - Set the relative performance target of a CPU
> + * @cpu: CPU of interest
> + * @perf: target performance level [0, %SCX_CPUPERF_ONE]
> + * @flags: %SCX_CPUPERF_* flags
> + *
> + * Set the target performance level of @cpu to @perf. @perf is in linear
> + * relative scale between 0 and %SCX_CPUPERF_ONE. This determines how the
> + * schedutil cpufreq governor chooses the target frequency.
> + *
> + * The actual performance level chosen, CPU grouping, and the overhead and
> + * latency of the operations are dependent on the hardware and cpufreq driver in
> + * use. Consult hardware and cpufreq documentation for more information. The
> + * current performance level can be monitored using scx_bpf_cpuperf_cur().
> + */
> +__bpf_kfunc void scx_bpf_cpuperf_set(u32 cpu, u32 perf)
> +{
> +	if (unlikely(perf > SCX_CPUPERF_ONE)) {
> +		scx_ops_error("Invalid cpuperf target %u for CPU %d", perf, cpu);
> +		return;
> +	}
> +
> +	if (ops_cpu_valid(cpu, NULL)) {
> +		struct rq *rq = cpu_rq(cpu);
> +
> +		rq->scx.cpuperf_target = perf;
> +
> +		rcu_read_lock_sched_notrace();
> +		cpufreq_update_util(cpu_rq(cpu), 0);
> +		rcu_read_unlock_sched_notrace();
> +	}
> +}

Is the problem that you break how util signal works in sched_ext? Or you want
the fine control? We expect user application to use uclamp to set their perf
requirement. And sched_ext should not break util signal, no? If it does and
there's a good reason for it, then it is not compatible with schedutil, as the
name indicates it operates on util signal as defined in PELT.

You can always use min_freq/max_freq in sysfs to force min and max frequencies
without hacking the governor. I don't advise it though and I'd recommend trying
to be compatible with schedutil as-is rather than modify it. Consistency is
a key.


Thanks

--
Qais Yousef

[2/2] sched_ext: Add cpuperf support

Commit Message

Comments

Patch