mbox series

[v2,0/4] Per CPU idle injection

Message ID 20221129233419.4022830-1-srinivas.pandruvada@linux.intel.com
Headers show
Series Per CPU idle injection | expand

Message

Srinivas Pandruvada Nov. 29, 2022, 11:34 p.m. UTC
This series introduces per CPU idle injection. In preparation for this
enhance the existing powercap/idle_inject and modify intel_powerclamp
to use this. Then add per core idle injection driver.

v2
- Update based on feedback from Rafael on patch 2/4
- Kconfig dependency issue
Reported-by: kernel test robot <lkp@intel.com>

Srinivas Pandruvada (4):
  powercap: idle_inject: Export symbols
  powercap: idle_inject: Add prepare/complete callbacks
  thermal/drivers/intel_powerclamp: Use powercap idle-inject framework
  thermal/drivers/intel_cpu_idle_cooling: Introduce Intel cpu idle
    cooling driver

 drivers/powercap/idle_inject.c                |  69 ++++-
 drivers/thermal/intel/Kconfig                 |  12 +
 drivers/thermal/intel/Makefile                |   1 +
 .../thermal/intel/intel_cpu_idle_cooling.c    | 261 ++++++++++++++++
 drivers/thermal/intel/intel_powerclamp.c      | 292 ++++++++----------
 include/linux/idle_inject.h                   |   4 +
 6 files changed, 467 insertions(+), 172 deletions(-)
 create mode 100644 drivers/thermal/intel/intel_cpu_idle_cooling.c

Comments

Rafael J. Wysocki Jan. 12, 2023, 3:05 p.m. UTC | #1
On Wed, Nov 30, 2022 at 12:34 AM Srinivas Pandruvada
<srinivas.pandruvada@linux.intel.com> wrote:
>
> Export symbols for external interfaces, so that they can be used in
> other loadable modules.
>
> Export is done under name space IDLE_INJECT.
>
> Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>

Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

> ---
> v2:
>         No change
>
>  drivers/powercap/idle_inject.c | 7 +++++++
>  1 file changed, 7 insertions(+)
>
> diff --git a/drivers/powercap/idle_inject.c b/drivers/powercap/idle_inject.c
> index fe86a09e3b67..dfa989182e71 100644
> --- a/drivers/powercap/idle_inject.c
> +++ b/drivers/powercap/idle_inject.c
> @@ -160,6 +160,7 @@ void idle_inject_set_duration(struct idle_inject_device *ii_dev,
>                 WRITE_ONCE(ii_dev->idle_duration_us, idle_duration_us);
>         }
>  }
> +EXPORT_SYMBOL_NS_GPL(idle_inject_set_duration, IDLE_INJECT);
>
>  /**
>   * idle_inject_get_duration - idle and run duration retrieval helper
> @@ -174,6 +175,7 @@ void idle_inject_get_duration(struct idle_inject_device *ii_dev,
>         *run_duration_us = READ_ONCE(ii_dev->run_duration_us);
>         *idle_duration_us = READ_ONCE(ii_dev->idle_duration_us);
>  }
> +EXPORT_SYMBOL_NS_GPL(idle_inject_get_duration, IDLE_INJECT);
>
>  /**
>   * idle_inject_set_latency - set the maximum latency allowed
> @@ -185,6 +187,7 @@ void idle_inject_set_latency(struct idle_inject_device *ii_dev,
>  {
>         WRITE_ONCE(ii_dev->latency_us, latency_us);
>  }
> +EXPORT_SYMBOL_NS_GPL(idle_inject_set_latency, IDLE_INJECT);
>
>  /**
>   * idle_inject_start - start idle injections
> @@ -216,6 +219,7 @@ int idle_inject_start(struct idle_inject_device *ii_dev)
>
>         return 0;
>  }
> +EXPORT_SYMBOL_NS_GPL(idle_inject_start, IDLE_INJECT);
>
>  /**
>   * idle_inject_stop - stops idle injections
> @@ -262,6 +266,7 @@ void idle_inject_stop(struct idle_inject_device *ii_dev)
>
>         cpu_hotplug_enable();
>  }
> +EXPORT_SYMBOL_NS_GPL(idle_inject_stop, IDLE_INJECT);
>
>  /**
>   * idle_inject_setup - prepare the current task for idle injection
> @@ -337,6 +342,7 @@ struct idle_inject_device *idle_inject_register(struct cpumask *cpumask)
>
>         return NULL;
>  }
> +EXPORT_SYMBOL_NS_GPL(idle_inject_register, IDLE_INJECT);
>
>  /**
>   * idle_inject_unregister - unregister idle injection control device
> @@ -357,6 +363,7 @@ void idle_inject_unregister(struct idle_inject_device *ii_dev)
>
>         kfree(ii_dev);
>  }
> +EXPORT_SYMBOL_NS_GPL(idle_inject_unregister, IDLE_INJECT);
>
>  static struct smp_hotplug_thread idle_inject_threads = {
>         .store = &idle_inject_thread.tsk,
> --
> 2.31.1
>
Rafael J. Wysocki Jan. 12, 2023, 6:32 p.m. UTC | #2
On Wed, Nov 30, 2022 at 12:34 AM Srinivas Pandruvada
<srinivas.pandruvada@linux.intel.com> wrote:
>
> There are two idle injection implementation in the Linux kernel. One
> via intel_powerclamp and the other using powercap/idle_inject. Both
> implementation end up in calling play_idle* function from a FIFO
> priority thread. Both can't be used at the same time.
>
> Currently per core idle injection (cpuidle_cooling) is using
> powercap/idle_inject, which is not used in platforms where
> intel_powerclamp is used for system wide idle injection. So there is
> no conflict. But there are some use cases where per core idle injection
> is beneficial on the same system where system wide idle injection is
> also used via intel_powerclamp. To avoid conflict only one of the idle
> injection type must be in use at a time. This require a common framework
> which both per core and system wide idle injection can use.
>
> Here powercap/idle_inject can be used for both per-core and for system
> wide idle injection. This framework has a well defined interface
> which allow registry for per-core or for all CPUs (system wide). If
> particular CPU is already participating in idle injection, the call
> to registry fails. Here the registry can be done when user space
> changes the current cooling device state.
>
> Also one framework for idle injection is better as there is one loop
> calling play_idle*, instead of multiple for better maintenance.
>
> So, reuse powercap/idle_inject calls in intel_powerclamp. This simplifies
> the code as all per CPU kthreads which calls play_idle* can be removed.
>
> The changes include:
> - Remove unneeded include files
> - Remove per CPU kthread workers: balancing_work and idle_injection_work
> - Reuse the compensation related code by moving from previous worker
> thread to idle_injection callbacks
> - Adjust the idle_duration and runtime by using powercap/idle_inject
> interface
> - Remove all variables, which are not required once powercap/idle_inject
> is used
> - Add mutex to avoid race during removal of idle injection during module
> unload and user action to change idle inject percent
> - Use READ_ONCE and WRITE_ONCE for data accessed from multiple CPUs
>
> Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
> ---
> v2:
> - Use idle_inject_register_full instead of idle_inject_register
> - Also fix dependency issue with POWERCAP config
> Reported-by: kernel test robot <lkp@intel.com>
>
>  drivers/thermal/intel/Kconfig            |   2 +
>  drivers/thermal/intel/intel_powerclamp.c | 292 ++++++++++-------------
>  2 files changed, 126 insertions(+), 168 deletions(-)
>
> diff --git a/drivers/thermal/intel/Kconfig b/drivers/thermal/intel/Kconfig
> index f0c845679250..6c2a95f41c81 100644
> --- a/drivers/thermal/intel/Kconfig
> +++ b/drivers/thermal/intel/Kconfig
> @@ -3,6 +3,8 @@ config INTEL_POWERCLAMP
>         tristate "Intel PowerClamp idle injection driver"
>         depends on X86
>         depends on CPU_SUP_INTEL
> +       select POWERCAP
> +       select IDLE_INJECT
>         help
>           Enable this to enable Intel PowerClamp idle injection driver. This
>           enforce idle time which results in more package C-state residency. The
> diff --git a/drivers/thermal/intel/intel_powerclamp.c b/drivers/thermal/intel/intel_powerclamp.c
> index b80e25ec1261..3f2b20ae8f68 100644
> --- a/drivers/thermal/intel/intel_powerclamp.c
> +++ b/drivers/thermal/intel/intel_powerclamp.c
> @@ -2,7 +2,7 @@
>  /*
>   * intel_powerclamp.c - package c-state idle injection
>   *
> - * Copyright (c) 2012, Intel Corporation.
> + * Copyright (c) 2022, Intel Corporation.

Nit: I would retain the original year of introduction, so 2012 - 2022.

>   *
>   * Authors:
>   *     Arjan van de Ven <arjan@linux.intel.com>
> @@ -27,21 +27,15 @@
>  #include <linux/module.h>
>  #include <linux/kernel.h>
>  #include <linux/delay.h>
> -#include <linux/kthread.h>
>  #include <linux/cpu.h>
>  #include <linux/thermal.h>
> -#include <linux/slab.h>
> -#include <linux/tick.h>
>  #include <linux/debugfs.h>
>  #include <linux/seq_file.h>
> -#include <linux/sched/rt.h>
> -#include <uapi/linux/sched/types.h>
> +#include <linux/idle_inject.h>
>
> -#include <asm/nmi.h>
>  #include <asm/msr.h>
>  #include <asm/mwait.h>
>  #include <asm/cpu_device_id.h>
> -#include <asm/hardirq.h>
>
>  #define MAX_TARGET_RATIO (50U)
>  /* For each undisturbed clamping period (no extra wake ups during idle time),
> @@ -60,6 +54,7 @@ static struct dentry *debug_dir;
>
>  /* user selected target */
>  static unsigned int set_target_ratio;
> +static bool target_ratio_updated;
>  static unsigned int current_ratio;
>  static bool should_skip;
>
> @@ -67,26 +62,20 @@ static unsigned int control_cpu; /* The cpu assigned to collect stat and update
>                                   * control parameters. default to BSP but BSP
>                                   * can be offlined.
>                                   */
> -static bool clamping;
> -
> -struct powerclamp_worker_data {
> -       struct kthread_worker *worker;
> -       struct kthread_work balancing_work;
> -       struct kthread_delayed_work idle_injection_work;
> +struct powerclamp_data {
>         unsigned int cpu;
>         unsigned int count;
>         unsigned int guard;
>         unsigned int window_size_now;
>         unsigned int target_ratio;
> -       unsigned int duration_jiffies;
>         bool clamping;
>  };
>
> -static struct powerclamp_worker_data __percpu *worker_data;
> +static struct powerclamp_data powerclamp_data;
> +
>  static struct thermal_cooling_device *cooling_dev;
> -static unsigned long *cpu_clamping_mask;  /* bit map for tracking per cpu
> -                                          * clamping kthread worker
> -                                          */
> +
> +static DEFINE_MUTEX(powerclamp_lock);
>
>  static unsigned int duration;
>  static unsigned int pkg_cstate_ratio_cur;
> @@ -344,79 +333,33 @@ static bool powerclamp_adjust_controls(unsigned int target_ratio,
>         return set_target_ratio + guard <= current_ratio;
>  }
>
> -static void clamp_balancing_func(struct kthread_work *work)
> +static unsigned int get_run_time(void)
>  {
> -       struct powerclamp_worker_data *w_data;
> -       int sleeptime;
> -       unsigned long target_jiffies;
>         unsigned int compensated_ratio;
> -       int interval; /* jiffies to sleep for each attempt */
> -
> -       w_data = container_of(work, struct powerclamp_worker_data,
> -                             balancing_work);
> +       unsigned int runtime;
>
>         /*
>          * make sure user selected ratio does not take effect until
>          * the next round. adjust target_ratio if user has changed
>          * target such that we can converge quickly.
>          */
> -       w_data->target_ratio = READ_ONCE(set_target_ratio);
> -       w_data->guard = 1 + w_data->target_ratio / 20;
> -       w_data->window_size_now = window_size;
> -       w_data->duration_jiffies = msecs_to_jiffies(duration);
> -       w_data->count++;
> +       powerclamp_data.target_ratio = READ_ONCE(set_target_ratio);
> +       powerclamp_data.guard = 1 + powerclamp_data.target_ratio / 20;
> +       powerclamp_data.window_size_now = window_size;
>
>         /*
>          * systems may have different ability to enter package level
>          * c-states, thus we need to compensate the injected idle ratio
>          * to achieve the actual target reported by the HW.
>          */
> -       compensated_ratio = w_data->target_ratio +
> -               get_compensation(w_data->target_ratio);
> +       compensated_ratio = powerclamp_data.target_ratio +
> +               get_compensation(powerclamp_data.target_ratio);
>         if (compensated_ratio <= 0)
>                 compensated_ratio = 1;
> -       interval = w_data->duration_jiffies * 100 / compensated_ratio;
> -
> -       /* align idle time */
> -       target_jiffies = roundup(jiffies, interval);
> -       sleeptime = target_jiffies - jiffies;
> -       if (sleeptime <= 0)
> -               sleeptime = 1;
> -
> -       if (clamping && w_data->clamping && cpu_online(w_data->cpu))
> -               kthread_queue_delayed_work(w_data->worker,
> -                                          &w_data->idle_injection_work,
> -                                          sleeptime);
> -}
>
> -static void clamp_idle_injection_func(struct kthread_work *work)
> -{
> -       struct powerclamp_worker_data *w_data;
> -
> -       w_data = container_of(work, struct powerclamp_worker_data,
> -                             idle_injection_work.work);
> +       runtime = duration * 100 / compensated_ratio - duration;
>
> -       /*
> -        * only elected controlling cpu can collect stats and update
> -        * control parameters.
> -        */
> -       if (w_data->cpu == control_cpu &&
> -           !(w_data->count % w_data->window_size_now)) {
> -               should_skip =
> -                       powerclamp_adjust_controls(w_data->target_ratio,
> -                                                  w_data->guard,
> -                                                  w_data->window_size_now);
> -               smp_mb();
> -       }
> -
> -       if (should_skip)
> -               goto balance;
> -
> -       play_idle(jiffies_to_usecs(w_data->duration_jiffies));
> -
> -balance:
> -       if (clamping && w_data->clamping && cpu_online(w_data->cpu))
> -               kthread_queue_work(w_data->worker, &w_data->balancing_work);
> +       return runtime;
>  }
>
>  /*
> @@ -452,104 +395,127 @@ static void poll_pkg_cstate(struct work_struct *dummy)
>         msr_last = msr_now;
>         tsc_last = tsc_now;
>
> -       if (true == clamping)
> +       if (powerclamp_data.clamping)
>                 schedule_delayed_work(&poll_pkg_cstate_work, HZ);
>  }
>
> -static void start_power_clamp_worker(unsigned long cpu)
> +static struct idle_inject_device *ii_dev;
> +
> +static int idle_inject_begin(unsigned int cpu)

So this would be the ->prepare() callback to be invoked on each CPU
from idle_inject_fn() IIUC.

>  {
> -       struct powerclamp_worker_data *w_data = per_cpu_ptr(worker_data, cpu);
> -       struct kthread_worker *worker;
> +       /*
> +        * only elected controlling cpu can collect stats and update
> +        * control parameters.
> +        */
> +       if (cpu == control_cpu) {
> +               bool update = READ_ONCE(target_ratio_updated);
> +
> +               if (!(powerclamp_data.count % powerclamp_data.window_size_now)) {
> +                       bool skip = powerclamp_adjust_controls(powerclamp_data.target_ratio,
> +                                                      powerclamp_data.guard,
> +                                                      powerclamp_data.window_size_now);
> +                       WRITE_ONCE(should_skip, skip);
> +                       update = true;
> +               }
>
> -       worker = kthread_create_worker_on_cpu(cpu, 0, "kidle_inj/%ld", cpu);
> -       if (IS_ERR(worker))
> -               return;
> +               if (update) {
> +                       unsigned int runtime;
> +
> +                       runtime = get_run_time();
> +                       idle_inject_set_duration(ii_dev, runtime, duration);
> +                       WRITE_ONCE(target_ratio_updated, false);
> +               }
> +               powerclamp_data.count++;
> +       }
> +
> +       if (READ_ONCE(should_skip))
> +               return -EAGAIN;

This has a bit of a synchronization issue, because the control CPU is
not guaranteed to run this code before any other CPUs in the given
cycle, so at least some of them may see a stale value of should_skip
and they will still inject idle in this cycle.  Or else, they may skip
idle injection when it should be done.

I think that it would be better to run the callback from
idle_inject_timer_fn() where it would decide whether or not to call
idle_inject_wakeup(), in which case the control CPU would not be
needed any more (which would be a plus), because the "control" could
be done by the CPU running the timer function, whichever it is.

Does this sound viable?  Or if it doesn't, then why?
Srinivas Pandruvada Jan. 12, 2023, 8:23 p.m. UTC | #3
On Thu, 2023-01-12 at 19:32 +0100, Rafael J. Wysocki wrote:
> On Wed, Nov 30, 2022 at 12:34 AM Srinivas Pandruvada
> <srinivas.pandruvada@linux.intel.com> wrote:
> > 
> > There are two idle injection implementation in the Linux kernel.
> > One
> > via intel_powerclamp and the other using powercap/idle_inject. Both
> > implementation end up in calling play_idle* function from a FIFO
> > priority thread. Both can't be used at the same time.
> > 
> > Currently per core idle injection (cpuidle_cooling) is using
> > powercap/idle_inject, which is not used in platforms where
> > intel_powerclamp is used for system wide idle injection. So there
> > is
> > no conflict. But there are some use cases where per core idle
> > injection
> > is beneficial on the same system where system wide idle injection
> > is
> > also used via intel_powerclamp. To avoid conflict only one of the
> > idle
> > injection type must be in use at a time. This require a common
> > framework
> > which both per core and system wide idle injection can use.
> > 
> > Here powercap/idle_inject can be used for both per-core and for
> > system
> > wide idle injection. This framework has a well defined interface
> > which allow registry for per-core or for all CPUs (system wide). If
> > particular CPU is already participating in idle injection, the call
> > to registry fails. Here the registry can be done when user space
> > changes the current cooling device state.
> > 
> > Also one framework for idle injection is better as there is one
> > loop
> > calling play_idle*, instead of multiple for better maintenance.
> > 
> > So, reuse powercap/idle_inject calls in intel_powerclamp. This
> > simplifies
> > the code as all per CPU kthreads which calls play_idle* can be
> > removed.
> > 
> > The changes include:
> > - Remove unneeded include files
> > - Remove per CPU kthread workers: balancing_work and
> > idle_injection_work
> > - Reuse the compensation related code by moving from previous
> > worker
> > thread to idle_injection callbacks
> > - Adjust the idle_duration and runtime by using
> > powercap/idle_inject
> > interface
> > - Remove all variables, which are not required once
> > powercap/idle_inject
> > is used
> > - Add mutex to avoid race during removal of idle injection during
> > module
> > unload and user action to change idle inject percent
> > - Use READ_ONCE and WRITE_ONCE for data accessed from multiple CPUs
> > 
> > Signed-off-by: Srinivas Pandruvada
> > <srinivas.pandruvada@linux.intel.com>
> > ---
> > v2:
> > - Use idle_inject_register_full instead of idle_inject_register
> > - Also fix dependency issue with POWERCAP config
> > Reported-by: kernel test robot <lkp@intel.com>
> > 
> >  drivers/thermal/intel/Kconfig            |   2 +
> >  drivers/thermal/intel/intel_powerclamp.c | 292 ++++++++++---------
> > ----
> >  2 files changed, 126 insertions(+), 168 deletions(-)
> > 
> > diff --git a/drivers/thermal/intel/Kconfig
> > b/drivers/thermal/intel/Kconfig
> > index f0c845679250..6c2a95f41c81 100644
> > --- a/drivers/thermal/intel/Kconfig
> > +++ b/drivers/thermal/intel/Kconfig
> > @@ -3,6 +3,8 @@ config INTEL_POWERCLAMP
> >         tristate "Intel PowerClamp idle injection driver"
> >         depends on X86
> >         depends on CPU_SUP_INTEL
> > +       select POWERCAP
> > +       select IDLE_INJECT
> >         help
> >           Enable this to enable Intel PowerClamp idle injection
> > driver. This
> >           enforce idle time which results in more package C-state
> > residency. The
> > diff --git a/drivers/thermal/intel/intel_powerclamp.c
> > b/drivers/thermal/intel/intel_powerclamp.c
> > index b80e25ec1261..3f2b20ae8f68 100644
> > --- a/drivers/thermal/intel/intel_powerclamp.c
> > +++ b/drivers/thermal/intel/intel_powerclamp.c
> > @@ -2,7 +2,7 @@
> >  /*
> >   * intel_powerclamp.c - package c-state idle injection
> >   *
> > - * Copyright (c) 2012, Intel Corporation.
> > + * Copyright (c) 2022, Intel Corporation.
> 
> Nit: I would retain the original year of introduction, so 2012 -
> 2022.
OK

> 
> >   *
> > 

[...]

> > +
> > +static int idle_inject_begin(unsigned int cpu)
> 
> So this would be the ->prepare() callback to be invoked on each CPU
> from idle_inject_fn() IIUC.
> 
Yes

> >  {
> > -       struct powerclamp_worker_data *w_data =
> > per_cpu_ptr(worker_data, cpu);
> > -       struct kthread_worker *worker;
> > +       /*
> > +        * only elected controlling cpu can collect stats and
> > update
> > +        * control parameters.
> > +        */
> > +       if (cpu == control_cpu) {
> > +               bool update = READ_ONCE(target_ratio_updated);
> > +
> > +               if (!(powerclamp_data.count %
> > powerclamp_data.window_size_now)) {
> > +                       bool skip =
> > powerclamp_adjust_controls(powerclamp_data.target_ratio,
> > +                                                     
> > powerclamp_data.guard,
> > +                                                     
> > powerclamp_data.window_size_now);
> > +                       WRITE_ONCE(should_skip, skip);
> > +                       update = true;
> > +               }
> > 
> > -       worker = kthread_create_worker_on_cpu(cpu, 0,
> > "kidle_inj/%ld", cpu);
> > -       if (IS_ERR(worker))
> > -               return;
> > +               if (update) {
> > +                       unsigned int runtime;
> > +
> > +                       runtime = get_run_time();
> > +                       idle_inject_set_duration(ii_dev, runtime,
> > duration);
> > +                       WRITE_ONCE(target_ratio_updated, false);
> > +               }
> > +               powerclamp_data.count++;
> > +       }
> > +
> > +       if (READ_ONCE(should_skip))
> > +               return -EAGAIN;
> 
> This has a bit of a synchronization issue, because the control CPU is
> not guaranteed to run this code before any other CPUs in the given
> cycle, so at least some of them may see a stale value of should_skip
> and they will still inject idle in this cycle.  Or else, they may
> skip
> idle injection when it should be done.
This is correct observation. This is true in in even in current
implementation. The per thread timer in the existing implementation has
this sync issue. So I tried to just mimic current implementation as is.


> 
> I think that it would be better to run the callback from
> idle_inject_timer_fn() where it would decide whether or not to call
> idle_inject_wakeup(), in which case the control CPU would not be
> needed any more (which would be a plus), because the "control" could
> be done by the CPU running the timer function, whichever it is.
> 
> Does this sound viable? 
Yes it is. In this case prepare() callback from idle_inject core is not
per CPU, but per device.

Thanks,
Srinivas

>  Or if it doesn't, then why?
Rafael J. Wysocki Jan. 12, 2023, 9:03 p.m. UTC | #4
On Thu, Jan 12, 2023 at 9:23 PM srinivas pandruvada
<srinivas.pandruvada@linux.intel.com> wrote:
>
> On Thu, 2023-01-12 at 19:32 +0100, Rafael J. Wysocki wrote:
> > On Wed, Nov 30, 2022 at 12:34 AM Srinivas Pandruvada
> > <srinivas.pandruvada@linux.intel.com> wrote:
> > >
> > > There are two idle injection implementation in the Linux kernel.
> > > One
> > > via intel_powerclamp and the other using powercap/idle_inject. Both
> > > implementation end up in calling play_idle* function from a FIFO
> > > priority thread. Both can't be used at the same time.
> > >
> > > Currently per core idle injection (cpuidle_cooling) is using
> > > powercap/idle_inject, which is not used in platforms where
> > > intel_powerclamp is used for system wide idle injection. So there
> > > is
> > > no conflict. But there are some use cases where per core idle
> > > injection
> > > is beneficial on the same system where system wide idle injection
> > > is
> > > also used via intel_powerclamp. To avoid conflict only one of the
> > > idle
> > > injection type must be in use at a time. This require a common
> > > framework
> > > which both per core and system wide idle injection can use.
> > >
> > > Here powercap/idle_inject can be used for both per-core and for
> > > system
> > > wide idle injection. This framework has a well defined interface
> > > which allow registry for per-core or for all CPUs (system wide). If
> > > particular CPU is already participating in idle injection, the call
> > > to registry fails. Here the registry can be done when user space
> > > changes the current cooling device state.
> > >
> > > Also one framework for idle injection is better as there is one
> > > loop
> > > calling play_idle*, instead of multiple for better maintenance.
> > >
> > > So, reuse powercap/idle_inject calls in intel_powerclamp. This
> > > simplifies
> > > the code as all per CPU kthreads which calls play_idle* can be
> > > removed.
> > >
> > > The changes include:
> > > - Remove unneeded include files
> > > - Remove per CPU kthread workers: balancing_work and
> > > idle_injection_work
> > > - Reuse the compensation related code by moving from previous
> > > worker
> > > thread to idle_injection callbacks
> > > - Adjust the idle_duration and runtime by using
> > > powercap/idle_inject
> > > interface
> > > - Remove all variables, which are not required once
> > > powercap/idle_inject
> > > is used
> > > - Add mutex to avoid race during removal of idle injection during
> > > module
> > > unload and user action to change idle inject percent
> > > - Use READ_ONCE and WRITE_ONCE for data accessed from multiple CPUs
> > >
> > > Signed-off-by: Srinivas Pandruvada
> > > <srinivas.pandruvada@linux.intel.com>
> > > ---
> > > v2:
> > > - Use idle_inject_register_full instead of idle_inject_register
> > > - Also fix dependency issue with POWERCAP config
> > > Reported-by: kernel test robot <lkp@intel.com>
> > >
> > >  drivers/thermal/intel/Kconfig            |   2 +
> > >  drivers/thermal/intel/intel_powerclamp.c | 292 ++++++++++---------
> > > ----
> > >  2 files changed, 126 insertions(+), 168 deletions(-)
> > >
> > > diff --git a/drivers/thermal/intel/Kconfig
> > > b/drivers/thermal/intel/Kconfig
> > > index f0c845679250..6c2a95f41c81 100644
> > > --- a/drivers/thermal/intel/Kconfig
> > > +++ b/drivers/thermal/intel/Kconfig
> > > @@ -3,6 +3,8 @@ config INTEL_POWERCLAMP
> > >         tristate "Intel PowerClamp idle injection driver"
> > >         depends on X86
> > >         depends on CPU_SUP_INTEL
> > > +       select POWERCAP
> > > +       select IDLE_INJECT
> > >         help
> > >           Enable this to enable Intel PowerClamp idle injection
> > > driver. This
> > >           enforce idle time which results in more package C-state
> > > residency. The
> > > diff --git a/drivers/thermal/intel/intel_powerclamp.c
> > > b/drivers/thermal/intel/intel_powerclamp.c
> > > index b80e25ec1261..3f2b20ae8f68 100644
> > > --- a/drivers/thermal/intel/intel_powerclamp.c
> > > +++ b/drivers/thermal/intel/intel_powerclamp.c
> > > @@ -2,7 +2,7 @@
> > >  /*
> > >   * intel_powerclamp.c - package c-state idle injection
> > >   *
> > > - * Copyright (c) 2012, Intel Corporation.
> > > + * Copyright (c) 2022, Intel Corporation.
> >
> > Nit: I would retain the original year of introduction, so 2012 -
> > 2022.
> OK
>
> >
> > >   *
> > >
>
> [...]
>
> > > +
> > > +static int idle_inject_begin(unsigned int cpu)
> >
> > So this would be the ->prepare() callback to be invoked on each CPU
> > from idle_inject_fn() IIUC.
> >
> Yes
>
> > >  {
> > > -       struct powerclamp_worker_data *w_data =
> > > per_cpu_ptr(worker_data, cpu);
> > > -       struct kthread_worker *worker;
> > > +       /*
> > > +        * only elected controlling cpu can collect stats and
> > > update
> > > +        * control parameters.
> > > +        */
> > > +       if (cpu == control_cpu) {
> > > +               bool update = READ_ONCE(target_ratio_updated);
> > > +
> > > +               if (!(powerclamp_data.count %
> > > powerclamp_data.window_size_now)) {
> > > +                       bool skip =
> > > powerclamp_adjust_controls(powerclamp_data.target_ratio,
> > > +
> > > powerclamp_data.guard,
> > > +
> > > powerclamp_data.window_size_now);
> > > +                       WRITE_ONCE(should_skip, skip);
> > > +                       update = true;
> > > +               }
> > >
> > > -       worker = kthread_create_worker_on_cpu(cpu, 0,
> > > "kidle_inj/%ld", cpu);
> > > -       if (IS_ERR(worker))
> > > -               return;
> > > +               if (update) {
> > > +                       unsigned int runtime;
> > > +
> > > +                       runtime = get_run_time();
> > > +                       idle_inject_set_duration(ii_dev, runtime,
> > > duration);
> > > +                       WRITE_ONCE(target_ratio_updated, false);
> > > +               }
> > > +               powerclamp_data.count++;
> > > +       }
> > > +
> > > +       if (READ_ONCE(should_skip))
> > > +               return -EAGAIN;
> >
> > This has a bit of a synchronization issue, because the control CPU is
> > not guaranteed to run this code before any other CPUs in the given
> > cycle, so at least some of them may see a stale value of should_skip
> > and they will still inject idle in this cycle.  Or else, they may
> > skip
> > idle injection when it should be done.
> This is correct observation. This is true in in even in current
> implementation. The per thread timer in the existing implementation has
> this sync issue. So I tried to just mimic current implementation as is.

I see, but I don't think that the new implementation has to be bug
compatible with the old one.

> >
> > I think that it would be better to run the callback from
> > idle_inject_timer_fn() where it would decide whether or not to call
> > idle_inject_wakeup(), in which case the control CPU would not be
> > needed any more (which would be a plus), because the "control" could
> > be done by the CPU running the timer function, whichever it is.
> >
> > Does this sound viable?
>
> Yes it is. In this case prepare() callback from idle_inject core is not
> per CPU, but per device.

Right.

BTW, I also would call it "update" and make it return bool, so
idle_inject_wakeup() would be called when it returned 'true'.