Message ID | 20230925081139.1305766-1-lukasz.luba@arm.com |
---|---|
Headers | show |
Series | Introduce runtime modifiable Energy Model | expand |
On Mon, Sep 25, 2023 at 10:11 AM Lukasz Luba <lukasz.luba@arm.com> wrote: > > The Energy Model (EM) is going to support runtime modifications. This > new callback would be used in the upcoming EM changes. The drivers > or frameworks which want to modify the EM have to implement the > update_power() callback. > > Signed-off-by: Lukasz Luba <lukasz.luba@arm.com> > --- > include/linux/energy_model.h | 22 ++++++++++++++++++++++ > 1 file changed, 22 insertions(+) > > diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h > index d236e08e80dc..546dee90f716 100644 > --- a/include/linux/energy_model.h > +++ b/include/linux/energy_model.h > @@ -168,6 +168,26 @@ struct em_data_callback { > */ > int (*get_cost)(struct device *dev, unsigned long freq, > unsigned long *cost); > + > + /** > + * update_power() - Provide new power at the given performance state of > + * a device The meaning of the above is unclear to me. > + * @dev : Device for which we do this operation (can be a CPU) It is unclear what "we" means in this context. Maybe say "Target device (can be a CPU)"? > + * @freq : Frequency at the performance state in kHz What performance state does this refer to? And the frequency of what? > + * @power : New power value at the performance state > + * (modified) Similarly unclear as the above. > + * @priv : Pointer to private data useful for tracking context > + * during runtime modifications of EM. Who's going to set this pointer and use this data? > + * > + * The update_power() is used by runtime modifiable EM. It aims to I would drop "The" from the above. > + * provide updated power value for a given frequency, which is stored > + * in the performance state. A given frequency of what and the performance state of what does this refer to? > + The power value provided by this callback > + * should fit in the [0, EM_MAX_POWER] range. > + * > + * Return 0 on success, or appropriate error value in case of failure. > + */ > + int (*update_power)(struct device *dev, unsigned long freq, > + unsigned long *power, void *priv); > }; > #define EM_SET_ACTIVE_POWER_CB(em_cb, cb) ((em_cb).active_power = cb) > #define EM_ADV_DATA_CB(_active_power_cb, _cost_cb) \ > @@ -175,6 +195,7 @@ struct em_data_callback { > .get_cost = _cost_cb } > #define EM_DATA_CB(_active_power_cb) \ > EM_ADV_DATA_CB(_active_power_cb, NULL) > +#define EM_UPDATE_CB(_update_power_cb) { .update_power = &_update_power_cb } > > struct em_perf_domain *em_cpu_get(int cpu); > struct em_perf_domain *em_pd_get(struct device *dev); > @@ -331,6 +352,7 @@ struct em_data_callback {}; > #define EM_ADV_DATA_CB(_active_power_cb, _cost_cb) { } > #define EM_DATA_CB(_active_power_cb) { } > #define EM_SET_ACTIVE_POWER_CB(em_cb, cb) do { } while (0) > +#define EM_UPDATE_CB(_update_cb) { } > > static inline > int em_dev_register_perf_domain(struct device *dev, unsigned int nr_states, > --
On Mon, Sep 25, 2023 at 10:11 AM Lukasz Luba <lukasz.luba@arm.com> wrote: > > The new runtime table would be populated with a new power data to better > reflect the actual power. The power can vary over time e.g. due to the > SoC temperature change. Higher temperature can increase power values. > For longer running scenarios, such as game or camera, when also other > devices are used (e.g. GPU, ISP) the CPU power can change. The new > EM framework is able to addresses this issue and change the data > at runtime safely. > > The runtime modifiable EM data is used by the Energy Aware Scheduler (EAS) > for the task placement. All the other users (thermal, etc.) are still > using the default (basic) EM. This fact drove the design of this feature. > > Signed-off-by: Lukasz Luba <lukasz.luba@arm.com> > --- > include/linux/energy_model.h | 4 +++- > kernel/power/energy_model.c | 12 +++++++++++- > 2 files changed, 14 insertions(+), 2 deletions(-) > > diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h > index 546dee90f716..740e7c25cfff 100644 > --- a/include/linux/energy_model.h > +++ b/include/linux/energy_model.h > @@ -39,7 +39,7 @@ struct em_perf_state { > /** > * struct em_perf_table - Performance states table > * @state: List of performance states, in ascending order > - * @rcu: RCU used for safe access and destruction > + * @rcu: RCU used only for runtime modifiable table This still doesn't appear to be used anywhere, so why change it here? > */ > struct em_perf_table { > struct em_perf_state *state; > @@ -49,6 +49,7 @@ struct em_perf_table { > /** > * struct em_perf_domain - Performance domain > * @default_table: Pointer to the default em_perf_table > + * @runtime_table: Pointer to the runtime modifiable em_perf_table "Pointer to em_perf_table that can be dynamically updated" > * @nr_perf_states: Number of performance states > * @flags: See "em_perf_domain flags" > * @cpus: Cpumask covering the CPUs of the domain. It's here > @@ -64,6 +65,7 @@ struct em_perf_table { > */ > struct em_perf_domain { > struct em_perf_table *default_table; > + struct em_perf_table __rcu *runtime_table; > int nr_perf_states; > unsigned long flags; > unsigned long cpus[]; > diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c > index 797141638b29..5b40db38b745 100644 > --- a/kernel/power/energy_model.c > +++ b/kernel/power/energy_model.c > @@ -251,6 +251,9 @@ static int em_create_pd(struct device *dev, int nr_states, > return ret; > } > > + /* Initialize runtime table as default table. */ Redundant comment. > + rcu_assign_pointer(pd->runtime_table, default_table); > + > if (_is_cpu_device(dev)) > for_each_cpu(cpu, cpus) { > cpu_dev = get_cpu_device(cpu); > @@ -448,6 +451,7 @@ EXPORT_SYMBOL_GPL(em_dev_register_perf_domain); > */ > void em_dev_unregister_perf_domain(struct device *dev) > { > + struct em_perf_table __rcu *runtime_table; > struct em_perf_domain *pd; > > if (IS_ERR_OR_NULL(dev) || !dev->em_pd) > @@ -457,18 +461,24 @@ void em_dev_unregister_perf_domain(struct device *dev) > return; > > pd = dev->em_pd; > - Unrelated change. > /* > * The mutex separates all register/unregister requests and protects > * from potential clean-up/setup issues in the debugfs directories. > * The debugfs directory name is the same as device's name. > */ > mutex_lock(&em_pd_mutex); > + Same here. > em_debug_remove_pd(dev); > > + runtime_table = pd->runtime_table; > + > + rcu_assign_pointer(pd->runtime_table, NULL); > + synchronize_rcu(); Is it really a good idea to call this under a mutex? > + > kfree(pd->default_table->state); > kfree(pd->default_table); > kfree(dev->em_pd); > + Unrelated change. > dev->em_pd = NULL; > mutex_unlock(&em_pd_mutex); > } > -- So this really adds a pointer to a table that can be dynamically updated to struct em_perf_domain without any users so far. It is not used anywhere as of this patch AFAICS, which is not what the changelog is saying.
On Mon, Sep 25, 2023 at 10:11 AM Lukasz Luba <lukasz.luba@arm.com> wrote: First off, I would merge this with the previous patch, as the changes would be much clearer then IMO. > Add an interface which allows to modify EM power data at runtime. > The new power information is populated by the provided callback, which > is called for each performance state. But it all starts with copying the frequencies from the default table. > The CPU frequencies' efficiency is > re-calculated since that might be affected as well. The old EM memory > is going to be freed later using RCU mechanism. Not all of it, but the old runtime table that is not going to be used any more. > Signed-off-by: Lukasz Luba <lukasz.luba@arm.com> > --- > include/linux/energy_model.h | 8 +++ > kernel/power/energy_model.c | 111 +++++++++++++++++++++++++++++++++++ > 2 files changed, 119 insertions(+) > > diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h > index 740e7c25cfff..8f055ab356ed 100644 > --- a/include/linux/energy_model.h > +++ b/include/linux/energy_model.h > @@ -201,6 +201,8 @@ struct em_data_callback { > > struct em_perf_domain *em_cpu_get(int cpu); > struct em_perf_domain *em_pd_get(struct device *dev); > +int em_dev_update_perf_domain(struct device *dev, struct em_data_callback *cb, > + void *priv); > int em_dev_register_perf_domain(struct device *dev, unsigned int nr_states, > struct em_data_callback *cb, cpumask_t *span, > bool microwatts); > @@ -384,6 +386,12 @@ static inline int em_pd_nr_perf_states(struct em_perf_domain *pd) > { > return 0; > } > +static inline > +int em_dev_update_perf_domain(struct device *dev, struct em_data_callback *cb, > + void *priv) > +{ > + return -EINVAL; > +} > #endif > > #endif > diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c > index 2345837bfd2c..78e1495dc87e 100644 > --- a/kernel/power/energy_model.c > +++ b/kernel/power/energy_model.c > @@ -172,6 +172,101 @@ static int em_compute_costs(struct device *dev, struct em_perf_state *table, > return 0; > } > > +/** > + * em_dev_update_perf_domain() - Update runtime EM table for a device > + * @dev : Device for which the EM is to be updated > + * @cb : Callback function providing the power data for the EM > + * @priv : Pointer to private data useful for passing context > + * which might be required while calling @cb It is still unclear to me who is going to use this priv pointer and how. > + * > + * Update EM runtime modifiable table for a @dev using the callback > + * defined in @cb. The EM new power values are then used for calculating > + * the em_perf_state::cost for associated performance state. It actually allocates a new runtime table and populates it from scratch, using the frequencies from the default table and the callback. > + * > + * This function uses mutex to serialize writers, so it must not be called "a mutex" > + * from non-sleeping context. > + * > + * Return 0 on success or a proper error in case of failure. > + */ > +int em_dev_update_perf_domain(struct device *dev, struct em_data_callback *cb, > + void *priv) > +{ > + struct em_perf_table *runtime_table; > + unsigned long power, freq; > + struct em_perf_domain *pd; > + int ret, i; > + > + if (!cb || !cb->update_power) > + return -EINVAL; > + > + /* > + * The lock serializes update and unregister code paths. When the > + * EM has been unregistered in the meantime, we should capture that > + * when entering this critical section. It also makes sure that > + * two concurrent updates will be serialized. > + */ > + mutex_lock(&em_pd_mutex); > + > + if (!dev || !dev->em_pd) { Checking dev against NULL under the mutex is pointless (either it is NULL or it isn't, so check it earlier). > + ret = -EINVAL; > + goto unlock_em; > + } > + > + pd = dev->em_pd; And I would check pd against NULL here. > + > + runtime_table = kzalloc(sizeof(*runtime_table), GFP_KERNEL); > + if (!runtime_table) { > + ret = -ENOMEM; > + goto unlock_em; > + } > + > + runtime_table->state = kcalloc(pd->nr_perf_states, > + sizeof(struct em_perf_state), > + GFP_KERNEL); > + if (!runtime_table->state) { > + ret = -ENOMEM; > + goto free_runtime_table; > + } The above allocations can be merged into one and allocating memory under the mutex is questionable. > + > + /* Populate runtime table with updated values using driver callback */ > + for (i = 0; i < pd->nr_perf_states; i++) { > + freq = pd->default_table->state[i].frequency; > + runtime_table->state[i].frequency = freq; > + > + /* > + * Call driver callback to get a new power value for > + * a given frequency. > + */ > + ret = cb->update_power(dev, freq, &power, priv); > + if (ret) { > + dev_dbg(dev, "EM: runtime update error: %d\n", ret); > + goto free_runtime_state_table; > + } > + > + runtime_table->state[i].power = power; > + } > + > + ret = em_compute_costs(dev, runtime_table->state, cb, > + pd->nr_perf_states, pd->flags); > + if (ret) > + goto free_runtime_state_table; > + > + em_perf_runtime_table_set(dev, runtime_table); > + > + mutex_unlock(&em_pd_mutex); > + return 0; > + > +free_runtime_state_table: > + kfree(runtime_table->state); > +free_runtime_table: > + kfree(runtime_table); > +unlock_em: > + mutex_unlock(&em_pd_mutex); > + > + return -EINVAL; > +} > +EXPORT_SYMBOL_GPL(em_dev_update_perf_domain); > + > static int em_create_perf_table(struct device *dev, struct em_perf_domain *pd, > int nr_states, struct em_data_callback *cb, > unsigned long flags) > @@ -494,6 +589,8 @@ void em_dev_unregister_perf_domain(struct device *dev) > * The mutex separates all register/unregister requests and protects > * from potential clean-up/setup issues in the debugfs directories. > * The debugfs directory name is the same as device's name. > + * The lock also protects the updater of the runtime modifiable > + * EM and this remover. > */ > mutex_lock(&em_pd_mutex); > > @@ -501,9 +598,23 @@ void em_dev_unregister_perf_domain(struct device *dev) > > runtime_table = pd->runtime_table; > > + /* > + * Safely destroy runtime modifiable EM. By using the call > + * synchronize_rcu() we make sure we don't progress till last user > + * finished the RCU section and our update got applied. > + */ > rcu_assign_pointer(pd->runtime_table, NULL); > synchronize_rcu(); > > + /* > + * After the sync no updates will be in-flight, so free the > + * memory allocated for runtime table (if there was such). > + */ > + if (runtime_table != pd->default_table) { > + kfree(runtime_table->state); > + kfree(runtime_table); > + } Can't this race with the RCU callback freeing the runtime table? > + > kfree(pd->default_table->state); > kfree(pd->default_table); > kfree(dev->em_pd); > --
Hi Lukasz On 09/25/23 09:11, Lukasz Luba wrote: > Hi all, > > This patch set adds a new feature which allows to modify Energy Model (EM) > power values at runtime. It will allow to better reflect power model of > a recent SoCs and silicon. Different characteristics of the power usage > can be leveraged and thus better decisions made during task placement in EAS. > > It's part of feature set know as Dynamic Energy Model. It has been presented > and discussed recently at OSPM2023 [3]. This patch set implements the 1st > improvement for the EM. Thanks for the series! I'm on CC this time :-) Unfortunately I have a planned holiday starting tomorrow, so won't get a chance to do proper review till I'm back which would be few weeks from now. I just want to iterate that this is a real problem being seen in practice where it's hard to provide a single average EM for all workloads now. So we certainly need something like this. Hopefully I'll get a chance to help with review when I'm back from holidays. Thanks! -- Qais Yousef
On 9/26/23 19:39, Rafael J. Wysocki wrote: > On Mon, Sep 25, 2023 at 10:11 AM Lukasz Luba <lukasz.luba@arm.com> wrote: >> >> Refactor a dedicated function which will be easier to maintain and re-use >> in future. The upcoming changes for the modifiable EM perf_state table >> will use it (instead of duplicating the code). >> >> Signed-off-by: Lukasz Luba <lukasz.luba@arm.com> > > If I'm not mistaken, this patch by itself is not going to change the > observable functionality in any way and it would be good to say that > in the changelog. > > This also applies to some other patches in this series. > Correct, no functional changes. I will add that information to the patch header in next version.
On 9/26/23 19:59, Rafael J. Wysocki wrote: > On Mon, Sep 25, 2023 at 10:11 AM Lukasz Luba <lukasz.luba@arm.com> wrote: >> >> The Energy Model (EM) is going to support runtime modifications. This >> new callback would be used in the upcoming EM changes. The drivers >> or frameworks which want to modify the EM have to implement the >> update_power() callback. >> >> Signed-off-by: Lukasz Luba <lukasz.luba@arm.com> >> --- >> include/linux/energy_model.h | 22 ++++++++++++++++++++++ >> 1 file changed, 22 insertions(+) >> >> diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h >> index d236e08e80dc..546dee90f716 100644 >> --- a/include/linux/energy_model.h >> +++ b/include/linux/energy_model.h >> @@ -168,6 +168,26 @@ struct em_data_callback { >> */ >> int (*get_cost)(struct device *dev, unsigned long freq, >> unsigned long *cost); >> + >> + /** >> + * update_power() - Provide new power at the given performance state of >> + * a device > > The meaning of the above is unclear to me. I can try to rephrase this a bit: ' Provide a new power value for the device at the given frequency. This allows to reflect changed power profile in runtime.' > >> + * @dev : Device for which we do this operation (can be a CPU) > > It is unclear what "we" means in this context. Maybe say "Target > device (can be a CPU)"? That sounds better indeed. > >> + * @freq : Frequency at the performance state in kHz > > What performance state does this refer to? And the frequency of what? We just call the entry in EM the 'performance state' (so frequency and power). I will rephrase this as well: 'Frequency of the @dev expressed in kHz' > >> + * @power : New power value at the performance state >> + * (modified) > > Similarly unclear as the above. OK, it can be re-written as: 'Power value of the @dev at the given @freq (modified)' > >> + * @priv : Pointer to private data useful for tracking context >> + * during runtime modifications of EM. > > Who's going to set this pointer and use this data? The driver or kernel module, which is aware about thermal events. Those events might be coming from FW to kernel, but we need to maintain the same 'context' for all OPPs when we start the EM update. This 'priv' field is used for passing the 'context' back to the caller, so the caller can consistently the update. The update might be with some math formula which multiplies the power by some factor value (based on thermal model and current temperature). > >> + * >> + * The update_power() is used by runtime modifiable EM. It aims to > > I would drop "The" from the above. OK > >> + * provide updated power value for a given frequency, which is stored >> + * in the performance state. > > A given frequency of what and the performance state of what does this refer to? I will change that and add the '@dev' word to make this more precised. 'update_power() is used by runtime modifiable EM. It aims to update the @dev EM power values for all registered frequencies.'
On 9/26/23 20:12, Rafael J. Wysocki wrote: > On Mon, Sep 25, 2023 at 10:11 AM Lukasz Luba <lukasz.luba@arm.com> wrote: >> >> The new runtime table would be populated with a new power data to better >> reflect the actual power. The power can vary over time e.g. due to the >> SoC temperature change. Higher temperature can increase power values. >> For longer running scenarios, such as game or camera, when also other >> devices are used (e.g. GPU, ISP) the CPU power can change. The new >> EM framework is able to addresses this issue and change the data >> at runtime safely. >> >> The runtime modifiable EM data is used by the Energy Aware Scheduler (EAS) >> for the task placement. All the other users (thermal, etc.) are still >> using the default (basic) EM. This fact drove the design of this feature. >> >> Signed-off-by: Lukasz Luba <lukasz.luba@arm.com> >> --- >> include/linux/energy_model.h | 4 +++- >> kernel/power/energy_model.c | 12 +++++++++++- >> 2 files changed, 14 insertions(+), 2 deletions(-) >> >> diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h >> index 546dee90f716..740e7c25cfff 100644 >> --- a/include/linux/energy_model.h >> +++ b/include/linux/energy_model.h >> @@ -39,7 +39,7 @@ struct em_perf_state { >> /** >> * struct em_perf_table - Performance states table >> * @state: List of performance states, in ascending order >> - * @rcu: RCU used for safe access and destruction >> + * @rcu: RCU used only for runtime modifiable table > > This still doesn't appear to be used anywhere, so why change it here? I will try to move this later in the series. > >> */ >> struct em_perf_table { >> struct em_perf_state *state; >> @@ -49,6 +49,7 @@ struct em_perf_table { >> /** >> * struct em_perf_domain - Performance domain >> * @default_table: Pointer to the default em_perf_table >> + * @runtime_table: Pointer to the runtime modifiable em_perf_table > > "Pointer to em_perf_table that can be dynamically updated" OK > >> * @nr_perf_states: Number of performance states >> * @flags: See "em_perf_domain flags" >> * @cpus: Cpumask covering the CPUs of the domain. It's here >> @@ -64,6 +65,7 @@ struct em_perf_table { >> */ >> struct em_perf_domain { >> struct em_perf_table *default_table; >> + struct em_perf_table __rcu *runtime_table; >> int nr_perf_states; >> unsigned long flags; >> unsigned long cpus[]; >> diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c >> index 797141638b29..5b40db38b745 100644 >> --- a/kernel/power/energy_model.c >> +++ b/kernel/power/energy_model.c >> @@ -251,6 +251,9 @@ static int em_create_pd(struct device *dev, int nr_states, >> return ret; >> } >> >> + /* Initialize runtime table as default table. */ > > Redundant comment. I'll drop it. > >> + rcu_assign_pointer(pd->runtime_table, default_table); >> + >> if (_is_cpu_device(dev)) >> for_each_cpu(cpu, cpus) { >> cpu_dev = get_cpu_device(cpu); >> @@ -448,6 +451,7 @@ EXPORT_SYMBOL_GPL(em_dev_register_perf_domain); >> */ >> void em_dev_unregister_perf_domain(struct device *dev) >> { >> + struct em_perf_table __rcu *runtime_table; >> struct em_perf_domain *pd; >> >> if (IS_ERR_OR_NULL(dev) || !dev->em_pd) >> @@ -457,18 +461,24 @@ void em_dev_unregister_perf_domain(struct device *dev) >> return; >> >> pd = dev->em_pd; >> - > > Unrelated change. ACK > >> /* >> * The mutex separates all register/unregister requests and protects >> * from potential clean-up/setup issues in the debugfs directories. >> * The debugfs directory name is the same as device's name. >> */ >> mutex_lock(&em_pd_mutex); >> + > > Same here. ACK > >> em_debug_remove_pd(dev); >> >> + runtime_table = pd->runtime_table; >> + >> + rcu_assign_pointer(pd->runtime_table, NULL); >> + synchronize_rcu(); > > Is it really a good idea to call this under a mutex? This is the unregistration of the EM code path, so a thermal driver which gets some IRQs might not be aware that the EM is going to vanish. That's why those two code paths: update & unregister are protected with the same lock. This synchronize_rcu() won't be long, but makes sure that when we free(dev->em_pd) we don't leak runtime_table. > >> + >> kfree(pd->default_table->state); >> kfree(pd->default_table); >> kfree(dev->em_pd); >> + > > Unrelated change. ACK > >> dev->em_pd = NULL; >> mutex_unlock(&em_pd_mutex); >> } >> -- > > So this really adds a pointer to a table that can be dynamically > updated to struct em_perf_domain without any users so far. It is not > used anywhere as of this patch AFAICS, which is not what the changelog > is saying. Good catch. I will adjust the changlog in header and say: 'Add infrastructure and mechanisms for the new runtime table. The runtime modifiable EM data is used by the Energy Aware Scheduler (EAS)for the task placement. All the other users (thermal, etc.) are still using the default (basic) EM. This fact drove the design of this feature.'
On 9/26/23 20:48, Rafael J. Wysocki wrote: > On Mon, Sep 25, 2023 at 10:11 AM Lukasz Luba <lukasz.luba@arm.com> wrote: > > First off, I would merge this with the previous patch, as the changes > would be much clearer then IMO. I was trying to avoid a big patch ~150 lines. I will do that merge. > >> Add an interface which allows to modify EM power data at runtime. >> The new power information is populated by the provided callback, which >> is called for each performance state. > > But it all starts with copying the frequencies from the default table. Yes, I can add that to the description. > >> The CPU frequencies' efficiency is >> re-calculated since that might be affected as well. The old EM memory >> is going to be freed later using RCU mechanism. > > Not all of it, but the old runtime table that is not going to be used any more. True, I will rephrase that, to make it more precised. > >> Signed-off-by: Lukasz Luba <lukasz.luba@arm.com> [snip] >> >> +/** >> + * em_dev_update_perf_domain() - Update runtime EM table for a device >> + * @dev : Device for which the EM is to be updated >> + * @cb : Callback function providing the power data for the EM >> + * @priv : Pointer to private data useful for passing context >> + * which might be required while calling @cb > > It is still unclear to me who is going to use this priv pointer and how. I have explained that in some previous patch response. A driver or kernel module which monitors the thermal situation and has context. > >> + * >> + * Update EM runtime modifiable table for a @dev using the callback >> + * defined in @cb. The EM new power values are then used for calculating >> + * the em_perf_state::cost for associated performance state. > > It actually allocates a new runtime table and populates it from > scratch, using the frequencies from the default table and the > callback. Yes, it allocated new table and put the updated power values there. I can add that to the comment. > >> + * >> + * This function uses mutex to serialize writers, so it must not be called > > "a mutex" ACK > >> + * from non-sleeping context. [snip] >> + >> + if (!dev || !dev->em_pd) { > > Checking dev against NULL under the mutex is pointless (either it is > NULL or it isn't, so check it earlier). ACK > >> + ret = -EINVAL; >> + goto unlock_em; >> + } >> + >> + pd = dev->em_pd; > > And I would check pd against NULL here. It's done above, next to '!dev || !dev->em_pd' > >> + >> + runtime_table = kzalloc(sizeof(*runtime_table), GFP_KERNEL); >> + if (!runtime_table) { >> + ret = -ENOMEM; >> + goto unlock_em; >> + } >> + >> + runtime_table->state = kcalloc(pd->nr_perf_states, >> + sizeof(struct em_perf_state), >> + GFP_KERNEL); >> + if (!runtime_table->state) { >> + ret = -ENOMEM; >> + goto free_runtime_table; >> + } > > The above allocations can be merged into one and allocating memory > under the mutex is questionable. So how to make sure that there is no 2 callers trying to update the same EM or unregistration is not in the background? [snip] >> >> @@ -501,9 +598,23 @@ void em_dev_unregister_perf_domain(struct device *dev) >> >> runtime_table = pd->runtime_table; >> >> + /* >> + * Safely destroy runtime modifiable EM. By using the call >> + * synchronize_rcu() we make sure we don't progress till last user >> + * finished the RCU section and our update got applied. >> + */ >> rcu_assign_pointer(pd->runtime_table, NULL); >> synchronize_rcu(); >> >> + /* >> + * After the sync no updates will be in-flight, so free the >> + * memory allocated for runtime table (if there was such). >> + */ >> + if (runtime_table != pd->default_table) { >> + kfree(runtime_table->state); >> + kfree(runtime_table); >> + } > > Can't this race with the RCU callback freeing the runtime table? That's why there is this 'synchronize_rcu()' above and the mutex. The updating caller if finished the update, would unlock the mutex and this unregister code can go. Here we call the synchronize_rcu() so we assure the callback has finished for the update path and than we explicitly free the saved 'runtime_table' here. So all data should be freed and code serialized in those two paths.
On Fri, Sep 29, 2023 at 10:59 AM Lukasz Luba <lukasz.luba@arm.com> wrote: > > On 9/26/23 19:59, Rafael J. Wysocki wrote: > > On Mon, Sep 25, 2023 at 10:11 AM Lukasz Luba <lukasz.luba@arm.com> wrote: > >> > >> The Energy Model (EM) is going to support runtime modifications. This > >> new callback would be used in the upcoming EM changes. The drivers > >> or frameworks which want to modify the EM have to implement the > >> update_power() callback. > >> > >> Signed-off-by: Lukasz Luba <lukasz.luba@arm.com> > >> --- > >> include/linux/energy_model.h | 22 ++++++++++++++++++++++ > >> 1 file changed, 22 insertions(+) > >> > >> diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h > >> index d236e08e80dc..546dee90f716 100644 > >> --- a/include/linux/energy_model.h > >> +++ b/include/linux/energy_model.h > >> @@ -168,6 +168,26 @@ struct em_data_callback { > >> */ > >> int (*get_cost)(struct device *dev, unsigned long freq, > >> unsigned long *cost); > >> + > >> + /** > >> + * update_power() - Provide new power at the given performance state of > >> + * a device > > > > The meaning of the above is unclear to me. > > I can try to rephrase this a bit: > ' Provide a new power value for the device at the given frequency. This > allows to reflect changed power profile in runtime.' Maybe "Estimate power for a given device frequency" > > > >> + * @dev : Device for which we do this operation (can be a CPU) > > > > It is unclear what "we" means in this context. Maybe say "Target > > device (can be a CPU)"? > > That sounds better indeed. > > > > >> + * @freq : Frequency at the performance state in kHz > > > > What performance state does this refer to? And the frequency of what? > > We just call the entry in EM the 'performance state' (so frequency and > power). I will rephrase this as well: > 'Frequency of the @dev expressed in kHz' Or "Device frequency for which to estimate power"? > > > >> + * @power : New power value at the performance state > >> + * (modified) > > > > Similarly unclear as the above. > > OK, it can be re-written as: > 'Power value of the @dev at the given @freq (modified)' This is not a power value, but a return pointer AFAICS. So "location to store the return power value". > > > >> + * @priv : Pointer to private data useful for tracking context > >> + * during runtime modifications of EM. > > > > Who's going to set this pointer and use this data? > > The driver or kernel module, which is aware about thermal events. Those > events might be coming from FW to kernel, but we need to maintain > the same 'context' for all OPPs when we start the EM update. > > This 'priv' field is used for passing the 'context' back to the > caller, so the caller can consistently the update. The update might > be with some math formula which multiplies the power by some factor > value (based on thermal model and current temperature). I would say "Additional data for the callback function, provided by the entity setting the callback pointer". > > > >> + * > >> + * The update_power() is used by runtime modifiable EM. It aims to > > > > I would drop "The" from the above. > > OK > > > > >> + * provide updated power value for a given frequency, which is stored > >> + * in the performance state. > > > > A given frequency of what and the performance state of what does this refer to? > > I will change that and add the '@dev' word to make this more precised. That would help. Overall, I would say "is used by ... to obtain a new power value for a given frequency of @dev". > > 'update_power() is used by runtime modifiable EM. It aims to update the > @dev EM power values for all registered frequencies.'
On Fri, Sep 29, 2023 at 11:15 AM Lukasz Luba <lukasz.luba@arm.com> wrote: > > > > On 9/26/23 20:12, Rafael J. Wysocki wrote: > > On Mon, Sep 25, 2023 at 10:11 AM Lukasz Luba <lukasz.luba@arm.com> wrote: > >> > >> The new runtime table would be populated with a new power data to better > >> reflect the actual power. The power can vary over time e.g. due to the > >> SoC temperature change. Higher temperature can increase power values. > >> For longer running scenarios, such as game or camera, when also other > >> devices are used (e.g. GPU, ISP) the CPU power can change. The new > >> EM framework is able to addresses this issue and change the data > >> at runtime safely. > >> > >> The runtime modifiable EM data is used by the Energy Aware Scheduler (EAS) > >> for the task placement. All the other users (thermal, etc.) are still > >> using the default (basic) EM. This fact drove the design of this feature. > >> > >> Signed-off-by: Lukasz Luba <lukasz.luba@arm.com> > >> --- > >> include/linux/energy_model.h | 4 +++- > >> kernel/power/energy_model.c | 12 +++++++++++- > >> 2 files changed, 14 insertions(+), 2 deletions(-) > >> > >> diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h > >> index 546dee90f716..740e7c25cfff 100644 > >> --- a/include/linux/energy_model.h > >> +++ b/include/linux/energy_model.h > >> @@ -39,7 +39,7 @@ struct em_perf_state { > >> /** > >> * struct em_perf_table - Performance states table > >> * @state: List of performance states, in ascending order > >> - * @rcu: RCU used for safe access and destruction > >> + * @rcu: RCU used only for runtime modifiable table > > > > This still doesn't appear to be used anywhere, so why change it here? > > I will try to move this later in the series. > > > > >> */ > >> struct em_perf_table { > >> struct em_perf_state *state; > >> @@ -49,6 +49,7 @@ struct em_perf_table { > >> /** > >> * struct em_perf_domain - Performance domain > >> * @default_table: Pointer to the default em_perf_table > >> + * @runtime_table: Pointer to the runtime modifiable em_perf_table > > > > "Pointer to em_perf_table that can be dynamically updated" > > OK > > > > >> * @nr_perf_states: Number of performance states > >> * @flags: See "em_perf_domain flags" > >> * @cpus: Cpumask covering the CPUs of the domain. It's here > >> @@ -64,6 +65,7 @@ struct em_perf_table { > >> */ > >> struct em_perf_domain { > >> struct em_perf_table *default_table; > >> + struct em_perf_table __rcu *runtime_table; > >> int nr_perf_states; > >> unsigned long flags; > >> unsigned long cpus[]; > >> diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c > >> index 797141638b29..5b40db38b745 100644 > >> --- a/kernel/power/energy_model.c > >> +++ b/kernel/power/energy_model.c > >> @@ -251,6 +251,9 @@ static int em_create_pd(struct device *dev, int nr_states, > >> return ret; > >> } > >> > >> + /* Initialize runtime table as default table. */ > > > > Redundant comment. > > I'll drop it. > > > > >> + rcu_assign_pointer(pd->runtime_table, default_table); > >> + > >> if (_is_cpu_device(dev)) > >> for_each_cpu(cpu, cpus) { > >> cpu_dev = get_cpu_device(cpu); > >> @@ -448,6 +451,7 @@ EXPORT_SYMBOL_GPL(em_dev_register_perf_domain); > >> */ > >> void em_dev_unregister_perf_domain(struct device *dev) > >> { > >> + struct em_perf_table __rcu *runtime_table; > >> struct em_perf_domain *pd; > >> > >> if (IS_ERR_OR_NULL(dev) || !dev->em_pd) > >> @@ -457,18 +461,24 @@ void em_dev_unregister_perf_domain(struct device *dev) > >> return; > >> > >> pd = dev->em_pd; > >> - > > > > Unrelated change. > > ACK > > > > >> /* > >> * The mutex separates all register/unregister requests and protects > >> * from potential clean-up/setup issues in the debugfs directories. > >> * The debugfs directory name is the same as device's name. > >> */ > >> mutex_lock(&em_pd_mutex); > >> + > > > > Same here. > > ACK > > > > >> em_debug_remove_pd(dev); > >> > >> + runtime_table = pd->runtime_table; > >> + > >> + rcu_assign_pointer(pd->runtime_table, NULL); > >> + synchronize_rcu(); > > > > Is it really a good idea to call this under a mutex? > > This is the unregistration of the EM code path, so a thermal > driver which gets some IRQs might not be aware that the EM > is going to vanish. That's why those two code paths: update > & unregister are protected with the same lock. > > This synchronize_rcu() won't be long, Are you sure? This potentially waits for all of the CPUs in the system to go through a quiescent state which may take a while in principle. In any case, though, this effectively makes everyone waiting for the mutex also wait for the grace period to elapse and they may not care about it. > but makes sure that when we free(dev->em_pd) we don't leak runtime_table. > > > > >> + > >> kfree(pd->default_table->state); > >> kfree(pd->default_table); > >> kfree(dev->em_pd); > >> + > > > > Unrelated change. > > ACK > > > > >> dev->em_pd = NULL; > >> mutex_unlock(&em_pd_mutex); > >> } > >> -- > > > > So this really adds a pointer to a table that can be dynamically > > updated to struct em_perf_domain without any users so far. It is not > > used anywhere as of this patch AFAICS, which is not what the changelog > > is saying. > > Good catch. I will adjust the changlog in header and say: > > 'Add infrastructure and mechanisms for the new runtime table. > The runtime modifiable EM data is used by the Energy Aware Scheduler > (EAS)for the task placement. I would make it more clear that this is going to happen after some other subsequent changes. > All the other users (thermal, etc.) are > still using the default (basic) EM. This fact drove the design of this > feature.'
On Fri, Sep 29, 2023 at 11:59 AM Lukasz Luba <lukasz.luba@arm.com> wrote: > > > > On 9/26/23 20:48, Rafael J. Wysocki wrote: > > On Mon, Sep 25, 2023 at 10:11 AM Lukasz Luba <lukasz.luba@arm.com> wrote: > > > > First off, I would merge this with the previous patch, as the changes > > would be much clearer then IMO. > > I was trying to avoid a big patch ~150 lines. I will do that merge. > > > > >> Add an interface which allows to modify EM power data at runtime. > >> The new power information is populated by the provided callback, which > >> is called for each performance state. > > > > But it all starts with copying the frequencies from the default table. > > Yes, I can add that to the description. > > > > >> The CPU frequencies' efficiency is > >> re-calculated since that might be affected as well. The old EM memory > >> is going to be freed later using RCU mechanism. > > > > Not all of it, but the old runtime table that is not going to be used any more. > > True, I will rephrase that, to make it more precised. > > > > >> Signed-off-by: Lukasz Luba <lukasz.luba@arm.com> > > [snip] > > >> > >> +/** > >> + * em_dev_update_perf_domain() - Update runtime EM table for a device > >> + * @dev : Device for which the EM is to be updated > >> + * @cb : Callback function providing the power data for the EM > >> + * @priv : Pointer to private data useful for passing context > >> + * which might be required while calling @cb > > > > It is still unclear to me who is going to use this priv pointer and how. > > I have explained that in some previous patch response. A driver or > kernel module which monitors the thermal situation and has context. > > > > >> + * > >> + * Update EM runtime modifiable table for a @dev using the callback > >> + * defined in @cb. The EM new power values are then used for calculating > >> + * the em_perf_state::cost for associated performance state. > > > > It actually allocates a new runtime table and populates it from > > scratch, using the frequencies from the default table and the > > callback. > > Yes, it allocated new table and put the updated power values there. > I can add that to the comment. > > > > >> + * > >> + * This function uses mutex to serialize writers, so it must not be called > > > > "a mutex" > > ACK > > > > >> + * from non-sleeping context. > > [snip] > > >> + > >> + if (!dev || !dev->em_pd) { > > > > Checking dev against NULL under the mutex is pointless (either it is > > NULL or it isn't, so check it earlier). > > ACK > > > > >> + ret = -EINVAL; > >> + goto unlock_em; > >> + } > >> + > >> + pd = dev->em_pd; > > > > And I would check pd against NULL here. > > It's done above, next to '!dev || !dev->em_pd' Yes, it is, I meant something like this: if (!cb || !cb->update_power || !dev) return -EINVAL; mutex_lock(&em_pd_mutex); pd = dev->em_pd; if (!pd) { ret = -EINVAL; /* or perhaps -ENODATA */ goto unlock_em; } > > > >> + > >> + runtime_table = kzalloc(sizeof(*runtime_table), GFP_KERNEL); > >> + if (!runtime_table) { > >> + ret = -ENOMEM; > >> + goto unlock_em; > >> + } > >> + > >> + runtime_table->state = kcalloc(pd->nr_perf_states, > >> + sizeof(struct em_perf_state), > >> + GFP_KERNEL); > >> + if (!runtime_table->state) { > >> + ret = -ENOMEM; > >> + goto free_runtime_table; > >> + } > > > > The above allocations can be merged into one and allocating memory > > under the mutex is questionable. > > So how to make sure that there is no 2 callers trying to update the > same EM or unregistration is not in the background? You can allocate memory upfront and take the mutex before accessing the shared data structures. If there's an error in the code running under the mutex, release it and then free the memory. Allocating memory is one operation, updating the shared data structures to use it is another one. The former doesn't affect the shared state in any way, so why do it under the mutex? > [snip] > > >> > >> @@ -501,9 +598,23 @@ void em_dev_unregister_perf_domain(struct device *dev) > >> > >> runtime_table = pd->runtime_table; > >> > >> + /* > >> + * Safely destroy runtime modifiable EM. By using the call > >> + * synchronize_rcu() we make sure we don't progress till last user > >> + * finished the RCU section and our update got applied. > >> + */ > >> rcu_assign_pointer(pd->runtime_table, NULL); > >> synchronize_rcu(); > >> > >> + /* > >> + * After the sync no updates will be in-flight, so free the > >> + * memory allocated for runtime table (if there was such). > >> + */ > >> + if (runtime_table != pd->default_table) { > >> + kfree(runtime_table->state); > >> + kfree(runtime_table); > >> + } > > > > Can't this race with the RCU callback freeing the runtime table? > > That's why there is this 'synchronize_rcu()' above and the mutex. The > updating caller if finished the update, would unlock the mutex and this > unregister code can go. Here we call the synchronize_rcu() so we assure > the callback has finished for the update path and than we explicitly > free the saved 'runtime_table' here. So all data should be freed and > code serialized in those two paths. This doesn't quite agree with my understanding of what synchronize_rcu() does. IIUC, RCU callbacks can run as soon as the grace period has elapsed and they need not wait for synchronize_rcu() to return. Conversely, synchronize_rcu() doesn't wait for all of the RCU callbacks to complete. Now, em_destroy_rt_table_rcu() doesn't actually use the mutex, so how exactly is it protected against racing with this code?
On 9/29/23 14:18, Rafael J. Wysocki wrote: > On Fri, Sep 29, 2023 at 11:59 AM Lukasz Luba <lukasz.luba@arm.com> wrote: >> [snip] >> >> It's done above, next to '!dev || !dev->em_pd' > > Yes, it is, I meant something like this: > > if (!cb || !cb->update_power || !dev) > return -EINVAL; > > mutex_lock(&em_pd_mutex); > > pd = dev->em_pd; > if (!pd) { > ret = -EINVAL; /* or perhaps -ENODATA */ > goto unlock_em; > } > > OK, I see what you mean. Let me change that. >>> >>>> + >>>> + runtime_table = kzalloc(sizeof(*runtime_table), GFP_KERNEL); >>>> + if (!runtime_table) { >>>> + ret = -ENOMEM; >>>> + goto unlock_em; >>>> + } >>>> + >>>> + runtime_table->state = kcalloc(pd->nr_perf_states, >>>> + sizeof(struct em_perf_state), >>>> + GFP_KERNEL); >>>> + if (!runtime_table->state) { >>>> + ret = -ENOMEM; >>>> + goto free_runtime_table; >>>> + } >>> >>> The above allocations can be merged into one and allocating memory >>> under the mutex is questionable. >> >> So how to make sure that there is no 2 callers trying to update the >> same EM or unregistration is not in the background? > > You can allocate memory upfront and take the mutex before accessing > the shared data structures. If there's an error in the code running > under the mutex, release it and then free the memory. > > Allocating memory is one operation, updating the shared data > structures to use it is another one. The former doesn't affect the > shared state in any way, so why do it under the mutex? Yes, make sense. I will shrink that critical section. Good catch, thanks! > >> [snip] >> >>>> >>>> @@ -501,9 +598,23 @@ void em_dev_unregister_perf_domain(struct device *dev) >>>> >>>> runtime_table = pd->runtime_table; >>>> >>>> + /* >>>> + * Safely destroy runtime modifiable EM. By using the call >>>> + * synchronize_rcu() we make sure we don't progress till last user >>>> + * finished the RCU section and our update got applied. >>>> + */ >>>> rcu_assign_pointer(pd->runtime_table, NULL); >>>> synchronize_rcu(); >>>> >>>> + /* >>>> + * After the sync no updates will be in-flight, so free the >>>> + * memory allocated for runtime table (if there was such). >>>> + */ >>>> + if (runtime_table != pd->default_table) { >>>> + kfree(runtime_table->state); >>>> + kfree(runtime_table); >>>> + } >>> >>> Can't this race with the RCU callback freeing the runtime table? >> >> That's why there is this 'synchronize_rcu()' above and the mutex. The >> updating caller if finished the update, would unlock the mutex and this >> unregister code can go. Here we call the synchronize_rcu() so we assure >> the callback has finished for the update path and than we explicitly >> free the saved 'runtime_table' here. So all data should be freed and >> code serialized in those two paths. > > This doesn't quite agree with my understanding of what synchronize_rcu() does. > > IIUC, RCU callbacks can run as soon as the grace period has elapsed > and they need not wait for synchronize_rcu() to return. Conversely, > synchronize_rcu() doesn't wait for all of the RCU callbacks to > complete. > > Now, em_destroy_rt_table_rcu() doesn't actually use the mutex, so how > exactly is it protected against racing with this code? I'll try to draw in on some pictures... (previous instance ) +---------------------+ | | | runtime table #1 | | | +---------------------+ (next instance ) +---------------------+ | | | runtime table #2 | | | +---------------------+ (not possible new instance) +.....................+ . . . runtime table #3 . . . +.....................+ cpu A - "updater" | cpu B - "remover" | ------------------------------|------------------------------ lock em_pd_mutex | | alloc runtime table #2 | lock em_pd_mutex | (waiting) async free instance #1 | . | . unlock em_pd_mutex | . | (enter critical section) | lock em_pd_mutex | set NULL to runtime table ptr (waiting) | (wanted to create #3 inst) | synchronize rcu to make it is visible . | . | implicit free instance #2 . | . | free the rest of EM and EM . | . | unlock em_pd_mutex (enter critical section) | !dev->em_pd so | unlock & exit | | | This should clean all involved memory and also prevent of allocating new instance, when we unregister EM.
On 9/29/23 13:27, Rafael J. Wysocki wrote: > On Fri, Sep 29, 2023 at 11:15 AM Lukasz Luba <lukasz.luba@arm.com> wrote: >> >> [snip] >>>> em_debug_remove_pd(dev); >>>> >>>> + runtime_table = pd->runtime_table; >>>> + >>>> + rcu_assign_pointer(pd->runtime_table, NULL); >>>> + synchronize_rcu(); >>> >>> Is it really a good idea to call this under a mutex? >> >> This is the unregistration of the EM code path, so a thermal >> driver which gets some IRQs might not be aware that the EM >> is going to vanish. That's why those two code paths: update >> & unregister are protected with the same lock. >> >> This synchronize_rcu() won't be long, > > Are you sure? This potentially waits for all of the CPUs in the > system to go through a quiescent state which may take a while in > principle. > > In any case, though, this effectively makes everyone waiting for the > mutex also wait for the grace period to elapse and they may not care > about it. My apologies for the delay, I had to check this. Yes, should be possible and safe to not wait here as you described on this synchronize_rcu(). What I have drawn in other response to patch 11/18 [1] should still be true. Thanks, I will remove this sync call from here. > >> but makes sure that when we free(dev->em_pd) we don't leak runtime_table. >> >>> >>>> + >>>> kfree(pd->default_table->state); >>>> kfree(pd->default_table); >>>> kfree(dev->em_pd); >>>> + >>> >>> Unrelated change. >> >> ACK >> >>> >>>> dev->em_pd = NULL; >>>> mutex_unlock(&em_pd_mutex); >>>> } >>>> -- >>> >>> So this really adds a pointer to a table that can be dynamically >>> updated to struct em_perf_domain without any users so far. It is not >>> used anywhere as of this patch AFAICS, which is not what the changelog >>> is saying. >> >> Good catch. I will adjust the changlog in header and say: >> >> 'Add infrastructure and mechanisms for the new runtime table. >> The runtime modifiable EM data is used by the Energy Aware Scheduler >> (EAS)for the task placement. > > I would make it more clear that this is going to happen after some > other subsequent changes. > OK, I will add that information too. [1] https://lore.kernel.org/lkml/91d6e9be-d50c-d157-55a0-79134cbd01fb@arm.com/
On 25/09/2023 10:11, Lukasz Luba wrote: > The Energy Model (EM) is going to support runtime modification. There > are going to be 2 EM tables which store information. This patch aims > to prepare the code to be generic and use one of the tables. The function > will no longer get a pointer to 'struct em_perf_domain' (the EM) but > instead a pointer to 'struct em_perf_state' (which is one of the EM's > tables). > > Prepare em_pd_get_efficient_state() for the upcoming changes and > make it possible to re-use. Return an index for the best performance > state for a given EM table. The function arguments that are introduced > should allow to work on different performance state arrays. The caller of > em_pd_get_efficient_state() should be able to use the index either > on the default or the modifiable EM table. > > Signed-off-by: Lukasz Luba <lukasz.luba@arm.com> > --- [ ... ] > @@ -251,7 +253,9 @@ static inline unsigned long em_cpu_energy(struct em_perf_domain *pd, > * Find the lowest performance state of the Energy Model above the > * requested frequency. > */ > - ps = em_pd_get_efficient_state(pd, freq); > + i = em_pd_get_efficient_state(pd->table, pd->nr_perf_states, freq, > + pd->flags); nitpicking but s/i/state/ Other than that: Reviewed-by: Daniel Lezcano <daniel.lezcano@linaro.org> > + ps = &pd->table[i]; > > /* > * The capacity of a CPU in the domain at the performance state (ps)