Message ID | 20210401183654.27214-2-daniel.lezcano@linaro.org |
---|---|
State | New |
Headers | show |
Series | [v6,1/7] powercap/drivers/dtpm: Encapsulate even more the code | expand |
On Thu, Apr 01, 2021 at 08:36:49PM +0200, Daniel Lezcano wrote: > A SoC can be differently structured depending on the platform and the > kernel can not be aware of all the combinations, as well as the > specific tweaks for a particular board. > > The creation of the hierarchy must be delegated to userspace. Why? Isn't this what DT is for? What "userspace tool" is going to be created to manage all of this? Pointers to that codebase? > These changes provide a registering mechanism where the different > subsystems will initialize their dtpm backends and register with a > name the dtpm node in a list. > > The next changes will provide an userspace interface to create > hierarchically the different nodes. Those will be created by name and > found via the list filled by the different subsystem. > > If a specified name is not found in the list, it is assumed to be a > virtual node which will have children and the default is to allocate > such node. So userspace sets the name? Why not use the name in the device itself? I thought I asked that last time... thanks, greg k-h
Hi Greg, On 01/04/2021 21:28, Greg KH wrote: > On Thu, Apr 01, 2021 at 08:36:49PM +0200, Daniel Lezcano wrote: >> A SoC can be differently structured depending on the platform and the >> kernel can not be aware of all the combinations, as well as the >> specific tweaks for a particular board. >> >> The creation of the hierarchy must be delegated to userspace. > > Why? Isn't this what DT is for? I've always been told the DT describes the hardware. Here we are more describing a configuration, that is the reason why I've let the userspace to handle that through configfs. > What "userspace tool" is going to be created to manage all of this? > Pointers to that codebase? You are certainly aware of most of it but let me give a bit more of context. The thermal framework has cooling devices which export their 'state', a representation of their performance level, in sysfs. Unfortunately that gives access from the user space to act on the performance as a power limiter in total conflict with the in-kernel thermal governor decisions. That is done from thermal daemon the different SoC vendors tweak for their platform. Depending on the application running and identified as a scenario, the daemon acts proactively on the different cooling devices to ensure a skin temperature which is far below the thermal limit of the components. This usage of the cooling devices hijacked the real purpose of the thermal framework which is to protect the silicon. Nobody to blame, there is no alternative for userspace. The use case falls under the power limitation framework prerogative and that is the reason why we provided a new framework to limit the power based on the powercap framework. The thermal daemon can then use it and stop abusing the thermal framework. This DTPM framework allows to read the power consumption and set a power limit to a device. While the powercap simple backend gives a single device entry, DTPM aggregates the different devices power by summing their power and their limits. The tree representation of the different DTPM nodes describe how their limits are set and how the power is computed along the different devices. For more info, we did a presentation at ELC [1] and Linux PM microconference [2] and there is an article talking about it [3]. To answer your questions, there is a SoC vendor thermal daemon using DTPM and there is a tool created to watch the thermal framework and read the DTPM values, it is available at [4]. It is currently under development with the goal of doing power rebalancing / capping across the different nodes when there is a violation of the parent's power limit. [1] https://ossna2020.sched.com/event/c3Wf/ideas-for-finer-grained-control-over-your-heat-budget-amit-kucheria-daniel-lezcano-linaro [2] https://www.linuxplumbersconf.org/event/7/page/80-accepted-microconferences#power-cr [3] https://www.linaro.org/blog/using-energy-model-to-stay-in-tdp-budget/ [4] https://git.linaro.org/people/daniel.lezcano/dtpm.git >> These changes provide a registering mechanism where the different >> subsystems will initialize their dtpm backends and register with a >> name the dtpm node in a list. >> >> The next changes will provide an userspace interface to create >> hierarchically the different nodes. Those will be created by name and >> found via the list filled by the different subsystem. >> >> If a specified name is not found in the list, it is assumed to be a >> virtual node which will have children and the default is to allocate >> such node. > > So userspace sets the name? > > Why not use the name in the device itself? I thought I asked that last > time... I probably missed it, sorry for that. When the userspace creates the directory in the configfs, there is a lookup with the name in the device list name. If it is found, then the device is used, otherwise a virtual node is created instead, its power consumption is equal to the sum of the children. The different drivers register themselves with their name and the associated dtpm structure. The userspace pick in this list to create a hierarchy via configfs. For example, a big.Little system. - little CPUs power limiter will have the name cpu0-cpufreq - big CPUs will have the name cpu4-cpufreq - gpu will have the name ff9a0000.gpu-devfreq - charger will have the name power-supply-charge - DDR memory controller can have the name dmc-devfreq Userspace may want to create this hierarchy: soc - package - cluster0 - cpu0-cpufreq - cluster1 - ff9a0000.gpu-devfreq - dmc-devfreq - battery - power-supply-charge It will do: mkdir soc (virtual node) mkdir soc/cluster0 (virtual node) mkdir soc/cluster0/cpu0-cpufreq (real device) etc ... The configfs does not represent the layout of the sensors or the floor plan of the devices but only the constraints we want to tie together. That is the reason why I think using configfs instead of OF is more adequate and flexible as userspace deals with the power numbers. Moreover, we won't be facing issues with devices initialization priority when the daemon starts. I thought we can add OF later, when the framework has more users and more devices. The configfs and OF can certainly co-exist or be mutually exclusive via the Kconfig option. -- <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook | <http://twitter.com/#!/linaroorg> Twitter | <http://www.linaro.org/linaro-blog/> Blog
On Fri, Apr 02, 2021 at 12:08:49AM +0200, Daniel Lezcano wrote: > > Hi Greg, > > On 01/04/2021 21:28, Greg KH wrote: > > On Thu, Apr 01, 2021 at 08:36:49PM +0200, Daniel Lezcano wrote: > >> A SoC can be differently structured depending on the platform and the > >> kernel can not be aware of all the combinations, as well as the > >> specific tweaks for a particular board. > >> > >> The creation of the hierarchy must be delegated to userspace. > > > > Why? Isn't this what DT is for? > > I've always been told the DT describes the hardware. Here we are more > describing a configuration, that is the reason why I've let the > userspace to handle that through configfs. DT does describe how the hardware configuration is to be used. You are saying that the hardware description does not work and somehow you need a magic userspace tool to reconfigure things instead? > > What "userspace tool" is going to be created to manage all of this? > > Pointers to that codebase? > > You are certainly aware of most of it but let me give a bit more of context. No, I am not aware of it at all, thanks :) > The thermal framework has cooling devices which export their 'state', a > representation of their performance level, in sysfs. Unfortunately that > gives access from the user space to act on the performance as a power > limiter in total conflict with the in-kernel thermal governor decisions. Why not remove that conflict? > That is done from thermal daemon the different SoC vendors tweak for > their platform. Depending on the application running and identified as a > scenario, the daemon acts proactively on the different cooling devices > to ensure a skin temperature which is far below the thermal limit of the > components. So userspace is going to try to manage power settings in order to keep thermal values under control? Has no one learned from our past mistakes when we used to have to do this 10 years ago and it turned out so bad that it was just baked into the hardware instead? {sigh} > This usage of the cooling devices hijacked the real purpose of the > thermal framework which is to protect the silicon. Nobody to blame, > there is no alternative for userspace. Userspace should not care. > The use case falls under the power limitation framework prerogative and > that is the reason why we provided a new framework to limit the power > based on the powercap framework. The thermal daemon can then use it and > stop abusing the thermal framework. > > This DTPM framework allows to read the power consumption and set a power > limit to a device. > > While the powercap simple backend gives a single device entry, DTPM > aggregates the different devices power by summing their power and their > limits. The tree representation of the different DTPM nodes describe how > their limits are set and how the power is computed along the different > devices. That's all great, doing this in-kernel is fine, it's now the "userspace must set this up properly that I'm objecting to as no one will know how to do this. > For more info, we did a presentation at ELC [1] and Linux PM > microconference [2] and there is an article talking about it [3]. > > > To answer your questions, there is a SoC vendor thermal daemon using > DTPM and there is a tool created to watch the thermal framework and read > the DTPM values, it is available at [4]. It is currently under > development with the goal of doing power rebalancing / capping across > the different nodes when there is a violation of the parent's power limit. Crazy ideas aside, your implementation of this is my main objection here. You are creating a user/kernel api that you will have to support for 20+ years, without a real userspace user just yet (from what I can tell). That's rough, and is going to mean that this gets messy over time. Also there's the whole "tying sysfs to configfs" mess and reference counting that I object to as well... > >> The next changes will provide an userspace interface to create > >> hierarchically the different nodes. Those will be created by name and > >> found via the list filled by the different subsystem. > >> > >> If a specified name is not found in the list, it is assumed to be a > >> virtual node which will have children and the default is to allocate > >> such node. > > > > So userspace sets the name? > > > > Why not use the name in the device itself? I thought I asked that last > > time... > > I probably missed it, sorry for that. > > When the userspace creates the directory in the configfs, there is a > lookup with the name in the device list name. If it is found, then the > device is used, otherwise a virtual node is created instead, its power > consumption is equal to the sum of the children. > > The different drivers register themselves with their name and the > associated dtpm structure. The userspace pick in this list to create a > hierarchy via configfs. > > For example, a big.Little system. > > - little CPUs power limiter will have the name cpu0-cpufreq > - big CPUs will have the name cpu4-cpufreq > - gpu will have the name ff9a0000.gpu-devfreq > - charger will have the name power-supply-charge > - DDR memory controller can have the name dmc-devfreq > > Userspace may want to create this hierarchy: > > soc > - package > - cluster0 > - cpu0-cpufreq > - cluster1 > - ff9a0000.gpu-devfreq > - dmc-devfreq > - battery > - power-supply-charge > > It will do: > > mkdir soc (virtual node) > mkdir soc/cluster0 (virtual node) > mkdir soc/cluster0/cpu0-cpufreq (real device) > etc ... > > The configfs does not represent the layout of the sensors or the floor > plan of the devices but only the constraints we want to tie together. > > That is the reason why I think using configfs instead of OF is more > adequate and flexible as userspace deals with the power numbers. > Moreover, we won't be facing issues with devices initialization priority > when the daemon starts. > > I thought we can add OF later, when the framework has more users and > more devices. The configfs and OF can certainly co-exist or be mutually > exclusive via the Kconfig option. Kconfig options are not ok for this, you want to build a kernel that works everywhere. If you want to make virtual things, that's fine, but again, your tying of sysfs to configfs in odd ways, along with the reference counting bugs/nightmare the current implementation is creating, is a non-starter for me at the moment, sorry. thanks, greg k-h
On 02/04/2021 10:02, Greg KH wrote: > On Fri, Apr 02, 2021 at 12:08:49AM +0200, Daniel Lezcano wrote: >> >> Hi Greg, >> >> On 01/04/2021 21:28, Greg KH wrote: >>> On Thu, Apr 01, 2021 at 08:36:49PM +0200, Daniel Lezcano wrote: >>>> A SoC can be differently structured depending on the platform and the >>>> kernel can not be aware of all the combinations, as well as the >>>> specific tweaks for a particular board. >>>> >>>> The creation of the hierarchy must be delegated to userspace. >>> >>> Why? Isn't this what DT is for? >> >> I've always been told the DT describes the hardware. Here we are more >> describing a configuration, that is the reason why I've let the >> userspace to handle that through configfs. > > DT does describe how the hardware configuration is to be used. You are > saying that the hardware description does not work and somehow you need > a magic userspace tool to reconfigure things instead? > >>> What "userspace tool" is going to be created to manage all of this? >>> Pointers to that codebase? >> >> You are certainly aware of most of it but let me give a bit more of context. > > No, I am not aware of it at all, thanks :) > >> The thermal framework has cooling devices which export their 'state', a >> representation of their performance level, in sysfs. Unfortunately that >> gives access from the user space to act on the performance as a power >> limiter in total conflict with the in-kernel thermal governor decisions. > > Why not remove that conflict? Because the cooling device state should have not be exported its state in a writable way, that is a weakness different applications are abusing. Moreover, the thermal framework is not designed for that. It is not the purpose of the cooling device to be managed as a power limiter by the userspace. The powercap framework is there for that. There are devices registering themselves as a cooling device while there is no sensor and governor for them just for the sake of acting on their power via opaque integer values. >> That is done from thermal daemon the different SoC vendors tweak for >> their platform. Depending on the application running and identified as a >> scenario, the daemon acts proactively on the different cooling devices >> to ensure a skin temperature which is far below the thermal limit of the >> components. > > So userspace is going to try to manage power settings in order to keep > thermal values under control? Has no one learned from our past mistakes > when we used to have to do this 10 years ago and it turned out so bad > that it was just baked into the hardware instead? I agree in a desktop/server environment, that is true but on the embedded world, things are a bit different because we are on battery, the cooling devices are passive and the device is carried. This is why there are different level of thermal control: - In the hardware / firmware to protect physically the silicon - In the kernel, to protect from a hard reset or reboot. Usually each of them is represented with a thermal zone (and its sensor) and a governor. There is a 1:1 relationship. IOW, a governor manage a thermal zone without any knowledge of the other thermal zones temperature, just the one it is monitoring and acts on the power (with different techniques) when the temperature threshold is reached. Its action is local and it is after crossing a certain temperature for this component. - In the userspace, the logic will get notified about which application is running and can set the power limitation on different devices knowing the performances are enough for that particular application, that saves energy and consequently reduce the temperature. Its action is on multiple places and happens far below the thermal limits. On the other side, the skin temperature is a particular sensor placed on the dissipation path and its temperature must be below 45°C (30 minutes skin contact at 45°C causes a 1st degree burn). It is the results of the dissipation of all the devices, the logic in userspace can know which devices to act on to have the overall power dissipation low enough to stay below this 45°C (eg. reduce camera resolution). So here userspace deals with performance, temperature, application profile, etc ... and abuse the cooling device. I'm not sure in the ARM/ARM64 embedded ecosystem, creating a complex hardware mechanism, supported by a firmware is really possible. In a desktop/server environment, we do not care about having this skin temperature (and battery savings) aspects. > {sigh} > >> This usage of the cooling devices hijacked the real purpose of the >> thermal framework which is to protect the silicon. Nobody to blame, >> there is no alternative for userspace. > > Userspace should not care. > >> The use case falls under the power limitation framework prerogative and >> that is the reason why we provided a new framework to limit the power >> based on the powercap framework. The thermal daemon can then use it and >> stop abusing the thermal framework. >> >> This DTPM framework allows to read the power consumption and set a power >> limit to a device. >> >> While the powercap simple backend gives a single device entry, DTPM >> aggregates the different devices power by summing their power and their >> limits. The tree representation of the different DTPM nodes describe how >> their limits are set and how the power is computed along the different >> devices. > > That's all great, doing this in-kernel is fine, it's now the "userspace > must set this up properly that I'm objecting to as no one will know how > to do this. > >> For more info, we did a presentation at ELC [1] and Linux PM >> microconference [2] and there is an article talking about it [3]. >> >> >> To answer your questions, there is a SoC vendor thermal daemon using >> DTPM and there is a tool created to watch the thermal framework and read >> the DTPM values, it is available at [4]. It is currently under >> development with the goal of doing power rebalancing / capping across >> the different nodes when there is a violation of the parent's power limit. > > Crazy ideas aside, your implementation of this is my main objection > here. You are creating a user/kernel api that you will have to support > for 20+ years, without a real userspace user just yet (from what I can > tell). That's rough, and is going to mean that this gets messy over > time. I'm not sure to understand, the API already exists since v3.3, it is the powercap and DTPM is its backend. AFAICT, there are already users of it except they create their own way to build the hierarchy today. > Also there's the whole "tying sysfs to configfs" mess and reference > counting that I object to as well... Ok, I think my description was bad but I agree with the fact that we must get rid of any adherence between sysfs and configfs. Thanks for reporting the issue. [ ... ] >> The configfs does not represent the layout of the sensors or the floor >> plan of the devices but only the constraints we want to tie together. >> >> That is the reason why I think using configfs instead of OF is more >> adequate and flexible as userspace deals with the power numbers. >> Moreover, we won't be facing issues with devices initialization priority >> when the daemon starts. >> >> I thought we can add OF later, when the framework has more users and >> more devices. The configfs and OF can certainly co-exist or be mutually >> exclusive via the Kconfig option. > > Kconfig options are not ok for this, you want to build a kernel that > works everywhere. > > If you want to make virtual things, that's fine, but again, your tying > of sysfs to configfs in odd ways, along with the reference counting > bugs/nightmare the current implementation is creating, is a non-starter > for me at the moment, sorry. I did not realize this reference counting issue. Thanks for spotting it. The points you raised are IMO solvable and I can send a new version but with all the comments I'm a bit lost with the approach with configfs. Do you think it is preferable to use OF instead ? Thanks again for your time -- Daniel -- <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook | <http://twitter.com/#!/linaroorg> Twitter | <http://www.linaro.org/linaro-blog/> Blog
On Fri, Apr 02, 2021 at 01:10:51PM +0200, Daniel Lezcano wrote: > >> To answer your questions, there is a SoC vendor thermal daemon using > >> DTPM and there is a tool created to watch the thermal framework and read > >> the DTPM values, it is available at [4]. It is currently under > >> development with the goal of doing power rebalancing / capping across > >> the different nodes when there is a violation of the parent's power limit. > > > > Crazy ideas aside, your implementation of this is my main objection > > here. You are creating a user/kernel api that you will have to support > > for 20+ years, without a real userspace user just yet (from what I can > > tell). That's rough, and is going to mean that this gets messy over > > time. > > I'm not sure to understand, the API already exists since v3.3, it is the > powercap and DTPM is its backend. AFAICT, there are already users of it > except they create their own way to build the hierarchy today. The configfs api is what I am referring to here, the ones in this patch series... thanks, greg k-h
diff --git a/drivers/powercap/dtpm.c b/drivers/powercap/dtpm.c index 58433b8ef9a1..a707cc2965a1 100644 --- a/drivers/powercap/dtpm.c +++ b/drivers/powercap/dtpm.c @@ -34,6 +34,14 @@ static DEFINE_MUTEX(dtpm_lock); static struct powercap_control_type *pct; static struct dtpm *root; +struct dtpm_node { + const char *name; + struct dtpm *dtpm; + struct list_head node; +}; + +static LIST_HEAD(dtpm_list); + static int get_time_window_us(struct powercap_zone *pcz, int cid, u64 *window) { return -ENOSYS; @@ -152,6 +160,116 @@ static int __dtpm_update_power(struct dtpm *dtpm) return ret; } +static struct dtpm *__dtpm_lookup(const char *name) +{ + struct dtpm_node *node; + + list_for_each_entry(node, &dtpm_list, node) { + if (!strcmp(name, node->name)) + return node->dtpm; + } + + return NULL; +} + +/** + * dtpm_lookup - Lookup for a registered dtpm node given its name + * @name: the name of the dtpm device + * + * The function looks up in the list of the registered dtpm + * devices. This function must be called to create a dtpm node in the + * powercap hierarchy. + * + * Return: a pointer to a dtpm structure, NULL if not found. + */ +struct dtpm *dtpm_lookup(const char *name) +{ + struct dtpm *dtpm; + + mutex_lock(&dtpm_lock); + dtpm = __dtpm_lookup(name); + mutex_unlock(&dtpm_lock); + + return dtpm; +} +EXPORT_SYMBOL_GPL(dtpm_lookup); + +/** + * dtpm_add - Add the dtpm in the dtpm list + * @name: a name used as an identifier + * @dtpm: the dtpm node to be registered + * + * Stores the dtpm device in a list. The list contains all the devices + * which are power capable in terms of limitation and power + * consumption measurements. Even if conceptually, a power capable + * device won't register itself twice, the function will check if it + * was already registered in order to prevent a misuse of the API. + * + * Return: 0 on success, -EEXIST if the device name is already present + * in the list, -ENOMEM in case of memory allocation failure. + */ +int dtpm_add(const char *name, struct dtpm *dtpm) +{ + struct dtpm_node *node; + int ret; + + mutex_lock(&dtpm_lock); + + ret = -EEXIST; + if (__dtpm_lookup(name)) + goto out_unlock; + + ret = -ENOMEM; + node = kzalloc(sizeof(*node), GFP_KERNEL); + if (!node) + goto out_unlock; + + node->name = kstrdup(name, GFP_KERNEL); + if (!node->name) { + kfree(node); + goto out_unlock; + } + + node->dtpm = dtpm; + + list_add(&node->node, &dtpm_list); + + ret = 0; +out_unlock: + mutex_unlock(&dtpm_lock); + + return ret; +} +EXPORT_SYMBOL_GPL(dtpm_add); + +/** + * dtpm_del - Remove the dtpm device from the list + * @name: the dtpm device name to be removed + * + * Remove the dtpm device from the list of the registered devices. + */ +void dtpm_del(const char *name) +{ + struct dtpm_node *node; + + mutex_lock(&dtpm_lock); + + list_for_each_entry(node, &dtpm_list, node) { + + if (strcmp(name, node->name)) + continue; + + list_del(&node->node); + kfree(node->name); + kfree(node); + + break; + } + + mutex_unlock(&dtpm_lock); +} +EXPORT_SYMBOL_GPL(dtpm_del); + /** * dtpm_update_power - Update the power on the dtpm * @dtpm: a pointer to a dtpm structure to update @@ -208,8 +326,6 @@ int dtpm_release_zone(struct powercap_zone *pcz) if (root == dtpm) root = NULL; - kfree(dtpm); - return 0; } @@ -388,7 +504,7 @@ void dtpm_unregister(struct dtpm *dtpm) { powercap_unregister_zone(pct, &dtpm->zone); - pr_info("Unregistered dtpm node '%s'\n", dtpm->zone.name); + pr_debug("Unregistered dtpm node '%s'\n", dtpm->zone.name); } /** @@ -457,7 +573,7 @@ int dtpm_register(const char *name, struct dtpm *dtpm, struct dtpm *parent) if (dtpm->ops && !dtpm->ops->update_power_uw(dtpm)) __dtpm_add_power(dtpm); - pr_info("Registered dtpm node '%s' / %llu-%llu uW, \n", + pr_debug("Created dtpm node '%s' / %llu-%llu uW, \n", dtpm->zone.name, dtpm->power_min, dtpm->power_max); mutex_unlock(&dtpm_lock); diff --git a/drivers/powercap/dtpm_cpu.c b/drivers/powercap/dtpm_cpu.c index f6076de39540..9deafd16a197 100644 --- a/drivers/powercap/dtpm_cpu.c +++ b/drivers/powercap/dtpm_cpu.c @@ -177,7 +177,7 @@ static int cpuhp_dtpm_cpu_online(unsigned int cpu) snprintf(name, sizeof(name), "cpu%d-cpufreq", dtpm_cpu->cpu); - ret = dtpm_register(name, dtpm, NULL); + ret = dtpm_add(name, dtpm); if (ret) goto out_kfree_dtpm_cpu; @@ -185,12 +185,12 @@ static int cpuhp_dtpm_cpu_online(unsigned int cpu) &dtpm_cpu->qos_req, FREQ_QOS_MAX, pd->table[pd->nr_perf_states - 1].frequency); if (ret) - goto out_dtpm_unregister; + goto out_dtpm_del; return 0; -out_dtpm_unregister: - dtpm_unregister(dtpm); +out_dtpm_del: + dtpm_del(name); dtpm_cpu = NULL; dtpm = NULL; diff --git a/include/linux/dtpm.h b/include/linux/dtpm.h index acf8d3638988..577c71d4e098 100644 --- a/include/linux/dtpm.h +++ b/include/linux/dtpm.h @@ -75,4 +75,10 @@ int dtpm_register(const char *name, struct dtpm *dtpm, struct dtpm *parent); int dtpm_register_cpu(struct dtpm *parent); +struct dtpm *dtpm_lookup(const char *name); + +int dtpm_add(const char *name, struct dtpm *dtpm); + +void dtpm_del(const char *name); + #endif