Message ID | 1622804761-126737-4-git-send-email-vincent.donnefort@arm.com |
---|---|
State | Superseded |
Headers | show |
Series | EM / PM: Inefficient OPPs | expand |
On 04-06-21, 12:05, Vincent Donnefort wrote: > Some SoCs such as the sd855 have OPPs within the same policy whose cost is > higher than others with a higher frequency. Those OPPs are inefficients and > it might be interesting for a governor to not use them. > > Adding a flag, CPUFREQ_INEFFICIENT_FREQ, to mark such OPPs into the > frequency table, as well as a new cpufreq_frequency_table member > "efficient". This new member will allow a governor to quickly resolve an > inefficient frequency to an efficient one. > > Efficient OPPs point to themselves. Governors must also ensure that the > efficiency resolution does not break the policy maximum. > > Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com> I was thinking about a different approach when I suggested it. - The cpufreq table instead of an index, will have "efficient" as bool. - EM will set an OPP as efficient or inefficient, based on the OPP table, there will be a flag for this in the OPP table. - The cpufreq table is then created from OPP table and this information is automatically retrieved. -- viresh
On 6/7/21 6:02 AM, Viresh Kumar wrote: > On 04-06-21, 12:05, Vincent Donnefort wrote: >> Some SoCs such as the sd855 have OPPs within the same policy whose cost is >> higher than others with a higher frequency. Those OPPs are inefficients and >> it might be interesting for a governor to not use them. >> >> Adding a flag, CPUFREQ_INEFFICIENT_FREQ, to mark such OPPs into the >> frequency table, as well as a new cpufreq_frequency_table member >> "efficient". This new member will allow a governor to quickly resolve an >> inefficient frequency to an efficient one. >> >> Efficient OPPs point to themselves. Governors must also ensure that the >> efficiency resolution does not break the policy maximum. >> >> Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com> > > I was thinking about a different approach when I suggested it. > > - The cpufreq table instead of an index, will have "efficient" as bool. > > - EM will set an OPP as efficient or inefficient, based on the OPP > table, there will be a flag for this in the OPP table. > > - The cpufreq table is then created from OPP table and this > information is automatically retrieved. > There are some issues with your proposed approach: The EM doesn't depend on OPP framework (even no header from opp.h) and we don't want to add this dependency in EM. The Vincent's proposed implementation doesn't introduce this dependency. Regards, Lukasz
On 04-06-21, 12:05, Vincent Donnefort wrote: > int cpufreq_table_validate_and_sort(struct cpufreq_policy *policy) > { > int ret; > @@ -362,7 +409,13 @@ int cpufreq_table_validate_and_sort(struct cpufreq_policy *policy) > if (ret) > return ret; > > - return set_freq_table_sorted(policy); > + ret = set_freq_table_sorted(policy); > + if (ret) > + return ret; > + > + set_freq_table_efficiencies(policy); > + > + return ret; > } Lets provide a single callback from the cpufreq core to do all settings related to efficient frequencies and call it from em_dev_register_perf_domain() ? So we only do them for the cpufreq drivers that register themselves with EM. -- viresh
On Mon, Jun 14, 2021 at 12:58:35PM +0530, Viresh Kumar wrote: > On 04-06-21, 12:05, Vincent Donnefort wrote: > > int cpufreq_table_validate_and_sort(struct cpufreq_policy *policy) > > { > > int ret; > > @@ -362,7 +409,13 @@ int cpufreq_table_validate_and_sort(struct cpufreq_policy *policy) > > if (ret) > > return ret; > > > > - return set_freq_table_sorted(policy); > > + ret = set_freq_table_sorted(policy); > > + if (ret) > > + return ret; > > + > > + set_freq_table_efficiencies(policy); > > + > > + return ret; > > } > > Lets provide a single callback from the cpufreq core to do all > settings related to efficient frequencies and call it from > em_dev_register_perf_domain() ? Sadly we cannot do it in em_dev_register_perf_domain(). This function is called from the cpufreq driver init, when the table isn't available yet. > > So we only do them for the cpufreq drivers that register themselves > with EM. Currently, only the EM would set those inefficient OPPs. But it also gives the possibility for individual drivers to set them, even if they do not use the EM. > > -- > viresh
On 14-06-21, 14:35, Vincent Donnefort wrote: > On Mon, Jun 14, 2021 at 12:58:35PM +0530, Viresh Kumar wrote: > > On 04-06-21, 12:05, Vincent Donnefort wrote: > > > int cpufreq_table_validate_and_sort(struct cpufreq_policy *policy) > > > { > > > int ret; > > > @@ -362,7 +409,13 @@ int cpufreq_table_validate_and_sort(struct cpufreq_policy *policy) > > > if (ret) > > > return ret; > > > > > > - return set_freq_table_sorted(policy); > > > + ret = set_freq_table_sorted(policy); > > > + if (ret) > > > + return ret; > > > + > > > + set_freq_table_efficiencies(policy); > > > + > > > + return ret; > > > } > > > > Lets provide a single callback from the cpufreq core to do all > > settings related to efficient frequencies and call it from > > em_dev_register_perf_domain() ? > > Sadly we cannot do it in em_dev_register_perf_domain(). This function is called > from the cpufreq driver init, when the table isn't available yet. I looked at all the users of em_dev_register_perf_domain() and each one of them have the freq table ready at that point of time. > > > > So we only do them for the cpufreq drivers that register themselves > > with EM. > > Currently, only the EM would set those inefficient OPPs. But it also gives the > possibility for individual drivers to set them, even if they do not use the EM. This is exactly why I want those parts of the kernel to call a specific API to get this done. This shouldn't be done automatically by the cpufreq core. -- viresh
On Tue, Jun 15, 2021 at 10:32:11AM +0530, Viresh Kumar wrote: > On 14-06-21, 14:35, Vincent Donnefort wrote: > > On Mon, Jun 14, 2021 at 12:58:35PM +0530, Viresh Kumar wrote: > > > On 04-06-21, 12:05, Vincent Donnefort wrote: > > > > int cpufreq_table_validate_and_sort(struct cpufreq_policy *policy) > > > > { > > > > int ret; > > > > @@ -362,7 +409,13 @@ int cpufreq_table_validate_and_sort(struct cpufreq_policy *policy) > > > > if (ret) > > > > return ret; > > > > > > > > - return set_freq_table_sorted(policy); > > > > + ret = set_freq_table_sorted(policy); > > > > + if (ret) > > > > + return ret; > > > > + > > > > + set_freq_table_efficiencies(policy); > > > > + > > > > + return ret; > > > > } > > > > > > Lets provide a single callback from the cpufreq core to do all > > > settings related to efficient frequencies and call it from > > > em_dev_register_perf_domain() ? > > > > Sadly we cannot do it in em_dev_register_perf_domain(). This function is called > > from the cpufreq driver init, when the table isn't available yet. > > I looked at all the users of em_dev_register_perf_domain() and each > one of them have the freq table ready at that point of time. The cpufreq_policy is accessed through percpu data. I originally tried to get it with cpufreq_cpu_get_raw(). But when we init the cpufreq driver (and by extension when we call em_dev_register_perf_domain()), the percpu data isn't updated yet. I guess the only way at that moment would be to propagate the cpufreq_policy pointer through the functions parameters, which is a bit cumbersome, especially as the EM can be used with devfreq as well and that em_dev_register_perf_domain() can be called from dev_pm_opp_of_register_em() Would we be ok with that? I could use em_data_callback as well ... but that doesn't change the fact some registration is going through dev_pm_opp_of_register_em() which has no knowledge about the cpufreq_policy. Doesn't look a better approach. > > > > > > > So we only do them for the cpufreq drivers that register themselves > > > with EM. > > > > Currently, only the EM would set those inefficient OPPs. But it also gives the > > possibility for individual drivers to set them, even if they do not use the EM. > > This is exactly why I want those parts of the kernel to call a > specific API to get this done. This shouldn't be done automatically by > the cpufreq core. > > -- > viresh
On 15-06-21, 09:44, Vincent Donnefort wrote: > The cpufreq_policy is accessed through percpu data. I originally tried to get it > with cpufreq_cpu_get_raw(). But when we init the cpufreq driver (and by > extension when we call em_dev_register_perf_domain()), the percpu data isn't > updated yet. Right. > I guess the only way at that moment would be to propagate the cpufreq_policy > pointer through the functions parameters, which is a bit cumbersome, especially > as the EM can be used with devfreq as well and that em_dev_register_perf_domain() > can be called from dev_pm_opp_of_register_em() > > Would we be ok with that? No. You only talked about freq_table earlier and that's all I checked :) > I could use em_data_callback as well ... but that doesn't change the fact some > registration is going through dev_pm_opp_of_register_em() which has no knowledge > about the cpufreq_policy. Doesn't look a better approach. The point is that I don't want cpufreq to carry this for users, we have EM today, tomorrow we may want to mark a frequency as inefficient from somewhere else. The call need to initiate from EM core. And this isn't a cpufreq only thing, but is going to be generic along with other device types. This is exactly why I asked you earlier to play with OPP core for this. That is the central place for data for all such users. If this information is present at the OPP table (somehow), then we can just fix dev_pm_opp_init_cpufreq_table() to set this for cpufreq core as well. This is the sequence that is followed in cpufreq drivers today: dev_pm_opp_of_cpumask_add_table(); dev_pm_opp_init_cpufreq_table(); dev_pm_opp_of_register_em(); What about changing this to: dev_pm_opp_of_cpumask_add_table(); /* Mark OPPs are inefficient here */ dev_pm_opp_of_register_em(); /* This should automatically pick the right set */ dev_pm_opp_init_cpufreq_table(); Will this break anything ? -- viresh
On Tue, Jun 15, 2021 at 03:47:06PM +0530, Viresh Kumar wrote: > On 15-06-21, 09:44, Vincent Donnefort wrote: > > The cpufreq_policy is accessed through percpu data. I originally tried to get it > > with cpufreq_cpu_get_raw(). But when we init the cpufreq driver (and by > > extension when we call em_dev_register_perf_domain()), the percpu data isn't > > updated yet. > > Right. > > > I guess the only way at that moment would be to propagate the cpufreq_policy > > pointer through the functions parameters, which is a bit cumbersome, especially > > as the EM can be used with devfreq as well and that em_dev_register_perf_domain() > > can be called from dev_pm_opp_of_register_em() > > > > Would we be ok with that? > > No. > > You only talked about freq_table earlier and that's all I checked :) > > > I could use em_data_callback as well ... but that doesn't change the fact some > > registration is going through dev_pm_opp_of_register_em() which has no knowledge > > about the cpufreq_policy. Doesn't look a better approach. > > The point is that I don't want cpufreq to carry this for users, we > have EM today, tomorrow we may want to mark a frequency as inefficient > from somewhere else. The call need to initiate from EM core. In the current version of this patchset, any driver can mark inefficiencies without relying on the EM, just by adding the flag CPUFREQ_INEFFICIENT_FREQ in cpufreq_frequency_table. Populating cpufreq_frequency_table from the EM in cpufreq was just an attempt to a less intrusive set of changes. > > And this isn't a cpufreq only thing, but is going to be generic along > with other device types. > > This is exactly why I asked you earlier to play with OPP core for > this. That is the central place for data for all such users. > > If this information is present at the OPP table (somehow), then we can > just fix dev_pm_opp_init_cpufreq_table() to set this for cpufreq core > as well. > > This is the sequence that is followed in cpufreq drivers today: > > dev_pm_opp_of_cpumask_add_table(); > dev_pm_opp_init_cpufreq_table(); > dev_pm_opp_of_register_em(); > > What about changing this to: > > dev_pm_opp_of_cpumask_add_table(); > > /* Mark OPPs are inefficient here */ > dev_pm_opp_of_register_em(); > > /* This should automatically pick the right set */ > dev_pm_opp_init_cpufreq_table(); > > Will this break anything ? Probably not, but with this approach I'll have to modify all the cpufreq drivers that are registering the EM, which I tried to avoid as much as possible so far. But if we sum-up: 1. em_dev_register_perf_domain() find inefficiencies 2. dev_pm_opp_of_register_em() apply EM inefficiencies into the OPP structures Note: scmi-cpufreq would need special treatment, as it doesn't rely dev_pm_opp_of_register_em(). 3. dev_pm_opp_init_cpufreq_table() marks the cpufreq table with the OPP inefficiencies Guess it would ease the adoption by other OPP clients. However this patchset will clearly get bigger. Would you agree with that?
On 15-06-21, 18:15, Vincent Donnefort wrote: > On Tue, Jun 15, 2021 at 03:47:06PM +0530, Viresh Kumar wrote: > > The point is that I don't want cpufreq to carry this for users, we > > have EM today, tomorrow we may want to mark a frequency as inefficient > > from somewhere else. The call need to initiate from EM core. > > In the current version of this patchset, any driver can mark inefficiencies > without relying on the EM, just by adding the flag CPUFREQ_INEFFICIENT_FREQ in > cpufreq_frequency_table. Yeah, I wasn't really talking about cpufreq drivers but external entities, like EM. > Populating cpufreq_frequency_table from the EM in cpufreq was just an attempt to > a less intrusive set of changes. > > And this isn't a cpufreq only thing, but is going to be generic along > > with other device types. > > > > This is exactly why I asked you earlier to play with OPP core for > > this. That is the central place for data for all such users. > > > > If this information is present at the OPP table (somehow), then we can > > just fix dev_pm_opp_init_cpufreq_table() to set this for cpufreq core > > as well. > > > > This is the sequence that is followed in cpufreq drivers today: > > > > dev_pm_opp_of_cpumask_add_table(); > > dev_pm_opp_init_cpufreq_table(); > > dev_pm_opp_of_register_em(); > > > > What about changing this to: > > > > dev_pm_opp_of_cpumask_add_table(); > > > > /* Mark OPPs are inefficient here */ > > dev_pm_opp_of_register_em(); > > > > /* This should automatically pick the right set */ > > dev_pm_opp_init_cpufreq_table(); > > > > Will this break anything ? > > Probably not, but with this approach I'll have to modify all the cpufreq drivers > that are registering the EM, which I tried to avoid as much as possible so far. Hmm. You are right as well, but I just want to get the right API in place which lives a longer life :) > But if we sum-up: > > 1. em_dev_register_perf_domain() find inefficiencies > > 2. dev_pm_opp_of_register_em() apply EM inefficiencies into the OPP structures I was looking to add a new API to the OPP core (dev_pm_opp_mark_inefficient()) to mark an OPP inefficient. And then get it called from em_create_perf_table(). But I now see that EM core rather has callbacks to call into and with that I think you should rather add another callback (.mark_inefficient()) in struct em_data_callback, to set inefficient frequencies. > Note: scmi-cpufreq would need special treatment, as it doesn't rely > dev_pm_opp_of_register_em(). For both dev_pm_opp_of_register_em() and scmi case, you can then set this callback to dev_pm_opp_mark_inefficient(). > 3. dev_pm_opp_init_cpufreq_table() marks the cpufreq table with the OPP > inefficiencies Yes. > Guess it would ease the adoption by other OPP clients. However this patchset > will clearly get bigger. > > Would you agree with that? Yes, it is fine. -- viresh
On 6/16/21 8:35 AM, Viresh Kumar wrote: > On 15-06-21, 18:15, Vincent Donnefort wrote: >> On Tue, Jun 15, 2021 at 03:47:06PM +0530, Viresh Kumar wrote: >>> The point is that I don't want cpufreq to carry this for users, we >>> have EM today, tomorrow we may want to mark a frequency as inefficient >>> from somewhere else. The call need to initiate from EM core. >> >> In the current version of this patchset, any driver can mark inefficiencies >> without relying on the EM, just by adding the flag CPUFREQ_INEFFICIENT_FREQ in >> cpufreq_frequency_table. > > Yeah, I wasn't really talking about cpufreq drivers but external > entities, like EM. > >> Populating cpufreq_frequency_table from the EM in cpufreq was just an attempt to >> a less intrusive set of changes. > >>> And this isn't a cpufreq only thing, but is going to be generic along >>> with other device types. >>> >>> This is exactly why I asked you earlier to play with OPP core for >>> this. That is the central place for data for all such users. >>> >>> If this information is present at the OPP table (somehow), then we can >>> just fix dev_pm_opp_init_cpufreq_table() to set this for cpufreq core >>> as well. >>> >>> This is the sequence that is followed in cpufreq drivers today: >>> >>> dev_pm_opp_of_cpumask_add_table(); >>> dev_pm_opp_init_cpufreq_table(); >>> dev_pm_opp_of_register_em(); >>> >>> What about changing this to: >>> >>> dev_pm_opp_of_cpumask_add_table(); >>> >>> /* Mark OPPs are inefficient here */ >>> dev_pm_opp_of_register_em(); >>> >>> /* This should automatically pick the right set */ >>> dev_pm_opp_init_cpufreq_table(); >>> >>> Will this break anything ? >> >> Probably not, but with this approach I'll have to modify all the cpufreq drivers >> that are registering the EM, which I tried to avoid as much as possible so far. > > Hmm. You are right as well, but I just want to get the right API in > place which lives a longer life :) > >> But if we sum-up: >> >> 1. em_dev_register_perf_domain() find inefficiencies >> >> 2. dev_pm_opp_of_register_em() apply EM inefficiencies into the OPP structures > > I was looking to add a new API to the OPP core > (dev_pm_opp_mark_inefficient()) to mark an OPP inefficient. And then > get it called from em_create_perf_table(). > > But I now see that EM core rather has callbacks to call into and with Exactly, that's what I was trying to stress. > that I think you should rather add another callback > (.mark_inefficient()) in struct em_data_callback, to set inefficient > frequencies. I disagree. That's why I prefer Vincent's approach in this patch set. This proposal would cause more mess. Vincent proposed a small and clean modification. This modification can be done easily in this cpufreq place because we have EM in device cpu struct. Let's don't over-engineering. The inefficient information is only valid for schedutil, thus IMHO it can live like this patch set made - in the cpufreq table. Compare the v1 (I still don't understand why it was blocked), this v3 and your proposal.
On 16-06-21, 10:03, Lukasz Luba wrote: > On 6/16/21 8:35 AM, Viresh Kumar wrote: > > On 15-06-21, 18:15, Vincent Donnefort wrote: > > > But if we sum-up: > > > > > > 1. em_dev_register_perf_domain() find inefficiencies > > > > > > 2. dev_pm_opp_of_register_em() apply EM inefficiencies into the OPP structures > > > > I was looking to add a new API to the OPP core > > (dev_pm_opp_mark_inefficient()) to mark an OPP inefficient. And then > > get it called from em_create_perf_table(). > > > > But I now see that EM core rather has callbacks to call into and with > > Exactly, that's what I was trying to stress. > > > that I think you should rather add another callback > > (.mark_inefficient()) in struct em_data_callback, to set inefficient > > frequencies. > > I disagree. That's why I prefer Vincent's approach in this patch set. > This proposal would cause more mess. > > Vincent proposed a small and clean modification. This modification > can be done easily in this cpufreq place because we have EM in > device cpu struct. This may look clean to you, but not to me, sorry about that. Clean is not lesser number of lines for me, but rather having the right ownership of such things. For example this patch: https://lore.kernel.org/linux-pm/1622804761-126737-6-git-send-email-vincent.donnefort@arm.com/ tries to add EM stuff in cpufreq core. Cpufreq core doesn't care about EM and it shouldn't. And this piece of code doesn't belong here. Would you guys like to add a cpufreq specific call into the EM core? I won't, that's not a place for cpufreq stuff. It is the EM core. I was fine with not including OPP core into this, and I gave up that argument earlier, but then we realized that the cpufreq core isn't ready at the time we register with EM core. Honestly, OPP core looks to be a better place holder for such stuff. This is exactly the purpose of the OPP core. Moreover, we can apply the same logic to devfreq or other devices later, with or without EM core. Again, OPP core fits better. The cpufreq core already has the relevant APIs in place to the OPP core and this won't require a new API there. > Let's don't over-engineering. The inefficient information is only valid > for schedutil, thus IMHO it can live like this patch set made - in the > cpufreq table. For now, yes. There is no guarantee though that we won't have more in future. > Compare the v1 (I still don't understand why it was blocked), IIRC, it required more work to be done in the hotpath, i.e. traversing the freq list twice. > this v3 and your proposal. IMHO, adding such callbacks to the EM core, like .mark_efficient(), will only make this easier to handle for all different frameworks, and not otherwise. The code will look much cleaner everywhere.. -- viresh
On 6/16/21 10:31 AM, Viresh Kumar wrote: > On 16-06-21, 10:03, Lukasz Luba wrote: >> On 6/16/21 8:35 AM, Viresh Kumar wrote: >>> On 15-06-21, 18:15, Vincent Donnefort wrote: >>>> But if we sum-up: >>>> >>>> 1. em_dev_register_perf_domain() find inefficiencies >>>> >>>> 2. dev_pm_opp_of_register_em() apply EM inefficiencies into the OPP structures >>> >>> I was looking to add a new API to the OPP core >>> (dev_pm_opp_mark_inefficient()) to mark an OPP inefficient. And then >>> get it called from em_create_perf_table(). >>> >>> But I now see that EM core rather has callbacks to call into and with >> >> Exactly, that's what I was trying to stress. >> >>> that I think you should rather add another callback >>> (.mark_inefficient()) in struct em_data_callback, to set inefficient >>> frequencies. >> >> I disagree. That's why I prefer Vincent's approach in this patch set. >> This proposal would cause more mess. >> >> Vincent proposed a small and clean modification. This modification >> can be done easily in this cpufreq place because we have EM in >> device cpu struct. > > This may look clean to you, but not to me, sorry about that. > > Clean is not lesser number of lines for me, but rather having the > right ownership of such things. Some developers do like patches which removes more lines then adds ;) > > For example this patch: > > https://lore.kernel.org/linux-pm/1622804761-126737-6-git-send-email-vincent.donnefort@arm.com/ > > tries to add EM stuff in cpufreq core. Cpufreq core doesn't care about > EM and it shouldn't. And this piece of code doesn't belong here. > > Would you guys like to add a cpufreq specific call into the EM core? I > won't, that's not a place for cpufreq stuff. It is the EM core. I was > fine with not including OPP core into this, and I gave up that > argument earlier, but then we realized that the cpufreq core isn't > ready at the time we register with EM core. > > Honestly, OPP core looks to be a better place holder for such stuff. > This is exactly the purpose of the OPP core. Moreover, we can apply > the same logic to devfreq or other devices later, with or without EM > core. Again, OPP core fits better. > > The cpufreq core already has the relevant APIs in place to the OPP > core and this won't require a new API there. I don't see an API function in the OPP framework or a field in the OPP struct which gives information that this freq is inefficient. Thus, it will require new API changes: cpufreq --> OPP. > >> Let's don't over-engineering. The inefficient information is only valid >> for schedutil, thus IMHO it can live like this patch set made - in the >> cpufreq table. > > For now, yes. There is no guarantee though that we won't have more in > future. And there won't be in near future. We don't build massive interfaces because there *might* be potential *oneday*. Even for this idea, it was a massive work to do the research and prove it that this is worth to put mainline so all vendors will get it. The GPUs are slightly different beasts and they have different workloads (not util + SchedUtil driven AFAIK). > >> Compare the v1 (I still don't understand why it was blocked), > > IIRC, it required more work to be done in the hotpath, i.e. traversing > the freq list twice. In v1 there was LUT. IMHO we have too easily gave and said: 'Remove the Look-up table as the numbers weren't strong enough to justify the implementation.' But it had other benefits, which are now pointed. There was different issue, which we could fix now. With this patch set [1] EAS could have the freq_max limit, which SchedUtil has in the hotpath. What could be the modified v1 [2]: - LUT which holds two IDs: efficient, inefficient, take one according to the clamp f_max - add new argument 'policy->max' to em_pd_get_efficient_freq() freq = em_pd_get_efficient_freq(em_cpu_get(policy->cpu), freq, policy->max); The problem was that EAS couldn't know the clamp freq_max, which shouldn't be the blocker now. > >> this v3 and your proposal. > > IMHO, adding such callbacks to the EM core, like .mark_efficient(), > will only make this easier to handle for all different frameworks, and > not otherwise. The code will look much cleaner everywhere.. > What about coming back to the slightly modified v1 idea? That was really self-contained modification for this inefficient opps heuristic. [1] https://lore.kernel.org/lkml/20210614185815.15136-1-lukasz.luba@arm.com/ [2] https://lore.kernel.org/lkml/1617901829-381963-2-git-send-email-vincent.donnefort@arm.com/
On 16-06-21, 11:33, Lukasz Luba wrote: > On 6/16/21 10:31 AM, Viresh Kumar wrote: > > On 16-06-21, 10:03, Lukasz Luba wrote: > > Clean is not lesser number of lines for me, but rather having the > > right ownership of such things. > > Some developers do like patches which removes more lines then adds ;) :) > > > > For example this patch: > > > > https://lore.kernel.org/linux-pm/1622804761-126737-6-git-send-email-vincent.donnefort@arm.com/ > > > > tries to add EM stuff in cpufreq core. Cpufreq core doesn't care about > > EM and it shouldn't. And this piece of code doesn't belong here. > > > > Would you guys like to add a cpufreq specific call into the EM core? I > > won't, that's not a place for cpufreq stuff. It is the EM core. I was > > fine with not including OPP core into this, and I gave up that > > argument earlier, but then we realized that the cpufreq core isn't > > ready at the time we register with EM core. > > > > Honestly, OPP core looks to be a better place holder for such stuff. > > This is exactly the purpose of the OPP core. Moreover, we can apply > > the same logic to devfreq or other devices later, with or without EM > > core. Again, OPP core fits better. > > > > The cpufreq core already has the relevant APIs in place to the OPP > > core and this won't require a new API there. > > I don't see an API function in the OPP framework or a field in the > OPP struct which gives information that this freq is inefficient. > Thus, it will require new API changes: cpufreq --> OPP. dev_pm_opp_init_cpufreq_table() is all we need here, we just need to change it to update one more field in the cpufreq table's entries. > > > > > Let's don't over-engineering. The inefficient information is only valid > > > for schedutil, thus IMHO it can live like this patch set made - in the > > > cpufreq table. > > > > For now, yes. There is no guarantee though that we won't have more in > > future. > > And there won't be in near future. We don't build massive interfaces > because there *might* be potential *oneday*. Yes, true. > Even for this idea, it was a massive work to do the research and prove > it that this is worth to put mainline so all vendors will get it. > > The GPUs are slightly different beasts and they have different > workloads (not util + SchedUtil driven AFAIK). Right, but even if there is a single user for this, I think getting this through the right layers is a more cleaner solution. > In v1 there was LUT. Oops, yes, I started looking from V2 and not V1. > IMHO we have too easily gave and said: > 'Remove the Look-up table as the numbers weren't strong enough to justify > the implementation.' > But it had other benefits, which are now pointed. > > There was different issue, which we could fix now. > With this patch set [1] EAS could have the freq_max limit, which > SchedUtil has in the hotpath. > > What could be the modified v1 [2]: > - LUT which holds two IDs: efficient, inefficient, take one > according to the clamp f_max > - add new argument 'policy->max' to em_pd_get_efficient_freq() > > freq = em_pd_get_efficient_freq(em_cpu_get(policy->cpu), freq, policy->max); > > The problem was that EAS couldn't know the clamp freq_max, > which shouldn't be the blocker now. If you can do that without adding any EM specific stuff in the cpufreq core, I will mostly be fine. But honestly speaking, creating more data structures to keep related information doesn't scale well. We already have so many tables for keeping freq/voltage pairs, OPP, cpufreq, EM. You tried to add one more in EM I think V1, not sure. It is always better to consolidate and we almost reached to a point where that could have been done very easily. I understand that you didn't want to touch so many different parts, but anyway.. > > > this v3 and your proposal. > > > > IMHO, adding such callbacks to the EM core, like .mark_efficient(), > > will only make this easier to handle for all different frameworks, and > > not otherwise. The code will look much cleaner everywhere.. > > > > What about coming back to the slightly modified v1 idea? > That was really self-contained modification for this > inefficient opps heuristic. I am not sure if I really understand what that would be, but again adding another table is going to create more problems then it should. Anyway, that's my view, which can be wrong as well. Rafael: You have any suggestions here ? -- viresh
On 6/16/21 11:53 AM, Viresh Kumar wrote: > On 16-06-21, 11:33, Lukasz Luba wrote: >> On 6/16/21 10:31 AM, Viresh Kumar wrote: >>> On 16-06-21, 10:03, Lukasz Luba wrote: >>> Clean is not lesser number of lines for me, but rather having the >>> right ownership of such things. >> >> Some developers do like patches which removes more lines then adds ;) > > :) > [snip] >> >> What could be the modified v1 [2]: >> - LUT which holds two IDs: efficient, inefficient, take one >> according to the clamp f_max >> - add new argument 'policy->max' to em_pd_get_efficient_freq() >> >> freq = em_pd_get_efficient_freq(em_cpu_get(policy->cpu), freq, policy->max); >> >> The problem was that EAS couldn't know the clamp freq_max, >> which shouldn't be the blocker now. > > If you can do that without adding any EM specific stuff in the cpufreq > core, I will mostly be fine. > > But honestly speaking, creating more data structures to keep related > information doesn't scale well. > > We already have so many tables for keeping freq/voltage pairs, OPP, > cpufreq, EM. You tried to add one more in EM I think V1, not sure. > > It is always better to consolidate and we almost reached to a point > where that could have been done very easily. I understand that you > didn't want to touch so many different parts, but anyway.. Yes, I don't want to touch some many generic parts because they are intended to be generic. This heuristic is only for EM platforms, which are arm, arm64 battery powered (not servers or other). Thus, I wanted to keep it locally. The cost of EM extra structures (the LUT) will only be for platforms for which EM discovers that they have inefficient performance levels. The code would even not be compiled in for x86, ppc, etc, in hot path. > >>>> this v3 and your proposal. >>> >>> IMHO, adding such callbacks to the EM core, like .mark_efficient(), >>> will only make this easier to handle for all different frameworks, and >>> not otherwise. The code will look much cleaner everywhere.. >>> >> >> What about coming back to the slightly modified v1 idea? >> That was really self-contained modification for this >> inefficient opps heuristic. > > I am not sure if I really understand what that would be, but again > adding another table is going to create more problems then it should. IMHO your proposals are more invasive for the generic code, while not always being even used. Plenty of architectures don't even set EM, even arm64 for servers doesn't use it. You and Rafael would have to maintain these modifications in generic code. It might be hard to remove it. While I recommend to keep this heuristic feature inside the EM and we will maintain it. If we decide after a few years that new arm64 platforms use some smarter FW for performance level, we might just disable this heuristic. > > Anyway, that's my view, which can be wrong as well. > > Rafael: You have any suggestions here ? > I would also like to hear Rafael's opinion :)
On Tuesday 15 Jun 2021 at 15:47:06 (+0530), Viresh Kumar wrote: > On 15-06-21, 09:44, Vincent Donnefort wrote: > > The cpufreq_policy is accessed through percpu data. I originally tried to get it > > with cpufreq_cpu_get_raw(). But when we init the cpufreq driver (and by > > extension when we call em_dev_register_perf_domain()), the percpu data isn't > > updated yet. > > Right. > > > I guess the only way at that moment would be to propagate the cpufreq_policy > > pointer through the functions parameters, which is a bit cumbersome, especially > > as the EM can be used with devfreq as well and that em_dev_register_perf_domain() > > can be called from dev_pm_opp_of_register_em() > > > > Would we be ok with that? > > No. > > You only talked about freq_table earlier and that's all I checked :) > > > I could use em_data_callback as well ... but that doesn't change the fact some > > registration is going through dev_pm_opp_of_register_em() which has no knowledge > > about the cpufreq_policy. Doesn't look a better approach. > > The point is that I don't want cpufreq to carry this for users, we > have EM today, tomorrow we may want to mark a frequency as inefficient > from somewhere else. The call need to initiate from EM core. > > And this isn't a cpufreq only thing, but is going to be generic along > with other device types. > > This is exactly why I asked you earlier to play with OPP core for > this. That is the central place for data for all such users. The thing is, the very reason for the existence of PM_EM in the first place is to NOT depend on PM_OPP which was deemed too Arm-specific. So let's not create those dependencies now please. > If this information is present at the OPP table (somehow), then we can > just fix dev_pm_opp_init_cpufreq_table() to set this for cpufreq core > as well. Honnestly, I don't think PM_OPP should have anything to do with this, for the reason mentionned above. I don't really see the problem with cpufreq core querying the EM data directly, we already have core subsystems doing that (the scheduler) so why would that be a problem? If the only concern is for non-CPU devices, all we'd need is a matching feature in devfreq core and everything should be good no? Thanks, Quentin
On 22-06-21, 09:01, Quentin Perret wrote: > On Tuesday 15 Jun 2021 at 15:47:06 (+0530), Viresh Kumar wrote: > > The point is that I don't want cpufreq to carry this for users, we > > have EM today, tomorrow we may want to mark a frequency as inefficient > > from somewhere else. The call need to initiate from EM core. > > > > And this isn't a cpufreq only thing, but is going to be generic along > > with other device types. > > > > This is exactly why I asked you earlier to play with OPP core for > > this. That is the central place for data for all such users. > > The thing is, the very reason for the existence of PM_EM in the first > place is to NOT depend on PM_OPP which was deemed too Arm-specific. So > let's not create those dependencies now please. I am not asking to create one here. What I am saying is that users of EM, which eventually call em_dev_register_perf_domain() have the necessary details about about the performance states (like voltage/freq pairs, or power) and the EM core should talk to those users directly. The struct em_data_callback is already there for this reason so we can keep EM core independent of its users. I suggested to add another callback there to mark a frequency inefficient, that's all. In the case of ARM platforms, that would automatically mean OPP core, and for others it can be anything that has this knowledge. > > If this information is present at the OPP table (somehow), then we can > > just fix dev_pm_opp_init_cpufreq_table() to set this for cpufreq core > > as well. > > Honnestly, I don't think PM_OPP should have anything to do with this, > for the reason mentionned above. > > I don't really see the problem with cpufreq core querying the EM data > directly, we already have core subsystems doing that (the scheduler) so > why would that be a problem? If EM core is going to be the only entity ever which can provide this information, then maybe we can get add some core support for this. But if it is just another driver/layer which can add this information, then it is always better to provide APIs which such users can call. And so I was looking for cpufreq to provide an API, which can be called from such users. The problem here happened because we registered with the EM core from policy's ->init() and we couldn't get back to cpufreq core from there as the policy is only half initialized. > If the only concern is for non-CPU devices, > all we'd need is a matching feature in devfreq core and everything > should be good no? -- viresh
Hi Rafael, This is a gentle ping. You have probably not seen this discussion thread. On 6/16/21 1:45 PM, Lukasz Luba wrote: > > > On 6/16/21 11:53 AM, Viresh Kumar wrote: >> On 16-06-21, 11:33, Lukasz Luba wrote: >>> On 6/16/21 10:31 AM, Viresh Kumar wrote: >>>> On 16-06-21, 10:03, Lukasz Luba wrote: >>>> Clean is not lesser number of lines for me, but rather having the >>>> right ownership of such things. >>> >>> Some developers do like patches which removes more lines then adds ;) >> >> :) >> > > [snip] > >>> >>> What could be the modified v1 [2]: >>> - LUT which holds two IDs: efficient, inefficient, take one >>> Â Â according to the clamp f_max >>> - add new argument 'policy->max' to em_pd_get_efficient_freq() >>> >>> freq = em_pd_get_efficient_freq(em_cpu_get(policy->cpu), freq, >>> policy->max); >>> >>> The problem was that EAS couldn't know the clamp freq_max, >>> which shouldn't be the blocker now. >> >> If you can do that without adding any EM specific stuff in the cpufreq >> core, I will mostly be fine. >> >> But honestly speaking, creating more data structures to keep related >> information doesn't scale well. >> >> We already have so many tables for keeping freq/voltage pairs, OPP, >> cpufreq, EM. You tried to add one more in EM I think V1, not sure. >> >> It is always better to consolidate and we almost reached to a point >> where that could have been done very easily. I understand that you >> didn't want to touch so many different parts, but anyway.. > > Yes, I don't want to touch some many generic parts because they > are intended to be generic. This heuristic is only for EM platforms, > which are arm, arm64 battery powered (not servers or other). > Thus, I wanted to keep it locally. The cost of EM extra structures > (the LUT) will only be for platforms for which EM discovers that > they have inefficient performance levels. > The code would even not be compiled in for x86, ppc, etc, in hot path. > >>>>> this v3 and your proposal. >>>> >>>> IMHO, adding such callbacks to the EM core, like .mark_efficient(), >>>> will only make this easier to handle for all different frameworks, and >>>> not otherwise. The code will look much cleaner everywhere.. >>>> >>> >>> What about coming back to the slightly modified v1 idea? >>> That was really self-contained modification for this >>> inefficient opps heuristic. >> >> I am not sure if I really understand what that would be, but again >> adding another table is going to create more problems then it should. > > IMHO your proposals are more invasive for the generic code, while > not always being even used. Plenty of architectures don't even set EM, > even arm64 for servers doesn't use it. You and Rafael would have to > maintain these modifications in generic code. It might be hard to remove > it. While I recommend to keep this heuristic feature inside the EM and > we will maintain it. If we decide after a few years that new arm64 > platforms use some smarter FW for performance level, we might just > disable this heuristic. > >> >> Anyway, that's my view, which can be wrong as well. >> >> Rafael: You have any suggestions here ? >> > > I would also like to hear Rafael's opinion :) It would be great if you could have a look. I will be grateful for your time spend on it and opinion. Regards, Lukasz
On Fri, Jul 2, 2021 at 4:21 PM Lukasz Luba <lukasz.luba@arm.com> wrote: > > Hi Rafael, > > This is a gentle ping. > You have probably not seen this discussion thread. I have looked at it briefly for a few times, but not too much into detail. > On 6/16/21 1:45 PM, Lukasz Luba wrote: > > > > > > On 6/16/21 11:53 AM, Viresh Kumar wrote: > >> On 16-06-21, 11:33, Lukasz Luba wrote: > >>> On 6/16/21 10:31 AM, Viresh Kumar wrote: > >>>> On 16-06-21, 10:03, Lukasz Luba wrote: > >>>> Clean is not lesser number of lines for me, but rather having the > >>>> right ownership of such things. > >>> > >>> Some developers do like patches which removes more lines then adds ;) > >> > >> :) > >> > > > > [snip] > > > >>> > >>> What could be the modified v1 [2]: > >>> - LUT which holds two IDs: efficient, inefficient, take one > >>> according to the clamp f_max > >>> - add new argument 'policy->max' to em_pd_get_efficient_freq() > >>> > >>> freq = em_pd_get_efficient_freq(em_cpu_get(policy->cpu), freq, > >>> policy->max); > >>> > >>> The problem was that EAS couldn't know the clamp freq_max, > >>> which shouldn't be the blocker now. > >> > >> If you can do that without adding any EM specific stuff in the cpufreq > >> core, I will mostly be fine. > >> > >> But honestly speaking, creating more data structures to keep related > >> information doesn't scale well. > >> > >> We already have so many tables for keeping freq/voltage pairs, OPP, > >> cpufreq, EM. You tried to add one more in EM I think V1, not sure. > >> > >> It is always better to consolidate and we almost reached to a point > >> where that could have been done very easily. I understand that you > >> didn't want to touch so many different parts, but anyway.. > > > > Yes, I don't want to touch some many generic parts because they > > are intended to be generic. This heuristic is only for EM platforms, > > which are arm, arm64 battery powered (not servers or other). > > Thus, I wanted to keep it locally. The cost of EM extra structures > > (the LUT) will only be for platforms for which EM discovers that > > they have inefficient performance levels. > > The code would even not be compiled in for x86, ppc, etc, in hot path. > > > >>>>> this v3 and your proposal. > >>>> > >>>> IMHO, adding such callbacks to the EM core, like .mark_efficient(), > >>>> will only make this easier to handle for all different frameworks, and > >>>> not otherwise. The code will look much cleaner everywhere.. > >>>> > >>> > >>> What about coming back to the slightly modified v1 idea? > >>> That was really self-contained modification for this > >>> inefficient opps heuristic. > >> > >> I am not sure if I really understand what that would be, but again > >> adding another table is going to create more problems then it should. > > > > IMHO your proposals are more invasive for the generic code, while > > not always being even used. Plenty of architectures don't even set EM, > > even arm64 for servers doesn't use it. You and Rafael would have to > > maintain these modifications in generic code. It might be hard to remove > > it. While I recommend to keep this heuristic feature inside the EM and > > we will maintain it. If we decide after a few years that new arm64 > > platforms use some smarter FW for performance level, we might just > > disable this heuristic. > > > >> > >> Anyway, that's my view, which can be wrong as well. > >> > >> Rafael: You have any suggestions here ? > >> > > > > I would also like to hear Rafael's opinion :) > > It would be great if you could have a look. > I will be grateful for your time spend on it and opinion. First of all, IMO checking whether or not a given frequency is "efficient" doesn't belong to cpufreq governors. The governor knows what the min and max supported frequencies are and selects one from that range and note that it doesn't even check whether or not the selected frequency is present in the frequency table. That part belongs to the driver or the general frequency table handling in the cpufreq core. So the governor doesn't care and it shouldn't care IMO. Besides, in the cpufreq_driver_fast_switch() case the driver's ->fast_switch() callback is entirely responsible for applying the selected frequency, so I'm not sure how this "efficient" thing is going to work then? Anyway, since we are talking about frequency tables, that would be the __cpufreq_driver_target() case only and specifically in the __target_index() case only (note how far away this is from the governor). Now, you may think about modifying cpufreq_frequency_table_target() to skip "inefficient" frequencies, but then the question is why those frequencies need to be there in the frequency table in the first place?
On Fri, Jul 2, 2021 at 5:46 PM Rafael J. Wysocki <rafael@kernel.org> wrote: > > On Fri, Jul 2, 2021 at 4:21 PM Lukasz Luba <lukasz.luba@arm.com> wrote: > > > > Hi Rafael, > > > > This is a gentle ping. > > You have probably not seen this discussion thread. > > I have looked at it briefly for a few times, but not too much into detail. > > > On 6/16/21 1:45 PM, Lukasz Luba wrote: > > > > > > > > > On 6/16/21 11:53 AM, Viresh Kumar wrote: > > >> On 16-06-21, 11:33, Lukasz Luba wrote: > > >>> On 6/16/21 10:31 AM, Viresh Kumar wrote: > > >>>> On 16-06-21, 10:03, Lukasz Luba wrote: > > >>>> Clean is not lesser number of lines for me, but rather having the > > >>>> right ownership of such things. > > >>> > > >>> Some developers do like patches which removes more lines then adds ;) > > >> > > >> :) > > >> > > > > > > [snip] > > > > > >>> > > >>> What could be the modified v1 [2]: > > >>> - LUT which holds two IDs: efficient, inefficient, take one > > >>> according to the clamp f_max > > >>> - add new argument 'policy->max' to em_pd_get_efficient_freq() > > >>> > > >>> freq = em_pd_get_efficient_freq(em_cpu_get(policy->cpu), freq, > > >>> policy->max); > > >>> > > >>> The problem was that EAS couldn't know the clamp freq_max, > > >>> which shouldn't be the blocker now. > > >> > > >> If you can do that without adding any EM specific stuff in the cpufreq > > >> core, I will mostly be fine. > > >> > > >> But honestly speaking, creating more data structures to keep related > > >> information doesn't scale well. > > >> > > >> We already have so many tables for keeping freq/voltage pairs, OPP, > > >> cpufreq, EM. You tried to add one more in EM I think V1, not sure. > > >> > > >> It is always better to consolidate and we almost reached to a point > > >> where that could have been done very easily. I understand that you > > >> didn't want to touch so many different parts, but anyway.. > > > > > > Yes, I don't want to touch some many generic parts because they > > > are intended to be generic. This heuristic is only for EM platforms, > > > which are arm, arm64 battery powered (not servers or other). > > > Thus, I wanted to keep it locally. The cost of EM extra structures > > > (the LUT) will only be for platforms for which EM discovers that > > > they have inefficient performance levels. > > > The code would even not be compiled in for x86, ppc, etc, in hot path. > > > > > >>>>> this v3 and your proposal. > > >>>> > > >>>> IMHO, adding such callbacks to the EM core, like .mark_efficient(), > > >>>> will only make this easier to handle for all different frameworks, and > > >>>> not otherwise. The code will look much cleaner everywhere.. > > >>>> > > >>> > > >>> What about coming back to the slightly modified v1 idea? > > >>> That was really self-contained modification for this > > >>> inefficient opps heuristic. > > >> > > >> I am not sure if I really understand what that would be, but again > > >> adding another table is going to create more problems then it should. > > > > > > IMHO your proposals are more invasive for the generic code, while > > > not always being even used. Plenty of architectures don't even set EM, > > > even arm64 for servers doesn't use it. You and Rafael would have to > > > maintain these modifications in generic code. It might be hard to remove > > > it. While I recommend to keep this heuristic feature inside the EM and > > > we will maintain it. If we decide after a few years that new arm64 > > > platforms use some smarter FW for performance level, we might just > > > disable this heuristic. > > > > > >> > > >> Anyway, that's my view, which can be wrong as well. > > >> > > >> Rafael: You have any suggestions here ? > > >> > > > > > > I would also like to hear Rafael's opinion :) > > > > It would be great if you could have a look. > > I will be grateful for your time spend on it and opinion. > > First of all, IMO checking whether or not a given frequency is > "efficient" doesn't belong to cpufreq governors. The governor knows > what the min and max supported frequencies are and selects one from > that range and note that it doesn't even check whether or not the > selected frequency is present in the frequency table. That part > belongs to the driver or the general frequency table handling in the > cpufreq core. > > So the governor doesn't care and it shouldn't care IMO. > > Besides, in the cpufreq_driver_fast_switch() case the driver's > ->fast_switch() callback is entirely responsible for applying the > selected frequency, so I'm not sure how this "efficient" thing is > going to work then? > > Anyway, since we are talking about frequency tables, that would be the > __cpufreq_driver_target() case only and specifically in the > __target_index() case only (note how far away this is from the > governor). > > Now, you may think about modifying cpufreq_frequency_table_target() to > skip "inefficient" frequencies, but then the question is why those > frequencies need to be there in the frequency table in the first > place? I'm guessing that the problem is that cpufreq_cooling works by using freq_qos_update_request() to update the max frequency limit and if that is in effect you'd rather use the inefficient frequencies, whereas when the governor selects an inefficient frequency below the policy limit, you'd rather use a higher-but-efficient frequency instead (within the policy limit). Am I guessing correctly?
On 7/2/21 5:04 PM, Rafael J. Wysocki wrote: > On Fri, Jul 2, 2021 at 5:46 PM Rafael J. Wysocki <rafael@kernel.org> wrote: >> >> On Fri, Jul 2, 2021 at 4:21 PM Lukasz Luba <lukasz.luba@arm.com> wrote: >>> >>> Hi Rafael, >>> >>> This is a gentle ping. >>> You have probably not seen this discussion thread. >> >> I have looked at it briefly for a few times, but not too much into detail. >> >>> On 6/16/21 1:45 PM, Lukasz Luba wrote: >>>> >>>> >>>> On 6/16/21 11:53 AM, Viresh Kumar wrote: >>>>> On 16-06-21, 11:33, Lukasz Luba wrote: >>>>>> On 6/16/21 10:31 AM, Viresh Kumar wrote: >>>>>>> On 16-06-21, 10:03, Lukasz Luba wrote: >>>>>>> Clean is not lesser number of lines for me, but rather having the >>>>>>> right ownership of such things. >>>>>> >>>>>> Some developers do like patches which removes more lines then adds ;) >>>>> >>>>> :) >>>>> >>>> >>>> [snip] >>>> >>>>>> >>>>>> What could be the modified v1 [2]: >>>>>> - LUT which holds two IDs: efficient, inefficient, take one >>>>>> according to the clamp f_max >>>>>> - add new argument 'policy->max' to em_pd_get_efficient_freq() >>>>>> >>>>>> freq = em_pd_get_efficient_freq(em_cpu_get(policy->cpu), freq, >>>>>> policy->max); >>>>>> >>>>>> The problem was that EAS couldn't know the clamp freq_max, >>>>>> which shouldn't be the blocker now. >>>>> >>>>> If you can do that without adding any EM specific stuff in the cpufreq >>>>> core, I will mostly be fine. >>>>> >>>>> But honestly speaking, creating more data structures to keep related >>>>> information doesn't scale well. >>>>> >>>>> We already have so many tables for keeping freq/voltage pairs, OPP, >>>>> cpufreq, EM. You tried to add one more in EM I think V1, not sure. >>>>> >>>>> It is always better to consolidate and we almost reached to a point >>>>> where that could have been done very easily. I understand that you >>>>> didn't want to touch so many different parts, but anyway.. >>>> >>>> Yes, I don't want to touch some many generic parts because they >>>> are intended to be generic. This heuristic is only for EM platforms, >>>> which are arm, arm64 battery powered (not servers or other). >>>> Thus, I wanted to keep it locally. The cost of EM extra structures >>>> (the LUT) will only be for platforms for which EM discovers that >>>> they have inefficient performance levels. >>>> The code would even not be compiled in for x86, ppc, etc, in hot path. >>>> >>>>>>>> this v3 and your proposal. >>>>>>> >>>>>>> IMHO, adding such callbacks to the EM core, like .mark_efficient(), >>>>>>> will only make this easier to handle for all different frameworks, and >>>>>>> not otherwise. The code will look much cleaner everywhere.. >>>>>>> >>>>>> >>>>>> What about coming back to the slightly modified v1 idea? >>>>>> That was really self-contained modification for this >>>>>> inefficient opps heuristic. >>>>> >>>>> I am not sure if I really understand what that would be, but again >>>>> adding another table is going to create more problems then it should. >>>> >>>> IMHO your proposals are more invasive for the generic code, while >>>> not always being even used. Plenty of architectures don't even set EM, >>>> even arm64 for servers doesn't use it. You and Rafael would have to >>>> maintain these modifications in generic code. It might be hard to remove >>>> it. While I recommend to keep this heuristic feature inside the EM and >>>> we will maintain it. If we decide after a few years that new arm64 >>>> platforms use some smarter FW for performance level, we might just >>>> disable this heuristic. >>>> >>>>> >>>>> Anyway, that's my view, which can be wrong as well. >>>>> >>>>> Rafael: You have any suggestions here ? >>>>> >>>> >>>> I would also like to hear Rafael's opinion :) >>> >>> It would be great if you could have a look. >>> I will be grateful for your time spend on it and opinion. >> >> First of all, IMO checking whether or not a given frequency is >> "efficient" doesn't belong to cpufreq governors. The governor knows >> what the min and max supported frequencies are and selects one from >> that range and note that it doesn't even check whether or not the >> selected frequency is present in the frequency table. That part >> belongs to the driver or the general frequency table handling in the >> cpufreq core. >> >> So the governor doesn't care and it shouldn't care IMO. >> >> Besides, in the cpufreq_driver_fast_switch() case the driver's >> ->fast_switch() callback is entirely responsible for applying the >> selected frequency, so I'm not sure how this "efficient" thing is >> going to work then? >> >> Anyway, since we are talking about frequency tables, that would be the >> __cpufreq_driver_target() case only and specifically in the >> __target_index() case only (note how far away this is from the >> governor). >> >> Now, you may think about modifying cpufreq_frequency_table_target() to >> skip "inefficient" frequencies, but then the question is why those >> frequencies need to be there in the frequency table in the first >> place? > > I'm guessing that the problem is that cpufreq_cooling works by using > freq_qos_update_request() to update the max frequency limit and if > that is in effect you'd rather use the inefficient frequencies, > whereas when the governor selects an inefficient frequency below the > policy limit, you'd rather use a higher-but-efficient frequency > instead (within the policy limit). > > Am I guessing correctly? > Yes, correct. Thermal would use all (efficient + inefficient), but we in cpufreq governor would like to pick if possible the efficient one (below the thermal limit).
[...] Hi Rafael, > > > > It would be great if you could have a look. > > I will be grateful for your time spend on it and opinion. > > First of all, IMO checking whether or not a given frequency is > "efficient" doesn't belong to cpufreq governors. The governor knows > what the min and max supported frequencies are and selects one from > that range and note that it doesn't even check whether or not the > selected frequency is present in the frequency table. That part > belongs to the driver or the general frequency table handling in the > cpufreq core. > > So the governor doesn't care and it shouldn't care IMO. A governor such as userspace should probably be able select any frequency with no regard to efficiency. > > Besides, in the cpufreq_driver_fast_switch() case the driver's > ->fast_switch() callback is entirely responsible for applying the > selected frequency, so I'm not sure how this "efficient" thing is > going to work then? This shouldn't be a problem if a governor that leverages inefficient OPPs resolves an inefficient one to an efficient one. The target_freq would simply point to a frequency known as efficient. > Anyway, since we are talking about frequency tables, that would be the > __cpufreq_driver_target() case only and specifically in the > __target_index() case only (note how far away this is from the > governor). > > Now, you may think about modifying cpufreq_frequency_table_target() to > skip "inefficient" frequencies, but then the question is why those > frequencies need to be there in the frequency table in the first > place? Cpufreq_cooling needs those frequencies, even inefficients, also there's probably no reason to hide them from the userspace. -- Vincent
On Fri, Jul 2, 2021 at 6:14 PM Vincent Donnefort <vincent.donnefort@arm.com> wrote: > > [...] > > Hi Rafael, > > > > > > > It would be great if you could have a look. > > > I will be grateful for your time spend on it and opinion. > > > > First of all, IMO checking whether or not a given frequency is > > "efficient" doesn't belong to cpufreq governors. The governor knows > > what the min and max supported frequencies are and selects one from > > that range and note that it doesn't even check whether or not the > > selected frequency is present in the frequency table. That part > > belongs to the driver or the general frequency table handling in the > > cpufreq core. > > > > So the governor doesn't care and it shouldn't care IMO. > > A governor such as userspace should probably be able select any frequency > with no regard to efficiency. You seem to be arguing that the decision whether or not to use inefficient frequencies needs to be made at the governor level, because the driver must use whatever frequency is selected by the userspace governor, since the user process driving it may be assuming that to be the case. That's why you want to hack schedutil to filter out the inefficient frequencies, so you don't need to worry about them below that level. That would be putting driver knowledge into the governor which cpufreq was specifically designed to avoid. There is a way to do that, though, which is to put a governor into the driver and use ->setoplicy(), but the good news is that you don't even need to do that. So there are those CPUFREQ_RELATION_* symbols used by governors to tell drivers what to do next (I wonder what they would be good for after your changes for that matter) and you can add more of them. For example, you can define a CPUFREQ_RELATION_EFFICIENT bit to tell the driver to avoid inefficient frequencies and pass that from schedutil. Besides, I'm not sure why you still want to use inefficient frequencies if they are selected by the ondemand or conservative governors. > > > > Besides, in the cpufreq_driver_fast_switch() case the driver's > > ->fast_switch() callback is entirely responsible for applying the > > selected frequency, so I'm not sure how this "efficient" thing is > > going to work then? > > This shouldn't be a problem if a governor that leverages inefficient OPPs > resolves an inefficient one to an efficient one. The target_freq would simply > point to a frequency known as efficient. No, that doesn't help. Moreover, it is harmful, because it changes the interface between this particular governor and drivers in a subtle and potentially confusing way. Namely, drivers now can expect that the frequency passed to ->fast_switch() is the one corresponding to the current utilization demand as seen by the governor, subject to the policy limits, not the-closest-efficient-one-present-in-the-frequency-table. > > Anyway, since we are talking about frequency tables, that would be the > > __cpufreq_driver_target() case only and specifically in the > > __target_index() case only (note how far away this is from the > > governor). > > > > Now, you may think about modifying cpufreq_frequency_table_target() to > > skip "inefficient" frequencies, but then the question is why those > > frequencies need to be there in the frequency table in the first > > place? > > Cpufreq_cooling needs those frequencies, even inefficients, I've already figured that out. > also there's probably no reason to hide them from the userspace.
On Fri, Jul 2, 2021 at 6:08 PM Lukasz Luba <lukasz.luba@arm.com> wrote: > > > > On 7/2/21 5:04 PM, Rafael J. Wysocki wrote: > > On Fri, Jul 2, 2021 at 5:46 PM Rafael J. Wysocki <rafael@kernel.org> wrote: > >> > >> On Fri, Jul 2, 2021 at 4:21 PM Lukasz Luba <lukasz.luba@arm.com> wrote: > >>> > >>> Hi Rafael, > >>> > >>> This is a gentle ping. > >>> You have probably not seen this discussion thread. > >> > >> I have looked at it briefly for a few times, but not too much into detail. > >> > >>> On 6/16/21 1:45 PM, Lukasz Luba wrote: > >>>> > >>>> > >>>> On 6/16/21 11:53 AM, Viresh Kumar wrote: > >>>>> On 16-06-21, 11:33, Lukasz Luba wrote: > >>>>>> On 6/16/21 10:31 AM, Viresh Kumar wrote: > >>>>>>> On 16-06-21, 10:03, Lukasz Luba wrote: > >>>>>>> Clean is not lesser number of lines for me, but rather having the > >>>>>>> right ownership of such things. > >>>>>> > >>>>>> Some developers do like patches which removes more lines then adds ;) > >>>>> > >>>>> :) > >>>>> > >>>> > >>>> [snip] > >>>> > >>>>>> > >>>>>> What could be the modified v1 [2]: > >>>>>> - LUT which holds two IDs: efficient, inefficient, take one > >>>>>> according to the clamp f_max > >>>>>> - add new argument 'policy->max' to em_pd_get_efficient_freq() > >>>>>> > >>>>>> freq = em_pd_get_efficient_freq(em_cpu_get(policy->cpu), freq, > >>>>>> policy->max); > >>>>>> > >>>>>> The problem was that EAS couldn't know the clamp freq_max, > >>>>>> which shouldn't be the blocker now. > >>>>> > >>>>> If you can do that without adding any EM specific stuff in the cpufreq > >>>>> core, I will mostly be fine. > >>>>> > >>>>> But honestly speaking, creating more data structures to keep related > >>>>> information doesn't scale well. > >>>>> > >>>>> We already have so many tables for keeping freq/voltage pairs, OPP, > >>>>> cpufreq, EM. You tried to add one more in EM I think V1, not sure. > >>>>> > >>>>> It is always better to consolidate and we almost reached to a point > >>>>> where that could have been done very easily. I understand that you > >>>>> didn't want to touch so many different parts, but anyway.. > >>>> > >>>> Yes, I don't want to touch some many generic parts because they > >>>> are intended to be generic. This heuristic is only for EM platforms, > >>>> which are arm, arm64 battery powered (not servers or other). > >>>> Thus, I wanted to keep it locally. The cost of EM extra structures > >>>> (the LUT) will only be for platforms for which EM discovers that > >>>> they have inefficient performance levels. > >>>> The code would even not be compiled in for x86, ppc, etc, in hot path. > >>>> > >>>>>>>> this v3 and your proposal. > >>>>>>> > >>>>>>> IMHO, adding such callbacks to the EM core, like .mark_efficient(), > >>>>>>> will only make this easier to handle for all different frameworks, and > >>>>>>> not otherwise. The code will look much cleaner everywhere.. > >>>>>>> > >>>>>> > >>>>>> What about coming back to the slightly modified v1 idea? > >>>>>> That was really self-contained modification for this > >>>>>> inefficient opps heuristic. > >>>>> > >>>>> I am not sure if I really understand what that would be, but again > >>>>> adding another table is going to create more problems then it should. > >>>> > >>>> IMHO your proposals are more invasive for the generic code, while > >>>> not always being even used. Plenty of architectures don't even set EM, > >>>> even arm64 for servers doesn't use it. You and Rafael would have to > >>>> maintain these modifications in generic code. It might be hard to remove > >>>> it. While I recommend to keep this heuristic feature inside the EM and > >>>> we will maintain it. If we decide after a few years that new arm64 > >>>> platforms use some smarter FW for performance level, we might just > >>>> disable this heuristic. > >>>> > >>>>> > >>>>> Anyway, that's my view, which can be wrong as well. > >>>>> > >>>>> Rafael: You have any suggestions here ? > >>>>> > >>>> > >>>> I would also like to hear Rafael's opinion :) > >>> > >>> It would be great if you could have a look. > >>> I will be grateful for your time spend on it and opinion. > >> > >> First of all, IMO checking whether or not a given frequency is > >> "efficient" doesn't belong to cpufreq governors. The governor knows > >> what the min and max supported frequencies are and selects one from > >> that range and note that it doesn't even check whether or not the > >> selected frequency is present in the frequency table. That part > >> belongs to the driver or the general frequency table handling in the > >> cpufreq core. > >> > >> So the governor doesn't care and it shouldn't care IMO. > >> > >> Besides, in the cpufreq_driver_fast_switch() case the driver's > >> ->fast_switch() callback is entirely responsible for applying the > >> selected frequency, so I'm not sure how this "efficient" thing is > >> going to work then? > >> > >> Anyway, since we are talking about frequency tables, that would be the > >> __cpufreq_driver_target() case only and specifically in the > >> __target_index() case only (note how far away this is from the > >> governor). > >> > >> Now, you may think about modifying cpufreq_frequency_table_target() to > >> skip "inefficient" frequencies, but then the question is why those > >> frequencies need to be there in the frequency table in the first > >> place? > > > > I'm guessing that the problem is that cpufreq_cooling works by using > > freq_qos_update_request() to update the max frequency limit and if > > that is in effect you'd rather use the inefficient frequencies, > > whereas when the governor selects an inefficient frequency below the > > policy limit, you'd rather use a higher-but-efficient frequency > > instead (within the policy limit). > > > > Am I guessing correctly? > > > > Yes, correct. Thermal would use all (efficient + inefficient), but > we in cpufreq governor would like to pick if possible the efficient > one (below the thermal limit). To address that, you need to pass more information from schedutil to __cpufreq_driver_target() that down the road can be used by cpufreq_frequency_table_target() to decide whether or not to skip the inefficient frequencies. For example, you can define CPUFREQ_RELATION_EFFICIENT and pass it from schedutil to __cpufreq_driver_target() in the "relation" argument, and clear it if the target frequency is above the max policy limit, or if ->target() is to be called.
On 7/2/21 6:53 PM, Rafael J. Wysocki wrote: > On Fri, Jul 2, 2021 at 6:08 PM Lukasz Luba <lukasz.luba@arm.com> wrote: >> >> >> >> On 7/2/21 5:04 PM, Rafael J. Wysocki wrote: >>> On Fri, Jul 2, 2021 at 5:46 PM Rafael J. Wysocki <rafael@kernel.org> wrote: >>>> >>>> On Fri, Jul 2, 2021 at 4:21 PM Lukasz Luba <lukasz.luba@arm.com> wrote: >>>>> >>>>> Hi Rafael, >>>>> >>>>> This is a gentle ping. >>>>> You have probably not seen this discussion thread. >>>> >>>> I have looked at it briefly for a few times, but not too much into detail. >>>> >>>>> On 6/16/21 1:45 PM, Lukasz Luba wrote: >>>>>> >>>>>> >>>>>> On 6/16/21 11:53 AM, Viresh Kumar wrote: >>>>>>> On 16-06-21, 11:33, Lukasz Luba wrote: >>>>>>>> On 6/16/21 10:31 AM, Viresh Kumar wrote: >>>>>>>>> On 16-06-21, 10:03, Lukasz Luba wrote: >>>>>>>>> Clean is not lesser number of lines for me, but rather having the >>>>>>>>> right ownership of such things. >>>>>>>> >>>>>>>> Some developers do like patches which removes more lines then adds ;) >>>>>>> >>>>>>> :) >>>>>>> >>>>>> >>>>>> [snip] >>>>>> >>>>>>>> >>>>>>>> What could be the modified v1 [2]: >>>>>>>> - LUT which holds two IDs: efficient, inefficient, take one >>>>>>>> according to the clamp f_max >>>>>>>> - add new argument 'policy->max' to em_pd_get_efficient_freq() >>>>>>>> >>>>>>>> freq = em_pd_get_efficient_freq(em_cpu_get(policy->cpu), freq, >>>>>>>> policy->max); >>>>>>>> >>>>>>>> The problem was that EAS couldn't know the clamp freq_max, >>>>>>>> which shouldn't be the blocker now. >>>>>>> >>>>>>> If you can do that without adding any EM specific stuff in the cpufreq >>>>>>> core, I will mostly be fine. >>>>>>> >>>>>>> But honestly speaking, creating more data structures to keep related >>>>>>> information doesn't scale well. >>>>>>> >>>>>>> We already have so many tables for keeping freq/voltage pairs, OPP, >>>>>>> cpufreq, EM. You tried to add one more in EM I think V1, not sure. >>>>>>> >>>>>>> It is always better to consolidate and we almost reached to a point >>>>>>> where that could have been done very easily. I understand that you >>>>>>> didn't want to touch so many different parts, but anyway.. >>>>>> >>>>>> Yes, I don't want to touch some many generic parts because they >>>>>> are intended to be generic. This heuristic is only for EM platforms, >>>>>> which are arm, arm64 battery powered (not servers or other). >>>>>> Thus, I wanted to keep it locally. The cost of EM extra structures >>>>>> (the LUT) will only be for platforms for which EM discovers that >>>>>> they have inefficient performance levels. >>>>>> The code would even not be compiled in for x86, ppc, etc, in hot path. >>>>>> >>>>>>>>>> this v3 and your proposal. >>>>>>>>> >>>>>>>>> IMHO, adding such callbacks to the EM core, like .mark_efficient(), >>>>>>>>> will only make this easier to handle for all different frameworks, and >>>>>>>>> not otherwise. The code will look much cleaner everywhere.. >>>>>>>>> >>>>>>>> >>>>>>>> What about coming back to the slightly modified v1 idea? >>>>>>>> That was really self-contained modification for this >>>>>>>> inefficient opps heuristic. >>>>>>> >>>>>>> I am not sure if I really understand what that would be, but again >>>>>>> adding another table is going to create more problems then it should. >>>>>> >>>>>> IMHO your proposals are more invasive for the generic code, while >>>>>> not always being even used. Plenty of architectures don't even set EM, >>>>>> even arm64 for servers doesn't use it. You and Rafael would have to >>>>>> maintain these modifications in generic code. It might be hard to remove >>>>>> it. While I recommend to keep this heuristic feature inside the EM and >>>>>> we will maintain it. If we decide after a few years that new arm64 >>>>>> platforms use some smarter FW for performance level, we might just >>>>>> disable this heuristic. >>>>>> >>>>>>> >>>>>>> Anyway, that's my view, which can be wrong as well. >>>>>>> >>>>>>> Rafael: You have any suggestions here ? >>>>>>> >>>>>> >>>>>> I would also like to hear Rafael's opinion :) >>>>> >>>>> It would be great if you could have a look. >>>>> I will be grateful for your time spend on it and opinion. >>>> >>>> First of all, IMO checking whether or not a given frequency is >>>> "efficient" doesn't belong to cpufreq governors. The governor knows >>>> what the min and max supported frequencies are and selects one from >>>> that range and note that it doesn't even check whether or not the >>>> selected frequency is present in the frequency table. That part >>>> belongs to the driver or the general frequency table handling in the >>>> cpufreq core. >>>> >>>> So the governor doesn't care and it shouldn't care IMO. >>>> >>>> Besides, in the cpufreq_driver_fast_switch() case the driver's >>>> ->fast_switch() callback is entirely responsible for applying the >>>> selected frequency, so I'm not sure how this "efficient" thing is >>>> going to work then? >>>> >>>> Anyway, since we are talking about frequency tables, that would be the >>>> __cpufreq_driver_target() case only and specifically in the >>>> __target_index() case only (note how far away this is from the >>>> governor). >>>> >>>> Now, you may think about modifying cpufreq_frequency_table_target() to >>>> skip "inefficient" frequencies, but then the question is why those >>>> frequencies need to be there in the frequency table in the first >>>> place? >>> >>> I'm guessing that the problem is that cpufreq_cooling works by using >>> freq_qos_update_request() to update the max frequency limit and if >>> that is in effect you'd rather use the inefficient frequencies, >>> whereas when the governor selects an inefficient frequency below the >>> policy limit, you'd rather use a higher-but-efficient frequency >>> instead (within the policy limit). >>> >>> Am I guessing correctly? >>> >> >> Yes, correct. Thermal would use all (efficient + inefficient), but >> we in cpufreq governor would like to pick if possible the efficient >> one (below the thermal limit). > > To address that, you need to pass more information from schedutil to > __cpufreq_driver_target() that down the road can be used by > cpufreq_frequency_table_target() to decide whether or not to skip the > inefficient frequencies. > > For example, you can define CPUFREQ_RELATION_EFFICIENT and pass it > from schedutil to __cpufreq_driver_target() in the "relation" > argument, and clear it if the target frequency is above the max policy > limit, or if ->target() is to be called. > Thank you Rafael for valuable comments. We will have to experiment with that option and come back with implementation based on it. Regards, Lukasz
[...] > > > > > > I'm guessing that the problem is that cpufreq_cooling works by using > > > freq_qos_update_request() to update the max frequency limit and if > > > that is in effect you'd rather use the inefficient frequencies, > > > whereas when the governor selects an inefficient frequency below the > > > policy limit, you'd rather use a higher-but-efficient frequency > > > instead (within the policy limit). > > > > > > Am I guessing correctly? > > > > > > > Yes, correct. Thermal would use all (efficient + inefficient), but > > we in cpufreq governor would like to pick if possible the efficient > > one (below the thermal limit). > > To address that, you need to pass more information from schedutil to > __cpufreq_driver_target() that down the road can be used by > cpufreq_frequency_table_target() to decide whether or not to skip the > inefficient frequencies. > > For example, you can define CPUFREQ_RELATION_EFFICIENT and pass it > from schedutil to __cpufreq_driver_target() in the "relation" > argument, and clear it if the target frequency is above the max policy > limit, or if ->target() is to be called. What about a cpufreq_policy option that if sets would make cpufreq_frequency_table_target() skip inefficient OPPs while staying within the limit of max policy? Each governor could decide to set it or not, but it would hide the efficiency resolution to the governor and allow drivers that implements ->target() to also implements support for inefficient OPPs. That flag could be set according to a new cpufreq_governor flag CPUFREQ_GOV_SKIP_INEFFICIENCIES? That could though modify behaviors like powersave_bias from ondemand. But if a frequency is inefficient, there's probably no power saving anyway. -- Vincent
On Fri, Jul 2, 2021 at 9:17 PM Vincent Donnefort <vincent.donnefort@arm.com> wrote: > > [...] > > > > > > > > > I'm guessing that the problem is that cpufreq_cooling works by using > > > > freq_qos_update_request() to update the max frequency limit and if > > > > that is in effect you'd rather use the inefficient frequencies, > > > > whereas when the governor selects an inefficient frequency below the > > > > policy limit, you'd rather use a higher-but-efficient frequency > > > > instead (within the policy limit). > > > > > > > > Am I guessing correctly? > > > > > > > > > > Yes, correct. Thermal would use all (efficient + inefficient), but > > > we in cpufreq governor would like to pick if possible the efficient > > > one (below the thermal limit). > > > > To address that, you need to pass more information from schedutil to > > __cpufreq_driver_target() that down the road can be used by > > cpufreq_frequency_table_target() to decide whether or not to skip the > > inefficient frequencies. > > > > For example, you can define CPUFREQ_RELATION_EFFICIENT and pass it > > from schedutil to __cpufreq_driver_target() in the "relation" > > argument, and clear it if the target frequency is above the max policy > > limit, or if ->target() is to be called. > > What about a cpufreq_policy option that if sets would make > cpufreq_frequency_table_target() skip inefficient OPPs while staying within > the limit of max policy? That would work too, -> > Each governor could decide to set it or not, but > it would hide the efficiency resolution to the governor and allow drivers > that implements ->target() to also implements support for inefficient OPPs. -> but alternatively there could be an additional cpufreq driver flag to be set by the drivers implementing ->target() and wanting to deal with CPUFREQ_RELATION_EFFICIENT themselves (an opt-in of sorts). So the governors that want it may pass CPUFREQ_RELATION_EFFICIENT to __cpufreq_driver_target() and then it will be passed to ->target() depending on whether or not the new driver flag is set. > That flag could be set according to a new cpufreq_governor flag > CPUFREQ_GOV_SKIP_INEFFICIENCIES? > > That could though modify behaviors like powersave_bias from ondemand. But if > a frequency is inefficient, there's probably no power saving anyway. AFAICS, the userspace governor aside, using inefficient frequencies only works with the powersave governor. In the other cases, RELATION_L (say) can be interpreted as "the closest efficient frequency equal to or above the target" with the max policy limit possibly causing inefficient frequencies to be used if they are closer to the limit than the next efficient one. As a rule, the governors don't assume that there are any inefficient frequencies in the table. In fact, they don't make any assumptions regarding the contents of the frequency table at all. They don't even assume that the driver uses a frequency table in the first place.
[...] > > > > What about a cpufreq_policy option that if sets would make > > cpufreq_frequency_table_target() skip inefficient OPPs while staying within > > the limit of max policy? > > That would work too, -> > > > Each governor could decide to set it or not, but > > it would hide the efficiency resolution to the governor and allow drivers > > that implements ->target() to also implements support for inefficient OPPs. > > -> but alternatively there could be an additional cpufreq driver flag > to be set by the drivers implementing ->target() and wanting to deal > with CPUFREQ_RELATION_EFFICIENT themselves (an opt-in of sorts). > > So the governors that want it may pass CPUFREQ_RELATION_EFFICIENT to > __cpufreq_driver_target() and then it will be passed to ->target() > depending on whether or not the new driver flag is set. Of course, I can implement this instead of a cpufreq_policy flag in v4. I suppose then right fallback for CPUFREQ_RELATION_EFFICIENT in case the driver doesn't opt-in is CPUFREQ_RELATION_L. > > > That flag could be set according to a new cpufreq_governor flag > > CPUFREQ_GOV_SKIP_INEFFICIENCIES? > > > > That could though modify behaviors like powersave_bias from ondemand. But if > > a frequency is inefficient, there's probably no power saving anyway. > > AFAICS, the userspace governor aside, using inefficient frequencies > only works with the powersave governor. In the other cases, > RELATION_L (say) can be interpreted as "the closest efficient > frequency equal to or above the target" with the max policy limit > possibly causing inefficient frequencies to be used if they are closer > to the limit than the next efficient one. > > As a rule, the governors don't assume that there are any inefficient > frequencies in the table. In fact, they don't make any assumptions > regarding the contents of the frequency table at all. They don't even > assume that the driver uses a frequency table in the first place. So all the governors, beside powersave and userspace would replace their RELATION_L with RELATION_EFFICIENT. I'll add the changes in v4. So if I sum-up: new RELATION_EFFICIENT that resolves RELATION_L to an higher efficient frequency (if necessary) within the limits of policy->max. CPUfreq drivers can opt-in by setting an appropriate flag. If they do not, RELATION_EFFICIENT will be rewritten in RELATION_L. All governors but userspace and powersave would use RELATION_EFFICIENT instead of RELATION_L. If that works for you, I'll implement this in a v4, as well as some improvements for the CPUfreq/EM registration following the discussion with Viresh. -- Vincent
On 06-07-21, 09:12, Vincent Donnefort wrote:
> Of course, I can implement this instead of a cpufreq_policy flag in v4.
Why should this ever be a policy flag and not governor's flag ?
--
viresh
On Tue, Jul 06, 2021 at 02:07:41PM +0530, Viresh Kumar wrote: > On 06-07-21, 09:12, Vincent Donnefort wrote: > > Of course, I can implement this instead of a cpufreq_policy flag in v4. > > Why should this ever be a policy flag and not governor's flag ? I was referring to how letting know a driver which registers ->target() that we want or not inefficient frequencies. My proposal was to rely on a cpufreq_policy's flag that the driver can read. > > -- > viresh
On 06-07-21, 09:43, Vincent Donnefort wrote: > I was referring to how letting know a driver which registers ->target() that we > want or not inefficient frequencies. My proposal was to rely on a > cpufreq_policy's flag that the driver can read. I am bit confused at this point, lets see how it looks eventually as patches :) -- viresh
On Tue, Jul 6, 2021 at 10:13 AM Vincent Donnefort <vincent.donnefort@arm.com> wrote: > > [...] > > > > > > > What about a cpufreq_policy option that if sets would make > > > cpufreq_frequency_table_target() skip inefficient OPPs while staying within > > > the limit of max policy? > > > > That would work too, -> > > > > > Each governor could decide to set it or not, but > > > it would hide the efficiency resolution to the governor and allow drivers > > > that implements ->target() to also implements support for inefficient OPPs. > > > > -> but alternatively there could be an additional cpufreq driver flag > > to be set by the drivers implementing ->target() and wanting to deal > > with CPUFREQ_RELATION_EFFICIENT themselves (an opt-in of sorts). > > > > So the governors that want it may pass CPUFREQ_RELATION_EFFICIENT to > > __cpufreq_driver_target() and then it will be passed to ->target() > > depending on whether or not the new driver flag is set. > > Of course, I can implement this instead of a cpufreq_policy flag in v4. > I suppose then right fallback for CPUFREQ_RELATION_EFFICIENT in case the > driver doesn't opt-in is CPUFREQ_RELATION_L. > > > > > > That flag could be set according to a new cpufreq_governor flag > > > CPUFREQ_GOV_SKIP_INEFFICIENCIES? > > > > > > That could though modify behaviors like powersave_bias from ondemand. But if > > > a frequency is inefficient, there's probably no power saving anyway. > > > > AFAICS, the userspace governor aside, using inefficient frequencies > > only works with the powersave governor. In the other cases, > > RELATION_L (say) can be interpreted as "the closest efficient > > frequency equal to or above the target" with the max policy limit > > possibly causing inefficient frequencies to be used if they are closer > > to the limit than the next efficient one. > > > > As a rule, the governors don't assume that there are any inefficient > > frequencies in the table. In fact, they don't make any assumptions > > regarding the contents of the frequency table at all. They don't even > > assume that the driver uses a frequency table in the first place. > > So all the governors, beside powersave and userspace would replace their > RELATION_L with RELATION_EFFICIENT. I'll add the changes in v4. > > So if I sum-up: new RELATION_EFFICIENT that resolves RELATION_L to an higher > efficient frequency (if necessary) within the limits of policy->max. Yes. It can be called RELATION_E for brevity. > CPUfreq drivers can opt-in by setting an appropriate flag. If they do not, > RELATION_EFFICIENT will be rewritten in RELATION_L. Yes, and cpufreq_frequency_table_target() will take RELATION_E into account if set. > All governors but userspace and powersave would use RELATION_EFFICIENT instead of RELATION_L. Yes. > If that works for you, I'll implement this in a v4, as well as some > improvements for the CPUfreq/EM registration following the discussion with > Viresh. Sounds good, thanks!
diff --git a/drivers/cpufreq/freq_table.c b/drivers/cpufreq/freq_table.c index 67e56cf..0756d7d6 100644 --- a/drivers/cpufreq/freq_table.c +++ b/drivers/cpufreq/freq_table.c @@ -351,6 +351,53 @@ static int set_freq_table_sorted(struct cpufreq_policy *policy) return 0; } +static void set_freq_table_efficiencies(struct cpufreq_policy *policy) +{ + struct cpufreq_frequency_table *pos, *table = policy->freq_table; + enum cpufreq_table_sorting sort = policy->freq_table_sorted; + int efficient, idx; + + /* Not supported */ + if (sort == CPUFREQ_TABLE_UNSORTED) { + cpufreq_for_each_entry_idx(pos, table, idx) + pos->efficient = idx; + return; + } + + /* The highest frequency is always efficient */ + cpufreq_for_each_entry_idx(pos, table, idx) { + if (pos->frequency == CPUFREQ_ENTRY_INVALID) + continue; + + efficient = idx; + + if (sort == CPUFREQ_TABLE_SORTED_DESCENDING) + break; + } + + for (;;) { + pos = &table[idx]; + + if (pos->frequency == CPUFREQ_ENTRY_INVALID) + continue; + + if (pos->flags & CPUFREQ_INEFFICIENT_FREQ) { + pos->efficient = efficient; + } else { + pos->efficient = idx; + efficient = idx; + } + + if (sort == CPUFREQ_TABLE_SORTED_ASCENDING) { + if (--idx < 0) + break; + } else { + if (table[++idx].frequency == CPUFREQ_TABLE_END) + break; + } + } +} + int cpufreq_table_validate_and_sort(struct cpufreq_policy *policy) { int ret; @@ -362,7 +409,13 @@ int cpufreq_table_validate_and_sort(struct cpufreq_policy *policy) if (ret) return ret; - return set_freq_table_sorted(policy); + ret = set_freq_table_sorted(policy); + if (ret) + return ret; + + set_freq_table_efficiencies(policy); + + return ret; } MODULE_AUTHOR("Dominik Brodowski <linux@brodo.de>"); diff --git a/include/linux/cpufreq.h b/include/linux/cpufreq.h index 353969c..d10784c 100644 --- a/include/linux/cpufreq.h +++ b/include/linux/cpufreq.h @@ -666,13 +666,15 @@ struct governor_attr { #define CPUFREQ_ENTRY_INVALID ~0u #define CPUFREQ_TABLE_END ~1u /* Special Values of .flags field */ -#define CPUFREQ_BOOST_FREQ (1 << 0) +#define CPUFREQ_BOOST_FREQ (1 << 0) +#define CPUFREQ_INEFFICIENT_FREQ (1 << 1) struct cpufreq_frequency_table { unsigned int flags; unsigned int driver_data; /* driver specific data, not used by core */ unsigned int frequency; /* kHz - doesn't need to be in ascending * order */ + unsigned int efficient; /* idx of an efficient frequency */ }; #if defined(CONFIG_CPU_FREQ) && defined(CONFIG_PM_OPP)
Some SoCs such as the sd855 have OPPs within the same policy whose cost is higher than others with a higher frequency. Those OPPs are inefficients and it might be interesting for a governor to not use them. Adding a flag, CPUFREQ_INEFFICIENT_FREQ, to mark such OPPs into the frequency table, as well as a new cpufreq_frequency_table member "efficient". This new member will allow a governor to quickly resolve an inefficient frequency to an efficient one. Efficient OPPs point to themselves. Governors must also ensure that the efficiency resolution does not break the policy maximum. Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>