Message ID | 20230522145702.2419654-4-lukasz.luba@arm.com |
---|---|
State | New |
Headers | show |
Series | Add basic tracing for uclamp and schedutil | expand |
On Mon, May 22, 2023 at 4:57 PM Lukasz Luba <lukasz.luba@arm.com> wrote: > > Some of the frequency update requests coming form the task scheduler > might be filter out. It can happen when the previous request was served > not that long ago (in a period smaller than provided by the cpufreq driver > as minimum for frequency update). In such case, we want to know if some of > the frequency updates cannot make through. > Export the new tracepoint as well. That would allow to handle it by a > toolkit for trace analyzes. > > Reported-by: kernel test robot <lkp@intel.com> # solved tricky build > Signed-off-by: Lukasz Luba <lukasz.luba@arm.com> > --- > include/trace/events/sched.h | 4 ++++ > kernel/sched/cpufreq_schedutil.c | 10 ++++++++-- > 2 files changed, 12 insertions(+), 2 deletions(-) > > diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h > index dbfb30809f15..e34b7cd5de73 100644 > --- a/include/trace/events/sched.h > +++ b/include/trace/events/sched.h > @@ -739,6 +739,10 @@ DECLARE_TRACE(uclamp_update_tsk_tp, > TP_PROTO(struct task_struct *tsk, int uclamp_id, unsigned int value), > TP_ARGS(tsk, uclamp_id, value)); > > +DECLARE_TRACE(schedutil_update_filtered_tp, > + TP_PROTO(int cpu), > + TP_ARGS(cpu)); > + > #endif /* _TRACE_SCHED_H */ > > /* This part must be outside protection */ > diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c > index f462496e5c07..4f9daf258a65 100644 > --- a/kernel/sched/cpufreq_schedutil.c > +++ b/kernel/sched/cpufreq_schedutil.c > @@ -6,6 +6,8 @@ > * Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > */ > > +EXPORT_TRACEPOINT_SYMBOL_GPL(schedutil_update_filtered_tp); > + > #define IOWAIT_BOOST_MIN (SCHED_CAPACITY_SCALE / 8) > > struct sugov_tunables { > @@ -318,8 +320,10 @@ static inline bool sugov_update_single_common(struct sugov_cpu *sg_cpu, > > ignore_dl_rate_limit(sg_cpu); > > - if (!sugov_should_update_freq(sg_cpu->sg_policy, time)) > + if (!sugov_should_update_freq(sg_cpu->sg_policy, time)) { > + trace_schedutil_update_filtered_tp(sg_cpu->cpu); It looks like the tracepoint can be added to sugov_should_update_freq() for less code duplication. > return false; > + } > > sugov_get_util(sg_cpu); > sugov_iowait_apply(sg_cpu, time, max_cap); > @@ -446,8 +450,10 @@ sugov_update_shared(struct update_util_data *hook, u64 time, unsigned int flags) > > ignore_dl_rate_limit(sg_cpu); > > - if (!sugov_should_update_freq(sg_policy, time)) > + if (!sugov_should_update_freq(sg_policy, time)) { > + trace_schedutil_update_filtered_tp(sg_cpu->cpu); > goto unlock; > + } > > next_f = sugov_next_freq_shared(sg_cpu, time); > > --
Hi Qais, I have somehow missed your feedback on this series. On 5/31/23 19:31, Qais Yousef wrote: > On 05/22/23 15:57, Lukasz Luba wrote: >> Some of the frequency update requests coming form the task scheduler >> might be filter out. It can happen when the previous request was served >> not that long ago (in a period smaller than provided by the cpufreq driver >> as minimum for frequency update). In such case, we want to know if some of >> the frequency updates cannot make through. >> Export the new tracepoint as well. That would allow to handle it by a >> toolkit for trace analyzes. >> >> Reported-by: kernel test robot <lkp@intel.com> # solved tricky build >> Signed-off-by: Lukasz Luba <lukasz.luba@arm.com> >> --- >> include/trace/events/sched.h | 4 ++++ >> kernel/sched/cpufreq_schedutil.c | 10 ++++++++-- >> 2 files changed, 12 insertions(+), 2 deletions(-) >> >> diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h >> index dbfb30809f15..e34b7cd5de73 100644 >> --- a/include/trace/events/sched.h >> +++ b/include/trace/events/sched.h >> @@ -739,6 +739,10 @@ DECLARE_TRACE(uclamp_update_tsk_tp, >> TP_PROTO(struct task_struct *tsk, int uclamp_id, unsigned int value), >> TP_ARGS(tsk, uclamp_id, value)); >> >> +DECLARE_TRACE(schedutil_update_filtered_tp, >> + TP_PROTO(int cpu), >> + TP_ARGS(cpu)); >> + >> #endif /* _TRACE_SCHED_H */ >> >> /* This part must be outside protection */ >> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c >> index f462496e5c07..4f9daf258a65 100644 >> --- a/kernel/sched/cpufreq_schedutil.c >> +++ b/kernel/sched/cpufreq_schedutil.c >> @@ -6,6 +6,8 @@ >> * Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com> >> */ >> >> +EXPORT_TRACEPOINT_SYMBOL_GPL(schedutil_update_filtered_tp); >> + >> #define IOWAIT_BOOST_MIN (SCHED_CAPACITY_SCALE / 8) >> >> struct sugov_tunables { >> @@ -318,8 +320,10 @@ static inline bool sugov_update_single_common(struct sugov_cpu *sg_cpu, >> >> ignore_dl_rate_limit(sg_cpu); >> >> - if (!sugov_should_update_freq(sg_cpu->sg_policy, time)) >> + if (!sugov_should_update_freq(sg_cpu->sg_policy, time)) { >> + trace_schedutil_update_filtered_tp(sg_cpu->cpu); >> return false; >> + } > > Can't we have something more generic here too? Are you interested to count > these events? How do you plan to use it? The plan is to record those events, count them and maybe adjust the FW if the frequency switching capabilities are too low, e.g. 4ms... We need those numbers to point out that there is a need for faster FW micro-controller to serve those incoming requests. > > I think this will be a very noisy event by the way. Could be, but on the other hand for those statistical analysis 'the more the better'. It will also depend on number of CPUs in the cluster, e.g. 4 CPUs vs 1 CPU. I don't know when we will switch to this per-cpu cpufreq mode when all CPUs behave like independent DVFS. Juno mainline kernel and FW supports that mode. We would have to compare those two modes and measure how much we gain/loose when using one and not the other. Furthermore, we already suspect some of our integration testing for EAS-mainline (on Juno) failing due to filtered out requests. How much that would impact other boards - it would be nice to see in traces. Thanks for your feedback! Lukasz
Hi Rafael, On 6/20/23 18:40, Rafael J. Wysocki wrote: > On Mon, May 22, 2023 at 4:57 PM Lukasz Luba <lukasz.luba@arm.com> wrote: >> >> Some of the frequency update requests coming form the task scheduler >> might be filter out. It can happen when the previous request was served >> not that long ago (in a period smaller than provided by the cpufreq driver >> as minimum for frequency update). In such case, we want to know if some of >> the frequency updates cannot make through. >> Export the new tracepoint as well. That would allow to handle it by a >> toolkit for trace analyzes. >> >> Reported-by: kernel test robot <lkp@intel.com> # solved tricky build >> Signed-off-by: Lukasz Luba <lukasz.luba@arm.com> >> --- >> include/trace/events/sched.h | 4 ++++ >> kernel/sched/cpufreq_schedutil.c | 10 ++++++++-- >> 2 files changed, 12 insertions(+), 2 deletions(-) >> >> diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h >> index dbfb30809f15..e34b7cd5de73 100644 >> --- a/include/trace/events/sched.h >> +++ b/include/trace/events/sched.h >> @@ -739,6 +739,10 @@ DECLARE_TRACE(uclamp_update_tsk_tp, >> TP_PROTO(struct task_struct *tsk, int uclamp_id, unsigned int value), >> TP_ARGS(tsk, uclamp_id, value)); >> >> +DECLARE_TRACE(schedutil_update_filtered_tp, >> + TP_PROTO(int cpu), >> + TP_ARGS(cpu)); >> + >> #endif /* _TRACE_SCHED_H */ >> >> /* This part must be outside protection */ >> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c >> index f462496e5c07..4f9daf258a65 100644 >> --- a/kernel/sched/cpufreq_schedutil.c >> +++ b/kernel/sched/cpufreq_schedutil.c >> @@ -6,6 +6,8 @@ >> * Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com> >> */ >> >> +EXPORT_TRACEPOINT_SYMBOL_GPL(schedutil_update_filtered_tp); >> + >> #define IOWAIT_BOOST_MIN (SCHED_CAPACITY_SCALE / 8) >> >> struct sugov_tunables { >> @@ -318,8 +320,10 @@ static inline bool sugov_update_single_common(struct sugov_cpu *sg_cpu, >> >> ignore_dl_rate_limit(sg_cpu); >> >> - if (!sugov_should_update_freq(sg_cpu->sg_policy, time)) >> + if (!sugov_should_update_freq(sg_cpu->sg_policy, time)) { >> + trace_schedutil_update_filtered_tp(sg_cpu->cpu); > > It looks like the tracepoint can be added to > sugov_should_update_freq() for less code duplication. > Make sense. I will move that trace there. In such case, of movement that trace call... Based on your comment for patch 2/3 I got impression that you still want it. For me it looks more 'aligned' w/ that patch 2/3. The two functions code flows: sugov_update_shared() and sugov_update_single_common() - how they call and interpret result from sugov_should_update_freq() - is more clear IMO. So I will keep that patch 2/3 in the next version. Although, if you don't like it - please tell me and I will drop it. Thanks for the review! Lukasz
On 06/20/23 18:52, Lukasz Luba wrote: > Hi Qais, > > I have somehow missed your feedback on this series. > > On 5/31/23 19:31, Qais Yousef wrote: > > On 05/22/23 15:57, Lukasz Luba wrote: > > > Some of the frequency update requests coming form the task scheduler > > > might be filter out. It can happen when the previous request was served > > > not that long ago (in a period smaller than provided by the cpufreq driver > > > as minimum for frequency update). In such case, we want to know if some of > > > the frequency updates cannot make through. > > > Export the new tracepoint as well. That would allow to handle it by a > > > toolkit for trace analyzes. > > > > > > Reported-by: kernel test robot <lkp@intel.com> # solved tricky build > > > Signed-off-by: Lukasz Luba <lukasz.luba@arm.com> > > > --- > > > include/trace/events/sched.h | 4 ++++ > > > kernel/sched/cpufreq_schedutil.c | 10 ++++++++-- > > > 2 files changed, 12 insertions(+), 2 deletions(-) > > > > > > diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h > > > index dbfb30809f15..e34b7cd5de73 100644 > > > --- a/include/trace/events/sched.h > > > +++ b/include/trace/events/sched.h > > > @@ -739,6 +739,10 @@ DECLARE_TRACE(uclamp_update_tsk_tp, > > > TP_PROTO(struct task_struct *tsk, int uclamp_id, unsigned int value), > > > TP_ARGS(tsk, uclamp_id, value)); > > > +DECLARE_TRACE(schedutil_update_filtered_tp, > > > + TP_PROTO(int cpu), > > > + TP_ARGS(cpu)); > > > + > > > #endif /* _TRACE_SCHED_H */ > > > /* This part must be outside protection */ > > > diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c > > > index f462496e5c07..4f9daf258a65 100644 > > > --- a/kernel/sched/cpufreq_schedutil.c > > > +++ b/kernel/sched/cpufreq_schedutil.c > > > @@ -6,6 +6,8 @@ > > > * Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > > > */ > > > +EXPORT_TRACEPOINT_SYMBOL_GPL(schedutil_update_filtered_tp); > > > + > > > #define IOWAIT_BOOST_MIN (SCHED_CAPACITY_SCALE / 8) > > > struct sugov_tunables { > > > @@ -318,8 +320,10 @@ static inline bool sugov_update_single_common(struct sugov_cpu *sg_cpu, > > > ignore_dl_rate_limit(sg_cpu); > > > - if (!sugov_should_update_freq(sg_cpu->sg_policy, time)) > > > + if (!sugov_should_update_freq(sg_cpu->sg_policy, time)) { > > > + trace_schedutil_update_filtered_tp(sg_cpu->cpu); > > > return false; > > > + } > > > > Can't we have something more generic here too? Are you interested to count > > these events? How do you plan to use it? > > The plan is to record those events, count them and maybe adjust the FW > if the frequency switching capabilities are too low, e.g. 4ms... You mean as part of tuning step for the system or at runtime? The latter seems to indicate for a proper interface instead. IMHO I think the current filtering mechanism needs a bit of a massage. One thing I think we must do is to ignore the filter if there's a big sudden change in requested frequency. Like for instance if a big task migrates. Then prev_cpu should go to lower freq sooner, and new_cpu should change to higher frequency sooner too. The filtering makes sense only in steady state situation where we are ramping up or down gradually. If no one beats me to it, I'll propose something in that regard. > > We need those numbers to point out that there is a need for faster > FW micro-controller to serve those incoming requests. I think there's a big assumption here that the filter is always set correctly ;-) > > > > > I think this will be a very noisy event by the way. > > Could be, but on the other hand for those statistical analysis > 'the more the better'. It will also depend on number of > CPUs in the cluster, e.g. 4 CPUs vs 1 CPU. > > I don't know when we will switch to this per-cpu cpufreq mode > when all CPUs behave like independent DVFS. Juno mainline kernel and FW > supports that mode. We would have to compare those two modes and > measure how much we gain/loose when using one and not the other. > > Furthermore, we already suspect some of our integration testing for > EAS-mainline (on Juno) failing due to filtered out requests. How much > that would impact other boards - it would be nice to see in traces. Another problem I think we have is that the DVFS headroom value should be a function of this filter. At the moment it is hardcoded to a random value which causes power issue. So to summarize I think there are two improvements required (and if anyone has the time to try them out go ahead otherwise I'll get to it): 1. The filter should only be applied if the history hasn't changed. ie: we are gradually increasing or decreasing PELT. Otherwise we should honour sudden changes ASAP. 2. DVFS headroom should be a function of the filter. 25% is too high for 500us. It could be too low for 10ms (I don't know). Thanks! -- Qais Yousef > > Thanks for your feedback! > Lukasz
On 06/30/23 13:01, Qais Yousef wrote: > On 06/20/23 18:52, Lukasz Luba wrote: > > Hi Qais, > > > > I have somehow missed your feedback on this series. > > > > On 5/31/23 19:31, Qais Yousef wrote: > > > On 05/22/23 15:57, Lukasz Luba wrote: > > > > Some of the frequency update requests coming form the task scheduler > > > > might be filter out. It can happen when the previous request was served > > > > not that long ago (in a period smaller than provided by the cpufreq driver > > > > as minimum for frequency update). In such case, we want to know if some of > > > > the frequency updates cannot make through. > > > > Export the new tracepoint as well. That would allow to handle it by a > > > > toolkit for trace analyzes. > > > > > > > > Reported-by: kernel test robot <lkp@intel.com> # solved tricky build > > > > Signed-off-by: Lukasz Luba <lukasz.luba@arm.com> > > > > --- > > > > include/trace/events/sched.h | 4 ++++ > > > > kernel/sched/cpufreq_schedutil.c | 10 ++++++++-- > > > > 2 files changed, 12 insertions(+), 2 deletions(-) > > > > > > > > diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h > > > > index dbfb30809f15..e34b7cd5de73 100644 > > > > --- a/include/trace/events/sched.h > > > > +++ b/include/trace/events/sched.h > > > > @@ -739,6 +739,10 @@ DECLARE_TRACE(uclamp_update_tsk_tp, > > > > TP_PROTO(struct task_struct *tsk, int uclamp_id, unsigned int value), > > > > TP_ARGS(tsk, uclamp_id, value)); > > > > +DECLARE_TRACE(schedutil_update_filtered_tp, > > > > + TP_PROTO(int cpu), > > > > + TP_ARGS(cpu)); > > > > + > > > > #endif /* _TRACE_SCHED_H */ > > > > /* This part must be outside protection */ > > > > diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c > > > > index f462496e5c07..4f9daf258a65 100644 > > > > --- a/kernel/sched/cpufreq_schedutil.c > > > > +++ b/kernel/sched/cpufreq_schedutil.c > > > > @@ -6,6 +6,8 @@ > > > > * Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > > > > */ > > > > +EXPORT_TRACEPOINT_SYMBOL_GPL(schedutil_update_filtered_tp); > > > > + > > > > #define IOWAIT_BOOST_MIN (SCHED_CAPACITY_SCALE / 8) > > > > struct sugov_tunables { > > > > @@ -318,8 +320,10 @@ static inline bool sugov_update_single_common(struct sugov_cpu *sg_cpu, > > > > ignore_dl_rate_limit(sg_cpu); > > > > - if (!sugov_should_update_freq(sg_cpu->sg_policy, time)) > > > > + if (!sugov_should_update_freq(sg_cpu->sg_policy, time)) { > > > > + trace_schedutil_update_filtered_tp(sg_cpu->cpu); > > > > return false; > > > > + } > > > > > > Can't we have something more generic here too? Are you interested to count > > > these events? How do you plan to use it? > > > > The plan is to record those events, count them and maybe adjust the FW > > if the frequency switching capabilities are too low, e.g. 4ms... > > You mean as part of tuning step for the system or at runtime? The latter seems > to indicate for a proper interface instead. > > IMHO I think the current filtering mechanism needs a bit of a massage. > > One thing I think we must do is to ignore the filter if there's a big sudden > change in requested frequency. Like for instance if a big task migrates. Then > prev_cpu should go to lower freq sooner, and new_cpu should change to higher > frequency sooner too. The filtering makes sense only in steady state situation > where we are ramping up or down gradually. > > If no one beats me to it, I'll propose something in that regard. > > > > > We need those numbers to point out that there is a need for faster > > FW micro-controller to serve those incoming requests. > > I think there's a big assumption here that the filter is always set correctly > ;-) > > > > > > > > > I think this will be a very noisy event by the way. > > > > Could be, but on the other hand for those statistical analysis > > 'the more the better'. It will also depend on number of > > CPUs in the cluster, e.g. 4 CPUs vs 1 CPU. > > > > I don't know when we will switch to this per-cpu cpufreq mode > > when all CPUs behave like independent DVFS. Juno mainline kernel and FW > > supports that mode. We would have to compare those two modes and > > measure how much we gain/loose when using one and not the other. > > > > Furthermore, we already suspect some of our integration testing for > > EAS-mainline (on Juno) failing due to filtered out requests. How much > > that would impact other boards - it would be nice to see in traces. > > Another problem I think we have is that the DVFS headroom value should be > a function of this filter. At the moment it is hardcoded to a random value > which causes power issue. > > So to summarize I think there are two improvements required (and if anyone has > the time to try them out go ahead otherwise I'll get to it): > > 1. The filter should only be applied if the history hasn't changed. ie: we are > gradually increasing or decreasing PELT. Otherwise we should honour sudden > changes ASAP. > 2. DVFS headroom should be a function of the filter. 25% is too high for > 500us. It could be too low for 10ms (I don't know). To expand a bit more since it's related. Our migration margins should depend on the tick value instead of random magic numbers they are today. More precisely the balance_interval. If there's a misfit task, then we should upmigrate it at wake up only if we think it'll become misfit before the load balancer kicks in. Otherwise the load balancer should do the correction if it becomes long running/misfit. If the sys admin wants to speed up/slow down migration it should be throw controlling PELT IMO and not these magic margin values - which are hardcoded to random values at the moment anyway that are not suitable for every system. And since we decoupled overutilized from misfit lb; I think our definition should improve to better detect when the system needs to disable packing and starts spreading. Current check for overutilized based on misfit is no longer adequate IMO. Especially when there's a single misfit task in the system. Again, just sharing thoughts in case someone interested to work on this before I get a chance to share any patches ;-) Cheers -- Qais Yousef
Hi Qais, On 6/30/23 14:25, Qais Yousef wrote: > On 06/30/23 13:01, Qais Yousef wrote: >> On 06/20/23 18:52, Lukasz Luba wrote: >>> Hi Qais, >>> >>> I have somehow missed your feedback on this series. >>> >>> On 5/31/23 19:31, Qais Yousef wrote: >>>> On 05/22/23 15:57, Lukasz Luba wrote: >>>>> Some of the frequency update requests coming form the task scheduler >>>>> might be filter out. It can happen when the previous request was served >>>>> not that long ago (in a period smaller than provided by the cpufreq driver >>>>> as minimum for frequency update). In such case, we want to know if some of >>>>> the frequency updates cannot make through. >>>>> Export the new tracepoint as well. That would allow to handle it by a >>>>> toolkit for trace analyzes. >>>>> >>>>> Reported-by: kernel test robot <lkp@intel.com> # solved tricky build >>>>> Signed-off-by: Lukasz Luba <lukasz.luba@arm.com> >>>>> --- >>>>> include/trace/events/sched.h | 4 ++++ >>>>> kernel/sched/cpufreq_schedutil.c | 10 ++++++++-- >>>>> 2 files changed, 12 insertions(+), 2 deletions(-) >>>>> >>>>> diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h >>>>> index dbfb30809f15..e34b7cd5de73 100644 >>>>> --- a/include/trace/events/sched.h >>>>> +++ b/include/trace/events/sched.h >>>>> @@ -739,6 +739,10 @@ DECLARE_TRACE(uclamp_update_tsk_tp, >>>>> TP_PROTO(struct task_struct *tsk, int uclamp_id, unsigned int value), >>>>> TP_ARGS(tsk, uclamp_id, value)); >>>>> +DECLARE_TRACE(schedutil_update_filtered_tp, >>>>> + TP_PROTO(int cpu), >>>>> + TP_ARGS(cpu)); >>>>> + >>>>> #endif /* _TRACE_SCHED_H */ >>>>> /* This part must be outside protection */ >>>>> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c >>>>> index f462496e5c07..4f9daf258a65 100644 >>>>> --- a/kernel/sched/cpufreq_schedutil.c >>>>> +++ b/kernel/sched/cpufreq_schedutil.c >>>>> @@ -6,6 +6,8 @@ >>>>> * Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com> >>>>> */ >>>>> +EXPORT_TRACEPOINT_SYMBOL_GPL(schedutil_update_filtered_tp); >>>>> + >>>>> #define IOWAIT_BOOST_MIN (SCHED_CAPACITY_SCALE / 8) >>>>> struct sugov_tunables { >>>>> @@ -318,8 +320,10 @@ static inline bool sugov_update_single_common(struct sugov_cpu *sg_cpu, >>>>> ignore_dl_rate_limit(sg_cpu); >>>>> - if (!sugov_should_update_freq(sg_cpu->sg_policy, time)) >>>>> + if (!sugov_should_update_freq(sg_cpu->sg_policy, time)) { >>>>> + trace_schedutil_update_filtered_tp(sg_cpu->cpu); >>>>> return false; >>>>> + } >>>> >>>> Can't we have something more generic here too? Are you interested to count >>>> these events? How do you plan to use it? >>> >>> The plan is to record those events, count them and maybe adjust the FW >>> if the frequency switching capabilities are too low, e.g. 4ms... >> >> You mean as part of tuning step for the system or at runtime? The latter seems >> to indicate for a proper interface instead. Not at runtime, the FW change or maybe even the uC would be needed for this. Therefore, our client which experiments with the new platform can run the analysis and spot this. Then it can change the FW if there was an issue, or maybe even upgrade the HW if there are severe issues with delivering needed performance on time (e.g. in high display refresh rates and first-frame drops). >> >> IMHO I think the current filtering mechanism needs a bit of a massage. >> >> One thing I think we must do is to ignore the filter if there's a big sudden >> change in requested frequency. Like for instance if a big task migrates. Then >> prev_cpu should go to lower freq sooner, and new_cpu should change to higher >> frequency sooner too. The filtering makes sense only in steady state situation >> where we are ramping up or down gradually. This is kind of a heuristic, which is biased for mobiles IMO. >> >> If no one beats me to it, I'll propose something in that regard. >> >>> >>> We need those numbers to point out that there is a need for faster >>> FW micro-controller to serve those incoming requests. >> >> I think there's a big assumption here that the filter is always set correctly >> ;-) In our case, it is set correctly, the value is coming directly from FW [1]. It is the FW+HW limit, so not much to do with this in the kernel. >> >>> >>>> >>>> I think this will be a very noisy event by the way. >>> >>> Could be, but on the other hand for those statistical analysis >>> 'the more the better'. It will also depend on number of >>> CPUs in the cluster, e.g. 4 CPUs vs 1 CPU. >>> >>> I don't know when we will switch to this per-cpu cpufreq mode >>> when all CPUs behave like independent DVFS. Juno mainline kernel and FW >>> supports that mode. We would have to compare those two modes and >>> measure how much we gain/loose when using one and not the other. >>> >>> Furthermore, we already suspect some of our integration testing for >>> EAS-mainline (on Juno) failing due to filtered out requests. How much >>> that would impact other boards - it would be nice to see in traces. >> >> Another problem I think we have is that the DVFS headroom value should be >> a function of this filter. At the moment it is hardcoded to a random value >> which causes power issue. It's not a random value, as you can see in [1]. This is the main goal for this $subject - provide information with proper test that the FW or HW change is needed. If you like to have a decent performance in your Linux based solution, the faster FW/HW would be needed. I don't want to put more hacks or heuristics which try to workaround performance issues with the HW. E.g. if someone instead of a 200MHz uC running fast FW would put 100MHz uC than should get quality data from integration tests, that such a design might not work well with recent OSes and apps. Currently those kind of 'design' checks are very hard, because require sophisticated knowledge and we are trying to lower that bar for more engineers. >> >> So to summarize I think there are two improvements required (and if anyone has >> the time to try them out go ahead otherwise I'll get to it): >> >> 1. The filter should only be applied if the history hasn't changed. ie: we are >> gradually increasing or decreasing PELT. Otherwise we should honour sudden >> changes ASAP. >> 2. DVFS headroom should be a function of the filter. 25% is too high for >> 500us. It could be too low for 10ms (I don't know). > > To expand a bit more since it's related. Our migration margins should depend > on the tick value instead of random magic numbers they are today. More > precisely the balance_interval. If there's a misfit task, then we should > upmigrate it at wake up only if we think it'll become misfit before the load > balancer kicks in. Otherwise the load balancer should do the correction if it > becomes long running/misfit. If the sys admin wants to speed up/slow down > migration it should be throw controlling PELT IMO and not these magic margin > values - which are hardcoded to random values at the moment anyway that are not > suitable for every system. > > And since we decoupled overutilized from misfit lb; I think our definition > should improve to better detect when the system needs to disable packing and > starts spreading. Current check for overutilized based on misfit is no longer > adequate IMO. Especially when there's a single misfit task in the system. > > Again, just sharing thoughts in case someone interested to work on this before > I get a chance to share any patches ;-) Those are all heuristics and some of your ideas are going very beyond the $subject. As I said the main goal for the $subject is to deliver information that the FW/HW might need a re-design (maybe even a more silicon for the uC). I cannot stop you from creating a dedicated thread with your patches, though ;) Regards, Lukasz [1] https://elixir.bootlin.com/linux/v6.4/source/drivers/cpufreq/scmi-cpufreq.c#L224
On 07/04/23 09:23, Lukasz Luba wrote: > Hi Qais, > > On 6/30/23 14:25, Qais Yousef wrote: > > On 06/30/23 13:01, Qais Yousef wrote: > > > On 06/20/23 18:52, Lukasz Luba wrote: > > > > Hi Qais, > > > > > > > > I have somehow missed your feedback on this series. > > > > > > > > On 5/31/23 19:31, Qais Yousef wrote: > > > > > On 05/22/23 15:57, Lukasz Luba wrote: > > > > > > Some of the frequency update requests coming form the task scheduler > > > > > > might be filter out. It can happen when the previous request was served > > > > > > not that long ago (in a period smaller than provided by the cpufreq driver > > > > > > as minimum for frequency update). In such case, we want to know if some of > > > > > > the frequency updates cannot make through. > > > > > > Export the new tracepoint as well. That would allow to handle it by a > > > > > > toolkit for trace analyzes. > > > > > > > > > > > > Reported-by: kernel test robot <lkp@intel.com> # solved tricky build > > > > > > Signed-off-by: Lukasz Luba <lukasz.luba@arm.com> > > > > > > --- > > > > > > include/trace/events/sched.h | 4 ++++ > > > > > > kernel/sched/cpufreq_schedutil.c | 10 ++++++++-- > > > > > > 2 files changed, 12 insertions(+), 2 deletions(-) > > > > > > > > > > > > diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h > > > > > > index dbfb30809f15..e34b7cd5de73 100644 > > > > > > --- a/include/trace/events/sched.h > > > > > > +++ b/include/trace/events/sched.h > > > > > > @@ -739,6 +739,10 @@ DECLARE_TRACE(uclamp_update_tsk_tp, > > > > > > TP_PROTO(struct task_struct *tsk, int uclamp_id, unsigned int value), > > > > > > TP_ARGS(tsk, uclamp_id, value)); > > > > > > +DECLARE_TRACE(schedutil_update_filtered_tp, > > > > > > + TP_PROTO(int cpu), > > > > > > + TP_ARGS(cpu)); > > > > > > + > > > > > > #endif /* _TRACE_SCHED_H */ > > > > > > /* This part must be outside protection */ > > > > > > diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c > > > > > > index f462496e5c07..4f9daf258a65 100644 > > > > > > --- a/kernel/sched/cpufreq_schedutil.c > > > > > > +++ b/kernel/sched/cpufreq_schedutil.c > > > > > > @@ -6,6 +6,8 @@ > > > > > > * Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > > > > > > */ > > > > > > +EXPORT_TRACEPOINT_SYMBOL_GPL(schedutil_update_filtered_tp); > > > > > > + > > > > > > #define IOWAIT_BOOST_MIN (SCHED_CAPACITY_SCALE / 8) > > > > > > struct sugov_tunables { > > > > > > @@ -318,8 +320,10 @@ static inline bool sugov_update_single_common(struct sugov_cpu *sg_cpu, > > > > > > ignore_dl_rate_limit(sg_cpu); > > > > > > - if (!sugov_should_update_freq(sg_cpu->sg_policy, time)) > > > > > > + if (!sugov_should_update_freq(sg_cpu->sg_policy, time)) { > > > > > > + trace_schedutil_update_filtered_tp(sg_cpu->cpu); > > > > > > return false; > > > > > > + } > > > > > > > > > > Can't we have something more generic here too? Are you interested to count > > > > > these events? How do you plan to use it? > > > > > > > > The plan is to record those events, count them and maybe adjust the FW > > > > if the frequency switching capabilities are too low, e.g. 4ms... > > > > > > You mean as part of tuning step for the system or at runtime? The latter seems > > > to indicate for a proper interface instead. > > Not at runtime, the FW change or maybe even the uC would be needed for > this. Therefore, our client which experiments with the new platform > can run the analysis and spot this. Then it can change the FW > if there was an issue, or maybe even upgrade the HW if there are severe > issues with delivering needed performance on time (e.g. in high display > refresh rates and first-frame drops). > > > > > > > IMHO I think the current filtering mechanism needs a bit of a massage. > > > > > > One thing I think we must do is to ignore the filter if there's a big sudden > > > change in requested frequency. Like for instance if a big task migrates. Then > > > prev_cpu should go to lower freq sooner, and new_cpu should change to higher > > > frequency sooner too. The filtering makes sense only in steady state situation > > > where we are ramping up or down gradually. > > This is kind of a heuristic, which is biased for mobiles IMO. How come? big tasks are not only on mobile? A 500+ task can exist on any system? > > > > > > > If no one beats me to it, I'll propose something in that regard. > > > > > > > > > > > We need those numbers to point out that there is a need for faster > > > > FW micro-controller to serve those incoming requests. > > > > > > I think there's a big assumption here that the filter is always set correctly > > > ;-) > > In our case, it is set correctly, the value is coming directly from FW > [1]. It is the FW+HW limit, so not much to do with this in the kernel. > > > > > > > > > > > > > > > > > > I think this will be a very noisy event by the way. > > > > > > > > Could be, but on the other hand for those statistical analysis > > > > 'the more the better'. It will also depend on number of > > > > CPUs in the cluster, e.g. 4 CPUs vs 1 CPU. > > > > > > > > I don't know when we will switch to this per-cpu cpufreq mode > > > > when all CPUs behave like independent DVFS. Juno mainline kernel and FW > > > > supports that mode. We would have to compare those two modes and > > > > measure how much we gain/loose when using one and not the other. > > > > > > > > Furthermore, we already suspect some of our integration testing for > > > > EAS-mainline (on Juno) failing due to filtered out requests. How much > > > > that would impact other boards - it would be nice to see in traces. > > > > > > Another problem I think we have is that the DVFS headroom value should be > > > a function of this filter. At the moment it is hardcoded to a random value > > > which causes power issue. > > It's not a random value, as you can see in [1]. This is the main goal I'm referring to the 25% in map_util_perf(). > for this $subject - provide information with proper test that the FW > or HW change is needed. If you like to have a decent performance in > your Linux based solution, the faster FW/HW would be needed. I don't > want to put more hacks or heuristics which try to workaround performance > issues with the HW. E.g. if someone instead of a 200MHz uC running fast > FW would put 100MHz uC than should get quality data from integration > tests, that such a design might not work well with recent OSes and apps. > Currently those kind of 'design' checks are very hard, because require > sophisticated knowledge and we are trying to lower that bar for more > engineers. I think we're talking about different things ;-) > > > > > > > So to summarize I think there are two improvements required (and if anyone has > > > the time to try them out go ahead otherwise I'll get to it): > > > > > > 1. The filter should only be applied if the history hasn't changed. ie: we are > > > gradually increasing or decreasing PELT. Otherwise we should honour sudden > > > changes ASAP. > > > 2. DVFS headroom should be a function of the filter. 25% is too high for > > > 500us. It could be too low for 10ms (I don't know). > > > > To expand a bit more since it's related. Our migration margins should depend > > on the tick value instead of random magic numbers they are today. More > > precisely the balance_interval. If there's a misfit task, then we should > > upmigrate it at wake up only if we think it'll become misfit before the load > > balancer kicks in. Otherwise the load balancer should do the correction if it > > becomes long running/misfit. If the sys admin wants to speed up/slow down > > migration it should be throw controlling PELT IMO and not these magic margin > > values - which are hardcoded to random values at the moment anyway that are not > > suitable for every system. > > > > And since we decoupled overutilized from misfit lb; I think our definition > > should improve to better detect when the system needs to disable packing and > > starts spreading. Current check for overutilized based on misfit is no longer > > adequate IMO. Especially when there's a single misfit task in the system. > > > > Again, just sharing thoughts in case someone interested to work on this before > > I get a chance to share any patches ;-) > > Those are all heuristics and some of your ideas are going very beyond > the $subject. As I said the main goal for the $subject is to > deliver information that the FW/HW might need a re-design (maybe even > a more silicon for the uC). > > I cannot stop you from creating a dedicated thread with your patches, > though ;) Fair enough. /me goes away -- Qais Yousef > > Regards, > Lukasz > > [1] https://elixir.bootlin.com/linux/v6.4/source/drivers/cpufreq/scmi-cpufreq.c#L224
diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h index dbfb30809f15..e34b7cd5de73 100644 --- a/include/trace/events/sched.h +++ b/include/trace/events/sched.h @@ -739,6 +739,10 @@ DECLARE_TRACE(uclamp_update_tsk_tp, TP_PROTO(struct task_struct *tsk, int uclamp_id, unsigned int value), TP_ARGS(tsk, uclamp_id, value)); +DECLARE_TRACE(schedutil_update_filtered_tp, + TP_PROTO(int cpu), + TP_ARGS(cpu)); + #endif /* _TRACE_SCHED_H */ /* This part must be outside protection */ diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c index f462496e5c07..4f9daf258a65 100644 --- a/kernel/sched/cpufreq_schedutil.c +++ b/kernel/sched/cpufreq_schedutil.c @@ -6,6 +6,8 @@ * Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com> */ +EXPORT_TRACEPOINT_SYMBOL_GPL(schedutil_update_filtered_tp); + #define IOWAIT_BOOST_MIN (SCHED_CAPACITY_SCALE / 8) struct sugov_tunables { @@ -318,8 +320,10 @@ static inline bool sugov_update_single_common(struct sugov_cpu *sg_cpu, ignore_dl_rate_limit(sg_cpu); - if (!sugov_should_update_freq(sg_cpu->sg_policy, time)) + if (!sugov_should_update_freq(sg_cpu->sg_policy, time)) { + trace_schedutil_update_filtered_tp(sg_cpu->cpu); return false; + } sugov_get_util(sg_cpu); sugov_iowait_apply(sg_cpu, time, max_cap); @@ -446,8 +450,10 @@ sugov_update_shared(struct update_util_data *hook, u64 time, unsigned int flags) ignore_dl_rate_limit(sg_cpu); - if (!sugov_should_update_freq(sg_policy, time)) + if (!sugov_should_update_freq(sg_policy, time)) { + trace_schedutil_update_filtered_tp(sg_cpu->cpu); goto unlock; + } next_f = sugov_next_freq_shared(sg_cpu, time);
Some of the frequency update requests coming form the task scheduler might be filter out. It can happen when the previous request was served not that long ago (in a period smaller than provided by the cpufreq driver as minimum for frequency update). In such case, we want to know if some of the frequency updates cannot make through. Export the new tracepoint as well. That would allow to handle it by a toolkit for trace analyzes. Reported-by: kernel test robot <lkp@intel.com> # solved tricky build Signed-off-by: Lukasz Luba <lukasz.luba@arm.com> --- include/trace/events/sched.h | 4 ++++ kernel/sched/cpufreq_schedutil.c | 10 ++++++++-- 2 files changed, 12 insertions(+), 2 deletions(-)