Message ID | 20230315164910.302265-1-krzysztof.kozlowski@linaro.org |
---|---|
State | New |
Headers | show |
Series | [RFC] cpufreq: qcom-cpufreq-hw: allow work to be done on other CPU for PREEMPT_RT | expand |
On 15/03/2023 17:49, Krzysztof Kozlowski wrote: > Qualcomm cpufreq driver configures interrupts with affinity to each > cluster, e.g. dcvsh-irq-0, dcvsh-irq-4 and dcvsh-irq-7 on SM8250. > Triggered interrupt will schedule delayed work, but, since workqueue > prefers local CPUs, it might get executed on a CPU dedicated to realtime > tasks causing unexpected latencies in realtime workload. > > Use unbound workqueue for such case. This might come with performance > or energy penalty, e.g. because of cache miss or when other CPU is > sleeping. > > Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> > --- > drivers/cpufreq/qcom-cpufreq-hw.c | 11 ++++++++++- Let me also paste impact of this patch - rtla osnoise on entirely idle system (cores 2-7 isolated for Realtime): BEFORE: osnoise/7-2967 [007] d..h2.. 3937.898311: irq_noise: dcvsh-irq-7:179 start 3937.898310871 duration 104 ns irq/179-dcvsh-i-343 [007] d..h2.. 3937.898318: irq_noise: IPI:6 start 3937.898317537 duration 104 ns irq/179-dcvsh-i-343 [007] d...3.. 3937.898321: thread_noise: irq/179-dcvsh-i:343 start 3937.898316287 duration 4740 ns kworker/7:0-85 [007] d..h2.. 3937.898323: irq_noise: IPI:6 start 3937.898322381 duration 104 ns kworker/7:0-85 [007] d...3.. 3937.898343: thread_noise: kworker/7:0:85 start 3937.898321339 duration 20990 ns osnoise/7-2967 [007] ....... 3937.898343: sample_threshold: start 3937.898308475 duration 34531 ns interference 5 Noise duration: 34 us AFTER: osnoise/7-2637 [007] d..h2.. 530.563819: irq_noise: dcvsh-irq-7:178 start 530.563817139 duration 260 ns osnoise/7-2637 [007] d..h2.. 530.563827: irq_noise: IPI:6 start 530.563826670 duration 156 ns osnoise/7-2637 [007] ....... 530.563828: sample_threshold: start 530.563814587 duration 12864 ns interference 2 Noise duration: 13 us Best regards, Krzysztof
On 17/03/2023 00:57, Hillf Danton wrote: > On 16 Mar 2023 13:28:18 +0100 Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> >> On 15/03/2023 17:49, Krzysztof Kozlowski wrote: >>> Qualcomm cpufreq driver configures interrupts with affinity to each >>> cluster, e.g. dcvsh-irq-0, dcvsh-irq-4 and dcvsh-irq-7 on SM8250. >>> Triggered interrupt will schedule delayed work, but, since workqueue >>> prefers local CPUs, it might get executed on a CPU dedicated to realtime >>> tasks causing unexpected latencies in realtime workload. >>> >>> Use unbound workqueue for such case. This might come with performance >>> or energy penalty, e.g. because of cache miss or when other CPU is >>> sleeping. >>> >>> Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> >>> --- >>> drivers/cpufreq/qcom-cpufreq-hw.c | 11 ++++++++++- >> >> Let me also paste impact of this patch - rtla osnoise on entirely idle >> system (cores 2-7 isolated for Realtime): > > Are cores 2-7 the non-housekeeping CPUs[1]? Yes, since their are isolated for Realtime this is synonymous. > > [1] https://lore.kernel.org/lkml/20230223150624.GA29739@lst.de/ Best regards, Krzysztof
On 2023-03-15 17:49:10 [+0100], Krzysztof Kozlowski wrote: > Qualcomm cpufreq driver configures interrupts with affinity to each > cluster, e.g. dcvsh-irq-0, dcvsh-irq-4 and dcvsh-irq-7 on SM8250. > Triggered interrupt will schedule delayed work, but, since workqueue > prefers local CPUs, it might get executed on a CPU dedicated to realtime > tasks causing unexpected latencies in realtime workload. > > Use unbound workqueue for such case. This might come with performance > or energy penalty, e.g. because of cache miss or when other CPU is > sleeping. I miss the point where it explains that only PREEMPT_RT is affected by this. > Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> > --- > drivers/cpufreq/qcom-cpufreq-hw.c | 11 ++++++++++- > 1 file changed, 10 insertions(+), 1 deletion(-) > > diff --git a/drivers/cpufreq/qcom-cpufreq-hw.c b/drivers/cpufreq/qcom-cpufreq-hw.c > index 2f581d2d617d..c5ff8d25fabb 100644 > --- a/drivers/cpufreq/qcom-cpufreq-hw.c > +++ b/drivers/cpufreq/qcom-cpufreq-hw.c > @@ -390,7 +390,16 @@ static irqreturn_t qcom_lmh_dcvs_handle_irq(int irq, void *data) > > /* Disable interrupt and enable polling */ > disable_irq_nosync(c_data->throttle_irq); > - schedule_delayed_work(&c_data->throttle_work, 0); > + > + /* > + * Workqueue prefers local CPUs and since interrupts have set affinity, > + * the work might execute on a CPU dedicated to realtime tasks. > + */ > + if (IS_ENABLED(CONFIG_PREEMPT_RT)) > + queue_delayed_work_on(WORK_CPU_UNBOUND, system_unbound_wq, > + &c_data->throttle_work, 0); > + else > + schedule_delayed_work(&c_data->throttle_work, 0); You isolated CPUs and use this on PREEMPT_RT. And this special use-case is your reasoning to make this change and let it depend on PREEMPT_RT? If you do PREEMPT_RT and you care about latency I would argue that you either disable cpufreq and set it to PERFORMANCE so that the highest available frequency is set once and not changed afterwards. > if (qcom_cpufreq.soc_data->reg_intr_clr) > writel_relaxed(GT_IRQ_STATUS, > -- > 2.34.1 >
On 21/03/2023 11:04, Sebastian Andrzej Siewior wrote: > On 2023-03-15 17:49:10 [+0100], Krzysztof Kozlowski wrote: >> Qualcomm cpufreq driver configures interrupts with affinity to each >> cluster, e.g. dcvsh-irq-0, dcvsh-irq-4 and dcvsh-irq-7 on SM8250. >> Triggered interrupt will schedule delayed work, but, since workqueue >> prefers local CPUs, it might get executed on a CPU dedicated to realtime >> tasks causing unexpected latencies in realtime workload. >> >> Use unbound workqueue for such case. This might come with performance >> or energy penalty, e.g. because of cache miss or when other CPU is >> sleeping. > > I miss the point where it explains that only PREEMPT_RT is affected by > this. I assume "realtime tasks" implies this, but I can make it clearer. > >> Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> >> --- >> drivers/cpufreq/qcom-cpufreq-hw.c | 11 ++++++++++- >> 1 file changed, 10 insertions(+), 1 deletion(-) >> >> diff --git a/drivers/cpufreq/qcom-cpufreq-hw.c b/drivers/cpufreq/qcom-cpufreq-hw.c >> index 2f581d2d617d..c5ff8d25fabb 100644 >> --- a/drivers/cpufreq/qcom-cpufreq-hw.c >> +++ b/drivers/cpufreq/qcom-cpufreq-hw.c >> @@ -390,7 +390,16 @@ static irqreturn_t qcom_lmh_dcvs_handle_irq(int irq, void *data) >> >> /* Disable interrupt and enable polling */ >> disable_irq_nosync(c_data->throttle_irq); >> - schedule_delayed_work(&c_data->throttle_work, 0); >> + >> + /* >> + * Workqueue prefers local CPUs and since interrupts have set affinity, >> + * the work might execute on a CPU dedicated to realtime tasks. >> + */ >> + if (IS_ENABLED(CONFIG_PREEMPT_RT)) >> + queue_delayed_work_on(WORK_CPU_UNBOUND, system_unbound_wq, >> + &c_data->throttle_work, 0); >> + else >> + schedule_delayed_work(&c_data->throttle_work, 0); > > You isolated CPUs and use this on PREEMPT_RT. And this special use-case > is your reasoning to make this change and let it depend on PREEMPT_RT? > > If you do PREEMPT_RT and you care about latency I would argue that you > either disable cpufreq and set it to PERFORMANCE so that the highest > available frequency is set once and not changed afterwards. The cpufreq is set to performance. It will be changed anyway because underlying FW notifies through such interrupts about thermal mitigation happening. The only other solution is to disable the cpufreq device, e.g. by not compiling it. Best regards, Krzysztof
On 2023-03-21 11:24:46 [+0100], Krzysztof Kozlowski wrote: > >> --- a/drivers/cpufreq/qcom-cpufreq-hw.c > >> +++ b/drivers/cpufreq/qcom-cpufreq-hw.c > >> @@ -390,7 +390,16 @@ static irqreturn_t qcom_lmh_dcvs_handle_irq(int irq, void *data) > >> > >> /* Disable interrupt and enable polling */ > >> disable_irq_nosync(c_data->throttle_irq); > >> - schedule_delayed_work(&c_data->throttle_work, 0); > >> + > >> + /* > >> + * Workqueue prefers local CPUs and since interrupts have set affinity, > >> + * the work might execute on a CPU dedicated to realtime tasks. > >> + */ > >> + if (IS_ENABLED(CONFIG_PREEMPT_RT)) > >> + queue_delayed_work_on(WORK_CPU_UNBOUND, system_unbound_wq, > >> + &c_data->throttle_work, 0); > >> + else > >> + schedule_delayed_work(&c_data->throttle_work, 0); > > > > You isolated CPUs and use this on PREEMPT_RT. And this special use-case > > is your reasoning to make this change and let it depend on PREEMPT_RT? > > > > If you do PREEMPT_RT and you care about latency I would argue that you > > either disable cpufreq and set it to PERFORMANCE so that the highest > > available frequency is set once and not changed afterwards. > > The cpufreq is set to performance. It will be changed anyway because > underlying FW notifies through such interrupts about thermal mitigation > happening. I still fail to understand why this is PREEMPT_RT specific and not a problem in general when it comes not NO_HZ_FULL and/ or CPU isolation. However the thermal notifications have nothing to do with cpufreq. > The only other solution is to disable the cpufreq device, e.g. by not > compiling it. People often disable cpufreq because _usually_ the system boots at maximum performance. There are however exceptions and even x86 system are configured sometimes to a lower clock speed by the firmware/ BIOS. In this case it is nice to have a cpufreq so it is possible to set the system during boot to a higher clock speed. And then remain idle unless the cpufreq governor changed. > Best regards, > Krzysztof Sebastian
On 21/03/2023 11:57, Sebastian Andrzej Siewior wrote: > On 2023-03-21 11:24:46 [+0100], Krzysztof Kozlowski wrote: >>>> --- a/drivers/cpufreq/qcom-cpufreq-hw.c >>>> +++ b/drivers/cpufreq/qcom-cpufreq-hw.c >>>> @@ -390,7 +390,16 @@ static irqreturn_t qcom_lmh_dcvs_handle_irq(int irq, void *data) >>>> >>>> /* Disable interrupt and enable polling */ >>>> disable_irq_nosync(c_data->throttle_irq); >>>> - schedule_delayed_work(&c_data->throttle_work, 0); >>>> + >>>> + /* >>>> + * Workqueue prefers local CPUs and since interrupts have set affinity, >>>> + * the work might execute on a CPU dedicated to realtime tasks. >>>> + */ >>>> + if (IS_ENABLED(CONFIG_PREEMPT_RT)) >>>> + queue_delayed_work_on(WORK_CPU_UNBOUND, system_unbound_wq, >>>> + &c_data->throttle_work, 0); >>>> + else >>>> + schedule_delayed_work(&c_data->throttle_work, 0); >>> >>> You isolated CPUs and use this on PREEMPT_RT. And this special use-case >>> is your reasoning to make this change and let it depend on PREEMPT_RT? >>> >>> If you do PREEMPT_RT and you care about latency I would argue that you >>> either disable cpufreq and set it to PERFORMANCE so that the highest >>> available frequency is set once and not changed afterwards. >> >> The cpufreq is set to performance. It will be changed anyway because >> underlying FW notifies through such interrupts about thermal mitigation >> happening. > > I still fail to understand why this is PREEMPT_RT specific and not a > problem in general when it comes not NO_HZ_FULL and/ or CPU isolation. Hm, good point, I actually don't know what is the workqueue recommendation for NO_HZ_FULL CPUs - is still locality of the workqueue preferred? And how such code would look like? if (tick_nohz_tick_stopped())? > However the thermal notifications have nothing to do with cpufreq. They have. The FW notifies that thermal mitigation is happening and maximum allowed frequency is now XYZ. The cpufreq receives this and sets maximum allowed scaling frequency for governor. > >> The only other solution is to disable the cpufreq device, e.g. by not >> compiling it. > > People often disable cpufreq because _usually_ the system boots at > maximum performance. There are however exceptions and even x86 system > are configured sometimes to a lower clock speed by the firmware/ BIOS. > In this case it is nice to have a cpufreq so it is possible to set the > system during boot to a higher clock speed. And then remain idle unless > the cpufreq governor changed. Which we do not want here, thus disabling cpufreq is not the interesting solution... Best regards, Krzysztof
On 2023-03-21 12:27:42 [+0100], Krzysztof Kozlowski wrote: > > I still fail to understand why this is PREEMPT_RT specific and not a > > problem in general when it comes not NO_HZ_FULL and/ or CPU isolation. > > Hm, good point, I actually don't know what is the workqueue > recommendation for NO_HZ_FULL CPUs - is still locality of the workqueue > preferred? If you isolate a CPU you want the kernel to stay away from it. The idea is that something is done on that CPU and the kernel should leave it alone. That is why the HZ tick avoided. That is why timers migrate to the "housekeeping" CPU and do not fire on the CPU that it was programmed on (unless the timer has to fire on this CPU). > And how such code would look like? > if (tick_nohz_tick_stopped())? Yeah closer :) The CPU-mask for workqueues can still be different on non-NOHZ-full CPUs. Still you interrupt the CPU doing in-userland work and this is not desired. You have a threaded-IRQ which does nothing but schedules a worker. Why? Why not sleep and remain in that threaded IRQ until the work is done? You _can_ sleep in the threaded IRQ if you have to. Force-threaded is different but this is one is explicit threaded so you could do it. > > However the thermal notifications have nothing to do with cpufreq. > > They have. The FW notifies that thermal mitigation is happening and > maximum allowed frequency is now XYZ. The cpufreq receives this and sets > maximum allowed scaling frequency for governor. I see. So the driver is doing something in worst case. This interrupt, you have per-CPU and you need to do this CPU? I mean could you change the affinity of the interrupt to another CPU? > Best regards, > Krzysztof Sebastian
On 21/03/2023 14:39, Sebastian Andrzej Siewior wrote: > On 2023-03-21 12:27:42 [+0100], Krzysztof Kozlowski wrote: >>> I still fail to understand why this is PREEMPT_RT specific and not a >>> problem in general when it comes not NO_HZ_FULL and/ or CPU isolation. >> >> Hm, good point, I actually don't know what is the workqueue >> recommendation for NO_HZ_FULL CPUs - is still locality of the workqueue >> preferred? > > If you isolate a CPU you want the kernel to stay away from it. The idea > is that something is done on that CPU and the kernel should leave it > alone. That is why the HZ tick avoided. That is why timers migrate to > the "housekeeping" CPU and do not fire on the CPU that it was programmed > on (unless the timer has to fire on this CPU). > >> And how such code would look like? >> if (tick_nohz_tick_stopped())? > > Yeah closer :) The CPU-mask for workqueues can still be different on > non-NOHZ-full CPUs. Still you interrupt the CPU doing in-userland work > and this is not desired. Probably this should be done by workqueue core code. Individual drivers should not need to investigate which CPUs are isolated. > You have a threaded-IRQ which does nothing but schedules a worker. Why? > Why not sleep and remain in that threaded IRQ until the work is done? > You _can_ sleep in the threaded IRQ if you have to. Force-threaded is > different but this is one is explicit threaded so you could do it. If I get your point correctly, you want the IRQ handler thread to do the actual work instead of scheduling work? The answer to this is probably here: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=e0e27c3d4e20dab861566f1c348ae44e4b498630 > >>> However the thermal notifications have nothing to do with cpufreq. >> >> They have. The FW notifies that thermal mitigation is happening and >> maximum allowed frequency is now XYZ. The cpufreq receives this and sets >> maximum allowed scaling frequency for governor. > > I see. So the driver is doing something in worst case. This interrupt, > you have per-CPU and you need to do this CPU? I mean could you change > the affinity of the interrupt to another CPU? I don't know. The commit introducing it: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=3ed6dfbd3bb987b3d2de86304ae45972ebff5870 claimed it helps to reduce number of interrupts hitting CPU 10x-100x times... I don't see it - neither in tests nor in the code, so I am just thinking to revert that one. Best regards, Krzysztof
On 2023-03-23 09:16:27 [+0100], Krzysztof Kozlowski wrote: > > Yeah closer :) The CPU-mask for workqueues can still be different on > > non-NOHZ-full CPUs. Still you interrupt the CPU doing in-userland work > > and this is not desired. > > Probably this should be done by workqueue core code. Individual drivers > should not need to investigate which CPUs are isolated. _Either_ this is part of the interrupt service routine or it is not. Sometimes work can be offloaded. However this interrupt is short and offloads work to a workqueue. Can the interrupt be moved to another CPU without breaking something? The use can only change the CPUs on which the workqueue can run but also the affinity of the IRQ itself. If the user wishes to isolate CPU X and move workqueues and interrupts away from the CPU the question is why is this a problem for you. > > You have a threaded-IRQ which does nothing but schedules a worker. Why? > > Why not sleep and remain in that threaded IRQ until the work is done? > > You _can_ sleep in the threaded IRQ if you have to. Force-threaded is > > different but this is one is explicit threaded so you could do it. > > If I get your point correctly, you want the IRQ handler thread to do the > actual work instead of scheduling work? The answer to this is probably here: > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=e0e27c3d4e20dab861566f1c348ae44e4b498630 Let me look. | Re-enabling an interrupt from its own interrupt handler may cause | an interrupt storm, if there is a pending interrupt and because its | handling is disabled due to already done entrance into the handler | above in the stack. I have hard time parsing this. disable_irq_nosync() and enable enable_irq() shouldn't be invoked from within the interrupt handler itself. This interrupt is already requested as a threaded handler with IRQF_ONESHOT. This means the IRQ-chip already disables the interrupt while the threaded handler is invoked. No need for that. I don't know what the purpose of reg_intr_clr here is. Acking the interrupt before reading the status and doing any work does not look right. | Also, apparently it is improper to lock a mutex in an interrupt contex. Again, this is an interrupt handler requested as a threaded handler. This handler is invoked with enabled interrupts and preemption. The code in this handler can invoke ssleep() or mutex_lock(). A might_sleep() does not produce a plat here. It okay to acquire a mutex. This is why we have threaded interrupts. You can't acquire a mutex in a forced-threaded handler. This is not the case here. > > > >>> However the thermal notifications have nothing to do with cpufreq. > >> > >> They have. The FW notifies that thermal mitigation is happening and > >> maximum allowed frequency is now XYZ. The cpufreq receives this and sets > >> maximum allowed scaling frequency for governor. > > > > I see. So the driver is doing something in worst case. This interrupt, > > you have per-CPU and you need to do this CPU? I mean could you change > > the affinity of the interrupt to another CPU? > > I don't know. The commit introducing it: > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=3ed6dfbd3bb987b3d2de86304ae45972ebff5870 > claimed it helps to reduce number of interrupts hitting CPU 10x-100x > times... I don't see it - neither in tests nor in the code, so I am just > thinking to revert that one. So it may run on another CPU but doing it on the right cluster reduces the received interrupt 10-100 times. Do we have per-cluster register or is the interrupt ACK wrong and this what is observed? The questions are: - What is the interrupt signaling. - What must be done to acknowledge the interrupt. This needs to be figured out and verified that it actually works as intended. The hardware might keep sending interrupt because the source is either not acknowledge properly or the source of the interrupt (the reason why it was generated in first place) is still pending/ not handled. The changes you reference look like "if we do this then it seems better.". No explanation about the issue/ root cause and the targeted solution. > Best regards, > Krzysztof Sebastian
diff --git a/drivers/cpufreq/qcom-cpufreq-hw.c b/drivers/cpufreq/qcom-cpufreq-hw.c index 2f581d2d617d..c5ff8d25fabb 100644 --- a/drivers/cpufreq/qcom-cpufreq-hw.c +++ b/drivers/cpufreq/qcom-cpufreq-hw.c @@ -390,7 +390,16 @@ static irqreturn_t qcom_lmh_dcvs_handle_irq(int irq, void *data) /* Disable interrupt and enable polling */ disable_irq_nosync(c_data->throttle_irq); - schedule_delayed_work(&c_data->throttle_work, 0); + + /* + * Workqueue prefers local CPUs and since interrupts have set affinity, + * the work might execute on a CPU dedicated to realtime tasks. + */ + if (IS_ENABLED(CONFIG_PREEMPT_RT)) + queue_delayed_work_on(WORK_CPU_UNBOUND, system_unbound_wq, + &c_data->throttle_work, 0); + else + schedule_delayed_work(&c_data->throttle_work, 0); if (qcom_cpufreq.soc_data->reg_intr_clr) writel_relaxed(GT_IRQ_STATUS,
Qualcomm cpufreq driver configures interrupts with affinity to each cluster, e.g. dcvsh-irq-0, dcvsh-irq-4 and dcvsh-irq-7 on SM8250. Triggered interrupt will schedule delayed work, but, since workqueue prefers local CPUs, it might get executed on a CPU dedicated to realtime tasks causing unexpected latencies in realtime workload. Use unbound workqueue for such case. This might come with performance or energy penalty, e.g. because of cache miss or when other CPU is sleeping. Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> --- drivers/cpufreq/qcom-cpufreq-hw.c | 11 ++++++++++- 1 file changed, 10 insertions(+), 1 deletion(-)