Message ID | 20240905092645.2885200-1-christian.loehle@arm.com |
---|---|
Headers | show |
Series | cpufreq: cpuidle: Remove iowait behaviour | expand |
On Thu, Sep 5, 2024 at 11:27 AM Christian Loehle <christian.loehle@arm.com> wrote: > > Remove CPU iowaiters influence on idle state selection. > Remove the menu notion of performance multiplier which increased with > the number of tasks that went to iowait sleep on this CPU and haven't > woken up yet. > > Relying on iowait for cpuidle is problematic for a few reasons: > 1. There is no guarantee that an iowaiting task will wake up on the > same CPU. > 2. The task being in iowait says nothing about the idle duration, we > could be selecting shallower states for a long time. > 3. The task being in iowait doesn't always imply a performance hit > with increased latency. > 4. If there is such a performance hit, the number of iowaiting tasks > doesn't directly correlate. > 5. The definition of iowait altogether is vague at best, it is > sprinkled across kernel code. > > Signed-off-by: Christian Loehle <christian.loehle@arm.com> I promised feedback on this series. As far as this particular patch is concerned, I generally agree with all of the above, so I'm going to put it into linux-next right away and see if anyone reports a problem with it. > --- > drivers/cpuidle/governors/menu.c | 76 ++++---------------------------- > 1 file changed, 9 insertions(+), 67 deletions(-) > > diff --git a/drivers/cpuidle/governors/menu.c b/drivers/cpuidle/governors/menu.c > index f3c9d49f0f2a..28363bfa3e4c 100644 > --- a/drivers/cpuidle/governors/menu.c > +++ b/drivers/cpuidle/governors/menu.c > @@ -19,7 +19,7 @@ > > #include "gov.h" > > -#define BUCKETS 12 > +#define BUCKETS 6 > #define INTERVAL_SHIFT 3 > #define INTERVALS (1UL << INTERVAL_SHIFT) > #define RESOLUTION 1024 > @@ -29,12 +29,11 @@ > /* > * Concepts and ideas behind the menu governor > * > - * For the menu governor, there are 3 decision factors for picking a C > + * For the menu governor, there are 2 decision factors for picking a C > * state: > * 1) Energy break even point > - * 2) Performance impact > - * 3) Latency tolerance (from pmqos infrastructure) > - * These three factors are treated independently. > + * 2) Latency tolerance (from pmqos infrastructure) > + * These two factors are treated independently. > * > * Energy break even point > * ----------------------- > @@ -75,30 +74,6 @@ > * intervals and if the stand deviation of these 8 intervals is below a > * threshold value, we use the average of these intervals as prediction. > * > - * Limiting Performance Impact > - * --------------------------- > - * C states, especially those with large exit latencies, can have a real > - * noticeable impact on workloads, which is not acceptable for most sysadmins, > - * and in addition, less performance has a power price of its own. > - * > - * As a general rule of thumb, menu assumes that the following heuristic > - * holds: > - * The busier the system, the less impact of C states is acceptable > - * > - * This rule-of-thumb is implemented using a performance-multiplier: > - * If the exit latency times the performance multiplier is longer than > - * the predicted duration, the C state is not considered a candidate > - * for selection due to a too high performance impact. So the higher > - * this multiplier is, the longer we need to be idle to pick a deep C > - * state, and thus the less likely a busy CPU will hit such a deep > - * C state. > - * > - * Currently there is only one value determining the factor: > - * 10 points are added for each process that is waiting for IO on this CPU. > - * (This value was experimentally determined.) > - * Utilization is no longer a factor as it was shown that it never contributed > - * significantly to the performance multiplier in the first place. > - * > */ > > struct menu_device { > @@ -112,19 +87,10 @@ struct menu_device { > int interval_ptr; > }; > > -static inline int which_bucket(u64 duration_ns, unsigned int nr_iowaiters) > +static inline int which_bucket(u64 duration_ns) > { > int bucket = 0; > > - /* > - * We keep two groups of stats; one with no > - * IO pending, one without. > - * This allows us to calculate > - * E(duration)|iowait > - */ > - if (nr_iowaiters) > - bucket = BUCKETS/2; > - > if (duration_ns < 10ULL * NSEC_PER_USEC) > return bucket; > if (duration_ns < 100ULL * NSEC_PER_USEC) > @@ -138,19 +104,6 @@ static inline int which_bucket(u64 duration_ns, unsigned int nr_iowaiters) > return bucket + 5; > } > > -/* > - * Return a multiplier for the exit latency that is intended > - * to take performance requirements into account. > - * The more performance critical we estimate the system > - * to be, the higher this multiplier, and thus the higher > - * the barrier to go to an expensive C state. > - */ > -static inline int performance_multiplier(unsigned int nr_iowaiters) > -{ > - /* for IO wait tasks (per cpu!) we add 10x each */ > - return 1 + 10 * nr_iowaiters; > -} > - > static DEFINE_PER_CPU(struct menu_device, menu_devices); > > static void menu_update(struct cpuidle_driver *drv, struct cpuidle_device *dev); > @@ -258,8 +211,6 @@ static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev, > struct menu_device *data = this_cpu_ptr(&menu_devices); > s64 latency_req = cpuidle_governor_latency_req(dev->cpu); > u64 predicted_ns; > - u64 interactivity_req; > - unsigned int nr_iowaiters; > ktime_t delta, delta_tick; > int i, idx; > > @@ -268,8 +219,6 @@ static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev, > data->needs_update = 0; > } > > - nr_iowaiters = nr_iowait_cpu(dev->cpu); > - > /* Find the shortest expected idle interval. */ > predicted_ns = get_typical_interval(data) * NSEC_PER_USEC; > if (predicted_ns > RESIDENCY_THRESHOLD_NS) { > @@ -283,7 +232,7 @@ static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev, > } > > data->next_timer_ns = delta; > - data->bucket = which_bucket(data->next_timer_ns, nr_iowaiters); > + data->bucket = which_bucket(data->next_timer_ns); > > /* Round up the result for half microseconds. */ > timer_us = div_u64((RESOLUTION * DECAY * NSEC_PER_USEC) / 2 + > @@ -301,7 +250,7 @@ static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev, > */ > data->next_timer_ns = KTIME_MAX; > delta_tick = TICK_NSEC / 2; > - data->bucket = which_bucket(KTIME_MAX, nr_iowaiters); > + data->bucket = which_bucket(KTIME_MAX); > } > > if (unlikely(drv->state_count <= 1 || latency_req == 0) || > @@ -328,15 +277,8 @@ static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev, > */ > if (predicted_ns < TICK_NSEC) > predicted_ns = data->next_timer_ns; > - } else { > - /* > - * Use the performance multiplier and the user-configurable > - * latency_req to determine the maximum exit latency. > - */ > - interactivity_req = div64_u64(predicted_ns, > - performance_multiplier(nr_iowaiters)); > - if (latency_req > interactivity_req) > - latency_req = interactivity_req; > + } else if (latency_req > predicted_ns) { > + latency_req = predicted_ns; > } > > /* > -- > 2.34.1 >
On Monday 30 Sep 2024 at 18:34:24 (+0200), Rafael J. Wysocki wrote: > On Thu, Sep 5, 2024 at 11:27 AM Christian Loehle > <christian.loehle@arm.com> wrote: > > > > iowait boost in schedutil was introduced by > > commit ("21ca6d2c52f8 cpufreq: schedutil: Add iowait boosting"). > > with it more or less following intel_pstate's approach to increase > > frequency after an iowait wakeup. > > Behaviour that is piggy-backed onto iowait boost is problematic > > due to a lot of reasons, so remove it. > > > > For schedutil specifically these are some of the reasons: > > 1. Boosting is applied even in scenarios where it doesn't improve > > throughput. > > Well, I wouldn't argue this way because it is kind of like saying that > air conditioning is used even when it doesn't really help. It is > sometimes hard to know in advance whether or not it will help though. > > > 2. The boost is not accounted for in EAS: a) feec() will only consider > > the actual task utilization for task placement, but another CPU might > > be more energy-efficient at that capacity than the boosted one.) > > b) When placing a non-IO task while a CPU is boosted compute_energy() > > assumes a lower OPP than what is actually applied. This leads to > > wrong EAS decisions. > > That's a very good point IMV and so is the one regarding UCLAMP_MAX (8 > in your list). I would actually argue that this is also an implementation problem rather than something fundamental about boosting. EAS could be taught about iowait boosting and factor that into the decisions. > If the goal is to set the adequate performance for a given utilization > level (either actual or prescribed), boosting doesn't really play well > with this and it shouldn't be used at least in these cases. There's plenty of cases where EAS will correctly understand that migrating a task away will not reduce the OPP (e.g. another task on the rq has a uclamp_min request, or another CPU in the perf domain has a higher request), so iowait boosting could probably be added. In fact if the iowait boost was made a task property, EAS could easily understand the effect of migrating that boost with the task (it's not fundamentally different from migrating a task with a high uclamp_min from the energy model perspective). > > 3. Actual IO heavy workloads are hardly distinguished from infrequent > > in_iowait wakeups. > > Do infrequent in_iowait wakeups really cause the boosting to be > applied at full swing? > > > 4. The boost isn't accounted for in task placement. > > I'm not sure what exactly this means. "Big" vs "little" or something else? > > > 5. The boost isn't associated with a task, it therefore lingers on the > > rq even after the responsible task has migrated / stopped. > > Fair enough, but this is rather a problem with the implementation of > boosting and not with the basic idea of it. +1 > > 6. The boost isn't associated with a task, it therefore needs to ramp > > up again when migrated. > > Well, that again is somewhat implementation-related IMV, and it need > not be problematic in principle. Namely, if a task migrates and it is > not the only one in the "new" CPUs runqueue, and the other tasks in > there don't use in_iowait, maybe it's better to not boost it? > > It also means that boosting is not very consistent, though, which is a > valid point. > > > 7. Since schedutil doesn't know which task is getting woken up, > > multiple unrelated in_iowait tasks lead to boosting. > > Well, that's by design: it boosts, when "there is enough IO pressure > in the runqueue", so to speak. > > Basically, it is a departure from the "make performance follow > utilization" general idea and it is based on the observation that in > some cases performance can be improved by taking additional > information into account. > > It is also about pure performance, not about energy efficiency. > > > 8. Boosting is hard to control with UCLAMP_MAX (which is only active > > when the task is on the rq, which for boosted tasks is usually not > > the case for most of the time). Sounds like another reason to make iowait boosting per-task to me :-) I've always thought that turning iowait boosting into some sort of in-kernel uclamp_min request would be a good approach for most of the issues mentioned above. Note that I'm not necessarily saying to use the actual uclamp infrastructure (though it's valid option), I'm really just talking about the concept. Is that something you've considered? I presume we could even factor out the 'logic' part of the code that decides out to request the boost into its own thing, and possibly have different policies for different use-cases, but that might be overkill. Thanks, Quentin
Hi, A caveat: I'm a userspace developer that occasionally strays into kernel land (see e.g. the io_uring iowait thing). So I'm likely to get some kernel side things wrong. On 2024-10-03 11:30:52 +0100, Christian Loehle wrote: > These are the main issues with transforming the existing mechanism into > a per-task attribute. > Almost unsolvable is: Does reducing "iowait pressure" (be it per-task or per-rq) > actually improve throughput even (assuming for now that this throughput is > something we care about, I'm sure you know that isn't always the case, e.g. > background tasks). With MCQ devices and some reasonable IO workload that is > IO-bound our iowait boosting is often just boosting CPU frequency (which uses > power obviously) to queue in yet another request for a device which has essentially > endless pending requests. If pending request N+1 arrives x usecs earlier or > later at the device then makes no difference in IO throughput. That's sometimes true, but definitely not all the time? There are plenty workloads with low-queue-depth style IO. Which often are also rather latency sensitive. E.g. the device a database journal resides on will typically have a low queue depth. It's extremely common in OLTPish workloads to be bound by the latency of journal flushes. If, after the journal flush completes, the CPU is clocked low and takes a while to wake up, you'll see substantially worse performance. > If boosting would improve e.g. IOPS (of that device) is something the block layer > (with a lot of added infrastructure, but at least in theory it would know what > device we're iowaiting on, unlike the scheduler) could tell us about. If that is > actually useful for user experience (i.e. worth the power) only userspace can decide > (and then we're back at uclamp_min anyway). I think there are many cases where userspace won't realistically be able to do anything about that. For one, just because, for some workload, a too deep idle state is bad during IO, doesn't mean userspace won't ever want to clock down. And it's probably going to be too expensive to change any attributes around idle states for individual IOs. Are there actually any non-privileged APIs around this that userspace *could* even change? I'd not consider moving to busy-polling based APIs a realistic alternative. For many workloads cpuidle is way too aggressive dropping into lower states *despite* iowait. But just disabling all lower idle states obviously has undesirable energy usage implications. It surely is the answer for some workloads, but I don't think it'd be good to promote it as the sole solution. It's easy to under-estimate the real-world impact of a change like this. When benchmarking we tend to see what kind of throughput we can get, by having N clients hammering the server as fast as they can. But in the real world that's pretty rare for anything latency sensitive to go full blast - rather there's a rate of requests incoming and that the clients are sensitive to requests being processed more slowly. That's not to say that the current situation can't be improved - I've seen way too many workloads where the only ways to get decent performance were one of: - disable most idle states (via sysfs or /dev/cpu_dma_latency) - just have busy loops when idling - doesn't work when doing synchronous syscalls that block though - have some lower priority tasks scheduled that just burns CPU I'm just worried that removing iowait will make this worse. Greetings, Andres Freund
On 10/5/24 01:39, Andres Freund wrote: > Hi, > > > A caveat: I'm a userspace developer that occasionally strays into kernel land > (see e.g. the io_uring iowait thing). So I'm likely to get some kernel side > things wrong. Thank you for your input! > > On 2024-10-03 11:30:52 +0100, Christian Loehle wrote: >> These are the main issues with transforming the existing mechanism into >> a per-task attribute. >> Almost unsolvable is: Does reducing "iowait pressure" (be it per-task or per-rq) >> actually improve throughput even (assuming for now that this throughput is >> something we care about, I'm sure you know that isn't always the case, e.g. >> background tasks). With MCQ devices and some reasonable IO workload that is >> IO-bound our iowait boosting is often just boosting CPU frequency (which uses >> power obviously) to queue in yet another request for a device which has essentially >> endless pending requests. If pending request N+1 arrives x usecs earlier or >> later at the device then makes no difference in IO throughput. > > That's sometimes true, but definitely not all the time? There are plenty > workloads with low-queue-depth style IO. Which often are also rather latency > sensitive. > > E.g. the device a database journal resides on will typically have a low queue > depth. It's extremely common in OLTPish workloads to be bound by the latency > of journal flushes. If, after the journal flush completes, the CPU is clocked > low and takes a while to wake up, you'll see substantially worse performance. Yeah absolutely and if we knew what a latency-sensitive journal flush is tuning cpuidle and cpufreq to it would probably be reasonable. I did test mmtests filebench-oltp that looked fine, do you have any other benchmarks you would like to see? >> If boosting would improve e.g. IOPS (of that device) is something the block layer >> (with a lot of added infrastructure, but at least in theory it would know what >> device we're iowaiting on, unlike the scheduler) could tell us about. If that is >> actually useful for user experience (i.e. worth the power) only userspace can decide >> (and then we're back at uclamp_min anyway). > > I think there are many cases where userspace won't realistically be able to do > anything about that. > > For one, just because, for some workload, a too deep idle state is bad during > IO, doesn't mean userspace won't ever want to clock down. And it's probably > going to be too expensive to change any attributes around idle states for > individual IOs. So the kernel currently applies these to all of them essentially. > > Are there actually any non-privileged APIs around this that userspace *could* > even change? I'd not consider moving to busy-polling based APIs a realistic > alternative. No and I'm not sure an actual non-privileged API would be a good idea, would it? It is essentially changing hardware behavior. So does busy-polling of course, but the kernel can at least curb that and maintain fairness and so forth. > > For many workloads cpuidle is way too aggressive dropping into lower states > *despite* iowait. But just disabling all lower idle states obviously has > undesirable energy usage implications. It surely is the answer for some > workloads, but I don't think it'd be good to promote it as the sole solution. Right, but we (cpuidle) don't know how to distinguish the two, we just do it for all of them. Whether kernel or userspace applies the same (awful) heuristic doesn't make that much of a difference in practice. > > It's easy to under-estimate the real-world impact of a change like this. When > benchmarking we tend to see what kind of throughput we can get, by having N > clients hammering the server as fast as they can. But in the real world that's > pretty rare for anything latency sensitive to go full blast - rather there's a > rate of requests incoming and that the clients are sensitive to requests being > processed more slowly. Agreed, this series is posted as RFT and I'm happy to take a look at any regressions for both the cpufreq and cpuidle parts of it. > > > That's not to say that the current situation can't be improved - I've seen way > too many workloads where the only ways to get decent performance were one of: > > - disable most idle states (via sysfs or /dev/cpu_dma_latency) > - just have busy loops when idling - doesn't work when doing synchronous > syscalls that block though > - have some lower priority tasks scheduled that just burns CPU > > I'm just worried that removing iowait will make this worse. I just need to mention again that almost all of what you replied does refer to cpuidle, not cpufreq (which this particular patch was about), not to create more confusion. Regards, Christian