mbox series

[RFC,v2,0/1] Introduce per-task io utilization boost

Message ID 20240518113947.2127802-1-christian.loehle@arm.com
Headers show
Series Introduce per-task io utilization boost | expand

Message

Christian Loehle May 18, 2024, 11:39 a.m. UTC
There is a feature inside of both schedutil and intel_pstate called
iowait boosting which tries to prevent selecting a low frequency
during IO workloads when it impacts throughput.
The feature is implemented by checking for task wakeups that have
the in_iowait flag set and boost the CPU of the rq accordingly
(implemented through cpufreq_update_util(rq, SCHED_CPUFREQ_IOWAIT)).

The necessity of the feature is argued with the potentially low
utilization of a task being frequently in_iowait (i.e. most of the
time not enqueued on any rq and cannot build up utilization).

The RFC focuses on the schedutil implementation.
intel_pstate implementation is possible, but with reviews of v1 it
seems a governor-based implementation is preferred.
Current schedutil iowait boosting has several issues:
1. Boosting happens even in scenarios where it doesn't improve
throughput. [1]
2. The boost is not accounted for in EAS: a) feec() will only consider
 the actual task utilization for task placement, but another CPU might
 be more energy-efficient at that capacity than the boosted one.)
 b) When placing a non-IO task while a CPU is boosted compute_energy()
 assumes a lower OPP than what is actually applied. This leads to
 wrong EAS decisions.
3. Actual IO heavy workloads are hardly distinguished from infrequent
in_iowait wakeups.
4. The boost isn't associated with a task, it therefore isn't considered
for task placement, potentially missing out on higher capacity CPUs on
heterogeneous CPU topologies.
5. The boost isn't associated with a task, it therefore lingers on the
rq even after the responsible task has migrated / stopped.
6. The boost isn't associated with a task, it therefore needs to ramp
up again when migrated.
7. Since schedutil doesn't know which task is getting woken up,
multiple unrelated in_iowait tasks might lead to boosting.
8. Boosting is hard to control with UCLAMP_MAX.

We attempt to mitigate all of the above by reworking the way the
iowait boosting (io boosting from here on) works in two major ways:
- Carry the boost in task_struct, so it is a per-task attribute and
behaves similar to utilization of the task in some ways.
- Employ a counting-based tracking strategy that only boosts as long
as it sees benefits and returns to minimal boosting dynamically.

Note that some the issues (1, 3) can be solved by using a
counting-based strategy on a per-rq basis, i.e. in sugov entirely.
Experiments with Android in particular showed that such a strategy
(which necessarily needs longer intervals to be reasonably stable)
is too prone to migrations to be useful generally.
We therefore consider the additional complexity of such a per-task
based approach like proposed to be worth it.

We require a minimum of 1000 iowait wakeups per second to start
boosting.
This isn't too far off from what sugov currently does, since it resets
the boost if it hasn't seen an iowait wakeup for TICK_NSEC.
For CONFIG_HZ=1000 we are on par, for anything below we are stricter.
We justify this by the small possible improvement by boosting in the
first place with 'rare' iowait wakeups.

When IO even leads to a task being in iowait isn't as straightforward
to explain.
Of course if the issued IO can be served by the page cache (e.g. on
reads because the pages are contained, on writes because they can be
marked dirty and the writeback takes care of it later) the actual
issuing task is usually not in iowait.
We consider this the good case, since whenever the scheduler and a
potential userspace / kernel switch is in the critical path for IO
there is possibly overhead impacting throughput.
We therefore focus on random read from here on, because (on synchronous
IO [3]) this will lead to the task being set in iowait for every IO.
This is where iowait boosting shows its biggest throughput improvement.