mbox series

[RFC,0/2] Introduce per-task io utilization boost

Message ID 20240304201625.100619-1-christian.loehle@arm.com
Headers show
Series Introduce per-task io utilization boost | expand

Message

Christian Loehle March 4, 2024, 8:16 p.m. UTC
There is a feature inside of both schedutil and intel_pstate called
iowait boosting which tries to prevent selecting a low frequency
during IO workloads when it impacts throughput.
The feature is implemented by checking for task wakeups that have
the in_iowait flag set and boost the CPU of the rq accordingly
(implemented through cpufreq_update_util(rq, SCHED_CPUFREQ_IOWAIT)).

The necessity of the feature is argued with the potentially low
utilization of a task being frequently in_iowait (i.e. most of the
time not enqueued on any rq and cannot build up utilization).

The RFC focuses on the schedutil implementation.
intel_pstate frequency selection isn't touched for now, suggestions are
very welcome.
Current schedutil iowait boosting has several issues:
1. Boosting happens even in scenarios where it doesn't improve
throughput. [1]
2. The boost is not accounted for in EAS: a) feec() will only consider
 the actual utilization for task placement, but another CPU might be
 more energy-efficient at that capacity than the boosted one.)
 b) When placing a non-IO task while a CPU is boosted compute_energy()
 will not consider the (potentially 'free') boosted capacity, but the
 one it would have without the boost (since the boost is only applied
 in sugov).
3. Actual IO heavy workloads are hardly distinguished from infrequent
in_iowait wakeups.
4. The boost isn't associated with a task, it therefore isn't considered
for task placement, potentially missing out on higher capacity CPUs on
heterogeneous CPU topologies.
5. The boost isn't associated with a task, it therefore lingers on the
rq even after the responsible task has migrated / stopped.
6. The boost isn't associated with a task, it therefore needs to ramp
up again when migrated.
7. Since schedutil doesn't know which task is getting woken up,
multiple unrelated in_iowait tasks might lead to boosting.

We attempt to mitigate all of the above by reworking the way the
iowait boosting (io boosting from here on) works in two major ways:
- Carry the boost in task_struct, so it is a per-task attribute and
behaves similar to utilization of the task in some ways.
- Employ a counting-based tracking strategy that only boosts as long
as it sees benefits and returns to no boosting dynamically.

Note that some the issues (1, 3) can be solved by using a
counting-based strategy on a per-rq basis, i.e. in sugov entirely.
Experiments with Android in particular showed that such a strategy
(which necessarily needs longer intervals to be reasonably stable)
is too prone to migrations to be useful generally.
We therefore consider the additional complexity of such a per-task
based approach like proposed to be worth it.

We require a minimum of 1000 iowait wakeups per second to start
boosting.
This isn't too far off from what sugov currently does, since it resets
the boost if it hasn't seen an iowait wakeup for TICK_NSEC.
For CONFIG_HZ=1000 we are on par, for anything below we are stricter.
We justify this by the small possible improvement by boosting in the
first place with 'rare' few iowait wakeups.

When IO even leads to a task being in iowait isn't as straightforward
to explain.
Of course if the issued IO can be served by the page cache (e.g. on
reads because the pages are contained, on writes because they can be
marked dirty and the writeback takes care of it later) the actual
issuing task is usually not in iowait.
We consider this the good case, since whenever the scheduler and a
potential userspace / kernel switch is in the critical path for IO
there is possibly overhead impacting throughput.
We therefore focus on random read from here on, because (on synchronous
IO [3]) this will lead to the task being set in iowait for every IO.
This is where iowait boosting shows its biggest throughput improvement.

Comments

Bart Van Assche March 5, 2024, 12:20 a.m. UTC | #1
On 3/4/24 12:16, Christian Loehle wrote:
> Pixel 6 ufs Android 14 (7 runs for because device showed some variance)
> [6605, 6622, 6633, 6652, 6690, 6697, 6754] sugov mainline
> [7141, 7173, 7198, 7220, 7280, 7427, 7452] per-task tracking
> [2390, 2392, 2406, 2437, 2464, 2487, 2813] sugov no iowait boost
> [7812, 7837, 7837, 7851, 7900, 7959, 7980] performance governor

Variance of performance results for Pixel devices can be reduced greatly
by disabling devfreq scaling, e.g. as follows (this may cause thermal
issues if the system load is high enough):

      for d in $(adb shell echo /sys/class/devfreq/*); do
	adb shell "cat $d/available_frequencies |
		tr ' ' '\n' |
		sort -n |
		case $devfreq in
			min) head -n1;;
			max) tail -n1;;
		esac > $d/min_freq"
     done

> Showcasing some different IO scenarios, again all random read,
> median out of 5 runs, all on rk3399 with NVMe.
> e.g. io_uring6x4 means 6 threads with 4 iodepth each, results can be
> obtained using:
> fio --minimal --time_based --name=test --filename=/dev/nvme0n1 --runtime=30 --rw=randread --bs=4k --ioengine=io_uring --iodepth=4 --numjobs=6 --group_reporting | cut -d \; -f 8

So buffered I/O was used during this test? Shouldn't direct I/O be used
for this kind of tests (--buffered=0)? Additionally, which I/O scheduler
was configured? I recommend --ioscheduler=none for this kind of tests.

> - Higher cap is not always beneficial, we might place the task away
> from the CPU where the interrupt handler is running, making it run
> on an unboosted CPU which may have a bigger impact than the difference
> between the CPU's capacity the task moved to. (Of course the boost will
> then be reverted again, but a ping-pong every interval is possible).

In the above I see "the interrupt handler". Does this mean that the NVMe
controller in the test setup only supports one completion interrupt for
all completion queues instead of one completion interrupt per completion
queue? There are already Android phones and developer boards available
that support the latter, namely the boards equipped with a UFSHCI 4.0 
controller.

Thanks,

Bart.
Christian Loehle March 5, 2024, 9:13 a.m. UTC | #2
Hi Bart,

On 05/03/2024 00:20, Bart Van Assche wrote:
> On 3/4/24 12:16, Christian Loehle wrote:
>> Pixel 6 ufs Android 14 (7 runs for because device showed some variance)
>> [6605, 6622, 6633, 6652, 6690, 6697, 6754] sugov mainline
>> [7141, 7173, 7198, 7220, 7280, 7427, 7452] per-task tracking
>> [2390, 2392, 2406, 2437, 2464, 2487, 2813] sugov no iowait boost
>> [7812, 7837, 7837, 7851, 7900, 7959, 7980] performance governor
> 
> Variance of performance results for Pixel devices can be reduced greatly
> by disabling devfreq scaling, e.g. as follows (this may cause thermal
> issues if the system load is high enough):
> 
>      for d in $(adb shell echo /sys/class/devfreq/*); do
>     adb shell "cat $d/available_frequencies |
>         tr ' ' '\n' |
>         sort -n |
>         case $devfreq in
>             min) head -n1;;
>             max) tail -n1;;
>         esac > $d/min_freq"
>     done
> 

Thanks for the hint!

>> Showcasing some different IO scenarios, again all random read,
>> median out of 5 runs, all on rk3399 with NVMe.
>> e.g. io_uring6x4 means 6 threads with 4 iodepth each, results can be
>> obtained using:
>> fio --minimal --time_based --name=test --filename=/dev/nvme0n1 --runtime=30 --rw=randread --bs=4k --ioengine=io_uring --iodepth=4 --numjobs=6 --group_reporting | cut -d \; -f 8
> 
> So buffered I/O was used during this test? Shouldn't direct I/O be used
> for this kind of tests (--buffered=0)? Additionally, which I/O scheduler
> was configured? I recommend --ioscheduler=none for this kind of tests.

Yes I opted for buffered I/O, I guess it's the eternal question if you
should benchmark the device/stack (O_DIRECT) or be more realistic to actual
use cases (probably). I opted for the latter, but since it's 4K randread
on significantly large devices the results don't differ too much.


>> - Higher cap is not always beneficial, we might place the task away
>> from the CPU where the interrupt handler is running, making it run
>> on an unboosted CPU which may have a bigger impact than the difference
>> between the CPU's capacity the task moved to. (Of course the boost will
>> then be reverted again, but a ping-pong every interval is possible).
> 
> In the above I see "the interrupt handler". Does this mean that the NVMe
> controller in the test setup only supports one completion interrupt for
> all completion queues instead of one completion interrupt per completion
> queue? There are already Android phones and developer boards available
> that support the latter, namely the boards equipped with a UFSHCI 4.0 controller.

No, both NVMe test setups have one completion interrupt per completion queue,
so this caveat doesn't affect them, higher capacity CPU is strictly better.
The UFS and both mmc setups (eMMC with CQE and sdcard) only have one completion
interrupt (on CPU0 on my setup).
The difference between the CPU capacities on the Pixel6 is able to make up for this.
The big CPU is still the best to run these single-threaded fio benchmarks on in terms
of throughput.
FWIW you do gain an additional ~20% (in my specific setup) if you move the ufshcd
interrupt to a big CPU, too. Similarly for the mmc.
Unfortunately the infrastructure is far from being there for the scheduler to move the
interrupt to the same performance domain as the task, which is often optimal both in
terms of throughput and in terms of power.
I'll go looking for a stable testing platform with UFS as you mentioned, benefits of this
patch will of course be greatly increased.
Thanks!

Best Regards,
Christian
Bart Van Assche March 5, 2024, 6:36 p.m. UTC | #3
On 3/5/24 01:13, Christian Loehle wrote:
> On 05/03/2024 00:20, Bart Van Assche wrote:
>> On 3/4/24 12:16, Christian Loehle wrote:
>>> - Higher cap is not always beneficial, we might place the task away
>>> from the CPU where the interrupt handler is running, making it run
>>> on an unboosted CPU which may have a bigger impact than the difference
>>> between the CPU's capacity the task moved to. (Of course the boost will
>>> then be reverted again, but a ping-pong every interval is possible).
>>
>> In the above I see "the interrupt handler". Does this mean that the NVMe
>> controller in the test setup only supports one completion interrupt for
>> all completion queues instead of one completion interrupt per completion
>> queue? There are already Android phones and developer boards available
>> that support the latter, namely the boards equipped with a UFSHCI 4.0 controller.
> 
> No, both NVMe test setups have one completion interrupt per completion queue,
> so this caveat doesn't affect them, higher capacity CPU is strictly better.
> The UFS and both mmc setups (eMMC with CQE and sdcard) only have one completion
> interrupt (on CPU0 on my setup).

I think that measurements should be provided in the cover letter for the
two types of storage controllers: one series of measurements for a
storage controller with a single completion interrupt and a second
series of measurements for storage controllers with one completion
interrupt per CPU.

> FWIW you do gain an additional ~20% (in my specific setup) if you move the ufshcd
> interrupt to a big CPU, too. Similarly for the mmc.
> Unfortunately the infrastructure is far from being there for the scheduler to move the
> interrupt to the same performance domain as the task, which is often optimal both in
> terms of throughput and in terms of power.
> I'll go looking for a stable testing platform with UFS as you mentioned, benefits of this
> patch will of course be greatly increased.

I'm not sure whether making the completion interrupt follow the workload
is a good solution. I'm concerned that this would increase energy
consumption by keeping the big cores active longer than necessary. I
like this solution better (improves storage performance on at least
devices with a UFSHCI 3.0 controller): "[PATCH v2 0/2] sched: blk:
Handle HMP systems when completing IO"
(https://lore.kernel.org/linux-block/20240223155749.2958009-1-qyousef@layalina.io/).

Thanks,

Bart.
Christian Loehle March 6, 2024, 10:49 a.m. UTC | #4
Hi Bart,

On 05/03/2024 18:36, Bart Van Assche wrote:
> On 3/5/24 01:13, Christian Loehle wrote:
>> On 05/03/2024 00:20, Bart Van Assche wrote:
>>> On 3/4/24 12:16, Christian Loehle wrote:
>>>> - Higher cap is not always beneficial, we might place the task away
>>>> from the CPU where the interrupt handler is running, making it run
>>>> on an unboosted CPU which may have a bigger impact than the difference
>>>> between the CPU's capacity the task moved to. (Of course the boost will
>>>> then be reverted again, but a ping-pong every interval is possible).
>>>
>>> In the above I see "the interrupt handler". Does this mean that the NVMe
>>> controller in the test setup only supports one completion interrupt for
>>> all completion queues instead of one completion interrupt per completion
>>> queue? There are already Android phones and developer boards available
>>> that support the latter, namely the boards equipped with a UFSHCI 4.0 controller.
>>
>> No, both NVMe test setups have one completion interrupt per completion queue,
>> so this caveat doesn't affect them, higher capacity CPU is strictly better.
>> The UFS and both mmc setups (eMMC with CQE and sdcard) only have one completion
>> interrupt (on CPU0 on my setup).
> 
> I think that measurements should be provided in the cover letter for the
> two types of storage controllers: one series of measurements for a
> storage controller with a single completion interrupt and a second
> series of measurements for storage controllers with one completion
> interrupt per CPU.

Of the same type of storage controller? Or what is missing for you in
the cover letter exactly (ufs/emmc: single completion interrupt,
nvme: one completion interrupt per CPU).

> 
>> FWIW you do gain an additional ~20% (in my specific setup) if you move the ufshcd
>> interrupt to a big CPU, too. Similarly for the mmc.
>> Unfortunately the infrastructure is far from being there for the scheduler to move the
>> interrupt to the same performance domain as the task, which is often optimal both in
>> terms of throughput and in terms of power.
>> I'll go looking for a stable testing platform with UFS as you mentioned, benefits of this
>> patch will of course be greatly increased.
> 
> I'm not sure whether making the completion interrupt follow the workload
> is a good solution. I'm concerned that this would increase energy
> consumption by keeping the big cores active longer than necessary. I
> like this solution better (improves storage performance on at least
> devices with a UFSHCI 3.0 controller): "[PATCH v2 0/2] sched: blk:
> Handle HMP systems when completing IO"
> (https://lore.kernel.org/linux-block/20240223155749.2958009-1-qyousef@layalina.io/).

That patch is good, don't get me wrong, but you still lose out by running everything
up to blk_mq_complete_request() on (potentially) a LITTlE (that might be run on a low OPP),
while having a big CPU available at a high OPP anyway ("for free").
It is only adjacent to the series but I've done some measurements (Pixel6 again, same device
as cover letter, Base is Android 6.6 mainline kernel (so without my series, but I somewhat forced
the effects by task pinning), Applied is with both of sched: blk: Handle HMP systems when completing IO):

Pretty numbers (IOPS):
Base irq@CPU0 median: 6969
Base irq@CPU6 median: 8407 (+20.6%)
Applied irq@CPU0 median: 7144 (+2.5%)
Applied irq@CPU6 median: 8288 (18.9%)

This is with psyncx1 4K Random Read again, of course anything with queue depth
takes advantage of batch completions to significantly reduce irq pressure.

Not so pretty numbers and full list commands used:

w/o patch:
irq on CPU0 (default):
psyncx1: 7000 6969 7025 6954 6964
io_uring4x128: 28766 28280 28339 28310 28349
irq on CPU6:
psyncx1: 8342 8492 8355 8407 8532
io_uring4x128: 28641 28356 25908 25787 25853

with patch:
irq on CPU0:
psyncx1: 7672 7144 7301 6976 6889
io_uring4x128: 28266 26314 27648 24482 25301
irq on CPU6:
psyncx1: 8208 8401 8351 8221 8288
io_uring4x128: 25603 25438 25453 25514 25402


for i in $(seq 0 4); do taskset c0 /data/local/tmp/fio_aosp_build --name=test --rw=randread --bs=4k --runtime=30 --time_based --filename=/dev/block/sda --minimal | awk -F ";" '{print $8}'; sleep 30; done

for i in $(seq 0 4); do taskset c0 /data/local/tmp/fio_aosp_build --name=test --rw=randread --bs=4k --runtime=30 --time_based --filename=/dev/block/sda --ioengine=io_uring --iodepth=128 --numjobs=4 --group_reporting --minimal | awk -F ";" '{print $8}'; sleep 30; done

echo 6 > /proc/irq/296/smp_affinity_list


Kind Regards,
Christian
Qais Yousef March 21, 2024, 12:39 p.m. UTC | #5
(Thanks for the CC Bart)

On 03/06/24 10:49, Christian Loehle wrote:
> Hi Bart,
> 
> On 05/03/2024 18:36, Bart Van Assche wrote:
> > On 3/5/24 01:13, Christian Loehle wrote:
> >> On 05/03/2024 00:20, Bart Van Assche wrote:
> >>> On 3/4/24 12:16, Christian Loehle wrote:
> >>>> - Higher cap is not always beneficial, we might place the task away
> >>>> from the CPU where the interrupt handler is running, making it run
> >>>> on an unboosted CPU which may have a bigger impact than the difference
> >>>> between the CPU's capacity the task moved to. (Of course the boost will
> >>>> then be reverted again, but a ping-pong every interval is possible).
> >>>
> >>> In the above I see "the interrupt handler". Does this mean that the NVMe
> >>> controller in the test setup only supports one completion interrupt for
> >>> all completion queues instead of one completion interrupt per completion
> >>> queue? There are already Android phones and developer boards available
> >>> that support the latter, namely the boards equipped with a UFSHCI 4.0 controller.
> >>
> >> No, both NVMe test setups have one completion interrupt per completion queue,
> >> so this caveat doesn't affect them, higher capacity CPU is strictly better.
> >> The UFS and both mmc setups (eMMC with CQE and sdcard) only have one completion
> >> interrupt (on CPU0 on my setup).
> > 
> > I think that measurements should be provided in the cover letter for the
> > two types of storage controllers: one series of measurements for a
> > storage controller with a single completion interrupt and a second
> > series of measurements for storage controllers with one completion
> > interrupt per CPU.
> 
> Of the same type of storage controller? Or what is missing for you in
> the cover letter exactly (ufs/emmc: single completion interrupt,
> nvme: one completion interrupt per CPU).
> 
> > 
> >> FWIW you do gain an additional ~20% (in my specific setup) if you move the ufshcd
> >> interrupt to a big CPU, too. Similarly for the mmc.
> >> Unfortunately the infrastructure is far from being there for the scheduler to move the
> >> interrupt to the same performance domain as the task, which is often optimal both in
> >> terms of throughput and in terms of power.
> >> I'll go looking for a stable testing platform with UFS as you mentioned, benefits of this
> >> patch will of course be greatly increased.
> > 
> > I'm not sure whether making the completion interrupt follow the workload
> > is a good solution. I'm concerned that this would increase energy
> > consumption by keeping the big cores active longer than necessary. I
> > like this solution better (improves storage performance on at least
> > devices with a UFSHCI 3.0 controller): "[PATCH v2 0/2] sched: blk:
> > Handle HMP systems when completing IO"
> > (https://lore.kernel.org/linux-block/20240223155749.2958009-1-qyousef@layalina.io/).
> 
> That patch is good, don't get me wrong, but you still lose out by running everything
> up to blk_mq_complete_request() on (potentially) a LITTlE (that might be run on a low OPP),
> while having a big CPU available at a high OPP anyway ("for free").
> It is only adjacent to the series but I've done some measurements (Pixel6 again, same device
> as cover letter, Base is Android 6.6 mainline kernel (so without my series, but I somewhat forced
> the effects by task pinning), Applied is with both of sched: blk: Handle HMP systems when completing IO):

So you want the hardirq to move to the big core? Unlike softirq, there will be
a single hardirq for the controller (to my limited knowledge), so if there are
multiple requests I'm not sure we can easily match which one relates to which
before it triggers. So we can end up waking up the wrong core.

Generally this should be a userspace policy. If there's a scenario where the
throughput is that important they can easily move the hardirq to the big core
unconditionally and move it back again once this high throughput scenario is no
longer important.

Or where you describing a different problem?

Glad to see your series by the way :-) I'll get a chance to review it over the
weekend hopefully.


Cheers

--
Qais Yousef

> 
> Pretty numbers (IOPS):
> Base irq@CPU0 median: 6969
> Base irq@CPU6 median: 8407 (+20.6%)
> Applied irq@CPU0 median: 7144 (+2.5%)
> Applied irq@CPU6 median: 8288 (18.9%)
> 
> This is with psyncx1 4K Random Read again, of course anything with queue depth
> takes advantage of batch completions to significantly reduce irq pressure.
> 
> Not so pretty numbers and full list commands used:
> 
> w/o patch:
> irq on CPU0 (default):
> psyncx1: 7000 6969 7025 6954 6964
> io_uring4x128: 28766 28280 28339 28310 28349
> irq on CPU6:
> psyncx1: 8342 8492 8355 8407 8532
> io_uring4x128: 28641 28356 25908 25787 25853
> 
> with patch:
> irq on CPU0:
> psyncx1: 7672 7144 7301 6976 6889
> io_uring4x128: 28266 26314 27648 24482 25301
> irq on CPU6:
> psyncx1: 8208 8401 8351 8221 8288
> io_uring4x128: 25603 25438 25453 25514 25402
> 
> 
> for i in $(seq 0 4); do taskset c0 /data/local/tmp/fio_aosp_build --name=test --rw=randread --bs=4k --runtime=30 --time_based --filename=/dev/block/sda --minimal | awk -F ";" '{print $8}'; sleep 30; done
> 
> for i in $(seq 0 4); do taskset c0 /data/local/tmp/fio_aosp_build --name=test --rw=randread --bs=4k --runtime=30 --time_based --filename=/dev/block/sda --ioengine=io_uring --iodepth=128 --numjobs=4 --group_reporting --minimal | awk -F ";" '{print $8}'; sleep 30; done
> 
> echo 6 > /proc/irq/296/smp_affinity_list
> 
> 
> Kind Regards,
> Christian
Christian Loehle March 21, 2024, 5:57 p.m. UTC | #6
On 21/03/2024 12:39, Qais Yousef wrote:
[snip]
>> On 05/03/2024 18:36, Bart Van Assche wrote:
>>> On 3/5/24 01:13, Christian Loehle wrote:
>>>> On 05/03/2024 00:20, Bart Van Assche wrote:
>>>>> On 3/4/24 12:16, Christian Loehle wrote:
>>>>>> - Higher cap is not always beneficial, we might place the task away
>>>>>> from the CPU where the interrupt handler is running, making it run
>>>>>> on an unboosted CPU which may have a bigger impact than the difference
>>>>>> between the CPU's capacity the task moved to. (Of course the boost will
>>>>>> then be reverted again, but a ping-pong every interval is possible).
>>>>>
>>>>> In the above I see "the interrupt handler". Does this mean that the NVMe
>>>>> controller in the test setup only supports one completion interrupt for
>>>>> all completion queues instead of one completion interrupt per completion
>>>>> queue? There are already Android phones and developer boards available
>>>>> that support the latter, namely the boards equipped with a UFSHCI 4.0 controller.
>>>>
>>>> No, both NVMe test setups have one completion interrupt per completion queue,
>>>> so this caveat doesn't affect them, higher capacity CPU is strictly better.
>>>> The UFS and both mmc setups (eMMC with CQE and sdcard) only have one completion
>>>> interrupt (on CPU0 on my setup).
>>>
>>> I think that measurements should be provided in the cover letter for the
>>> two types of storage controllers: one series of measurements for a
>>> storage controller with a single completion interrupt and a second
>>> series of measurements for storage controllers with one completion
>>> interrupt per CPU.
>>
>> Of the same type of storage controller? Or what is missing for you in
>> the cover letter exactly (ufs/emmc: single completion interrupt,
>> nvme: one completion interrupt per CPU).
>>
>>>
>>>> FWIW you do gain an additional ~20% (in my specific setup) if you move the ufshcd
>>>> interrupt to a big CPU, too. Similarly for the mmc.
>>>> Unfortunately the infrastructure is far from being there for the scheduler to move the
>>>> interrupt to the same performance domain as the task, which is often optimal both in
>>>> terms of throughput and in terms of power.
>>>> I'll go looking for a stable testing platform with UFS as you mentioned, benefits of this
>>>> patch will of course be greatly increased.
>>>
>>> I'm not sure whether making the completion interrupt follow the workload
>>> is a good solution. I'm concerned that this would increase energy
>>> consumption by keeping the big cores active longer than necessary. I
>>> like this solution better (improves storage performance on at least
>>> devices with a UFSHCI 3.0 controller): "[PATCH v2 0/2] sched: blk:
>>> Handle HMP systems when completing IO"
>>> (https://lore.kernel.org/linux-block/20240223155749.2958009-1-qyousef@layalina.io/).
>>
>> That patch is good, don't get me wrong, but you still lose out by running everything
>> up to blk_mq_complete_request() on (potentially) a LITTlE (that might be run on a low OPP),
>> while having a big CPU available at a high OPP anyway ("for free").
>> It is only adjacent to the series but I've done some measurements (Pixel6 again, same device
>> as cover letter, Base is Android 6.6 mainline kernel (so without my series, but I somewhat forced
>> the effects by task pinning), Applied is with both of sched: blk: Handle HMP systems when completing IO):
> 
> So you want the hardirq to move to the big core? Unlike softirq, there will be
> a single hardirq for the controller (to my limited knowledge), so if there are
> multiple requests I'm not sure we can easily match which one relates to which
> before it triggers. So we can end up waking up the wrong core.

It would be beneficial to move the hardirq to a big core if the IO task
is using it anyway.
I'm not sure I actually want to. There are quite a few pitfalls (like you
mentioned) that the scheduler really shouldn't be concerned about.
Moving the hardirq, if implemented in the kernel, would have to be done by the
host controller driver anyway, which would explode this series.
(host controller drivers are quite fragmented e.g. on mmc)

The fact that having a higher capacity CPU available ("running faster") for an
IO task doesn't (always) imply higher throughput because of the hardirq staying
on some LITTLE CPU is bothering (for this series), though.

> 
> Generally this should be a userspace policy. If there's a scenario where the
> throughput is that important they can easily move the hardirq to the big core
> unconditionally and move it back again once this high throughput scenario is no
> longer important.

It also feels wrong to let this be a userspace policy, as the hardirq must be
migrated to the perf domain of the task, which userspace isn't aware of.
Unless you expect userspace to do
CPU_affinity_task=big_perf_domain_0 && hardirq_affinity=big_perf_domain_0
but then you could just as well ask them to set performance governor for
big_perf_domain_0 (or uclamp_min=1024) and need neither this series nor
any iowait boosting.

Furthermore you can't generally expect userspace to know if their IO will lead
to any interrupt at all, much less which one. They ideally don't even know if
the file IO they are doing is backed by any physical storage in the first place.
(Or even further, that they are doing file IO at all, they might just be
e.g. page-faulting.)

> 
> Or where you describing a different problem?

That is the problem I mentioned in the series and Bart and I were discussing.
It's a problem of the series as in "the numbers aren't that impressive".
Current iowait boosting on embedded/mobile systems will perform quite well by
chance, as the (low util) task will often be on the same perf domain the hardirq
will be run on. As can be seen in the cover letter the benefit of running the
task on a (2xLITTLE capacity) big CPU therefore are practically non-existent,
for tri-gear systems where big CPU is more like 10xLITTLE capacity the benefit
will be much greater.
I just wanted to point this out. We might just acknowledge the problem and say
"don't care" about the potential performance benefits of those scenarios that
would require hardirq moving.
In the long-term it looks like for UFS the problem will disappear as we are
expected to get one queue/hardirq per CPU (as Bart mentioned), on NVMe that
is already the case.

I CC'd Uffe and Adrian for mmc, to my knowledge the only subsystem where
'fast' (let's say >10K IOPS) devices are common, but only one queue/hardirq
is available (and it doesn't look like this is changing anytime soon).
I would also love to hear what Bart or other UFS folks think about it.
Furthermore if I forgot any storage subsystem with the same behavior in that
regards do tell me.

Lastly, you could consider the IO workload:
IO task being in iowait very frequently [1] with just a single IO inflight [2]
and only very little time being spent on the CPU in-between iowaits[3],
therefore the interrupt handler being on the critical path for IO throughput
to a non-negligible degree, to be niche, but it's precisely the use-case where
iowait boosting shows it's biggest benefit.

Sorry for the abomination of a sentence, see footnotes for the reasons.

[1] If sugov doesn't see significantly more than 1 iowait per TICK_NSEC it
won't apply any significant boost currently.
[2] If the storage devices has enough in-flight requests to serve, iowait
boosting is unnecessary/wasteful, see cover letter.
[3] If the task actually uses the CPU in-between iowaits, it will build up
utilization, iowait boosting benefit diminishes.

> 
> Glad to see your series by the way :-) I'll get a chance to review it over the
> weekend hopefully.

Thank you!
Apologies for not CCing you in the first place, I am curious about your opinion
on the concept!

FWIW I did mess up a last-minute, what was supposed to be, cosmetic change that
only received a quick smoke test, so 1/2 needs the following:

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4aaf64023b03..2b6f521be658 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6824,7 +6824,7 @@ static void dequeue_io_boost(struct cfs_rq *cfs_rq, struct task_struct *p)
        } else if (p->io_boost_curr_ios < p->io_boost_threshold_down) {
                /* Reduce boost */
                if (p->io_boost_level > 1)
-                       io_boost_scale_interval(p, true);
+                       io_boost_scale_interval(p, false);
                else
                        p->io_boost_level = 0;
        } else if (p->io_boost_level == IO_BOOST_LEVELS) {


I'll probably send a v2 rebased on 6.9 when it's out anyway, but so far the
changes are mostly cosmetic and addressing Bart's comments about the benchmark
numbers in the cover letter.

Kind Regards,
Christian
Bart Van Assche March 21, 2024, 7:52 p.m. UTC | #7
On 3/21/24 10:57, Christian Loehle wrote:
> In the long-term it looks like for UFS the problem will disappear as we are
> expected to get one queue/hardirq per CPU (as Bart mentioned), on NVMe that
> is already the case.

Why the focus on storage controllers with a single completion interrupt?
It probably won't take long (one year?) until all new high-end
smartphones may have support for multiple completion interrupts.

Thanks,

Bart.
Vincent Guittot March 22, 2024, 6:08 p.m. UTC | #8
Hi Christian,

On Mon, 4 Mar 2024 at 21:17, Christian Loehle <christian.loehle@arm.com> wrote:
>
> There is a feature inside of both schedutil and intel_pstate called
> iowait boosting which tries to prevent selecting a low frequency
> during IO workloads when it impacts throughput.
> The feature is implemented by checking for task wakeups that have
> the in_iowait flag set and boost the CPU of the rq accordingly
> (implemented through cpufreq_update_util(rq, SCHED_CPUFREQ_IOWAIT)).
>
> The necessity of the feature is argued with the potentially low
> utilization of a task being frequently in_iowait (i.e. most of the
> time not enqueued on any rq and cannot build up utilization).
>
> The RFC focuses on the schedutil implementation.
> intel_pstate frequency selection isn't touched for now, suggestions are
> very welcome.
> Current schedutil iowait boosting has several issues:
> 1. Boosting happens even in scenarios where it doesn't improve
> throughput. [1]
> 2. The boost is not accounted for in EAS: a) feec() will only consider
>  the actual utilization for task placement, but another CPU might be
>  more energy-efficient at that capacity than the boosted one.)
>  b) When placing a non-IO task while a CPU is boosted compute_energy()
>  will not consider the (potentially 'free') boosted capacity, but the
>  one it would have without the boost (since the boost is only applied
>  in sugov).
> 3. Actual IO heavy workloads are hardly distinguished from infrequent
> in_iowait wakeups.
> 4. The boost isn't associated with a task, it therefore isn't considered
> for task placement, potentially missing out on higher capacity CPUs on
> heterogeneous CPU topologies.
> 5. The boost isn't associated with a task, it therefore lingers on the
> rq even after the responsible task has migrated / stopped.
> 6. The boost isn't associated with a task, it therefore needs to ramp
> up again when migrated.
> 7. Since schedutil doesn't know which task is getting woken up,
> multiple unrelated in_iowait tasks might lead to boosting.
>
> We attempt to mitigate all of the above by reworking the way the
> iowait boosting (io boosting from here on) works in two major ways:
> - Carry the boost in task_struct, so it is a per-task attribute and
> behaves similar to utilization of the task in some ways.
> - Employ a counting-based tracking strategy that only boosts as long
> as it sees benefits and returns to no boosting dynamically.

Thanks for working on improving IO boosting. I have started to read
your patchset and have few comments about your proposal:

The main one is that the io boosting decision should remain a cpufreq
governor decision and so the io boosting value should be applied by
the governor like in sugov_effective_cpu_perf() as an example instead
of everywhere in the scheduler code.

Then, the algorithm to track the right interval bucket and the mapping
of intervals into utilization really looks like a policy which has
been defined with heuristics and as a result further seems to be a
governor decision

Finally adding some atomic operation in the fast path is not really desirable

I will continue to review your patchset

>
> Note that some the issues (1, 3) can be solved by using a
> counting-based strategy on a per-rq basis, i.e. in sugov entirely.
> Experiments with Android in particular showed that such a strategy
> (which necessarily needs longer intervals to be reasonably stable)
> is too prone to migrations to be useful generally.
> We therefore consider the additional complexity of such a per-task
> based approach like proposed to be worth it.
>
> We require a minimum of 1000 iowait wakeups per second to start
> boosting.
> This isn't too far off from what sugov currently does, since it resets
> the boost if it hasn't seen an iowait wakeup for TICK_NSEC.
> For CONFIG_HZ=1000 we are on par, for anything below we are stricter.
> We justify this by the small possible improvement by boosting in the
> first place with 'rare' few iowait wakeups.
>
> When IO even leads to a task being in iowait isn't as straightforward
> to explain.
> Of course if the issued IO can be served by the page cache (e.g. on
> reads because the pages are contained, on writes because they can be
> marked dirty and the writeback takes care of it later) the actual
> issuing task is usually not in iowait.
> We consider this the good case, since whenever the scheduler and a
> potential userspace / kernel switch is in the critical path for IO
> there is possibly overhead impacting throughput.
> We therefore focus on random read from here on, because (on synchronous
> IO [3]) this will lead to the task being set in iowait for every IO.
> This is where iowait boosting shows its biggest throughput improvement.
> From here on IOPS (IO operations per second) and iowait wakeups may
> therefore be used interchangeably.
>
> Performance:
> Throughput for random read tries to be on par with the sugov
> implementation of iowait boosting for reasonably long-lived workloads.
> See the following table for some results, values are in IOPS, the
> tests are ran for 30s with pauses in-between, results are sorted.
>
> nvme on rk3399
> [3588, 3590, 3597, 3632, 3745] sugov mainline
> [3581, 3751, 3770, 3771, 3885] per-task tracking
> [2592, 2639, 2701, 2717, 2784] sugov no iowait boost
> [3218, 3451, 3598, 3848, 3921] performance governor
>
> emmc with cqe on rk3399
> [4146, 4155, 4159, 4161, 4193] sugov mainline
> [2848, 3217, 4375, 4380, 4454] per-task tracking
> [2510, 2665, 3093, 3101, 3105] sugov no iowait boost
> [4690, 4803, 4860, 4976, 5069] performance governor
>
> sd card on rk3399
> [1777, 1780, 1806, 1827, 1850] sugov mainline
> [1470, 1476, 1507, 1534, 1586] per-task tracking
> [1356, 1372, 1373, 1377, 1416] sugov no iowait boost
> [1861, 1890, 1901, 1905, 1908] performance governor
>
> Pixel 6 ufs Android 14 (7 runs for because device showed some variance)
> [6605, 6622, 6633, 6652, 6690, 6697, 6754] sugov mainline
> [7141, 7173, 7198, 7220, 7280, 7427, 7452] per-task tracking
> [2390, 2392, 2406, 2437, 2464, 2487, 2813] sugov no iowait boost
> [7812, 7837, 7837, 7851, 7900, 7959, 7980] performance governor
>
> Apple M1 apple-nvme
> [27421, 28331, 28515, 28699, 29529] sugov mainline
> [27274, 27344, 27345, 27384, 27930] per-task tracking
> [14480, 14512, 14625, 14872, 14967] sugov no iowait boost
> [31595, 32085, 32386, 32465, 32643] performance governor
>
> Showcasing some different IO scenarios, again all random read,
> median out of 5 runs, all on rk3399 with NVMe.
> e.g. io_uring6x4 means 6 threads with 4 iodepth each, results can be
> obtained using:
> fio --minimal --time_based --name=test --filename=/dev/nvme0n1 --runtime=30 --rw=randread --bs=4k --ioengine=io_uring --iodepth=4 --numjobs=6 --group_reporting | cut -d \; -f 8
>
> +---------------+----------------+-------------------+----------------+-------------+-----------+
> |               | Sugov mainline | Per-task tracking | Sugov no boost | Performance | Powersave |
> +---------------+----------------+-------------------+----------------+-------------+-----------+
> |       psyncx1 |           4073 |              3793 |           2979 |        4190 |      2788 |
> +---------------+----------------+-------------------+----------------+-------------+-----------+
> |       psyncx4 |          13921 |             13503 |          10635 |       13931 |     10225 |
> +---------------+----------------+-------------------+----------------+-------------+-----------+
> |       psyncx6 |          18473 |             17866 |          15902 |       19080 |     15789 |
> +---------------+----------------+-------------------+----------------+-------------+-----------+
> |       psyncx8 |          22498 |             21242 |          19867 |       22650 |     18837 |
> +---------------+----------------+-------------------+----------------+-------------+-----------+
> |      psyncx10 |          24801 |             23552 |          23658 |       25096 |     21474 |
> +---------------+----------------+-------------------+----------------+-------------+-----------+
> |      psyncx12 |          26743 |             25377 |          26372 |       26663 |     23613 |
> +---------------+----------------+-------------------+----------------+-------------+-----------+
> |     libaio1x1 |           4054 |              3542 |           2776 |        4055 |      2780 |
> +---------------+----------------+-------------------+----------------+-------------+-----------+
> |   libaio1x128 |           3959 |              3516 |           2758 |        3590 |      2560 |
> +---------------+----------------+-------------------+----------------+-------------+-----------+
> |   libaio4x128 |          13451 |             12517 |          10313 |       13403 |      9994 |
> +---------------+----------------+-------------------+----------------+-------------+-----------+
> |     libaio6x1 |          18394 |             17432 |          15340 |       18954 |     15251 |
> +---------------+----------------+-------------------+----------------+-------------+-----------+
> |     libaio6x4 |          18329 |             17100 |          15238 |       18623 |     15270 |
> +---------------+----------------+-------------------+----------------+-------------+-----------+
> |   libaio6x128 |          18066 |             16964 |          15139 |       18577 |     15192 |
> +---------------+----------------+-------------------+----------------+-------------+-----------+
> |   io_uring1x1 |           4043 |              3548 |           2810 |        4039 |      2689 |
> +---------------+----------------+-------------------+----------------+-------------+-----------+
> |  io_uring4x64 |          35790 |             32814 |          35983 |       34934 |     33254 |
> +---------------+----------------+-------------------+----------------+-------------+-----------+
> | io_uring1x128 |          32651 |             30427 |          32429 |       33232 |      9973 |
> +---------------+----------------+-------------------+----------------+-------------+-----------+
> | io_uring2x128 |          34928 |             32595 |          34922 |       33726 |     18790 |
> +---------------+----------------+-------------------+----------------+-------------+-----------+
> | io_uring4x128 |          34414 |             32173 |          34932 |       33332 |     33005 |
> +---------------+----------------+-------------------+----------------+-------------+-----------+
> |   io_uring6x4 |          31578 |             29260 |          31714 |       31399 |     31784 |
> +---------------+----------------+-------------------+----------------+-------------+-----------+
> | io_uring6x128 |          34480 |             32634 |          34973 |       33390 |     36452 |
> +---------------+----------------+-------------------+----------------+-------------+-----------+
>
> Based on the above we can basically categorize these into the following
> three:
> a) boost is useful
> b) boost irrelevant (util dominates)
> c) boost is energy-inefficient (boost dominates)
>
> The aim of the patch 1/2 is to boost as much as necessary for a) while
> boosting little for c) (thus saving energy).
>
> Energy-savings:
> Regarding sugov iowait boosting problem 1 mentioned earlier,
> some improvement can be seen:
> Tested on rk3399 (LLLL)(bb) with an NVMe, 30s runtime
> CPU0 perf domain spans 0-3 with 400MHz to 1400MHz
> CPU4 perf domain spans 4-5 with 400MHz to 1800MHz
>
> io_uring6x128:
> Sugov iowait boost:
> Average frequency for CPU0 : 1.180 GHz
> Average frequency for CPU4 : 1.504 GHz
> Per-task tracking:
> Average frequency for CPU0 : 1.070 GHz
> Average frequency for CPU4 : 1.211 GHz
>
> io_uring12x128:
> Sugov iowait boost:
> Average frequency for CPU0 : 1.324 GHz
> Average frequency for CPU4 : 1.444 GHz
> Per-task tracking:
> Average frequency for CPU0 : 1.260 GHz
> Average frequency for CPU4 : 1.062 GHz
> (In both cases actually 400MHz on both perf domains is optimal, more
> fine-tuning could get us closer [2])
>
> [1]
> There are many scenarios when it doesn't, so let's start with
> explaining when it does:
> Boosting improves throughput if there is frequent IO to a device from
> one or few origins, such that the device is likely idle when the task
> is enqueued on the rq and reducing this time cuts down on the storage
> device idle time.
> This might not be true (and boosting doesn't help) if:
> - The storage device uses the idle time to actually commit the IO to
> persistent storage or do other management activity (this can be
> observed with e.g. writes to flash-based storage, which will usually
> write to cache and flush the cache when idle or necessary).
> - The device is under thermal pressure and needs idle time to cool off
> (not uncommon for e.g. nvme devices).
> Furthermore the assumption (the device being idle while task is
> enqueued) is false altogether if:
> - Other tasks use the same storage device.
> - The task uses asynchronous IO with iodepth > 1 like io_uring, the
> in_iowait is then just to fill the queue on the host again.
> - The task just sets in_iowait to signal it is waiting on io to not
> appear as system idle, it might not send any io at all (cf with
> the various occurrences of in_iowait, io_mutex_lock and io_schedule*).
>
> [3]
> Unfortunately even for asynchronous IO iowait may be set, in the case
> of io_uring this is specifically for the iowait boost to trigger, see
> commit ("8a796565cec3 io_uring: Use io_schedule* in cqring wait")
> which is why the energy-savings are so significant here, as io_uring
> load on the CPU is minimal.
>
> Problems encountered:
> - Higher cap is not always beneficial, we might place the task away
> from the CPU where the interrupt handler is running, making it run
> on an unboosted CPU which may have a bigger impact than the difference
> between the CPU's capacity the task moved to. (Of course the boost will
> then be reverted again, but a ping-pong every interval is possible).
> - [2] tracking and scaling can be improved (io_uring12x128 still shows
> boosting): Unfortunately tracking purely per-task shows some limits.
> One task might show more iowaits per second when boosted, but overall
> throughput doesn't increase => there is still some boost.
> The task throughput improvement is somewhat limited though,
> so by fine-tuning the thresholds there could be mitigations.
>
> Christian Loehle (2):
>   sched/fair: Introduce per-task io util boost
>   cpufreq/schedutil: Remove iowait boost
>
>  include/linux/sched.h            |  15 +++
>  kernel/sched/cpufreq_schedutil.c | 150 ++--------------------------
>  kernel/sched/fair.c              | 165 +++++++++++++++++++++++++++++--
>  kernel/sched/sched.h             |   4 +-
>  4 files changed, 182 insertions(+), 152 deletions(-)
>
> --
> 2.34.1
>
Vincent Guittot March 28, 2024, 10:09 a.m. UTC | #9
On Mon, 25 Mar 2024 at 13:24, Christian Loehle <christian.loehle@arm.com> wrote:
>
> On 22/03/2024 18:08, Vincent Guittot wrote:
> > Hi Christian,
> Hi Vincent,
> thanks for taking a look.
>
> >
> > On Mon, 4 Mar 2024 at 21:17, Christian Loehle <christian.loehle@arm.com> wrote:
> >>
> >> There is a feature inside of both schedutil and intel_pstate called
> >> iowait boosting which tries to prevent selecting a low frequency
> >> during IO workloads when it impacts throughput.
> >> The feature is implemented by checking for task wakeups that have
> >> the in_iowait flag set and boost the CPU of the rq accordingly
> >> (implemented through cpufreq_update_util(rq, SCHED_CPUFREQ_IOWAIT)).
> >>
> >> The necessity of the feature is argued with the potentially low
> >> utilization of a task being frequently in_iowait (i.e. most of the
> >> time not enqueued on any rq and cannot build up utilization).
> >>
> >> The RFC focuses on the schedutil implementation.
> >> intel_pstate frequency selection isn't touched for now, suggestions are
> >> very welcome.
> >> Current schedutil iowait boosting has several issues:
> >> 1. Boosting happens even in scenarios where it doesn't improve
> >> throughput. [1]
> >> 2. The boost is not accounted for in EAS: a) feec() will only consider
> >>  the actual utilization for task placement, but another CPU might be
> >>  more energy-efficient at that capacity than the boosted one.)
> >>  b) When placing a non-IO task while a CPU is boosted compute_energy()
> >>  will not consider the (potentially 'free') boosted capacity, but the
> >>  one it would have without the boost (since the boost is only applied
> >>  in sugov).
> >> 3. Actual IO heavy workloads are hardly distinguished from infrequent
> >> in_iowait wakeups.
> >> 4. The boost isn't associated with a task, it therefore isn't considered
> >> for task placement, potentially missing out on higher capacity CPUs on
> >> heterogeneous CPU topologies.
> >> 5. The boost isn't associated with a task, it therefore lingers on the
> >> rq even after the responsible task has migrated / stopped.
> >> 6. The boost isn't associated with a task, it therefore needs to ramp
> >> up again when migrated.
> >> 7. Since schedutil doesn't know which task is getting woken up,
> >> multiple unrelated in_iowait tasks might lead to boosting.
> >>
> >> We attempt to mitigate all of the above by reworking the way the
> >> iowait boosting (io boosting from here on) works in two major ways:
> >> - Carry the boost in task_struct, so it is a per-task attribute and
> >> behaves similar to utilization of the task in some ways.
> >> - Employ a counting-based tracking strategy that only boosts as long
> >> as it sees benefits and returns to no boosting dynamically.
> >
> > Thanks for working on improving IO boosting. I have started to read
> > your patchset and have few comments about your proposal:
> >
> > The main one is that the io boosting decision should remain a cpufreq
> > governor decision and so the io boosting value should be applied by
> > the governor like in sugov_effective_cpu_perf() as an example instead
> > of everywhere in the scheduler code
> Having it move into the scheduler is to enable it for EAS (e.g. boosting
> a LITTLE to it's highest OPP often being much less energy-efficient than
> having a higher cap CPU at a lower OPP) and to enable higher capacities
> reachable on other CPUs, too.

sugov_effective_cpu_perf() is used by EAS when finding the final OPP
and computing the energy so I don't see a problem of moving the policy
(converting some iowait boost information into some performance level)
into the cpufreq governor. EAS should be able to select the more
efficient CPU for the waking task.
Furthermore, you add it into the utilization whereas iowait boost is
not a capacity that will be used by the task but like a minimum
bandwidth requirement to speedup its execution; This could be seen
like uclamp_min especially if you also want to use the iowait boosting
to migrate tasks. But I don't think that this is exactly the same.
Uclamp_min helps when a task has always the same amount of small work
to do periodically. Whatever the frequency, its utilization remains
(almost) the same and is not really expected to impact its period. In
the case of iowait boost, when you increase the frequency, the task
will do more work and its utilization will decrease (because the
overall periods will decrease). This increase of the utilization
should be the trigger for migrating the task on another CPU.

> I guess for you the first one is the more interesting one.
>
> >
> > Then, the algorithm to track the right interval bucket and the mapping
> > of intervals into utilization really looks like a policy which has
> > been defined with heuristics and as a result further seems to be a
> > governor decision
>
> I did have a comparable thing as a governor decision, but the entire
> "Test if util boost increases iowaits seen per interval and only boost
> accordingly" really only works if the interval is long enough, my proposed
> starting length of 25ms really being the lower limit for the storage devices
> we want to cover (IO latency not being constant and therefore iowaits per
> interval being somewhat noisy).

Your explanation above confirms that it's a policy for your storage devices.

> Given that the IO tasks will be enqueued/dequeued very frequently it just
> isn't credible to expect them to land on the same CPU for many intervals,
> unless your system is very bare-bones and idle, but even on an idle Android
> I see any interval above 50ms to be unusable and not provide any throughput
> improvement.
> The idea of tracking the iowaits I do find the best option in this vague and
> noisy environment of "iowait wakeups" and definitely worth having, so that's
> why I opted for it being in the scheduler code, but I'd love to hear your
> thoughts/alternatives.

There is 3 parts in your proposal
1- tracking per task iowait statistics
2- translate that into a capacity more than an utilization
3- use this value in EAS

Having 1- in the scheduler seems ok but 2- and 3- should not be
injected directly into the scheduler



> I'd also like an improvement on the definition of iowait or some more separate
> flag for boostable IO, the entire "boost on any iowait wakeup" is groping in
> the dark which I'm trying to combat, but it's somewhat out of scope here.
>
> >
> > Finally adding some atomic operation in the fast path is not really desirable
> Agreed, I'll look into it, for now I wanted as much feedback on the two major
> changes:
> - iowait boost now per-task
> - boosting based upon iowaits seen per interval
>
> >
> > I will continue to review your patchset
>
> Thank you, looking forward to seeing your review.
>
> >>[snip]
> Kind Regards,
> Christian