Message ID | 20210428232821.2506201-1-swood@redhat.com |
---|---|
Headers | show |
Series | newidle_balance() PREEMPT_RT latency mitigations | expand |
On Thu, 2021-04-29 at 09:12 +0200, Vincent Guittot wrote: > Hi Scott, > > On Thu, 29 Apr 2021 at 01:28, Scott Wood <swood@redhat.com> wrote: > > These patches mitigate latency caused by newidle_balance() on large > > systems when PREEMPT_RT is enabled, by enabling interrupts when the lock > > is dropped, and exiting early at various points if an RT task is > > runnable > > on the current CPU. > > > > On a system with 128 CPUs, these patches dropped latency (as measured by > > a 12 hour rteval run) from 1045us to 317us (when applied to > > 5.12.0-rc3-rt3). > > The patch below has been queued for v5.13 and removed the update of > blocked load what seemed to be the major reason for long preempt/irq > off during newly idle balance: > https://lore.kernel.org/lkml/20210224133007.28644-1-vincent.guittot@linaro.org/ > > I would be curious to see how it impacts your cases I still get 1000+ ms latencies with those patches applied. -Scott
On Sat, 2021-05-01 at 17:03 -0500, Scott Wood wrote: > On Thu, 2021-04-29 at 09:12 +0200, Vincent Guittot wrote: > > Hi Scott, > > > > On Thu, 29 Apr 2021 at 01:28, Scott Wood <swood@redhat.com> wrote: > > > These patches mitigate latency caused by newidle_balance() on large > > > systems when PREEMPT_RT is enabled, by enabling interrupts when the lock > > > is dropped, and exiting early at various points if an RT task is > > > runnable > > > on the current CPU. > > > > > > On a system with 128 CPUs, these patches dropped latency (as measured by > > > a 12 hour rteval run) from 1045us to 317us (when applied to > > > 5.12.0-rc3-rt3). > > > > The patch below has been queued for v5.13 and removed the update of > > blocked load what seemed to be the major reason for long preempt/irq > > off during newly idle balance: > > https://lore.kernel.org/lkml/20210224133007.28644-1-vincent.guittot@linaro.org/ > > > > I would be curious to see how it impacts your cases > > I still get 1000+ ms latencies with those patches applied. If NEWIDLE balancing migrates one task, how does that manage to consume a full *millisecond*, and why would that only be a problem for RT? -Mike (rt tasks don't play !rt balancer here, if CPU goes idle, tough titty)
On Sun, 2021-05-02 at 05:25 +0200, Mike Galbraith wrote: > On Sat, 2021-05-01 at 17:03 -0500, Scott Wood wrote: > > On Thu, 2021-04-29 at 09:12 +0200, Vincent Guittot wrote: > > > Hi Scott, > > > > > > On Thu, 29 Apr 2021 at 01:28, Scott Wood <swood@redhat.com> wrote: > > > > These patches mitigate latency caused by newidle_balance() on large > > > > systems when PREEMPT_RT is enabled, by enabling interrupts when the > > > > lock > > > > is dropped, and exiting early at various points if an RT task is > > > > runnable > > > > on the current CPU. > > > > > > > > On a system with 128 CPUs, these patches dropped latency (as > > > > measured by > > > > a 12 hour rteval run) from 1045us to 317us (when applied to > > > > 5.12.0-rc3-rt3). > > > > > > The patch below has been queued for v5.13 and removed the update of > > > blocked load what seemed to be the major reason for long preempt/irq > > > off during newly idle balance: > > > https://lore.kernel.org/lkml/20210224133007.28644-1-vincent.guittot@linaro.org/ > > > > > > I would be curious to see how it impacts your cases > > > > I still get 1000+ ms latencies with those patches applied. > > If NEWIDLE balancing migrates one task, how does that manage to consume > a full *millisecond*, and why would that only be a problem for RT? > > -Mike > > (rt tasks don't play !rt balancer here, if CPU goes idle, tough titty) Determining which task to pull is apparently taking that long (again, this is on a 128-cpu system). RT is singled out because that is the config that makes significant tradeoffs to keep latencies down (I expect this would be far from the only possible 1ms+ latency on a non-RT kernel), and there was concern about the overhead of a double context switch when pulling a task to a newidle cpu. -Scott
On Mon, 2021-05-03 at 11:33 -0500, Scott Wood wrote: > On Sun, 2021-05-02 at 05:25 +0200, Mike Galbraith wrote: > > On Sat, 2021-05-01 at 17:03 -0500, Scott Wood wrote: > > > On Thu, 2021-04-29 at 09:12 +0200, Vincent Guittot wrote: > > > > Hi Scott, > > > > > > > > On Thu, 29 Apr 2021 at 01:28, Scott Wood <swood@redhat.com> wrote: > > > > > These patches mitigate latency caused by newidle_balance() on large > > > > > systems when PREEMPT_RT is enabled, by enabling interrupts when the > > > > > lock > > > > > is dropped, and exiting early at various points if an RT task is > > > > > runnable > > > > > on the current CPU. > > > > > > > > > > On a system with 128 CPUs, these patches dropped latency (as > > > > > measured by > > > > > a 12 hour rteval run) from 1045us to 317us (when applied to > > > > > 5.12.0-rc3-rt3). > > > > > > > > The patch below has been queued for v5.13 and removed the update of > > > > blocked load what seemed to be the major reason for long preempt/irq > > > > off during newly idle balance: > > > > https://lore.kernel.org/lkml/20210224133007.28644-1-vincent.guittot@linaro.org/ > > > > > > > > I would be curious to see how it impacts your cases > > > > > > I still get 1000+ ms latencies with those patches applied. > > > > If NEWIDLE balancing migrates one task, how does that manage to consume > > a full *millisecond*, and why would that only be a problem for RT? > > > > -Mike > > > > (rt tasks don't play !rt balancer here, if CPU goes idle, tough titty) > > Determining which task to pull is apparently taking that long (again, this > is on a 128-cpu system). RT is singled out because that is the config that > makes significant tradeoffs to keep latencies down (I expect this would be > far from the only possible 1ms+ latency on a non-RT kernel), and there was > concern about the overhead of a double context switch when pulling a task to > a newidle cpu. What I think has be going on is that you're running a synchronized RT load, many CPUs go idle as a thundering herd, and meet at focal point busiest. What I was alluding to was that preventing such size scale pile-ups would be way better than poking holes in it for RT to try to sneak through. If pile-up it is, while not particularly likely, the same should happen with normal tasks, wasting cycles generating heat. The main issue I see with these patches is that the resulting number is still so gawd awful as to mean "nope, not rt ready", making the whole exercise look a bit like a noop. -Mike
On Mon, 2021-05-03 at 20:52 +0200, Mike Galbraith wrote: > On Mon, 2021-05-03 at 11:33 -0500, Scott Wood wrote: > > On Sun, 2021-05-02 at 05:25 +0200, Mike Galbraith wrote: > > > If NEWIDLE balancing migrates one task, how does that manage to > > > consume > > > a full *millisecond*, and why would that only be a problem for RT? > > > > > > -Mike > > > > > > (rt tasks don't play !rt balancer here, if CPU goes idle, tough titty) > > > > Determining which task to pull is apparently taking that long (again, > > this is on a 128-cpu system). RT is singled out because that is the > > config that makes significant tradeoffs to keep latencies down (I > > expect this would be far from the only possible 1ms+ latency on a > > non-RT kernel), and there was concern about the overhead of a double > > context switch when pulling a task to a newidle cpu. > > What I think has be going on is that you're running a synchronized RT > load, many CPUs go idle as a thundering herd, and meet at focal point > busiest. What I was alluding to was that preventing such size scale > pile-ups would be way better than poking holes in it for RT to try to > sneak through. If pile-up it is, while not particularly likely, the > same should happen with normal tasks, wasting cycles generating heat. > > The main issue I see with these patches is that the resulting number is > still so gawd awful as to mean "nope, not rt ready", making the whole > exercise look a bit like a noop. It doesn't look like rteval asks cyclictest to synchronize, but regardless, how is this "poking holes"? Making sure interrupts are enabled during potentially long-running activities is pretty fundamental to PREEMPT_RT. What specifically is your suggestion? And yes, 317 us is still not a very good number for PREEMPT_RT, but progress is progress. It's hard to address the moderate latency spikes if they're obscured by large latency spikes. One also needs to have realistic expectations when it comes to RT on large systems, particularly when not isolating the latency-sensitive CPUs. -Scott
On Mon, 2021-05-03 at 16:57 -0500, Scott Wood wrote: > On Mon, 2021-05-03 at 20:52 +0200, Mike Galbraith wrote: > > On Mon, 2021-05-03 at 11:33 -0500, Scott Wood wrote: > > > On Sun, 2021-05-02 at 05:25 +0200, Mike Galbraith wrote: > > > > If NEWIDLE balancing migrates one task, how does that manage to > > > > consume > > > > a full *millisecond*, and why would that only be a problem for RT? > > > > > > > > -Mike > > > > > > > > (rt tasks don't play !rt balancer here, if CPU goes idle, tough titty) > > > > > > Determining which task to pull is apparently taking that long (again, > > > this is on a 128-cpu system). RT is singled out because that is the > > > config that makes significant tradeoffs to keep latencies down (I > > > expect this would be far from the only possible 1ms+ latency on a > > > non-RT kernel), and there was concern about the overhead of a double > > > context switch when pulling a task to a newidle cpu. > > > > What I think has be going on is that you're running a synchronized RT > > load, many CPUs go idle as a thundering herd, and meet at focal point > > busiest. What I was alluding to was that preventing such size scale > > pile-ups would be way better than poking holes in it for RT to try to > > sneak through. If pile-up it is, while not particularly likely, the > > same should happen with normal tasks, wasting cycles generating heat. > > > > The main issue I see with these patches is that the resulting number is > > still so gawd awful as to mean "nope, not rt ready", making the whole > > exercise look a bit like a noop. > > It doesn't look like rteval asks cyclictest to synchronize, but > regardless, how is this "poking holes"? Pulling a single task is taking _a full millisecond_, which I see as a mountain of cycles, directly through which you open a path for wakeups. That "poking holes" isn't meant to be some kind of crude derogatory remark, it's just the way I see what was done. The mountain still stands, you didn't remove it. > Making sure interrupts are > enabled during potentially long-running activities is pretty fundamental > to PREEMPT_RT. What specifically is your suggestion? Try to include fair class in any LB improvement if at all possible, because that's where most of the real world benefit is to be had. > And yes, 317 us is still not a very good number for PREEMPT_RT, but > progress is progress. It's hard to address the moderate latency spikes > if they're obscured by large latency spikes. One also needs to have > realistic expectations when it comes to RT on large systems, particularly > when not isolating the latency-sensitive CPUs. Agreed. But. Specifically because the result remains intolerable to anything remotely sensitive, users running such on their big boxen are not going to be doing mixed load, that flat does not work, which is why I said the patch set looks a bit like a noop: it excludes the audience that stands to gain.. nearly anything. Big box HPC (acronym includes RT) gains absolutely nothing, as does big box general case with its not particularly prevalent, but definitely existent RT tasks. Big box users who are NOT running anything they care deeply about do receive some love.. but don't care deeply, and certainly don't care more any more deeply than general case users do about these collision induced latency spikes. -Mike