[0/2] Exposing nice CPU usage to userspace

Message ID	20240823201317.156379-1-joshua.hahn6@gmail.com
Headers	show Received: from mail-yb1-f181.google.com (mail-yb1-f181.google.com [209.85.219.181]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 7EBE2193406; Fri, 23 Aug 2024 20:13:19 +0000 (UTC) From: Joshua Hahn joshua.hahn6@gmail.com To: tj@kernel.org Cc: lizefan.x@bytedance.com, hannes@cmpxchg.org, mkoutny@suse.com, shuah@kernel.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org Subject: [PATCH 0/2] Exposing nice CPU usage to userspace Date: Fri, 23 Aug 2024 13:05:16 -0700 Message-ID: <20240823201317.156379-1-joshua.hahn6@gmail.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	Exposing nice CPU usage to userspace \| expand [0/2] Exposing nice CPU usage to userspace [1/2] Tracking cgroup-level niced CPU time [2/2] Selftests for niced CPU statistics

Message ID

20240823201317.156379-1-joshua.hahn6@gmail.com

Headers

From: Joshua Hahn joshua.hahn6@gmail.com
To: tj@kernel.org
Cc: lizefan.x@bytedance.com,
	hannes@cmpxchg.org,
	mkoutny@suse.com,
	shuah@kernel.org,
	cgroups@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	linux-kselftest@vger.kernel.org
Subject: [PATCH 0/2] Exposing nice CPU usage to userspace
Date: Fri, 23 Aug 2024 13:05:16 -0700
Message-ID: <20240823201317.156379-1-joshua.hahn6@gmail.com>
Precedence: bulk
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit

Series

Exposing nice CPU usage to userspace | expand

Message

Joshua Hahn joshua.hahn6@gmail.com Aug. 23, 2024, 8:05 p.m. UTC

From: Joshua Hahn <joshua.hahn6@gmail.com>

Niced CPU usage is a metric reported in host-level /proc/stat, but is
not reported in cgroup-level statistics in cpu.stat. However, when a
host contains multiple tasks across different workloads, it becomes
difficult to gauage how much of the task is being spent on niced
processes based on /proc/stat alone, since host-level metrics do not
provide this cgroup-level granularity.

Exposing this metric will allow load balancers to correctly probe the
niced CPU metric for each workload, and make more informed decisions
when directing higher priority tasks.

Joshua Hahn (2):
  Tracking cgroup-level niced CPU time
  Selftests for niced CPU statistics

 include/linux/cgroup-defs.h               |  1 +
 kernel/cgroup/rstat.c                     | 16 ++++-
 tools/testing/selftests/cgroup/test_cpu.c | 72 +++++++++++++++++++++++
 3 files changed, 86 insertions(+), 3 deletions(-)

Comments

Michal Koutný Aug. 26, 2024, 11:59 a.m. UTC | #1

Hello.

On Fri, Aug 23, 2024 at 01:05:16PM GMT, JoshuaHahnjoshua.hahn6@gmail.com wrote:
> Niced CPU usage is a metric reported in host-level /proc/stat, but is
> not reported in cgroup-level statistics in cpu.stat. However, when a
> host contains multiple tasks across different workloads, it becomes
> difficult to gauage how much of the task is being spent on niced
> processes based on /proc/stat alone, since host-level metrics do not
> provide this cgroup-level granularity.

The difference between the two metrics is in cputime.c:
        index = (task_nice(p) > 0) ? CPUTIME_NICE : CPUTIME_USER;

> Exposing this metric will allow load balancers to correctly probe the
> niced CPU metric for each workload, and make more informed decisions
> when directing higher priority tasks.

How would this work? (E.g. if too little nice time -> reduce priority
of high prio tasks?)

Thanks,
Michal

Joshua Hahn Aug. 26, 2024, 4:13 p.m. UTC | #2

Hello, thank you for reviewing the patch.

On Mon, Aug 26, 2024 at 10:43 AM Michal Koutný <mkoutny@suse.com> wrote:
> The difference between the two metrics is in cputime.c:
>         index = (task_nice(p) > 0) ? CPUTIME_NICE : CPUTIME_USER;
>
> > Exposing this metric will allow load balancers to correctly probe the
> > niced CPU metric for each workload, and make more informed decisions
> > when directing higher priority tasks.
>
> How would this work? (E.g. if too little nice time -> reduce priority
> of high prio tasks?)

We can find what fraction of the task is being run as a nice process by
dividing the two metrics (nice / user) and determining the fraction of
niceness. When a high prio task comes into the load balancer and must
decide where the task should be delegated to, the balancer can use how much
of the task is nice as one factor in making the decision.

The reverse is also true; host-level information in /proc/stat may indicate that
a high percentage of CPU time is being used by nice processes, giving an
illusion that all tasks within the host are running nice processes,
when in reality,
it is just one task that is using a lot of nice CPU time, and other tasks are
running nonnice tasks. By including cgroup-level nice statistics, we can get
a clearer picture and avoid overloading a host with too many high prio tasks.

Like you suggested, this information can also help in re-prioritizing
the processes, which may help high prio tasks become executed quicker.

Thanks,
Joshua

Tejun Heo Aug. 26, 2024, 6:19 p.m. UTC | #3

Hello,

On Fri, Aug 23, 2024 at 01:05:17PM -0700, JoshuaHahnjoshua.hahn6@gmail.com wrote:
> From: Joshua Hahn <joshua.hahn6@gmail.com>
> 
> Cgroup-level CPU statistics currently include time spent on
> user/system processes, but do not include niced CPU time (despite
> already being tracked). This patch exposes niced CPU time to the
> userspace, allowing users to get a better understanding of their
> hardware limits and can facilitate better load-balancing.

You aren't talking about the in-kernel scheduler's load balancer, right? If
so, can you please update the description? This is a bit too confusing for a
commit message for a kernel commit.

Thanks.

Joshua Hahn Aug. 29, 2024, 7:26 p.m. UTC | #4

Hello, thank you for reviewing the patch.

> > Cgroup-level CPU statistics currently include time spent on
> > user/system processes, but do not include niced CPU time (despite
> > already being tracked). This patch exposes niced CPU time to the
> > userspace, allowing users to get a better understanding of their
> > hardware limits and can facilitate better load-balancing.
>
> You aren't talking about the in-kernel scheduler's load balancer, right? If
> so, can you please update the description? This is a bit too confusing for a
> commit message for a kernel commit.

Thank you for pointing this out -- I'll edit the commit message to the
following in a v2:

Cgroup-level CPU statistics currently include time spent on
user/system processes, but do not include niced CPU time (despite
already being tracked). This patch exposes niced CPU time to the
userspace, allowing users to get a better understanding of their
hardware limits and can facilitate more informed workload distribution.

Thanks,
Joshua