Message ID | 20210217120004.7984-3-arbn@yandex-team.com |
---|---|
State | New |
Headers | show |
Series | [1/4] cputime,cpuacct: Include guest time in user time in cpuacct.stat | expand |
Andrey Ryabinin <arbn@yandex-team.com> writes: > cpuacct has 2 different ways of accounting and showing user > and system times. > > The first one uses cpuacct_account_field() to account times > and cpuacct.stat file to expose them. And this one seems to work ok. > > The second one is uses cpuacct_charge() function for accounting and > set of cpuacct.usage* files to show times. Despite some attempts to > fix it in the past it still doesn't work. E.g. while running KVM > guest the cpuacct_charge() accounts most of the guest time as > system time. This doesn't match with user&system times shown in > cpuacct.stat or proc/<pid>/stat. I couldn't reproduce this running a cpu bound load in a kvm guest on a nohz_full cpu on 5.11. The time is almost entirely in cpuacct.usage and _user, while _sys stays low. Could you say more about how you're seeing this? Don't really doubt there's a problem, just wondering what you're doing. > diff --git a/kernel/sched/cpuacct.c b/kernel/sched/cpuacct.c > index 941c28cf9738..7eff79faab0d 100644 > --- a/kernel/sched/cpuacct.c > +++ b/kernel/sched/cpuacct.c > @@ -29,7 +29,7 @@ struct cpuacct_usage { > struct cpuacct { > struct cgroup_subsys_state css; > /* cpuusage holds pointer to a u64-type object on every CPU */ > - struct cpuacct_usage __percpu *cpuusage; Definition of struct cpuacct_usage can go away now. > @@ -99,7 +99,8 @@ static void cpuacct_css_free(struct cgroup_subsys_state *css) > static u64 cpuacct_cpuusage_read(struct cpuacct *ca, int cpu, > enum cpuacct_stat_index index) > { > - struct cpuacct_usage *cpuusage = per_cpu_ptr(ca->cpuusage, cpu); > + u64 *cpuusage = per_cpu_ptr(ca->cpuusage, cpu); > + u64 *cpustat = per_cpu_ptr(ca->cpustat, cpu)->cpustat; > u64 data; There's a BUG_ON below this that could probably be WARN_ON_ONCE while you're here > @@ -278,8 +274,8 @@ static int cpuacct_stats_show(struct seq_file *sf, void *v) > for_each_possible_cpu(cpu) { > u64 *cpustat = per_cpu_ptr(ca->cpustat, cpu)->cpustat; > > - val[CPUACCT_STAT_USER] += cpustat[CPUTIME_USER]; > - val[CPUACCT_STAT_USER] += cpustat[CPUTIME_NICE]; > + val[CPUACCT_STAT_USER] += cpustat[CPUTIME_USER]; > + val[CPUACCT_STAT_USER] += cpustat[CPUTIME_NICE]; unnecessary whitespace change?
On 3/18/21 1:22 AM, Daniel Jordan wrote: > Andrey Ryabinin <arbn@yandex-team.com> writes: > >> cpuacct has 2 different ways of accounting and showing user >> and system times. >> >> The first one uses cpuacct_account_field() to account times >> and cpuacct.stat file to expose them. And this one seems to work ok. >> >> The second one is uses cpuacct_charge() function for accounting and >> set of cpuacct.usage* files to show times. Despite some attempts to >> fix it in the past it still doesn't work. E.g. while running KVM >> guest the cpuacct_charge() accounts most of the guest time as >> system time. This doesn't match with user&system times shown in >> cpuacct.stat or proc/<pid>/stat. > > I couldn't reproduce this running a cpu bound load in a kvm guest on a > nohz_full cpu on 5.11. The time is almost entirely in cpuacct.usage and > _user, while _sys stays low. > > Could you say more about how you're seeing this? Don't really doubt > there's a problem, just wondering what you're doing. > Yeah, I it's almost unnoticable if you run some load in guest like qemu. But more simple case with busy loop in KVM_RUN triggers this: # git clone https://github.com/aryabinin/kvmsample # make # mkdir /sys/fs/cgroup/cpuacct/test # echo $$ > /sys/fs/cgroup/cpuacct/test/tasks # ./kvmsample & # for i in {1..5}; do cat /sys/fs/cgroup/cpuacct/test/cpuacct.usage_sys; sleep 1; done 1976535645 2979839428 3979832704 4983603153 5983604157 >> diff --git a/kernel/sched/cpuacct.c b/kernel/sched/cpuacct.c >> index 941c28cf9738..7eff79faab0d 100644 >> --- a/kernel/sched/cpuacct.c >> +++ b/kernel/sched/cpuacct.c >> @@ -29,7 +29,7 @@ struct cpuacct_usage { >> struct cpuacct { >> struct cgroup_subsys_state css; >> /* cpuusage holds pointer to a u64-type object on every CPU */ >> - struct cpuacct_usage __percpu *cpuusage; > > Definition of struct cpuacct_usage can go away now. > Done. >> @@ -99,7 +99,8 @@ static void cpuacct_css_free(struct cgroup_subsys_state *css) >> static u64 cpuacct_cpuusage_read(struct cpuacct *ca, int cpu, >> enum cpuacct_stat_index index) >> { >> - struct cpuacct_usage *cpuusage = per_cpu_ptr(ca->cpuusage, cpu); >> + u64 *cpuusage = per_cpu_ptr(ca->cpuusage, cpu); >> + u64 *cpustat = per_cpu_ptr(ca->cpustat, cpu)->cpustat; >> u64 data; > > There's a BUG_ON below this that could probably be WARN_ON_ONCE while > you're here > Sure. >> @@ -278,8 +274,8 @@ static int cpuacct_stats_show(struct seq_file *sf, void *v) >> for_each_possible_cpu(cpu) { >> u64 *cpustat = per_cpu_ptr(ca->cpustat, cpu)->cpustat; >> >> - val[CPUACCT_STAT_USER] += cpustat[CPUTIME_USER]; >> - val[CPUACCT_STAT_USER] += cpustat[CPUTIME_NICE]; >> + val[CPUACCT_STAT_USER] += cpustat[CPUTIME_USER]; >> + val[CPUACCT_STAT_USER] += cpustat[CPUTIME_NICE]; > > unnecessary whitespace change? > yup
diff --git a/kernel/sched/cpuacct.c b/kernel/sched/cpuacct.c index 941c28cf9738..7eff79faab0d 100644 --- a/kernel/sched/cpuacct.c +++ b/kernel/sched/cpuacct.c @@ -29,7 +29,7 @@ struct cpuacct_usage { struct cpuacct { struct cgroup_subsys_state css; /* cpuusage holds pointer to a u64-type object on every CPU */ - struct cpuacct_usage __percpu *cpuusage; + u64 __percpu *cpuusage; struct kernel_cpustat __percpu *cpustat; }; @@ -49,7 +49,7 @@ static inline struct cpuacct *parent_ca(struct cpuacct *ca) return css_ca(ca->css.parent); } -static DEFINE_PER_CPU(struct cpuacct_usage, root_cpuacct_cpuusage); +static DEFINE_PER_CPU(u64, root_cpuacct_cpuusage); static struct cpuacct root_cpuacct = { .cpustat = &kernel_cpustat, .cpuusage = &root_cpuacct_cpuusage, @@ -68,7 +68,7 @@ cpuacct_css_alloc(struct cgroup_subsys_state *parent_css) if (!ca) goto out; - ca->cpuusage = alloc_percpu(struct cpuacct_usage); + ca->cpuusage = alloc_percpu(u64); if (!ca->cpuusage) goto out_free_ca; @@ -99,7 +99,8 @@ static void cpuacct_css_free(struct cgroup_subsys_state *css) static u64 cpuacct_cpuusage_read(struct cpuacct *ca, int cpu, enum cpuacct_stat_index index) { - struct cpuacct_usage *cpuusage = per_cpu_ptr(ca->cpuusage, cpu); + u64 *cpuusage = per_cpu_ptr(ca->cpuusage, cpu); + u64 *cpustat = per_cpu_ptr(ca->cpustat, cpu)->cpustat; u64 data; /* @@ -115,14 +116,17 @@ static u64 cpuacct_cpuusage_read(struct cpuacct *ca, int cpu, raw_spin_lock_irq(&cpu_rq(cpu)->lock); #endif - if (index == CPUACCT_STAT_NSTATS) { - int i = 0; - - data = 0; - for (i = 0; i < CPUACCT_STAT_NSTATS; i++) - data += cpuusage->usages[i]; - } else { - data = cpuusage->usages[index]; + switch (index) { + case CPUACCT_STAT_USER: + data = cpustat[CPUTIME_USER] + cpustat[CPUTIME_NICE]; + break; + case CPUACCT_STAT_SYSTEM: + data = cpustat[CPUTIME_SYSTEM] + cpustat[CPUTIME_IRQ] + + cpustat[CPUTIME_SOFTIRQ]; + break; + case CPUACCT_STAT_NSTATS: + data = *cpuusage; + break; } #ifndef CONFIG_64BIT @@ -132,10 +136,14 @@ static u64 cpuacct_cpuusage_read(struct cpuacct *ca, int cpu, return data; } -static void cpuacct_cpuusage_write(struct cpuacct *ca, int cpu, u64 val) +static void cpuacct_cpuusage_write(struct cpuacct *ca, int cpu) { - struct cpuacct_usage *cpuusage = per_cpu_ptr(ca->cpuusage, cpu); - int i; + u64 *cpuusage = per_cpu_ptr(ca->cpuusage, cpu); + u64 *cpustat = per_cpu_ptr(ca->cpustat, cpu)->cpustat; + + /* Don't allow to reset global kernel_cpustat */ + if (ca == &root_cpuacct) + return; #ifndef CONFIG_64BIT /* @@ -143,9 +151,10 @@ static void cpuacct_cpuusage_write(struct cpuacct *ca, int cpu, u64 val) */ raw_spin_lock_irq(&cpu_rq(cpu)->lock); #endif - - for (i = 0; i < CPUACCT_STAT_NSTATS; i++) - cpuusage->usages[i] = val; + *cpuusage = 0; + cpustat[CPUTIME_USER] = cpustat[CPUTIME_NICE] = 0; + cpustat[CPUTIME_SYSTEM] = cpustat[CPUTIME_IRQ] = 0; + cpustat[CPUTIME_SOFTIRQ] = 0; #ifndef CONFIG_64BIT raw_spin_unlock_irq(&cpu_rq(cpu)->lock); @@ -196,7 +205,7 @@ static int cpuusage_write(struct cgroup_subsys_state *css, struct cftype *cft, return -EINVAL; for_each_possible_cpu(cpu) - cpuacct_cpuusage_write(ca, cpu, 0); + cpuacct_cpuusage_write(ca, cpu); return 0; } @@ -243,25 +252,12 @@ static int cpuacct_all_seq_show(struct seq_file *m, void *V) seq_puts(m, "\n"); for_each_possible_cpu(cpu) { - struct cpuacct_usage *cpuusage = per_cpu_ptr(ca->cpuusage, cpu); - seq_printf(m, "%d", cpu); - for (index = 0; index < CPUACCT_STAT_NSTATS; index++) { -#ifndef CONFIG_64BIT - /* - * Take rq->lock to make 64-bit read safe on 32-bit - * platforms. - */ - raw_spin_lock_irq(&cpu_rq(cpu)->lock); -#endif - - seq_printf(m, " %llu", cpuusage->usages[index]); + for (index = 0; index < CPUACCT_STAT_NSTATS; index++) + seq_printf(m, " %llu", + cpuacct_cpuusage_read(ca, cpu, index)); -#ifndef CONFIG_64BIT - raw_spin_unlock_irq(&cpu_rq(cpu)->lock); -#endif - } seq_puts(m, "\n"); } return 0; @@ -278,8 +274,8 @@ static int cpuacct_stats_show(struct seq_file *sf, void *v) for_each_possible_cpu(cpu) { u64 *cpustat = per_cpu_ptr(ca->cpustat, cpu)->cpustat; - val[CPUACCT_STAT_USER] += cpustat[CPUTIME_USER]; - val[CPUACCT_STAT_USER] += cpustat[CPUTIME_NICE]; + val[CPUACCT_STAT_USER] += cpustat[CPUTIME_USER]; + val[CPUACCT_STAT_USER] += cpustat[CPUTIME_NICE]; val[CPUACCT_STAT_SYSTEM] += cpustat[CPUTIME_SYSTEM]; val[CPUACCT_STAT_SYSTEM] += cpustat[CPUTIME_IRQ]; val[CPUACCT_STAT_SYSTEM] += cpustat[CPUTIME_SOFTIRQ]; @@ -339,16 +335,11 @@ static struct cftype files[] = { void cpuacct_charge(struct task_struct *tsk, u64 cputime) { struct cpuacct *ca; - int index = CPUACCT_STAT_SYSTEM; - struct pt_regs *regs = get_irq_regs() ? : task_pt_regs(tsk); - - if (regs && user_mode(regs)) - index = CPUACCT_STAT_USER; rcu_read_lock(); for (ca = task_ca(tsk); ca; ca = parent_ca(ca)) - __this_cpu_add(ca->cpuusage->usages[index], cputime); + __this_cpu_add(*ca->cpuusage, cputime); rcu_read_unlock(); }
cpuacct has 2 different ways of accounting and showing user and system times. The first one uses cpuacct_account_field() to account times and cpuacct.stat file to expose them. And this one seems to work ok. The second one is uses cpuacct_charge() function for accounting and set of cpuacct.usage* files to show times. Despite some attempts to fix it in the past it still doesn't work. E.g. while running KVM guest the cpuacct_charge() accounts most of the guest time as system time. This doesn't match with user&system times shown in cpuacct.stat or proc/<pid>/stat. Use cpustats accounted in cpuacct_account_field() as the source of user/sys times for cpuacct.usage* files. Make cpuacct_charge() to account only summary execution time. Fixes: d740037fac70 ("sched/cpuacct: Split usage accounting into user_usage and sys_usage") Signed-off-by: Andrey Ryabinin <arbn@yandex-team.com> Cc: <stable@vger.kernel.org> --- kernel/sched/cpuacct.c | 77 +++++++++++++++++++----------------------- 1 file changed, 34 insertions(+), 43 deletions(-)