Message ID | 1974978.nRy8TqEeLZ@kreacher |
---|---|
State | New |
Headers | show |
Series | [RFT,v1] cpufreq: ACPI: Set cpuinfo.max_freq directly if max boost is known | expand |
On Mon, 2021-02-15 at 20:24 +0100, Rafael J. Wysocki wrote: > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > > Commit 3c55e94c0ade ("cpufreq: ACPI: Extend frequency tables to cover > boost frequencies") attempted to address a performance issue involving > acpi-cpufreq, the schedutil governor and scale-invariance on x86 by > extending the frequency tables created by acpi-cpufreq to cover the > entire range of "turbo" (or "boost") frequencies, but that caused > frequencies reported via /proc/cpuinfo and the scaling_cur_freq > attribute in sysfs to change which may confuse users and monitoring > tools. > > For this reason, revert the part of commit 3c55e94c0ade adding the > extra entry to the frequency table and use the observation that > in principle cpuinfo.max_freq need not be equal to the maximum > frequency listed in the frequency table for the given policy. > > Namely, modify cpufreq_frequency_table_cpuinfo() to allow cpufreq > drivers to set their own cpuinfo.max_freq above that frequency and > change acpi-cpufreq to set cpuinfo.max_freq to the maximum boost > frequency found via CPPC. > > This should be sufficient to let all of the cpufreq subsystem know > the real maximum frequency of the CPU without changing frequency > reporting. > > Link: https://bugzilla.kernel.org/show_bug.cgi?id=211305 > Fixes: 3c55e94c0ade ("cpufreq: ACPI: Extend frequency tables to cover boost frequencies") > Reported-by: Matt McDonald <gardotd426@gmail.com> > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > --- > > Michael, Giovanni, > > The fix for the EPYC performance regression that was merged into 5.11 introduced > an undesirable side-effect by distorting the CPU frequency reporting via > /proc/cpuinfo and scaling_cur_freq (see the BZ link above for details). > > The patch below is reported to address this problem and it should still allow > schedutil to achieve desirable performance, because it simply sets > cpuinfo.max_freq without extending the frequency table of the CPU. > > Please test this one and let me know if it adversely affects performance. > > Thanks! > > --- > drivers/cpufreq/acpi-cpufreq.c | 62 ++++++++++------------------------------- > drivers/cpufreq/freq_table.c | 8 ++++- > 2 files changed, 23 insertions(+), 47 deletions(-) Hello Rafael, I've run the quick image processing test below and the performance is in line with v5.11. I'll send some more results as longer tests complete. TEST : Intel Open Image Denoise, www.openimagedenoise.org INVOCATION : ./denoise -hdr memorial.pfm -out out.pfm -bench 200 -threads $NTHREADS CPU : MODEL : 2x AMD EPYC 7742 FREQUENCY TABLE : P2: 1.50 GHz P1: 2.00 GHz P0: 2.25 GHz MAX BOOST : 3.40 GHz Results: threads, msecs (ratio). Lower is better. v5.11 v5.11-patch ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 1 1071.43 (1.00) 1068.57 (1.00) 2 541.50 (1.00) 542.26 (1.00) 4 276.38 (1.00) 276.96 (1.00) 8 149.51 (1.00) 149.24 (1.00) 16 78.57 (1.00) 78.57 (1.00) 24 57.59 (1.00) 57.67 (1.00) 32 46.40 (1.00) 46.30 (1.00) 48 37.48 (1.00) 38.28 (1.02) 64 33.18 (1.00) 33.69 (1.02) 80 30.73 (1.00) 31.24 (1.02) 96 28.06 (1.00) 28.79 (1.03) 112 27.82 (1.00) 28.14 (1.01) 120 28.33 (1.00) 29.16 (1.03) 128 28.44 (1.00) 28.35 (1.00) Giovanni
On Mon, 2021-02-15 at 20:24 +0100, Rafael J. Wysocki wrote: > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > > Commit 3c55e94c0ade ("cpufreq: ACPI: Extend frequency tables to cover > boost frequencies") attempted to address a performance issue involving > acpi-cpufreq, the schedutil governor and scale-invariance on x86 by > extending the frequency tables created by acpi-cpufreq to cover the > entire range of "turbo" (or "boost") frequencies, but that caused > frequencies reported via /proc/cpuinfo and the scaling_cur_freq > attribute in sysfs to change which may confuse users and monitoring > tools. > > For this reason, revert the part of commit 3c55e94c0ade adding the > extra entry to the frequency table and use the observation that > in principle cpuinfo.max_freq need not be equal to the maximum > frequency listed in the frequency table for the given policy. > > Namely, modify cpufreq_frequency_table_cpuinfo() to allow cpufreq > drivers to set their own cpuinfo.max_freq above that frequency and > change acpi-cpufreq to set cpuinfo.max_freq to the maximum boost > frequency found via CPPC. > > This should be sufficient to let all of the cpufreq subsystem know > the real maximum frequency of the CPU without changing frequency > reporting. > > Link: https://bugzilla.kernel.org/show_bug.cgi?id=211305 > Fixes: 3c55e94c0ade ("cpufreq: ACPI: Extend frequency tables to cover boost frequencies") > Reported-by: Matt McDonald <gardotd426@gmail.com> > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > --- > > Michael, Giovanni, > > The fix for the EPYC performance regression that was merged into 5.11 introduced > an undesirable side-effect by distorting the CPU frequency reporting via > /proc/cpuinfo and scaling_cur_freq (see the BZ link above for details). > > The patch below is reported to address this problem and it should still allow > schedutil to achieve desirable performance, because it simply sets > cpuinfo.max_freq without extending the frequency table of the CPU. > > Please test this one and let me know if it adversely affects performance. > > Thanks! Hello Rafael, more extended testing confirms the initial feeling; performance with this patch is mostly identical to vanilla v5.11. Tbench shows an improvement. Thanks for the fix! Tested-by: Giovanni Gherdovich <ggherdovich@suse.cz> Results follow. The machine has two sockets with an AMD EPYC 7742 each. The governor is always schedutil. Ratios of time, lower is better: v5.11 v5.11 vanilla patch - - - - - - - - - - - - - - - - - - - - - - - - - - - - NASA Parallel Benchmarks w/ MPI 1.00 0.96 NASA Parallel Benchmarks w/ OpenMP 1.00 ~ dbench on XFS 1.00 ~ Linux kernel compilation 1.00 ~ git unit test suite 1.00 ~ Ratio of throughput, higher is better: v5.11 v5.11 vanilla patch - - - - - - - - - - - - - - - - - - - - - - - - - - - - tbench on localhost 1.00 1.09 Tilde (~): no change wrt baseline. Giovanni
Index: linux-pm/drivers/cpufreq/acpi-cpufreq.c =================================================================== --- linux-pm.orig/drivers/cpufreq/acpi-cpufreq.c +++ linux-pm/drivers/cpufreq/acpi-cpufreq.c @@ -54,7 +54,6 @@ struct acpi_cpufreq_data { unsigned int resume; unsigned int cpu_feature; unsigned int acpi_perf_cpu; - unsigned int first_perf_state; cpumask_var_t freqdomain_cpus; void (*cpu_freq_write)(struct acpi_pct_register *reg, u32 val); u32 (*cpu_freq_read)(struct acpi_pct_register *reg); @@ -223,10 +222,10 @@ static unsigned extract_msr(struct cpufr perf = to_perf_data(data); - cpufreq_for_each_entry(pos, policy->freq_table + data->first_perf_state) + cpufreq_for_each_entry(pos, policy->freq_table) if (msr == perf->states[pos->driver_data].status) return pos->frequency; - return policy->freq_table[data->first_perf_state].frequency; + return policy->freq_table[0].frequency; } static unsigned extract_freq(struct cpufreq_policy *policy, u32 val) @@ -365,7 +364,6 @@ static unsigned int get_cur_freq_on_cpu( struct cpufreq_policy *policy; unsigned int freq; unsigned int cached_freq; - unsigned int state; pr_debug("%s (%d)\n", __func__, cpu); @@ -377,11 +375,7 @@ static unsigned int get_cur_freq_on_cpu( if (unlikely(!data || !policy->freq_table)) return 0; - state = to_perf_data(data)->state; - if (state < data->first_perf_state) - state = data->first_perf_state; - - cached_freq = policy->freq_table[state].frequency; + cached_freq = policy->freq_table[to_perf_data(data)->state].frequency; freq = extract_freq(policy, get_cur_val(cpumask_of(cpu), data)); if (freq != cached_freq) { /* @@ -680,7 +674,6 @@ static int acpi_cpufreq_cpu_init(struct struct cpuinfo_x86 *c = &cpu_data(cpu); unsigned int valid_states = 0; unsigned int result = 0; - unsigned int state_count; u64 max_boost_ratio; unsigned int i; #ifdef CONFIG_SMP @@ -795,28 +788,8 @@ static int acpi_cpufreq_cpu_init(struct goto err_unreg; } - state_count = perf->state_count + 1; - - max_boost_ratio = get_max_boost_ratio(cpu); - if (max_boost_ratio) { - /* - * Make a room for one more entry to represent the highest - * available "boost" frequency. - */ - state_count++; - valid_states++; - data->first_perf_state = valid_states; - } else { - /* - * If the maximum "boost" frequency is unknown, ask the arch - * scale-invariance code to use the "nominal" performance for - * CPU utilization scaling so as to prevent the schedutil - * governor from selecting inadequate CPU frequencies. - */ - arch_set_max_freq_ratio(true); - } - - freq_table = kcalloc(state_count, sizeof(*freq_table), GFP_KERNEL); + freq_table = kcalloc(perf->state_count + 1, sizeof(*freq_table), + GFP_KERNEL); if (!freq_table) { result = -ENOMEM; goto err_unreg; @@ -851,27 +824,25 @@ static int acpi_cpufreq_cpu_init(struct } freq_table[valid_states].frequency = CPUFREQ_TABLE_END; + max_boost_ratio = get_max_boost_ratio(cpu); if (max_boost_ratio) { - unsigned int state = data->first_perf_state; - unsigned int freq = freq_table[state].frequency; + unsigned int freq = freq_table[0].frequency; /* * Because the loop above sorts the freq_table entries in the * descending order, freq is the maximum frequency in the table. * Assume that it corresponds to the CPPC nominal frequency and - * use it to populate the frequency field of the extra "boost" - * frequency entry. + * use it to set cpuinfo.max_freq. */ - freq_table[0].frequency = freq * max_boost_ratio >> SCHED_CAPACITY_SHIFT; + policy->cpuinfo.max_freq = freq * max_boost_ratio >> SCHED_CAPACITY_SHIFT; + } else { /* - * The purpose of the extra "boost" frequency entry is to make - * the rest of cpufreq aware of the real maximum frequency, but - * the way to request it is the same as for the first_perf_state - * entry that is expected to cover the entire range of "boost" - * frequencies of the CPU, so copy the driver_data value from - * that entry. + * If the maximum "boost" frequency is unknown, ask the arch + * scale-invariance code to use the "nominal" performance for + * CPU utilization scaling so as to prevent the schedutil + * governor from selecting inadequate CPU frequencies. */ - freq_table[0].driver_data = freq_table[state].driver_data; + arch_set_max_freq_ratio(true); } policy->freq_table = freq_table; @@ -947,8 +918,7 @@ static void acpi_cpufreq_cpu_ready(struc { struct acpi_processor_performance *perf = per_cpu_ptr(acpi_perf_data, policy->cpu); - struct acpi_cpufreq_data *data = policy->driver_data; - unsigned int freq = policy->freq_table[data->first_perf_state].frequency; + unsigned int freq = policy->freq_table[0].frequency; if (perf->states[0].core_frequency * 1000 != freq) pr_warn(FW_WARN "P-state 0 is not max freq\n"); Index: linux-pm/drivers/cpufreq/freq_table.c =================================================================== --- linux-pm.orig/drivers/cpufreq/freq_table.c +++ linux-pm/drivers/cpufreq/freq_table.c @@ -52,7 +52,13 @@ int cpufreq_frequency_table_cpuinfo(stru } policy->min = policy->cpuinfo.min_freq = min_freq; - policy->max = policy->cpuinfo.max_freq = max_freq; + policy->max = max_freq; + /* + * If the driver has set its own cpuinfo.max_freq above max_freq, leave + * it as is. + */ + if (policy->cpuinfo.max_freq < max_freq) + policy->max = policy->cpuinfo.max_freq = max_freq; if (policy->min == ~0) return -EINVAL;