diff mbox

[AArch64] Remove AARCH64_EXTRA_TUNE_RECIP_SQRT from Cortex-A57 tuning

Message ID 1452513883-25826-1-git-send-email-james.greenhalgh@arm.com
State New
Headers show

Commit Message

James Greenhalgh Jan. 11, 2016, 12:04 p.m. UTC
Hi,

I've seen a couple of large performance issues caused by expanding
the high-precision reciprocal square root for Cortex-A57, so I'd like
to turn it off by default.

This is good for art (~2%) from Spec2000, bad (~3.5%) for fma3d from
Spec2000, good (~5.5%) for gromcas from Spec2006, and very good (>10%) for
some private microbenchmark kernels which stress the divide/sqrt/multiply
units. It therefore seems to me to be the correct choice to make across
a number of workloads.

Bootstrapped and tested on aarch64-none-linux-gnu with no issues.

OK?

Thanks,
James

---
2015-12-11  James Greenhalgh  <james.greenhalgh@arm.com>

	* config/aarch64/aarch64.c (cortexa57_tunings): Remove
	AARCH64_EXTRA_TUNE_RECIP_SQRT.

Comments

James Greenhalgh Jan. 25, 2016, 11:20 a.m. UTC | #1
On Mon, Jan 11, 2016 at 12:04:43PM +0000, James Greenhalgh wrote:
> 

> Hi,

> 

> I've seen a couple of large performance issues caused by expanding

> the high-precision reciprocal square root for Cortex-A57, so I'd like

> to turn it off by default.

> 

> This is good for art (~2%) from Spec2000, bad (~3.5%) for fma3d from

> Spec2000, good (~5.5%) for gromcas from Spec2006, and very good (>10%) for

> some private microbenchmark kernels which stress the divide/sqrt/multiply

> units. It therefore seems to me to be the correct choice to make across

> a number of workloads.

> 

> Bootstrapped and tested on aarch64-none-linux-gnu with no issues.

> 

> OK?


*Ping*

Thanks,
James

> ---

> 2015-12-11  James Greenhalgh  <james.greenhalgh@arm.com>

> 

> 	* config/aarch64/aarch64.c (cortexa57_tunings): Remove

> 	AARCH64_EXTRA_TUNE_RECIP_SQRT.

> 


> diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c

> index 1d5d898..999c9fc 100644

> --- a/gcc/config/aarch64/aarch64.c

> +++ b/gcc/config/aarch64/aarch64.c

> @@ -484,8 +484,7 @@ static const struct tune_params cortexa57_tunings =

>    0,	/* max_case_values.  */

>    0,	/* cache_line_size.  */

>    tune_params::AUTOPREFETCHER_WEAK,	/* autoprefetcher_model.  */

> -  (AARCH64_EXTRA_TUNE_RENAME_FMA_REGS

> -   | AARCH64_EXTRA_TUNE_RECIP_SQRT)	/* tune_flags.  */

> +  (AARCH64_EXTRA_TUNE_RENAME_FMA_REGS)	/* tune_flags.  */

>  };

>  

>  static const struct tune_params cortexa72_tunings =
James Greenhalgh Feb. 1, 2016, 2 p.m. UTC | #2
On Mon, Jan 25, 2016 at 11:20:46AM +0000, James Greenhalgh wrote:
> On Mon, Jan 11, 2016 at 12:04:43PM +0000, James Greenhalgh wrote:

> > 

> > Hi,

> > 

> > I've seen a couple of large performance issues caused by expanding

> > the high-precision reciprocal square root for Cortex-A57, so I'd like

> > to turn it off by default.

> > 

> > This is good for art (~2%) from Spec2000, bad (~3.5%) for fma3d from

> > Spec2000, good (~5.5%) for gromcas from Spec2006, and very good (>10%) for

> > some private microbenchmark kernels which stress the divide/sqrt/multiply

> > units. It therefore seems to me to be the correct choice to make across

> > a number of workloads.

> > 

> > Bootstrapped and tested on aarch64-none-linux-gnu with no issues.

> > 

> > OK?

> 

> *Ping*


*pingx2*

Thanks,
James

> > ---

> > 2015-12-11  James Greenhalgh  <james.greenhalgh@arm.com>

> > 

> > 	* config/aarch64/aarch64.c (cortexa57_tunings): Remove

> > 	AARCH64_EXTRA_TUNE_RECIP_SQRT.

> > 

> 

> > diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c

> > index 1d5d898..999c9fc 100644

> > --- a/gcc/config/aarch64/aarch64.c

> > +++ b/gcc/config/aarch64/aarch64.c

> > @@ -484,8 +484,7 @@ static const struct tune_params cortexa57_tunings =

> >    0,	/* max_case_values.  */

> >    0,	/* cache_line_size.  */

> >    tune_params::AUTOPREFETCHER_WEAK,	/* autoprefetcher_model.  */

> > -  (AARCH64_EXTRA_TUNE_RENAME_FMA_REGS

> > -   | AARCH64_EXTRA_TUNE_RECIP_SQRT)	/* tune_flags.  */

> > +  (AARCH64_EXTRA_TUNE_RENAME_FMA_REGS)	/* tune_flags.  */

> >  };

> >  

> >  static const struct tune_params cortexa72_tunings =

>
James Greenhalgh Feb. 8, 2016, 10:57 a.m. UTC | #3
On Mon, Feb 01, 2016 at 02:00:01PM +0000, James Greenhalgh wrote:
> On Mon, Jan 25, 2016 at 11:20:46AM +0000, James Greenhalgh wrote:

> > On Mon, Jan 11, 2016 at 12:04:43PM +0000, James Greenhalgh wrote:

> > > 

> > > Hi,

> > > 

> > > I've seen a couple of large performance issues caused by expanding

> > > the high-precision reciprocal square root for Cortex-A57, so I'd like

> > > to turn it off by default.

> > > 

> > > This is good for art (~2%) from Spec2000, bad (~3.5%) for fma3d from

> > > Spec2000, good (~5.5%) for gromcas from Spec2006, and very good (>10%) for

> > > some private microbenchmark kernels which stress the divide/sqrt/multiply

> > > units. It therefore seems to me to be the correct choice to make across

> > > a number of workloads.

> > > 

> > > Bootstrapped and tested on aarch64-none-linux-gnu with no issues.

> > > 

> > > OK?

> > 

> > *Ping*

> 

> *pingx2*


*ping^3*

Thanks,
James

> > > ---

> > > 2015-12-11  James Greenhalgh  <james.greenhalgh@arm.com>

> > > 

> > > 	* config/aarch64/aarch64.c (cortexa57_tunings): Remove

> > > 	AARCH64_EXTRA_TUNE_RECIP_SQRT.

> > > 

> > 

> > > diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c

> > > index 1d5d898..999c9fc 100644

> > > --- a/gcc/config/aarch64/aarch64.c

> > > +++ b/gcc/config/aarch64/aarch64.c

> > > @@ -484,8 +484,7 @@ static const struct tune_params cortexa57_tunings =

> > >    0,	/* max_case_values.  */

> > >    0,	/* cache_line_size.  */

> > >    tune_params::AUTOPREFETCHER_WEAK,	/* autoprefetcher_model.  */

> > > -  (AARCH64_EXTRA_TUNE_RENAME_FMA_REGS

> > > -   | AARCH64_EXTRA_TUNE_RECIP_SQRT)	/* tune_flags.  */

> > > +  (AARCH64_EXTRA_TUNE_RENAME_FMA_REGS)	/* tune_flags.  */

> > >  };

> > >  

> > >  static const struct tune_params cortexa72_tunings =

> > 

>
James Greenhalgh Feb. 15, 2016, 10:50 a.m. UTC | #4
On Mon, Feb 08, 2016 at 10:57:10AM +0000, James Greenhalgh wrote:
> On Mon, Feb 01, 2016 at 02:00:01PM +0000, James Greenhalgh wrote:

> > On Mon, Jan 25, 2016 at 11:20:46AM +0000, James Greenhalgh wrote:

> > > On Mon, Jan 11, 2016 at 12:04:43PM +0000, James Greenhalgh wrote:

> > > > 

> > > > Hi,

> > > > 

> > > > I've seen a couple of large performance issues caused by expanding

> > > > the high-precision reciprocal square root for Cortex-A57, so I'd like

> > > > to turn it off by default.

> > > > 

> > > > This is good for art (~2%) from Spec2000, bad (~3.5%) for fma3d from

> > > > Spec2000, good (~5.5%) for gromcas from Spec2006, and very good (>10%) for

> > > > some private microbenchmark kernels which stress the divide/sqrt/multiply

> > > > units. It therefore seems to me to be the correct choice to make across

> > > > a number of workloads.

> > > > 

> > > > Bootstrapped and tested on aarch64-none-linux-gnu with no issues.

> > > > 

> > > > OK?

> > > 

> > > *Ping*

> > 

> > *pingx2*

> 

> *ping^3*


*ping^4*

Thanks,
James

> > > > ---

> > > > 2015-12-11  James Greenhalgh  <james.greenhalgh@arm.com>

> > > > 

> > > > 	* config/aarch64/aarch64.c (cortexa57_tunings): Remove

> > > > 	AARCH64_EXTRA_TUNE_RECIP_SQRT.

> > > > 

> > > 

> > > > diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c

> > > > index 1d5d898..999c9fc 100644

> > > > --- a/gcc/config/aarch64/aarch64.c

> > > > +++ b/gcc/config/aarch64/aarch64.c

> > > > @@ -484,8 +484,7 @@ static const struct tune_params cortexa57_tunings =

> > > >    0,	/* max_case_values.  */

> > > >    0,	/* cache_line_size.  */

> > > >    tune_params::AUTOPREFETCHER_WEAK,	/* autoprefetcher_model.  */

> > > > -  (AARCH64_EXTRA_TUNE_RENAME_FMA_REGS

> > > > -   | AARCH64_EXTRA_TUNE_RECIP_SQRT)	/* tune_flags.  */

> > > > +  (AARCH64_EXTRA_TUNE_RENAME_FMA_REGS)	/* tune_flags.  */

> > > >  };

> > > >  

> > > >  static const struct tune_params cortexa72_tunings =

> > > 

> > 

>
Marcus Shawcroft Feb. 16, 2016, 8:49 a.m. UTC | #5
On 11 January 2016 at 12:04, James Greenhalgh <james.greenhalgh@arm.com> wrote:

> 2015-12-11  James Greenhalgh  <james.greenhalgh@arm.com>

>

>         * config/aarch64/aarch64.c (cortexa57_tunings): Remove

>         AARCH64_EXTRA_TUNE_RECIP_SQRT.

>


OK /Marcus
James Greenhalgh Feb. 16, 2016, 10:28 a.m. UTC | #6
On Mon, Feb 15, 2016 at 11:24:53AM -0600, Evandro Menezes wrote:
> On 02/15/16 04:50, James Greenhalgh wrote:

> >On Mon, Feb 08, 2016 at 10:57:10AM +0000, James Greenhalgh wrote:

> >>On Mon, Feb 01, 2016 at 02:00:01PM +0000, James Greenhalgh wrote:

> >>>On Mon, Jan 25, 2016 at 11:20:46AM +0000, James Greenhalgh wrote:

> >>>>On Mon, Jan 11, 2016 at 12:04:43PM +0000, James Greenhalgh wrote:

> >>>>>Hi,

> >>>>>

> >>>>>I've seen a couple of large performance issues caused by expanding

> >>>>>the high-precision reciprocal square root for Cortex-A57, so I'd like

> >>>>>to turn it off by default.

> >>>>>

> >>>>>This is good for art (~2%) from Spec2000, bad (~3.5%) for fma3d from

> >>>>>Spec2000, good (~5.5%) for gromcas from Spec2006, and very good (>10%) for

> >>>>>some private microbenchmark kernels which stress the divide/sqrt/multiply

> >>>>>units. It therefore seems to me to be the correct choice to make across

> >>>>>a number of workloads.

> >>>>>

> >>>>>Bootstrapped and tested on aarch64-none-linux-gnu with no issues.

> >>>>>

> >>>>>OK?

> >>>>*Ping*

> >>>*pingx2*

> >>*ping^3*

> >*ping^4*

> >

> >Thanks,

> >James

> >

> >>>>>---

> >>>>>2015-12-11  James Greenhalgh  <james.greenhalgh@arm.com>

> >>>>>

> >>>>>	* config/aarch64/aarch64.c (cortexa57_tunings): Remove

> >>>>>	AARCH64_EXTRA_TUNE_RECIP_SQRT.

> >>>>>

> >>>>>diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c

> >>>>>index 1d5d898..999c9fc 100644

> >>>>>--- a/gcc/config/aarch64/aarch64.c

> >>>>>+++ b/gcc/config/aarch64/aarch64.c

> >>>>>@@ -484,8 +484,7 @@ static const struct tune_params cortexa57_tunings =

> >>>>>    0,	/* max_case_values.  */

> >>>>>    0,	/* cache_line_size.  */

> >>>>>    tune_params::AUTOPREFETCHER_WEAK,	/* autoprefetcher_model.  */

> >>>>>-  (AARCH64_EXTRA_TUNE_RENAME_FMA_REGS

> >>>>>-   | AARCH64_EXTRA_TUNE_RECIP_SQRT)	/* tune_flags.  */

> >>>>>+  (AARCH64_EXTRA_TUNE_RENAME_FMA_REGS)	/* tune_flags.  */

> >>>>>  };

> >>>>>  static const struct tune_params cortexa72_tunings =

> >

> 

> James,

> 

> There seem to be SPEC CPU2000fp validation issues on A57 when this

> flag is present too.  Though I evaluated the algorithm with a huge

> random set of values, always delivering accuracy around 1ulp, which

> should be enough for CPU2000fp (wit x86-64), I expected the

> benchmarks to pass.

> 

> My suspicion is that the Newton series on AArch64 is probably good

> only for SP.  Then, DP might require an extra round, probably

> exacerbating the performance penalty.

> 

> I'd like to try to split this tuning option into one for SP and

> another for DP.  Thoughts?


I haven't seen validation issues with the default expansion, but with
-mlow-precision-recip-sqrt I do see failures. I think this is to be
expected. I don't support splitting the low-precision flag to
"-mlow-precision-float-recip-sqrt" and "-mlow-precision-double-recip-sqrt",
I think that is pushing a particular set of Spec tuning flags over any
meaningful use case.

I could imagine a case for splitting the internal tuning flag to give
AARCH64_EXTRA_TUNE_SF_RECIP_SQRT and AARCH64_EXTRA_TUNE_DF_RECIP_SQRT, but
I'm not sure I understand the benefits of this? Certainly, I think your goals
for performance (turn on for 64-bit divide/sqrt) would contradict your goals
for accuracy (turn off for 64-bit divide/sqrt).

I'm happy with these flags as they are, but I might be missing a subtelty in
your argument?

Thanks,
James
diff mbox

Patch

diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 1d5d898..999c9fc 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -484,8 +484,7 @@  static const struct tune_params cortexa57_tunings =
   0,	/* max_case_values.  */
   0,	/* cache_line_size.  */
   tune_params::AUTOPREFETCHER_WEAK,	/* autoprefetcher_model.  */
-  (AARCH64_EXTRA_TUNE_RENAME_FMA_REGS
-   | AARCH64_EXTRA_TUNE_RECIP_SQRT)	/* tune_flags.  */
+  (AARCH64_EXTRA_TUNE_RENAME_FMA_REGS)	/* tune_flags.  */
 };
 
 static const struct tune_params cortexa72_tunings =