mbox series

[00/23] Add remaining CORE-MATH binary32 implementations to libm

Message ID 20241129132032.476978-1-adhemerval.zanella@linaro.org
Headers show
Series Add remaining CORE-MATH binary32 implementations to libm | expand

Message

Adhemerval Zanella Netto Nov. 29, 2024, 1:17 p.m. UTC
This patchset adds the optimized and correctly rounded acosf, acohf,
asinf, asinhf, atanf, atan2f, atanhf, coshf, sinhf, and tanf from
CORE-MATH [1].  Each implementation has a benchmark to evaluate the
performance improvements.

In general, the results are pretty good with just some remarks:

* acosf/asinf hits hard the branch prediction on the range x < 0x1.c2a1dcp-1
  for acosf and x < 0x1.852p+126 for asinf.

  For acosf, this is required to make it correctly rounded on non-default
  rounding modes, and for asinf, it is fast-path optimization.

  The performance profiles are wildly different depending on the chip: I see
  regressions compared to glibc implementation on AMD Zen3 targeting
  x86-64/x86-64-v2, but I also see large improvements for x86-64-v3 and also
   on aarch64 Neoverse-N1 and powerpc POWER10.

  I think it should not be a blocker for integration.

* coshf performance is the only regression compared to the current glibc.
  This is mostly due to the benchmark used (which I modeled using CORE-MATH
  input range) showing that the glibc code hotspot is on expf, an optimized
  version from ARM-optimized routines.  

  Neither current expf nor coshf is correctly rounded, and the maximum error
  is 2ulps for FE_TONEAREST and 3ulp for another rounding.

  I am not sure if this would be a blocker, and I plan to remove the old SVID
  compat wrapper in subsequent patches (which should improve the function
  performance by about ~10%).

[1] https://gitlab.inria.fr/core-math/core-math

Adhemerval Zanella (23):
  benchtests: Add acosf benchmark
  benchtests: Add acoshf benchmark
  benchtests: Add asinf benchmark
  benchtests: Add asinhf benchmark
  benchtests: Add atanf benchmark
  benchtests: Add atan2f benchmark
  benchtests: Add atanhf benchmark
  benchtests: Add coshf benchmark
  benchtests: Add sinhf benchmark
  benchtests: Add tanhf benchmark
  math: Add inf support on gen-auto-libm-tests.c
  math: Fix the expected atanf (inf) results
  math: Fix the expected atan2f (inf) results
  math: Use acosf from CORE-MATH
  math: Use acoshf from CORE-MATH
  math: Use asinf from CORE-MATH
  math: Use asinhf from CORE-MATH
  math: Use atanf from CORE-MATH
  math: Use atan2f from CORE-MATH
  math: Use atanhf from CORE-MATH
  math: Use coshf from CORE-MATH
  math: Use sinhf from CORE-MATH
  math: Use tanhf from CORE-MATH

 SHARED-FILES                                  |   40 +
 benchtests/Makefile                           |   10 +
 benchtests/acosf-inputs                       | 2710 +++++++++++++++++
 benchtests/acoshf-inputs                      | 1005 ++++++
 benchtests/asinf-inputs                       | 2710 +++++++++++++++++
 benchtests/asinhf-inputs                      | 2005 ++++++++++++
 benchtests/atan2f-inputs                      | 2005 ++++++++++++
 benchtests/atanf-inputs                       | 2005 ++++++++++++
 benchtests/atanhf-inputs                      | 2005 ++++++++++++
 benchtests/coshf-inputs                       | 2005 ++++++++++++
 benchtests/sinhf-inputs                       | 2005 ++++++++++++
 benchtests/tanhf-inputs                       | 2005 ++++++++++++
 math/auto-libm-test-in                        |   52 +
 math/auto-libm-test-out-atan                  |   50 +
 math/auto-libm-test-out-atan2                 | 2316 ++++++++++++++
 math/gen-auto-libm-tests.c                    |   23 +-
 math/libm-test-atan.inc                       |    2 -
 math/libm-test-atan2.inc                      |   56 -
 sysdeps/aarch64/libm-test-ulps                |   44 +-
 sysdeps/alpha/fpu/libm-test-ulps              |   40 -
 sysdeps/arc/fpu/libm-test-ulps                |   40 -
 sysdeps/arc/nofpu/libm-test-ulps              |   10 -
 sysdeps/arm/libm-test-ulps                    |   48 +-
 sysdeps/csky/fpu/libm-test-ulps               |   40 -
 sysdeps/csky/nofpu/libm-test-ulps             |   40 -
 sysdeps/hppa/fpu/libm-test-ulps               |   40 -
 sysdeps/i386/fpu/e_acosf.S                    |   23 -
 sysdeps/i386/fpu/e_acoshf.S                   |  101 -
 sysdeps/i386/fpu/e_asinf.S                    |   38 -
 sysdeps/i386/fpu/e_atan2f.S                   |   30 -
 sysdeps/i386/fpu/e_atanhf.S                   |  110 -
 sysdeps/i386/fpu/libm-test-ulps               |   25 -
 sysdeps/i386/fpu/s_asinhf.S                   |  139 -
 sysdeps/i386/fpu/s_atanf.S                    |   30 -
 .../i386/i686/fpu/multiarch/libm-test-ulps    |   25 -
 sysdeps/ieee754/flt-32/e_acosf.c              |  191 +-
 sysdeps/ieee754/flt-32/e_acoshf.c             |  230 +-
 sysdeps/ieee754/flt-32/e_asinf.c              |  210 +-
 sysdeps/ieee754/flt-32/e_atan2f.c             |  337 +-
 sysdeps/ieee754/flt-32/e_atanhf.c             |  210 +-
 sysdeps/ieee754/flt-32/e_coshf.c              |  156 +-
 sysdeps/ieee754/flt-32/e_sinhf.c              |  169 +-
 sysdeps/ieee754/flt-32/s_asinhf.c             |  219 +-
 sysdeps/ieee754/flt-32/s_atanf.c              |  186 +-
 sysdeps/ieee754/flt-32/s_tanhf.c              |  131 +-
 sysdeps/loongarch/lp64/libm-test-ulps         |   44 +-
 sysdeps/m68k/coldfire/fpu/libm-test-ulps      |    2 -
 sysdeps/m68k/m680x0/fpu/libm-test-ulps        |   11 -
 sysdeps/microblaze/libm-test-ulps             |   10 -
 sysdeps/mips/mips32/libm-test-ulps            |   40 -
 sysdeps/mips/mips64/libm-test-ulps            |   40 -
 sysdeps/or1k/fpu/libm-test-ulps               |   40 -
 sysdeps/or1k/nofpu/libm-test-ulps             |   40 -
 sysdeps/powerpc/fpu/libm-test-ulps            |   44 +-
 sysdeps/powerpc/nofpu/libm-test-ulps          |   40 -
 sysdeps/riscv/nofpu/libm-test-ulps            |   40 -
 sysdeps/riscv/rvd/libm-test-ulps              |   40 -
 sysdeps/s390/fpu/libm-test-ulps               |   40 -
 sysdeps/sh/libm-test-ulps                     |   20 -
 sysdeps/sparc/fpu/libm-test-ulps              |   40 -
 sysdeps/x86_64/fpu/libm-test-ulps             |   46 +-
 61 files changed, 24381 insertions(+), 2027 deletions(-)
 create mode 100644 benchtests/acosf-inputs
 create mode 100644 benchtests/acoshf-inputs
 create mode 100644 benchtests/asinf-inputs
 create mode 100644 benchtests/asinhf-inputs
 create mode 100644 benchtests/atan2f-inputs
 create mode 100644 benchtests/atanf-inputs
 create mode 100644 benchtests/atanhf-inputs
 create mode 100644 benchtests/coshf-inputs
 create mode 100644 benchtests/sinhf-inputs
 create mode 100644 benchtests/tanhf-inputs
 delete mode 100644 sysdeps/i386/fpu/e_acosf.S
 delete mode 100644 sysdeps/i386/fpu/e_acoshf.S
 delete mode 100644 sysdeps/i386/fpu/e_asinf.S
 delete mode 100644 sysdeps/i386/fpu/e_atan2f.S
 delete mode 100644 sysdeps/i386/fpu/e_atanhf.S
 delete mode 100644 sysdeps/i386/fpu/s_asinhf.S
 delete mode 100644 sysdeps/i386/fpu/s_atanf.S

Comments

Florian Weimer Dec. 2, 2024, 2:39 p.m. UTC | #1
* Adhemerval Zanella:

> * acosf/asinf hits hard the branch prediction on the range x < 0x1.c2a1dcp-1
>   for acosf and x < 0x1.852p+126 for asinf.
>
>   For acosf, this is required to make it correctly rounded on non-default
>   rounding modes, and for asinf, it is fast-path optimization.
>
>   The performance profiles are wildly different depending on the chip: I see
>   regressions compared to glibc implementation on AMD Zen3 targeting
>   x86-64/x86-64-v2, but I also see large improvements for x86-64-v3 and also
>    on aarch64 Neoverse-N1 and powerpc POWER10.

We can add an FMA variant for acosf/asinf as an IFUNC.  This should
avoid the performance regression for x86-64 on most machines.

According to the commit message, that would also help with coshf for
x86-64?

Thanks,
Florian