From patchwork Wed Nov 16 21:30:03 2016
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Thomas Koenig <tkoenig@netcologne.de>
X-Patchwork-Id: 82615
Delivered-To: patch@linaro.org
Received: by 10.182.1.168 with SMTP id 8csp475243obn;
 Wed, 16 Nov 2016 13:31:01 -0800 (PST)
X-Received: by 10.98.217.67 with SMTP id s64mr7556050pfg.66.1479331861525;
 Wed, 16 Nov 2016 13:31:01 -0800 (PST)
Return-Path: <gcc-patches-return-441731-patch=linaro.org@gcc.gnu.org>
Received: from sourceware.org (server1.sourceware.org. [209.132.180.131])
 by mx.google.com with ESMTPS id
 j185si33529781pgd.305.2016.11.16.13.31.01 for <patch@linaro.org>
 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
 Wed, 16 Nov 2016 13:31:01 -0800 (PST)
Received-SPF: pass (google.com: domain of
 gcc-patches-return-441731-patch=linaro.org@gcc.gnu.org
 designates 209.132.180.131 as permitted sender)
 client-ip=209.132.180.131; 
Authentication-Results: mx.google.com; dkim=pass header.i=@gcc.gnu.org;
 spf=pass (google.com: domain of
 gcc-patches-return-441731-patch=linaro.org@gcc.gnu.org
 designates 209.132.180.131 as permitted sender)
 smtp.mailfrom=gcc-patches-return-441731-patch=linaro.org@gcc.gnu.org
DomainKey-Signature: a=rsa-sha1; c=nofws; d=gcc.gnu.org; h=list-id
 :list-unsubscribe:list-archive:list-post:list-help:sender:to
 :from:subject:message-id:date:mime-version:content-type; q=dns;
 s=default; b=ZTOVMMvDfL71siUYWEXo7utepQGhixcAJFhL/3bm7szLMENMUZ
 IqVlJf50eJnkesK4MCMqN/d2kWTV7z/7rf4hRDTq2LB1dslhALxGwY8cKYxckd2Y
 8KpBZS9hauVPm1cpsdR7fpKTC6EIg9fCR8JSSrc/a3zNIUMBMAqb/ZLvs=
DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=gcc.gnu.org; h=list-id
 :list-unsubscribe:list-archive:list-post:list-help:sender:to
 :from:subject:message-id:date:mime-version:content-type; s=
 default; bh=eG0ldW32ZWwDlCI3eXSxdw8ykcc=; b=TpcVDucT2o3S6ByLfycE
 +kVOy5WDZcNDjTP6CaJg6/XQVFpiKRJUG1WCz2mSdRT7sMFXEKTRJhcl+UnD1gDI
 AGv2xigKrnTbWf/AffrA6W1aDcmv/TbE1LuE2DhkT5Lp9+Pel1R0WDeFCWCnbQb7
 VdZ4d/MkXBKFET/Bw74RMwM=
Received: (qmail 46513 invoked by alias); 16 Nov 2016 21:30:29 -0000
Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <mailto:gcc-patches-unsubscribe-patch=linaro.org@gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-help@gcc.gnu.org>
Sender: gcc-patches-owner@gcc.gnu.org
Delivered-To: mailing list gcc-patches@gcc.gnu.org
Received: (qmail 46463 invoked by uid 89); 16 Nov 2016 21:30:26 -0000
Authentication-Results: sourceware.org; auth=none
X-Virus-Found: No
X-Spam-SWARE-Status: No, score=-3.7 required=5.0 tests=AWL, BAYES_00,
 KAM_ASCII_DIVIDERS, KAM_LAZY_DOMAIN_SECURITY,
 RCVD_IN_DNSWL_LOW,
 RP_MATCHES_RCVD autolearn=ham version=3.3.2 spammy=75, 11,
 7511, rtype
X-Spam-User: qpsmtpd, 2 recipients
X-HELO: cc-smtpout3.netcologne.de
Received: from cc-smtpout3.netcologne.de (HELO cc-smtpout3.netcologne.de)
 (89.1.8.213) by sourceware.org
 (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP;
 Wed, 16 Nov 2016 21:30:16 +0000
Received: from cc-smtpin1.netcologne.de (cc-smtpin1.netcologne.de
 [89.1.8.201])	by cc-smtpout3.netcologne.de (Postfix) with
 ESMTP id 0390D126F9; Wed, 16 Nov 2016 22:30:06 +0100 (CET)
Received: from localhost (localhost [127.0.0.1])	by cc-smtpin1.netcologne.de
 (Postfix) with ESMTP id E876B11DCB;
 Wed, 16 Nov 2016 22:30:05 +0100 (CET)
Received: from [78.35.157.153] (helo=cc-smtpin1.netcologne.de)	by localhost
 with ESMTP (eXpurgate 4.1.9)	(envelope-from
 <tkoenig@netcologne.de>)	id
 582ccfdd-021e-7f0000012729-7f000001ab20-1	for
 <multiple-recipients>; Wed, 16 Nov 2016 22:30:05 +0100
Received: from [192.168.178.20] (xdsl-78-35-157-153.netcologne.de
 [78.35.157.153])	(using TLSv1.2 with cipher
 ECDHE-RSA-AES256-SHA (256/256 bits))	(No client certificate
 requested)	by cc-smtpin1.netcologne.de (Postfix) with ESMTPSA;
 Wed, 16 Nov 2016 22:30:03 +0100 (CET)
To: "fortran@gcc.gnu.org" <fortran@gcc.gnu.org>,
 gcc-patches <gcc-patches@gcc.gnu.org>
From: Thomas Koenig <tkoenig@netcologne.de>
Subject: [patch, libfortran] Add AVX-specific matmul
Message-ID: <05fbb04a-f4c1-cb61-9baa-7a86ea673784@netcologne.de>
Date: Wed, 16 Nov 2016 22:30:03 +0100
User-Agent: Mozilla/5.0 (X11; Linux x86_64;
 rv:45.0) Gecko/20100101 Thunderbird/45.4.0
MIME-Version: 1.0

Hello world,

the attached patch adds an AVX-specific version of the matmul
intrinsic to the Fortran library.  This works by using the target_clones
attribute.

For testing, I compiled this on powerpc64-unknown-linux-gnu,
without any ill effects.

Also, a resulting binary reached around 15 GFlops for larger matrices
on a 3.4 GHz i7-2600 CPU.  I am currently building/regtesting on
that machine. This can give another 40% speed increase  for large
matrices on AVX.

OK for trunk?

Regards

	Thomas

2016-11-16  Thomas Koenig  <tkoenig@gcc.gnu.org>

         PR fortran/78379
         * m4/matmul.m4:  For x86_64, make the work function for matmul
         static with target_clones for AVX and default, and create
         a wrapper function to call it.
         * generated/matmul_c10.c
         * generated/matmul_c16.c: Regenerated.
         * generated/matmul_c4.c: Regenerated.
         * generated/matmul_c8.c: Regenerated.
         * generated/matmul_i1.c: Regenerated.
         * generated/matmul_i16.c: Regenerated.
         * generated/matmul_i2.c: Regenerated.
         * generated/matmul_i4.c: Regenerated.
         * generated/matmul_i8.c: Regenerated.
         * generated/matmul_r10.c: Regenerated.
         * generated/matmul_r16.c: Regenerated.
         * generated/matmul_r4.c: Regenerated.
         * generated/matmul_r8.c: Regenerated.

Index: generated/matmul_c10.c
===================================================================
--- generated/matmul_c10.c	(Revision 242477)
+++ generated/matmul_c10.c	(Arbeitskopie)
@@ -75,11 +75,37 @@ extern void matmul_c10 (gfc_array_c10 * const rest
 	int blas_limit, blas_call gemm);
 export_proto(matmul_c10);
 
+#ifdef __x86_64__
+
+/* For x86_64, we switch to AVX if that is available.  For this, we
+   let the actual work be done by the static aux_matmul - function.
+   The user-callable function will then automagically contain the
+   selection code for the right architecture.  This is done to avoid
+   knowledge of architecture details in the front end.  */
+
+static void aux_matmul_c10 (gfc_array_c10 * const restrict retarray, 
+	gfc_array_c10 * const restrict a, gfc_array_c10 * const restrict b, int try_blas,
+	int blas_limit, blas_call gemm)
+	__attribute__ ((target_clones("avx,default")));
+
 void
 matmul_c10 (gfc_array_c10 * const restrict retarray, 
 	gfc_array_c10 * const restrict a, gfc_array_c10 * const restrict b, int try_blas,
 	int blas_limit, blas_call gemm)
 {
+  aux_matmul_c10 (retarray, a, b, try_blas, blas_limit, gemm);
+}
+
+static void
+aux_matmul_c10 (gfc_array_c10 * const restrict retarray, 
+	gfc_array_c10 * const restrict a, gfc_array_c10 * const restrict b, int try_blas,
+	int blas_limit, blas_call gemm)
+#else
+matmul_c10 (gfc_array_c10 * const restrict retarray, 
+	gfc_array_c10 * const restrict a, gfc_array_c10 * const restrict b, int try_blas,
+	int blas_limit, blas_call gemm)
+#endif
+{
   const GFC_COMPLEX_10 * restrict abase;
   const GFC_COMPLEX_10 * restrict bbase;
   GFC_COMPLEX_10 * restrict dest;
Index: generated/matmul_c16.c
===================================================================
--- generated/matmul_c16.c	(Revision 242477)
+++ generated/matmul_c16.c	(Arbeitskopie)
@@ -75,11 +75,37 @@ extern void matmul_c16 (gfc_array_c16 * const rest
 	int blas_limit, blas_call gemm);
 export_proto(matmul_c16);
 
+#ifdef __x86_64__
+
+/* For x86_64, we switch to AVX if that is available.  For this, we
+   let the actual work be done by the static aux_matmul - function.
+   The user-callable function will then automagically contain the
+   selection code for the right architecture.  This is done to avoid
+   knowledge of architecture details in the front end.  */
+
+static void aux_matmul_c16 (gfc_array_c16 * const restrict retarray, 
+	gfc_array_c16 * const restrict a, gfc_array_c16 * const restrict b, int try_blas,
+	int blas_limit, blas_call gemm)
+	__attribute__ ((target_clones("avx,default")));
+
 void
 matmul_c16 (gfc_array_c16 * const restrict retarray, 
 	gfc_array_c16 * const restrict a, gfc_array_c16 * const restrict b, int try_blas,
 	int blas_limit, blas_call gemm)
 {
+  aux_matmul_c16 (retarray, a, b, try_blas, blas_limit, gemm);
+}
+
+static void
+aux_matmul_c16 (gfc_array_c16 * const restrict retarray, 
+	gfc_array_c16 * const restrict a, gfc_array_c16 * const restrict b, int try_blas,
+	int blas_limit, blas_call gemm)
+#else
+matmul_c16 (gfc_array_c16 * const restrict retarray, 
+	gfc_array_c16 * const restrict a, gfc_array_c16 * const restrict b, int try_blas,
+	int blas_limit, blas_call gemm)
+#endif
+{
   const GFC_COMPLEX_16 * restrict abase;
   const GFC_COMPLEX_16 * restrict bbase;
   GFC_COMPLEX_16 * restrict dest;
Index: generated/matmul_c4.c
===================================================================
--- generated/matmul_c4.c	(Revision 242477)
+++ generated/matmul_c4.c	(Arbeitskopie)
@@ -75,11 +75,37 @@ extern void matmul_c4 (gfc_array_c4 * const restri
 	int blas_limit, blas_call gemm);
 export_proto(matmul_c4);
 
+#ifdef __x86_64__
+
+/* For x86_64, we switch to AVX if that is available.  For this, we
+   let the actual work be done by the static aux_matmul - function.
+   The user-callable function will then automagically contain the
+   selection code for the right architecture.  This is done to avoid
+   knowledge of architecture details in the front end.  */
+
+static void aux_matmul_c4 (gfc_array_c4 * const restrict retarray, 
+	gfc_array_c4 * const restrict a, gfc_array_c4 * const restrict b, int try_blas,
+	int blas_limit, blas_call gemm)
+	__attribute__ ((target_clones("avx,default")));
+
 void
 matmul_c4 (gfc_array_c4 * const restrict retarray, 
 	gfc_array_c4 * const restrict a, gfc_array_c4 * const restrict b, int try_blas,
 	int blas_limit, blas_call gemm)
 {
+  aux_matmul_c4 (retarray, a, b, try_blas, blas_limit, gemm);
+}
+
+static void
+aux_matmul_c4 (gfc_array_c4 * const restrict retarray, 
+	gfc_array_c4 * const restrict a, gfc_array_c4 * const restrict b, int try_blas,
+	int blas_limit, blas_call gemm)
+#else
+matmul_c4 (gfc_array_c4 * const restrict retarray, 
+	gfc_array_c4 * const restrict a, gfc_array_c4 * const restrict b, int try_blas,
+	int blas_limit, blas_call gemm)
+#endif
+{
   const GFC_COMPLEX_4 * restrict abase;
   const GFC_COMPLEX_4 * restrict bbase;
   GFC_COMPLEX_4 * restrict dest;
Index: generated/matmul_c8.c
===================================================================
--- generated/matmul_c8.c	(Revision 242477)
+++ generated/matmul_c8.c	(Arbeitskopie)
@@ -75,11 +75,37 @@ extern void matmul_c8 (gfc_array_c8 * const restri
 	int blas_limit, blas_call gemm);
 export_proto(matmul_c8);
 
+#ifdef __x86_64__
+
+/* For x86_64, we switch to AVX if that is available.  For this, we
+   let the actual work be done by the static aux_matmul - function.
+   The user-callable function will then automagically contain the
+   selection code for the right architecture.  This is done to avoid
+   knowledge of architecture details in the front end.  */
+
+static void aux_matmul_c8 (gfc_array_c8 * const restrict retarray, 
+	gfc_array_c8 * const restrict a, gfc_array_c8 * const restrict b, int try_blas,
+	int blas_limit, blas_call gemm)
+	__attribute__ ((target_clones("avx,default")));
+
 void
 matmul_c8 (gfc_array_c8 * const restrict retarray, 
 	gfc_array_c8 * const restrict a, gfc_array_c8 * const restrict b, int try_blas,
 	int blas_limit, blas_call gemm)
 {
+  aux_matmul_c8 (retarray, a, b, try_blas, blas_limit, gemm);
+}
+
+static void
+aux_matmul_c8 (gfc_array_c8 * const restrict retarray, 
+	gfc_array_c8 * const restrict a, gfc_array_c8 * const restrict b, int try_blas,
+	int blas_limit, blas_call gemm)
+#else
+matmul_c8 (gfc_array_c8 * const restrict retarray, 
+	gfc_array_c8 * const restrict a, gfc_array_c8 * const restrict b, int try_blas,
+	int blas_limit, blas_call gemm)
+#endif
+{
   const GFC_COMPLEX_8 * restrict abase;
   const GFC_COMPLEX_8 * restrict bbase;
   GFC_COMPLEX_8 * restrict dest;
Index: generated/matmul_i1.c
===================================================================
--- generated/matmul_i1.c	(Revision 242477)
+++ generated/matmul_i1.c	(Arbeitskopie)
@@ -75,11 +75,37 @@ extern void matmul_i1 (gfc_array_i1 * const restri
 	int blas_limit, blas_call gemm);
 export_proto(matmul_i1);
 
+#ifdef __x86_64__
+
+/* For x86_64, we switch to AVX if that is available.  For this, we
+   let the actual work be done by the static aux_matmul - function.
+   The user-callable function will then automagically contain the
+   selection code for the right architecture.  This is done to avoid
+   knowledge of architecture details in the front end.  */
+
+static void aux_matmul_i1 (gfc_array_i1 * const restrict retarray, 
+	gfc_array_i1 * const restrict a, gfc_array_i1 * const restrict b, int try_blas,
+	int blas_limit, blas_call gemm)
+	__attribute__ ((target_clones("avx,default")));
+
 void
 matmul_i1 (gfc_array_i1 * const restrict retarray, 
 	gfc_array_i1 * const restrict a, gfc_array_i1 * const restrict b, int try_blas,
 	int blas_limit, blas_call gemm)
 {
+  aux_matmul_i1 (retarray, a, b, try_blas, blas_limit, gemm);
+}
+
+static void
+aux_matmul_i1 (gfc_array_i1 * const restrict retarray, 
+	gfc_array_i1 * const restrict a, gfc_array_i1 * const restrict b, int try_blas,
+	int blas_limit, blas_call gemm)
+#else
+matmul_i1 (gfc_array_i1 * const restrict retarray, 
+	gfc_array_i1 * const restrict a, gfc_array_i1 * const restrict b, int try_blas,
+	int blas_limit, blas_call gemm)
+#endif
+{
   const GFC_INTEGER_1 * restrict abase;
   const GFC_INTEGER_1 * restrict bbase;
   GFC_INTEGER_1 * restrict dest;
Index: generated/matmul_i16.c
===================================================================
--- generated/matmul_i16.c	(Revision 242477)
+++ generated/matmul_i16.c	(Arbeitskopie)
@@ -75,11 +75,37 @@ extern void matmul_i16 (gfc_array_i16 * const rest
 	int blas_limit, blas_call gemm);
 export_proto(matmul_i16);
 
+#ifdef __x86_64__
+
+/* For x86_64, we switch to AVX if that is available.  For this, we
+   let the actual work be done by the static aux_matmul - function.
+   The user-callable function will then automagically contain the
+   selection code for the right architecture.  This is done to avoid
+   knowledge of architecture details in the front end.  */
+
+static void aux_matmul_i16 (gfc_array_i16 * const restrict retarray, 
+	gfc_array_i16 * const restrict a, gfc_array_i16 * const restrict b, int try_blas,
+	int blas_limit, blas_call gemm)
+	__attribute__ ((target_clones("avx,default")));
+
 void
 matmul_i16 (gfc_array_i16 * const restrict retarray, 
 	gfc_array_i16 * const restrict a, gfc_array_i16 * const restrict b, int try_blas,
 	int blas_limit, blas_call gemm)
 {
+  aux_matmul_i16 (retarray, a, b, try_blas, blas_limit, gemm);
+}
+
+static void
+aux_matmul_i16 (gfc_array_i16 * const restrict retarray, 
+	gfc_array_i16 * const restrict a, gfc_array_i16 * const restrict b, int try_blas,
+	int blas_limit, blas_call gemm)
+#else
+matmul_i16 (gfc_array_i16 * const restrict retarray, 
+	gfc_array_i16 * const restrict a, gfc_array_i16 * const restrict b, int try_blas,
+	int blas_limit, blas_call gemm)
+#endif
+{
   const GFC_INTEGER_16 * restrict abase;
   const GFC_INTEGER_16 * restrict bbase;
   GFC_INTEGER_16 * restrict dest;
Index: generated/matmul_i2.c
===================================================================
--- generated/matmul_i2.c	(Revision 242477)
+++ generated/matmul_i2.c	(Arbeitskopie)
@@ -75,11 +75,37 @@ extern void matmul_i2 (gfc_array_i2 * const restri
 	int blas_limit, blas_call gemm);
 export_proto(matmul_i2);
 
+#ifdef __x86_64__
+
+/* For x86_64, we switch to AVX if that is available.  For this, we
+   let the actual work be done by the static aux_matmul - function.
+   The user-callable function will then automagically contain the
+   selection code for the right architecture.  This is done to avoid
+   knowledge of architecture details in the front end.  */
+
+static void aux_matmul_i2 (gfc_array_i2 * const restrict retarray, 
+	gfc_array_i2 * const restrict a, gfc_array_i2 * const restrict b, int try_blas,
+	int blas_limit, blas_call gemm)
+	__attribute__ ((target_clones("avx,default")));
+
 void
 matmul_i2 (gfc_array_i2 * const restrict retarray, 
 	gfc_array_i2 * const restrict a, gfc_array_i2 * const restrict b, int try_blas,
 	int blas_limit, blas_call gemm)
 {
+  aux_matmul_i2 (retarray, a, b, try_blas, blas_limit, gemm);
+}
+
+static void
+aux_matmul_i2 (gfc_array_i2 * const restrict retarray, 
+	gfc_array_i2 * const restrict a, gfc_array_i2 * const restrict b, int try_blas,
+	int blas_limit, blas_call gemm)
+#else
+matmul_i2 (gfc_array_i2 * const restrict retarray, 
+	gfc_array_i2 * const restrict a, gfc_array_i2 * const restrict b, int try_blas,
+	int blas_limit, blas_call gemm)
+#endif
+{
   const GFC_INTEGER_2 * restrict abase;
   const GFC_INTEGER_2 * restrict bbase;
   GFC_INTEGER_2 * restrict dest;
Index: generated/matmul_i4.c
===================================================================
--- generated/matmul_i4.c	(Revision 242477)
+++ generated/matmul_i4.c	(Arbeitskopie)
@@ -75,11 +75,37 @@ extern void matmul_i4 (gfc_array_i4 * const restri
 	int blas_limit, blas_call gemm);
 export_proto(matmul_i4);
 
+#ifdef __x86_64__
+
+/* For x86_64, we switch to AVX if that is available.  For this, we
+   let the actual work be done by the static aux_matmul - function.
+   The user-callable function will then automagically contain the
+   selection code for the right architecture.  This is done to avoid
+   knowledge of architecture details in the front end.  */
+
+static void aux_matmul_i4 (gfc_array_i4 * const restrict retarray, 
+	gfc_array_i4 * const restrict a, gfc_array_i4 * const restrict b, int try_blas,
+	int blas_limit, blas_call gemm)
+	__attribute__ ((target_clones("avx,default")));
+
 void
 matmul_i4 (gfc_array_i4 * const restrict retarray, 
 	gfc_array_i4 * const restrict a, gfc_array_i4 * const restrict b, int try_blas,
 	int blas_limit, blas_call gemm)
 {
+  aux_matmul_i4 (retarray, a, b, try_blas, blas_limit, gemm);
+}
+
+static void
+aux_matmul_i4 (gfc_array_i4 * const restrict retarray, 
+	gfc_array_i4 * const restrict a, gfc_array_i4 * const restrict b, int try_blas,
+	int blas_limit, blas_call gemm)
+#else
+matmul_i4 (gfc_array_i4 * const restrict retarray, 
+	gfc_array_i4 * const restrict a, gfc_array_i4 * const restrict b, int try_blas,
+	int blas_limit, blas_call gemm)
+#endif
+{
   const GFC_INTEGER_4 * restrict abase;
   const GFC_INTEGER_4 * restrict bbase;
   GFC_INTEGER_4 * restrict dest;
Index: generated/matmul_i8.c
===================================================================
--- generated/matmul_i8.c	(Revision 242477)
+++ generated/matmul_i8.c	(Arbeitskopie)
@@ -75,11 +75,37 @@ extern void matmul_i8 (gfc_array_i8 * const restri
 	int blas_limit, blas_call gemm);
 export_proto(matmul_i8);
 
+#ifdef __x86_64__
+
+/* For x86_64, we switch to AVX if that is available.  For this, we
+   let the actual work be done by the static aux_matmul - function.
+   The user-callable function will then automagically contain the
+   selection code for the right architecture.  This is done to avoid
+   knowledge of architecture details in the front end.  */
+
+static void aux_matmul_i8 (gfc_array_i8 * const restrict retarray, 
+	gfc_array_i8 * const restrict a, gfc_array_i8 * const restrict b, int try_blas,
+	int blas_limit, blas_call gemm)
+	__attribute__ ((target_clones("avx,default")));
+
 void
 matmul_i8 (gfc_array_i8 * const restrict retarray, 
 	gfc_array_i8 * const restrict a, gfc_array_i8 * const restrict b, int try_blas,
 	int blas_limit, blas_call gemm)
 {
+  aux_matmul_i8 (retarray, a, b, try_blas, blas_limit, gemm);
+}
+
+static void
+aux_matmul_i8 (gfc_array_i8 * const restrict retarray, 
+	gfc_array_i8 * const restrict a, gfc_array_i8 * const restrict b, int try_blas,
+	int blas_limit, blas_call gemm)
+#else
+matmul_i8 (gfc_array_i8 * const restrict retarray, 
+	gfc_array_i8 * const restrict a, gfc_array_i8 * const restrict b, int try_blas,
+	int blas_limit, blas_call gemm)
+#endif
+{
   const GFC_INTEGER_8 * restrict abase;
   const GFC_INTEGER_8 * restrict bbase;
   GFC_INTEGER_8 * restrict dest;
Index: generated/matmul_r10.c
===================================================================
--- generated/matmul_r10.c	(Revision 242477)
+++ generated/matmul_r10.c	(Arbeitskopie)
@@ -75,11 +75,37 @@ extern void matmul_r10 (gfc_array_r10 * const rest
 	int blas_limit, blas_call gemm);
 export_proto(matmul_r10);
 
+#ifdef __x86_64__
+
+/* For x86_64, we switch to AVX if that is available.  For this, we
+   let the actual work be done by the static aux_matmul - function.
+   The user-callable function will then automagically contain the
+   selection code for the right architecture.  This is done to avoid
+   knowledge of architecture details in the front end.  */
+
+static void aux_matmul_r10 (gfc_array_r10 * const restrict retarray, 
+	gfc_array_r10 * const restrict a, gfc_array_r10 * const restrict b, int try_blas,
+	int blas_limit, blas_call gemm)
+	__attribute__ ((target_clones("avx,default")));
+
 void
 matmul_r10 (gfc_array_r10 * const restrict retarray, 
 	gfc_array_r10 * const restrict a, gfc_array_r10 * const restrict b, int try_blas,
 	int blas_limit, blas_call gemm)
 {
+  aux_matmul_r10 (retarray, a, b, try_blas, blas_limit, gemm);
+}
+
+static void
+aux_matmul_r10 (gfc_array_r10 * const restrict retarray, 
+	gfc_array_r10 * const restrict a, gfc_array_r10 * const restrict b, int try_blas,
+	int blas_limit, blas_call gemm)
+#else
+matmul_r10 (gfc_array_r10 * const restrict retarray, 
+	gfc_array_r10 * const restrict a, gfc_array_r10 * const restrict b, int try_blas,
+	int blas_limit, blas_call gemm)
+#endif
+{
   const GFC_REAL_10 * restrict abase;
   const GFC_REAL_10 * restrict bbase;
   GFC_REAL_10 * restrict dest;
Index: generated/matmul_r16.c
===================================================================
--- generated/matmul_r16.c	(Revision 242477)
+++ generated/matmul_r16.c	(Arbeitskopie)
@@ -75,11 +75,37 @@ extern void matmul_r16 (gfc_array_r16 * const rest
 	int blas_limit, blas_call gemm);
 export_proto(matmul_r16);
 
+#ifdef __x86_64__
+
+/* For x86_64, we switch to AVX if that is available.  For this, we
+   let the actual work be done by the static aux_matmul - function.
+   The user-callable function will then automagically contain the
+   selection code for the right architecture.  This is done to avoid
+   knowledge of architecture details in the front end.  */
+
+static void aux_matmul_r16 (gfc_array_r16 * const restrict retarray, 
+	gfc_array_r16 * const restrict a, gfc_array_r16 * const restrict b, int try_blas,
+	int blas_limit, blas_call gemm)
+	__attribute__ ((target_clones("avx,default")));
+
 void
 matmul_r16 (gfc_array_r16 * const restrict retarray, 
 	gfc_array_r16 * const restrict a, gfc_array_r16 * const restrict b, int try_blas,
 	int blas_limit, blas_call gemm)
 {
+  aux_matmul_r16 (retarray, a, b, try_blas, blas_limit, gemm);
+}
+
+static void
+aux_matmul_r16 (gfc_array_r16 * const restrict retarray, 
+	gfc_array_r16 * const restrict a, gfc_array_r16 * const restrict b, int try_blas,
+	int blas_limit, blas_call gemm)
+#else
+matmul_r16 (gfc_array_r16 * const restrict retarray, 
+	gfc_array_r16 * const restrict a, gfc_array_r16 * const restrict b, int try_blas,
+	int blas_limit, blas_call gemm)
+#endif
+{
   const GFC_REAL_16 * restrict abase;
   const GFC_REAL_16 * restrict bbase;
   GFC_REAL_16 * restrict dest;
Index: generated/matmul_r4.c
===================================================================
--- generated/matmul_r4.c	(Revision 242477)
+++ generated/matmul_r4.c	(Arbeitskopie)
@@ -75,11 +75,37 @@ extern void matmul_r4 (gfc_array_r4 * const restri
 	int blas_limit, blas_call gemm);
 export_proto(matmul_r4);
 
+#ifdef __x86_64__
+
+/* For x86_64, we switch to AVX if that is available.  For this, we
+   let the actual work be done by the static aux_matmul - function.
+   The user-callable function will then automagically contain the
+   selection code for the right architecture.  This is done to avoid
+   knowledge of architecture details in the front end.  */
+
+static void aux_matmul_r4 (gfc_array_r4 * const restrict retarray, 
+	gfc_array_r4 * const restrict a, gfc_array_r4 * const restrict b, int try_blas,
+	int blas_limit, blas_call gemm)
+	__attribute__ ((target_clones("avx,default")));
+
 void
 matmul_r4 (gfc_array_r4 * const restrict retarray, 
 	gfc_array_r4 * const restrict a, gfc_array_r4 * const restrict b, int try_blas,
 	int blas_limit, blas_call gemm)
 {
+  aux_matmul_r4 (retarray, a, b, try_blas, blas_limit, gemm);
+}
+
+static void
+aux_matmul_r4 (gfc_array_r4 * const restrict retarray, 
+	gfc_array_r4 * const restrict a, gfc_array_r4 * const restrict b, int try_blas,
+	int blas_limit, blas_call gemm)
+#else
+matmul_r4 (gfc_array_r4 * const restrict retarray, 
+	gfc_array_r4 * const restrict a, gfc_array_r4 * const restrict b, int try_blas,
+	int blas_limit, blas_call gemm)
+#endif
+{
   const GFC_REAL_4 * restrict abase;
   const GFC_REAL_4 * restrict bbase;
   GFC_REAL_4 * restrict dest;
Index: generated/matmul_r8.c
===================================================================
--- generated/matmul_r8.c	(Revision 242477)
+++ generated/matmul_r8.c	(Arbeitskopie)
@@ -75,11 +75,37 @@ extern void matmul_r8 (gfc_array_r8 * const restri
 	int blas_limit, blas_call gemm);
 export_proto(matmul_r8);
 
+#ifdef __x86_64__
+
+/* For x86_64, we switch to AVX if that is available.  For this, we
+   let the actual work be done by the static aux_matmul - function.
+   The user-callable function will then automagically contain the
+   selection code for the right architecture.  This is done to avoid
+   knowledge of architecture details in the front end.  */
+
+static void aux_matmul_r8 (gfc_array_r8 * const restrict retarray, 
+	gfc_array_r8 * const restrict a, gfc_array_r8 * const restrict b, int try_blas,
+	int blas_limit, blas_call gemm)
+	__attribute__ ((target_clones("avx,default")));
+
 void
 matmul_r8 (gfc_array_r8 * const restrict retarray, 
 	gfc_array_r8 * const restrict a, gfc_array_r8 * const restrict b, int try_blas,
 	int blas_limit, blas_call gemm)
 {
+  aux_matmul_r8 (retarray, a, b, try_blas, blas_limit, gemm);
+}
+
+static void
+aux_matmul_r8 (gfc_array_r8 * const restrict retarray, 
+	gfc_array_r8 * const restrict a, gfc_array_r8 * const restrict b, int try_blas,
+	int blas_limit, blas_call gemm)
+#else
+matmul_r8 (gfc_array_r8 * const restrict retarray, 
+	gfc_array_r8 * const restrict a, gfc_array_r8 * const restrict b, int try_blas,
+	int blas_limit, blas_call gemm)
+#endif
+{
   const GFC_REAL_8 * restrict abase;
   const GFC_REAL_8 * restrict bbase;
   GFC_REAL_8 * restrict dest;
Index: m4/matmul.m4
===================================================================
--- m4/matmul.m4	(Revision 242477)
+++ m4/matmul.m4	(Arbeitskopie)
@@ -76,11 +76,37 @@ extern void matmul_'rtype_code` ('rtype` * const r
 	int blas_limit, blas_call gemm);
 export_proto(matmul_'rtype_code`);
 
+#ifdef __x86_64__
+
+/* For x86_64, we switch to AVX if that is available.  For this, we
+   let the actual work be done by the static aux_matmul - function.
+   The user-callable function will then automagically contain the
+   selection code for the right architecture.  This is done to avoid
+   knowledge of architecture details in the front end.  */
+
+static void aux_matmul_'rtype_code` ('rtype` * const restrict retarray, 
+	'rtype` * const restrict a, 'rtype` * const restrict b, int try_blas,
+	int blas_limit, blas_call gemm)
+	__attribute__ ((target_clones("avx,default")));
+
 void
 matmul_'rtype_code` ('rtype` * const restrict retarray, 
 	'rtype` * const restrict a, 'rtype` * const restrict b, int try_blas,
 	int blas_limit, blas_call gemm)
 {
+  aux_matmul_'rtype_code` (retarray, a, b, try_blas, blas_limit, gemm);
+}
+
+static void
+aux_matmul_'rtype_code` ('rtype` * const restrict retarray, 
+	'rtype` * const restrict a, 'rtype` * const restrict b, int try_blas,
+	int blas_limit, blas_call gemm)
+#else
+matmul_'rtype_code` ('rtype` * const restrict retarray, 
+	'rtype` * const restrict a, 'rtype` * const restrict b, int try_blas,
+	int blas_limit, blas_call gemm)
+#endif
+{
   const 'rtype_name` * restrict abase;
   const 'rtype_name` * restrict bbase;
   'rtype_name` * restrict dest;