From patchwork Mon Nov 14 00:08:50 2016 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jerry DeLisle X-Patchwork-Id: 81987 Delivered-To: patch@linaro.org Received: by 10.140.97.165 with SMTP id m34csp779268qge; Sun, 13 Nov 2016 16:10:12 -0800 (PST) X-Received: by 10.99.108.8 with SMTP id h8mr43221484pgc.93.1479082212855; Sun, 13 Nov 2016 16:10:12 -0800 (PST) Return-Path: Received: from sourceware.org (server1.sourceware.org. [209.132.180.131]) by mx.google.com with ESMTPS id e13si15775242pgf.220.2016.11.13.16.10.12 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sun, 13 Nov 2016 16:10:12 -0800 (PST) Received-SPF: pass (google.com: domain of gcc-patches-return-441282-patch=linaro.org@gcc.gnu.org designates 209.132.180.131 as permitted sender) client-ip=209.132.180.131; Authentication-Results: mx.google.com; dkim=pass header.i=@gcc.gnu.org; spf=pass (google.com: domain of gcc-patches-return-441282-patch=linaro.org@gcc.gnu.org designates 209.132.180.131 as permitted sender) smtp.mailfrom=gcc-patches-return-441282-patch=linaro.org@gcc.gnu.org DomainKey-Signature: a=rsa-sha1; c=nofws; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender:from :subject:to:cc:message-id:date:mime-version:content-type; q=dns; s=default; b=IN2ygg1ejSuBjDvPn4396ubxoVXPK4SDYEaIzpQpKzM1kJo5qt mI7y0/okXYG6lGso+gOQSU+KjYSfDIC1w/s1Fx0XvotoTchVtGQutHsZ21RdadCi XQk4aUr1GlcuHSNMXqoZMIXXV+6Lwte2c/TaSgkur3oU9Etrd1YwAF1R0= DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender:from :subject:to:cc:message-id:date:mime-version:content-type; s= default; bh=rN0SWU9FuXsAZ7o3/UZ/BtjbnGQ=; b=OWGVv/yv6zuhJ/i9p/6N l71hBzn+DG5w9aImEYCOonXXp2XrcjqxAJ64k2psOoOjOaZs8GfdZJGGzQ1yJW54 u3wprP0uzVsj8+kAJZTNm/kJpznreDOYFbC3vH54z1Um5WtNmPoRCUFUaC90+quQ j83Rc8Qs9mCFaA7gom8QpnU= Received: (qmail 91092 invoked by alias); 14 Nov 2016 00:09:18 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Delivered-To: mailing list gcc-patches@gcc.gnu.org Received: (qmail 91060 invoked by uid 89); 14 Nov 2016 00:09:16 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-1.3 required=5.0 tests=AWL, BAYES_50, KAM_ASCII_DIVIDERS, RP_MATCHES_RCVD, SPF_PASS autolearn=ham version=3.3.2 spammy=ny, science, ii, Science X-Spam-User: qpsmtpd, 2 recipients X-HELO: mtaout003-public.msg.strl.va.charter.net Received: from mtaout003-public.msg.strl.va.charter.net (HELO mtaout003-public.msg.strl.va.charter.net) (68.114.190.28) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Mon, 14 Nov 2016 00:08:55 +0000 Received: from impout002 ([68.114.189.17]) by mtaout003.msg.strl.va.charter.net (InterMail vM.9.00.023.01 201-2473-194) with ESMTP id <20161114000853.NRNW7355.mtaout003.msg.strl.va.charter.net@impout002>; Sun, 13 Nov 2016 18:08:53 -0600 Received: from amda8.localdomain ([96.41.215.23]) by impout002 with charter.net id 7c8r1u0010Wrkg001c8r97; Sun, 13 Nov 2016 18:08:53 -0600 X-Authority-Analysis: v=2.1 cv=UfjfSciN c=1 sm=1 tr=0 a=salB9WdMPIDduBH7JsZfrA==:117 a=salB9WdMPIDduBH7JsZfrA==:17 a=L9H7d07YOLsA:10 a=9cW_t1CCXrUA:10 a=s5jvgZ67dGcA:10 a=r77TgQKjGQsHNAKrUKIA:9 a=ozQi6Qy7AAAA:8 a=mDV3o1hIAAAA:8 a=38KySD9ugQW6moVqGfEA:9 a=QEXdDO2ut3YA:10 a=pvXDIpvsHAiv-iJZ780A:9 a=VYR0TuFiMAOvUaOw:21 a=bCHIDxSGlfYQCkxo:21 a=ZhkpOFxDRTm-_ORk-vYA:9 a=5011TGWGUjXZ4oUQxJtK:22 a=_FVE-zBwftR9WsbkzFJk:22 X-Auth-id: anZkZWxpc2xlQGNoYXJ0ZXIubmV0 From: Jerry DeLisle Subject: [patch,libgfortran] PR51119 - MATMUL slow for large matrices To: "fortran@gcc.gnu.org" Cc: GCC Patches Message-ID: <2aad89ce-02e1-45bf-0bdc-d318e7995595@charter.net> Date: Sun, 13 Nov 2016 16:08:50 -0800 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.4.0 MIME-Version: 1.0 Hi all, Attached patch implements a fast blocked matrix multiply. The basic algorithm is derived from netlib.org tuned blas dgemm. See matmul.m4 for reference. The matmul() function is compiled with -Ofast -funroll-loops. This can be customized further if there is an undesired optimization being used. This is accomplished using #pragma optimize ( string ). My results on 3.8 GHz machine with single core: $ gfc -Ofast -funroll-loops -finline-matmul-limit=32 compare.f90 $ ./a.out ========================================================= ================ MEASURED GIGAFLOPS = ========================================================= Matmul Matmul fixed Matmul variable Size Loops explicit refMatmul assumed explicit ========================================================= 2 2000 23.810 0.058 0.116 0.191 4 2000 1.979 0.294 0.437 0.421 8 2000 3.089 0.826 0.928 0.993 16 2000 4.115 3.262 2.600 3.381 32 2000 6.066 5.201 3.008 4.873 64 2000 6.596 4.847 6.624 6.603 128 2000 8.389 5.965 8.370 8.375 256 477 9.520 6.003 9.449 9.452 512 59 8.563 2.783 8.359 8.500 1024 7 8.672 1.537 8.457 8.604 2048 1 8.586 1.753 8.371 8.511 Results may vary, but I found 32 is the right place to set the limit on inlining. Regression tested on x86-64-linux power8 (gcc112) Special thanks to Thomas for helping me test and debug. An additional test case as well. OK for trunk? Best regards, Jerry 2016-11-09 Jerry DeLisle Thomas Koenig PR libgfortran/51119 * m4/matmul.m4: For the case of all strides = 1, implement a fast blocked matrix multiply. Fix some whitespace. * generated/matmul_c10.c: Regenerate. * generated/matmul_c16.c: Regenerate. * generated/matmul_c4.c: Regenerate. * generated/matmul_c8.c: Regenerate. * generated/matmul_i1.c: Regenerate. * generated/matmul_i16.c: Regenerate. * generated/matmul_i2.c: Regenerate. * generated/matmul_i4.c: Regenerate. * generated/matmul_i8.c: Regenerate. * generated/matmul_r10.c: Regenerate. * generated/matmul_r16.c: Regenerate. * generated/matmul_r4.c: Regenerate. * generated/matmul_r8.c: Regenerate. diff --git a/libgfortran/generated/matmul_c10.c b/libgfortran/generated/matmul_c10.c index c955818..6d1985d 100644 --- a/libgfortran/generated/matmul_c10.c +++ b/libgfortran/generated/matmul_c10.c @@ -32,7 +32,7 @@ see the files COPYING3 and COPYING.RUNTIME respectively. If not, see #if defined (HAVE_GFC_COMPLEX_10) /* Prototype for the BLAS ?gemm subroutine, a pointer to which can be - passed to us by the front-end, in which case we'll call it for large + passed to us by the front-end, in which case we call it for large matrices. */ typedef void (*blas_call)(const char *, const char *, const int *, const int *, @@ -75,6 +75,9 @@ extern void matmul_c10 (gfc_array_c10 * const restrict retarray, int blas_limit, blas_call gemm); export_proto(matmul_c10); +#pragma GCC optimize ( "-Ofast" ) +#pragma GCC optimize ( "-funroll-loops" ) + void matmul_c10 (gfc_array_c10 * const restrict retarray, gfc_array_c10 * const restrict a, gfc_array_c10 * const restrict b, int try_blas, @@ -99,7 +102,7 @@ matmul_c10 (gfc_array_c10 * const restrict retarray, o One-dimensional argument B is implicitly treated as a column matrix dimensioned [count, 1], so ycount=1. - */ +*/ if (retarray->base_addr == NULL) { @@ -127,47 +130,47 @@ matmul_c10 (gfc_array_c10 * const restrict retarray, = xmallocarray (size0 ((array_t *) retarray), sizeof (GFC_COMPLEX_10)); retarray->offset = 0; } - else if (unlikely (compile_options.bounds_check)) - { - index_type ret_extent, arg_extent; - - if (GFC_DESCRIPTOR_RANK (a) == 1) - { - arg_extent = GFC_DESCRIPTOR_EXTENT(b,1); - ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0); - if (arg_extent != ret_extent) - runtime_error ("Incorrect extent in return array in" - " MATMUL intrinsic: is %ld, should be %ld", - (long int) ret_extent, (long int) arg_extent); - } - else if (GFC_DESCRIPTOR_RANK (b) == 1) - { - arg_extent = GFC_DESCRIPTOR_EXTENT(a,0); - ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0); - if (arg_extent != ret_extent) - runtime_error ("Incorrect extent in return array in" - " MATMUL intrinsic: is %ld, should be %ld", - (long int) ret_extent, (long int) arg_extent); - } - else - { - arg_extent = GFC_DESCRIPTOR_EXTENT(a,0); - ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0); - if (arg_extent != ret_extent) - runtime_error ("Incorrect extent in return array in" - " MATMUL intrinsic for dimension 1:" - " is %ld, should be %ld", - (long int) ret_extent, (long int) arg_extent); - - arg_extent = GFC_DESCRIPTOR_EXTENT(b,1); - ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,1); - if (arg_extent != ret_extent) - runtime_error ("Incorrect extent in return array in" - " MATMUL intrinsic for dimension 2:" - " is %ld, should be %ld", - (long int) ret_extent, (long int) arg_extent); - } - } + else if (unlikely (compile_options.bounds_check)) + { + index_type ret_extent, arg_extent; + + if (GFC_DESCRIPTOR_RANK (a) == 1) + { + arg_extent = GFC_DESCRIPTOR_EXTENT(b,1); + ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0); + if (arg_extent != ret_extent) + runtime_error ("Incorrect extent in return array in" + " MATMUL intrinsic: is %ld, should be %ld", + (long int) ret_extent, (long int) arg_extent); + } + else if (GFC_DESCRIPTOR_RANK (b) == 1) + { + arg_extent = GFC_DESCRIPTOR_EXTENT(a,0); + ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0); + if (arg_extent != ret_extent) + runtime_error ("Incorrect extent in return array in" + " MATMUL intrinsic: is %ld, should be %ld", + (long int) ret_extent, (long int) arg_extent); + } + else + { + arg_extent = GFC_DESCRIPTOR_EXTENT(a,0); + ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0); + if (arg_extent != ret_extent) + runtime_error ("Incorrect extent in return array in" + " MATMUL intrinsic for dimension 1:" + " is %ld, should be %ld", + (long int) ret_extent, (long int) arg_extent); + + arg_extent = GFC_DESCRIPTOR_EXTENT(b,1); + ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,1); + if (arg_extent != ret_extent) + runtime_error ("Incorrect extent in return array in" + " MATMUL intrinsic for dimension 2:" + " is %ld, should be %ld", + (long int) ret_extent, (long int) arg_extent); + } + } if (GFC_DESCRIPTOR_RANK (retarray) == 1) @@ -230,61 +233,294 @@ matmul_c10 (gfc_array_c10 * const restrict retarray, bbase = b->base_addr; dest = retarray->base_addr; - - /* Now that everything is set up, we're performing the multiplication + /* Now that everything is set up, we perform the multiplication itself. */ #define POW3(x) (((float) (x)) * ((float) (x)) * ((float) (x))) +#define min(a,b) ((a) <= (b) ? (a) : (b)) +#define max(a,b) ((a) >= (b) ? (a) : (b)) if (try_blas && rxstride == 1 && (axstride == 1 || aystride == 1) && (bxstride == 1 || bystride == 1) && (((float) xcount) * ((float) ycount) * ((float) count) > POW3(blas_limit))) - { - const int m = xcount, n = ycount, k = count, ldc = rystride; - const GFC_COMPLEX_10 one = 1, zero = 0; - const int lda = (axstride == 1) ? aystride : axstride, - ldb = (bxstride == 1) ? bystride : bxstride; - - if (lda > 0 && ldb > 0 && ldc > 0 && m > 1 && n > 1 && k > 1) - { - assert (gemm != NULL); - gemm (axstride == 1 ? "N" : "T", bxstride == 1 ? "N" : "T", &m, &n, &k, - &one, abase, &lda, bbase, &ldb, &zero, dest, &ldc, 1, 1); - return; - } - } - - if (rxstride == 1 && axstride == 1 && bxstride == 1) { - const GFC_COMPLEX_10 * restrict bbase_y; - GFC_COMPLEX_10 * restrict dest_y; - const GFC_COMPLEX_10 * restrict abase_n; - GFC_COMPLEX_10 bbase_yn; + const int m = xcount, n = ycount, k = count, ldc = rystride; + const GFC_COMPLEX_10 one = 1, zero = 0; + const int lda = (axstride == 1) ? aystride : axstride, + ldb = (bxstride == 1) ? bystride : bxstride; - if (rystride == xcount) - memset (dest, 0, (sizeof (GFC_COMPLEX_10) * xcount * ycount)); - else + if (lda > 0 && ldb > 0 && ldc > 0 && m > 1 && n > 1 && k > 1) { - for (y = 0; y < ycount; y++) - for (x = 0; x < xcount; x++) - dest[x + y*rystride] = (GFC_COMPLEX_10)0; + assert (gemm != NULL); + gemm (axstride == 1 ? "N" : "T", bxstride == 1 ? "N" : "T", &m, + &n, &k, &one, abase, &lda, bbase, &ldb, &zero, dest, + &ldc, 1, 1); + return; } + } - for (y = 0; y < ycount; y++) + if (rxstride == 1 && axstride == 1 && bxstride == 1) + { + /* This block of code implements a tuned matmul, derived from + Superscalar GEMM-based level 3 BLAS, Beta version 0.1 + + Bo Kagstrom and Per Ling + Department of Computing Science + Umea University + S-901 87 Umea, Sweden + + from netlib.org, translated to C, and modified for matmul.m4. */ + + const GFC_COMPLEX_10 *a, *b; + GFC_COMPLEX_10 *c; + const index_type m = xcount, n = ycount, k = count; + + /* System generated locals */ + index_type a_dim1, a_offset, b_dim1, b_offset, c_dim1, c_offset, i1, i2, + i3, i4, i5, i6; + + /* Local variables */ + GFC_COMPLEX_10 t1[65536], /* was [256][256] */ + f11, f12, f21, f22, f31, f32, f41, f42, + f13, f14, f23, f24, f33, f34, f43, f44; + index_type i, j, l, ii, jj, ll; + index_type isec, jsec, lsec, uisec, ujsec, ulsec; + + a = abase; + b = bbase; + c = retarray->base_addr; + + /* Parameter adjustments */ + c_dim1 = m; + c_offset = 1 + c_dim1; + c -= c_offset; + a_dim1 = aystride; + a_offset = 1 + a_dim1; + a -= a_offset; + b_dim1 = bystride; + b_offset = 1 + b_dim1; + b -= b_offset; + + /* Early exit if possible */ + if (m == 0 || n == 0 || k == 0) + return; + + /* Empty c first. */ + for (j=1; j<=n; j++) + for (i=1; i<=m; i++) + c[i + j * c_dim1] = (GFC_COMPLEX_10)0; + + /* Start turning the crank. */ + i1 = n; + for (jj = 1; jj <= i1; jj += 512) { - bbase_y = bbase + y*bystride; - dest_y = dest + y*rystride; - for (n = 0; n < count; n++) + /* Computing MIN */ + i2 = 512; + i3 = n - jj + 1; + jsec = min(i2,i3); + ujsec = jsec - jsec % 4; + i2 = k; + for (ll = 1; ll <= i2; ll += 256) { - abase_n = abase + n*aystride; - bbase_yn = bbase_y[n]; - for (x = 0; x < xcount; x++) + /* Computing MIN */ + i3 = 256; + i4 = k - ll + 1; + lsec = min(i3,i4); + ulsec = lsec - lsec % 2; + + i3 = m; + for (ii = 1; ii <= i3; ii += 256) { - dest_y[x] += abase_n[x] * bbase_yn; + /* Computing MIN */ + i4 = 256; + i5 = m - ii + 1; + isec = min(i4,i5); + uisec = isec - isec % 2; + i4 = ll + ulsec - 1; + for (l = ll; l <= i4; l += 2) + { + i5 = ii + uisec - 1; + for (i = ii; i <= i5; i += 2) + { + t1[l - ll + 1 + ((i - ii + 1) << 8) - 257] = + a[i + l * a_dim1]; + t1[l - ll + 2 + ((i - ii + 1) << 8) - 257] = + a[i + (l + 1) * a_dim1]; + t1[l - ll + 1 + ((i - ii + 2) << 8) - 257] = + a[i + 1 + l * a_dim1]; + t1[l - ll + 2 + ((i - ii + 2) << 8) - 257] = + a[i + 1 + (l + 1) * a_dim1]; + } + if (uisec < isec) + { + t1[l - ll + 1 + (isec << 8) - 257] = + a[ii + isec - 1 + l * a_dim1]; + t1[l - ll + 2 + (isec << 8) - 257] = + a[ii + isec - 1 + (l + 1) * a_dim1]; + } + } + if (ulsec < lsec) + { + i4 = ii + isec - 1; + for (i = ii; i<= i4; ++i) + { + t1[lsec + ((i - ii + 1) << 8) - 257] = + a[i + (ll + lsec - 1) * a_dim1]; + } + } + + uisec = isec - isec % 4; + i4 = jj + ujsec - 1; + for (j = jj; j <= i4; j += 4) + { + i5 = ii + uisec - 1; + for (i = ii; i <= i5; i += 4) + { + f11 = c[i + j * c_dim1]; + f21 = c[i + 1 + j * c_dim1]; + f12 = c[i + (j + 1) * c_dim1]; + f22 = c[i + 1 + (j + 1) * c_dim1]; + f13 = c[i + (j + 2) * c_dim1]; + f23 = c[i + 1 + (j + 2) * c_dim1]; + f14 = c[i + (j + 3) * c_dim1]; + f24 = c[i + 1 + (j + 3) * c_dim1]; + f31 = c[i + 2 + j * c_dim1]; + f41 = c[i + 3 + j * c_dim1]; + f32 = c[i + 2 + (j + 1) * c_dim1]; + f42 = c[i + 3 + (j + 1) * c_dim1]; + f33 = c[i + 2 + (j + 2) * c_dim1]; + f43 = c[i + 3 + (j + 2) * c_dim1]; + f34 = c[i + 2 + (j + 3) * c_dim1]; + f44 = c[i + 3 + (j + 3) * c_dim1]; + i6 = ll + lsec - 1; + for (l = ll; l <= i6; ++l) + { + f11 += t1[l - ll + 1 + ((i - ii + 1) << 8) - 257] + * b[l + j * b_dim1]; + f21 += t1[l - ll + 1 + ((i - ii + 2) << 8) - 257] + * b[l + j * b_dim1]; + f12 += t1[l - ll + 1 + ((i - ii + 1) << 8) - 257] + * b[l + (j + 1) * b_dim1]; + f22 += t1[l - ll + 1 + ((i - ii + 2) << 8) - 257] + * b[l + (j + 1) * b_dim1]; + f13 += t1[l - ll + 1 + ((i - ii + 1) << 8) - 257] + * b[l + (j + 2) * b_dim1]; + f23 += t1[l - ll + 1 + ((i - ii + 2) << 8) - 257] + * b[l + (j + 2) * b_dim1]; + f14 += t1[l - ll + 1 + ((i - ii + 1) << 8) - 257] + * b[l + (j + 3) * b_dim1]; + f24 += t1[l - ll + 1 + ((i - ii + 2) << 8) - 257] + * b[l + (j + 3) * b_dim1]; + f31 += t1[l - ll + 1 + ((i - ii + 3) << 8) - 257] + * b[l + j * b_dim1]; + f41 += t1[l - ll + 1 + ((i - ii + 4) << 8) - 257] + * b[l + j * b_dim1]; + f32 += t1[l - ll + 1 + ((i - ii + 3) << 8) - 257] + * b[l + (j + 1) * b_dim1]; + f42 += t1[l - ll + 1 + ((i - ii + 4) << 8) - 257] + * b[l + (j + 1) * b_dim1]; + f33 += t1[l - ll + 1 + ((i - ii + 3) << 8) - 257] + * b[l + (j + 2) * b_dim1]; + f43 += t1[l - ll + 1 + ((i - ii + 4) << 8) - 257] + * b[l + (j + 2) * b_dim1]; + f34 += t1[l - ll + 1 + ((i - ii + 3) << 8) - 257] + * b[l + (j + 3) * b_dim1]; + f44 += t1[l - ll + 1 + ((i - ii + 4) << 8) - 257] + * b[l + (j + 3) * b_dim1]; + } + c[i + j * c_dim1] = f11; + c[i + 1 + j * c_dim1] = f21; + c[i + (j + 1) * c_dim1] = f12; + c[i + 1 + (j + 1) * c_dim1] = f22; + c[i + (j + 2) * c_dim1] = f13; + c[i + 1 + (j + 2) * c_dim1] = f23; + c[i + (j + 3) * c_dim1] = f14; + c[i + 1 + (j + 3) * c_dim1] = f24; + c[i + 2 + j * c_dim1] = f31; + c[i + 3 + j * c_dim1] = f41; + c[i + 2 + (j + 1) * c_dim1] = f32; + c[i + 3 + (j + 1) * c_dim1] = f42; + c[i + 2 + (j + 2) * c_dim1] = f33; + c[i + 3 + (j + 2) * c_dim1] = f43; + c[i + 2 + (j + 3) * c_dim1] = f34; + c[i + 3 + (j + 3) * c_dim1] = f44; + } + if (uisec < isec) + { + i5 = ii + isec - 1; + for (i = ii + uisec; i <= i5; ++i) + { + f11 = c[i + j * c_dim1]; + f12 = c[i + (j + 1) * c_dim1]; + f13 = c[i + (j + 2) * c_dim1]; + f14 = c[i + (j + 3) * c_dim1]; + i6 = ll + lsec - 1; + for (l = ll; l <= i6; ++l) + { + f11 += t1[l - ll + 1 + ((i - ii + 1) << 8) - + 257] * b[l + j * b_dim1]; + f12 += t1[l - ll + 1 + ((i - ii + 1) << 8) - + 257] * b[l + (j + 1) * b_dim1]; + f13 += t1[l - ll + 1 + ((i - ii + 1) << 8) - + 257] * b[l + (j + 2) * b_dim1]; + f14 += t1[l - ll + 1 + ((i - ii + 1) << 8) - + 257] * b[l + (j + 3) * b_dim1]; + } + c[i + j * c_dim1] = f11; + c[i + (j + 1) * c_dim1] = f12; + c[i + (j + 2) * c_dim1] = f13; + c[i + (j + 3) * c_dim1] = f14; + } + } + } + if (ujsec < jsec) + { + i4 = jj + jsec - 1; + for (j = jj + ujsec; j <= i4; ++j) + { + i5 = ii + uisec - 1; + for (i = ii; i <= i5; i += 4) + { + f11 = c[i + j * c_dim1]; + f21 = c[i + 1 + j * c_dim1]; + f31 = c[i + 2 + j * c_dim1]; + f41 = c[i + 3 + j * c_dim1]; + i6 = ll + lsec - 1; + for (l = ll; l <= i6; ++l) + { + f11 += t1[l - ll + 1 + ((i - ii + 1) << 8) - + 257] * b[l + j * b_dim1]; + f21 += t1[l - ll + 1 + ((i - ii + 2) << 8) - + 257] * b[l + j * b_dim1]; + f31 += t1[l - ll + 1 + ((i - ii + 3) << 8) - + 257] * b[l + j * b_dim1]; + f41 += t1[l - ll + 1 + ((i - ii + 4) << 8) - + 257] * b[l + j * b_dim1]; + } + c[i + j * c_dim1] = f11; + c[i + 1 + j * c_dim1] = f21; + c[i + 2 + j * c_dim1] = f31; + c[i + 3 + j * c_dim1] = f41; + } + i5 = ii + isec - 1; + for (i = ii + uisec; i <= i5; ++i) + { + f11 = c[i + j * c_dim1]; + i6 = ll + lsec - 1; + for (l = ll; l <= i6; ++l) + { + f11 += t1[l - ll + 1 + ((i - ii + 1) << 8) - + 257] * b[l + j * b_dim1]; + } + c[i + j * c_dim1] = f11; + } + } + } } } } + return; } else if (rxstride == 1 && aystride == 1 && bxstride == 1) { @@ -334,7 +570,9 @@ matmul_c10 (gfc_array_c10 * const restrict retarray, for (n = 0; n < count; n++) for (x = 0; x < xcount; x++) /* dest[x,y] += a[x,n] * b[n,y] */ - dest[x*rxstride + y*rystride] += abase[x*axstride + n*aystride] * bbase[n*bxstride + y*bystride]; + dest[x*rxstride + y*rystride] += + abase[x*axstride + n*aystride] * + bbase[n*bxstride + y*bystride]; } else if (GFC_DESCRIPTOR_RANK (a) == 1) { @@ -372,5 +610,5 @@ matmul_c10 (gfc_array_c10 * const restrict retarray, } } } - +#pragma GCC reset_options #endif diff --git a/libgfortran/generated/matmul_c16.c b/libgfortran/generated/matmul_c16.c index 25fe56e..d967483 100644 --- a/libgfortran/generated/matmul_c16.c +++ b/libgfortran/generated/matmul_c16.c @@ -32,7 +32,7 @@ see the files COPYING3 and COPYING.RUNTIME respectively. If not, see #if defined (HAVE_GFC_COMPLEX_16) /* Prototype for the BLAS ?gemm subroutine, a pointer to which can be - passed to us by the front-end, in which case we'll call it for large + passed to us by the front-end, in which case we call it for large matrices. */ typedef void (*blas_call)(const char *, const char *, const int *, const int *, @@ -75,6 +75,9 @@ extern void matmul_c16 (gfc_array_c16 * const restrict retarray, int blas_limit, blas_call gemm); export_proto(matmul_c16); +#pragma GCC optimize ( "-Ofast" ) +#pragma GCC optimize ( "-funroll-loops" ) + void matmul_c16 (gfc_array_c16 * const restrict retarray, gfc_array_c16 * const restrict a, gfc_array_c16 * const restrict b, int try_blas, @@ -99,7 +102,7 @@ matmul_c16 (gfc_array_c16 * const restrict retarray, o One-dimensional argument B is implicitly treated as a column matrix dimensioned [count, 1], so ycount=1. - */ +*/ if (retarray->base_addr == NULL) { @@ -127,47 +130,47 @@ matmul_c16 (gfc_array_c16 * const restrict retarray, = xmallocarray (size0 ((array_t *) retarray), sizeof (GFC_COMPLEX_16)); retarray->offset = 0; } - else if (unlikely (compile_options.bounds_check)) - { - index_type ret_extent, arg_extent; - - if (GFC_DESCRIPTOR_RANK (a) == 1) - { - arg_extent = GFC_DESCRIPTOR_EXTENT(b,1); - ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0); - if (arg_extent != ret_extent) - runtime_error ("Incorrect extent in return array in" - " MATMUL intrinsic: is %ld, should be %ld", - (long int) ret_extent, (long int) arg_extent); - } - else if (GFC_DESCRIPTOR_RANK (b) == 1) - { - arg_extent = GFC_DESCRIPTOR_EXTENT(a,0); - ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0); - if (arg_extent != ret_extent) - runtime_error ("Incorrect extent in return array in" - " MATMUL intrinsic: is %ld, should be %ld", - (long int) ret_extent, (long int) arg_extent); - } - else - { - arg_extent = GFC_DESCRIPTOR_EXTENT(a,0); - ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0); - if (arg_extent != ret_extent) - runtime_error ("Incorrect extent in return array in" - " MATMUL intrinsic for dimension 1:" - " is %ld, should be %ld", - (long int) ret_extent, (long int) arg_extent); - - arg_extent = GFC_DESCRIPTOR_EXTENT(b,1); - ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,1); - if (arg_extent != ret_extent) - runtime_error ("Incorrect extent in return array in" - " MATMUL intrinsic for dimension 2:" - " is %ld, should be %ld", - (long int) ret_extent, (long int) arg_extent); - } - } + else if (unlikely (compile_options.bounds_check)) + { + index_type ret_extent, arg_extent; + + if (GFC_DESCRIPTOR_RANK (a) == 1) + { + arg_extent = GFC_DESCRIPTOR_EXTENT(b,1); + ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0); + if (arg_extent != ret_extent) + runtime_error ("Incorrect extent in return array in" + " MATMUL intrinsic: is %ld, should be %ld", + (long int) ret_extent, (long int) arg_extent); + } + else if (GFC_DESCRIPTOR_RANK (b) == 1) + { + arg_extent = GFC_DESCRIPTOR_EXTENT(a,0); + ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0); + if (arg_extent != ret_extent) + runtime_error ("Incorrect extent in return array in" + " MATMUL intrinsic: is %ld, should be %ld", + (long int) ret_extent, (long int) arg_extent); + } + else + { + arg_extent = GFC_DESCRIPTOR_EXTENT(a,0); + ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0); + if (arg_extent != ret_extent) + runtime_error ("Incorrect extent in return array in" + " MATMUL intrinsic for dimension 1:" + " is %ld, should be %ld", + (long int) ret_extent, (long int) arg_extent); + + arg_extent = GFC_DESCRIPTOR_EXTENT(b,1); + ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,1); + if (arg_extent != ret_extent) + runtime_error ("Incorrect extent in return array in" + " MATMUL intrinsic for dimension 2:" + " is %ld, should be %ld", + (long int) ret_extent, (long int) arg_extent); + } + } if (GFC_DESCRIPTOR_RANK (retarray) == 1) @@ -230,61 +233,294 @@ matmul_c16 (gfc_array_c16 * const restrict retarray, bbase = b->base_addr; dest = retarray->base_addr; - - /* Now that everything is set up, we're performing the multiplication + /* Now that everything is set up, we perform the multiplication itself. */ #define POW3(x) (((float) (x)) * ((float) (x)) * ((float) (x))) +#define min(a,b) ((a) <= (b) ? (a) : (b)) +#define max(a,b) ((a) >= (b) ? (a) : (b)) if (try_blas && rxstride == 1 && (axstride == 1 || aystride == 1) && (bxstride == 1 || bystride == 1) && (((float) xcount) * ((float) ycount) * ((float) count) > POW3(blas_limit))) - { - const int m = xcount, n = ycount, k = count, ldc = rystride; - const GFC_COMPLEX_16 one = 1, zero = 0; - const int lda = (axstride == 1) ? aystride : axstride, - ldb = (bxstride == 1) ? bystride : bxstride; - - if (lda > 0 && ldb > 0 && ldc > 0 && m > 1 && n > 1 && k > 1) - { - assert (gemm != NULL); - gemm (axstride == 1 ? "N" : "T", bxstride == 1 ? "N" : "T", &m, &n, &k, - &one, abase, &lda, bbase, &ldb, &zero, dest, &ldc, 1, 1); - return; - } - } - - if (rxstride == 1 && axstride == 1 && bxstride == 1) { - const GFC_COMPLEX_16 * restrict bbase_y; - GFC_COMPLEX_16 * restrict dest_y; - const GFC_COMPLEX_16 * restrict abase_n; - GFC_COMPLEX_16 bbase_yn; + const int m = xcount, n = ycount, k = count, ldc = rystride; + const GFC_COMPLEX_16 one = 1, zero = 0; + const int lda = (axstride == 1) ? aystride : axstride, + ldb = (bxstride == 1) ? bystride : bxstride; - if (rystride == xcount) - memset (dest, 0, (sizeof (GFC_COMPLEX_16) * xcount * ycount)); - else + if (lda > 0 && ldb > 0 && ldc > 0 && m > 1 && n > 1 && k > 1) { - for (y = 0; y < ycount; y++) - for (x = 0; x < xcount; x++) - dest[x + y*rystride] = (GFC_COMPLEX_16)0; + assert (gemm != NULL); + gemm (axstride == 1 ? "N" : "T", bxstride == 1 ? "N" : "T", &m, + &n, &k, &one, abase, &lda, bbase, &ldb, &zero, dest, + &ldc, 1, 1); + return; } + } - for (y = 0; y < ycount; y++) + if (rxstride == 1 && axstride == 1 && bxstride == 1) + { + /* This block of code implements a tuned matmul, derived from + Superscalar GEMM-based level 3 BLAS, Beta version 0.1 + + Bo Kagstrom and Per Ling + Department of Computing Science + Umea University + S-901 87 Umea, Sweden + + from netlib.org, translated to C, and modified for matmul.m4. */ + + const GFC_COMPLEX_16 *a, *b; + GFC_COMPLEX_16 *c; + const index_type m = xcount, n = ycount, k = count; + + /* System generated locals */ + index_type a_dim1, a_offset, b_dim1, b_offset, c_dim1, c_offset, i1, i2, + i3, i4, i5, i6; + + /* Local variables */ + GFC_COMPLEX_16 t1[65536], /* was [256][256] */ + f11, f12, f21, f22, f31, f32, f41, f42, + f13, f14, f23, f24, f33, f34, f43, f44; + index_type i, j, l, ii, jj, ll; + index_type isec, jsec, lsec, uisec, ujsec, ulsec; + + a = abase; + b = bbase; + c = retarray->base_addr; + + /* Parameter adjustments */ + c_dim1 = m; + c_offset = 1 + c_dim1; + c -= c_offset; + a_dim1 = aystride; + a_offset = 1 + a_dim1; + a -= a_offset; + b_dim1 = bystride; + b_offset = 1 + b_dim1; + b -= b_offset; + + /* Early exit if possible */ + if (m == 0 || n == 0 || k == 0) + return; + + /* Empty c first. */ + for (j=1; j<=n; j++) + for (i=1; i<=m; i++) + c[i + j * c_dim1] = (GFC_COMPLEX_16)0; + + /* Start turning the crank. */ + i1 = n; + for (jj = 1; jj <= i1; jj += 512) { - bbase_y = bbase + y*bystride; - dest_y = dest + y*rystride; - for (n = 0; n < count; n++) + /* Computing MIN */ + i2 = 512; + i3 = n - jj + 1; + jsec = min(i2,i3); + ujsec = jsec - jsec % 4; + i2 = k; + for (ll = 1; ll <= i2; ll += 256) { - abase_n = abase + n*aystride; - bbase_yn = bbase_y[n]; - for (x = 0; x < xcount; x++) + /* Computing MIN */ + i3 = 256; + i4 = k - ll + 1; + lsec = min(i3,i4); + ulsec = lsec - lsec % 2; + + i3 = m; + for (ii = 1; ii <= i3; ii += 256) { - dest_y[x] += abase_n[x] * bbase_yn; + /* Computing MIN */ + i4 = 256; + i5 = m - ii + 1; + isec = min(i4,i5); + uisec = isec - isec % 2; + i4 = ll + ulsec - 1; + for (l = ll; l <= i4; l += 2) + { + i5 = ii + uisec - 1; + for (i = ii; i <= i5; i += 2) + { + t1[l - ll + 1 + ((i - ii + 1) << 8) - 257] = + a[i + l * a_dim1]; + t1[l - ll + 2 + ((i - ii + 1) << 8) - 257] = + a[i + (l + 1) * a_dim1]; + t1[l - ll + 1 + ((i - ii + 2) << 8) - 257] = + a[i + 1 + l * a_dim1]; + t1[l - ll + 2 + ((i - ii + 2) << 8) - 257] = + a[i + 1 + (l + 1) * a_dim1]; + } + if (uisec < isec) + { + t1[l - ll + 1 + (isec << 8) - 257] = + a[ii + isec - 1 + l * a_dim1]; + t1[l - ll + 2 + (isec << 8) - 257] = + a[ii + isec - 1 + (l + 1) * a_dim1]; + } + } + if (ulsec < lsec) + { + i4 = ii + isec - 1; + for (i = ii; i<= i4; ++i) + { + t1[lsec + ((i - ii + 1) << 8) - 257] = + a[i + (ll + lsec - 1) * a_dim1]; + } + } + + uisec = isec - isec % 4; + i4 = jj + ujsec - 1; + for (j = jj; j <= i4; j += 4) + { + i5 = ii + uisec - 1; + for (i = ii; i <= i5; i += 4) + { + f11 = c[i + j * c_dim1]; + f21 = c[i + 1 + j * c_dim1]; + f12 = c[i + (j + 1) * c_dim1]; + f22 = c[i + 1 + (j + 1) * c_dim1]; + f13 = c[i + (j + 2) * c_dim1]; + f23 = c[i + 1 + (j + 2) * c_dim1]; + f14 = c[i + (j + 3) * c_dim1]; + f24 = c[i + 1 + (j + 3) * c_dim1]; + f31 = c[i + 2 + j * c_dim1]; + f41 = c[i + 3 + j * c_dim1]; + f32 = c[i + 2 + (j + 1) * c_dim1]; + f42 = c[i + 3 + (j + 1) * c_dim1]; + f33 = c[i + 2 + (j + 2) * c_dim1]; + f43 = c[i + 3 + (j + 2) * c_dim1]; + f34 = c[i + 2 + (j + 3) * c_dim1]; + f44 = c[i + 3 + (j + 3) * c_dim1]; + i6 = ll + lsec - 1; + for (l = ll; l <= i6; ++l) + { + f11 += t1[l - ll + 1 + ((i - ii + 1) << 8) - 257] + * b[l + j * b_dim1]; + f21 += t1[l - ll + 1 + ((i - ii + 2) << 8) - 257] + * b[l + j * b_dim1]; + f12 += t1[l - ll + 1 + ((i - ii + 1) << 8) - 257] + * b[l + (j + 1) * b_dim1]; + f22 += t1[l - ll + 1 + ((i - ii + 2) << 8) - 257] + * b[l + (j + 1) * b_dim1]; + f13 += t1[l - ll + 1 + ((i - ii + 1) << 8) - 257] + * b[l + (j + 2) * b_dim1]; + f23 += t1[l - ll + 1 + ((i - ii + 2) << 8) - 257] + * b[l + (j + 2) * b_dim1]; + f14 += t1[l - ll + 1 + ((i - ii + 1) << 8) - 257] + * b[l + (j + 3) * b_dim1]; + f24 += t1[l - ll + 1 + ((i - ii + 2) << 8) - 257] + * b[l + (j + 3) * b_dim1]; + f31 += t1[l - ll + 1 + ((i - ii + 3) << 8) - 257] + * b[l + j * b_dim1]; + f41 += t1[l - ll + 1 + ((i - ii + 4) << 8) - 257] + * b[l + j * b_dim1]; + f32 += t1[l - ll + 1 + ((i - ii + 3) << 8) - 257] + * b[l + (j + 1) * b_dim1]; + f42 += t1[l - ll + 1 + ((i - ii + 4) << 8) - 257] + * b[l + (j + 1) * b_dim1]; + f33 += t1[l - ll + 1 + ((i - ii + 3) << 8) - 257] + * b[l + (j + 2) * b_dim1]; + f43 += t1[l - ll + 1 + ((i - ii + 4) << 8) - 257] + * b[l + (j + 2) * b_dim1]; + f34 += t1[l - ll + 1 + ((i - ii + 3) << 8) - 257] + * b[l + (j + 3) * b_dim1]; + f44 += t1[l - ll + 1 + ((i - ii + 4) << 8) - 257] + * b[l + (j + 3) * b_dim1]; + } + c[i + j * c_dim1] = f11; + c[i + 1 + j * c_dim1] = f21; + c[i + (j + 1) * c_dim1] = f12; + c[i + 1 + (j + 1) * c_dim1] = f22; + c[i + (j + 2) * c_dim1] = f13; + c[i + 1 + (j + 2) * c_dim1] = f23; + c[i + (j + 3) * c_dim1] = f14; + c[i + 1 + (j + 3) * c_dim1] = f24; + c[i + 2 + j * c_dim1] = f31; + c[i + 3 + j * c_dim1] = f41; + c[i + 2 + (j + 1) * c_dim1] = f32; + c[i + 3 + (j + 1) * c_dim1] = f42; + c[i + 2 + (j + 2) * c_dim1] = f33; + c[i + 3 + (j + 2) * c_dim1] = f43; + c[i + 2 + (j + 3) * c_dim1] = f34; + c[i + 3 + (j + 3) * c_dim1] = f44; + } + if (uisec < isec) + { + i5 = ii + isec - 1; + for (i = ii + uisec; i <= i5; ++i) + { + f11 = c[i + j * c_dim1]; + f12 = c[i + (j + 1) * c_dim1]; + f13 = c[i + (j + 2) * c_dim1]; + f14 = c[i + (j + 3) * c_dim1]; + i6 = ll + lsec - 1; + for (l = ll; l <= i6; ++l) + { + f11 += t1[l - ll + 1 + ((i - ii + 1) << 8) - + 257] * b[l + j * b_dim1]; + f12 += t1[l - ll + 1 + ((i - ii + 1) << 8) - + 257] * b[l + (j + 1) * b_dim1]; + f13 += t1[l - ll + 1 + ((i - ii + 1) << 8) - + 257] * b[l + (j + 2) * b_dim1]; + f14 += t1[l - ll + 1 + ((i - ii + 1) << 8) - + 257] * b[l + (j + 3) * b_dim1]; + } + c[i + j * c_dim1] = f11; + c[i + (j + 1) * c_dim1] = f12; + c[i + (j + 2) * c_dim1] = f13; + c[i + (j + 3) * c_dim1] = f14; + } + } + } + if (ujsec < jsec) + { + i4 = jj + jsec - 1; + for (j = jj + ujsec; j <= i4; ++j) + { + i5 = ii + uisec - 1; + for (i = ii; i <= i5; i += 4) + { + f11 = c[i + j * c_dim1]; + f21 = c[i + 1 + j * c_dim1]; + f31 = c[i + 2 + j * c_dim1]; + f41 = c[i + 3 + j * c_dim1]; + i6 = ll + lsec - 1; + for (l = ll; l <= i6; ++l) + { + f11 += t1[l - ll + 1 + ((i - ii + 1) << 8) - + 257] * b[l + j * b_dim1]; + f21 += t1[l - ll + 1 + ((i - ii + 2) << 8) - + 257] * b[l + j * b_dim1]; + f31 += t1[l - ll + 1 + ((i - ii + 3) << 8) - + 257] * b[l + j * b_dim1]; + f41 += t1[l - ll + 1 + ((i - ii + 4) << 8) - + 257] * b[l + j * b_dim1]; + } + c[i + j * c_dim1] = f11; + c[i + 1 + j * c_dim1] = f21; + c[i + 2 + j * c_dim1] = f31; + c[i + 3 + j * c_dim1] = f41; + } + i5 = ii + isec - 1; + for (i = ii + uisec; i <= i5; ++i) + { + f11 = c[i + j * c_dim1]; + i6 = ll + lsec - 1; + for (l = ll; l <= i6; ++l) + { + f11 += t1[l - ll + 1 + ((i - ii + 1) << 8) - + 257] * b[l + j * b_dim1]; + } + c[i + j * c_dim1] = f11; + } + } + } } } } + return; } else if (rxstride == 1 && aystride == 1 && bxstride == 1) { @@ -334,7 +570,9 @@ matmul_c16 (gfc_array_c16 * const restrict retarray, for (n = 0; n < count; n++) for (x = 0; x < xcount; x++) /* dest[x,y] += a[x,n] * b[n,y] */ - dest[x*rxstride + y*rystride] += abase[x*axstride + n*aystride] * bbase[n*bxstride + y*bystride]; + dest[x*rxstride + y*rystride] += + abase[x*axstride + n*aystride] * + bbase[n*bxstride + y*bystride]; } else if (GFC_DESCRIPTOR_RANK (a) == 1) { @@ -372,5 +610,5 @@ matmul_c16 (gfc_array_c16 * const restrict retarray, } } } - +#pragma GCC reset_options #endif diff --git a/libgfortran/generated/matmul_c4.c b/libgfortran/generated/matmul_c4.c index e9d2ed3..0305782 100644 --- a/libgfortran/generated/matmul_c4.c +++ b/libgfortran/generated/matmul_c4.c @@ -32,7 +32,7 @@ see the files COPYING3 and COPYING.RUNTIME respectively. If not, see #if defined (HAVE_GFC_COMPLEX_4) /* Prototype for the BLAS ?gemm subroutine, a pointer to which can be - passed to us by the front-end, in which case we'll call it for large + passed to us by the front-end, in which case we call it for large matrices. */ typedef void (*blas_call)(const char *, const char *, const int *, const int *, @@ -75,6 +75,9 @@ extern void matmul_c4 (gfc_array_c4 * const restrict retarray, int blas_limit, blas_call gemm); export_proto(matmul_c4); +#pragma GCC optimize ( "-Ofast" ) +#pragma GCC optimize ( "-funroll-loops" ) + void matmul_c4 (gfc_array_c4 * const restrict retarray, gfc_array_c4 * const restrict a, gfc_array_c4 * const restrict b, int try_blas, @@ -99,7 +102,7 @@ matmul_c4 (gfc_array_c4 * const restrict retarray, o One-dimensional argument B is implicitly treated as a column matrix dimensioned [count, 1], so ycount=1. - */ +*/ if (retarray->base_addr == NULL) { @@ -127,47 +130,47 @@ matmul_c4 (gfc_array_c4 * const restrict retarray, = xmallocarray (size0 ((array_t *) retarray), sizeof (GFC_COMPLEX_4)); retarray->offset = 0; } - else if (unlikely (compile_options.bounds_check)) - { - index_type ret_extent, arg_extent; - - if (GFC_DESCRIPTOR_RANK (a) == 1) - { - arg_extent = GFC_DESCRIPTOR_EXTENT(b,1); - ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0); - if (arg_extent != ret_extent) - runtime_error ("Incorrect extent in return array in" - " MATMUL intrinsic: is %ld, should be %ld", - (long int) ret_extent, (long int) arg_extent); - } - else if (GFC_DESCRIPTOR_RANK (b) == 1) - { - arg_extent = GFC_DESCRIPTOR_EXTENT(a,0); - ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0); - if (arg_extent != ret_extent) - runtime_error ("Incorrect extent in return array in" - " MATMUL intrinsic: is %ld, should be %ld", - (long int) ret_extent, (long int) arg_extent); - } - else - { - arg_extent = GFC_DESCRIPTOR_EXTENT(a,0); - ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0); - if (arg_extent != ret_extent) - runtime_error ("Incorrect extent in return array in" - " MATMUL intrinsic for dimension 1:" - " is %ld, should be %ld", - (long int) ret_extent, (long int) arg_extent); - - arg_extent = GFC_DESCRIPTOR_EXTENT(b,1); - ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,1); - if (arg_extent != ret_extent) - runtime_error ("Incorrect extent in return array in" - " MATMUL intrinsic for dimension 2:" - " is %ld, should be %ld", - (long int) ret_extent, (long int) arg_extent); - } - } + else if (unlikely (compile_options.bounds_check)) + { + index_type ret_extent, arg_extent; + + if (GFC_DESCRIPTOR_RANK (a) == 1) + { + arg_extent = GFC_DESCRIPTOR_EXTENT(b,1); + ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0); + if (arg_extent != ret_extent) + runtime_error ("Incorrect extent in return array in" + " MATMUL intrinsic: is %ld, should be %ld", + (long int) ret_extent, (long int) arg_extent); + } + else if (GFC_DESCRIPTOR_RANK (b) == 1) + { + arg_extent = GFC_DESCRIPTOR_EXTENT(a,0); + ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0); + if (arg_extent != ret_extent) + runtime_error ("Incorrect extent in return array in" + " MATMUL intrinsic: is %ld, should be %ld", + (long int) ret_extent, (long int) arg_extent); + } + else + { + arg_extent = GFC_DESCRIPTOR_EXTENT(a,0); + ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0); + if (arg_extent != ret_extent) + runtime_error ("Incorrect extent in return array in" + " MATMUL intrinsic for dimension 1:" + " is %ld, should be %ld", + (long int) ret_extent, (long int) arg_extent); + + arg_extent = GFC_DESCRIPTOR_EXTENT(b,1); + ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,1); + if (arg_extent != ret_extent) + runtime_error ("Incorrect extent in return array in" + " MATMUL intrinsic for dimension 2:" + " is %ld, should be %ld", + (long int) ret_extent, (long int) arg_extent); + } + } if (GFC_DESCRIPTOR_RANK (retarray) == 1) @@ -230,61 +233,294 @@ matmul_c4 (gfc_array_c4 * const restrict retarray, bbase = b->base_addr; dest = retarray->base_addr; - - /* Now that everything is set up, we're performing the multiplication + /* Now that everything is set up, we perform the multiplication itself. */ #define POW3(x) (((float) (x)) * ((float) (x)) * ((float) (x))) +#define min(a,b) ((a) <= (b) ? (a) : (b)) +#define max(a,b) ((a) >= (b) ? (a) : (b)) if (try_blas && rxstride == 1 && (axstride == 1 || aystride == 1) && (bxstride == 1 || bystride == 1) && (((float) xcount) * ((float) ycount) * ((float) count) > POW3(blas_limit))) - { - const int m = xcount, n = ycount, k = count, ldc = rystride; - const GFC_COMPLEX_4 one = 1, zero = 0; - const int lda = (axstride == 1) ? aystride : axstride, - ldb = (bxstride == 1) ? bystride : bxstride; - - if (lda > 0 && ldb > 0 && ldc > 0 && m > 1 && n > 1 && k > 1) - { - assert (gemm != NULL); - gemm (axstride == 1 ? "N" : "T", bxstride == 1 ? "N" : "T", &m, &n, &k, - &one, abase, &lda, bbase, &ldb, &zero, dest, &ldc, 1, 1); - return; - } - } - - if (rxstride == 1 && axstride == 1 && bxstride == 1) { - const GFC_COMPLEX_4 * restrict bbase_y; - GFC_COMPLEX_4 * restrict dest_y; - const GFC_COMPLEX_4 * restrict abase_n; - GFC_COMPLEX_4 bbase_yn; + const int m = xcount, n = ycount, k = count, ldc = rystride; + const GFC_COMPLEX_4 one = 1, zero = 0; + const int lda = (axstride == 1) ? aystride : axstride, + ldb = (bxstride == 1) ? bystride : bxstride; - if (rystride == xcount) - memset (dest, 0, (sizeof (GFC_COMPLEX_4) * xcount * ycount)); - else + if (lda > 0 && ldb > 0 && ldc > 0 && m > 1 && n > 1 && k > 1) { - for (y = 0; y < ycount; y++) - for (x = 0; x < xcount; x++) - dest[x + y*rystride] = (GFC_COMPLEX_4)0; + assert (gemm != NULL); + gemm (axstride == 1 ? "N" : "T", bxstride == 1 ? "N" : "T", &m, + &n, &k, &one, abase, &lda, bbase, &ldb, &zero, dest, + &ldc, 1, 1); + return; } + } - for (y = 0; y < ycount; y++) + if (rxstride == 1 && axstride == 1 && bxstride == 1) + { + /* This block of code implements a tuned matmul, derived from + Superscalar GEMM-based level 3 BLAS, Beta version 0.1 + + Bo Kagstrom and Per Ling + Department of Computing Science + Umea University + S-901 87 Umea, Sweden + + from netlib.org, translated to C, and modified for matmul.m4. */ + + const GFC_COMPLEX_4 *a, *b; + GFC_COMPLEX_4 *c; + const index_type m = xcount, n = ycount, k = count; + + /* System generated locals */ + index_type a_dim1, a_offset, b_dim1, b_offset, c_dim1, c_offset, i1, i2, + i3, i4, i5, i6; + + /* Local variables */ + GFC_COMPLEX_4 t1[65536], /* was [256][256] */ + f11, f12, f21, f22, f31, f32, f41, f42, + f13, f14, f23, f24, f33, f34, f43, f44; + index_type i, j, l, ii, jj, ll; + index_type isec, jsec, lsec, uisec, ujsec, ulsec; + + a = abase; + b = bbase; + c = retarray->base_addr; + + /* Parameter adjustments */ + c_dim1 = m; + c_offset = 1 + c_dim1; + c -= c_offset; + a_dim1 = aystride; + a_offset = 1 + a_dim1; + a -= a_offset; + b_dim1 = bystride; + b_offset = 1 + b_dim1; + b -= b_offset; + + /* Early exit if possible */ + if (m == 0 || n == 0 || k == 0) + return; + + /* Empty c first. */ + for (j=1; j<=n; j++) + for (i=1; i<=m; i++) + c[i + j * c_dim1] = (GFC_COMPLEX_4)0; + + /* Start turning the crank. */ + i1 = n; + for (jj = 1; jj <= i1; jj += 512) { - bbase_y = bbase + y*bystride; - dest_y = dest + y*rystride; - for (n = 0; n < count; n++) + /* Computing MIN */ + i2 = 512; + i3 = n - jj + 1; + jsec = min(i2,i3); + ujsec = jsec - jsec % 4; + i2 = k; + for (ll = 1; ll <= i2; ll += 256) { - abase_n = abase + n*aystride; - bbase_yn = bbase_y[n]; - for (x = 0; x < xcount; x++) + /* Computing MIN */ + i3 = 256; + i4 = k - ll + 1; + lsec = min(i3,i4); + ulsec = lsec - lsec % 2; + + i3 = m; + for (ii = 1; ii <= i3; ii += 256) { - dest_y[x] += abase_n[x] * bbase_yn; + /* Computing MIN */ + i4 = 256; + i5 = m - ii + 1; + isec = min(i4,i5); + uisec = isec - isec % 2; + i4 = ll + ulsec - 1; + for (l = ll; l <= i4; l += 2) + { + i5 = ii + uisec - 1; + for (i = ii; i <= i5; i += 2) + { + t1[l - ll + 1 + ((i - ii + 1) << 8) - 257] = + a[i + l * a_dim1]; + t1[l - ll + 2 + ((i - ii + 1) << 8) - 257] = + a[i + (l + 1) * a_dim1]; + t1[l - ll + 1 + ((i - ii + 2) << 8) - 257] = + a[i + 1 + l * a_dim1]; + t1[l - ll + 2 + ((i - ii + 2) << 8) - 257] = + a[i + 1 + (l + 1) * a_dim1]; + } + if (uisec < isec) + { + t1[l - ll + 1 + (isec << 8) - 257] = + a[ii + isec - 1 + l * a_dim1]; + t1[l - ll + 2 + (isec << 8) - 257] = + a[ii + isec - 1 + (l + 1) * a_dim1]; + } + } + if (ulsec < lsec) + { + i4 = ii + isec - 1; + for (i = ii; i<= i4; ++i) + { + t1[lsec + ((i - ii + 1) << 8) - 257] = + a[i + (ll + lsec - 1) * a_dim1]; + } + } + + uisec = isec - isec % 4; + i4 = jj + ujsec - 1; + for (j = jj; j <= i4; j += 4) + { + i5 = ii + uisec - 1; + for (i = ii; i <= i5; i += 4) + { + f11 = c[i + j * c_dim1]; + f21 = c[i + 1 + j * c_dim1]; + f12 = c[i + (j + 1) * c_dim1]; + f22 = c[i + 1 + (j + 1) * c_dim1]; + f13 = c[i + (j + 2) * c_dim1]; + f23 = c[i + 1 + (j + 2) * c_dim1]; + f14 = c[i + (j + 3) * c_dim1]; + f24 = c[i + 1 + (j + 3) * c_dim1]; + f31 = c[i + 2 + j * c_dim1]; + f41 = c[i + 3 + j * c_dim1]; + f32 = c[i + 2 + (j + 1) * c_dim1]; + f42 = c[i + 3 + (j + 1) * c_dim1]; + f33 = c[i + 2 + (j + 2) * c_dim1]; + f43 = c[i + 3 + (j + 2) * c_dim1]; + f34 = c[i + 2 + (j + 3) * c_dim1]; + f44 = c[i + 3 + (j + 3) * c_dim1]; + i6 = ll + lsec - 1; + for (l = ll; l <= i6; ++l) + { + f11 += t1[l - ll + 1 + ((i - ii + 1) << 8) - 257] + * b[l + j * b_dim1]; + f21 += t1[l - ll + 1 + ((i - ii + 2) << 8) - 257] + * b[l + j * b_dim1]; + f12 += t1[l - ll + 1 + ((i - ii + 1) << 8) - 257] + * b[l + (j + 1) * b_dim1]; + f22 += t1[l - ll + 1 + ((i - ii + 2) << 8) - 257] + * b[l + (j + 1) * b_dim1]; + f13 += t1[l - ll + 1 + ((i - ii + 1) << 8) - 257] + * b[l + (j + 2) * b_dim1]; + f23 += t1[l - ll + 1 + ((i - ii + 2) << 8) - 257] + * b[l + (j + 2) * b_dim1]; + f14 += t1[l - ll + 1 + ((i - ii + 1) << 8) - 257] + * b[l + (j + 3) * b_dim1]; + f24 += t1[l - ll + 1 + ((i - ii + 2) << 8) - 257] + * b[l + (j + 3) * b_dim1]; + f31 += t1[l - ll + 1 + ((i - ii + 3) << 8) - 257] + * b[l + j * b_dim1]; + f41 += t1[l - ll + 1 + ((i - ii + 4) << 8) - 257] + * b[l + j * b_dim1]; + f32 += t1[l - ll + 1 + ((i - ii + 3) << 8) - 257] + * b[l + (j + 1) * b_dim1]; + f42 += t1[l - ll + 1 + ((i - ii + 4) << 8) - 257] + * b[l + (j + 1) * b_dim1]; + f33 += t1[l - ll + 1 + ((i - ii + 3) << 8) - 257] + * b[l + (j + 2) * b_dim1]; + f43 += t1[l - ll + 1 + ((i - ii + 4) << 8) - 257] + * b[l + (j + 2) * b_dim1]; + f34 += t1[l - ll + 1 + ((i - ii + 3) << 8) - 257] + * b[l + (j + 3) * b_dim1]; + f44 += t1[l - ll + 1 + ((i - ii + 4) << 8) - 257] + * b[l + (j + 3) * b_dim1]; + } + c[i + j * c_dim1] = f11; + c[i + 1 + j * c_dim1] = f21; + c[i + (j + 1) * c_dim1] = f12; + c[i + 1 + (j + 1) * c_dim1] = f22; + c[i + (j + 2) * c_dim1] = f13; + c[i + 1 + (j + 2) * c_dim1] = f23; + c[i + (j + 3) * c_dim1] = f14; + c[i + 1 + (j + 3) * c_dim1] = f24; + c[i + 2 + j * c_dim1] = f31; + c[i + 3 + j * c_dim1] = f41; + c[i + 2 + (j + 1) * c_dim1] = f32; + c[i + 3 + (j + 1) * c_dim1] = f42; + c[i + 2 + (j + 2) * c_dim1] = f33; + c[i + 3 + (j + 2) * c_dim1] = f43; + c[i + 2 + (j + 3) * c_dim1] = f34; + c[i + 3 + (j + 3) * c_dim1] = f44; + } + if (uisec < isec) + { + i5 = ii + isec - 1; + for (i = ii + uisec; i <= i5; ++i) + { + f11 = c[i + j * c_dim1]; + f12 = c[i + (j + 1) * c_dim1]; + f13 = c[i + (j + 2) * c_dim1]; + f14 = c[i + (j + 3) * c_dim1]; + i6 = ll + lsec - 1; + for (l = ll; l <= i6; ++l) + { + f11 += t1[l - ll + 1 + ((i - ii + 1) << 8) - + 257] * b[l + j * b_dim1]; + f12 += t1[l - ll + 1 + ((i - ii + 1) << 8) - + 257] * b[l + (j + 1) * b_dim1]; + f13 += t1[l - ll + 1 + ((i - ii + 1) << 8) - + 257] * b[l + (j + 2) * b_dim1]; + f14 += t1[l - ll + 1 + ((i - ii + 1) << 8) - + 257] * b[l + (j + 3) * b_dim1]; + } + c[i + j * c_dim1] = f11; + c[i + (j + 1) * c_dim1] = f12; + c[i + (j + 2) * c_dim1] = f13; + c[i + (j + 3) * c_dim1] = f14; + } + } + } + if (ujsec < jsec) + { + i4 = jj + jsec - 1; + for (j = jj + ujsec; j <= i4; ++j) + { + i5 = ii + uisec - 1; + for (i = ii; i <= i5; i += 4) + { + f11 = c[i + j * c_dim1]; + f21 = c[i + 1 + j * c_dim1]; + f31 = c[i + 2 + j * c_dim1]; + f41 = c[i + 3 + j * c_dim1]; + i6 = ll + lsec - 1; + for (l = ll; l <= i6; ++l) + { + f11 += t1[l - ll + 1 + ((i - ii + 1) << 8) - + 257] * b[l + j * b_dim1]; + f21 += t1[l - ll + 1 + ((i - ii + 2) << 8) - + 257] * b[l + j * b_dim1]; + f31 += t1[l - ll + 1 + ((i - ii + 3) << 8) - + 257] * b[l + j * b_dim1]; + f41 += t1[l - ll + 1 + ((i - ii + 4) << 8) - + 257] * b[l + j * b_dim1]; + } + c[i + j * c_dim1] = f11; + c[i + 1 + j * c_dim1] = f21; + c[i + 2 + j * c_dim1] = f31; + c[i + 3 + j * c_dim1] = f41; + } + i5 = ii + isec - 1; + for (i = ii + uisec; i <= i5; ++i) + { + f11 = c[i + j * c_dim1]; + i6 = ll + lsec - 1; + for (l = ll; l <= i6; ++l) + { + f11 += t1[l - ll + 1 + ((i - ii + 1) << 8) - + 257] * b[l + j * b_dim1]; + } + c[i + j * c_dim1] = f11; + } + } + } } } } + return; } else if (rxstride == 1 && aystride == 1 && bxstride == 1) { @@ -334,7 +570,9 @@ matmul_c4 (gfc_array_c4 * const restrict retarray, for (n = 0; n < count; n++) for (x = 0; x < xcount; x++) /* dest[x,y] += a[x,n] * b[n,y] */ - dest[x*rxstride + y*rystride] += abase[x*axstride + n*aystride] * bbase[n*bxstride + y*bystride]; + dest[x*rxstride + y*rystride] += + abase[x*axstride + n*aystride] * + bbase[n*bxstride + y*bystride]; } else if (GFC_DESCRIPTOR_RANK (a) == 1) { @@ -372,5 +610,5 @@ matmul_c4 (gfc_array_c4 * const restrict retarray, } } } - +#pragma GCC reset_options #endif diff --git a/libgfortran/generated/matmul_c8.c b/libgfortran/generated/matmul_c8.c index 8a127da..efd5623 100644 --- a/libgfortran/generated/matmul_c8.c +++ b/libgfortran/generated/matmul_c8.c @@ -32,7 +32,7 @@ see the files COPYING3 and COPYING.RUNTIME respectively. If not, see #if defined (HAVE_GFC_COMPLEX_8) /* Prototype for the BLAS ?gemm subroutine, a pointer to which can be - passed to us by the front-end, in which case we'll call it for large + passed to us by the front-end, in which case we call it for large matrices. */ typedef void (*blas_call)(const char *, const char *, const int *, const int *, @@ -75,6 +75,9 @@ extern void matmul_c8 (gfc_array_c8 * const restrict retarray, int blas_limit, blas_call gemm); export_proto(matmul_c8); +#pragma GCC optimize ( "-Ofast" ) +#pragma GCC optimize ( "-funroll-loops" ) + void matmul_c8 (gfc_array_c8 * const restrict retarray, gfc_array_c8 * const restrict a, gfc_array_c8 * const restrict b, int try_blas, @@ -99,7 +102,7 @@ matmul_c8 (gfc_array_c8 * const restrict retarray, o One-dimensional argument B is implicitly treated as a column matrix dimensioned [count, 1], so ycount=1. - */ +*/ if (retarray->base_addr == NULL) { @@ -127,47 +130,47 @@ matmul_c8 (gfc_array_c8 * const restrict retarray, = xmallocarray (size0 ((array_t *) retarray), sizeof (GFC_COMPLEX_8)); retarray->offset = 0; } - else if (unlikely (compile_options.bounds_check)) - { - index_type ret_extent, arg_extent; - - if (GFC_DESCRIPTOR_RANK (a) == 1) - { - arg_extent = GFC_DESCRIPTOR_EXTENT(b,1); - ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0); - if (arg_extent != ret_extent) - runtime_error ("Incorrect extent in return array in" - " MATMUL intrinsic: is %ld, should be %ld", - (long int) ret_extent, (long int) arg_extent); - } - else if (GFC_DESCRIPTOR_RANK (b) == 1) - { - arg_extent = GFC_DESCRIPTOR_EXTENT(a,0); - ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0); - if (arg_extent != ret_extent) - runtime_error ("Incorrect extent in return array in" - " MATMUL intrinsic: is %ld, should be %ld", - (long int) ret_extent, (long int) arg_extent); - } - else - { - arg_extent = GFC_DESCRIPTOR_EXTENT(a,0); - ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0); - if (arg_extent != ret_extent) - runtime_error ("Incorrect extent in return array in" - " MATMUL intrinsic for dimension 1:" - " is %ld, should be %ld", - (long int) ret_extent, (long int) arg_extent); - - arg_extent = GFC_DESCRIPTOR_EXTENT(b,1); - ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,1); - if (arg_extent != ret_extent) - runtime_error ("Incorrect extent in return array in" - " MATMUL intrinsic for dimension 2:" - " is %ld, should be %ld", - (long int) ret_extent, (long int) arg_extent); - } - } + else if (unlikely (compile_options.bounds_check)) + { + index_type ret_extent, arg_extent; + + if (GFC_DESCRIPTOR_RANK (a) == 1) + { + arg_extent = GFC_DESCRIPTOR_EXTENT(b,1); + ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0); + if (arg_extent != ret_extent) + runtime_error ("Incorrect extent in return array in" + " MATMUL intrinsic: is %ld, should be %ld", + (long int) ret_extent, (long int) arg_extent); + } + else if (GFC_DESCRIPTOR_RANK (b) == 1) + { + arg_extent = GFC_DESCRIPTOR_EXTENT(a,0); + ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0); + if (arg_extent != ret_extent) + runtime_error ("Incorrect extent in return array in" + " MATMUL intrinsic: is %ld, should be %ld", + (long int) ret_extent, (long int) arg_extent); + } + else + { + arg_extent = GFC_DESCRIPTOR_EXTENT(a,0); + ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0); + if (arg_extent != ret_extent) + runtime_error ("Incorrect extent in return array in" + " MATMUL intrinsic for dimension 1:" + " is %ld, should be %ld", + (long int) ret_extent, (long int) arg_extent); + + arg_extent = GFC_DESCRIPTOR_EXTENT(b,1); + ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,1); + if (arg_extent != ret_extent) + runtime_error ("Incorrect extent in return array in" + " MATMUL intrinsic for dimension 2:" + " is %ld, should be %ld", + (long int) ret_extent, (long int) arg_extent); + } + } if (GFC_DESCRIPTOR_RANK (retarray) == 1) @@ -230,61 +233,294 @@ matmul_c8 (gfc_array_c8 * const restrict retarray, bbase = b->base_addr; dest = retarray->base_addr; - - /* Now that everything is set up, we're performing the multiplication + /* Now that everything is set up, we perform the multiplication itself. */ #define POW3(x) (((float) (x)) * ((float) (x)) * ((float) (x))) +#define min(a,b) ((a) <= (b) ? (a) : (b)) +#define max(a,b) ((a) >= (b) ? (a) : (b)) if (try_blas && rxstride == 1 && (axstride == 1 || aystride == 1) && (bxstride == 1 || bystride == 1) && (((float) xcount) * ((float) ycount) * ((float) count) > POW3(blas_limit))) - { - const int m = xcount, n = ycount, k = count, ldc = rystride; - const GFC_COMPLEX_8 one = 1, zero = 0; - const int lda = (axstride == 1) ? aystride : axstride, - ldb = (bxstride == 1) ? bystride : bxstride; - - if (lda > 0 && ldb > 0 && ldc > 0 && m > 1 && n > 1 && k > 1) - { - assert (gemm != NULL); - gemm (axstride == 1 ? "N" : "T", bxstride == 1 ? "N" : "T", &m, &n, &k, - &one, abase, &lda, bbase, &ldb, &zero, dest, &ldc, 1, 1); - return; - } - } - - if (rxstride == 1 && axstride == 1 && bxstride == 1) { - const GFC_COMPLEX_8 * restrict bbase_y; - GFC_COMPLEX_8 * restrict dest_y; - const GFC_COMPLEX_8 * restrict abase_n; - GFC_COMPLEX_8 bbase_yn; + const int m = xcount, n = ycount, k = count, ldc = rystride; + const GFC_COMPLEX_8 one = 1, zero = 0; + const int lda = (axstride == 1) ? aystride : axstride, + ldb = (bxstride == 1) ? bystride : bxstride; - if (rystride == xcount) - memset (dest, 0, (sizeof (GFC_COMPLEX_8) * xcount * ycount)); - else + if (lda > 0 && ldb > 0 && ldc > 0 && m > 1 && n > 1 && k > 1) { - for (y = 0; y < ycount; y++) - for (x = 0; x < xcount; x++) - dest[x + y*rystride] = (GFC_COMPLEX_8)0; + assert (gemm != NULL); + gemm (axstride == 1 ? "N" : "T", bxstride == 1 ? "N" : "T", &m, + &n, &k, &one, abase, &lda, bbase, &ldb, &zero, dest, + &ldc, 1, 1); + return; } + } - for (y = 0; y < ycount; y++) + if (rxstride == 1 && axstride == 1 && bxstride == 1) + { + /* This block of code implements a tuned matmul, derived from + Superscalar GEMM-based level 3 BLAS, Beta version 0.1 + + Bo Kagstrom and Per Ling + Department of Computing Science + Umea University + S-901 87 Umea, Sweden + + from netlib.org, translated to C, and modified for matmul.m4. */ + + const GFC_COMPLEX_8 *a, *b; + GFC_COMPLEX_8 *c; + const index_type m = xcount, n = ycount, k = count; + + /* System generated locals */ + index_type a_dim1, a_offset, b_dim1, b_offset, c_dim1, c_offset, i1, i2, + i3, i4, i5, i6; + + /* Local variables */ + GFC_COMPLEX_8 t1[65536], /* was [256][256] */ + f11, f12, f21, f22, f31, f32, f41, f42, + f13, f14, f23, f24, f33, f34, f43, f44; + index_type i, j, l, ii, jj, ll; + index_type isec, jsec, lsec, uisec, ujsec, ulsec; + + a = abase; + b = bbase; + c = retarray->base_addr; + + /* Parameter adjustments */ + c_dim1 = m; + c_offset = 1 + c_dim1; + c -= c_offset; + a_dim1 = aystride; + a_offset = 1 + a_dim1; + a -= a_offset; + b_dim1 = bystride; + b_offset = 1 + b_dim1; + b -= b_offset; + + /* Early exit if possible */ + if (m == 0 || n == 0 || k == 0) + return; + + /* Empty c first. */ + for (j=1; j<=n; j++) + for (i=1; i<=m; i++) + c[i + j * c_dim1] = (GFC_COMPLEX_8)0; + + /* Start turning the crank. */ + i1 = n; + for (jj = 1; jj <= i1; jj += 512) { - bbase_y = bbase + y*bystride; - dest_y = dest + y*rystride; - for (n = 0; n < count; n++) + /* Computing MIN */ + i2 = 512; + i3 = n - jj + 1; + jsec = min(i2,i3); + ujsec = jsec - jsec % 4; + i2 = k; + for (ll = 1; ll <= i2; ll += 256) { - abase_n = abase + n*aystride; - bbase_yn = bbase_y[n]; - for (x = 0; x < xcount; x++) + /* Computing MIN */ + i3 = 256; + i4 = k - ll + 1; + lsec = min(i3,i4); + ulsec = lsec - lsec % 2; + + i3 = m; + for (ii = 1; ii <= i3; ii += 256) { - dest_y[x] += abase_n[x] * bbase_yn; + /* Computing MIN */ + i4 = 256; + i5 = m - ii + 1; + isec = min(i4,i5); + uisec = isec - isec % 2; + i4 = ll + ulsec - 1; + for (l = ll; l <= i4; l += 2) + { + i5 = ii + uisec - 1; + for (i = ii; i <= i5; i += 2) + { + t1[l - ll + 1 + ((i - ii + 1) << 8) - 257] = + a[i + l * a_dim1]; + t1[l - ll + 2 + ((i - ii + 1) << 8) - 257] = + a[i + (l + 1) * a_dim1]; + t1[l - ll + 1 + ((i - ii + 2) << 8) - 257] = + a[i + 1 + l * a_dim1]; + t1[l - ll + 2 + ((i - ii + 2) << 8) - 257] = + a[i + 1 + (l + 1) * a_dim1]; + } + if (uisec < isec) + { + t1[l - ll + 1 + (isec << 8) - 257] = + a[ii + isec - 1 + l * a_dim1]; + t1[l - ll + 2 + (isec << 8) - 257] = + a[ii + isec - 1 + (l + 1) * a_dim1]; + } + } + if (ulsec < lsec) + { + i4 = ii + isec - 1; + for (i = ii; i<= i4; ++i) + { + t1[lsec + ((i - ii + 1) << 8) - 257] = + a[i + (ll + lsec - 1) * a_dim1]; + } + } + + uisec = isec - isec % 4; + i4 = jj + ujsec - 1; + for (j = jj; j <= i4; j += 4) + { + i5 = ii + uisec - 1; + for (i = ii; i <= i5; i += 4) + { + f11 = c[i + j * c_dim1]; + f21 = c[i + 1 + j * c_dim1]; + f12 = c[i + (j + 1) * c_dim1]; + f22 = c[i + 1 + (j + 1) * c_dim1]; + f13 = c[i + (j + 2) * c_dim1]; + f23 = c[i + 1 + (j + 2) * c_dim1]; + f14 = c[i + (j + 3) * c_dim1]; + f24 = c[i + 1 + (j + 3) * c_dim1]; + f31 = c[i + 2 + j * c_dim1]; + f41 = c[i + 3 + j * c_dim1]; + f32 = c[i + 2 + (j + 1) * c_dim1]; + f42 = c[i + 3 + (j + 1) * c_dim1]; + f33 = c[i + 2 + (j + 2) * c_dim1]; + f43 = c[i + 3 + (j + 2) * c_dim1]; + f34 = c[i + 2 + (j + 3) * c_dim1]; + f44 = c[i + 3 + (j + 3) * c_dim1]; + i6 = ll + lsec - 1; + for (l = ll; l <= i6; ++l) + { + f11 += t1[l - ll + 1 + ((i - ii + 1) << 8) - 257] + * b[l + j * b_dim1]; + f21 += t1[l - ll + 1 + ((i - ii + 2) << 8) - 257] + * b[l + j * b_dim1]; + f12 += t1[l - ll + 1 + ((i - ii + 1) << 8) - 257] + * b[l + (j + 1) * b_dim1]; + f22 += t1[l - ll + 1 + ((i - ii + 2) << 8) - 257] + * b[l + (j + 1) * b_dim1]; + f13 += t1[l - ll + 1 + ((i - ii + 1) << 8) - 257] + * b[l + (j + 2) * b_dim1]; + f23 += t1[l - ll + 1 + ((i - ii + 2) << 8) - 257] + * b[l + (j + 2) * b_dim1]; + f14 += t1[l - ll + 1 + ((i - ii + 1) << 8) - 257] + * b[l + (j + 3) * b_dim1]; + f24 += t1[l - ll + 1 + ((i - ii + 2) << 8) - 257] + * b[l + (j + 3) * b_dim1]; + f31 += t1[l - ll + 1 + ((i - ii + 3) << 8) - 257] + * b[l + j * b_dim1]; + f41 += t1[l - ll + 1 + ((i - ii + 4) << 8) - 257] + * b[l + j * b_dim1]; + f32 += t1[l - ll + 1 + ((i - ii + 3) << 8) - 257] + * b[l + (j + 1) * b_dim1]; + f42 += t1[l - ll + 1 + ((i - ii + 4) << 8) - 257] + * b[l + (j + 1) * b_dim1]; + f33 += t1[l - ll + 1 + ((i - ii + 3) << 8) - 257] + * b[l + (j + 2) * b_dim1]; + f43 += t1[l - ll + 1 + ((i - ii + 4) << 8) - 257] + * b[l + (j + 2) * b_dim1]; + f34 += t1[l - ll + 1 + ((i - ii + 3) << 8) - 257] + * b[l + (j + 3) * b_dim1]; + f44 += t1[l - ll + 1 + ((i - ii + 4) << 8) - 257] + * b[l + (j + 3) * b_dim1]; + } + c[i + j * c_dim1] = f11; + c[i + 1 + j * c_dim1] = f21; + c[i + (j + 1) * c_dim1] = f12; + c[i + 1 + (j + 1) * c_dim1] = f22; + c[i + (j + 2) * c_dim1] = f13; + c[i + 1 + (j + 2) * c_dim1] = f23; + c[i + (j + 3) * c_dim1] = f14; + c[i + 1 + (j + 3) * c_dim1] = f24; + c[i + 2 + j * c_dim1] = f31; + c[i + 3 + j * c_dim1] = f41; + c[i + 2 + (j + 1) * c_dim1] = f32; + c[i + 3 + (j + 1) * c_dim1] = f42; + c[i + 2 + (j + 2) * c_dim1] = f33; + c[i + 3 + (j + 2) * c_dim1] = f43; + c[i + 2 + (j + 3) * c_dim1] = f34; + c[i + 3 + (j + 3) * c_dim1] = f44; + } + if (uisec < isec) + { + i5 = ii + isec - 1; + for (i = ii + uisec; i <= i5; ++i) + { + f11 = c[i + j * c_dim1]; + f12 = c[i + (j + 1) * c_dim1]; + f13 = c[i + (j + 2) * c_dim1]; + f14 = c[i + (j + 3) * c_dim1]; + i6 = ll + lsec - 1; + for (l = ll; l <= i6; ++l) + { + f11 += t1[l - ll + 1 + ((i - ii + 1) << 8) - + 257] * b[l + j * b_dim1]; + f12 += t1[l - ll + 1 + ((i - ii + 1) << 8) - + 257] * b[l + (j + 1) * b_dim1]; + f13 += t1[l - ll + 1 + ((i - ii + 1) << 8) - + 257] * b[l + (j + 2) * b_dim1]; + f14 += t1[l - ll + 1 + ((i - ii + 1) << 8) - + 257] * b[l + (j + 3) * b_dim1]; + } + c[i + j * c_dim1] = f11; + c[i + (j + 1) * c_dim1] = f12; + c[i + (j + 2) * c_dim1] = f13; + c[i + (j + 3) * c_dim1] = f14; + } + } + } + if (ujsec < jsec) + { + i4 = jj + jsec - 1; + for (j = jj + ujsec; j <= i4; ++j) + { + i5 = ii + uisec - 1; + for (i = ii; i <= i5; i += 4) + { + f11 = c[i + j * c_dim1]; + f21 = c[i + 1 + j * c_dim1]; + f31 = c[i + 2 + j * c_dim1]; + f41 = c[i + 3 + j * c_dim1]; + i6 = ll + lsec - 1; + for (l = ll; l <= i6; ++l) + { + f11 += t1[l - ll + 1 + ((i - ii + 1) << 8) - + 257] * b[l + j * b_dim1]; + f21 += t1[l - ll + 1 + ((i - ii + 2) << 8) - + 257] * b[l + j * b_dim1]; + f31 += t1[l - ll + 1 + ((i - ii + 3) << 8) - + 257] * b[l + j * b_dim1]; + f41 += t1[l - ll + 1 + ((i - ii + 4) << 8) - + 257] * b[l + j * b_dim1]; + } + c[i + j * c_dim1] = f11; + c[i + 1 + j * c_dim1] = f21; + c[i + 2 + j * c_dim1] = f31; + c[i + 3 + j * c_dim1] = f41; + } + i5 = ii + isec - 1; + for (i = ii + uisec; i <= i5; ++i) + { + f11 = c[i + j * c_dim1]; + i6 = ll + lsec - 1; + for (l = ll; l <= i6; ++l) + { + f11 += t1[l - ll + 1 + ((i - ii + 1) << 8) - + 257] * b[l + j * b_dim1]; + } + c[i + j * c_dim1] = f11; + } + } + } } } } + return; } else if (rxstride == 1 && aystride == 1 && bxstride == 1) { @@ -334,7 +570,9 @@ matmul_c8 (gfc_array_c8 * const restrict retarray, for (n = 0; n < count; n++) for (x = 0; x < xcount; x++) /* dest[x,y] += a[x,n] * b[n,y] */ - dest[x*rxstride + y*rystride] += abase[x*axstride + n*aystride] * bbase[n*bxstride + y*bystride]; + dest[x*rxstride + y*rystride] += + abase[x*axstride + n*aystride] * + bbase[n*bxstride + y*bystride]; } else if (GFC_DESCRIPTOR_RANK (a) == 1) { @@ -372,5 +610,5 @@ matmul_c8 (gfc_array_c8 * const restrict retarray, } } } - +#pragma GCC reset_options #endif diff --git a/libgfortran/generated/matmul_i1.c b/libgfortran/generated/matmul_i1.c index fdb3092..58d12ab 100644 --- a/libgfortran/generated/matmul_i1.c +++ b/libgfortran/generated/matmul_i1.c @@ -32,7 +32,7 @@ see the files COPYING3 and COPYING.RUNTIME respectively. If not, see #if defined (HAVE_GFC_INTEGER_1) /* Prototype for the BLAS ?gemm subroutine, a pointer to which can be - passed to us by the front-end, in which case we'll call it for large + passed to us by the front-end, in which case we call it for large matrices. */ typedef void (*blas_call)(const char *, const char *, const int *, const int *, @@ -75,6 +75,9 @@ extern void matmul_i1 (gfc_array_i1 * const restrict retarray, int blas_limit, blas_call gemm); export_proto(matmul_i1); +#pragma GCC optimize ( "-Ofast" ) +#pragma GCC optimize ( "-funroll-loops" ) + void matmul_i1 (gfc_array_i1 * const restrict retarray, gfc_array_i1 * const restrict a, gfc_array_i1 * const restrict b, int try_blas, @@ -99,7 +102,7 @@ matmul_i1 (gfc_array_i1 * const restrict retarray, o One-dimensional argument B is implicitly treated as a column matrix dimensioned [count, 1], so ycount=1. - */ +*/ if (retarray->base_addr == NULL) { @@ -127,47 +130,47 @@ matmul_i1 (gfc_array_i1 * const restrict retarray, = xmallocarray (size0 ((array_t *) retarray), sizeof (GFC_INTEGER_1)); retarray->offset = 0; } - else if (unlikely (compile_options.bounds_check)) - { - index_type ret_extent, arg_extent; - - if (GFC_DESCRIPTOR_RANK (a) == 1) - { - arg_extent = GFC_DESCRIPTOR_EXTENT(b,1); - ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0); - if (arg_extent != ret_extent) - runtime_error ("Incorrect extent in return array in" - " MATMUL intrinsic: is %ld, should be %ld", - (long int) ret_extent, (long int) arg_extent); - } - else if (GFC_DESCRIPTOR_RANK (b) == 1) - { - arg_extent = GFC_DESCRIPTOR_EXTENT(a,0); - ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0); - if (arg_extent != ret_extent) - runtime_error ("Incorrect extent in return array in" - " MATMUL intrinsic: is %ld, should be %ld", - (long int) ret_extent, (long int) arg_extent); - } - else - { - arg_extent = GFC_DESCRIPTOR_EXTENT(a,0); - ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0); - if (arg_extent != ret_extent) - runtime_error ("Incorrect extent in return array in" - " MATMUL intrinsic for dimension 1:" - " is %ld, should be %ld", - (long int) ret_extent, (long int) arg_extent); - - arg_extent = GFC_DESCRIPTOR_EXTENT(b,1); - ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,1); - if (arg_extent != ret_extent) - runtime_error ("Incorrect extent in return array in" - " MATMUL intrinsic for dimension 2:" - " is %ld, should be %ld", - (long int) ret_extent, (long int) arg_extent); - } - } + else if (unlikely (compile_options.bounds_check)) + { + index_type ret_extent, arg_extent; + + if (GFC_DESCRIPTOR_RANK (a) == 1) + { + arg_extent = GFC_DESCRIPTOR_EXTENT(b,1); + ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0); + if (arg_extent != ret_extent) + runtime_error ("Incorrect extent in return array in" + " MATMUL intrinsic: is %ld, should be %ld", + (long int) ret_extent, (long int) arg_extent); + } + else if (GFC_DESCRIPTOR_RANK (b) == 1) + { + arg_extent = GFC_DESCRIPTOR_EXTENT(a,0); + ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0); + if (arg_extent != ret_extent) + runtime_error ("Incorrect extent in return array in" + " MATMUL intrinsic: is %ld, should be %ld", + (long int) ret_extent, (long int) arg_extent); + } + else + { + arg_extent = GFC_DESCRIPTOR_EXTENT(a,0); + ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0); + if (arg_extent != ret_extent) + runtime_error ("Incorrect extent in return array in" + " MATMUL intrinsic for dimension 1:" + " is %ld, should be %ld", + (long int) ret_extent, (long int) arg_extent); + + arg_extent = GFC_DESCRIPTOR_EXTENT(b,1); + ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,1); + if (arg_extent != ret_extent) + runtime_error ("Incorrect extent in return array in" + " MATMUL intrinsic for dimension 2:" + " is %ld, should be %ld", + (long int) ret_extent, (long int) arg_extent); + } + } if (GFC_DESCRIPTOR_RANK (retarray) == 1) @@ -230,61 +233,294 @@ matmul_i1 (gfc_array_i1 * const restrict retarray, bbase = b->base_addr; dest = retarray->base_addr; - - /* Now that everything is set up, we're performing the multiplication + /* Now that everything is set up, we perform the multiplication itself. */ #define POW3(x) (((float) (x)) * ((float) (x)) * ((float) (x))) +#define min(a,b) ((a) <= (b) ? (a) : (b)) +#define max(a,b) ((a) >= (b) ? (a) : (b)) if (try_blas && rxstride == 1 && (axstride == 1 || aystride == 1) && (bxstride == 1 || bystride == 1) && (((float) xcount) * ((float) ycount) * ((float) count) > POW3(blas_limit))) - { - const int m = xcount, n = ycount, k = count, ldc = rystride; - const GFC_INTEGER_1 one = 1, zero = 0; - const int lda = (axstride == 1) ? aystride : axstride, - ldb = (bxstride == 1) ? bystride : bxstride; - - if (lda > 0 && ldb > 0 && ldc > 0 && m > 1 && n > 1 && k > 1) - { - assert (gemm != NULL); - gemm (axstride == 1 ? "N" : "T", bxstride == 1 ? "N" : "T", &m, &n, &k, - &one, abase, &lda, bbase, &ldb, &zero, dest, &ldc, 1, 1); - return; - } - } - - if (rxstride == 1 && axstride == 1 && bxstride == 1) { - const GFC_INTEGER_1 * restrict bbase_y; - GFC_INTEGER_1 * restrict dest_y; - const GFC_INTEGER_1 * restrict abase_n; - GFC_INTEGER_1 bbase_yn; + const int m = xcount, n = ycount, k = count, ldc = rystride; + const GFC_INTEGER_1 one = 1, zero = 0; + const int lda = (axstride == 1) ? aystride : axstride, + ldb = (bxstride == 1) ? bystride : bxstride; - if (rystride == xcount) - memset (dest, 0, (sizeof (GFC_INTEGER_1) * xcount * ycount)); - else + if (lda > 0 && ldb > 0 && ldc > 0 && m > 1 && n > 1 && k > 1) { - for (y = 0; y < ycount; y++) - for (x = 0; x < xcount; x++) - dest[x + y*rystride] = (GFC_INTEGER_1)0; + assert (gemm != NULL); + gemm (axstride == 1 ? "N" : "T", bxstride == 1 ? "N" : "T", &m, + &n, &k, &one, abase, &lda, bbase, &ldb, &zero, dest, + &ldc, 1, 1); + return; } + } - for (y = 0; y < ycount; y++) + if (rxstride == 1 && axstride == 1 && bxstride == 1) + { + /* This block of code implements a tuned matmul, derived from + Superscalar GEMM-based level 3 BLAS, Beta version 0.1 + + Bo Kagstrom and Per Ling + Department of Computing Science + Umea University + S-901 87 Umea, Sweden + + from netlib.org, translated to C, and modified for matmul.m4. */ + + const GFC_INTEGER_1 *a, *b; + GFC_INTEGER_1 *c; + const index_type m = xcount, n = ycount, k = count; + + /* System generated locals */ + index_type a_dim1, a_offset, b_dim1, b_offset, c_dim1, c_offset, i1, i2, + i3, i4, i5, i6; + + /* Local variables */ + GFC_INTEGER_1 t1[65536], /* was [256][256] */ + f11, f12, f21, f22, f31, f32, f41, f42, + f13, f14, f23, f24, f33, f34, f43, f44; + index_type i, j, l, ii, jj, ll; + index_type isec, jsec, lsec, uisec, ujsec, ulsec; + + a = abase; + b = bbase; + c = retarray->base_addr; + + /* Parameter adjustments */ + c_dim1 = m; + c_offset = 1 + c_dim1; + c -= c_offset; + a_dim1 = aystride; + a_offset = 1 + a_dim1; + a -= a_offset; + b_dim1 = bystride; + b_offset = 1 + b_dim1; + b -= b_offset; + + /* Early exit if possible */ + if (m == 0 || n == 0 || k == 0) + return; + + /* Empty c first. */ + for (j=1; j<=n; j++) + for (i=1; i<=m; i++) + c[i + j * c_dim1] = (GFC_INTEGER_1)0; + + /* Start turning the crank. */ + i1 = n; + for (jj = 1; jj <= i1; jj += 512) { - bbase_y = bbase + y*bystride; - dest_y = dest + y*rystride; - for (n = 0; n < count; n++) + /* Computing MIN */ + i2 = 512; + i3 = n - jj + 1; + jsec = min(i2,i3); + ujsec = jsec - jsec % 4; + i2 = k; + for (ll = 1; ll <= i2; ll += 256) { - abase_n = abase + n*aystride; - bbase_yn = bbase_y[n]; - for (x = 0; x < xcount; x++) + /* Computing MIN */ + i3 = 256; + i4 = k - ll + 1; + lsec = min(i3,i4); + ulsec = lsec - lsec % 2; + + i3 = m; + for (ii = 1; ii <= i3; ii += 256) { - dest_y[x] += abase_n[x] * bbase_yn; + /* Computing MIN */ + i4 = 256; + i5 = m - ii + 1; + isec = min(i4,i5); + uisec = isec - isec % 2; + i4 = ll + ulsec - 1; + for (l = ll; l <= i4; l += 2) + { + i5 = ii + uisec - 1; + for (i = ii; i <= i5; i += 2) + { + t1[l - ll + 1 + ((i - ii + 1) << 8) - 257] = + a[i + l * a_dim1]; + t1[l - ll + 2 + ((i - ii + 1) << 8) - 257] = + a[i + (l + 1) * a_dim1]; + t1[l - ll + 1 + ((i - ii + 2) << 8) - 257] = + a[i + 1 + l * a_dim1]; + t1[l - ll + 2 + ((i - ii + 2) << 8) - 257] = + a[i + 1 + (l + 1) * a_dim1]; + } + if (uisec < isec) + { + t1[l - ll + 1 + (isec << 8) - 257] = + a[ii + isec - 1 + l * a_dim1]; + t1[l - ll + 2 + (isec << 8) - 257] = + a[ii + isec - 1 + (l + 1) * a_dim1]; + } + } + if (ulsec < lsec) + { + i4 = ii + isec - 1; + for (i = ii; i<= i4; ++i) + { + t1[lsec + ((i - ii + 1) << 8) - 257] = + a[i + (ll + lsec - 1) * a_dim1]; + } + } + + uisec = isec - isec % 4; + i4 = jj + ujsec - 1; + for (j = jj; j <= i4; j += 4) + { + i5 = ii + uisec - 1; + for (i = ii; i <= i5; i += 4) + { + f11 = c[i + j * c_dim1]; + f21 = c[i + 1 + j * c_dim1]; + f12 = c[i + (j + 1) * c_dim1]; + f22 = c[i + 1 + (j + 1) * c_dim1]; + f13 = c[i + (j + 2) * c_dim1]; + f23 = c[i + 1 + (j + 2) * c_dim1]; + f14 = c[i + (j + 3) * c_dim1]; + f24 = c[i + 1 + (j + 3) * c_dim1]; + f31 = c[i + 2 + j * c_dim1]; + f41 = c[i + 3 + j * c_dim1]; + f32 = c[i + 2 + (j + 1) * c_dim1]; + f42 = c[i + 3 + (j + 1) * c_dim1]; + f33 = c[i + 2 + (j + 2) * c_dim1]; + f43 = c[i + 3 + (j + 2) * c_dim1]; + f34 = c[i + 2 + (j + 3) * c_dim1]; + f44 = c[i + 3 + (j + 3) * c_dim1]; + i6 = ll + lsec - 1; + for (l = ll; l <= i6; ++l) + { + f11 += t1[l - ll + 1 + ((i - ii + 1) << 8) - 257] + * b[l + j * b_dim1]; + f21 += t1[l - ll + 1 + ((i - ii + 2) << 8) - 257] + * b[l + j * b_dim1]; + f12 += t1[l - ll + 1 + ((i - ii + 1) << 8) - 257] + * b[l + (j + 1) * b_dim1]; + f22 += t1[l - ll + 1 + ((i - ii + 2) << 8) - 257] + * b[l + (j + 1) * b_dim1]; + f13 += t1[l - ll + 1 + ((i - ii + 1) << 8) - 257] + * b[l + (j + 2) * b_dim1]; + f23 += t1[l - ll + 1 + ((i - ii + 2) << 8) - 257] + * b[l + (j + 2) * b_dim1]; + f14 += t1[l - ll + 1 + ((i - ii + 1) << 8) - 257] + * b[l + (j + 3) * b_dim1]; + f24 += t1[l - ll + 1 + ((i - ii + 2) << 8) - 257] + * b[l + (j + 3) * b_dim1]; + f31 += t1[l - ll + 1 + ((i - ii + 3) << 8) - 257] + * b[l + j * b_dim1]; + f41 += t1[l - ll + 1 + ((i - ii + 4) << 8) - 257] + * b[l + j * b_dim1]; + f32 += t1[l - ll + 1 + ((i - ii + 3) << 8) - 257] + * b[l + (j + 1) * b_dim1]; + f42 += t1[l - ll + 1 + ((i - ii + 4) << 8) - 257] + * b[l + (j + 1) * b_dim1]; + f33 += t1[l - ll + 1 + ((i - ii + 3) << 8) - 257] + * b[l + (j + 2) * b_dim1]; + f43 += t1[l - ll + 1 + ((i - ii + 4) << 8) - 257] + * b[l + (j + 2) * b_dim1]; + f34 += t1[l - ll + 1 + ((i - ii + 3) << 8) - 257] + * b[l + (j + 3) * b_dim1]; + f44 += t1[l - ll + 1 + ((i - ii + 4) << 8) - 257] + * b[l + (j + 3) * b_dim1]; + } + c[i + j * c_dim1] = f11; + c[i + 1 + j * c_dim1] = f21; + c[i + (j + 1) * c_dim1] = f12; + c[i + 1 + (j + 1) * c_dim1] = f22; + c[i + (j + 2) * c_dim1] = f13; + c[i + 1 + (j + 2) * c_dim1] = f23; + c[i + (j + 3) * c_dim1] = f14; + c[i + 1 + (j + 3) * c_dim1] = f24; + c[i + 2 + j * c_dim1] = f31; + c[i + 3 + j * c_dim1] = f41; + c[i + 2 + (j + 1) * c_dim1] = f32; + c[i + 3 + (j + 1) * c_dim1] = f42; + c[i + 2 + (j + 2) * c_dim1] = f33; + c[i + 3 + (j + 2) * c_dim1] = f43; + c[i + 2 + (j + 3) * c_dim1] = f34; + c[i + 3 + (j + 3) * c_dim1] = f44; + } + if (uisec < isec) + { + i5 = ii + isec - 1; + for (i = ii + uisec; i <= i5; ++i) + { + f11 = c[i + j * c_dim1]; + f12 = c[i + (j + 1) * c_dim1]; + f13 = c[i + (j + 2) * c_dim1]; + f14 = c[i + (j + 3) * c_dim1]; + i6 = ll + lsec - 1; + for (l = ll; l <= i6; ++l) + { + f11 += t1[l - ll + 1 + ((i - ii + 1) << 8) - + 257] * b[l + j * b_dim1]; + f12 += t1[l - ll + 1 + ((i - ii + 1) << 8) - + 257] * b[l + (j + 1) * b_dim1]; + f13 += t1[l - ll + 1 + ((i - ii + 1) << 8) - + 257] * b[l + (j + 2) * b_dim1]; + f14 += t1[l - ll + 1 + ((i - ii + 1) << 8) - + 257] * b[l + (j + 3) * b_dim1]; + } + c[i + j * c_dim1] = f11; + c[i + (j + 1) * c_dim1] = f12; + c[i + (j + 2) * c_dim1] = f13; + c[i + (j + 3) * c_dim1] = f14; + } + } + } + if (ujsec < jsec) + { + i4 = jj + jsec - 1; + for (j = jj + ujsec; j <= i4; ++j) + { + i5 = ii + uisec - 1; + for (i = ii; i <= i5; i += 4) + { + f11 = c[i + j * c_dim1]; + f21 = c[i + 1 + j * c_dim1]; + f31 = c[i + 2 + j * c_dim1]; + f41 = c[i + 3 + j * c_dim1]; + i6 = ll + lsec - 1; + for (l = ll; l <= i6; ++l) + { + f11 += t1[l - ll + 1 + ((i - ii + 1) << 8) - + 257] * b[l + j * b_dim1]; + f21 += t1[l - ll + 1 + ((i - ii + 2) << 8) - + 257] * b[l + j * b_dim1]; + f31 += t1[l - ll + 1 + ((i - ii + 3) << 8) - + 257] * b[l + j * b_dim1]; + f41 += t1[l - ll + 1 + ((i - ii + 4) << 8) - + 257] * b[l + j * b_dim1]; + } + c[i + j * c_dim1] = f11; + c[i + 1 + j * c_dim1] = f21; + c[i + 2 + j * c_dim1] = f31; + c[i + 3 + j * c_dim1] = f41; + } + i5 = ii + isec - 1; + for (i = ii + uisec; i <= i5; ++i) + { + f11 = c[i + j * c_dim1]; + i6 = ll + lsec - 1; + for (l = ll; l <= i6; ++l) + { + f11 += t1[l - ll + 1 + ((i - ii + 1) << 8) - + 257] * b[l + j * b_dim1]; + } + c[i + j * c_dim1] = f11; + } + } + } } } } + return; } else if (rxstride == 1 && aystride == 1 && bxstride == 1) { @@ -334,7 +570,9 @@ matmul_i1 (gfc_array_i1 * const restrict retarray, for (n = 0; n < count; n++) for (x = 0; x < xcount; x++) /* dest[x,y] += a[x,n] * b[n,y] */ - dest[x*rxstride + y*rystride] += abase[x*axstride + n*aystride] * bbase[n*bxstride + y*bystride]; + dest[x*rxstride + y*rystride] += + abase[x*axstride + n*aystride] * + bbase[n*bxstride + y*bystride]; } else if (GFC_DESCRIPTOR_RANK (a) == 1) { @@ -372,5 +610,5 @@ matmul_i1 (gfc_array_i1 * const restrict retarray, } } } - +#pragma GCC reset_options #endif diff --git a/libgfortran/generated/matmul_i16.c b/libgfortran/generated/matmul_i16.c index 80eb63c..bfcc028 100644 --- a/libgfortran/generated/matmul_i16.c +++ b/libgfortran/generated/matmul_i16.c @@ -32,7 +32,7 @@ see the files COPYING3 and COPYING.RUNTIME respectively. If not, see #if defined (HAVE_GFC_INTEGER_16) /* Prototype for the BLAS ?gemm subroutine, a pointer to which can be - passed to us by the front-end, in which case we'll call it for large + passed to us by the front-end, in which case we call it for large matrices. */ typedef void (*blas_call)(const char *, const char *, const int *, const int *, @@ -75,6 +75,9 @@ extern void matmul_i16 (gfc_array_i16 * const restrict retarray, int blas_limit, blas_call gemm); export_proto(matmul_i16); +#pragma GCC optimize ( "-Ofast" ) +#pragma GCC optimize ( "-funroll-loops" ) + void matmul_i16 (gfc_array_i16 * const restrict retarray, gfc_array_i16 * const restrict a, gfc_array_i16 * const restrict b, int try_blas, @@ -99,7 +102,7 @@ matmul_i16 (gfc_array_i16 * const restrict retarray, o One-dimensional argument B is implicitly treated as a column matrix dimensioned [count, 1], so ycount=1. - */ +*/ if (retarray->base_addr == NULL) { @@ -127,47 +130,47 @@ matmul_i16 (gfc_array_i16 * const restrict retarray, = xmallocarray (size0 ((array_t *) retarray), sizeof (GFC_INTEGER_16)); retarray->offset = 0; } - else if (unlikely (compile_options.bounds_check)) - { - index_type ret_extent, arg_extent; - - if (GFC_DESCRIPTOR_RANK (a) == 1) - { - arg_extent = GFC_DESCRIPTOR_EXTENT(b,1); - ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0); - if (arg_extent != ret_extent) - runtime_error ("Incorrect extent in return array in" - " MATMUL intrinsic: is %ld, should be %ld", - (long int) ret_extent, (long int) arg_extent); - } - else if (GFC_DESCRIPTOR_RANK (b) == 1) - { - arg_extent = GFC_DESCRIPTOR_EXTENT(a,0); - ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0); - if (arg_extent != ret_extent) - runtime_error ("Incorrect extent in return array in" - " MATMUL intrinsic: is %ld, should be %ld", - (long int) ret_extent, (long int) arg_extent); - } - else - { - arg_extent = GFC_DESCRIPTOR_EXTENT(a,0); - ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0); - if (arg_extent != ret_extent) - runtime_error ("Incorrect extent in return array in" - " MATMUL intrinsic for dimension 1:" - " is %ld, should be %ld", - (long int) ret_extent, (long int) arg_extent); - - arg_extent = GFC_DESCRIPTOR_EXTENT(b,1); - ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,1); - if (arg_extent != ret_extent) - runtime_error ("Incorrect extent in return array in" - " MATMUL intrinsic for dimension 2:" - " is %ld, should be %ld", - (long int) ret_extent, (long int) arg_extent); - } - } + else if (unlikely (compile_options.bounds_check)) + { + index_type ret_extent, arg_extent; + + if (GFC_DESCRIPTOR_RANK (a) == 1) + { + arg_extent = GFC_DESCRIPTOR_EXTENT(b,1); + ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0); + if (arg_extent != ret_extent) + runtime_error ("Incorrect extent in return array in" + " MATMUL intrinsic: is %ld, should be %ld", + (long int) ret_extent, (long int) arg_extent); + } + else if (GFC_DESCRIPTOR_RANK (b) == 1) + { + arg_extent = GFC_DESCRIPTOR_EXTENT(a,0); + ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0); + if (arg_extent != ret_extent) + runtime_error ("Incorrect extent in return array in" + " MATMUL intrinsic: is %ld, should be %ld", + (long int) ret_extent, (long int) arg_extent); + } + else + { + arg_extent = GFC_DESCRIPTOR_EXTENT(a,0); + ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0); + if (arg_extent != ret_extent) + runtime_error ("Incorrect extent in return array in" + " MATMUL intrinsic for dimension 1:" + " is %ld, should be %ld", + (long int) ret_extent, (long int) arg_extent); + + arg_extent = GFC_DESCRIPTOR_EXTENT(b,1); + ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,1); + if (arg_extent != ret_extent) + runtime_error ("Incorrect extent in return array in" + " MATMUL intrinsic for dimension 2:" + " is %ld, should be %ld", + (long int) ret_extent, (long int) arg_extent); + } + } if (GFC_DESCRIPTOR_RANK (retarray) == 1) @@ -230,61 +233,294 @@ matmul_i16 (gfc_array_i16 * const restrict retarray, bbase = b->base_addr; dest = retarray->base_addr; - - /* Now that everything is set up, we're performing the multiplication + /* Now that everything is set up, we perform the multiplication itself. */ #define POW3(x) (((float) (x)) * ((float) (x)) * ((float) (x))) +#define min(a,b) ((a) <= (b) ? (a) : (b)) +#define max(a,b) ((a) >= (b) ? (a) : (b)) if (try_blas && rxstride == 1 && (axstride == 1 || aystride == 1) && (bxstride == 1 || bystride == 1) && (((float) xcount) * ((float) ycount) * ((float) count) > POW3(blas_limit))) - { - const int m = xcount, n = ycount, k = count, ldc = rystride; - const GFC_INTEGER_16 one = 1, zero = 0; - const int lda = (axstride == 1) ? aystride : axstride, - ldb = (bxstride == 1) ? bystride : bxstride; - - if (lda > 0 && ldb > 0 && ldc > 0 && m > 1 && n > 1 && k > 1) - { - assert (gemm != NULL); - gemm (axstride == 1 ? "N" : "T", bxstride == 1 ? "N" : "T", &m, &n, &k, - &one, abase, &lda, bbase, &ldb, &zero, dest, &ldc, 1, 1); - return; - } - } - - if (rxstride == 1 && axstride == 1 && bxstride == 1) { - const GFC_INTEGER_16 * restrict bbase_y; - GFC_INTEGER_16 * restrict dest_y; - const GFC_INTEGER_16 * restrict abase_n; - GFC_INTEGER_16 bbase_yn; + const int m = xcount, n = ycount, k = count, ldc = rystride; + const GFC_INTEGER_16 one = 1, zero = 0; + const int lda = (axstride == 1) ? aystride : axstride, + ldb = (bxstride == 1) ? bystride : bxstride; - if (rystride == xcount) - memset (dest, 0, (sizeof (GFC_INTEGER_16) * xcount * ycount)); - else + if (lda > 0 && ldb > 0 && ldc > 0 && m > 1 && n > 1 && k > 1) { - for (y = 0; y < ycount; y++) - for (x = 0; x < xcount; x++) - dest[x + y*rystride] = (GFC_INTEGER_16)0; + assert (gemm != NULL); + gemm (axstride == 1 ? "N" : "T", bxstride == 1 ? "N" : "T", &m, + &n, &k, &one, abase, &lda, bbase, &ldb, &zero, dest, + &ldc, 1, 1); + return; } + } - for (y = 0; y < ycount; y++) + if (rxstride == 1 && axstride == 1 && bxstride == 1) + { + /* This block of code implements a tuned matmul, derived from + Superscalar GEMM-based level 3 BLAS, Beta version 0.1 + + Bo Kagstrom and Per Ling + Department of Computing Science + Umea University + S-901 87 Umea, Sweden + + from netlib.org, translated to C, and modified for matmul.m4. */ + + const GFC_INTEGER_16 *a, *b; + GFC_INTEGER_16 *c; + const index_type m = xcount, n = ycount, k = count; + + /* System generated locals */ + index_type a_dim1, a_offset, b_dim1, b_offset, c_dim1, c_offset, i1, i2, + i3, i4, i5, i6; + + /* Local variables */ + GFC_INTEGER_16 t1[65536], /* was [256][256] */ + f11, f12, f21, f22, f31, f32, f41, f42, + f13, f14, f23, f24, f33, f34, f43, f44; + index_type i, j, l, ii, jj, ll; + index_type isec, jsec, lsec, uisec, ujsec, ulsec; + + a = abase; + b = bbase; + c = retarray->base_addr; + + /* Parameter adjustments */ + c_dim1 = m; + c_offset = 1 + c_dim1; + c -= c_offset; + a_dim1 = aystride; + a_offset = 1 + a_dim1; + a -= a_offset; + b_dim1 = bystride; + b_offset = 1 + b_dim1; + b -= b_offset; + + /* Early exit if possible */ + if (m == 0 || n == 0 || k == 0) + return; + + /* Empty c first. */ + for (j=1; j<=n; j++) + for (i=1; i<=m; i++) + c[i + j * c_dim1] = (GFC_INTEGER_16)0; + + /* Start turning the crank. */ + i1 = n; + for (jj = 1; jj <= i1; jj += 512) { - bbase_y = bbase + y*bystride; - dest_y = dest + y*rystride; - for (n = 0; n < count; n++) + /* Computing MIN */ + i2 = 512; + i3 = n - jj + 1; + jsec = min(i2,i3); + ujsec = jsec - jsec % 4; + i2 = k; + for (ll = 1; ll <= i2; ll += 256) { - abase_n = abase + n*aystride; - bbase_yn = bbase_y[n]; - for (x = 0; x < xcount; x++) + /* Computing MIN */ + i3 = 256; + i4 = k - ll + 1; + lsec = min(i3,i4); + ulsec = lsec - lsec % 2; + + i3 = m; + for (ii = 1; ii <= i3; ii += 256) { - dest_y[x] += abase_n[x] * bbase_yn; + /* Computing MIN */ + i4 = 256; + i5 = m - ii + 1; + isec = min(i4,i5); + uisec = isec - isec % 2; + i4 = ll + ulsec - 1; + for (l = ll; l <= i4; l += 2) + { + i5 = ii + uisec - 1; + for (i = ii; i <= i5; i += 2) + { + t1[l - ll + 1 + ((i - ii + 1) << 8) - 257] = + a[i + l * a_dim1]; + t1[l - ll + 2 + ((i - ii + 1) << 8) - 257] = + a[i + (l + 1) * a_dim1]; + t1[l - ll + 1 + ((i - ii + 2) << 8) - 257] = + a[i + 1 + l * a_dim1]; + t1[l - ll + 2 + ((i - ii + 2) << 8) - 257] = + a[i + 1 + (l + 1) * a_dim1]; + } + if (uisec < isec) + { + t1[l - ll + 1 + (isec << 8) - 257] = + a[ii + isec - 1 + l * a_dim1]; + t1[l - ll + 2 + (isec << 8) - 257] = + a[ii + isec - 1 + (l + 1) * a_dim1]; + } + } + if (ulsec < lsec) + { + i4 = ii + isec - 1; + for (i = ii; i<= i4; ++i) + { + t1[lsec + ((i - ii + 1) << 8) - 257] = + a[i + (ll + lsec - 1) * a_dim1]; + } + } + + uisec = isec - isec % 4; + i4 = jj + ujsec - 1; + for (j = jj; j <= i4; j += 4) + { + i5 = ii + uisec - 1; + for (i = ii; i <= i5; i += 4) + { + f11 = c[i + j * c_dim1]; + f21 = c[i + 1 + j * c_dim1]; + f12 = c[i + (j + 1) * c_dim1]; + f22 = c[i + 1 + (j + 1) * c_dim1]; + f13 = c[i + (j + 2) * c_dim1]; + f23 = c[i + 1 + (j + 2) * c_dim1]; + f14 = c[i + (j + 3) * c_dim1]; + f24 = c[i + 1 + (j + 3) * c_dim1]; + f31 = c[i + 2 + j * c_dim1]; + f41 = c[i + 3 + j * c_dim1]; + f32 = c[i + 2 + (j + 1) * c_dim1]; + f42 = c[i + 3 + (j + 1) * c_dim1]; + f33 = c[i + 2 + (j + 2) * c_dim1]; + f43 = c[i + 3 + (j + 2) * c_dim1]; + f34 = c[i + 2 + (j + 3) * c_dim1]; + f44 = c[i + 3 + (j + 3) * c_dim1]; + i6 = ll + lsec - 1; + for (l = ll; l <= i6; ++l) + { + f11 += t1[l - ll + 1 + ((i - ii + 1) << 8) - 257] + * b[l + j * b_dim1]; + f21 += t1[l - ll + 1 + ((i - ii + 2) << 8) - 257] + * b[l + j * b_dim1]; + f12 += t1[l - ll + 1 + ((i - ii + 1) << 8) - 257] + * b[l + (j + 1) * b_dim1]; + f22 += t1[l - ll + 1 + ((i - ii + 2) << 8) - 257] + * b[l + (j + 1) * b_dim1]; + f13 += t1[l - ll + 1 + ((i - ii + 1) << 8) - 257] + * b[l + (j + 2) * b_dim1]; + f23 += t1[l - ll + 1 + ((i - ii + 2) << 8) - 257] + * b[l + (j + 2) * b_dim1]; + f14 += t1[l - ll + 1 + ((i - ii + 1) << 8) - 257] + * b[l + (j + 3) * b_dim1]; + f24 += t1[l - ll + 1 + ((i - ii + 2) << 8) - 257] + * b[l + (j + 3) * b_dim1]; + f31 += t1[l - ll + 1 + ((i - ii + 3) << 8) - 257] + * b[l + j * b_dim1]; + f41 += t1[l - ll + 1 + ((i - ii + 4) << 8) - 257] + * b[l + j * b_dim1]; + f32 += t1[l - ll + 1 + ((i - ii + 3) << 8) - 257] + * b[l + (j + 1) * b_dim1]; + f42 += t1[l - ll + 1 + ((i - ii + 4) << 8) - 257] + * b[l + (j + 1) * b_dim1]; + f33 += t1[l - ll + 1 + ((i - ii + 3) << 8) - 257] + * b[l + (j + 2) * b_dim1]; + f43 += t1[l - ll + 1 + ((i - ii + 4) << 8) - 257] + * b[l + (j + 2) * b_dim1]; + f34 += t1[l - ll + 1 + ((i - ii + 3) << 8) - 257] + * b[l + (j + 3) * b_dim1]; + f44 += t1[l - ll + 1 + ((i - ii + 4) << 8) - 257] + * b[l + (j + 3) * b_dim1]; + } + c[i + j * c_dim1] = f11; + c[i + 1 + j * c_dim1] = f21; + c[i + (j + 1) * c_dim1] = f12; + c[i + 1 + (j + 1) * c_dim1] = f22; + c[i + (j + 2) * c_dim1] = f13; + c[i + 1 + (j + 2) * c_dim1] = f23; + c[i + (j + 3) * c_dim1] = f14; + c[i + 1 + (j + 3) * c_dim1] = f24; + c[i + 2 + j * c_dim1] = f31; + c[i + 3 + j * c_dim1] = f41; + c[i + 2 + (j + 1) * c_dim1] = f32; + c[i + 3 + (j + 1) * c_dim1] = f42; + c[i + 2 + (j + 2) * c_dim1] = f33; + c[i + 3 + (j + 2) * c_dim1] = f43; + c[i + 2 + (j + 3) * c_dim1] = f34; + c[i + 3 + (j + 3) * c_dim1] = f44; + } + if (uisec < isec) + { + i5 = ii + isec - 1; + for (i = ii + uisec; i <= i5; ++i) + { + f11 = c[i + j * c_dim1]; + f12 = c[i + (j + 1) * c_dim1]; + f13 = c[i + (j + 2) * c_dim1]; + f14 = c[i + (j + 3) * c_dim1]; + i6 = ll + lsec - 1; + for (l = ll; l <= i6; ++l) + { + f11 += t1[l - ll + 1 + ((i - ii + 1) << 8) - + 257] * b[l + j * b_dim1]; + f12 += t1[l - ll + 1 + ((i - ii + 1) << 8) - + 257] * b[l + (j + 1) * b_dim1]; + f13 += t1[l - ll + 1 + ((i - ii + 1) << 8) - + 257] * b[l + (j + 2) * b_dim1]; + f14 += t1[l - ll + 1 + ((i - ii + 1) << 8) - + 257] * b[l + (j + 3) * b_dim1]; + } + c[i + j * c_dim1] = f11; + c[i + (j + 1) * c_dim1] = f12; + c[i + (j + 2) * c_dim1] = f13; + c[i + (j + 3) * c_dim1] = f14; + } + } + } + if (ujsec < jsec) + { + i4 = jj + jsec - 1; + for (j = jj + ujsec; j <= i4; ++j) + { + i5 = ii + uisec - 1; + for (i = ii; i <= i5; i += 4) + { + f11 = c[i + j * c_dim1]; + f21 = c[i + 1 + j * c_dim1]; + f31 = c[i + 2 + j * c_dim1]; + f41 = c[i + 3 + j * c_dim1]; + i6 = ll + lsec - 1; + for (l = ll; l <= i6; ++l) + { + f11 += t1[l - ll + 1 + ((i - ii + 1) << 8) - + 257] * b[l + j * b_dim1]; + f21 += t1[l - ll + 1 + ((i - ii + 2) << 8) - + 257] * b[l + j * b_dim1]; + f31 += t1[l - ll + 1 + ((i - ii + 3) << 8) - + 257] * b[l + j * b_dim1]; + f41 += t1[l - ll + 1 + ((i - ii + 4) << 8) - + 257] * b[l + j * b_dim1]; + } + c[i + j * c_dim1] = f11; + c[i + 1 + j * c_dim1] = f21; + c[i + 2 + j * c_dim1] = f31; + c[i + 3 + j * c_dim1] = f41; + } + i5 = ii + isec - 1; + for (i = ii + uisec; i <= i5; ++i) + { + f11 = c[i + j * c_dim1]; + i6 = ll + lsec - 1; + for (l = ll; l <= i6; ++l) + { + f11 += t1[l - ll + 1 + ((i - ii + 1) << 8) - + 257] * b[l + j * b_dim1]; + } + c[i + j * c_dim1] = f11; + } + } + } } } } + return; } else if (rxstride == 1 && aystride == 1 && bxstride == 1) { @@ -334,7 +570,9 @@ matmul_i16 (gfc_array_i16 * const restrict retarray, for (n = 0; n < count; n++) for (x = 0; x < xcount; x++) /* dest[x,y] += a[x,n] * b[n,y] */ - dest[x*rxstride + y*rystride] += abase[x*axstride + n*aystride] * bbase[n*bxstride + y*bystride]; + dest[x*rxstride + y*rystride] += + abase[x*axstride + n*aystride] * + bbase[n*bxstride + y*bystride]; } else if (GFC_DESCRIPTOR_RANK (a) == 1) { @@ -372,5 +610,5 @@ matmul_i16 (gfc_array_i16 * const restrict retarray, } } } - +#pragma GCC reset_options #endif diff --git a/libgfortran/generated/matmul_i2.c b/libgfortran/generated/matmul_i2.c index 281a013..a65c40b 100644 --- a/libgfortran/generated/matmul_i2.c +++ b/libgfortran/generated/matmul_i2.c @@ -32,7 +32,7 @@ see the files COPYING3 and COPYING.RUNTIME respectively. If not, see #if defined (HAVE_GFC_INTEGER_2) /* Prototype for the BLAS ?gemm subroutine, a pointer to which can be - passed to us by the front-end, in which case we'll call it for large + passed to us by the front-end, in which case we call it for large matrices. */ typedef void (*blas_call)(const char *, const char *, const int *, const int *, @@ -75,6 +75,9 @@ extern void matmul_i2 (gfc_array_i2 * const restrict retarray, int blas_limit, blas_call gemm); export_proto(matmul_i2); +#pragma GCC optimize ( "-Ofast" ) +#pragma GCC optimize ( "-funroll-loops" ) + void matmul_i2 (gfc_array_i2 * const restrict retarray, gfc_array_i2 * const restrict a, gfc_array_i2 * const restrict b, int try_blas, @@ -99,7 +102,7 @@ matmul_i2 (gfc_array_i2 * const restrict retarray, o One-dimensional argument B is implicitly treated as a column matrix dimensioned [count, 1], so ycount=1. - */ +*/ if (retarray->base_addr == NULL) { @@ -127,47 +130,47 @@ matmul_i2 (gfc_array_i2 * const restrict retarray, = xmallocarray (size0 ((array_t *) retarray), sizeof (GFC_INTEGER_2)); retarray->offset = 0; } - else if (unlikely (compile_options.bounds_check)) - { - index_type ret_extent, arg_extent; - - if (GFC_DESCRIPTOR_RANK (a) == 1) - { - arg_extent = GFC_DESCRIPTOR_EXTENT(b,1); - ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0); - if (arg_extent != ret_extent) - runtime_error ("Incorrect extent in return array in" - " MATMUL intrinsic: is %ld, should be %ld", - (long int) ret_extent, (long int) arg_extent); - } - else if (GFC_DESCRIPTOR_RANK (b) == 1) - { - arg_extent = GFC_DESCRIPTOR_EXTENT(a,0); - ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0); - if (arg_extent != ret_extent) - runtime_error ("Incorrect extent in return array in" - " MATMUL intrinsic: is %ld, should be %ld", - (long int) ret_extent, (long int) arg_extent); - } - else - { - arg_extent = GFC_DESCRIPTOR_EXTENT(a,0); - ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0); - if (arg_extent != ret_extent) - runtime_error ("Incorrect extent in return array in" - " MATMUL intrinsic for dimension 1:" - " is %ld, should be %ld", - (long int) ret_extent, (long int) arg_extent); - - arg_extent = GFC_DESCRIPTOR_EXTENT(b,1); - ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,1); - if (arg_extent != ret_extent) - runtime_error ("Incorrect extent in return array in" - " MATMUL intrinsic for dimension 2:" - " is %ld, should be %ld", - (long int) ret_extent, (long int) arg_extent); - } - } + else if (unlikely (compile_options.bounds_check)) + { + index_type ret_extent, arg_extent; + + if (GFC_DESCRIPTOR_RANK (a) == 1) + { + arg_extent = GFC_DESCRIPTOR_EXTENT(b,1); + ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0); + if (arg_extent != ret_extent) + runtime_error ("Incorrect extent in return array in" + " MATMUL intrinsic: is %ld, should be %ld", + (long int) ret_extent, (long int) arg_extent); + } + else if (GFC_DESCRIPTOR_RANK (b) == 1) + { + arg_extent = GFC_DESCRIPTOR_EXTENT(a,0); + ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0); + if (arg_extent != ret_extent) + runtime_error ("Incorrect extent in return array in" + " MATMUL intrinsic: is %ld, should be %ld", + (long int) ret_extent, (long int) arg_extent); + } + else + { + arg_extent = GFC_DESCRIPTOR_EXTENT(a,0); + ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0); + if (arg_extent != ret_extent) + runtime_error ("Incorrect extent in return array in" + " MATMUL intrinsic for dimension 1:" + " is %ld, should be %ld", + (long int) ret_extent, (long int) arg_extent); + + arg_extent = GFC_DESCRIPTOR_EXTENT(b,1); + ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,1); + if (arg_extent != ret_extent) + runtime_error ("Incorrect extent in return array in" + " MATMUL intrinsic for dimension 2:" + " is %ld, should be %ld", + (long int) ret_extent, (long int) arg_extent); + } + } if (GFC_DESCRIPTOR_RANK (retarray) == 1) @@ -230,61 +233,294 @@ matmul_i2 (gfc_array_i2 * const restrict retarray, bbase = b->base_addr; dest = retarray->base_addr; - - /* Now that everything is set up, we're performing the multiplication + /* Now that everything is set up, we perform the multiplication itself. */ #define POW3(x) (((float) (x)) * ((float) (x)) * ((float) (x))) +#define min(a,b) ((a) <= (b) ? (a) : (b)) +#define max(a,b) ((a) >= (b) ? (a) : (b)) if (try_blas && rxstride == 1 && (axstride == 1 || aystride == 1) && (bxstride == 1 || bystride == 1) && (((float) xcount) * ((float) ycount) * ((float) count) > POW3(blas_limit))) - { - const int m = xcount, n = ycount, k = count, ldc = rystride; - const GFC_INTEGER_2 one = 1, zero = 0; - const int lda = (axstride == 1) ? aystride : axstride, - ldb = (bxstride == 1) ? bystride : bxstride; - - if (lda > 0 && ldb > 0 && ldc > 0 && m > 1 && n > 1 && k > 1) - { - assert (gemm != NULL); - gemm (axstride == 1 ? "N" : "T", bxstride == 1 ? "N" : "T", &m, &n, &k, - &one, abase, &lda, bbase, &ldb, &zero, dest, &ldc, 1, 1); - return; - } - } - - if (rxstride == 1 && axstride == 1 && bxstride == 1) { - const GFC_INTEGER_2 * restrict bbase_y; - GFC_INTEGER_2 * restrict dest_y; - const GFC_INTEGER_2 * restrict abase_n; - GFC_INTEGER_2 bbase_yn; + const int m = xcount, n = ycount, k = count, ldc = rystride; + const GFC_INTEGER_2 one = 1, zero = 0; + const int lda = (axstride == 1) ? aystride : axstride, + ldb = (bxstride == 1) ? bystride : bxstride; - if (rystride == xcount) - memset (dest, 0, (sizeof (GFC_INTEGER_2) * xcount * ycount)); - else + if (lda > 0 && ldb > 0 && ldc > 0 && m > 1 && n > 1 && k > 1) { - for (y = 0; y < ycount; y++) - for (x = 0; x < xcount; x++) - dest[x + y*rystride] = (GFC_INTEGER_2)0; + assert (gemm != NULL); + gemm (axstride == 1 ? "N" : "T", bxstride == 1 ? "N" : "T", &m, + &n, &k, &one, abase, &lda, bbase, &ldb, &zero, dest, + &ldc, 1, 1); + return; } + } - for (y = 0; y < ycount; y++) + if (rxstride == 1 && axstride == 1 && bxstride == 1) + { + /* This block of code implements a tuned matmul, derived from + Superscalar GEMM-based level 3 BLAS, Beta version 0.1 + + Bo Kagstrom and Per Ling + Department of Computing Science + Umea University + S-901 87 Umea, Sweden + + from netlib.org, translated to C, and modified for matmul.m4. */ + + const GFC_INTEGER_2 *a, *b; + GFC_INTEGER_2 *c; + const index_type m = xcount, n = ycount, k = count; + + /* System generated locals */ + index_type a_dim1, a_offset, b_dim1, b_offset, c_dim1, c_offset, i1, i2, + i3, i4, i5, i6; + + /* Local variables */ + GFC_INTEGER_2 t1[65536], /* was [256][256] */ + f11, f12, f21, f22, f31, f32, f41, f42, + f13, f14, f23, f24, f33, f34, f43, f44; + index_type i, j, l, ii, jj, ll; + index_type isec, jsec, lsec, uisec, ujsec, ulsec; + + a = abase; + b = bbase; + c = retarray->base_addr; + + /* Parameter adjustments */ + c_dim1 = m; + c_offset = 1 + c_dim1; + c -= c_offset; + a_dim1 = aystride; + a_offset = 1 + a_dim1; + a -= a_offset; + b_dim1 = bystride; + b_offset = 1 + b_dim1; + b -= b_offset; + + /* Early exit if possible */ + if (m == 0 || n == 0 || k == 0) + return; + + /* Empty c first. */ + for (j=1; j<=n; j++) + for (i=1; i<=m; i++) + c[i + j * c_dim1] = (GFC_INTEGER_2)0; + + /* Start turning the crank. */ + i1 = n; + for (jj = 1; jj <= i1; jj += 512) { - bbase_y = bbase + y*bystride; - dest_y = dest + y*rystride; - for (n = 0; n < count; n++) + /* Computing MIN */ + i2 = 512; + i3 = n - jj + 1; + jsec = min(i2,i3); + ujsec = jsec - jsec % 4; + i2 = k; + for (ll = 1; ll <= i2; ll += 256) { - abase_n = abase + n*aystride; - bbase_yn = bbase_y[n]; - for (x = 0; x < xcount; x++) + /* Computing MIN */ + i3 = 256; + i4 = k - ll + 1; + lsec = min(i3,i4); + ulsec = lsec - lsec % 2; + + i3 = m; + for (ii = 1; ii <= i3; ii += 256) { - dest_y[x] += abase_n[x] * bbase_yn; + /* Computing MIN */ + i4 = 256; + i5 = m - ii + 1; + isec = min(i4,i5); + uisec = isec - isec % 2; + i4 = ll + ulsec - 1; + for (l = ll; l <= i4; l += 2) + { + i5 = ii + uisec - 1; + for (i = ii; i <= i5; i += 2) + { + t1[l - ll + 1 + ((i - ii + 1) << 8) - 257] = + a[i + l * a_dim1]; + t1[l - ll + 2 + ((i - ii + 1) << 8) - 257] = + a[i + (l + 1) * a_dim1]; + t1[l - ll + 1 + ((i - ii + 2) << 8) - 257] = + a[i + 1 + l * a_dim1]; + t1[l - ll + 2 + ((i - ii + 2) << 8) - 257] = + a[i + 1 + (l + 1) * a_dim1]; + } + if (uisec < isec) + { + t1[l - ll + 1 + (isec << 8) - 257] = + a[ii + isec - 1 + l * a_dim1]; + t1[l - ll + 2 + (isec << 8) - 257] = + a[ii + isec - 1 + (l + 1) * a_dim1]; + } + } + if (ulsec < lsec) + { + i4 = ii + isec - 1; + for (i = ii; i<= i4; ++i) + { + t1[lsec + ((i - ii + 1) << 8) - 257] = + a[i + (ll + lsec - 1) * a_dim1]; + } + } + + uisec = isec - isec % 4; + i4 = jj + ujsec - 1; + for (j = jj; j <= i4; j += 4) + { + i5 = ii + uisec - 1; + for (i = ii; i <= i5; i += 4) + { + f11 = c[i + j * c_dim1]; + f21 = c[i + 1 + j * c_dim1]; + f12 = c[i + (j + 1) * c_dim1]; + f22 = c[i + 1 + (j + 1) * c_dim1]; + f13 = c[i + (j + 2) * c_dim1]; + f23 = c[i + 1 + (j + 2) * c_dim1]; + f14 = c[i + (j + 3) * c_dim1]; + f24 = c[i + 1 + (j + 3) * c_dim1]; + f31 = c[i + 2 + j * c_dim1]; + f41 = c[i + 3 + j * c_dim1]; + f32 = c[i + 2 + (j + 1) * c_dim1]; + f42 = c[i + 3 + (j + 1) * c_dim1]; + f33 = c[i + 2 + (j + 2) * c_dim1]; + f43 = c[i + 3 + (j + 2) * c_dim1]; + f34 = c[i + 2 + (j + 3) * c_dim1]; + f44 = c[i + 3 + (j + 3) * c_dim1]; + i6 = ll + lsec - 1; + for (l = ll; l <= i6; ++l) + { + f11 += t1[l - ll + 1 + ((i - ii + 1) << 8) - 257] + * b[l + j * b_dim1]; + f21 += t1[l - ll + 1 + ((i - ii + 2) << 8) - 257] + * b[l + j * b_dim1]; + f12 += t1[l - ll + 1 + ((i - ii + 1) << 8) - 257] + * b[l + (j + 1) * b_dim1]; + f22 += t1[l - ll + 1 + ((i - ii + 2) << 8) - 257] + * b[l + (j + 1) * b_dim1]; + f13 += t1[l - ll + 1 + ((i - ii + 1) << 8) - 257] + * b[l + (j + 2) * b_dim1]; + f23 += t1[l - ll + 1 + ((i - ii + 2) << 8) - 257] + * b[l + (j + 2) * b_dim1]; + f14 += t1[l - ll + 1 + ((i - ii + 1) << 8) - 257] + * b[l + (j + 3) * b_dim1]; + f24 += t1[l - ll + 1 + ((i - ii + 2) << 8) - 257] + * b[l + (j + 3) * b_dim1]; + f31 += t1[l - ll + 1 + ((i - ii + 3) << 8) - 257] + * b[l + j * b_dim1]; + f41 += t1[l - ll + 1 + ((i - ii + 4) << 8) - 257] + * b[l + j * b_dim1]; + f32 += t1[l - ll + 1 + ((i - ii + 3) << 8) - 257] + * b[l + (j + 1) * b_dim1]; + f42 += t1[l - ll + 1 + ((i - ii + 4) << 8) - 257] + * b[l + (j + 1) * b_dim1]; + f33 += t1[l - ll + 1 + ((i - ii + 3) << 8) - 257] + * b[l + (j + 2) * b_dim1]; + f43 += t1[l - ll + 1 + ((i - ii + 4) << 8) - 257] + * b[l + (j + 2) * b_dim1]; + f34 += t1[l - ll + 1 + ((i - ii + 3) << 8) - 257] + * b[l + (j + 3) * b_dim1]; + f44 += t1[l - ll + 1 + ((i - ii + 4) << 8) - 257] + * b[l + (j + 3) * b_dim1]; + } + c[i + j * c_dim1] = f11; + c[i + 1 + j * c_dim1] = f21; + c[i + (j + 1) * c_dim1] = f12; + c[i + 1 + (j + 1) * c_dim1] = f22; + c[i + (j + 2) * c_dim1] = f13; + c[i + 1 + (j + 2) * c_dim1] = f23; + c[i + (j + 3) * c_dim1] = f14; + c[i + 1 + (j + 3) * c_dim1] = f24; + c[i + 2 + j * c_dim1] = f31; + c[i + 3 + j * c_dim1] = f41; + c[i + 2 + (j + 1) * c_dim1] = f32; + c[i + 3 + (j + 1) * c_dim1] = f42; + c[i + 2 + (j + 2) * c_dim1] = f33; + c[i + 3 + (j + 2) * c_dim1] = f43; + c[i + 2 + (j + 3) * c_dim1] = f34; + c[i + 3 + (j + 3) * c_dim1] = f44; + } + if (uisec < isec) + { + i5 = ii + isec - 1; + for (i = ii + uisec; i <= i5; ++i) + { + f11 = c[i + j * c_dim1]; + f12 = c[i + (j + 1) * c_dim1]; + f13 = c[i + (j + 2) * c_dim1]; + f14 = c[i + (j + 3) * c_dim1]; + i6 = ll + lsec - 1; + for (l = ll; l <= i6; ++l) + { + f11 += t1[l - ll + 1 + ((i - ii + 1) << 8) - + 257] * b[l + j * b_dim1]; + f12 += t1[l - ll + 1 + ((i - ii + 1) << 8) - + 257] * b[l + (j + 1) * b_dim1]; + f13 += t1[l - ll + 1 + ((i - ii + 1) << 8) - + 257] * b[l + (j + 2) * b_dim1]; + f14 += t1[l - ll + 1 + ((i - ii + 1) << 8) - + 257] * b[l + (j + 3) * b_dim1]; + } + c[i + j * c_dim1] = f11; + c[i + (j + 1) * c_dim1] = f12; + c[i + (j + 2) * c_dim1] = f13; + c[i + (j + 3) * c_dim1] = f14; + } + } + } + if (ujsec < jsec) + { + i4 = jj + jsec - 1; + for (j = jj + ujsec; j <= i4; ++j) + { + i5 = ii + uisec - 1; + for (i = ii; i <= i5; i += 4) + { + f11 = c[i + j * c_dim1]; + f21 = c[i + 1 + j * c_dim1]; + f31 = c[i + 2 + j * c_dim1]; + f41 = c[i + 3 + j * c_dim1]; + i6 = ll + lsec - 1; + for (l = ll; l <= i6; ++l) + { + f11 += t1[l - ll + 1 + ((i - ii + 1) << 8) - + 257] * b[l + j * b_dim1]; + f21 += t1[l - ll + 1 + ((i - ii + 2) << 8) - + 257] * b[l + j * b_dim1]; + f31 += t1[l - ll + 1 + ((i - ii + 3) << 8) - + 257] * b[l + j * b_dim1]; + f41 += t1[l - ll + 1 + ((i - ii + 4) << 8) - + 257] * b[l + j * b_dim1]; + } + c[i + j * c_dim1] = f11; + c[i + 1 + j * c_dim1] = f21; + c[i + 2 + j * c_dim1] = f31; + c[i + 3 + j * c_dim1] = f41; + } + i5 = ii + isec - 1; + for (i = ii + uisec; i <= i5; ++i) + { + f11 = c[i + j * c_dim1]; + i6 = ll + lsec - 1; + for (l = ll; l <= i6; ++l) + { + f11 += t1[l - ll + 1 + ((i - ii + 1) << 8) - + 257] * b[l + j * b_dim1]; + } + c[i + j * c_dim1] = f11; + } + } + } } } } + return; } else if (rxstride == 1 && aystride == 1 && bxstride == 1) { @@ -334,7 +570,9 @@ matmul_i2 (gfc_array_i2 * const restrict retarray, for (n = 0; n < count; n++) for (x = 0; x < xcount; x++) /* dest[x,y] += a[x,n] * b[n,y] */ - dest[x*rxstride + y*rystride] += abase[x*axstride + n*aystride] * bbase[n*bxstride + y*bystride]; + dest[x*rxstride + y*rystride] += + abase[x*axstride + n*aystride] * + bbase[n*bxstride + y*bystride]; } else if (GFC_DESCRIPTOR_RANK (a) == 1) { @@ -372,5 +610,5 @@ matmul_i2 (gfc_array_i2 * const restrict retarray, } } } - +#pragma GCC reset_options #endif diff --git a/libgfortran/generated/matmul_i4.c b/libgfortran/generated/matmul_i4.c index 2dc526d..933f8d5 100644 --- a/libgfortran/generated/matmul_i4.c +++ b/libgfortran/generated/matmul_i4.c @@ -32,7 +32,7 @@ see the files COPYING3 and COPYING.RUNTIME respectively. If not, see #if defined (HAVE_GFC_INTEGER_4) /* Prototype for the BLAS ?gemm subroutine, a pointer to which can be - passed to us by the front-end, in which case we'll call it for large + passed to us by the front-end, in which case we call it for large matrices. */ typedef void (*blas_call)(const char *, const char *, const int *, const int *, @@ -75,6 +75,9 @@ extern void matmul_i4 (gfc_array_i4 * const restrict retarray, int blas_limit, blas_call gemm); export_proto(matmul_i4); +#pragma GCC optimize ( "-Ofast" ) +#pragma GCC optimize ( "-funroll-loops" ) + void matmul_i4 (gfc_array_i4 * const restrict retarray, gfc_array_i4 * const restrict a, gfc_array_i4 * const restrict b, int try_blas, @@ -99,7 +102,7 @@ matmul_i4 (gfc_array_i4 * const restrict retarray, o One-dimensional argument B is implicitly treated as a column matrix dimensioned [count, 1], so ycount=1. - */ +*/ if (retarray->base_addr == NULL) { @@ -127,47 +130,47 @@ matmul_i4 (gfc_array_i4 * const restrict retarray, = xmallocarray (size0 ((array_t *) retarray), sizeof (GFC_INTEGER_4)); retarray->offset = 0; } - else if (unlikely (compile_options.bounds_check)) - { - index_type ret_extent, arg_extent; - - if (GFC_DESCRIPTOR_RANK (a) == 1) - { - arg_extent = GFC_DESCRIPTOR_EXTENT(b,1); - ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0); - if (arg_extent != ret_extent) - runtime_error ("Incorrect extent in return array in" - " MATMUL intrinsic: is %ld, should be %ld", - (long int) ret_extent, (long int) arg_extent); - } - else if (GFC_DESCRIPTOR_RANK (b) == 1) - { - arg_extent = GFC_DESCRIPTOR_EXTENT(a,0); - ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0); - if (arg_extent != ret_extent) - runtime_error ("Incorrect extent in return array in" - " MATMUL intrinsic: is %ld, should be %ld", - (long int) ret_extent, (long int) arg_extent); - } - else - { - arg_extent = GFC_DESCRIPTOR_EXTENT(a,0); - ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0); - if (arg_extent != ret_extent) - runtime_error ("Incorrect extent in return array in" - " MATMUL intrinsic for dimension 1:" - " is %ld, should be %ld", - (long int) ret_extent, (long int) arg_extent); - - arg_extent = GFC_DESCRIPTOR_EXTENT(b,1); - ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,1); - if (arg_extent != ret_extent) - runtime_error ("Incorrect extent in return array in" - " MATMUL intrinsic for dimension 2:" - " is %ld, should be %ld", - (long int) ret_extent, (long int) arg_extent); - } - } + else if (unlikely (compile_options.bounds_check)) + { + index_type ret_extent, arg_extent; + + if (GFC_DESCRIPTOR_RANK (a) == 1) + { + arg_extent = GFC_DESCRIPTOR_EXTENT(b,1); + ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0); + if (arg_extent != ret_extent) + runtime_error ("Incorrect extent in return array in" + " MATMUL intrinsic: is %ld, should be %ld", + (long int) ret_extent, (long int) arg_extent); + } + else if (GFC_DESCRIPTOR_RANK (b) == 1) + { + arg_extent = GFC_DESCRIPTOR_EXTENT(a,0); + ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0); + if (arg_extent != ret_extent) + runtime_error ("Incorrect extent in return array in" + " MATMUL intrinsic: is %ld, should be %ld", + (long int) ret_extent, (long int) arg_extent); + } + else + { + arg_extent = GFC_DESCRIPTOR_EXTENT(a,0); + ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0); + if (arg_extent != ret_extent) + runtime_error ("Incorrect extent in return array in" + " MATMUL intrinsic for dimension 1:" + " is %ld, should be %ld", + (long int) ret_extent, (long int) arg_extent); + + arg_extent = GFC_DESCRIPTOR_EXTENT(b,1); + ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,1); + if (arg_extent != ret_extent) + runtime_error ("Incorrect extent in return array in" + " MATMUL intrinsic for dimension 2:" + " is %ld, should be %ld", + (long int) ret_extent, (long int) arg_extent); + } + } if (GFC_DESCRIPTOR_RANK (retarray) == 1) @@ -230,61 +233,294 @@ matmul_i4 (gfc_array_i4 * const restrict retarray, bbase = b->base_addr; dest = retarray->base_addr; - - /* Now that everything is set up, we're performing the multiplication + /* Now that everything is set up, we perform the multiplication itself. */ #define POW3(x) (((float) (x)) * ((float) (x)) * ((float) (x))) +#define min(a,b) ((a) <= (b) ? (a) : (b)) +#define max(a,b) ((a) >= (b) ? (a) : (b)) if (try_blas && rxstride == 1 && (axstride == 1 || aystride == 1) && (bxstride == 1 || bystride == 1) && (((float) xcount) * ((float) ycount) * ((float) count) > POW3(blas_limit))) - { - const int m = xcount, n = ycount, k = count, ldc = rystride; - const GFC_INTEGER_4 one = 1, zero = 0; - const int lda = (axstride == 1) ? aystride : axstride, - ldb = (bxstride == 1) ? bystride : bxstride; - - if (lda > 0 && ldb > 0 && ldc > 0 && m > 1 && n > 1 && k > 1) - { - assert (gemm != NULL); - gemm (axstride == 1 ? "N" : "T", bxstride == 1 ? "N" : "T", &m, &n, &k, - &one, abase, &lda, bbase, &ldb, &zero, dest, &ldc, 1, 1); - return; - } - } - - if (rxstride == 1 && axstride == 1 && bxstride == 1) { - const GFC_INTEGER_4 * restrict bbase_y; - GFC_INTEGER_4 * restrict dest_y; - const GFC_INTEGER_4 * restrict abase_n; - GFC_INTEGER_4 bbase_yn; + const int m = xcount, n = ycount, k = count, ldc = rystride; + const GFC_INTEGER_4 one = 1, zero = 0; + const int lda = (axstride == 1) ? aystride : axstride, + ldb = (bxstride == 1) ? bystride : bxstride; - if (rystride == xcount) - memset (dest, 0, (sizeof (GFC_INTEGER_4) * xcount * ycount)); - else + if (lda > 0 && ldb > 0 && ldc > 0 && m > 1 && n > 1 && k > 1) { - for (y = 0; y < ycount; y++) - for (x = 0; x < xcount; x++) - dest[x + y*rystride] = (GFC_INTEGER_4)0; + assert (gemm != NULL); + gemm (axstride == 1 ? "N" : "T", bxstride == 1 ? "N" : "T", &m, + &n, &k, &one, abase, &lda, bbase, &ldb, &zero, dest, + &ldc, 1, 1); + return; } + } - for (y = 0; y < ycount; y++) + if (rxstride == 1 && axstride == 1 && bxstride == 1) + { + /* This block of code implements a tuned matmul, derived from + Superscalar GEMM-based level 3 BLAS, Beta version 0.1 + + Bo Kagstrom and Per Ling + Department of Computing Science + Umea University + S-901 87 Umea, Sweden + + from netlib.org, translated to C, and modified for matmul.m4. */ + + const GFC_INTEGER_4 *a, *b; + GFC_INTEGER_4 *c; + const index_type m = xcount, n = ycount, k = count; + + /* System generated locals */ + index_type a_dim1, a_offset, b_dim1, b_offset, c_dim1, c_offset, i1, i2, + i3, i4, i5, i6; + + /* Local variables */ + GFC_INTEGER_4 t1[65536], /* was [256][256] */ + f11, f12, f21, f22, f31, f32, f41, f42, + f13, f14, f23, f24, f33, f34, f43, f44; + index_type i, j, l, ii, jj, ll; + index_type isec, jsec, lsec, uisec, ujsec, ulsec; + + a = abase; + b = bbase; + c = retarray->base_addr; + + /* Parameter adjustments */ + c_dim1 = m; + c_offset = 1 + c_dim1; + c -= c_offset; + a_dim1 = aystride; + a_offset = 1 + a_dim1; + a -= a_offset; + b_dim1 = bystride; + b_offset = 1 + b_dim1; + b -= b_offset; + + /* Early exit if possible */ + if (m == 0 || n == 0 || k == 0) + return; + + /* Empty c first. */ + for (j=1; j<=n; j++) + for (i=1; i<=m; i++) + c[i + j * c_dim1] = (GFC_INTEGER_4)0; + + /* Start turning the crank. */ + i1 = n; + for (jj = 1; jj <= i1; jj += 512) { - bbase_y = bbase + y*bystride; - dest_y = dest + y*rystride; - for (n = 0; n < count; n++) + /* Computing MIN */ + i2 = 512; + i3 = n - jj + 1; + jsec = min(i2,i3); + ujsec = jsec - jsec % 4; + i2 = k; + for (ll = 1; ll <= i2; ll += 256) { - abase_n = abase + n*aystride; - bbase_yn = bbase_y[n]; - for (x = 0; x < xcount; x++) + /* Computing MIN */ + i3 = 256; + i4 = k - ll + 1; + lsec = min(i3,i4); + ulsec = lsec - lsec % 2; + + i3 = m; + for (ii = 1; ii <= i3; ii += 256) { - dest_y[x] += abase_n[x] * bbase_yn; + /* Computing MIN */ + i4 = 256; + i5 = m - ii + 1; + isec = min(i4,i5); + uisec = isec - isec % 2; + i4 = ll + ulsec - 1; + for (l = ll; l <= i4; l += 2) + { + i5 = ii + uisec - 1; + for (i = ii; i <= i5; i += 2) + { + t1[l - ll + 1 + ((i - ii + 1) << 8) - 257] = + a[i + l * a_dim1]; + t1[l - ll + 2 + ((i - ii + 1) << 8) - 257] = + a[i + (l + 1) * a_dim1]; + t1[l - ll + 1 + ((i - ii + 2) << 8) - 257] = + a[i + 1 + l * a_dim1]; + t1[l - ll + 2 + ((i - ii + 2) << 8) - 257] = + a[i + 1 + (l + 1) * a_dim1]; + } + if (uisec < isec) + { + t1[l - ll + 1 + (isec << 8) - 257] = + a[ii + isec - 1 + l * a_dim1]; + t1[l - ll + 2 + (isec << 8) - 257] = + a[ii + isec - 1 + (l + 1) * a_dim1]; + } + } + if (ulsec < lsec) + { + i4 = ii + isec - 1; + for (i = ii; i<= i4; ++i) + { + t1[lsec + ((i - ii + 1) << 8) - 257] = + a[i + (ll + lsec - 1) * a_dim1]; + } + } + + uisec = isec - isec % 4; + i4 = jj + ujsec - 1; + for (j = jj; j <= i4; j += 4) + { + i5 = ii + uisec - 1; + for (i = ii; i <= i5; i += 4) + { + f11 = c[i + j * c_dim1]; + f21 = c[i + 1 + j * c_dim1]; + f12 = c[i + (j + 1) * c_dim1]; + f22 = c[i + 1 + (j + 1) * c_dim1]; + f13 = c[i + (j + 2) * c_dim1]; + f23 = c[i + 1 + (j + 2) * c_dim1]; + f14 = c[i + (j + 3) * c_dim1]; + f24 = c[i + 1 + (j + 3) * c_dim1]; + f31 = c[i + 2 + j * c_dim1]; + f41 = c[i + 3 + j * c_dim1]; + f32 = c[i + 2 + (j + 1) * c_dim1]; + f42 = c[i + 3 + (j + 1) * c_dim1]; + f33 = c[i + 2 + (j + 2) * c_dim1]; + f43 = c[i + 3 + (j + 2) * c_dim1]; + f34 = c[i + 2 + (j + 3) * c_dim1]; + f44 = c[i + 3 + (j + 3) * c_dim1]; + i6 = ll + lsec - 1; + for (l = ll; l <= i6; ++l) + { + f11 += t1[l - ll + 1 + ((i - ii + 1) << 8) - 257] + * b[l + j * b_dim1]; + f21 += t1[l - ll + 1 + ((i - ii + 2) << 8) - 257] + * b[l + j * b_dim1]; + f12 += t1[l - ll + 1 + ((i - ii + 1) << 8) - 257] + * b[l + (j + 1) * b_dim1]; + f22 += t1[l - ll + 1 + ((i - ii + 2) << 8) - 257] + * b[l + (j + 1) * b_dim1]; + f13 += t1[l - ll + 1 + ((i - ii + 1) << 8) - 257] + * b[l + (j + 2) * b_dim1]; + f23 += t1[l - ll + 1 + ((i - ii + 2) << 8) - 257] + * b[l + (j + 2) * b_dim1]; + f14 += t1[l - ll + 1 + ((i - ii + 1) << 8) - 257] + * b[l + (j + 3) * b_dim1]; + f24 += t1[l - ll + 1 + ((i - ii + 2) << 8) - 257] + * b[l + (j + 3) * b_dim1]; + f31 += t1[l - ll + 1 + ((i - ii + 3) << 8) - 257] + * b[l + j * b_dim1]; + f41 += t1[l - ll + 1 + ((i - ii + 4) << 8) - 257] + * b[l + j * b_dim1]; + f32 += t1[l - ll + 1 + ((i - ii + 3) << 8) - 257] + * b[l + (j + 1) * b_dim1]; + f42 += t1[l - ll + 1 + ((i - ii + 4) << 8) - 257] + * b[l + (j + 1) * b_dim1]; + f33 += t1[l - ll + 1 + ((i - ii + 3) << 8) - 257] + * b[l + (j + 2) * b_dim1]; + f43 += t1[l - ll + 1 + ((i - ii + 4) << 8) - 257] + * b[l + (j + 2) * b_dim1]; + f34 += t1[l - ll + 1 + ((i - ii + 3) << 8) - 257] + * b[l + (j + 3) * b_dim1]; + f44 += t1[l - ll + 1 + ((i - ii + 4) << 8) - 257] + * b[l + (j + 3) * b_dim1]; + } + c[i + j * c_dim1] = f11; + c[i + 1 + j * c_dim1] = f21; + c[i + (j + 1) * c_dim1] = f12; + c[i + 1 + (j + 1) * c_dim1] = f22; + c[i + (j + 2) * c_dim1] = f13; + c[i + 1 + (j + 2) * c_dim1] = f23; + c[i + (j + 3) * c_dim1] = f14; + c[i + 1 + (j + 3) * c_dim1] = f24; + c[i + 2 + j * c_dim1] = f31; + c[i + 3 + j * c_dim1] = f41; + c[i + 2 + (j + 1) * c_dim1] = f32; + c[i + 3 + (j + 1) * c_dim1] = f42; + c[i + 2 + (j + 2) * c_dim1] = f33; + c[i + 3 + (j + 2) * c_dim1] = f43; + c[i + 2 + (j + 3) * c_dim1] = f34; + c[i + 3 + (j + 3) * c_dim1] = f44; + } + if (uisec < isec) + { + i5 = ii + isec - 1; + for (i = ii + uisec; i <= i5; ++i) + { + f11 = c[i + j * c_dim1]; + f12 = c[i + (j + 1) * c_dim1]; + f13 = c[i + (j + 2) * c_dim1]; + f14 = c[i + (j + 3) * c_dim1]; + i6 = ll + lsec - 1; + for (l = ll; l <= i6; ++l) + { + f11 += t1[l - ll + 1 + ((i - ii + 1) << 8) - + 257] * b[l + j * b_dim1]; + f12 += t1[l - ll + 1 + ((i - ii + 1) << 8) - + 257] * b[l + (j + 1) * b_dim1]; + f13 += t1[l - ll + 1 + ((i - ii + 1) << 8) - + 257] * b[l + (j + 2) * b_dim1]; + f14 += t1[l - ll + 1 + ((i - ii + 1) << 8) - + 257] * b[l + (j + 3) * b_dim1]; + } + c[i + j * c_dim1] = f11; + c[i + (j + 1) * c_dim1] = f12; + c[i + (j + 2) * c_dim1] = f13; + c[i + (j + 3) * c_dim1] = f14; + } + } + } + if (ujsec < jsec) + { + i4 = jj + jsec - 1; + for (j = jj + ujsec; j <= i4; ++j) + { + i5 = ii + uisec - 1; + for (i = ii; i <= i5; i += 4) + { + f11 = c[i + j * c_dim1]; + f21 = c[i + 1 + j * c_dim1]; + f31 = c[i + 2 + j * c_dim1]; + f41 = c[i + 3 + j * c_dim1]; + i6 = ll + lsec - 1; + for (l = ll; l <= i6; ++l) + { + f11 += t1[l - ll + 1 + ((i - ii + 1) << 8) - + 257] * b[l + j * b_dim1]; + f21 += t1[l - ll + 1 + ((i - ii + 2) << 8) - + 257] * b[l + j * b_dim1]; + f31 += t1[l - ll + 1 + ((i - ii + 3) << 8) - + 257] * b[l + j * b_dim1]; + f41 += t1[l - ll + 1 + ((i - ii + 4) << 8) - + 257] * b[l + j * b_dim1]; + } + c[i + j * c_dim1] = f11; + c[i + 1 + j * c_dim1] = f21; + c[i + 2 + j * c_dim1] = f31; + c[i + 3 + j * c_dim1] = f41; + } + i5 = ii + isec - 1; + for (i = ii + uisec; i <= i5; ++i) + { + f11 = c[i + j * c_dim1]; + i6 = ll + lsec - 1; + for (l = ll; l <= i6; ++l) + { + f11 += t1[l - ll + 1 + ((i - ii + 1) << 8) - + 257] * b[l + j * b_dim1]; + } + c[i + j * c_dim1] = f11; + } + } + } } } } + return; } else if (rxstride == 1 && aystride == 1 && bxstride == 1) { @@ -334,7 +570,9 @@ matmul_i4 (gfc_array_i4 * const restrict retarray, for (n = 0; n < count; n++) for (x = 0; x < xcount; x++) /* dest[x,y] += a[x,n] * b[n,y] */ - dest[x*rxstride + y*rystride] += abase[x*axstride + n*aystride] * bbase[n*bxstride + y*bystride]; + dest[x*rxstride + y*rystride] += + abase[x*axstride + n*aystride] * + bbase[n*bxstride + y*bystride]; } else if (GFC_DESCRIPTOR_RANK (a) == 1) { @@ -372,5 +610,5 @@ matmul_i4 (gfc_array_i4 * const restrict retarray, } } } - +#pragma GCC reset_options #endif diff --git a/libgfortran/generated/matmul_i8.c b/libgfortran/generated/matmul_i8.c index 0ff728d..62f82b9 100644 --- a/libgfortran/generated/matmul_i8.c +++ b/libgfortran/generated/matmul_i8.c @@ -32,7 +32,7 @@ see the files COPYING3 and COPYING.RUNTIME respectively. If not, see #if defined (HAVE_GFC_INTEGER_8) /* Prototype for the BLAS ?gemm subroutine, a pointer to which can be - passed to us by the front-end, in which case we'll call it for large + passed to us by the front-end, in which case we call it for large matrices. */ typedef void (*blas_call)(const char *, const char *, const int *, const int *, @@ -75,6 +75,9 @@ extern void matmul_i8 (gfc_array_i8 * const restrict retarray, int blas_limit, blas_call gemm); export_proto(matmul_i8); +#pragma GCC optimize ( "-Ofast" ) +#pragma GCC optimize ( "-funroll-loops" ) + void matmul_i8 (gfc_array_i8 * const restrict retarray, gfc_array_i8 * const restrict a, gfc_array_i8 * const restrict b, int try_blas, @@ -99,7 +102,7 @@ matmul_i8 (gfc_array_i8 * const restrict retarray, o One-dimensional argument B is implicitly treated as a column matrix dimensioned [count, 1], so ycount=1. - */ +*/ if (retarray->base_addr == NULL) { @@ -127,47 +130,47 @@ matmul_i8 (gfc_array_i8 * const restrict retarray, = xmallocarray (size0 ((array_t *) retarray), sizeof (GFC_INTEGER_8)); retarray->offset = 0; } - else if (unlikely (compile_options.bounds_check)) - { - index_type ret_extent, arg_extent; - - if (GFC_DESCRIPTOR_RANK (a) == 1) - { - arg_extent = GFC_DESCRIPTOR_EXTENT(b,1); - ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0); - if (arg_extent != ret_extent) - runtime_error ("Incorrect extent in return array in" - " MATMUL intrinsic: is %ld, should be %ld", - (long int) ret_extent, (long int) arg_extent); - } - else if (GFC_DESCRIPTOR_RANK (b) == 1) - { - arg_extent = GFC_DESCRIPTOR_EXTENT(a,0); - ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0); - if (arg_extent != ret_extent) - runtime_error ("Incorrect extent in return array in" - " MATMUL intrinsic: is %ld, should be %ld", - (long int) ret_extent, (long int) arg_extent); - } - else - { - arg_extent = GFC_DESCRIPTOR_EXTENT(a,0); - ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0); - if (arg_extent != ret_extent) - runtime_error ("Incorrect extent in return array in" - " MATMUL intrinsic for dimension 1:" - " is %ld, should be %ld", - (long int) ret_extent, (long int) arg_extent); - - arg_extent = GFC_DESCRIPTOR_EXTENT(b,1); - ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,1); - if (arg_extent != ret_extent) - runtime_error ("Incorrect extent in return array in" - " MATMUL intrinsic for dimension 2:" - " is %ld, should be %ld", - (long int) ret_extent, (long int) arg_extent); - } - } + else if (unlikely (compile_options.bounds_check)) + { + index_type ret_extent, arg_extent; + + if (GFC_DESCRIPTOR_RANK (a) == 1) + { + arg_extent = GFC_DESCRIPTOR_EXTENT(b,1); + ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0); + if (arg_extent != ret_extent) + runtime_error ("Incorrect extent in return array in" + " MATMUL intrinsic: is %ld, should be %ld", + (long int) ret_extent, (long int) arg_extent); + } + else if (GFC_DESCRIPTOR_RANK (b) == 1) + { + arg_extent = GFC_DESCRIPTOR_EXTENT(a,0); + ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0); + if (arg_extent != ret_extent) + runtime_error ("Incorrect extent in return array in" + " MATMUL intrinsic: is %ld, should be %ld", + (long int) ret_extent, (long int) arg_extent); + } + else + { + arg_extent = GFC_DESCRIPTOR_EXTENT(a,0); + ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0); + if (arg_extent != ret_extent) + runtime_error ("Incorrect extent in return array in" + " MATMUL intrinsic for dimension 1:" + " is %ld, should be %ld", + (long int) ret_extent, (long int) arg_extent); + + arg_extent = GFC_DESCRIPTOR_EXTENT(b,1); + ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,1); + if (arg_extent != ret_extent) + runtime_error ("Incorrect extent in return array in" + " MATMUL intrinsic for dimension 2:" + " is %ld, should be %ld", + (long int) ret_extent, (long int) arg_extent); + } + } if (GFC_DESCRIPTOR_RANK (retarray) == 1) @@ -230,61 +233,294 @@ matmul_i8 (gfc_array_i8 * const restrict retarray, bbase = b->base_addr; dest = retarray->base_addr; - - /* Now that everything is set up, we're performing the multiplication + /* Now that everything is set up, we perform the multiplication itself. */ #define POW3(x) (((float) (x)) * ((float) (x)) * ((float) (x))) +#define min(a,b) ((a) <= (b) ? (a) : (b)) +#define max(a,b) ((a) >= (b) ? (a) : (b)) if (try_blas && rxstride == 1 && (axstride == 1 || aystride == 1) && (bxstride == 1 || bystride == 1) && (((float) xcount) * ((float) ycount) * ((float) count) > POW3(blas_limit))) - { - const int m = xcount, n = ycount, k = count, ldc = rystride; - const GFC_INTEGER_8 one = 1, zero = 0; - const int lda = (axstride == 1) ? aystride : axstride, - ldb = (bxstride == 1) ? bystride : bxstride; - - if (lda > 0 && ldb > 0 && ldc > 0 && m > 1 && n > 1 && k > 1) - { - assert (gemm != NULL); - gemm (axstride == 1 ? "N" : "T", bxstride == 1 ? "N" : "T", &m, &n, &k, - &one, abase, &lda, bbase, &ldb, &zero, dest, &ldc, 1, 1); - return; - } - } - - if (rxstride == 1 && axstride == 1 && bxstride == 1) { - const GFC_INTEGER_8 * restrict bbase_y; - GFC_INTEGER_8 * restrict dest_y; - const GFC_INTEGER_8 * restrict abase_n; - GFC_INTEGER_8 bbase_yn; + const int m = xcount, n = ycount, k = count, ldc = rystride; + const GFC_INTEGER_8 one = 1, zero = 0; + const int lda = (axstride == 1) ? aystride : axstride, + ldb = (bxstride == 1) ? bystride : bxstride; - if (rystride == xcount) - memset (dest, 0, (sizeof (GFC_INTEGER_8) * xcount * ycount)); - else + if (lda > 0 && ldb > 0 && ldc > 0 && m > 1 && n > 1 && k > 1) { - for (y = 0; y < ycount; y++) - for (x = 0; x < xcount; x++) - dest[x + y*rystride] = (GFC_INTEGER_8)0; + assert (gemm != NULL); + gemm (axstride == 1 ? "N" : "T", bxstride == 1 ? "N" : "T", &m, + &n, &k, &one, abase, &lda, bbase, &ldb, &zero, dest, + &ldc, 1, 1); + return; } + } - for (y = 0; y < ycount; y++) + if (rxstride == 1 && axstride == 1 && bxstride == 1) + { + /* This block of code implements a tuned matmul, derived from + Superscalar GEMM-based level 3 BLAS, Beta version 0.1 + + Bo Kagstrom and Per Ling + Department of Computing Science + Umea University + S-901 87 Umea, Sweden + + from netlib.org, translated to C, and modified for matmul.m4. */ + + const GFC_INTEGER_8 *a, *b; + GFC_INTEGER_8 *c; + const index_type m = xcount, n = ycount, k = count; + + /* System generated locals */ + index_type a_dim1, a_offset, b_dim1, b_offset, c_dim1, c_offset, i1, i2, + i3, i4, i5, i6; + + /* Local variables */ + GFC_INTEGER_8 t1[65536], /* was [256][256] */ + f11, f12, f21, f22, f31, f32, f41, f42, + f13, f14, f23, f24, f33, f34, f43, f44; + index_type i, j, l, ii, jj, ll; + index_type isec, jsec, lsec, uisec, ujsec, ulsec; + + a = abase; + b = bbase; + c = retarray->base_addr; + + /* Parameter adjustments */ + c_dim1 = m; + c_offset = 1 + c_dim1; + c -= c_offset; + a_dim1 = aystride; + a_offset = 1 + a_dim1; + a -= a_offset; + b_dim1 = bystride; + b_offset = 1 + b_dim1; + b -= b_offset; + + /* Early exit if possible */ + if (m == 0 || n == 0 || k == 0) + return; + + /* Empty c first. */ + for (j=1; j<=n; j++) + for (i=1; i<=m; i++) + c[i + j * c_dim1] = (GFC_INTEGER_8)0; + + /* Start turning the crank. */ + i1 = n; + for (jj = 1; jj <= i1; jj += 512) { - bbase_y = bbase + y*bystride; - dest_y = dest + y*rystride; - for (n = 0; n < count; n++) + /* Computing MIN */ + i2 = 512; + i3 = n - jj + 1; + jsec = min(i2,i3); + ujsec = jsec - jsec % 4; + i2 = k; + for (ll = 1; ll <= i2; ll += 256) { - abase_n = abase + n*aystride; - bbase_yn = bbase_y[n]; - for (x = 0; x < xcount; x++) + /* Computing MIN */ + i3 = 256; + i4 = k - ll + 1; + lsec = min(i3,i4); + ulsec = lsec - lsec % 2; + + i3 = m; + for (ii = 1; ii <= i3; ii += 256) { - dest_y[x] += abase_n[x] * bbase_yn; + /* Computing MIN */ + i4 = 256; + i5 = m - ii + 1; + isec = min(i4,i5); + uisec = isec - isec % 2; + i4 = ll + ulsec - 1; + for (l = ll; l <= i4; l += 2) + { + i5 = ii + uisec - 1; + for (i = ii; i <= i5; i += 2) + { + t1[l - ll + 1 + ((i - ii + 1) << 8) - 257] = + a[i + l * a_dim1]; + t1[l - ll + 2 + ((i - ii + 1) << 8) - 257] = + a[i + (l + 1) * a_dim1]; + t1[l - ll + 1 + ((i - ii + 2) << 8) - 257] = + a[i + 1 + l * a_dim1]; + t1[l - ll + 2 + ((i - ii + 2) << 8) - 257] = + a[i + 1 + (l + 1) * a_dim1]; + } + if (uisec < isec) + { + t1[l - ll + 1 + (isec << 8) - 257] = + a[ii + isec - 1 + l * a_dim1]; + t1[l - ll + 2 + (isec << 8) - 257] = + a[ii + isec - 1 + (l + 1) * a_dim1]; + } + } + if (ulsec < lsec) + { + i4 = ii + isec - 1; + for (i = ii; i<= i4; ++i) + { + t1[lsec + ((i - ii + 1) << 8) - 257] = + a[i + (ll + lsec - 1) * a_dim1]; + } + } + + uisec = isec - isec % 4; + i4 = jj + ujsec - 1; + for (j = jj; j <= i4; j += 4) + { + i5 = ii + uisec - 1; + for (i = ii; i <= i5; i += 4) + { + f11 = c[i + j * c_dim1]; + f21 = c[i + 1 + j * c_dim1]; + f12 = c[i + (j + 1) * c_dim1]; + f22 = c[i + 1 + (j + 1) * c_dim1]; + f13 = c[i + (j + 2) * c_dim1]; + f23 = c[i + 1 + (j + 2) * c_dim1]; + f14 = c[i + (j + 3) * c_dim1]; + f24 = c[i + 1 + (j + 3) * c_dim1]; + f31 = c[i + 2 + j * c_dim1]; + f41 = c[i + 3 + j * c_dim1]; + f32 = c[i + 2 + (j + 1) * c_dim1]; + f42 = c[i + 3 + (j + 1) * c_dim1]; + f33 = c[i + 2 + (j + 2) * c_dim1]; + f43 = c[i + 3 + (j + 2) * c_dim1]; + f34 = c[i + 2 + (j + 3) * c_dim1]; + f44 = c[i + 3 + (j + 3) * c_dim1]; + i6 = ll + lsec - 1; + for (l = ll; l <= i6; ++l) + { + f11 += t1[l - ll + 1 + ((i - ii + 1) << 8) - 257] + * b[l + j * b_dim1]; + f21 += t1[l - ll + 1 + ((i - ii + 2) << 8) - 257] + * b[l + j * b_dim1]; + f12 += t1[l - ll + 1 + ((i - ii + 1) << 8) - 257] + * b[l + (j + 1) * b_dim1]; + f22 += t1[l - ll + 1 + ((i - ii + 2) << 8) - 257] + * b[l + (j + 1) * b_dim1]; + f13 += t1[l - ll + 1 + ((i - ii + 1) << 8) - 257] + * b[l + (j + 2) * b_dim1]; + f23 += t1[l - ll + 1 + ((i - ii + 2) << 8) - 257] + * b[l + (j + 2) * b_dim1]; + f14 += t1[l - ll + 1 + ((i - ii + 1) << 8) - 257] + * b[l + (j + 3) * b_dim1]; + f24 += t1[l - ll + 1 + ((i - ii + 2) << 8) - 257] + * b[l + (j + 3) * b_dim1]; + f31 += t1[l - ll + 1 + ((i - ii + 3) << 8) - 257] + * b[l + j * b_dim1]; + f41 += t1[l - ll + 1 + ((i - ii + 4) << 8) - 257] + * b[l + j * b_dim1]; + f32 += t1[l - ll + 1 + ((i - ii + 3) << 8) - 257] + * b[l + (j + 1) * b_dim1]; + f42 += t1[l - ll + 1 + ((i - ii + 4) << 8) - 257] + * b[l + (j + 1) * b_dim1]; + f33 += t1[l - ll + 1 + ((i - ii + 3) << 8) - 257] + * b[l + (j + 2) * b_dim1]; + f43 += t1[l - ll + 1 + ((i - ii + 4) << 8) - 257] + * b[l + (j + 2) * b_dim1]; + f34 += t1[l - ll + 1 + ((i - ii + 3) << 8) - 257] + * b[l + (j + 3) * b_dim1]; + f44 += t1[l - ll + 1 + ((i - ii + 4) << 8) - 257] + * b[l + (j + 3) * b_dim1]; + } + c[i + j * c_dim1] = f11; + c[i + 1 + j * c_dim1] = f21; + c[i + (j + 1) * c_dim1] = f12; + c[i + 1 + (j + 1) * c_dim1] = f22; + c[i + (j + 2) * c_dim1] = f13; + c[i + 1 + (j + 2) * c_dim1] = f23; + c[i + (j + 3) * c_dim1] = f14; + c[i + 1 + (j + 3) * c_dim1] = f24; + c[i + 2 + j * c_dim1] = f31; + c[i + 3 + j * c_dim1] = f41; + c[i + 2 + (j + 1) * c_dim1] = f32; + c[i + 3 + (j + 1) * c_dim1] = f42; + c[i + 2 + (j + 2) * c_dim1] = f33; + c[i + 3 + (j + 2) * c_dim1] = f43; + c[i + 2 + (j + 3) * c_dim1] = f34; + c[i + 3 + (j + 3) * c_dim1] = f44; + } + if (uisec < isec) + { + i5 = ii + isec - 1; + for (i = ii + uisec; i <= i5; ++i) + { + f11 = c[i + j * c_dim1]; + f12 = c[i + (j + 1) * c_dim1]; + f13 = c[i + (j + 2) * c_dim1]; + f14 = c[i + (j + 3) * c_dim1]; + i6 = ll + lsec - 1; + for (l = ll; l <= i6; ++l) + { + f11 += t1[l - ll + 1 + ((i - ii + 1) << 8) - + 257] * b[l + j * b_dim1]; + f12 += t1[l - ll + 1 + ((i - ii + 1) << 8) - + 257] * b[l + (j + 1) * b_dim1]; + f13 += t1[l - ll + 1 + ((i - ii + 1) << 8) - + 257] * b[l + (j + 2) * b_dim1]; + f14 += t1[l - ll + 1 + ((i - ii + 1) << 8) - + 257] * b[l + (j + 3) * b_dim1]; + } + c[i + j * c_dim1] = f11; + c[i + (j + 1) * c_dim1] = f12; + c[i + (j + 2) * c_dim1] = f13; + c[i + (j + 3) * c_dim1] = f14; + } + } + } + if (ujsec < jsec) + { + i4 = jj + jsec - 1; + for (j = jj + ujsec; j <= i4; ++j) + { + i5 = ii + uisec - 1; + for (i = ii; i <= i5; i += 4) + { + f11 = c[i + j * c_dim1]; + f21 = c[i + 1 + j * c_dim1]; + f31 = c[i + 2 + j * c_dim1]; + f41 = c[i + 3 + j * c_dim1]; + i6 = ll + lsec - 1; + for (l = ll; l <= i6; ++l) + { + f11 += t1[l - ll + 1 + ((i - ii + 1) << 8) - + 257] * b[l + j * b_dim1]; + f21 += t1[l - ll + 1 + ((i - ii + 2) << 8) - + 257] * b[l + j * b_dim1]; + f31 += t1[l - ll + 1 + ((i - ii + 3) << 8) - + 257] * b[l + j * b_dim1]; + f41 += t1[l - ll + 1 + ((i - ii + 4) << 8) - + 257] * b[l + j * b_dim1]; + } + c[i + j * c_dim1] = f11; + c[i + 1 + j * c_dim1] = f21; + c[i + 2 + j * c_dim1] = f31; + c[i + 3 + j * c_dim1] = f41; + } + i5 = ii + isec - 1; + for (i = ii + uisec; i <= i5; ++i) + { + f11 = c[i + j * c_dim1]; + i6 = ll + lsec - 1; + for (l = ll; l <= i6; ++l) + { + f11 += t1[l - ll + 1 + ((i - ii + 1) << 8) - + 257] * b[l + j * b_dim1]; + } + c[i + j * c_dim1] = f11; + } + } + } } } } + return; } else if (rxstride == 1 && aystride == 1 && bxstride == 1) { @@ -334,7 +570,9 @@ matmul_i8 (gfc_array_i8 * const restrict retarray, for (n = 0; n < count; n++) for (x = 0; x < xcount; x++) /* dest[x,y] += a[x,n] * b[n,y] */ - dest[x*rxstride + y*rystride] += abase[x*axstride + n*aystride] * bbase[n*bxstride + y*bystride]; + dest[x*rxstride + y*rystride] += + abase[x*axstride + n*aystride] * + bbase[n*bxstride + y*bystride]; } else if (GFC_DESCRIPTOR_RANK (a) == 1) { @@ -372,5 +610,5 @@ matmul_i8 (gfc_array_i8 * const restrict retarray, } } } - +#pragma GCC reset_options #endif diff --git a/libgfortran/generated/matmul_r10.c b/libgfortran/generated/matmul_r10.c index a34856f..3d8c32c 100644 --- a/libgfortran/generated/matmul_r10.c +++ b/libgfortran/generated/matmul_r10.c @@ -32,7 +32,7 @@ see the files COPYING3 and COPYING.RUNTIME respectively. If not, see #if defined (HAVE_GFC_REAL_10) /* Prototype for the BLAS ?gemm subroutine, a pointer to which can be - passed to us by the front-end, in which case we'll call it for large + passed to us by the front-end, in which case we call it for large matrices. */ typedef void (*blas_call)(const char *, const char *, const int *, const int *, @@ -75,6 +75,9 @@ extern void matmul_r10 (gfc_array_r10 * const restrict retarray, int blas_limit, blas_call gemm); export_proto(matmul_r10); +#pragma GCC optimize ( "-Ofast" ) +#pragma GCC optimize ( "-funroll-loops" ) + void matmul_r10 (gfc_array_r10 * const restrict retarray, gfc_array_r10 * const restrict a, gfc_array_r10 * const restrict b, int try_blas, @@ -99,7 +102,7 @@ matmul_r10 (gfc_array_r10 * const restrict retarray, o One-dimensional argument B is implicitly treated as a column matrix dimensioned [count, 1], so ycount=1. - */ +*/ if (retarray->base_addr == NULL) { @@ -127,47 +130,47 @@ matmul_r10 (gfc_array_r10 * const restrict retarray, = xmallocarray (size0 ((array_t *) retarray), sizeof (GFC_REAL_10)); retarray->offset = 0; } - else if (unlikely (compile_options.bounds_check)) - { - index_type ret_extent, arg_extent; - - if (GFC_DESCRIPTOR_RANK (a) == 1) - { - arg_extent = GFC_DESCRIPTOR_EXTENT(b,1); - ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0); - if (arg_extent != ret_extent) - runtime_error ("Incorrect extent in return array in" - " MATMUL intrinsic: is %ld, should be %ld", - (long int) ret_extent, (long int) arg_extent); - } - else if (GFC_DESCRIPTOR_RANK (b) == 1) - { - arg_extent = GFC_DESCRIPTOR_EXTENT(a,0); - ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0); - if (arg_extent != ret_extent) - runtime_error ("Incorrect extent in return array in" - " MATMUL intrinsic: is %ld, should be %ld", - (long int) ret_extent, (long int) arg_extent); - } - else - { - arg_extent = GFC_DESCRIPTOR_EXTENT(a,0); - ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0); - if (arg_extent != ret_extent) - runtime_error ("Incorrect extent in return array in" - " MATMUL intrinsic for dimension 1:" - " is %ld, should be %ld", - (long int) ret_extent, (long int) arg_extent); - - arg_extent = GFC_DESCRIPTOR_EXTENT(b,1); - ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,1); - if (arg_extent != ret_extent) - runtime_error ("Incorrect extent in return array in" - " MATMUL intrinsic for dimension 2:" - " is %ld, should be %ld", - (long int) ret_extent, (long int) arg_extent); - } - } + else if (unlikely (compile_options.bounds_check)) + { + index_type ret_extent, arg_extent; + + if (GFC_DESCRIPTOR_RANK (a) == 1) + { + arg_extent = GFC_DESCRIPTOR_EXTENT(b,1); + ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0); + if (arg_extent != ret_extent) + runtime_error ("Incorrect extent in return array in" + " MATMUL intrinsic: is %ld, should be %ld", + (long int) ret_extent, (long int) arg_extent); + } + else if (GFC_DESCRIPTOR_RANK (b) == 1) + { + arg_extent = GFC_DESCRIPTOR_EXTENT(a,0); + ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0); + if (arg_extent != ret_extent) + runtime_error ("Incorrect extent in return array in" + " MATMUL intrinsic: is %ld, should be %ld", + (long int) ret_extent, (long int) arg_extent); + } + else + { + arg_extent = GFC_DESCRIPTOR_EXTENT(a,0); + ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0); + if (arg_extent != ret_extent) + runtime_error ("Incorrect extent in return array in" + " MATMUL intrinsic for dimension 1:" + " is %ld, should be %ld", + (long int) ret_extent, (long int) arg_extent); + + arg_extent = GFC_DESCRIPTOR_EXTENT(b,1); + ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,1); + if (arg_extent != ret_extent) + runtime_error ("Incorrect extent in return array in" + " MATMUL intrinsic for dimension 2:" + " is %ld, should be %ld", + (long int) ret_extent, (long int) arg_extent); + } + } if (GFC_DESCRIPTOR_RANK (retarray) == 1) @@ -230,61 +233,294 @@ matmul_r10 (gfc_array_r10 * const restrict retarray, bbase = b->base_addr; dest = retarray->base_addr; - - /* Now that everything is set up, we're performing the multiplication + /* Now that everything is set up, we perform the multiplication itself. */ #define POW3(x) (((float) (x)) * ((float) (x)) * ((float) (x))) +#define min(a,b) ((a) <= (b) ? (a) : (b)) +#define max(a,b) ((a) >= (b) ? (a) : (b)) if (try_blas && rxstride == 1 && (axstride == 1 || aystride == 1) && (bxstride == 1 || bystride == 1) && (((float) xcount) * ((float) ycount) * ((float) count) > POW3(blas_limit))) - { - const int m = xcount, n = ycount, k = count, ldc = rystride; - const GFC_REAL_10 one = 1, zero = 0; - const int lda = (axstride == 1) ? aystride : axstride, - ldb = (bxstride == 1) ? bystride : bxstride; - - if (lda > 0 && ldb > 0 && ldc > 0 && m > 1 && n > 1 && k > 1) - { - assert (gemm != NULL); - gemm (axstride == 1 ? "N" : "T", bxstride == 1 ? "N" : "T", &m, &n, &k, - &one, abase, &lda, bbase, &ldb, &zero, dest, &ldc, 1, 1); - return; - } - } - - if (rxstride == 1 && axstride == 1 && bxstride == 1) { - const GFC_REAL_10 * restrict bbase_y; - GFC_REAL_10 * restrict dest_y; - const GFC_REAL_10 * restrict abase_n; - GFC_REAL_10 bbase_yn; + const int m = xcount, n = ycount, k = count, ldc = rystride; + const GFC_REAL_10 one = 1, zero = 0; + const int lda = (axstride == 1) ? aystride : axstride, + ldb = (bxstride == 1) ? bystride : bxstride; - if (rystride == xcount) - memset (dest, 0, (sizeof (GFC_REAL_10) * xcount * ycount)); - else + if (lda > 0 && ldb > 0 && ldc > 0 && m > 1 && n > 1 && k > 1) { - for (y = 0; y < ycount; y++) - for (x = 0; x < xcount; x++) - dest[x + y*rystride] = (GFC_REAL_10)0; + assert (gemm != NULL); + gemm (axstride == 1 ? "N" : "T", bxstride == 1 ? "N" : "T", &m, + &n, &k, &one, abase, &lda, bbase, &ldb, &zero, dest, + &ldc, 1, 1); + return; } + } - for (y = 0; y < ycount; y++) + if (rxstride == 1 && axstride == 1 && bxstride == 1) + { + /* This block of code implements a tuned matmul, derived from + Superscalar GEMM-based level 3 BLAS, Beta version 0.1 + + Bo Kagstrom and Per Ling + Department of Computing Science + Umea University + S-901 87 Umea, Sweden + + from netlib.org, translated to C, and modified for matmul.m4. */ + + const GFC_REAL_10 *a, *b; + GFC_REAL_10 *c; + const index_type m = xcount, n = ycount, k = count; + + /* System generated locals */ + index_type a_dim1, a_offset, b_dim1, b_offset, c_dim1, c_offset, i1, i2, + i3, i4, i5, i6; + + /* Local variables */ + GFC_REAL_10 t1[65536], /* was [256][256] */ + f11, f12, f21, f22, f31, f32, f41, f42, + f13, f14, f23, f24, f33, f34, f43, f44; + index_type i, j, l, ii, jj, ll; + index_type isec, jsec, lsec, uisec, ujsec, ulsec; + + a = abase; + b = bbase; + c = retarray->base_addr; + + /* Parameter adjustments */ + c_dim1 = m; + c_offset = 1 + c_dim1; + c -= c_offset; + a_dim1 = aystride; + a_offset = 1 + a_dim1; + a -= a_offset; + b_dim1 = bystride; + b_offset = 1 + b_dim1; + b -= b_offset; + + /* Early exit if possible */ + if (m == 0 || n == 0 || k == 0) + return; + + /* Empty c first. */ + for (j=1; j<=n; j++) + for (i=1; i<=m; i++) + c[i + j * c_dim1] = (GFC_REAL_10)0; + + /* Start turning the crank. */ + i1 = n; + for (jj = 1; jj <= i1; jj += 512) { - bbase_y = bbase + y*bystride; - dest_y = dest + y*rystride; - for (n = 0; n < count; n++) + /* Computing MIN */ + i2 = 512; + i3 = n - jj + 1; + jsec = min(i2,i3); + ujsec = jsec - jsec % 4; + i2 = k; + for (ll = 1; ll <= i2; ll += 256) { - abase_n = abase + n*aystride; - bbase_yn = bbase_y[n]; - for (x = 0; x < xcount; x++) + /* Computing MIN */ + i3 = 256; + i4 = k - ll + 1; + lsec = min(i3,i4); + ulsec = lsec - lsec % 2; + + i3 = m; + for (ii = 1; ii <= i3; ii += 256) { - dest_y[x] += abase_n[x] * bbase_yn; + /* Computing MIN */ + i4 = 256; + i5 = m - ii + 1; + isec = min(i4,i5); + uisec = isec - isec % 2; + i4 = ll + ulsec - 1; + for (l = ll; l <= i4; l += 2) + { + i5 = ii + uisec - 1; + for (i = ii; i <= i5; i += 2) + { + t1[l - ll + 1 + ((i - ii + 1) << 8) - 257] = + a[i + l * a_dim1]; + t1[l - ll + 2 + ((i - ii + 1) << 8) - 257] = + a[i + (l + 1) * a_dim1]; + t1[l - ll + 1 + ((i - ii + 2) << 8) - 257] = + a[i + 1 + l * a_dim1]; + t1[l - ll + 2 + ((i - ii + 2) << 8) - 257] = + a[i + 1 + (l + 1) * a_dim1]; + } + if (uisec < isec) + { + t1[l - ll + 1 + (isec << 8) - 257] = + a[ii + isec - 1 + l * a_dim1]; + t1[l - ll + 2 + (isec << 8) - 257] = + a[ii + isec - 1 + (l + 1) * a_dim1]; + } + } + if (ulsec < lsec) + { + i4 = ii + isec - 1; + for (i = ii; i<= i4; ++i) + { + t1[lsec + ((i - ii + 1) << 8) - 257] = + a[i + (ll + lsec - 1) * a_dim1]; + } + } + + uisec = isec - isec % 4; + i4 = jj + ujsec - 1; + for (j = jj; j <= i4; j += 4) + { + i5 = ii + uisec - 1; + for (i = ii; i <= i5; i += 4) + { + f11 = c[i + j * c_dim1]; + f21 = c[i + 1 + j * c_dim1]; + f12 = c[i + (j + 1) * c_dim1]; + f22 = c[i + 1 + (j + 1) * c_dim1]; + f13 = c[i + (j + 2) * c_dim1]; + f23 = c[i + 1 + (j + 2) * c_dim1]; + f14 = c[i + (j + 3) * c_dim1]; + f24 = c[i + 1 + (j + 3) * c_dim1]; + f31 = c[i + 2 + j * c_dim1]; + f41 = c[i + 3 + j * c_dim1]; + f32 = c[i + 2 + (j + 1) * c_dim1]; + f42 = c[i + 3 + (j + 1) * c_dim1]; + f33 = c[i + 2 + (j + 2) * c_dim1]; + f43 = c[i + 3 + (j + 2) * c_dim1]; + f34 = c[i + 2 + (j + 3) * c_dim1]; + f44 = c[i + 3 + (j + 3) * c_dim1]; + i6 = ll + lsec - 1; + for (l = ll; l <= i6; ++l) + { + f11 += t1[l - ll + 1 + ((i - ii + 1) << 8) - 257] + * b[l + j * b_dim1]; + f21 += t1[l - ll + 1 + ((i - ii + 2) << 8) - 257] + * b[l + j * b_dim1]; + f12 += t1[l - ll + 1 + ((i - ii + 1) << 8) - 257] + * b[l + (j + 1) * b_dim1]; + f22 += t1[l - ll + 1 + ((i - ii + 2) << 8) - 257] + * b[l + (j + 1) * b_dim1]; + f13 += t1[l - ll + 1 + ((i - ii + 1) << 8) - 257] + * b[l + (j + 2) * b_dim1]; + f23 += t1[l - ll + 1 + ((i - ii + 2) << 8) - 257] + * b[l + (j + 2) * b_dim1]; + f14 += t1[l - ll + 1 + ((i - ii + 1) << 8) - 257] + * b[l + (j + 3) * b_dim1]; + f24 += t1[l - ll + 1 + ((i - ii + 2) << 8) - 257] + * b[l + (j + 3) * b_dim1]; + f31 += t1[l - ll + 1 + ((i - ii + 3) << 8) - 257] + * b[l + j * b_dim1]; + f41 += t1[l - ll + 1 + ((i - ii + 4) << 8) - 257] + * b[l + j * b_dim1]; + f32 += t1[l - ll + 1 + ((i - ii + 3) << 8) - 257] + * b[l + (j + 1) * b_dim1]; + f42 += t1[l - ll + 1 + ((i - ii + 4) << 8) - 257] + * b[l + (j + 1) * b_dim1]; + f33 += t1[l - ll + 1 + ((i - ii + 3) << 8) - 257] + * b[l + (j + 2) * b_dim1]; + f43 += t1[l - ll + 1 + ((i - ii + 4) << 8) - 257] + * b[l + (j + 2) * b_dim1]; + f34 += t1[l - ll + 1 + ((i - ii + 3) << 8) - 257] + * b[l + (j + 3) * b_dim1]; + f44 += t1[l - ll + 1 + ((i - ii + 4) << 8) - 257] + * b[l + (j + 3) * b_dim1]; + } + c[i + j * c_dim1] = f11; + c[i + 1 + j * c_dim1] = f21; + c[i + (j + 1) * c_dim1] = f12; + c[i + 1 + (j + 1) * c_dim1] = f22; + c[i + (j + 2) * c_dim1] = f13; + c[i + 1 + (j + 2) * c_dim1] = f23; + c[i + (j + 3) * c_dim1] = f14; + c[i + 1 + (j + 3) * c_dim1] = f24; + c[i + 2 + j * c_dim1] = f31; + c[i + 3 + j * c_dim1] = f41; + c[i + 2 + (j + 1) * c_dim1] = f32; + c[i + 3 + (j + 1) * c_dim1] = f42; + c[i + 2 + (j + 2) * c_dim1] = f33; + c[i + 3 + (j + 2) * c_dim1] = f43; + c[i + 2 + (j + 3) * c_dim1] = f34; + c[i + 3 + (j + 3) * c_dim1] = f44; + } + if (uisec < isec) + { + i5 = ii + isec - 1; + for (i = ii + uisec; i <= i5; ++i) + { + f11 = c[i + j * c_dim1]; + f12 = c[i + (j + 1) * c_dim1]; + f13 = c[i + (j + 2) * c_dim1]; + f14 = c[i + (j + 3) * c_dim1]; + i6 = ll + lsec - 1; + for (l = ll; l <= i6; ++l) + { + f11 += t1[l - ll + 1 + ((i - ii + 1) << 8) - + 257] * b[l + j * b_dim1]; + f12 += t1[l - ll + 1 + ((i - ii + 1) << 8) - + 257] * b[l + (j + 1) * b_dim1]; + f13 += t1[l - ll + 1 + ((i - ii + 1) << 8) - + 257] * b[l + (j + 2) * b_dim1]; + f14 += t1[l - ll + 1 + ((i - ii + 1) << 8) - + 257] * b[l + (j + 3) * b_dim1]; + } + c[i + j * c_dim1] = f11; + c[i + (j + 1) * c_dim1] = f12; + c[i + (j + 2) * c_dim1] = f13; + c[i + (j + 3) * c_dim1] = f14; + } + } + } + if (ujsec < jsec) + { + i4 = jj + jsec - 1; + for (j = jj + ujsec; j <= i4; ++j) + { + i5 = ii + uisec - 1; + for (i = ii; i <= i5; i += 4) + { + f11 = c[i + j * c_dim1]; + f21 = c[i + 1 + j * c_dim1]; + f31 = c[i + 2 + j * c_dim1]; + f41 = c[i + 3 + j * c_dim1]; + i6 = ll + lsec - 1; + for (l = ll; l <= i6; ++l) + { + f11 += t1[l - ll + 1 + ((i - ii + 1) << 8) - + 257] * b[l + j * b_dim1]; + f21 += t1[l - ll + 1 + ((i - ii + 2) << 8) - + 257] * b[l + j * b_dim1]; + f31 += t1[l - ll + 1 + ((i - ii + 3) << 8) - + 257] * b[l + j * b_dim1]; + f41 += t1[l - ll + 1 + ((i - ii + 4) << 8) - + 257] * b[l + j * b_dim1]; + } + c[i + j * c_dim1] = f11; + c[i + 1 + j * c_dim1] = f21; + c[i + 2 + j * c_dim1] = f31; + c[i + 3 + j * c_dim1] = f41; + } + i5 = ii + isec - 1; + for (i = ii + uisec; i <= i5; ++i) + { + f11 = c[i + j * c_dim1]; + i6 = ll + lsec - 1; + for (l = ll; l <= i6; ++l) + { + f11 += t1[l - ll + 1 + ((i - ii + 1) << 8) - + 257] * b[l + j * b_dim1]; + } + c[i + j * c_dim1] = f11; + } + } + } } } } + return; } else if (rxstride == 1 && aystride == 1 && bxstride == 1) { @@ -334,7 +570,9 @@ matmul_r10 (gfc_array_r10 * const restrict retarray, for (n = 0; n < count; n++) for (x = 0; x < xcount; x++) /* dest[x,y] += a[x,n] * b[n,y] */ - dest[x*rxstride + y*rystride] += abase[x*axstride + n*aystride] * bbase[n*bxstride + y*bystride]; + dest[x*rxstride + y*rystride] += + abase[x*axstride + n*aystride] * + bbase[n*bxstride + y*bystride]; } else if (GFC_DESCRIPTOR_RANK (a) == 1) { @@ -372,5 +610,5 @@ matmul_r10 (gfc_array_r10 * const restrict retarray, } } } - +#pragma GCC reset_options #endif diff --git a/libgfortran/generated/matmul_r16.c b/libgfortran/generated/matmul_r16.c index d2f11bd..e5a0a92 100644 --- a/libgfortran/generated/matmul_r16.c +++ b/libgfortran/generated/matmul_r16.c @@ -32,7 +32,7 @@ see the files COPYING3 and COPYING.RUNTIME respectively. If not, see #if defined (HAVE_GFC_REAL_16) /* Prototype for the BLAS ?gemm subroutine, a pointer to which can be - passed to us by the front-end, in which case we'll call it for large + passed to us by the front-end, in which case we call it for large matrices. */ typedef void (*blas_call)(const char *, const char *, const int *, const int *, @@ -75,6 +75,9 @@ extern void matmul_r16 (gfc_array_r16 * const restrict retarray, int blas_limit, blas_call gemm); export_proto(matmul_r16); +#pragma GCC optimize ( "-Ofast" ) +#pragma GCC optimize ( "-funroll-loops" ) + void matmul_r16 (gfc_array_r16 * const restrict retarray, gfc_array_r16 * const restrict a, gfc_array_r16 * const restrict b, int try_blas, @@ -99,7 +102,7 @@ matmul_r16 (gfc_array_r16 * const restrict retarray, o One-dimensional argument B is implicitly treated as a column matrix dimensioned [count, 1], so ycount=1. - */ +*/ if (retarray->base_addr == NULL) { @@ -127,47 +130,47 @@ matmul_r16 (gfc_array_r16 * const restrict retarray, = xmallocarray (size0 ((array_t *) retarray), sizeof (GFC_REAL_16)); retarray->offset = 0; } - else if (unlikely (compile_options.bounds_check)) - { - index_type ret_extent, arg_extent; - - if (GFC_DESCRIPTOR_RANK (a) == 1) - { - arg_extent = GFC_DESCRIPTOR_EXTENT(b,1); - ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0); - if (arg_extent != ret_extent) - runtime_error ("Incorrect extent in return array in" - " MATMUL intrinsic: is %ld, should be %ld", - (long int) ret_extent, (long int) arg_extent); - } - else if (GFC_DESCRIPTOR_RANK (b) == 1) - { - arg_extent = GFC_DESCRIPTOR_EXTENT(a,0); - ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0); - if (arg_extent != ret_extent) - runtime_error ("Incorrect extent in return array in" - " MATMUL intrinsic: is %ld, should be %ld", - (long int) ret_extent, (long int) arg_extent); - } - else - { - arg_extent = GFC_DESCRIPTOR_EXTENT(a,0); - ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0); - if (arg_extent != ret_extent) - runtime_error ("Incorrect extent in return array in" - " MATMUL intrinsic for dimension 1:" - " is %ld, should be %ld", - (long int) ret_extent, (long int) arg_extent); - - arg_extent = GFC_DESCRIPTOR_EXTENT(b,1); - ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,1); - if (arg_extent != ret_extent) - runtime_error ("Incorrect extent in return array in" - " MATMUL intrinsic for dimension 2:" - " is %ld, should be %ld", - (long int) ret_extent, (long int) arg_extent); - } - } + else if (unlikely (compile_options.bounds_check)) + { + index_type ret_extent, arg_extent; + + if (GFC_DESCRIPTOR_RANK (a) == 1) + { + arg_extent = GFC_DESCRIPTOR_EXTENT(b,1); + ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0); + if (arg_extent != ret_extent) + runtime_error ("Incorrect extent in return array in" + " MATMUL intrinsic: is %ld, should be %ld", + (long int) ret_extent, (long int) arg_extent); + } + else if (GFC_DESCRIPTOR_RANK (b) == 1) + { + arg_extent = GFC_DESCRIPTOR_EXTENT(a,0); + ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0); + if (arg_extent != ret_extent) + runtime_error ("Incorrect extent in return array in" + " MATMUL intrinsic: is %ld, should be %ld", + (long int) ret_extent, (long int) arg_extent); + } + else + { + arg_extent = GFC_DESCRIPTOR_EXTENT(a,0); + ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0); + if (arg_extent != ret_extent) + runtime_error ("Incorrect extent in return array in" + " MATMUL intrinsic for dimension 1:" + " is %ld, should be %ld", + (long int) ret_extent, (long int) arg_extent); + + arg_extent = GFC_DESCRIPTOR_EXTENT(b,1); + ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,1); + if (arg_extent != ret_extent) + runtime_error ("Incorrect extent in return array in" + " MATMUL intrinsic for dimension 2:" + " is %ld, should be %ld", + (long int) ret_extent, (long int) arg_extent); + } + } if (GFC_DESCRIPTOR_RANK (retarray) == 1) @@ -230,61 +233,294 @@ matmul_r16 (gfc_array_r16 * const restrict retarray, bbase = b->base_addr; dest = retarray->base_addr; - - /* Now that everything is set up, we're performing the multiplication + /* Now that everything is set up, we perform the multiplication itself. */ #define POW3(x) (((float) (x)) * ((float) (x)) * ((float) (x))) +#define min(a,b) ((a) <= (b) ? (a) : (b)) +#define max(a,b) ((a) >= (b) ? (a) : (b)) if (try_blas && rxstride == 1 && (axstride == 1 || aystride == 1) && (bxstride == 1 || bystride == 1) && (((float) xcount) * ((float) ycount) * ((float) count) > POW3(blas_limit))) - { - const int m = xcount, n = ycount, k = count, ldc = rystride; - const GFC_REAL_16 one = 1, zero = 0; - const int lda = (axstride == 1) ? aystride : axstride, - ldb = (bxstride == 1) ? bystride : bxstride; - - if (lda > 0 && ldb > 0 && ldc > 0 && m > 1 && n > 1 && k > 1) - { - assert (gemm != NULL); - gemm (axstride == 1 ? "N" : "T", bxstride == 1 ? "N" : "T", &m, &n, &k, - &one, abase, &lda, bbase, &ldb, &zero, dest, &ldc, 1, 1); - return; - } - } - - if (rxstride == 1 && axstride == 1 && bxstride == 1) { - const GFC_REAL_16 * restrict bbase_y; - GFC_REAL_16 * restrict dest_y; - const GFC_REAL_16 * restrict abase_n; - GFC_REAL_16 bbase_yn; + const int m = xcount, n = ycount, k = count, ldc = rystride; + const GFC_REAL_16 one = 1, zero = 0; + const int lda = (axstride == 1) ? aystride : axstride, + ldb = (bxstride == 1) ? bystride : bxstride; - if (rystride == xcount) - memset (dest, 0, (sizeof (GFC_REAL_16) * xcount * ycount)); - else + if (lda > 0 && ldb > 0 && ldc > 0 && m > 1 && n > 1 && k > 1) { - for (y = 0; y < ycount; y++) - for (x = 0; x < xcount; x++) - dest[x + y*rystride] = (GFC_REAL_16)0; + assert (gemm != NULL); + gemm (axstride == 1 ? "N" : "T", bxstride == 1 ? "N" : "T", &m, + &n, &k, &one, abase, &lda, bbase, &ldb, &zero, dest, + &ldc, 1, 1); + return; } + } - for (y = 0; y < ycount; y++) + if (rxstride == 1 && axstride == 1 && bxstride == 1) + { + /* This block of code implements a tuned matmul, derived from + Superscalar GEMM-based level 3 BLAS, Beta version 0.1 + + Bo Kagstrom and Per Ling + Department of Computing Science + Umea University + S-901 87 Umea, Sweden + + from netlib.org, translated to C, and modified for matmul.m4. */ + + const GFC_REAL_16 *a, *b; + GFC_REAL_16 *c; + const index_type m = xcount, n = ycount, k = count; + + /* System generated locals */ + index_type a_dim1, a_offset, b_dim1, b_offset, c_dim1, c_offset, i1, i2, + i3, i4, i5, i6; + + /* Local variables */ + GFC_REAL_16 t1[65536], /* was [256][256] */ + f11, f12, f21, f22, f31, f32, f41, f42, + f13, f14, f23, f24, f33, f34, f43, f44; + index_type i, j, l, ii, jj, ll; + index_type isec, jsec, lsec, uisec, ujsec, ulsec; + + a = abase; + b = bbase; + c = retarray->base_addr; + + /* Parameter adjustments */ + c_dim1 = m; + c_offset = 1 + c_dim1; + c -= c_offset; + a_dim1 = aystride; + a_offset = 1 + a_dim1; + a -= a_offset; + b_dim1 = bystride; + b_offset = 1 + b_dim1; + b -= b_offset; + + /* Early exit if possible */ + if (m == 0 || n == 0 || k == 0) + return; + + /* Empty c first. */ + for (j=1; j<=n; j++) + for (i=1; i<=m; i++) + c[i + j * c_dim1] = (GFC_REAL_16)0; + + /* Start turning the crank. */ + i1 = n; + for (jj = 1; jj <= i1; jj += 512) { - bbase_y = bbase + y*bystride; - dest_y = dest + y*rystride; - for (n = 0; n < count; n++) + /* Computing MIN */ + i2 = 512; + i3 = n - jj + 1; + jsec = min(i2,i3); + ujsec = jsec - jsec % 4; + i2 = k; + for (ll = 1; ll <= i2; ll += 256) { - abase_n = abase + n*aystride; - bbase_yn = bbase_y[n]; - for (x = 0; x < xcount; x++) + /* Computing MIN */ + i3 = 256; + i4 = k - ll + 1; + lsec = min(i3,i4); + ulsec = lsec - lsec % 2; + + i3 = m; + for (ii = 1; ii <= i3; ii += 256) { - dest_y[x] += abase_n[x] * bbase_yn; + /* Computing MIN */ + i4 = 256; + i5 = m - ii + 1; + isec = min(i4,i5); + uisec = isec - isec % 2; + i4 = ll + ulsec - 1; + for (l = ll; l <= i4; l += 2) + { + i5 = ii + uisec - 1; + for (i = ii; i <= i5; i += 2) + { + t1[l - ll + 1 + ((i - ii + 1) << 8) - 257] = + a[i + l * a_dim1]; + t1[l - ll + 2 + ((i - ii + 1) << 8) - 257] = + a[i + (l + 1) * a_dim1]; + t1[l - ll + 1 + ((i - ii + 2) << 8) - 257] = + a[i + 1 + l * a_dim1]; + t1[l - ll + 2 + ((i - ii + 2) << 8) - 257] = + a[i + 1 + (l + 1) * a_dim1]; + } + if (uisec < isec) + { + t1[l - ll + 1 + (isec << 8) - 257] = + a[ii + isec - 1 + l * a_dim1]; + t1[l - ll + 2 + (isec << 8) - 257] = + a[ii + isec - 1 + (l + 1) * a_dim1]; + } + } + if (ulsec < lsec) + { + i4 = ii + isec - 1; + for (i = ii; i<= i4; ++i) + { + t1[lsec + ((i - ii + 1) << 8) - 257] = + a[i + (ll + lsec - 1) * a_dim1]; + } + } + + uisec = isec - isec % 4; + i4 = jj + ujsec - 1; + for (j = jj; j <= i4; j += 4) + { + i5 = ii + uisec - 1; + for (i = ii; i <= i5; i += 4) + { + f11 = c[i + j * c_dim1]; + f21 = c[i + 1 + j * c_dim1]; + f12 = c[i + (j + 1) * c_dim1]; + f22 = c[i + 1 + (j + 1) * c_dim1]; + f13 = c[i + (j + 2) * c_dim1]; + f23 = c[i + 1 + (j + 2) * c_dim1]; + f14 = c[i + (j + 3) * c_dim1]; + f24 = c[i + 1 + (j + 3) * c_dim1]; + f31 = c[i + 2 + j * c_dim1]; + f41 = c[i + 3 + j * c_dim1]; + f32 = c[i + 2 + (j + 1) * c_dim1]; + f42 = c[i + 3 + (j + 1) * c_dim1]; + f33 = c[i + 2 + (j + 2) * c_dim1]; + f43 = c[i + 3 + (j + 2) * c_dim1]; + f34 = c[i + 2 + (j + 3) * c_dim1]; + f44 = c[i + 3 + (j + 3) * c_dim1]; + i6 = ll + lsec - 1; + for (l = ll; l <= i6; ++l) + { + f11 += t1[l - ll + 1 + ((i - ii + 1) << 8) - 257] + * b[l + j * b_dim1]; + f21 += t1[l - ll + 1 + ((i - ii + 2) << 8) - 257] + * b[l + j * b_dim1]; + f12 += t1[l - ll + 1 + ((i - ii + 1) << 8) - 257] + * b[l + (j + 1) * b_dim1]; + f22 += t1[l - ll + 1 + ((i - ii + 2) << 8) - 257] + * b[l + (j + 1) * b_dim1]; + f13 += t1[l - ll + 1 + ((i - ii + 1) << 8) - 257] + * b[l + (j + 2) * b_dim1]; + f23 += t1[l - ll + 1 + ((i - ii + 2) << 8) - 257] + * b[l + (j + 2) * b_dim1]; + f14 += t1[l - ll + 1 + ((i - ii + 1) << 8) - 257] + * b[l + (j + 3) * b_dim1]; + f24 += t1[l - ll + 1 + ((i - ii + 2) << 8) - 257] + * b[l + (j + 3) * b_dim1]; + f31 += t1[l - ll + 1 + ((i - ii + 3) << 8) - 257] + * b[l + j * b_dim1]; + f41 += t1[l - ll + 1 + ((i - ii + 4) << 8) - 257] + * b[l + j * b_dim1]; + f32 += t1[l - ll + 1 + ((i - ii + 3) << 8) - 257] + * b[l + (j + 1) * b_dim1]; + f42 += t1[l - ll + 1 + ((i - ii + 4) << 8) - 257] + * b[l + (j + 1) * b_dim1]; + f33 += t1[l - ll + 1 + ((i - ii + 3) << 8) - 257] + * b[l + (j + 2) * b_dim1]; + f43 += t1[l - ll + 1 + ((i - ii + 4) << 8) - 257] + * b[l + (j + 2) * b_dim1]; + f34 += t1[l - ll + 1 + ((i - ii + 3) << 8) - 257] + * b[l + (j + 3) * b_dim1]; + f44 += t1[l - ll + 1 + ((i - ii + 4) << 8) - 257] + * b[l + (j + 3) * b_dim1]; + } + c[i + j * c_dim1] = f11; + c[i + 1 + j * c_dim1] = f21; + c[i + (j + 1) * c_dim1] = f12; + c[i + 1 + (j + 1) * c_dim1] = f22; + c[i + (j + 2) * c_dim1] = f13; + c[i + 1 + (j + 2) * c_dim1] = f23; + c[i + (j + 3) * c_dim1] = f14; + c[i + 1 + (j + 3) * c_dim1] = f24; + c[i + 2 + j * c_dim1] = f31; + c[i + 3 + j * c_dim1] = f41; + c[i + 2 + (j + 1) * c_dim1] = f32; + c[i + 3 + (j + 1) * c_dim1] = f42; + c[i + 2 + (j + 2) * c_dim1] = f33; + c[i + 3 + (j + 2) * c_dim1] = f43; + c[i + 2 + (j + 3) * c_dim1] = f34; + c[i + 3 + (j + 3) * c_dim1] = f44; + } + if (uisec < isec) + { + i5 = ii + isec - 1; + for (i = ii + uisec; i <= i5; ++i) + { + f11 = c[i + j * c_dim1]; + f12 = c[i + (j + 1) * c_dim1]; + f13 = c[i + (j + 2) * c_dim1]; + f14 = c[i + (j + 3) * c_dim1]; + i6 = ll + lsec - 1; + for (l = ll; l <= i6; ++l) + { + f11 += t1[l - ll + 1 + ((i - ii + 1) << 8) - + 257] * b[l + j * b_dim1]; + f12 += t1[l - ll + 1 + ((i - ii + 1) << 8) - + 257] * b[l + (j + 1) * b_dim1]; + f13 += t1[l - ll + 1 + ((i - ii + 1) << 8) - + 257] * b[l + (j + 2) * b_dim1]; + f14 += t1[l - ll + 1 + ((i - ii + 1) << 8) - + 257] * b[l + (j + 3) * b_dim1]; + } + c[i + j * c_dim1] = f11; + c[i + (j + 1) * c_dim1] = f12; + c[i + (j + 2) * c_dim1] = f13; + c[i + (j + 3) * c_dim1] = f14; + } + } + } + if (ujsec < jsec) + { + i4 = jj + jsec - 1; + for (j = jj + ujsec; j <= i4; ++j) + { + i5 = ii + uisec - 1; + for (i = ii; i <= i5; i += 4) + { + f11 = c[i + j * c_dim1]; + f21 = c[i + 1 + j * c_dim1]; + f31 = c[i + 2 + j * c_dim1]; + f41 = c[i + 3 + j * c_dim1]; + i6 = ll + lsec - 1; + for (l = ll; l <= i6; ++l) + { + f11 += t1[l - ll + 1 + ((i - ii + 1) << 8) - + 257] * b[l + j * b_dim1]; + f21 += t1[l - ll + 1 + ((i - ii + 2) << 8) - + 257] * b[l + j * b_dim1]; + f31 += t1[l - ll + 1 + ((i - ii + 3) << 8) - + 257] * b[l + j * b_dim1]; + f41 += t1[l - ll + 1 + ((i - ii + 4) << 8) - + 257] * b[l + j * b_dim1]; + } + c[i + j * c_dim1] = f11; + c[i + 1 + j * c_dim1] = f21; + c[i + 2 + j * c_dim1] = f31; + c[i + 3 + j * c_dim1] = f41; + } + i5 = ii + isec - 1; + for (i = ii + uisec; i <= i5; ++i) + { + f11 = c[i + j * c_dim1]; + i6 = ll + lsec - 1; + for (l = ll; l <= i6; ++l) + { + f11 += t1[l - ll + 1 + ((i - ii + 1) << 8) - + 257] * b[l + j * b_dim1]; + } + c[i + j * c_dim1] = f11; + } + } + } } } } + return; } else if (rxstride == 1 && aystride == 1 && bxstride == 1) { @@ -334,7 +570,9 @@ matmul_r16 (gfc_array_r16 * const restrict retarray, for (n = 0; n < count; n++) for (x = 0; x < xcount; x++) /* dest[x,y] += a[x,n] * b[n,y] */ - dest[x*rxstride + y*rystride] += abase[x*axstride + n*aystride] * bbase[n*bxstride + y*bystride]; + dest[x*rxstride + y*rystride] += + abase[x*axstride + n*aystride] * + bbase[n*bxstride + y*bystride]; } else if (GFC_DESCRIPTOR_RANK (a) == 1) { @@ -372,5 +610,5 @@ matmul_r16 (gfc_array_r16 * const restrict retarray, } } } - +#pragma GCC reset_options #endif diff --git a/libgfortran/generated/matmul_r4.c b/libgfortran/generated/matmul_r4.c index ff3b93f..6b6ad9b 100644 --- a/libgfortran/generated/matmul_r4.c +++ b/libgfortran/generated/matmul_r4.c @@ -32,7 +32,7 @@ see the files COPYING3 and COPYING.RUNTIME respectively. If not, see #if defined (HAVE_GFC_REAL_4) /* Prototype for the BLAS ?gemm subroutine, a pointer to which can be - passed to us by the front-end, in which case we'll call it for large + passed to us by the front-end, in which case we call it for large matrices. */ typedef void (*blas_call)(const char *, const char *, const int *, const int *, @@ -75,6 +75,9 @@ extern void matmul_r4 (gfc_array_r4 * const restrict retarray, int blas_limit, blas_call gemm); export_proto(matmul_r4); +#pragma GCC optimize ( "-Ofast" ) +#pragma GCC optimize ( "-funroll-loops" ) + void matmul_r4 (gfc_array_r4 * const restrict retarray, gfc_array_r4 * const restrict a, gfc_array_r4 * const restrict b, int try_blas, @@ -99,7 +102,7 @@ matmul_r4 (gfc_array_r4 * const restrict retarray, o One-dimensional argument B is implicitly treated as a column matrix dimensioned [count, 1], so ycount=1. - */ +*/ if (retarray->base_addr == NULL) { @@ -127,47 +130,47 @@ matmul_r4 (gfc_array_r4 * const restrict retarray, = xmallocarray (size0 ((array_t *) retarray), sizeof (GFC_REAL_4)); retarray->offset = 0; } - else if (unlikely (compile_options.bounds_check)) - { - index_type ret_extent, arg_extent; - - if (GFC_DESCRIPTOR_RANK (a) == 1) - { - arg_extent = GFC_DESCRIPTOR_EXTENT(b,1); - ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0); - if (arg_extent != ret_extent) - runtime_error ("Incorrect extent in return array in" - " MATMUL intrinsic: is %ld, should be %ld", - (long int) ret_extent, (long int) arg_extent); - } - else if (GFC_DESCRIPTOR_RANK (b) == 1) - { - arg_extent = GFC_DESCRIPTOR_EXTENT(a,0); - ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0); - if (arg_extent != ret_extent) - runtime_error ("Incorrect extent in return array in" - " MATMUL intrinsic: is %ld, should be %ld", - (long int) ret_extent, (long int) arg_extent); - } - else - { - arg_extent = GFC_DESCRIPTOR_EXTENT(a,0); - ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0); - if (arg_extent != ret_extent) - runtime_error ("Incorrect extent in return array in" - " MATMUL intrinsic for dimension 1:" - " is %ld, should be %ld", - (long int) ret_extent, (long int) arg_extent); - - arg_extent = GFC_DESCRIPTOR_EXTENT(b,1); - ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,1); - if (arg_extent != ret_extent) - runtime_error ("Incorrect extent in return array in" - " MATMUL intrinsic for dimension 2:" - " is %ld, should be %ld", - (long int) ret_extent, (long int) arg_extent); - } - } + else if (unlikely (compile_options.bounds_check)) + { + index_type ret_extent, arg_extent; + + if (GFC_DESCRIPTOR_RANK (a) == 1) + { + arg_extent = GFC_DESCRIPTOR_EXTENT(b,1); + ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0); + if (arg_extent != ret_extent) + runtime_error ("Incorrect extent in return array in" + " MATMUL intrinsic: is %ld, should be %ld", + (long int) ret_extent, (long int) arg_extent); + } + else if (GFC_DESCRIPTOR_RANK (b) == 1) + { + arg_extent = GFC_DESCRIPTOR_EXTENT(a,0); + ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0); + if (arg_extent != ret_extent) + runtime_error ("Incorrect extent in return array in" + " MATMUL intrinsic: is %ld, should be %ld", + (long int) ret_extent, (long int) arg_extent); + } + else + { + arg_extent = GFC_DESCRIPTOR_EXTENT(a,0); + ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0); + if (arg_extent != ret_extent) + runtime_error ("Incorrect extent in return array in" + " MATMUL intrinsic for dimension 1:" + " is %ld, should be %ld", + (long int) ret_extent, (long int) arg_extent); + + arg_extent = GFC_DESCRIPTOR_EXTENT(b,1); + ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,1); + if (arg_extent != ret_extent) + runtime_error ("Incorrect extent in return array in" + " MATMUL intrinsic for dimension 2:" + " is %ld, should be %ld", + (long int) ret_extent, (long int) arg_extent); + } + } if (GFC_DESCRIPTOR_RANK (retarray) == 1) @@ -230,61 +233,294 @@ matmul_r4 (gfc_array_r4 * const restrict retarray, bbase = b->base_addr; dest = retarray->base_addr; - - /* Now that everything is set up, we're performing the multiplication + /* Now that everything is set up, we perform the multiplication itself. */ #define POW3(x) (((float) (x)) * ((float) (x)) * ((float) (x))) +#define min(a,b) ((a) <= (b) ? (a) : (b)) +#define max(a,b) ((a) >= (b) ? (a) : (b)) if (try_blas && rxstride == 1 && (axstride == 1 || aystride == 1) && (bxstride == 1 || bystride == 1) && (((float) xcount) * ((float) ycount) * ((float) count) > POW3(blas_limit))) - { - const int m = xcount, n = ycount, k = count, ldc = rystride; - const GFC_REAL_4 one = 1, zero = 0; - const int lda = (axstride == 1) ? aystride : axstride, - ldb = (bxstride == 1) ? bystride : bxstride; - - if (lda > 0 && ldb > 0 && ldc > 0 && m > 1 && n > 1 && k > 1) - { - assert (gemm != NULL); - gemm (axstride == 1 ? "N" : "T", bxstride == 1 ? "N" : "T", &m, &n, &k, - &one, abase, &lda, bbase, &ldb, &zero, dest, &ldc, 1, 1); - return; - } - } - - if (rxstride == 1 && axstride == 1 && bxstride == 1) { - const GFC_REAL_4 * restrict bbase_y; - GFC_REAL_4 * restrict dest_y; - const GFC_REAL_4 * restrict abase_n; - GFC_REAL_4 bbase_yn; + const int m = xcount, n = ycount, k = count, ldc = rystride; + const GFC_REAL_4 one = 1, zero = 0; + const int lda = (axstride == 1) ? aystride : axstride, + ldb = (bxstride == 1) ? bystride : bxstride; - if (rystride == xcount) - memset (dest, 0, (sizeof (GFC_REAL_4) * xcount * ycount)); - else + if (lda > 0 && ldb > 0 && ldc > 0 && m > 1 && n > 1 && k > 1) { - for (y = 0; y < ycount; y++) - for (x = 0; x < xcount; x++) - dest[x + y*rystride] = (GFC_REAL_4)0; + assert (gemm != NULL); + gemm (axstride == 1 ? "N" : "T", bxstride == 1 ? "N" : "T", &m, + &n, &k, &one, abase, &lda, bbase, &ldb, &zero, dest, + &ldc, 1, 1); + return; } + } - for (y = 0; y < ycount; y++) + if (rxstride == 1 && axstride == 1 && bxstride == 1) + { + /* This block of code implements a tuned matmul, derived from + Superscalar GEMM-based level 3 BLAS, Beta version 0.1 + + Bo Kagstrom and Per Ling + Department of Computing Science + Umea University + S-901 87 Umea, Sweden + + from netlib.org, translated to C, and modified for matmul.m4. */ + + const GFC_REAL_4 *a, *b; + GFC_REAL_4 *c; + const index_type m = xcount, n = ycount, k = count; + + /* System generated locals */ + index_type a_dim1, a_offset, b_dim1, b_offset, c_dim1, c_offset, i1, i2, + i3, i4, i5, i6; + + /* Local variables */ + GFC_REAL_4 t1[65536], /* was [256][256] */ + f11, f12, f21, f22, f31, f32, f41, f42, + f13, f14, f23, f24, f33, f34, f43, f44; + index_type i, j, l, ii, jj, ll; + index_type isec, jsec, lsec, uisec, ujsec, ulsec; + + a = abase; + b = bbase; + c = retarray->base_addr; + + /* Parameter adjustments */ + c_dim1 = m; + c_offset = 1 + c_dim1; + c -= c_offset; + a_dim1 = aystride; + a_offset = 1 + a_dim1; + a -= a_offset; + b_dim1 = bystride; + b_offset = 1 + b_dim1; + b -= b_offset; + + /* Early exit if possible */ + if (m == 0 || n == 0 || k == 0) + return; + + /* Empty c first. */ + for (j=1; j<=n; j++) + for (i=1; i<=m; i++) + c[i + j * c_dim1] = (GFC_REAL_4)0; + + /* Start turning the crank. */ + i1 = n; + for (jj = 1; jj <= i1; jj += 512) { - bbase_y = bbase + y*bystride; - dest_y = dest + y*rystride; - for (n = 0; n < count; n++) + /* Computing MIN */ + i2 = 512; + i3 = n - jj + 1; + jsec = min(i2,i3); + ujsec = jsec - jsec % 4; + i2 = k; + for (ll = 1; ll <= i2; ll += 256) { - abase_n = abase + n*aystride; - bbase_yn = bbase_y[n]; - for (x = 0; x < xcount; x++) + /* Computing MIN */ + i3 = 256; + i4 = k - ll + 1; + lsec = min(i3,i4); + ulsec = lsec - lsec % 2; + + i3 = m; + for (ii = 1; ii <= i3; ii += 256) { - dest_y[x] += abase_n[x] * bbase_yn; + /* Computing MIN */ + i4 = 256; + i5 = m - ii + 1; + isec = min(i4,i5); + uisec = isec - isec % 2; + i4 = ll + ulsec - 1; + for (l = ll; l <= i4; l += 2) + { + i5 = ii + uisec - 1; + for (i = ii; i <= i5; i += 2) + { + t1[l - ll + 1 + ((i - ii + 1) << 8) - 257] = + a[i + l * a_dim1]; + t1[l - ll + 2 + ((i - ii + 1) << 8) - 257] = + a[i + (l + 1) * a_dim1]; + t1[l - ll + 1 + ((i - ii + 2) << 8) - 257] = + a[i + 1 + l * a_dim1]; + t1[l - ll + 2 + ((i - ii + 2) << 8) - 257] = + a[i + 1 + (l + 1) * a_dim1]; + } + if (uisec < isec) + { + t1[l - ll + 1 + (isec << 8) - 257] = + a[ii + isec - 1 + l * a_dim1]; + t1[l - ll + 2 + (isec << 8) - 257] = + a[ii + isec - 1 + (l + 1) * a_dim1]; + } + } + if (ulsec < lsec) + { + i4 = ii + isec - 1; + for (i = ii; i<= i4; ++i) + { + t1[lsec + ((i - ii + 1) << 8) - 257] = + a[i + (ll + lsec - 1) * a_dim1]; + } + } + + uisec = isec - isec % 4; + i4 = jj + ujsec - 1; + for (j = jj; j <= i4; j += 4) + { + i5 = ii + uisec - 1; + for (i = ii; i <= i5; i += 4) + { + f11 = c[i + j * c_dim1]; + f21 = c[i + 1 + j * c_dim1]; + f12 = c[i + (j + 1) * c_dim1]; + f22 = c[i + 1 + (j + 1) * c_dim1]; + f13 = c[i + (j + 2) * c_dim1]; + f23 = c[i + 1 + (j + 2) * c_dim1]; + f14 = c[i + (j + 3) * c_dim1]; + f24 = c[i + 1 + (j + 3) * c_dim1]; + f31 = c[i + 2 + j * c_dim1]; + f41 = c[i + 3 + j * c_dim1]; + f32 = c[i + 2 + (j + 1) * c_dim1]; + f42 = c[i + 3 + (j + 1) * c_dim1]; + f33 = c[i + 2 + (j + 2) * c_dim1]; + f43 = c[i + 3 + (j + 2) * c_dim1]; + f34 = c[i + 2 + (j + 3) * c_dim1]; + f44 = c[i + 3 + (j + 3) * c_dim1]; + i6 = ll + lsec - 1; + for (l = ll; l <= i6; ++l) + { + f11 += t1[l - ll + 1 + ((i - ii + 1) << 8) - 257] + * b[l + j * b_dim1]; + f21 += t1[l - ll + 1 + ((i - ii + 2) << 8) - 257] + * b[l + j * b_dim1]; + f12 += t1[l - ll + 1 + ((i - ii + 1) << 8) - 257] + * b[l + (j + 1) * b_dim1]; + f22 += t1[l - ll + 1 + ((i - ii + 2) << 8) - 257] + * b[l + (j + 1) * b_dim1]; + f13 += t1[l - ll + 1 + ((i - ii + 1) << 8) - 257] + * b[l + (j + 2) * b_dim1]; + f23 += t1[l - ll + 1 + ((i - ii + 2) << 8) - 257] + * b[l + (j + 2) * b_dim1]; + f14 += t1[l - ll + 1 + ((i - ii + 1) << 8) - 257] + * b[l + (j + 3) * b_dim1]; + f24 += t1[l - ll + 1 + ((i - ii + 2) << 8) - 257] + * b[l + (j + 3) * b_dim1]; + f31 += t1[l - ll + 1 + ((i - ii + 3) << 8) - 257] + * b[l + j * b_dim1]; + f41 += t1[l - ll + 1 + ((i - ii + 4) << 8) - 257] + * b[l + j * b_dim1]; + f32 += t1[l - ll + 1 + ((i - ii + 3) << 8) - 257] + * b[l + (j + 1) * b_dim1]; + f42 += t1[l - ll + 1 + ((i - ii + 4) << 8) - 257] + * b[l + (j + 1) * b_dim1]; + f33 += t1[l - ll + 1 + ((i - ii + 3) << 8) - 257] + * b[l + (j + 2) * b_dim1]; + f43 += t1[l - ll + 1 + ((i - ii + 4) << 8) - 257] + * b[l + (j + 2) * b_dim1]; + f34 += t1[l - ll + 1 + ((i - ii + 3) << 8) - 257] + * b[l + (j + 3) * b_dim1]; + f44 += t1[l - ll + 1 + ((i - ii + 4) << 8) - 257] + * b[l + (j + 3) * b_dim1]; + } + c[i + j * c_dim1] = f11; + c[i + 1 + j * c_dim1] = f21; + c[i + (j + 1) * c_dim1] = f12; + c[i + 1 + (j + 1) * c_dim1] = f22; + c[i + (j + 2) * c_dim1] = f13; + c[i + 1 + (j + 2) * c_dim1] = f23; + c[i + (j + 3) * c_dim1] = f14; + c[i + 1 + (j + 3) * c_dim1] = f24; + c[i + 2 + j * c_dim1] = f31; + c[i + 3 + j * c_dim1] = f41; + c[i + 2 + (j + 1) * c_dim1] = f32; + c[i + 3 + (j + 1) * c_dim1] = f42; + c[i + 2 + (j + 2) * c_dim1] = f33; + c[i + 3 + (j + 2) * c_dim1] = f43; + c[i + 2 + (j + 3) * c_dim1] = f34; + c[i + 3 + (j + 3) * c_dim1] = f44; + } + if (uisec < isec) + { + i5 = ii + isec - 1; + for (i = ii + uisec; i <= i5; ++i) + { + f11 = c[i + j * c_dim1]; + f12 = c[i + (j + 1) * c_dim1]; + f13 = c[i + (j + 2) * c_dim1]; + f14 = c[i + (j + 3) * c_dim1]; + i6 = ll + lsec - 1; + for (l = ll; l <= i6; ++l) + { + f11 += t1[l - ll + 1 + ((i - ii + 1) << 8) - + 257] * b[l + j * b_dim1]; + f12 += t1[l - ll + 1 + ((i - ii + 1) << 8) - + 257] * b[l + (j + 1) * b_dim1]; + f13 += t1[l - ll + 1 + ((i - ii + 1) << 8) - + 257] * b[l + (j + 2) * b_dim1]; + f14 += t1[l - ll + 1 + ((i - ii + 1) << 8) - + 257] * b[l + (j + 3) * b_dim1]; + } + c[i + j * c_dim1] = f11; + c[i + (j + 1) * c_dim1] = f12; + c[i + (j + 2) * c_dim1] = f13; + c[i + (j + 3) * c_dim1] = f14; + } + } + } + if (ujsec < jsec) + { + i4 = jj + jsec - 1; + for (j = jj + ujsec; j <= i4; ++j) + { + i5 = ii + uisec - 1; + for (i = ii; i <= i5; i += 4) + { + f11 = c[i + j * c_dim1]; + f21 = c[i + 1 + j * c_dim1]; + f31 = c[i + 2 + j * c_dim1]; + f41 = c[i + 3 + j * c_dim1]; + i6 = ll + lsec - 1; + for (l = ll; l <= i6; ++l) + { + f11 += t1[l - ll + 1 + ((i - ii + 1) << 8) - + 257] * b[l + j * b_dim1]; + f21 += t1[l - ll + 1 + ((i - ii + 2) << 8) - + 257] * b[l + j * b_dim1]; + f31 += t1[l - ll + 1 + ((i - ii + 3) << 8) - + 257] * b[l + j * b_dim1]; + f41 += t1[l - ll + 1 + ((i - ii + 4) << 8) - + 257] * b[l + j * b_dim1]; + } + c[i + j * c_dim1] = f11; + c[i + 1 + j * c_dim1] = f21; + c[i + 2 + j * c_dim1] = f31; + c[i + 3 + j * c_dim1] = f41; + } + i5 = ii + isec - 1; + for (i = ii + uisec; i <= i5; ++i) + { + f11 = c[i + j * c_dim1]; + i6 = ll + lsec - 1; + for (l = ll; l <= i6; ++l) + { + f11 += t1[l - ll + 1 + ((i - ii + 1) << 8) - + 257] * b[l + j * b_dim1]; + } + c[i + j * c_dim1] = f11; + } + } + } } } } + return; } else if (rxstride == 1 && aystride == 1 && bxstride == 1) { @@ -334,7 +570,9 @@ matmul_r4 (gfc_array_r4 * const restrict retarray, for (n = 0; n < count; n++) for (x = 0; x < xcount; x++) /* dest[x,y] += a[x,n] * b[n,y] */ - dest[x*rxstride + y*rystride] += abase[x*axstride + n*aystride] * bbase[n*bxstride + y*bystride]; + dest[x*rxstride + y*rystride] += + abase[x*axstride + n*aystride] * + bbase[n*bxstride + y*bystride]; } else if (GFC_DESCRIPTOR_RANK (a) == 1) { @@ -372,5 +610,5 @@ matmul_r4 (gfc_array_r4 * const restrict retarray, } } } - +#pragma GCC reset_options #endif diff --git a/libgfortran/generated/matmul_r8.c b/libgfortran/generated/matmul_r8.c index af805ee..f3d0149 100644 --- a/libgfortran/generated/matmul_r8.c +++ b/libgfortran/generated/matmul_r8.c @@ -32,7 +32,7 @@ see the files COPYING3 and COPYING.RUNTIME respectively. If not, see #if defined (HAVE_GFC_REAL_8) /* Prototype for the BLAS ?gemm subroutine, a pointer to which can be - passed to us by the front-end, in which case we'll call it for large + passed to us by the front-end, in which case we call it for large matrices. */ typedef void (*blas_call)(const char *, const char *, const int *, const int *, @@ -75,6 +75,9 @@ extern void matmul_r8 (gfc_array_r8 * const restrict retarray, int blas_limit, blas_call gemm); export_proto(matmul_r8); +#pragma GCC optimize ( "-Ofast" ) +#pragma GCC optimize ( "-funroll-loops" ) + void matmul_r8 (gfc_array_r8 * const restrict retarray, gfc_array_r8 * const restrict a, gfc_array_r8 * const restrict b, int try_blas, @@ -99,7 +102,7 @@ matmul_r8 (gfc_array_r8 * const restrict retarray, o One-dimensional argument B is implicitly treated as a column matrix dimensioned [count, 1], so ycount=1. - */ +*/ if (retarray->base_addr == NULL) { @@ -127,47 +130,47 @@ matmul_r8 (gfc_array_r8 * const restrict retarray, = xmallocarray (size0 ((array_t *) retarray), sizeof (GFC_REAL_8)); retarray->offset = 0; } - else if (unlikely (compile_options.bounds_check)) - { - index_type ret_extent, arg_extent; - - if (GFC_DESCRIPTOR_RANK (a) == 1) - { - arg_extent = GFC_DESCRIPTOR_EXTENT(b,1); - ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0); - if (arg_extent != ret_extent) - runtime_error ("Incorrect extent in return array in" - " MATMUL intrinsic: is %ld, should be %ld", - (long int) ret_extent, (long int) arg_extent); - } - else if (GFC_DESCRIPTOR_RANK (b) == 1) - { - arg_extent = GFC_DESCRIPTOR_EXTENT(a,0); - ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0); - if (arg_extent != ret_extent) - runtime_error ("Incorrect extent in return array in" - " MATMUL intrinsic: is %ld, should be %ld", - (long int) ret_extent, (long int) arg_extent); - } - else - { - arg_extent = GFC_DESCRIPTOR_EXTENT(a,0); - ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0); - if (arg_extent != ret_extent) - runtime_error ("Incorrect extent in return array in" - " MATMUL intrinsic for dimension 1:" - " is %ld, should be %ld", - (long int) ret_extent, (long int) arg_extent); - - arg_extent = GFC_DESCRIPTOR_EXTENT(b,1); - ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,1); - if (arg_extent != ret_extent) - runtime_error ("Incorrect extent in return array in" - " MATMUL intrinsic for dimension 2:" - " is %ld, should be %ld", - (long int) ret_extent, (long int) arg_extent); - } - } + else if (unlikely (compile_options.bounds_check)) + { + index_type ret_extent, arg_extent; + + if (GFC_DESCRIPTOR_RANK (a) == 1) + { + arg_extent = GFC_DESCRIPTOR_EXTENT(b,1); + ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0); + if (arg_extent != ret_extent) + runtime_error ("Incorrect extent in return array in" + " MATMUL intrinsic: is %ld, should be %ld", + (long int) ret_extent, (long int) arg_extent); + } + else if (GFC_DESCRIPTOR_RANK (b) == 1) + { + arg_extent = GFC_DESCRIPTOR_EXTENT(a,0); + ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0); + if (arg_extent != ret_extent) + runtime_error ("Incorrect extent in return array in" + " MATMUL intrinsic: is %ld, should be %ld", + (long int) ret_extent, (long int) arg_extent); + } + else + { + arg_extent = GFC_DESCRIPTOR_EXTENT(a,0); + ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0); + if (arg_extent != ret_extent) + runtime_error ("Incorrect extent in return array in" + " MATMUL intrinsic for dimension 1:" + " is %ld, should be %ld", + (long int) ret_extent, (long int) arg_extent); + + arg_extent = GFC_DESCRIPTOR_EXTENT(b,1); + ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,1); + if (arg_extent != ret_extent) + runtime_error ("Incorrect extent in return array in" + " MATMUL intrinsic for dimension 2:" + " is %ld, should be %ld", + (long int) ret_extent, (long int) arg_extent); + } + } if (GFC_DESCRIPTOR_RANK (retarray) == 1) @@ -230,61 +233,294 @@ matmul_r8 (gfc_array_r8 * const restrict retarray, bbase = b->base_addr; dest = retarray->base_addr; - - /* Now that everything is set up, we're performing the multiplication + /* Now that everything is set up, we perform the multiplication itself. */ #define POW3(x) (((float) (x)) * ((float) (x)) * ((float) (x))) +#define min(a,b) ((a) <= (b) ? (a) : (b)) +#define max(a,b) ((a) >= (b) ? (a) : (b)) if (try_blas && rxstride == 1 && (axstride == 1 || aystride == 1) && (bxstride == 1 || bystride == 1) && (((float) xcount) * ((float) ycount) * ((float) count) > POW3(blas_limit))) - { - const int m = xcount, n = ycount, k = count, ldc = rystride; - const GFC_REAL_8 one = 1, zero = 0; - const int lda = (axstride == 1) ? aystride : axstride, - ldb = (bxstride == 1) ? bystride : bxstride; - - if (lda > 0 && ldb > 0 && ldc > 0 && m > 1 && n > 1 && k > 1) - { - assert (gemm != NULL); - gemm (axstride == 1 ? "N" : "T", bxstride == 1 ? "N" : "T", &m, &n, &k, - &one, abase, &lda, bbase, &ldb, &zero, dest, &ldc, 1, 1); - return; - } - } - - if (rxstride == 1 && axstride == 1 && bxstride == 1) { - const GFC_REAL_8 * restrict bbase_y; - GFC_REAL_8 * restrict dest_y; - const GFC_REAL_8 * restrict abase_n; - GFC_REAL_8 bbase_yn; + const int m = xcount, n = ycount, k = count, ldc = rystride; + const GFC_REAL_8 one = 1, zero = 0; + const int lda = (axstride == 1) ? aystride : axstride, + ldb = (bxstride == 1) ? bystride : bxstride; - if (rystride == xcount) - memset (dest, 0, (sizeof (GFC_REAL_8) * xcount * ycount)); - else + if (lda > 0 && ldb > 0 && ldc > 0 && m > 1 && n > 1 && k > 1) { - for (y = 0; y < ycount; y++) - for (x = 0; x < xcount; x++) - dest[x + y*rystride] = (GFC_REAL_8)0; + assert (gemm != NULL); + gemm (axstride == 1 ? "N" : "T", bxstride == 1 ? "N" : "T", &m, + &n, &k, &one, abase, &lda, bbase, &ldb, &zero, dest, + &ldc, 1, 1); + return; } + } - for (y = 0; y < ycount; y++) + if (rxstride == 1 && axstride == 1 && bxstride == 1) + { + /* This block of code implements a tuned matmul, derived from + Superscalar GEMM-based level 3 BLAS, Beta version 0.1 + + Bo Kagstrom and Per Ling + Department of Computing Science + Umea University + S-901 87 Umea, Sweden + + from netlib.org, translated to C, and modified for matmul.m4. */ + + const GFC_REAL_8 *a, *b; + GFC_REAL_8 *c; + const index_type m = xcount, n = ycount, k = count; + + /* System generated locals */ + index_type a_dim1, a_offset, b_dim1, b_offset, c_dim1, c_offset, i1, i2, + i3, i4, i5, i6; + + /* Local variables */ + GFC_REAL_8 t1[65536], /* was [256][256] */ + f11, f12, f21, f22, f31, f32, f41, f42, + f13, f14, f23, f24, f33, f34, f43, f44; + index_type i, j, l, ii, jj, ll; + index_type isec, jsec, lsec, uisec, ujsec, ulsec; + + a = abase; + b = bbase; + c = retarray->base_addr; + + /* Parameter adjustments */ + c_dim1 = m; + c_offset = 1 + c_dim1; + c -= c_offset; + a_dim1 = aystride; + a_offset = 1 + a_dim1; + a -= a_offset; + b_dim1 = bystride; + b_offset = 1 + b_dim1; + b -= b_offset; + + /* Early exit if possible */ + if (m == 0 || n == 0 || k == 0) + return; + + /* Empty c first. */ + for (j=1; j<=n; j++) + for (i=1; i<=m; i++) + c[i + j * c_dim1] = (GFC_REAL_8)0; + + /* Start turning the crank. */ + i1 = n; + for (jj = 1; jj <= i1; jj += 512) { - bbase_y = bbase + y*bystride; - dest_y = dest + y*rystride; - for (n = 0; n < count; n++) + /* Computing MIN */ + i2 = 512; + i3 = n - jj + 1; + jsec = min(i2,i3); + ujsec = jsec - jsec % 4; + i2 = k; + for (ll = 1; ll <= i2; ll += 256) { - abase_n = abase + n*aystride; - bbase_yn = bbase_y[n]; - for (x = 0; x < xcount; x++) + /* Computing MIN */ + i3 = 256; + i4 = k - ll + 1; + lsec = min(i3,i4); + ulsec = lsec - lsec % 2; + + i3 = m; + for (ii = 1; ii <= i3; ii += 256) { - dest_y[x] += abase_n[x] * bbase_yn; + /* Computing MIN */ + i4 = 256; + i5 = m - ii + 1; + isec = min(i4,i5); + uisec = isec - isec % 2; + i4 = ll + ulsec - 1; + for (l = ll; l <= i4; l += 2) + { + i5 = ii + uisec - 1; + for (i = ii; i <= i5; i += 2) + { + t1[l - ll + 1 + ((i - ii + 1) << 8) - 257] = + a[i + l * a_dim1]; + t1[l - ll + 2 + ((i - ii + 1) << 8) - 257] = + a[i + (l + 1) * a_dim1]; + t1[l - ll + 1 + ((i - ii + 2) << 8) - 257] = + a[i + 1 + l * a_dim1]; + t1[l - ll + 2 + ((i - ii + 2) << 8) - 257] = + a[i + 1 + (l + 1) * a_dim1]; + } + if (uisec < isec) + { + t1[l - ll + 1 + (isec << 8) - 257] = + a[ii + isec - 1 + l * a_dim1]; + t1[l - ll + 2 + (isec << 8) - 257] = + a[ii + isec - 1 + (l + 1) * a_dim1]; + } + } + if (ulsec < lsec) + { + i4 = ii + isec - 1; + for (i = ii; i<= i4; ++i) + { + t1[lsec + ((i - ii + 1) << 8) - 257] = + a[i + (ll + lsec - 1) * a_dim1]; + } + } + + uisec = isec - isec % 4; + i4 = jj + ujsec - 1; + for (j = jj; j <= i4; j += 4) + { + i5 = ii + uisec - 1; + for (i = ii; i <= i5; i += 4) + { + f11 = c[i + j * c_dim1]; + f21 = c[i + 1 + j * c_dim1]; + f12 = c[i + (j + 1) * c_dim1]; + f22 = c[i + 1 + (j + 1) * c_dim1]; + f13 = c[i + (j + 2) * c_dim1]; + f23 = c[i + 1 + (j + 2) * c_dim1]; + f14 = c[i + (j + 3) * c_dim1]; + f24 = c[i + 1 + (j + 3) * c_dim1]; + f31 = c[i + 2 + j * c_dim1]; + f41 = c[i + 3 + j * c_dim1]; + f32 = c[i + 2 + (j + 1) * c_dim1]; + f42 = c[i + 3 + (j + 1) * c_dim1]; + f33 = c[i + 2 + (j + 2) * c_dim1]; + f43 = c[i + 3 + (j + 2) * c_dim1]; + f34 = c[i + 2 + (j + 3) * c_dim1]; + f44 = c[i + 3 + (j + 3) * c_dim1]; + i6 = ll + lsec - 1; + for (l = ll; l <= i6; ++l) + { + f11 += t1[l - ll + 1 + ((i - ii + 1) << 8) - 257] + * b[l + j * b_dim1]; + f21 += t1[l - ll + 1 + ((i - ii + 2) << 8) - 257] + * b[l + j * b_dim1]; + f12 += t1[l - ll + 1 + ((i - ii + 1) << 8) - 257] + * b[l + (j + 1) * b_dim1]; + f22 += t1[l - ll + 1 + ((i - ii + 2) << 8) - 257] + * b[l + (j + 1) * b_dim1]; + f13 += t1[l - ll + 1 + ((i - ii + 1) << 8) - 257] + * b[l + (j + 2) * b_dim1]; + f23 += t1[l - ll + 1 + ((i - ii + 2) << 8) - 257] + * b[l + (j + 2) * b_dim1]; + f14 += t1[l - ll + 1 + ((i - ii + 1) << 8) - 257] + * b[l + (j + 3) * b_dim1]; + f24 += t1[l - ll + 1 + ((i - ii + 2) << 8) - 257] + * b[l + (j + 3) * b_dim1]; + f31 += t1[l - ll + 1 + ((i - ii + 3) << 8) - 257] + * b[l + j * b_dim1]; + f41 += t1[l - ll + 1 + ((i - ii + 4) << 8) - 257] + * b[l + j * b_dim1]; + f32 += t1[l - ll + 1 + ((i - ii + 3) << 8) - 257] + * b[l + (j + 1) * b_dim1]; + f42 += t1[l - ll + 1 + ((i - ii + 4) << 8) - 257] + * b[l + (j + 1) * b_dim1]; + f33 += t1[l - ll + 1 + ((i - ii + 3) << 8) - 257] + * b[l + (j + 2) * b_dim1]; + f43 += t1[l - ll + 1 + ((i - ii + 4) << 8) - 257] + * b[l + (j + 2) * b_dim1]; + f34 += t1[l - ll + 1 + ((i - ii + 3) << 8) - 257] + * b[l + (j + 3) * b_dim1]; + f44 += t1[l - ll + 1 + ((i - ii + 4) << 8) - 257] + * b[l + (j + 3) * b_dim1]; + } + c[i + j * c_dim1] = f11; + c[i + 1 + j * c_dim1] = f21; + c[i + (j + 1) * c_dim1] = f12; + c[i + 1 + (j + 1) * c_dim1] = f22; + c[i + (j + 2) * c_dim1] = f13; + c[i + 1 + (j + 2) * c_dim1] = f23; + c[i + (j + 3) * c_dim1] = f14; + c[i + 1 + (j + 3) * c_dim1] = f24; + c[i + 2 + j * c_dim1] = f31; + c[i + 3 + j * c_dim1] = f41; + c[i + 2 + (j + 1) * c_dim1] = f32; + c[i + 3 + (j + 1) * c_dim1] = f42; + c[i + 2 + (j + 2) * c_dim1] = f33; + c[i + 3 + (j + 2) * c_dim1] = f43; + c[i + 2 + (j + 3) * c_dim1] = f34; + c[i + 3 + (j + 3) * c_dim1] = f44; + } + if (uisec < isec) + { + i5 = ii + isec - 1; + for (i = ii + uisec; i <= i5; ++i) + { + f11 = c[i + j * c_dim1]; + f12 = c[i + (j + 1) * c_dim1]; + f13 = c[i + (j + 2) * c_dim1]; + f14 = c[i + (j + 3) * c_dim1]; + i6 = ll + lsec - 1; + for (l = ll; l <= i6; ++l) + { + f11 += t1[l - ll + 1 + ((i - ii + 1) << 8) - + 257] * b[l + j * b_dim1]; + f12 += t1[l - ll + 1 + ((i - ii + 1) << 8) - + 257] * b[l + (j + 1) * b_dim1]; + f13 += t1[l - ll + 1 + ((i - ii + 1) << 8) - + 257] * b[l + (j + 2) * b_dim1]; + f14 += t1[l - ll + 1 + ((i - ii + 1) << 8) - + 257] * b[l + (j + 3) * b_dim1]; + } + c[i + j * c_dim1] = f11; + c[i + (j + 1) * c_dim1] = f12; + c[i + (j + 2) * c_dim1] = f13; + c[i + (j + 3) * c_dim1] = f14; + } + } + } + if (ujsec < jsec) + { + i4 = jj + jsec - 1; + for (j = jj + ujsec; j <= i4; ++j) + { + i5 = ii + uisec - 1; + for (i = ii; i <= i5; i += 4) + { + f11 = c[i + j * c_dim1]; + f21 = c[i + 1 + j * c_dim1]; + f31 = c[i + 2 + j * c_dim1]; + f41 = c[i + 3 + j * c_dim1]; + i6 = ll + lsec - 1; + for (l = ll; l <= i6; ++l) + { + f11 += t1[l - ll + 1 + ((i - ii + 1) << 8) - + 257] * b[l + j * b_dim1]; + f21 += t1[l - ll + 1 + ((i - ii + 2) << 8) - + 257] * b[l + j * b_dim1]; + f31 += t1[l - ll + 1 + ((i - ii + 3) << 8) - + 257] * b[l + j * b_dim1]; + f41 += t1[l - ll + 1 + ((i - ii + 4) << 8) - + 257] * b[l + j * b_dim1]; + } + c[i + j * c_dim1] = f11; + c[i + 1 + j * c_dim1] = f21; + c[i + 2 + j * c_dim1] = f31; + c[i + 3 + j * c_dim1] = f41; + } + i5 = ii + isec - 1; + for (i = ii + uisec; i <= i5; ++i) + { + f11 = c[i + j * c_dim1]; + i6 = ll + lsec - 1; + for (l = ll; l <= i6; ++l) + { + f11 += t1[l - ll + 1 + ((i - ii + 1) << 8) - + 257] * b[l + j * b_dim1]; + } + c[i + j * c_dim1] = f11; + } + } + } } } } + return; } else if (rxstride == 1 && aystride == 1 && bxstride == 1) { @@ -334,7 +570,9 @@ matmul_r8 (gfc_array_r8 * const restrict retarray, for (n = 0; n < count; n++) for (x = 0; x < xcount; x++) /* dest[x,y] += a[x,n] * b[n,y] */ - dest[x*rxstride + y*rystride] += abase[x*axstride + n*aystride] * bbase[n*bxstride + y*bystride]; + dest[x*rxstride + y*rystride] += + abase[x*axstride + n*aystride] * + bbase[n*bxstride + y*bystride]; } else if (GFC_DESCRIPTOR_RANK (a) == 1) { @@ -372,5 +610,5 @@ matmul_r8 (gfc_array_r8 * const restrict retarray, } } } - +#pragma GCC reset_options #endif diff --git a/libgfortran/m4/matmul.m4 b/libgfortran/m4/matmul.m4 index 468615b..20a0404 100644 --- a/libgfortran/m4/matmul.m4 +++ b/libgfortran/m4/matmul.m4 @@ -33,7 +33,7 @@ include(iparm.m4)dnl `#if defined (HAVE_'rtype_name`) /* Prototype for the BLAS ?gemm subroutine, a pointer to which can be - passed to us by the front-end, in which case we''`ll call it for large + passed to us by the front-end, in which case we call it for large matrices. */ typedef void (*blas_call)(const char *, const char *, const int *, const int *, @@ -76,6 +76,9 @@ extern void matmul_'rtype_code` ('rtype` * const restrict retarray, int blas_limit, blas_call gemm); export_proto(matmul_'rtype_code`); +#pragma GCC optimize ( "-Ofast" ) +#pragma GCC optimize ( "-funroll-loops" ) + void matmul_'rtype_code` ('rtype` * const restrict retarray, 'rtype` * const restrict a, 'rtype` * const restrict b, int try_blas, @@ -100,7 +103,7 @@ matmul_'rtype_code` ('rtype` * const restrict retarray, o One-dimensional argument B is implicitly treated as a column matrix dimensioned [count, 1], so ycount=1. - */ +*/ if (retarray->base_addr == NULL) { @@ -128,47 +131,47 @@ matmul_'rtype_code` ('rtype` * const restrict retarray, = xmallocarray (size0 ((array_t *) retarray), sizeof ('rtype_name`)); retarray->offset = 0; } - else if (unlikely (compile_options.bounds_check)) - { - index_type ret_extent, arg_extent; - - if (GFC_DESCRIPTOR_RANK (a) == 1) - { - arg_extent = GFC_DESCRIPTOR_EXTENT(b,1); - ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0); - if (arg_extent != ret_extent) - runtime_error ("Incorrect extent in return array in" - " MATMUL intrinsic: is %ld, should be %ld", - (long int) ret_extent, (long int) arg_extent); - } - else if (GFC_DESCRIPTOR_RANK (b) == 1) - { - arg_extent = GFC_DESCRIPTOR_EXTENT(a,0); - ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0); - if (arg_extent != ret_extent) - runtime_error ("Incorrect extent in return array in" - " MATMUL intrinsic: is %ld, should be %ld", - (long int) ret_extent, (long int) arg_extent); - } - else - { - arg_extent = GFC_DESCRIPTOR_EXTENT(a,0); - ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0); - if (arg_extent != ret_extent) - runtime_error ("Incorrect extent in return array in" - " MATMUL intrinsic for dimension 1:" - " is %ld, should be %ld", - (long int) ret_extent, (long int) arg_extent); - - arg_extent = GFC_DESCRIPTOR_EXTENT(b,1); - ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,1); - if (arg_extent != ret_extent) - runtime_error ("Incorrect extent in return array in" - " MATMUL intrinsic for dimension 2:" - " is %ld, should be %ld", - (long int) ret_extent, (long int) arg_extent); - } - } + else if (unlikely (compile_options.bounds_check)) + { + index_type ret_extent, arg_extent; + + if (GFC_DESCRIPTOR_RANK (a) == 1) + { + arg_extent = GFC_DESCRIPTOR_EXTENT(b,1); + ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0); + if (arg_extent != ret_extent) + runtime_error ("Incorrect extent in return array in" + " MATMUL intrinsic: is %ld, should be %ld", + (long int) ret_extent, (long int) arg_extent); + } + else if (GFC_DESCRIPTOR_RANK (b) == 1) + { + arg_extent = GFC_DESCRIPTOR_EXTENT(a,0); + ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0); + if (arg_extent != ret_extent) + runtime_error ("Incorrect extent in return array in" + " MATMUL intrinsic: is %ld, should be %ld", + (long int) ret_extent, (long int) arg_extent); + } + else + { + arg_extent = GFC_DESCRIPTOR_EXTENT(a,0); + ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0); + if (arg_extent != ret_extent) + runtime_error ("Incorrect extent in return array in" + " MATMUL intrinsic for dimension 1:" + " is %ld, should be %ld", + (long int) ret_extent, (long int) arg_extent); + + arg_extent = GFC_DESCRIPTOR_EXTENT(b,1); + ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,1); + if (arg_extent != ret_extent) + runtime_error ("Incorrect extent in return array in" + " MATMUL intrinsic for dimension 2:" + " is %ld, should be %ld", + (long int) ret_extent, (long int) arg_extent); + } + } ' sinclude(`matmul_asm_'rtype_code`.m4')dnl ` @@ -232,61 +235,294 @@ sinclude(`matmul_asm_'rtype_code`.m4')dnl bbase = b->base_addr; dest = retarray->base_addr; - - /* Now that everything is set up, we''`re performing the multiplication + /* Now that everything is set up, we perform the multiplication itself. */ #define POW3(x) (((float) (x)) * ((float) (x)) * ((float) (x))) +#define min(a,b) ((a) <= (b) ? (a) : (b)) +#define max(a,b) ((a) >= (b) ? (a) : (b)) if (try_blas && rxstride == 1 && (axstride == 1 || aystride == 1) && (bxstride == 1 || bystride == 1) && (((float) xcount) * ((float) ycount) * ((float) count) > POW3(blas_limit))) - { - const int m = xcount, n = ycount, k = count, ldc = rystride; - const 'rtype_name` one = 1, zero = 0; - const int lda = (axstride == 1) ? aystride : axstride, - ldb = (bxstride == 1) ? bystride : bxstride; - - if (lda > 0 && ldb > 0 && ldc > 0 && m > 1 && n > 1 && k > 1) - { - assert (gemm != NULL); - gemm (axstride == 1 ? "N" : "T", bxstride == 1 ? "N" : "T", &m, &n, &k, - &one, abase, &lda, bbase, &ldb, &zero, dest, &ldc, 1, 1); - return; - } - } - - if (rxstride == 1 && axstride == 1 && bxstride == 1) { - const 'rtype_name` * restrict bbase_y; - 'rtype_name` * restrict dest_y; - const 'rtype_name` * restrict abase_n; - 'rtype_name` bbase_yn; + const int m = xcount, n = ycount, k = count, ldc = rystride; + const 'rtype_name` one = 1, zero = 0; + const int lda = (axstride == 1) ? aystride : axstride, + ldb = (bxstride == 1) ? bystride : bxstride; - if (rystride == xcount) - memset (dest, 0, (sizeof ('rtype_name`) * xcount * ycount)); - else + if (lda > 0 && ldb > 0 && ldc > 0 && m > 1 && n > 1 && k > 1) { - for (y = 0; y < ycount; y++) - for (x = 0; x < xcount; x++) - dest[x + y*rystride] = ('rtype_name`)0; + assert (gemm != NULL); + gemm (axstride == 1 ? "N" : "T", bxstride == 1 ? "N" : "T", &m, + &n, &k, &one, abase, &lda, bbase, &ldb, &zero, dest, + &ldc, 1, 1); + return; } + } - for (y = 0; y < ycount; y++) + if (rxstride == 1 && axstride == 1 && bxstride == 1) + { + /* This block of code implements a tuned matmul, derived from + Superscalar GEMM-based level 3 BLAS, Beta version 0.1 + + Bo Kagstrom and Per Ling + Department of Computing Science + Umea University + S-901 87 Umea, Sweden + + from netlib.org, translated to C, and modified for matmul.m4. */ + + const 'rtype_name` *a, *b; + 'rtype_name` *c; + const index_type m = xcount, n = ycount, k = count; + + /* System generated locals */ + index_type a_dim1, a_offset, b_dim1, b_offset, c_dim1, c_offset, i1, i2, + i3, i4, i5, i6; + + /* Local variables */ + 'rtype_name` t1[65536], /* was [256][256] */ + f11, f12, f21, f22, f31, f32, f41, f42, + f13, f14, f23, f24, f33, f34, f43, f44; + index_type i, j, l, ii, jj, ll; + index_type isec, jsec, lsec, uisec, ujsec, ulsec; + + a = abase; + b = bbase; + c = retarray->base_addr; + + /* Parameter adjustments */ + c_dim1 = m; + c_offset = 1 + c_dim1; + c -= c_offset; + a_dim1 = aystride; + a_offset = 1 + a_dim1; + a -= a_offset; + b_dim1 = bystride; + b_offset = 1 + b_dim1; + b -= b_offset; + + /* Early exit if possible */ + if (m == 0 || n == 0 || k == 0) + return; + + /* Empty c first. */ + for (j=1; j<=n; j++) + for (i=1; i<=m; i++) + c[i + j * c_dim1] = ('rtype_name`)0; + + /* Start turning the crank. */ + i1 = n; + for (jj = 1; jj <= i1; jj += 512) { - bbase_y = bbase + y*bystride; - dest_y = dest + y*rystride; - for (n = 0; n < count; n++) + /* Computing MIN */ + i2 = 512; + i3 = n - jj + 1; + jsec = min(i2,i3); + ujsec = jsec - jsec % 4; + i2 = k; + for (ll = 1; ll <= i2; ll += 256) { - abase_n = abase + n*aystride; - bbase_yn = bbase_y[n]; - for (x = 0; x < xcount; x++) + /* Computing MIN */ + i3 = 256; + i4 = k - ll + 1; + lsec = min(i3,i4); + ulsec = lsec - lsec % 2; + + i3 = m; + for (ii = 1; ii <= i3; ii += 256) { - dest_y[x] += abase_n[x] * bbase_yn; + /* Computing MIN */ + i4 = 256; + i5 = m - ii + 1; + isec = min(i4,i5); + uisec = isec - isec % 2; + i4 = ll + ulsec - 1; + for (l = ll; l <= i4; l += 2) + { + i5 = ii + uisec - 1; + for (i = ii; i <= i5; i += 2) + { + t1[l - ll + 1 + ((i - ii + 1) << 8) - 257] = + a[i + l * a_dim1]; + t1[l - ll + 2 + ((i - ii + 1) << 8) - 257] = + a[i + (l + 1) * a_dim1]; + t1[l - ll + 1 + ((i - ii + 2) << 8) - 257] = + a[i + 1 + l * a_dim1]; + t1[l - ll + 2 + ((i - ii + 2) << 8) - 257] = + a[i + 1 + (l + 1) * a_dim1]; + } + if (uisec < isec) + { + t1[l - ll + 1 + (isec << 8) - 257] = + a[ii + isec - 1 + l * a_dim1]; + t1[l - ll + 2 + (isec << 8) - 257] = + a[ii + isec - 1 + (l + 1) * a_dim1]; + } + } + if (ulsec < lsec) + { + i4 = ii + isec - 1; + for (i = ii; i<= i4; ++i) + { + t1[lsec + ((i - ii + 1) << 8) - 257] = + a[i + (ll + lsec - 1) * a_dim1]; + } + } + + uisec = isec - isec % 4; + i4 = jj + ujsec - 1; + for (j = jj; j <= i4; j += 4) + { + i5 = ii + uisec - 1; + for (i = ii; i <= i5; i += 4) + { + f11 = c[i + j * c_dim1]; + f21 = c[i + 1 + j * c_dim1]; + f12 = c[i + (j + 1) * c_dim1]; + f22 = c[i + 1 + (j + 1) * c_dim1]; + f13 = c[i + (j + 2) * c_dim1]; + f23 = c[i + 1 + (j + 2) * c_dim1]; + f14 = c[i + (j + 3) * c_dim1]; + f24 = c[i + 1 + (j + 3) * c_dim1]; + f31 = c[i + 2 + j * c_dim1]; + f41 = c[i + 3 + j * c_dim1]; + f32 = c[i + 2 + (j + 1) * c_dim1]; + f42 = c[i + 3 + (j + 1) * c_dim1]; + f33 = c[i + 2 + (j + 2) * c_dim1]; + f43 = c[i + 3 + (j + 2) * c_dim1]; + f34 = c[i + 2 + (j + 3) * c_dim1]; + f44 = c[i + 3 + (j + 3) * c_dim1]; + i6 = ll + lsec - 1; + for (l = ll; l <= i6; ++l) + { + f11 += t1[l - ll + 1 + ((i - ii + 1) << 8) - 257] + * b[l + j * b_dim1]; + f21 += t1[l - ll + 1 + ((i - ii + 2) << 8) - 257] + * b[l + j * b_dim1]; + f12 += t1[l - ll + 1 + ((i - ii + 1) << 8) - 257] + * b[l + (j + 1) * b_dim1]; + f22 += t1[l - ll + 1 + ((i - ii + 2) << 8) - 257] + * b[l + (j + 1) * b_dim1]; + f13 += t1[l - ll + 1 + ((i - ii + 1) << 8) - 257] + * b[l + (j + 2) * b_dim1]; + f23 += t1[l - ll + 1 + ((i - ii + 2) << 8) - 257] + * b[l + (j + 2) * b_dim1]; + f14 += t1[l - ll + 1 + ((i - ii + 1) << 8) - 257] + * b[l + (j + 3) * b_dim1]; + f24 += t1[l - ll + 1 + ((i - ii + 2) << 8) - 257] + * b[l + (j + 3) * b_dim1]; + f31 += t1[l - ll + 1 + ((i - ii + 3) << 8) - 257] + * b[l + j * b_dim1]; + f41 += t1[l - ll + 1 + ((i - ii + 4) << 8) - 257] + * b[l + j * b_dim1]; + f32 += t1[l - ll + 1 + ((i - ii + 3) << 8) - 257] + * b[l + (j + 1) * b_dim1]; + f42 += t1[l - ll + 1 + ((i - ii + 4) << 8) - 257] + * b[l + (j + 1) * b_dim1]; + f33 += t1[l - ll + 1 + ((i - ii + 3) << 8) - 257] + * b[l + (j + 2) * b_dim1]; + f43 += t1[l - ll + 1 + ((i - ii + 4) << 8) - 257] + * b[l + (j + 2) * b_dim1]; + f34 += t1[l - ll + 1 + ((i - ii + 3) << 8) - 257] + * b[l + (j + 3) * b_dim1]; + f44 += t1[l - ll + 1 + ((i - ii + 4) << 8) - 257] + * b[l + (j + 3) * b_dim1]; + } + c[i + j * c_dim1] = f11; + c[i + 1 + j * c_dim1] = f21; + c[i + (j + 1) * c_dim1] = f12; + c[i + 1 + (j + 1) * c_dim1] = f22; + c[i + (j + 2) * c_dim1] = f13; + c[i + 1 + (j + 2) * c_dim1] = f23; + c[i + (j + 3) * c_dim1] = f14; + c[i + 1 + (j + 3) * c_dim1] = f24; + c[i + 2 + j * c_dim1] = f31; + c[i + 3 + j * c_dim1] = f41; + c[i + 2 + (j + 1) * c_dim1] = f32; + c[i + 3 + (j + 1) * c_dim1] = f42; + c[i + 2 + (j + 2) * c_dim1] = f33; + c[i + 3 + (j + 2) * c_dim1] = f43; + c[i + 2 + (j + 3) * c_dim1] = f34; + c[i + 3 + (j + 3) * c_dim1] = f44; + } + if (uisec < isec) + { + i5 = ii + isec - 1; + for (i = ii + uisec; i <= i5; ++i) + { + f11 = c[i + j * c_dim1]; + f12 = c[i + (j + 1) * c_dim1]; + f13 = c[i + (j + 2) * c_dim1]; + f14 = c[i + (j + 3) * c_dim1]; + i6 = ll + lsec - 1; + for (l = ll; l <= i6; ++l) + { + f11 += t1[l - ll + 1 + ((i - ii + 1) << 8) - + 257] * b[l + j * b_dim1]; + f12 += t1[l - ll + 1 + ((i - ii + 1) << 8) - + 257] * b[l + (j + 1) * b_dim1]; + f13 += t1[l - ll + 1 + ((i - ii + 1) << 8) - + 257] * b[l + (j + 2) * b_dim1]; + f14 += t1[l - ll + 1 + ((i - ii + 1) << 8) - + 257] * b[l + (j + 3) * b_dim1]; + } + c[i + j * c_dim1] = f11; + c[i + (j + 1) * c_dim1] = f12; + c[i + (j + 2) * c_dim1] = f13; + c[i + (j + 3) * c_dim1] = f14; + } + } + } + if (ujsec < jsec) + { + i4 = jj + jsec - 1; + for (j = jj + ujsec; j <= i4; ++j) + { + i5 = ii + uisec - 1; + for (i = ii; i <= i5; i += 4) + { + f11 = c[i + j * c_dim1]; + f21 = c[i + 1 + j * c_dim1]; + f31 = c[i + 2 + j * c_dim1]; + f41 = c[i + 3 + j * c_dim1]; + i6 = ll + lsec - 1; + for (l = ll; l <= i6; ++l) + { + f11 += t1[l - ll + 1 + ((i - ii + 1) << 8) - + 257] * b[l + j * b_dim1]; + f21 += t1[l - ll + 1 + ((i - ii + 2) << 8) - + 257] * b[l + j * b_dim1]; + f31 += t1[l - ll + 1 + ((i - ii + 3) << 8) - + 257] * b[l + j * b_dim1]; + f41 += t1[l - ll + 1 + ((i - ii + 4) << 8) - + 257] * b[l + j * b_dim1]; + } + c[i + j * c_dim1] = f11; + c[i + 1 + j * c_dim1] = f21; + c[i + 2 + j * c_dim1] = f31; + c[i + 3 + j * c_dim1] = f41; + } + i5 = ii + isec - 1; + for (i = ii + uisec; i <= i5; ++i) + { + f11 = c[i + j * c_dim1]; + i6 = ll + lsec - 1; + for (l = ll; l <= i6; ++l) + { + f11 += t1[l - ll + 1 + ((i - ii + 1) << 8) - + 257] * b[l + j * b_dim1]; + } + c[i + j * c_dim1] = f11; + } + } + } } } } + return; } else if (rxstride == 1 && aystride == 1 && bxstride == 1) { @@ -336,7 +572,9 @@ sinclude(`matmul_asm_'rtype_code`.m4')dnl for (n = 0; n < count; n++) for (x = 0; x < xcount; x++) /* dest[x,y] += a[x,n] * b[n,y] */ - dest[x*rxstride + y*rystride] += abase[x*axstride + n*aystride] * bbase[n*bxstride + y*bystride]; + dest[x*rxstride + y*rystride] += + abase[x*axstride + n*aystride] * + bbase[n*bxstride + y*bystride]; } else if (GFC_DESCRIPTOR_RANK (a) == 1) { @@ -373,6 +611,6 @@ sinclude(`matmul_asm_'rtype_code`.m4')dnl } } } -} - -#endif' +}' +#pragma GCC reset_options +#endif