From patchwork Tue Oct 31 20:09:21 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Adhemerval Zanella X-Patchwork-Id: 739595 Delivered-To: patch@linaro.org Received: by 2002:a5d:4c47:0:b0:32d:baff:b0ca with SMTP id n7csp1841632wrt; Tue, 31 Oct 2023 13:09:38 -0700 (PDT) X-Google-Smtp-Source: AGHT+IH7pmnc8/iek8tILP6VF1iydAUaKRXk9s0TpccWHg0Fn5UhOh7BCTJ/wEvsBsD+0knlu3rd X-Received: by 2002:a05:620a:f03:b0:76f:10ab:7c64 with SMTP id v3-20020a05620a0f0300b0076f10ab7c64mr15755486qkl.28.1698782978108; Tue, 31 Oct 2023 13:09:38 -0700 (PDT) ARC-Seal: i=2; a=rsa-sha256; t=1698782978; cv=pass; d=google.com; s=arc-20160816; b=q/LpxVl32ep6v/iBC8VwGClSAYa0vwZjkMN4Ocu/MaHJWnTdFuxhm6IKY+9phBq9uv Wblmk4Ke5gtcSY6BCDJ521y409TcfYEQsDBCnd8hcgNThfbdIb8q0qnObR67tq8EBere QnO2rZzf0u15/2brKFRTAok+TnMzlzBBYHllXuhCsHARxqeP0XqfHsz5EPOGYD7Vjefv fjdniP582qMwdhpkYEhN6mbgnMgixevEkXJ0KvEu0zjlMt0Wg2nUFQYyqAOx1UGzN2ZB pt6amhdq1AHGHWZ5IL+O54gb7o6E3vEDJy/me9k/43peEerEbPtV/Kke4rcX3+08xXp2 0TWA== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=errors-to:list-subscribe:list-help:list-post:list-archive :list-unsubscribe:list-id:precedence:content-transfer-encoding :mime-version:message-id:date:subject:to:from:dkim-signature :arc-filter:dmarc-filter:delivered-to; bh=AbgNceBcxOwUwJdSCBlJLTetjPYd819XUfYhf93ygVA=; fh=pPFR/wQ0k/GuqyNsntNA5LqEpigLoSYvby6BmJvILyA=; b=g2oVVAPr4eS1vUSuy9e9eq9JwlgbmQXdz/3qWJQQMRhTqZ+L5Ky+EK8FEszvwxcV7E xr4VJvUiyRXrbxOXoQLavWIV3BxaMvehaEbWZTk+59MF3tf94SqfXgCzif+a+Shfl18C tgr5aQgN6i76y2evsF9LY2FWR0gYWWvFY2PaQcEZkpRR2XNwhn5UWUOfyYTuNbVwsRhN VSDWnw9ebQH1/XRVVVOZuZGpkWNvMr8RPcNsRoP+NoPHHvoa8Ip0vepHZ7acEJXNMPbi /73HLniZ5b0itb8F/kOnn+Q7jvUkXwOOWspdfm57VSoHfhQse9J5tMlUoR/C/bAJ2ty1 acpA== ARC-Authentication-Results: i=2; mx.google.com; dkim=pass header.i=@linaro.org header.s=google header.b=SK9V8HEK; arc=pass (i=1); spf=pass (google.com: domain of libc-alpha-bounces+patch=linaro.org@sourceware.org designates 2620:52:3:1:0:246e:9693:128c as permitted sender) smtp.mailfrom="libc-alpha-bounces+patch=linaro.org@sourceware.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linaro.org Return-Path: Received: from server2.sourceware.org (server2.sourceware.org. [2620:52:3:1:0:246e:9693:128c]) by mx.google.com with ESMTPS id o22-20020a05620a0d5600b0077a01c9aca3si1594860qkl.24.2023.10.31.13.09.37 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 31 Oct 2023 13:09:38 -0700 (PDT) Received-SPF: pass (google.com: domain of libc-alpha-bounces+patch=linaro.org@sourceware.org designates 2620:52:3:1:0:246e:9693:128c as permitted sender) client-ip=2620:52:3:1:0:246e:9693:128c; Authentication-Results: mx.google.com; dkim=pass header.i=@linaro.org header.s=google header.b=SK9V8HEK; arc=pass (i=1); spf=pass (google.com: domain of libc-alpha-bounces+patch=linaro.org@sourceware.org designates 2620:52:3:1:0:246e:9693:128c as permitted sender) smtp.mailfrom="libc-alpha-bounces+patch=linaro.org@sourceware.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linaro.org Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id B72383857730 for ; Tue, 31 Oct 2023 20:09:37 +0000 (GMT) X-Original-To: libc-alpha@sourceware.org Delivered-To: libc-alpha@sourceware.org Received: from mail-yw1-x112b.google.com (mail-yw1-x112b.google.com [IPv6:2607:f8b0:4864:20::112b]) by sourceware.org (Postfix) with ESMTPS id D7D813858D1E for ; Tue, 31 Oct 2023 20:09:30 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org D7D813858D1E Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=linaro.org Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=linaro.org ARC-Filter: OpenARC Filter v1.0.0 sourceware.org D7D813858D1E Authentication-Results: server2.sourceware.org; arc=none smtp.remote-ip=2607:f8b0:4864:20::112b ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1698782972; cv=none; b=NEe+I56r94SfLyXu+Cb1Lz+XK4aVg7ALT/D38BgBAbI62zxvrPXdtMTGVcZ4YRFSc2ZoqdVnwJog63hTeFPYLpOeSCmOYM0tJnt7VTiwRXN8fYO+TqbyQbS4DmVrZ17M5P//MhEXGX+4gcScAffKMAFZNylGIPvMc2L6X98mgr0= ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1698782972; c=relaxed/simple; bh=hfzKuzmUvrDFsdUe0mEQgeyB2PBsmRs/JPooMw4WtzA=; h=DKIM-Signature:From:To:Subject:Date:Message-Id:MIME-Version; b=xl4yqKCKyopmWvYjdZzn1dZPO60MNw3XA8gJ7tbRSMsD8Xt3oX9cuBRMZC5wJeJGPzGUez9FSoxFf/w9uI09N7smbdO3EGyCrU8o/yLHslNFILIWR2MMLiedK/kYCJm4NINtU18cXApYNVOKzmxntFdkbTzHx5CEK6dOYeqXT30= ARC-Authentication-Results: i=1; server2.sourceware.org Received: by mail-yw1-x112b.google.com with SMTP id 00721157ae682-5a7afd45199so61144407b3.0 for ; Tue, 31 Oct 2023 13:09:30 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linaro.org; s=google; t=1698782969; x=1699387769; darn=sourceware.org; h=content-transfer-encoding:mime-version:message-id:date:subject:to :from:from:to:cc:subject:date:message-id:reply-to; bh=AbgNceBcxOwUwJdSCBlJLTetjPYd819XUfYhf93ygVA=; b=SK9V8HEK8+3Opu66lL4umGM1RwfqVYa7+LAwloVyeeFu6WXS1PQOCIsTub5BXNFGpy PZPAoZlUuzqoU5HQt3fxy2BA7o6WdUPQJd4zv7UecR9EggQxDL6EM/n4cPAD8YzPo/4U R2tG2+5ADPcrFlCwIeQh7fcdfjj86jJ0CRECCAvwcUeTgCi81TsyUMTGhvK094/L7dzN RF0AmPYsrx2fYSqlX9FwxVNiY8+ndWzY4isJA6UuiTQAk1Cj/qXmblgNM0Uuuhm8VV3Y wGExc+kIegdWWCDlPum+mGgxR6BCcdz0mK4IAAxFNMH2TAdGvB/rXLGKP7RInq5VrlOS bFXg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1698782969; x=1699387769; h=content-transfer-encoding:mime-version:message-id:date:subject:to :from:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=AbgNceBcxOwUwJdSCBlJLTetjPYd819XUfYhf93ygVA=; b=ixoucZNQUQlAbgxHU2pWpAr5bOEEJ8oadvZr3Bqaa/hX+3FGsLkzf9mGgUeLa1am94 6bNwgOufm0qcykJfnd5lqXTKkPHs/SS26adKrhsa6e4UkggNjgCHuZMQnZ05CnF8wEPw t62VWAcRpkd3qqXe+qV0YXl1LC+5QrMWIsVS4gZ2E1eDVLJpM7Nvxe8sF9cJc9nNlJXA NqWGkxkhtd12xAUjgonemDA60aZ7I5OMxXxE/Mq71haTV4CN7C7y380ANVq/clxzYqEt XOsb9zXEysHYbKMnFac8ojCmQksv/rPe4/lKlIrV9qxP07ixW1a0WvOan9z2dFRnQhxn SoKA== X-Gm-Message-State: AOJu0Yx3pqo25W6QH8DbgWKo1gCQ+mp4DB/WdTj9p6IWXY+C4Pb5sjww 2N9dRFMRUKgcUAxX3LTEnADOG1LMiOVW4rrmjPvkNg== X-Received: by 2002:a81:b611:0:b0:5a7:d9e6:8fc6 with SMTP id u17-20020a81b611000000b005a7d9e68fc6mr14296821ywh.39.1698782969559; Tue, 31 Oct 2023 13:09:29 -0700 (PDT) Received: from mandiga.. ([2804:1b3:a7c0:3d3c:6c87:9be3:8cfc:976d]) by smtp.gmail.com with ESMTPSA id q69-20020a819948000000b005a7fa3ccb32sm1264111ywg.35.2023.10.31.13.09.27 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 31 Oct 2023 13:09:28 -0700 (PDT) From: Adhemerval Zanella To: libc-alpha@sourceware.org, Noah Goldstein , "H . J . Lu" , Bruce Merry Subject: [PATCH 0/4] x86: Improve ERMS usage on Zen3+ Date: Tue, 31 Oct 2023 17:09:21 -0300 Message-Id: <20231031200925.3297456-1-adhemerval.zanella@linaro.org> X-Mailer: git-send-email 2.34.1 MIME-Version: 1.0 X-Spam-Status: No, score=-6.5 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.30 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: libc-alpha-bounces+patch=linaro.org@sourceware.org For the sizes where REP MOVSB and REP STOSB are used on Zen3+ cores, the result performance is lower than vectorized instructions (with some input alignment showing a very large performance gap as indicated by BZ#30995). The glibc enables ERMS on AMD code for sizes between 2113 (rep_movsb_threshold) and L2 cache size (rep_movsb_stop_threshold or 524288 on a Zen3 core). Using the provided benchmarks from BZ#30995, the memcpy on Ryzen 9 5900X shows: Size (bytes) Destination Alignment Throughput (GB/s) 2113 0 84.2448 2113 15 4.4310 524287 0 57.1122 524287 15 4.34671 While by using vectorized instructions with the tunable GLIBC_TUNABLES=glibc.cpu.x86_rep_movsb_threshold=1000000 it shows: Size (bytes) Destination Alignment Throughput (GB/s) 2113 0 124.1830 2113 15 121.8720 524287 0 58.3212 524287 15 58.5352 Increasing the number of concurrent jobs does show improvements in ERMS over vectorized instructions as well. The performance difference with ERMS improves if input alignments are equal, although it does not reach parity with the vectorized path. The memset also shows similar performance improvement with vectorized instructions instead of REP STOSB. On the same machine, the default strategy shows: Size (bytes) Destination Alignment Throughput (GB/s) 2113 0 68.0113 2113 15 56.1880 524287 0 119.3670 524287 15 116.2590 While with GLIBC_TUNABLES=glibc.cpu.x86_rep_stosb_threshold=1000000: Size (bytes) Destination Alignment Throughput (GB/s) 2113 0 133.2310 2113 15 132.5800 524287 0 112.0650 524287 15 118.0960 I also saw a slight performance increase on 502.gcc_r (1 copy), where where result went from 9.82 to 9.85. The benchmarks hit hard both memcpy and memset. The first patch adds a way to check if tunable is set (BZ 27069), which is used on the second patch to select the best strategy. The BZ 30994 fix also adds a new tunable, glibc.cpu.x86_rep_movsb_stop_threshold, so the caller can specify a size range for force ERMS usage (from BZ #30994 discussion, there are some cases where ERMS is profitable). Patch 3 disables ERMS usage for memset on Zen 3+. And patch 4 slightly improves slight the x86 memcpy documentation. Adhemerval Zanella (4): elf: Add a way to check if tunable is set (BZ 27069) x86: Fix Zen3/Zen4 ERMS selection (BZ 30994) x86: Do not prefer ERMS for memset on Zen3+ x86: Expand the comment on when REP STOSB is used on memset elf/dl-tunable-types.h | 1 + elf/dl-tunables.c | 40 ++++++++++ elf/dl-tunables.h | 28 +++++++ elf/dl-tunables.list | 1 + manual/tunables.texi | 9 +++ scripts/gen-tunables.awk | 4 +- sysdeps/x86/dl-cacheinfo.h | 74 ++++++++++++------- sysdeps/x86/dl-tunables.list | 10 +++ .../multiarch/memset-vec-unaligned-erms.S | 4 +- 9 files changed, 142 insertions(+), 29 deletions(-)