From patchwork Mon Jan 28 15:58:57 2019
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Patchwork-Submitter: Richard Henderson <richard.henderson@linaro.org>
X-Patchwork-Id: 156787
Delivered-To: patch@linaro.org
Received: by 2002:a02:48:0:0:0:0:0 with SMTP id 69csp3617808jaa;
 Mon, 28 Jan 2019 08:25:41 -0800 (PST)
X-Google-Smtp-Source: ALg8bN67ShcJxrPKO0PXRtzEdg7HI7JSYVu214r7GgrsujDxRHTtyZAyzi52u+8ShfXHRHNzZQ3R
X-Received: by 2002:adf:cd0e:: with SMTP id
 w14mr23267765wrm.218.1548692741522; 
 Mon, 28 Jan 2019 08:25:41 -0800 (PST)
ARC-Seal: i=1; a=rsa-sha256; t=1548692741; cv=none;
 d=google.com; s=arc-20160816;
 b=aSVV2nb4ZINo7BNdWuNmY6BKpoyBD4xU0ezm5Zs3DbawjFYxS62a7ZehrcJTvtYW0Y
 /Qn8LNroHyY8p48Perq5RkMVuBMtlBVTAhnv93VmP6n5rsHiOYptzJQAOTPATUr3byti
 vZmSBtB9yzRjNEKOAvRcJ4HsCzUybwGwTrsezuzwOCYcjnMqm2UVjX3nIHjFPASIVQvR
 zun2LeOGzYABFUuNSHT3vl/HciSKixuZsAWlk7l+mstk7G5fanauJ7KcmVIVJBdyMdfG
 /piXIqgicpGGRXnmUA5djgR9aJSWYVZ6gyJZHCcH9Y5sZR29xoiuynyoEXOwQYnk+ckm
 oDTw==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816; 
 h=sender:errors-to:cc:list-subscribe:list-help:list-post:list-archive
 :list-unsubscribe:list-id:precedence:subject
 :content-transfer-encoding:mime-version:references:in-reply-to
 :message-id:date:to:from:dkim-signature;
 bh=G3R6M8lo+egOdVDHfgVuN8I78ohbr7aGRLQnNFv5cCY=;
 b=VF2bddu/x3KgnpTgb1OAeEgJEV3qio66T9eh3XUlJCBDHsYmqCR3GDy+W7/0QDk8qY
 rH31+Y2rmlY9RdnkVn9GS+yAxu6XRJ+hZ3/ccf+OjbB8JNuoHYkSxk4JAowuV789A4rl
 VRw06RaQtovKwEEsljJz5pu5y5fGlJSo4krSKkHBnjt4atlohKPsqBqWVVVveMhwCYqi
 58FRnD+BTLclISRL7KjzD0WidcO0M1hoP4n2EUoKs4XNgrv9turO2G1fSLSmUqmqJVU+
 TO7n1gWjBaMA7yMPmj7int9URbfjPSihIODF8eor0wmsrJG4RDxBhrLgu5SyvDxJkU8A
 bfgQ==
ARC-Authentication-Results: i=1; mx.google.com;
 dkim=fail header.i=@linaro.org header.s=google header.b=BI8PyEot;
 spf=pass (google.com: domain of
 qemu-devel-bounces+patch=linaro.org@nongnu.org designates
 209.51.188.17 as permitted sender)
 smtp.mailfrom="qemu-devel-bounces+patch=linaro.org@nongnu.org";
 dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=linaro.org
Return-Path: <qemu-devel-bounces+patch=linaro.org@nongnu.org>
Received: from lists.gnu.org (lists.gnu.org. [209.51.188.17])
 by mx.google.com with ESMTPS id
 i67si34910582wmi.164.2019.01.28.08.25.41 for <patch@linaro.org>
 (version=TLS1 cipher=AES128-SHA bits=128/128);
 Mon, 28 Jan 2019 08:25:41 -0800 (PST)
Received-SPF: pass (google.com: domain of
 qemu-devel-bounces+patch=linaro.org@nongnu.org designates
 209.51.188.17 as permitted sender) client-ip=209.51.188.17; 
Authentication-Results: mx.google.com;
 dkim=fail header.i=@linaro.org header.s=google header.b=BI8PyEot;
 spf=pass (google.com: domain of
 qemu-devel-bounces+patch=linaro.org@nongnu.org designates
 209.51.188.17 as permitted sender)
 smtp.mailfrom="qemu-devel-bounces+patch=linaro.org@nongnu.org";
 dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=linaro.org
Received: from localhost ([127.0.0.1]:34517 helo=lists.gnu.org)
 by lists.gnu.org with esmtp (Exim 4.71)
 (envelope-from <qemu-devel-bounces+patch=linaro.org@nongnu.org>)
 id 1go9jA-0007qz-9r
 for patch@linaro.org; Mon, 28 Jan 2019 11:25:40 -0500
Received: from eggs.gnu.org ([209.51.188.92]:33630)
 by lists.gnu.org with esmtp (Exim 4.71)
 (envelope-from <richard.henderson@linaro.org>) id 1go9K6-0005Dq-UG
 for qemu-devel@nongnu.org; Mon, 28 Jan 2019 10:59:49 -0500
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
 (envelope-from <richard.henderson@linaro.org>) id 1go9K4-0007vO-Ib
 for qemu-devel@nongnu.org; Mon, 28 Jan 2019 10:59:46 -0500
Received: from mail-pl1-x642.google.com ([2607:f8b0:4864:20::642]:41075)
 by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16)
 (Exim 4.71) (envelope-from <richard.henderson@linaro.org>)
 id 1go9K4-0007om-7N
 for qemu-devel@nongnu.org; Mon, 28 Jan 2019 10:59:44 -0500
Received: by mail-pl1-x642.google.com with SMTP id u6so7919431plm.8
 for <qemu-devel@nongnu.org>; Mon, 28 Jan 2019 07:59:30 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linaro.org; s=google; 
 h=from:to:cc:subject:date:message-id:in-reply-to:references
 :mime-version:content-transfer-encoding;
 bh=G3R6M8lo+egOdVDHfgVuN8I78ohbr7aGRLQnNFv5cCY=;
 b=BI8PyEot8BbJCc0GinTMC85x5fEgOgU38qSj2XRMrC4iaX0hllM5w7Rqn2eX6GmD9t
 OSzJtYxRiHCgCGszRUKreVsWEZ+oQf356KVDesWsnvclvFnL6HpR1BSXVP1zA1M9iJTJ
 RbJ6JLOWjnBNZ9apMKCRx522jkoKg0FeTX758=
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
 :references:mime-version:content-transfer-encoding;
 bh=G3R6M8lo+egOdVDHfgVuN8I78ohbr7aGRLQnNFv5cCY=;
 b=mlrt5F3thSmKr15t7SGa14AhplILEaNvLQQumCwaPRUUM+wyaW1LblgLwH513iF+g5
 plk95nYFble12ZO4WYooCn5q8SlwRIy3v2TA3OMqcGRqRTrewvNFfstG96lf67RDLBSa
 tg/a3lV2a8icr4NiobpInUD+ExQBLH8+dOnZfZjCdGioLdfTDT6Kt6QIABPfhO3R8+cQ
 feqUjVOTNgz8T9hh2j7iLO4NW4giqiEZIiKnnNSvpD+P2PGu0m7lR3vnxVzMn9KjcKgR
 YtLCDgId8gkrGC3BbtPQ54Q2iUIJhqLXwj1kRf13dmtm8oRFttq5aEpfvD8P7sIwSUwu
 rmLw==
X-Gm-Message-State: AJcUukfw+KtevSUaWO/fMQQYyeROo3Ok3g7eYX2ShnJSWlTV8CisgCtF
 qZbTQc4y9K45BRYrJivIDjAsQawDHRU=
X-Received: by 2002:a17:902:b48b:: with SMTP id
 y11mr21442992plr.200.1548691169074; 
 Mon, 28 Jan 2019 07:59:29 -0800 (PST)
Received: from cloudburst.twiddle.net
 (50-233-235-3-static.hfc.comcastbusiness.net. [50.233.235.3])
 by smtp.gmail.com with ESMTPSA id
 p2sm45518687pfp.125.2019.01.28.07.59.27
 (version=TLS1_2 cipher=ECDHE-RSA-CHACHA20-POLY1305 bits=256/256);
 Mon, 28 Jan 2019 07:59:28 -0800 (PST)
From: Richard Henderson <richard.henderson@linaro.org>
To: qemu-devel@nongnu.org
Date: Mon, 28 Jan 2019 07:58:57 -0800
Message-Id: <20190128155907.20607-14-richard.henderson@linaro.org>
X-Mailer: git-send-email 2.17.2
In-Reply-To: <20190128155907.20607-1-richard.henderson@linaro.org>
References: <20190128155907.20607-1-richard.henderson@linaro.org>
MIME-Version: 1.0
X-detected-operating-system: by eggs.gnu.org: Genre and OS details not
 recognized.
X-Received-From: 2607:f8b0:4864:20::642
Subject: [Qemu-devel] [PULL 13/23] tcg/i386: enable dynamic TLB sizing
X-BeenThere: qemu-devel@nongnu.org
X-Mailman-Version: 2.1.21
Precedence: list
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
 <mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel/>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
 <mailto:qemu-devel-request@nongnu.org?subject=subscribe>
Cc: peter.maydell@linaro.org, "Emilio G. Cota" <cota@braap.org>
Errors-To: qemu-devel-bounces+patch=linaro.org@nongnu.org
Sender: "Qemu-devel" <qemu-devel-bounces+patch=linaro.org@nongnu.org>

From: "Emilio G. Cota" <cota@braap.org>

As the following experiments show, this series is a net perf gain,
particularly for memory-heavy workloads. Experiments are run on an
Intel(R) Xeon(R) Gold 6142 CPU @ 2.60GHz.

1. System boot + shudown, debian aarch64:

- Before (v3.1.0):
 Performance counter stats for './die.sh v3.1.0' (10 runs):

       9019.797015      task-clock (msec)         #    0.993 CPUs utilized            ( +-  0.23% )
    29,910,312,379      cycles                    #    3.316 GHz                      ( +-  0.14% )
    54,699,252,014      instructions              #    1.83  insn per cycle           ( +-  0.08% )
    10,061,951,686      branches                  # 1115.541 M/sec                    ( +-  0.08% )
       172,966,530      branch-misses             #    1.72% of all branches          ( +-  0.07% )

       9.084039051 seconds time elapsed                                          ( +-  0.23% )

- After:
 Performance counter stats for './die.sh tlb-dyn-v5' (10 runs):

       8624.084842      task-clock (msec)         #    0.993 CPUs utilized            ( +-  0.23% )
    28,556,123,404      cycles                    #    3.311 GHz                      ( +-  0.13% )
    51,755,089,512      instructions              #    1.81  insn per cycle           ( +-  0.05% )
     9,526,513,946      branches                  # 1104.641 M/sec                    ( +-  0.05% )
       166,578,509      branch-misses             #    1.75% of all branches          ( +-  0.19% )

       8.680540350 seconds time elapsed                                          ( +-  0.24% )

That is, a 4.4% perf increase.

2. System boot + shutdown, ubuntu 18.04 x86_64:

- Before (v3.1.0):
      56100.574751      task-clock (msec)         #    1.016 CPUs utilized            ( +-  4.81% )
   200,745,466,128      cycles                    #    3.578 GHz                      ( +-  5.24% )
   431,949,100,608      instructions              #    2.15  insn per cycle           ( +-  5.65% )
    77,502,383,330      branches                  # 1381.490 M/sec                    ( +-  6.18% )
       844,681,191      branch-misses             #    1.09% of all branches          ( +-  3.82% )

      55.221556378 seconds time elapsed                                          ( +-  5.01% )

- After:
      56603.419540      task-clock (msec)         #    1.019 CPUs utilized            ( +- 10.19% )
   202,217,930,479      cycles                    #    3.573 GHz                      ( +- 10.69% )
   439,336,291,626      instructions              #    2.17  insn per cycle           ( +- 14.14% )
    80,538,357,447      branches                  # 1422.853 M/sec                    ( +- 16.09% )
       776,321,622      branch-misses             #    0.96% of all branches          ( +-  3.77% )

      55.549661409 seconds time elapsed                                          ( +- 10.44% )

No improvement (within noise range). Note that for this workload,
increasing the time window too much can lead to perf degradation,
since it flushes the TLB *very* frequently.

3. x86_64 SPEC06int:

           x86_64-softmmu speedup vs. v3.1.0 for SPEC06int (test set)
            Host: Intel(R) Xeon(R) Gold 6142 CPU @ 2.60GHz (Skylake)

5.5 +------------------------------------------------------------------------+
    |                   +-+                                                  |
  5 |-+.................+-+...............................tlb-dyn-v5.......+-|
    |                   * *                                                  |
4.5 |-+.................*.*................................................+-|
    |                   * *                                                  |
  4 |-+.................*.*................................................+-|
    |                   * *                                                  |
3.5 |-+.................*.*................................................+-|
    |                   * *                                                  |
  3 |-+......+-+*.......*.*................................................+-|
    |        *  *       * *                                                  |
2.5 |-+......*..*.......*.*.................................+-+*...........+-|
    |        *  *       * *                                 *  *             |
  2 |-+......*..*.......*.*.................................*..*...........+-|
    |        *  *       * *                                 *  *  +-+        |
1.5 |-+......*..*.......*.*.................................*..*.*+-+.*+-+.+-|
    |        *  * *+-+  * *  +-+       *+-+  +-+       +-+  *  * *  * *  *   |
  1 |++++-+*+*++*+*++*++*+*++*+*+++-+*+*+-++*+-++++-++++-+++*++*+*++*+*++*+++|
    |   *  * *  * *  *  * *  * *  *  * *  * *  *  * *  * *  *  * *  * *  *   |
0.5 +------------------------------------------------------------------------+
  400.perlb401.bzip403.g429445.g456.hm462.libq464.h471.omn47483.xalancbgeomean
  png: https://imgur.com/YRF90f7

That is, a 1.51x average speedup over the baseline, with a max speedup
of 5.17x.

Here's a different look at the SPEC06int results, using KVM as the baseline:

             x86_64-softmmu slowdown vs. KVM for SPEC06int (test set)
             Host: Intel(R) Xeon(R) Gold 6142 CPU @ 2.60GHz (Skylake)

25 +---------------------------------------------------------------------------+
   |                   +-+                                        +-+          |
   |                   * *                             +-+      v3.1.0         |
   |                   * *                             +-+  tlb-dyn-v5         |
   |                   * *                             * *        +-+          |
20 |-+.................*.*.............................*.+-+......*.*........+-|
   |                   * *                             * # #      * *          |
   |        +-+        * *                             * # #      * *          |
   |        * *        * *                             * # #      * *          |
15 |-+......*.*........*.*.............................*.#.#......*.+-+......+-|
   |        * *        * *                             * # #      * #|#        |
   |        * *        * *        +-+                  * # #      * +-+        |
   |        * *  +-+   * *        ++-+       +-+       * # #      * # # +-+    |
   |        * *  +-+   * *        * ##       *|   +-+  * # #      * # # +-+    |
10 |-+......*.*..*.+-+.*.*........*.##.......++-+.*.+-+*.#.#......*.#.#.*.*..+-|
   |        * *  * +-+ * *        * ## +-+   *# # * # #* # # +-+  * # # * *    |
   |        * *  * # # * *  +-+   * ## * +-+ *# # * # #* # # * *  * # # *+-+   |
   |        * *  * # # * *  * +-+ * ## * # # *# # * # #* # # * *  * # # * ##   |
 5 |-+......*.+-+*.#.#.*.*..*.#.#.*.##.*.#.#.*#.#.*.#.#*.#.#.*.*..*.#.#.*.##.+-|
   |        * # #* # # * +-+* # # * ## * # # *# # * # #* # # * *  * # # * ##   |
   |        * # #* # # * # #* # # * ## * # # *# # * # #* # # * +-+* # # * ##   |
   |   ++-+ * # #* # # * # #* # # * ## * # # *# # * # #* # # * # #* # # * ##   |
   |+++*#+#+*+#+#*+#+#+*+#+#*+#+#+*+##+*+#+#+*#+#+*+#+#*+#+#+*+#+#*+#+#+*+##+++|
 0 +---------------------------------------------------------------------------+
 400.perlbe401.bzi403.gc429445.go456.h462.libqu464.h471.omne4483.xalancbmgeomean
  png: https://imgur.com/YzAMNEV

After this series, we bring down the average SPEC06int slowdown vs KVM
from 11.47x to 7.58x.

Tested-by: Alex Bennée <alex.bennee@linaro.org>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Emilio G. Cota <cota@braap.org>
Message-Id: <20190116170114.26802-4-cota@braap.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 tcg/i386/tcg-target.h     |  2 +-
 tcg/i386/tcg-target.inc.c | 28 ++++++++++++++--------------
 2 files changed, 15 insertions(+), 15 deletions(-)

-- 
2.17.2

diff --git a/tcg/i386/tcg-target.h b/tcg/i386/tcg-target.h
index 94aa4cef7c..eb40312e67 100644
--- a/tcg/i386/tcg-target.h
+++ b/tcg/i386/tcg-target.h
@@ -27,7 +27,7 @@
 
 #define TCG_TARGET_INSN_UNIT_SIZE  1
 #define TCG_TARGET_TLB_DISPLACEMENT_BITS 31
-#define TCG_TARGET_IMPLEMENTS_DYN_TLB 0
+#define TCG_TARGET_IMPLEMENTS_DYN_TLB 1
 
 #ifdef __x86_64__
 # define TCG_TARGET_REG_BITS  64
diff --git a/tcg/i386/tcg-target.inc.c b/tcg/i386/tcg-target.inc.c
index dbf8253718..4d84aea3a9 100644
--- a/tcg/i386/tcg-target.inc.c
+++ b/tcg/i386/tcg-target.inc.c
@@ -329,6 +329,7 @@ static inline int tcg_target_const_match(tcg_target_long val, TCGType type,
 #define OPC_ARITH_GvEv	(0x03)		/* ... plus (ARITH_FOO << 3) */
 #define OPC_ANDN        (0xf2 | P_EXT38)
 #define OPC_ADD_GvEv	(OPC_ARITH_GvEv | (ARITH_ADD << 3))
+#define OPC_AND_GvEv    (OPC_ARITH_GvEv | (ARITH_AND << 3))
 #define OPC_BLENDPS     (0x0c | P_EXT3A | P_DATA16)
 #define OPC_BSF         (0xbc | P_EXT)
 #define OPC_BSR         (0xbd | P_EXT)
@@ -1641,7 +1642,7 @@ static inline void tcg_out_tlb_load(TCGContext *s, TCGReg addrlo, TCGReg addrhi,
         }
         if (TCG_TYPE_PTR == TCG_TYPE_I64) {
             hrexw = P_REXW;
-            if (TARGET_PAGE_BITS + CPU_TLB_BITS > 32) {
+            if (TARGET_PAGE_BITS + CPU_TLB_DYN_MAX_BITS > 32) {
                 tlbtype = TCG_TYPE_I64;
                 tlbrexw = P_REXW;
             }
@@ -1649,6 +1650,15 @@ static inline void tcg_out_tlb_load(TCGContext *s, TCGReg addrlo, TCGReg addrhi,
     }
 
     tcg_out_mov(s, tlbtype, r0, addrlo);
+    tcg_out_shifti(s, SHIFT_SHR + tlbrexw, r0,
+                   TARGET_PAGE_BITS - CPU_TLB_ENTRY_BITS);
+
+    tcg_out_modrm_offset(s, OPC_AND_GvEv + trexw, r0, TCG_AREG0,
+                         offsetof(CPUArchState, tlb_mask[mem_index]));
+
+    tcg_out_modrm_offset(s, OPC_ADD_GvEv + hrexw, r0, TCG_AREG0,
+                         offsetof(CPUArchState, tlb_table[mem_index]));
+
     /* If the required alignment is at least as large as the access, simply
        copy the address and mask.  For lesser alignments, check that we don't
        cross pages for the complete access.  */
@@ -1658,20 +1668,10 @@ static inline void tcg_out_tlb_load(TCGContext *s, TCGReg addrlo, TCGReg addrhi,
         tcg_out_modrm_offset(s, OPC_LEA + trexw, r1, addrlo, s_mask - a_mask);
     }
     tlb_mask = (target_ulong)TARGET_PAGE_MASK | a_mask;
-
-    tcg_out_shifti(s, SHIFT_SHR + tlbrexw, r0,
-                   TARGET_PAGE_BITS - CPU_TLB_ENTRY_BITS);
-
     tgen_arithi(s, ARITH_AND + trexw, r1, tlb_mask, 0);
-    tgen_arithi(s, ARITH_AND + tlbrexw, r0,
-                (CPU_TLB_SIZE - 1) << CPU_TLB_ENTRY_BITS, 0);
-
-    tcg_out_modrm_sib_offset(s, OPC_LEA + hrexw, r0, TCG_AREG0, r0, 0,
-                             offsetof(CPUArchState, tlb_table[mem_index][0])
-                             + which);
 
     /* cmp 0(r0), r1 */
-    tcg_out_modrm_offset(s, OPC_CMP_GvEv + trexw, r1, r0, 0);
+    tcg_out_modrm_offset(s, OPC_CMP_GvEv + trexw, r1, r0, which);
 
     /* Prepare for both the fast path add of the tlb addend, and the slow
        path function argument setup.  */
@@ -1684,7 +1684,7 @@ static inline void tcg_out_tlb_load(TCGContext *s, TCGReg addrlo, TCGReg addrhi,
 
     if (TARGET_LONG_BITS > TCG_TARGET_REG_BITS) {
         /* cmp 4(r0), addrhi */
-        tcg_out_modrm_offset(s, OPC_CMP_GvEv, addrhi, r0, 4);
+        tcg_out_modrm_offset(s, OPC_CMP_GvEv, addrhi, r0, which + 4);
 
         /* jne slow_path */
         tcg_out_opc(s, OPC_JCC_long + JCC_JNE, 0, 0, 0);
@@ -1696,7 +1696,7 @@ static inline void tcg_out_tlb_load(TCGContext *s, TCGReg addrlo, TCGReg addrhi,
 
     /* add addend(r0), r1 */
     tcg_out_modrm_offset(s, OPC_ADD_GvEv + hrexw, r1, r0,
-                         offsetof(CPUTLBEntry, addend) - which);
+                         offsetof(CPUTLBEntry, addend));
 }
 
 /*