From patchwork Mon Apr 28 05:11:30 2014 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: zhichang.yuan@linaro.org X-Patchwork-Id: 29167 Return-Path: X-Original-To: linaro@patches.linaro.org Delivered-To: linaro@patches.linaro.org Received: from mail-pd0-f197.google.com (mail-pd0-f197.google.com [209.85.192.197]) by ip-10-151-82-157.ec2.internal (Postfix) with ESMTPS id 2E5B820534 for ; Mon, 28 Apr 2014 05:14:43 +0000 (UTC) Received: by mail-pd0-f197.google.com with SMTP id y10sf1180578pdj.4 for ; Sun, 27 Apr 2014 22:14:42 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:delivered-to:from:to:subject:date:message-id :in-reply-to:references:cc:precedence:list-id:list-unsubscribe :list-archive:list-post:list-help:list-subscribe:mime-version:sender :errors-to:x-original-sender:x-original-authentication-results :mailing-list:content-type:content-transfer-encoding; bh=HS2bPLOHjCZIdalrtqTO9Z+NFmhePKRXGCSHRXAxnbg=; b=MMUlT2d2k+59PgI5LevxO4+nG2POhOD5xpKBAV00LagfxttQ2jbfm/TJDrGXvgb+oR 5mAfDUC/yyvlE79XRdWG2xPkHkySNsUrrKmtE5z0R3kkXkTjj4H4VcFYqzEsEJTSpIMx iBl8E4DJ2Vsj/Dl9XJjf+035noCOgGPl1IgOFsobYnGJvaTinHnXCiXgKYrFVNWl4mUA zWgiNRbYw7brJJNMgAaUkjuUG/RLw8+PzocN/vmshQ3QtGQusndzAZTix5vHAeldzKZn Em+MDGufHqprigSNK5imUHCdFPW1fX4lKsIdDQ6Y0adNj1xOqbVNOpOAixMqtBLlDlZJ ILhg== X-Gm-Message-State: ALoCoQmCkjqQqk/1ndVl6Qc0SORJSHloLeuzxkMCJjHLPd7z3UCLiEzwJGgLLQ0Q3PwHCs7Yjb3s X-Received: by 10.68.216.230 with SMTP id ot6mr12070487pbc.3.1398662082386; Sun, 27 Apr 2014 22:14:42 -0700 (PDT) X-BeenThere: patchwork-forward@linaro.org Received: by 10.140.101.213 with SMTP id u79ls2440927qge.79.gmail; Sun, 27 Apr 2014 22:14:42 -0700 (PDT) X-Received: by 10.52.108.164 with SMTP id hl4mr18704548vdb.25.1398662082127; Sun, 27 Apr 2014 22:14:42 -0700 (PDT) Received: from mail-ve0-f180.google.com (mail-ve0-f180.google.com [209.85.128.180]) by mx.google.com with ESMTPS id o6si3385687vcz.184.2014.04.27.22.14.42 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Sun, 27 Apr 2014 22:14:42 -0700 (PDT) Received-SPF: none (google.com: patch+caf_=patchwork-forward=linaro.org@linaro.org does not designate permitted sender hosts) client-ip=209.85.128.180; Received: by mail-ve0-f180.google.com with SMTP id jz11so7319022veb.39 for ; Sun, 27 Apr 2014 22:14:42 -0700 (PDT) X-Received: by 10.52.15.132 with SMTP id x4mr993833vdc.31.1398662082029; Sun, 27 Apr 2014 22:14:42 -0700 (PDT) X-Forwarded-To: patchwork-forward@linaro.org X-Forwarded-For: patch@linaro.org patchwork-forward@linaro.org Delivered-To: patch@linaro.org Received: by 10.220.221.72 with SMTP id ib8csp91019vcb; Sun, 27 Apr 2014 22:14:41 -0700 (PDT) X-Received: by 10.140.88.85 with SMTP id s79mr28245708qgd.70.1398662081589; Sun, 27 Apr 2014 22:14:41 -0700 (PDT) Received: from bombadil.infradead.org (bombadil.infradead.org. [2001:1868:205::9]) by mx.google.com with ESMTPS id v6si7247444qas.206.2014.04.27.22.14.41 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sun, 27 Apr 2014 22:14:41 -0700 (PDT) Received-SPF: none (google.com: linux-arm-kernel-bounces+patch=linaro.org@lists.infradead.org does not designate permitted sender hosts) client-ip=2001:1868:205::9; Received: from localhost ([127.0.0.1] helo=bombadil.infradead.org) by bombadil.infradead.org with esmtp (Exim 4.80.1 #2 (Red Hat Linux)) id 1Wedry-0004Kb-Nq; Mon, 28 Apr 2014 05:12:46 +0000 Received: from mail-pa0-f53.google.com ([209.85.220.53]) by bombadil.infradead.org with esmtps (Exim 4.80.1 #2 (Red Hat Linux)) id 1Wedrv-0004GL-8t for linux-arm-kernel@lists.infradead.org; Mon, 28 Apr 2014 05:12:43 +0000 Received: by mail-pa0-f53.google.com with SMTP id ld10so5288460pab.26 for ; Sun, 27 Apr 2014 22:12:21 -0700 (PDT) X-Received: by 10.66.119.239 with SMTP id kx15mr23178480pab.51.1398661941304; Sun, 27 Apr 2014 22:12:21 -0700 (PDT) Received: from localhost ([58.251.159.252]) by mx.google.com with ESMTPSA id su8sm31995507pbc.72.2014.04.27.22.12.17 for (version=TLSv1.2 cipher=RC4-SHA bits=128/128); Sun, 27 Apr 2014 22:12:20 -0700 (PDT) From: zhichang.yuan@linaro.org To: linux-arm-kernel@lists.infradead.org, catalin.marinas@arm.com, will.deacon@arm.com Subject: [PATCHv2 2/6] arm64: lib: Implement optimized memmove routine Date: Mon, 28 Apr 2014 13:11:30 +0800 Message-Id: <1398661895-5559-3-git-send-email-zhichang.yuan@linaro.org> X-Mailer: git-send-email 1.7.9.5 In-Reply-To: <1398661895-5559-1-git-send-email-zhichang.yuan@linaro.org> References: <1398661895-5559-1-git-send-email-zhichang.yuan@linaro.org> X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20140427_221243_350036_F3D8FA67 X-CRM114-Status: GOOD ( 12.97 ) X-Spam-Score: -0.7 (/) X-Spam-Report: SpamAssassin version 3.3.2 on bombadil.infradead.org summary: Content analysis details: (-0.7 points) pts rule name description ---- ---------------------- -------------------------------------------------- -0.7 RCVD_IN_DNSWL_LOW RBL: Sender listed at http://www.dnswl.org/, low trust [209.85.220.53 listed in list.dnswl.org] Cc: dsaxena@linaro.org, liguozhu@huawei.com, "zhichang.yuan" X-BeenThere: linux-arm-kernel@lists.infradead.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: , List-Help: , List-Subscribe: , MIME-Version: 1.0 Sender: "linux-arm-kernel" Errors-To: linux-arm-kernel-bounces+patch=linaro.org@lists.infradead.org X-Removed-Original-Auth: Dkim didn't pass. X-Original-Sender: zhichang.yuan@linaro.org X-Original-Authentication-Results: mx.google.com; spf=neutral (google.com: patch+caf_=patchwork-forward=linaro.org@linaro.org does not designate permitted sender hosts) smtp.mail=patch+caf_=patchwork-forward=linaro.org@linaro.org Mailing-list: list patchwork-forward@linaro.org; contact patchwork-forward+owners@linaro.org X-Google-Group-Id: 836684582541 From: "zhichang.yuan" This patch, based on Linaro's Cortex Strings library, improves the performance of the assembly optimized memmove() function. Signed-off-by: Zhichang Yuan Signed-off-by: Deepak Saxena --- arch/arm64/lib/memmove.S | 190 ++++++++++++++++++++++++++++++++++++++++------ 1 file changed, 165 insertions(+), 25 deletions(-) diff --git a/arch/arm64/lib/memmove.S b/arch/arm64/lib/memmove.S index b79fdfa..57b19ea 100644 --- a/arch/arm64/lib/memmove.S +++ b/arch/arm64/lib/memmove.S @@ -1,5 +1,13 @@ /* * Copyright (C) 2013 ARM Ltd. + * Copyright (C) 2013 Linaro. + * + * This code is based on glibc cortex strings work originally authored by Linaro + * and re-licensed under GPLv2 for the Linux kernel. The original code can + * be found @ + * + * http://bazaar.launchpad.net/~linaro-toolchain-dev/cortex-strings/trunk/ + * files/head:/src/aarch64/ * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License version 2 as @@ -16,6 +24,7 @@ #include #include +#include /* * Move a buffer from src to test (alignment handled by the hardware). @@ -28,30 +37,161 @@ * Returns: * x0 - dest */ +dstin .req x0 +src .req x1 +count .req x2 +tmp1 .req x3 +tmp1w .req w3 +tmp2 .req x4 +tmp2w .req w4 +tmp3 .req x5 +tmp3w .req w5 +dst .req x6 + +A_l .req x7 +A_h .req x8 +B_l .req x9 +B_h .req x10 +C_l .req x11 +C_h .req x12 +D_l .req x13 +D_h .req x14 + ENTRY(memmove) - cmp x0, x1 - b.ls memcpy - add x4, x0, x2 - add x1, x1, x2 - subs x2, x2, #8 - b.mi 2f -1: ldr x3, [x1, #-8]! - subs x2, x2, #8 - str x3, [x4, #-8]! - b.pl 1b -2: adds x2, x2, #4 - b.mi 3f - ldr w3, [x1, #-4]! - sub x2, x2, #4 - str w3, [x4, #-4]! -3: adds x2, x2, #2 - b.mi 4f - ldrh w3, [x1, #-2]! - sub x2, x2, #2 - strh w3, [x4, #-2]! -4: adds x2, x2, #1 - b.mi 5f - ldrb w3, [x1, #-1] - strb w3, [x4, #-1] -5: ret + cmp dstin, src + b.lo memcpy + add tmp1, src, count + cmp dstin, tmp1 + b.hs memcpy /* No overlap. */ + + add dst, dstin, count + add src, src, count + cmp count, #16 + b.lo .Ltail15 /*probably non-alignment accesses.*/ + + ands tmp2, src, #15 /* Bytes to reach alignment. */ + b.eq .LSrcAligned + sub count, count, tmp2 + /* + * process the aligned offset length to make the src aligned firstly. + * those extra instructions' cost is acceptable. It also make the + * coming accesses are based on aligned address. + */ + tbz tmp2, #0, 1f + ldrb tmp1w, [src, #-1]! + strb tmp1w, [dst, #-1]! +1: + tbz tmp2, #1, 2f + ldrh tmp1w, [src, #-2]! + strh tmp1w, [dst, #-2]! +2: + tbz tmp2, #2, 3f + ldr tmp1w, [src, #-4]! + str tmp1w, [dst, #-4]! +3: + tbz tmp2, #3, .LSrcAligned + ldr tmp1, [src, #-8]! + str tmp1, [dst, #-8]! + +.LSrcAligned: + cmp count, #64 + b.ge .Lcpy_over64 + + /* + * Deal with small copies quickly by dropping straight into the + * exit block. + */ +.Ltail63: + /* + * Copy up to 48 bytes of data. At this point we only need the + * bottom 6 bits of count to be accurate. + */ + ands tmp1, count, #0x30 + b.eq .Ltail15 + cmp tmp1w, #0x20 + b.eq 1f + b.lt 2f + ldp A_l, A_h, [src, #-16]! + stp A_l, A_h, [dst, #-16]! +1: + ldp A_l, A_h, [src, #-16]! + stp A_l, A_h, [dst, #-16]! +2: + ldp A_l, A_h, [src, #-16]! + stp A_l, A_h, [dst, #-16]! + +.Ltail15: + tbz count, #3, 1f + ldr tmp1, [src, #-8]! + str tmp1, [dst, #-8]! +1: + tbz count, #2, 2f + ldr tmp1w, [src, #-4]! + str tmp1w, [dst, #-4]! +2: + tbz count, #1, 3f + ldrh tmp1w, [src, #-2]! + strh tmp1w, [dst, #-2]! +3: + tbz count, #0, .Lexitfunc + ldrb tmp1w, [src, #-1] + strb tmp1w, [dst, #-1] + +.Lexitfunc: + ret + +.Lcpy_over64: + subs count, count, #128 + b.ge .Lcpy_body_large + /* + * Less than 128 bytes to copy, so handle 64 bytes here and then jump + * to the tail. + */ + ldp A_l, A_h, [src, #-16] + stp A_l, A_h, [dst, #-16] + ldp B_l, B_h, [src, #-32] + ldp C_l, C_h, [src, #-48] + stp B_l, B_h, [dst, #-32] + stp C_l, C_h, [dst, #-48] + ldp D_l, D_h, [src, #-64]! + stp D_l, D_h, [dst, #-64]! + + tst count, #0x3f + b.ne .Ltail63 + ret + + /* + * Critical loop. Start at a new cache line boundary. Assuming + * 64 bytes per line this ensures the entire loop is in one line. + */ + .p2align L1_CACHE_SHIFT +.Lcpy_body_large: + /* pre-load 64 bytes data. */ + ldp A_l, A_h, [src, #-16] + ldp B_l, B_h, [src, #-32] + ldp C_l, C_h, [src, #-48] + ldp D_l, D_h, [src, #-64]! +1: + /* + * interlace the load of next 64 bytes data block with store of the last + * loaded 64 bytes data. + */ + stp A_l, A_h, [dst, #-16] + ldp A_l, A_h, [src, #-16] + stp B_l, B_h, [dst, #-32] + ldp B_l, B_h, [src, #-32] + stp C_l, C_h, [dst, #-48] + ldp C_l, C_h, [src, #-48] + stp D_l, D_h, [dst, #-64]! + ldp D_l, D_h, [src, #-64]! + subs count, count, #64 + b.ge 1b + stp A_l, A_h, [dst, #-16] + stp B_l, B_h, [dst, #-32] + stp C_l, C_h, [dst, #-48] + stp D_l, D_h, [dst, #-64]! + + tst count, #0x3f + b.ne .Ltail63 + ret ENDPROC(memmove)