From patchwork Mon Apr 28 05:11:31 2014 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: zhichang.yuan@linaro.org X-Patchwork-Id: 29170 Return-Path: X-Original-To: linaro@patches.linaro.org Delivered-To: linaro@patches.linaro.org Received: from mail-ve0-f199.google.com (mail-ve0-f199.google.com [209.85.128.199]) by ip-10-151-82-157.ec2.internal (Postfix) with ESMTPS id 65CF920534 for ; Mon, 28 Apr 2014 05:20:57 +0000 (UTC) Received: by mail-ve0-f199.google.com with SMTP id jy13sf26167899veb.6 for ; Sun, 27 Apr 2014 22:20:56 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:delivered-to:from:to:subject:date:message-id :in-reply-to:references:cc:precedence:list-id:list-unsubscribe :list-archive:list-post:list-help:list-subscribe:mime-version:sender :errors-to:x-original-sender:x-original-authentication-results :mailing-list:content-type:content-transfer-encoding; bh=fKDi5MawFAP36Pj+cr8mswKVLipfceQ4m2XHKXWMEvo=; b=VgDaKqR9+TihhXj9eFPthQaxr/FgJo50Jjy9n90p26LR7PqWmd/ai/wuKhA/XGaxn9 nGJ5GUy4sak01iYBO0KKO253W5MEXdLdR6SMYqT0cKvzwkpKhmKnpI/mFMyL+xZltNWa B70hqbcBnPz/9LPTsSyN1MmrvEan64Fui3OLbOT2JXEwKr4QlemNmoVNEAXSPELj9G71 O7KPWY5tf4FJVGP3zoHhll8SmAOfwnS0Q2/ZpMjI6vPegLGhvbk53YpTXSIcTE7wSgbs 9l6V/NJY8yVmXi413P7WNA2IOi1I5Ko2/EItHSkrPI2BOWVr2sZPLyo6qEwT6WxqjBT9 04XA== X-Gm-Message-State: ALoCoQlMSJgIWs64YI8S3jnYUfTz85lo3U2xfB1FUga2WLTg4akj5IUiGm2yH4ZuaWkdx8iQVwbw X-Received: by 10.224.68.72 with SMTP id u8mr11873884qai.1.1398662456859; Sun, 27 Apr 2014 22:20:56 -0700 (PDT) X-BeenThere: patchwork-forward@linaro.org Received: by 10.140.23.232 with SMTP id 95ls2568346qgp.68.gmail; Sun, 27 Apr 2014 22:20:56 -0700 (PDT) X-Received: by 10.58.74.38 with SMTP id q6mr21135091vev.7.1398662456765; Sun, 27 Apr 2014 22:20:56 -0700 (PDT) Received: from mail-vc0-f173.google.com (mail-vc0-f173.google.com [209.85.220.173]) by mx.google.com with ESMTPS id v14si3392112vco.111.2014.04.27.22.20.56 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Sun, 27 Apr 2014 22:20:56 -0700 (PDT) Received-SPF: none (google.com: patch+caf_=patchwork-forward=linaro.org@linaro.org does not designate permitted sender hosts) client-ip=209.85.220.173; Received: by mail-vc0-f173.google.com with SMTP id ik5so3293412vcb.32 for ; Sun, 27 Apr 2014 22:20:56 -0700 (PDT) X-Received: by 10.220.161.8 with SMTP id p8mr21744232vcx.4.1398662456663; Sun, 27 Apr 2014 22:20:56 -0700 (PDT) X-Forwarded-To: patchwork-forward@linaro.org X-Forwarded-For: patch@linaro.org patchwork-forward@linaro.org Delivered-To: patch@linaro.org Received: by 10.220.221.72 with SMTP id ib8csp91236vcb; Sun, 27 Apr 2014 22:20:56 -0700 (PDT) X-Received: by 10.140.33.181 with SMTP id j50mr28477083qgj.81.1398662456162; Sun, 27 Apr 2014 22:20:56 -0700 (PDT) Received: from bombadil.infradead.org (bombadil.infradead.org. [2001:1868:205::9]) by mx.google.com with ESMTPS id v6si7258644qas.91.2014.04.27.22.20.55 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sun, 27 Apr 2014 22:20:56 -0700 (PDT) Received-SPF: none (google.com: linux-arm-kernel-bounces+patch=linaro.org@lists.infradead.org does not designate permitted sender hosts) client-ip=2001:1868:205::9; Received: from localhost ([127.0.0.1] helo=bombadil.infradead.org) by bombadil.infradead.org with esmtp (Exim 4.80.1 #2 (Red Hat Linux)) id 1WedyO-0007zX-2x; Mon, 28 Apr 2014 05:19:24 +0000 Received: from mail-pa0-f43.google.com ([209.85.220.43]) by bombadil.infradead.org with esmtps (Exim 4.80.1 #2 (Red Hat Linux)) id 1WedyL-0007vK-9W for linux-arm-kernel@lists.infradead.org; Mon, 28 Apr 2014 05:19:22 +0000 Received: by mail-pa0-f43.google.com with SMTP id rd3so3513071pab.16 for ; Sun, 27 Apr 2014 22:18:59 -0700 (PDT) X-Received: by 10.66.180.141 with SMTP id do13mr23161304pac.93.1398661955026; Sun, 27 Apr 2014 22:12:35 -0700 (PDT) Received: from localhost ([58.251.159.252]) by mx.google.com with ESMTPSA id dy7sm84776650pad.9.2014.04.27.22.12.31 for (version=TLSv1.2 cipher=RC4-SHA bits=128/128); Sun, 27 Apr 2014 22:12:34 -0700 (PDT) From: zhichang.yuan@linaro.org To: linux-arm-kernel@lists.infradead.org, catalin.marinas@arm.com, will.deacon@arm.com Subject: [PATCHv2 3/6] arm64: lib: Implement optimized memset routine Date: Mon, 28 Apr 2014 13:11:31 +0800 Message-Id: <1398661895-5559-4-git-send-email-zhichang.yuan@linaro.org> X-Mailer: git-send-email 1.7.9.5 In-Reply-To: <1398661895-5559-1-git-send-email-zhichang.yuan@linaro.org> References: <1398661895-5559-1-git-send-email-zhichang.yuan@linaro.org> X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20140427_221921_351165_82AEBFD4 X-CRM114-Status: GOOD ( 17.03 ) X-Spam-Score: -0.7 (/) X-Spam-Report: SpamAssassin version 3.3.2 on bombadil.infradead.org summary: Content analysis details: (-0.7 points) pts rule name description ---- ---------------------- -------------------------------------------------- -0.7 RCVD_IN_DNSWL_LOW RBL: Sender listed at http://www.dnswl.org/, low trust [209.85.220.43 listed in list.dnswl.org] Cc: dsaxena@linaro.org, liguozhu@huawei.com, "zhichang.yuan" X-BeenThere: linux-arm-kernel@lists.infradead.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: , List-Help: , List-Subscribe: , MIME-Version: 1.0 Sender: "linux-arm-kernel" Errors-To: linux-arm-kernel-bounces+patch=linaro.org@lists.infradead.org X-Removed-Original-Auth: Dkim didn't pass. X-Original-Sender: zhichang.yuan@linaro.org X-Original-Authentication-Results: mx.google.com; spf=neutral (google.com: patch+caf_=patchwork-forward=linaro.org@linaro.org does not designate permitted sender hosts) smtp.mail=patch+caf_=patchwork-forward=linaro.org@linaro.org Mailing-list: list patchwork-forward@linaro.org; contact patchwork-forward+owners@linaro.org X-Google-Group-Id: 836684582541 From: "zhichang.yuan" This patch, based on Linaro's Cortex Strings library, improves the performance of the assembly optimized memset() function. Signed-off-by: Zhichang Yuan Signed-off-by: Deepak Saxena --- arch/arm64/lib/memset.S | 207 ++++++++++++++++++++++++++++++++++++++++++----- 1 file changed, 185 insertions(+), 22 deletions(-) diff --git a/arch/arm64/lib/memset.S b/arch/arm64/lib/memset.S index 87e4a68..7c72dfd 100644 --- a/arch/arm64/lib/memset.S +++ b/arch/arm64/lib/memset.S @@ -1,5 +1,13 @@ /* * Copyright (C) 2013 ARM Ltd. + * Copyright (C) 2013 Linaro. + * + * This code is based on glibc cortex strings work originally authored by Linaro + * and re-licensed under GPLv2 for the Linux kernel. The original code can + * be found @ + * + * http://bazaar.launchpad.net/~linaro-toolchain-dev/cortex-strings/trunk/ + * files/head:/src/aarch64/ * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License version 2 as @@ -16,6 +24,7 @@ #include #include +#include /* * Fill in the buffer with character c (alignment handled by the hardware) @@ -27,27 +36,181 @@ * Returns: * x0 - buf */ + +dstin .req x0 +val .req w1 +count .req x2 +tmp1 .req x3 +tmp1w .req w3 +tmp2 .req x4 +tmp2w .req w4 +zva_len_x .req x5 +zva_len .req w5 +zva_bits_x .req x6 + +A_l .req x7 +A_lw .req w7 +dst .req x8 +tmp3w .req w9 +tmp3 .req x9 + ENTRY(memset) - mov x4, x0 - and w1, w1, #0xff - orr w1, w1, w1, lsl #8 - orr w1, w1, w1, lsl #16 - orr x1, x1, x1, lsl #32 - subs x2, x2, #8 - b.mi 2f -1: str x1, [x4], #8 - subs x2, x2, #8 - b.pl 1b -2: adds x2, x2, #4 - b.mi 3f - sub x2, x2, #4 - str w1, [x4], #4 -3: adds x2, x2, #2 - b.mi 4f - sub x2, x2, #2 - strh w1, [x4], #2 -4: adds x2, x2, #1 - b.mi 5f - strb w1, [x4] -5: ret + mov dst, dstin /* Preserve return value. */ + and A_lw, val, #255 + orr A_lw, A_lw, A_lw, lsl #8 + orr A_lw, A_lw, A_lw, lsl #16 + orr A_l, A_l, A_l, lsl #32 + + cmp count, #15 + b.hi .Lover16_proc + /*All store maybe are non-aligned..*/ + tbz count, #3, 1f + str A_l, [dst], #8 +1: + tbz count, #2, 2f + str A_lw, [dst], #4 +2: + tbz count, #1, 3f + strh A_lw, [dst], #2 +3: + tbz count, #0, 4f + strb A_lw, [dst] +4: + ret + +.Lover16_proc: + /*Whether the start address is aligned with 16.*/ + neg tmp2, dst + ands tmp2, tmp2, #15 + b.eq .Laligned +/* +* The count is not less than 16, we can use stp to store the start 16 bytes, +* then adjust the dst aligned with 16.This process will make the current +* memory address at alignment boundary. +*/ + stp A_l, A_l, [dst] /*non-aligned store..*/ + /*make the dst aligned..*/ + sub count, count, tmp2 + add dst, dst, tmp2 + +.Laligned: + cbz A_l, .Lzero_mem + +.Ltail_maybe_long: + cmp count, #64 + b.ge .Lnot_short +.Ltail63: + ands tmp1, count, #0x30 + b.eq 3f + cmp tmp1w, #0x20 + b.eq 1f + b.lt 2f + stp A_l, A_l, [dst], #16 +1: + stp A_l, A_l, [dst], #16 +2: + stp A_l, A_l, [dst], #16 +/* +* The last store length is less than 16,use stp to write last 16 bytes. +* It will lead some bytes written twice and the access is non-aligned. +*/ +3: + ands count, count, #15 + cbz count, 4f + add dst, dst, count + stp A_l, A_l, [dst, #-16] /* Repeat some/all of last store. */ +4: + ret + + /* + * Critical loop. Start at a new cache line boundary. Assuming + * 64 bytes per line, this ensures the entire loop is in one line. + */ + .p2align L1_CACHE_SHIFT +.Lnot_short: + sub dst, dst, #16/* Pre-bias. */ + sub count, count, #64 +1: + stp A_l, A_l, [dst, #16] + stp A_l, A_l, [dst, #32] + stp A_l, A_l, [dst, #48] + stp A_l, A_l, [dst, #64]! + subs count, count, #64 + b.ge 1b + tst count, #0x3f + add dst, dst, #16 + b.ne .Ltail63 +.Lexitfunc: + ret + + /* + * For zeroing memory, check to see if we can use the ZVA feature to + * zero entire 'cache' lines. + */ +.Lzero_mem: + cmp count, #63 + b.le .Ltail63 + /* + * For zeroing small amounts of memory, it's not worth setting up + * the line-clear code. + */ + cmp count, #128 + b.lt .Lnot_short /*count is at least 128 bytes*/ + + mrs tmp1, dczid_el0 + tbnz tmp1, #4, .Lnot_short + mov tmp3w, #4 + and zva_len, tmp1w, #15 /* Safety: other bits reserved. */ + lsl zva_len, tmp3w, zva_len + + ands tmp3w, zva_len, #63 + /* + * ensure the zva_len is not less than 64. + * It is not meaningful to use ZVA if the block size is less than 64. + */ + b.ne .Lnot_short +.Lzero_by_line: + /* + * Compute how far we need to go to become suitably aligned. We're + * already at quad-word alignment. + */ + cmp count, zva_len_x + b.lt .Lnot_short /* Not enough to reach alignment. */ + sub zva_bits_x, zva_len_x, #1 + neg tmp2, dst + ands tmp2, tmp2, zva_bits_x + b.eq 2f /* Already aligned. */ + /* Not aligned, check that there's enough to copy after alignment.*/ + sub tmp1, count, tmp2 + /* + * grantee the remain length to be ZVA is bigger than 64, + * avoid to make the 2f's process over mem range.*/ + cmp tmp1, #64 + ccmp tmp1, zva_len_x, #8, ge /* NZCV=0b1000 */ + b.lt .Lnot_short + /* + * We know that there's at least 64 bytes to zero and that it's safe + * to overrun by 64 bytes. + */ + mov count, tmp1 +1: + stp A_l, A_l, [dst] + stp A_l, A_l, [dst, #16] + stp A_l, A_l, [dst, #32] + subs tmp2, tmp2, #64 + stp A_l, A_l, [dst, #48] + add dst, dst, #64 + b.ge 1b + /* We've overrun a bit, so adjust dst downwards.*/ + add dst, dst, tmp2 +2: + sub count, count, zva_len_x +3: + dc zva, dst + add dst, dst, zva_len_x + subs count, count, zva_len_x + b.ge 3b + ands count, count, zva_bits_x + b.ne .Ltail_maybe_long + ret ENDPROC(memset)