From patchwork Thu Mar 7 16:25:50 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Paolo Valente X-Patchwork-Id: 159883 Delivered-To: patch@linaro.org Received: by 2002:a02:5cc1:0:0:0:0:0 with SMTP id w62csp7628468jad; Thu, 7 Mar 2019 08:26:27 -0800 (PST) X-Google-Smtp-Source: APXvYqxwiEEcYp/z1mc3SnNHWKFTUwGxw0cBYe7L+QaGtcZiWflAgFWD23tE9pGzLBITPn1d4Qrv X-Received: by 2002:a62:1e82:: with SMTP id e124mr13639426pfe.258.1551975987263; Thu, 07 Mar 2019 08:26:27 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1551975987; cv=none; d=google.com; s=arc-20160816; b=jU2WpNvSIIP5YzCeHk6oNUJwUqpi1ul580lsjY8dG3NRoUJt0z3Mn6o2vc0wJYMPBH XdrYs1OQdRGFWG6djsNUw0oTa/piXbBARa8h2KY+fC9JmMrjFQjLQlnTkidQuMSyNHvf ZQOJvZBOtq+1/vFe1CSa0Zo845IqA2hfYSSQ17YjwrDQm9+Sk589vyro6EEWQQaNpSRT pLuM9sFlMtap+LLe27XILDlFCaY7jhNvsKLK2OJWJfRufGwVzC09B4s4VFDg+JXO5ekS +9oJjmPFAJGqX/Ys155tnZU0AOdcPdi823U7mGw1uSuyo5/FK8SOOBo3dP9WWsRC5eOA Ae+w== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=9QZLgskSCrdBD9un8CU62IVDYuz5MmLQQB3oIqrmxZE=; b=xYWFY2S9b3QihRKukV2B/hquTAl5XkzdJ3UV6hD5k86cLvbIEKTRh55BCb0mRZVPAh Ots10Tknb+kMhAEypkF4NaGh/eTNiLikeblREhWGwrVlRtsHbDcYKor3F/FdH/v9a87/ HqVslPgOk2DEE76odusPripvu3Z7VhkT9/6ktWrM5IUVodQwUnLYtUqyAP/c7IqL6zkY CLe6i7dhBf0Chddlz36EAMmiR2bVTzAju1H/EmK5USU6qEUB+9QFAIeK8jBuRuzFixB/ J22Kv+fVIYc2919nuZ7G9Z1il3dVSmbwB5jN9CI9ZKTThgKcYP05EHYLafh5ExnkIrpi l3aA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@linaro.org header.s=google header.b=tjyoaWfK; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linaro.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id bd11si4283160plb.157.2019.03.07.08.26.25; Thu, 07 Mar 2019 08:26:27 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@linaro.org header.s=google header.b=tjyoaWfK; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linaro.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726504AbfCGQ0Y (ORCPT + 31 others); Thu, 7 Mar 2019 11:26:24 -0500 Received: from mail-wr1-f68.google.com ([209.85.221.68]:36257 "EHLO mail-wr1-f68.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726404AbfCGQ0T (ORCPT ); Thu, 7 Mar 2019 11:26:19 -0500 Received: by mail-wr1-f68.google.com with SMTP id o17so18166517wrw.3 for ; Thu, 07 Mar 2019 08:26:17 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linaro.org; s=google; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=9QZLgskSCrdBD9un8CU62IVDYuz5MmLQQB3oIqrmxZE=; b=tjyoaWfK+HDBZepAXYP0d91deWi5pF2hxjojvl2j4nqYnUeK/GX4i0hoxcN00Y6wVF XJnEKqwJoaMYkTU/dS6Oro+keZgr8s3/KgfWXyQ2SfCvWfcGvEH+PGg5cPc0/pIVfIG7 V+J2tM18i6kG7fuxnipRLAycXriyklmSCFOGbdIobZ96KAozWL4kf+0hRvwCXI6V/1UX bGDkesg4I0+7Dx6nB/GVgAC0qNDyaLI7ZoawkJjJiYBrmHB9NMcD6bLjc6YW06Rpc7hm L53tH2ZDmZqx21sMgX/8ErJRS7pQL18f5uwhN3M6q+h15AqOZrcXGOHRjsNMcZzoQNvT cWJw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=9QZLgskSCrdBD9un8CU62IVDYuz5MmLQQB3oIqrmxZE=; b=uHFKFotuN8n5RzvwYtrGWKgcd3+/+hUB5XX8Jq2mkQLG0Cr+obsn/oP8bgT1kIuExB U8EjS+RVCFep1uJbN/IxcyH+ivE3Vvfj3xW8hi1eX8aCyxkENHkmn4wlDPaNS2kRUsVQ nsR/rP00BFQWVGrZpm0cxn8Jmex/qy4UZnhc1WQhfiF7PjNI0xZ8phDlKLcj7DMvBu25 tjdKajF57zznz1eOvKL0mb3yXxMpbpvdskA6yt+ARSuPtFRPfsHVaqTGPSObiYAA+uTs h4coJiZKbSUAvHY/6+oEf4nA41umQk2sPwhVE2gXETw/JmpUjHK7KCvGBFJ7FY6HCI7D kD1g== X-Gm-Message-State: APjAAAXFuRO5fAoqvtFaGItJJL4gMVJrYt2C/dNXHxqawpVWzy6uDvM4 mwz1RuG6L2sPwnz3q8c//SiBsA== X-Received: by 2002:adf:eb89:: with SMTP id t9mr7264180wrn.21.1551975976529; Thu, 07 Mar 2019 08:26:16 -0800 (PST) Received: from localhost.localdomain (146-241-117-61.dyn.eolo.it. [146.241.117.61]) by smtp.gmail.com with ESMTPSA id o127sm5801797wmo.20.2019.03.07.08.26.15 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 07 Mar 2019 08:26:15 -0800 (PST) From: Paolo Valente To: Jens Axboe Cc: linux-block@vger.kernel.org, linux-kernel@vger.kernel.org, ulf.hansson@linaro.org, linus.walleij@linaro.org, broonie@kernel.org, bfq-iosched@googlegroups.com, oleksandr@natalenko.name, fra.fra.800@gmail.com, Paolo Valente , Alessio Masola Subject: [PATCH BUGFIX IMPROVEMENT 4/8] block, bfq: do not merge queues on flash storage with queueing Date: Thu, 7 Mar 2019 17:25:50 +0100 Message-Id: <20190307162554.77205-5-paolo.valente@linaro.org> X-Mailer: git-send-email 2.20.1 In-Reply-To: <20190307162554.77205-1-paolo.valente@linaro.org> References: <20190307162554.77205-1-paolo.valente@linaro.org> MIME-Version: 1.0 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org To boost throughput with a set of processes doing interleaved I/O (i.e., a set of processes whose individual I/O is random, but whose merged cumulative I/O is sequential), BFQ merges the queues associated with these processes, i.e., redirects the I/O of these processes into a common, shared queue. In the shared queue, I/O requests are ordered by their position on the medium, thus sequential I/O gets dispatched to the device when the shared queue is served. Queue merging costs execution time, because, to detect which queues to merge, BFQ must maintain a list of the head I/O requests of active queues, ordered by request positions. Measurements showed that this costs about 10% of BFQ's total per-request processing time. Request processing time becomes more and more critical as the speed of the underlying storage device grows. Yet, fortunately, queue merging is basically useless on the very devices that are so fast to make request processing time critical. To reach a high throughput, these devices must have many requests queued at the same time. But, in this configuration, the internal scheduling algorithms of these devices do also the job of queue merging: they reorder requests so as to obtain as much as possible a sequential I/O pattern. As a consequence, with processes doing interleaved I/O, the throughput reached by one such device is likely to be the same, with and without queue merging. In view of this fact, this commit disables queue merging, and all related housekeeping, for non-rotational devices with internal queueing. The total, single-lock-protected, per-request processing time of BFQ drops to, e.g., 1.9 us on an Intel Core i7-2760QM@2.40GHz (time measured with simple code instrumentation, and using the throughput-sync.sh script of the S suite [1], in performance-profiling mode). To put this result into context, the total, single-lock-protected, per-request execution time of the lightest I/O scheduler available in blk-mq, mq-deadline, is 0.7 us (mq-deadline is ~800 LOC, against ~10500 LOC for BFQ). Disabling merging provides a further, remarkable benefit in terms of throughput. Merging tends to make many workloads artificially more uneven, mainly because of shared queues remaining non empty for incomparably more time than normal queues. So, if, e.g., one of the queues in a set of merged queues has a higher weight than a normal queue, then the shared queue may inherit such a high weight and, by staying almost always active, may force BFQ to perform I/O plugging most of the time. This evidently makes it harder for BFQ to let the device reach a high throughput. As a practical example of this problem, and of the benefits of this commit, we measured again the throughput in the nasty scenario considered in previous commit messages: dbench test (in the Phoronix suite), with 6 clients, on a filesystem with journaling, and with the journaling daemon enjoying a higher weight than normal processes. With this commit, the throughput grows from ~150 MB/s to ~200 MB/s on a PLEXTOR PX-256M5 SSD. This is the same peak throughput reached by any of the other I/O schedulers. As such, this is also likely to be the maximum possible throughput reachable with this workload on this device, because I/O is mostly random, and the other schedulers basically just pass I/O requests to the drive as fast as possible. [1] https://github.com/Algodev-github/S Tested-by: Francesco Pollicino Signed-off-by: Alessio Masola Signed-off-by: Paolo Valente --- block/bfq-cgroup.c | 3 +- block/bfq-iosched.c | 73 +++++++++++++++++++++++++++++++++++++++++---- block/bfq-iosched.h | 3 ++ 3 files changed, 73 insertions(+), 6 deletions(-) -- 2.20.1 diff --git a/block/bfq-cgroup.c b/block/bfq-cgroup.c index c6113af31960..2a74a3f2a8f7 100644 --- a/block/bfq-cgroup.c +++ b/block/bfq-cgroup.c @@ -578,7 +578,8 @@ void bfq_bfqq_move(struct bfq_data *bfqd, struct bfq_queue *bfqq, bfqg_and_blkg_get(bfqg); if (bfq_bfqq_busy(bfqq)) { - bfq_pos_tree_add_move(bfqd, bfqq); + if (unlikely(!bfqd->nonrot_with_queueing)) + bfq_pos_tree_add_move(bfqd, bfqq); bfq_activate_bfqq(bfqd, bfqq); } diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c index 41364c0cca8c..b96be3764b8a 100644 --- a/block/bfq-iosched.c +++ b/block/bfq-iosched.c @@ -595,7 +595,16 @@ static bool bfq_too_late_for_merging(struct bfq_queue *bfqq) bfq_merge_time_limit); } -void bfq_pos_tree_add_move(struct bfq_data *bfqd, struct bfq_queue *bfqq) +/* + * The following function is not marked as __cold because it is + * actually cold, but for the same performance goal described in the + * comments on the likely() at the beginning of + * bfq_setup_cooperator(). Unexpectedly, to reach an even lower + * execution time for the case where this function is not invoked, we + * had to add an unlikely() in each involved if(). + */ +void __cold +bfq_pos_tree_add_move(struct bfq_data *bfqd, struct bfq_queue *bfqq) { struct rb_node **p, *parent; struct bfq_queue *__bfqq; @@ -1849,8 +1858,9 @@ static void bfq_add_request(struct request *rq) /* * Adjust priority tree position, if next_rq changes. + * See comments on bfq_pos_tree_add_move() for the unlikely(). */ - if (prev != bfqq->next_rq) + if (unlikely(!bfqd->nonrot_with_queueing && prev != bfqq->next_rq)) bfq_pos_tree_add_move(bfqd, bfqq); if (!bfq_bfqq_busy(bfqq)) /* switching to busy ... */ @@ -1990,7 +2000,9 @@ static void bfq_remove_request(struct request_queue *q, bfqq->pos_root = NULL; } } else { - bfq_pos_tree_add_move(bfqd, bfqq); + /* see comments on bfq_pos_tree_add_move() for the unlikely() */ + if (unlikely(!bfqd->nonrot_with_queueing)) + bfq_pos_tree_add_move(bfqd, bfqq); } if (rq->cmd_flags & REQ_META) @@ -2075,7 +2087,12 @@ static void bfq_request_merged(struct request_queue *q, struct request *req, */ if (prev != bfqq->next_rq) { bfq_updated_next_req(bfqd, bfqq); - bfq_pos_tree_add_move(bfqd, bfqq); + /* + * See comments on bfq_pos_tree_add_move() for + * the unlikely(). + */ + if (unlikely(!bfqd->nonrot_with_queueing)) + bfq_pos_tree_add_move(bfqd, bfqq); } } } @@ -2357,6 +2374,46 @@ bfq_setup_cooperator(struct bfq_data *bfqd, struct bfq_queue *bfqq, { struct bfq_queue *in_service_bfqq, *new_bfqq; + /* + * Do not perform queue merging if the device is non + * rotational and performs internal queueing. In fact, such a + * device reaches a high speed through internal parallelism + * and pipelining. This means that, to reach a high + * throughput, it must have many requests enqueued at the same + * time. But, in this configuration, the internal scheduling + * algorithm of the device does exactly the job of queue + * merging: it reorders requests so as to obtain as much as + * possible a sequential I/O pattern. As a consequence, with + * the workload generated by processes doing interleaved I/O, + * the throughput reached by the device is likely to be the + * same, with and without queue merging. + * + * Disabling merging also provides a remarkable benefit in + * terms of throughput. Merging tends to make many workloads + * artificially more uneven, because of shared queues + * remaining non empty for incomparably more time than + * non-merged queues. This may accentuate workload + * asymmetries. For example, if one of the queues in a set of + * merged queues has a higher weight than a normal queue, then + * the shared queue may inherit such a high weight and, by + * staying almost always active, may force BFQ to perform I/O + * plugging most of the time. This evidently makes it harder + * for BFQ to let the device reach a high throughput. + * + * Finally, the likely() macro below is not used because one + * of the two branches is more likely than the other, but to + * have the code path after the following if() executed as + * fast as possible for the case of a non rotational device + * with queueing. We want it because this is the fastest kind + * of device. On the opposite end, the likely() may lengthen + * the execution time of BFQ for the case of slower devices + * (rotational or at least without queueing). But in this case + * the execution time of BFQ matters very little, if not at + * all. + */ + if (likely(bfqd->nonrot_with_queueing)) + return NULL; + /* * Prevent bfqq from being merged if it has been created too * long ago. The idea is that true cooperating processes, and @@ -2986,8 +3043,10 @@ static void __bfq_bfqq_expire(struct bfq_data *bfqd, struct bfq_queue *bfqq) bfq_requeue_bfqq(bfqd, bfqq, true); /* * Resort priority tree of potential close cooperators. + * See comments on bfq_pos_tree_add_move() for the unlikely(). */ - bfq_pos_tree_add_move(bfqd, bfqq); + if (unlikely(!bfqd->nonrot_with_queueing)) + bfq_pos_tree_add_move(bfqd, bfqq); } /* @@ -5051,6 +5110,9 @@ static void bfq_update_hw_tag(struct bfq_data *bfqd) bfqd->hw_tag = bfqd->max_rq_in_driver > BFQ_HW_QUEUE_THRESHOLD; bfqd->max_rq_in_driver = 0; bfqd->hw_tag_samples = 0; + + bfqd->nonrot_with_queueing = + blk_queue_nonrot(bfqd->queue) && bfqd->hw_tag; } static void bfq_completed_request(struct bfq_queue *bfqq, struct bfq_data *bfqd) @@ -5882,6 +5944,7 @@ static int bfq_init_queue(struct request_queue *q, struct elevator_type *e) INIT_HLIST_HEAD(&bfqd->burst_list); bfqd->hw_tag = -1; + bfqd->nonrot_with_queueing = blk_queue_nonrot(bfqd->queue); bfqd->bfq_max_budget = bfq_default_max_budget; diff --git a/block/bfq-iosched.h b/block/bfq-iosched.h index 26869cfbbfa9..829730b96fb2 100644 --- a/block/bfq-iosched.h +++ b/block/bfq-iosched.h @@ -497,6 +497,9 @@ struct bfq_data { /* number of requests dispatched and waiting for completion */ int rq_in_driver; + /* true if the device is non rotational and performs queueing */ + bool nonrot_with_queueing; + /* * Maximum number of requests in driver in the last * @hw_tag_samples completed requests.