From patchwork Thu Nov 19 11:41:33 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "heming.zhao@suse.com" X-Patchwork-Id: 329123 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-18.7 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_CR_TRAILER, INCLUDES_PATCH, MAILING_LIST_MULTI, MSGID_FROM_MTA_HEADER, SPF_HELO_NONE, SPF_PASS, URIBL_BLOCKED, USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id BB2A9C6369E for ; Thu, 19 Nov 2020 11:42:38 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 47583246EF for ; Thu, 19 Nov 2020 11:42:36 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (1024-bit key) header.d=suse.com header.i=@suse.com header.b="G3Hzo8ry" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727212AbgKSLmZ (ORCPT ); Thu, 19 Nov 2020 06:42:25 -0500 Received: from de-smtp-delivery-52.mimecast.com ([62.140.7.52]:20047 "EHLO de-smtp-delivery-52.mimecast.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727122AbgKSLmO (ORCPT ); Thu, 19 Nov 2020 06:42:14 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.com; s=mimecast20200619; t=1605786129; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=Gj8vZJXaQr5AXBTfALZR5rIPPYooNAU51tgpCsHANks=; b=G3Hzo8ryZ0J0Qeqd+pN7QeE3aRRixAOgYIhi5o9U+m+ETAc78FdlU7fxoH/A1v/RYGF9XB +o9ag1PnxZz47C+GFnOHjMRl8LULzZnN4ALLKz0Px39dV2UcF5HNBEUkRKz+iKf/7Cru69 ZYoDAVHz8lniKkqGX5OJcjwYwMUVCXw= Received: from EUR01-DB5-obe.outbound.protection.outlook.com (mail-db5eur01lp2058.outbound.protection.outlook.com [104.47.2.58]) (Using TLS) by relay.mimecast.com with ESMTP id de-mta-22-MmgxEUI2OkuxF-3Yr0BOxw-1; Thu, 19 Nov 2020 12:42:07 +0100 X-MC-Unique: MmgxEUI2OkuxF-3Yr0BOxw-1 ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=gQkzwC1C9x98q5kfb0otw8WF/C4i1OGKnqrCzLp+OWn/tSOcImDWcnIcsNfwqdtMWdrDUMMDm+8YNMFA13DLB2wR0m49/GSKsWxTxmKQU5oq/es8jd2bVtem5n8HlVRV5muWaJBd+xlm5xZ3opL8TW7V9Gbpg/d06A+N8SINEqVI9GPzjrS4DBkrFQJmgptn7eUgWT1MHQIzk8yBfIsH8uG8e2fWObeqG1HY3JThBSuPhrIFDiKEkgoh1YlkhA1/XK5QEXnQdrwD4lPb+Ye0x3OZgC8NBM9QngrojhqocriBzkjoOw6BeyAoGwqT7q8IjqDgqXeKK7Ln/m/6Rqaq2A== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=Gj8vZJXaQr5AXBTfALZR5rIPPYooNAU51tgpCsHANks=; b=UUqdCDDYEHebjq9GOegWNs9q5xbzbJWIO2DXFGUVNHgRXfiPBTKs58fdQFOqSk/cZmLILRoyxP7IYt8rNvNI8I4jzHMNn83GSDlS29cBs1YtHFy57cUp8hAhSuwJkJbsZVcs9ufEZBeD6j6tOlse5gHJxqjpWIzUJf0YtfjTL4JsEH3nb8Y7+8KMJof0wT7ouou7txHhzCdwRANMp9Dd2jhe5dZH2GV7p0doCPzAh/tOx4dWkGlJNLiy75A+2Q0vjsg52bqOI7/jfO8mb6JJQpnYB7PtntDx5Dc1YTc2vTVJI6Wp3U+sy7vDdRIfuULiPs0dggXJ1L8/FNNMMoveVQ== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=suse.com; dmarc=pass action=none header.from=suse.com; dkim=pass header.d=suse.com; arc=none Authentication-Results: vger.kernel.org; dkim=none (message not signed) header.d=none; vger.kernel.org; dmarc=none action=none header.from=suse.com; Received: from DB7PR04MB4666.eurprd04.prod.outlook.com (2603:10a6:5:2b::14) by DB6PR0402MB2806.eurprd04.prod.outlook.com (2603:10a6:4:97::14) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.3589.20; Thu, 19 Nov 2020 11:42:06 +0000 Received: from DB7PR04MB4666.eurprd04.prod.outlook.com ([fe80::1c22:83d3:bc1e:6383]) by DB7PR04MB4666.eurprd04.prod.outlook.com ([fe80::1c22:83d3:bc1e:6383%4]) with mapi id 15.20.3564.028; Thu, 19 Nov 2020 11:42:06 +0000 From: Zhao Heming To: linux-raid@vger.kernel.org, song@kernel.org, guoqing.jiang@cloud.ionos.com, xni@redhat.com Cc: Zhao Heming , lidong.zhong@suse.com, neilb@suse.de, colyli@suse.de, stable@vger.kernel.org Subject: [PATCH v4 1/2] md/cluster: block reshape with remote resync job Date: Thu, 19 Nov 2020 19:41:33 +0800 Message-Id: <1605786094-5582-2-git-send-email-heming.zhao@suse.com> X-Mailer: git-send-email 1.8.3.1 In-Reply-To: <1605786094-5582-1-git-send-email-heming.zhao@suse.com> References: <1605786094-5582-1-git-send-email-heming.zhao@suse.com> X-Originating-IP: [123.123.130.58] X-ClientProxiedBy: HKAPR03CA0033.apcprd03.prod.outlook.com (2603:1096:203:c9::20) To DB7PR04MB4666.eurprd04.prod.outlook.com (2603:10a6:5:2b::14) MIME-Version: 1.0 X-MS-Exchange-MessageSentRepresentingType: 1 Received: from c73.home (123.123.130.58) by HKAPR03CA0033.apcprd03.prod.outlook.com (2603:1096:203:c9::20) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.3589.15 via Frontend Transport; Thu, 19 Nov 2020 11:42:02 +0000 X-MS-PublicTrafficType: Email X-MS-Office365-Filtering-Correlation-Id: 655b189c-2ed8-482a-75c3-08d88c8023b8 X-MS-TrafficTypeDiagnostic: DB6PR0402MB2806: X-MS-Exchange-Transport-Forked: True X-Microsoft-Antispam-PRVS: X-MS-Oob-TLC-OOBClassifiers: OLM:7691; X-MS-Exchange-SenderADCheck: 1 X-Microsoft-Antispam: BCL:0; X-Microsoft-Antispam-Message-Info: M14iex3rx5Hl7FN8R9w9DkLDg9VQxIGQgJ6tz8azgCT2ecRs9qqyDbwTRTQy43GEbywJDyPoN1GogtN/oi/RA4LhtKL+XAiHPimZL7+GRgRpBIUQxMOMPCwwBZcXm4GoRTdR7PWhiJMS1aEG31zDS+VkA0TXnfeKGDZEanYaFbHK2PcuFhvX9q7yPM+s3btZ1LGC0ABi4jPzDF3w/+BjkB7AZtDgd+0Jp/K7mnSqMBWj5Sx26Cq9cPztWwl2GmMoIuEPkGhzjzJLZ6Ec4UJV0QLURb1tg1yFd44k5cdmaG+QM9aytnCnV6YMWrz8oIoPJUDh34JmkNadJz5EXBKQUkXQIv73epcv5MamL3A+p5Rd8oi94ib2zpRvBNqp0J/J X-Forefront-Antispam-Report: CIP:255.255.255.255; CTRY:; LANG:en; SCL:1; SRV:; IPV:NLI; SFV:NSPM; H:DB7PR04MB4666.eurprd04.prod.outlook.com; PTR:; CAT:NONE; SFS:(346002)(376002)(136003)(366004)(39860400002)(396003)(86362001)(2906002)(6512007)(478600001)(5660300002)(66946007)(8936002)(66556008)(52116002)(66476007)(6666004)(8886007)(4326008)(16526019)(186003)(8676002)(6506007)(316002)(6486002)(956004)(26005)(2616005)(83380400001)(36756003)(9126006); DIR:OUT; SFP:1101; X-MS-Exchange-AntiSpam-MessageData: r2Yc9foolCZNwvWYF/iPcGIdgAikzCFyeCrjmPna0mTVoGa4psI52EBkW9L165J3cI0EIspO3A4ic7S2VaE/j3QYkCvagcK6qKSleG3ju9WKUG9QyYd3TB/1SpcUKEXCSVnhtyoH6Ycmrg0RPQHJlJnoQSyVJnjlXP6jmtRFIlQ72uYNOLl85klYX88ZO6X3nCeAFPWML+Jhmv2ZCOVQ2/HHyWJotJux2Tz8NtIqBFvaQW0DUjsc2IuQ+oHVopS1GHOUKw+kQpYYNjN5ZyroW+G6xIGHGPDTm3D2JXMWPCwowsLhdQgIt6ARIQdkiLs9oLDnjw1FIyXZgFP+cAIR53u2QpX2Z0drtO3MxaSWlS0jiP5IaVf9Kljmoh0cMs/obkY8nI46oBLB0ycdO+Pa4+bDKLRbL4ZKDDmiMl7VicCps7tM7wjWmE7u8+ILt+UDdfcJaNIA5Ii3ak/2aF8DBtDKcPA4msVfC7Lris2Z8Pi+T6JHrH4I8/F5RhLfOJ8AJhHwCkUl82Lp7tJrC2ZqUiIoktoq6zKt47mKWr0Vhqa3Ep4lokIlJ1TOrGhqUrEraugWP/WaSBcLANZNMos/SB9XTvxifQmJjz99jognJKSOzUORJMF7H1qzYcFgKuXX02LMe+G2XhzsVUhQyy5goQ== X-OriginatorOrg: suse.com X-MS-Exchange-CrossTenant-Network-Message-Id: 655b189c-2ed8-482a-75c3-08d88c8023b8 X-MS-Exchange-CrossTenant-AuthSource: DB7PR04MB4666.eurprd04.prod.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Internal X-MS-Exchange-CrossTenant-OriginalArrivalTime: 19 Nov 2020 11:42:05.9606 (UTC) X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted X-MS-Exchange-CrossTenant-Id: f7a17af6-1c5c-4a36-aa8b-f5be247aa4ba X-MS-Exchange-CrossTenant-MailboxType: HOSTED X-MS-Exchange-CrossTenant-UserPrincipalName: MsgR56lPvYO7IaE86G4mlFUS2IeG9dU++Z1cjw/d0QffXm39aZLigFB4F4SSYh8DOPQ21mbkI5dPRascHCSBqw== X-MS-Exchange-Transport-CrossTenantHeadersStamped: DB6PR0402MB2806 Precedence: bulk List-ID: X-Mailing-List: stable@vger.kernel.org Reshape request should be blocked with ongoing resync job. In cluster env, a node can start resync job even if the resync cmd isn't executed on it, e.g., user executes "mdadm --grow" on node A, sometimes node B will start resync job. However, current update_raid_disks() only check local recovery status, which is incomplete. As a result, we see user will execute "mdadm --grow" successfully on local, while the remote node deny to do reshape job when it doing resync job. The inconsistent handling cause array enter unexpected status. If user doesn't observe this issue and continue executing mdadm cmd, the array doesn't work at last. Fix this issue by blocking reshape request. When node executes "--grow" and detects ongoing resync, it should stop and report error to user. The following script reproduces the issue with ~100% probability. (two nodes share 3 iSCSI luns: sdg/sdh/sdi. Each lun size is 1GB) ``` # on node1, node2 is the remote node. ssh root@node2 "mdadm -S --scan" mdadm -S --scan for i in {g,h,i};do dd if=/dev/zero of=/dev/sd$i oflag=direct bs=1M \ count=20; done mdadm -C /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sdg /dev/sdh ssh root@node2 "mdadm -A /dev/md0 /dev/sdg /dev/sdh" sleep 5 mdadm --manage --add /dev/md0 /dev/sdi mdadm --wait /dev/md0 mdadm --grow --raid-devices=3 /dev/md0 mdadm /dev/md0 --fail /dev/sdg mdadm /dev/md0 --remove /dev/sdg mdadm --grow --raid-devices=2 /dev/md0 ``` Cc: stable@vger.kernel.org Signed-off-by: Zhao Heming --- drivers/md/md.c | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/drivers/md/md.c b/drivers/md/md.c index 98bac4f304ae..74280e353b8f 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -7278,6 +7278,7 @@ static int update_raid_disks(struct mddev *mddev, int raid_disks) return -EINVAL; if (mddev->sync_thread || test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) || + test_bit(MD_RESYNCING_REMOTE, &mddev->recovery) || mddev->reshape_position != MaxSector) return -EBUSY; @@ -9662,8 +9663,11 @@ static void check_sb_changes(struct mddev *mddev, struct md_rdev *rdev) } } - if (mddev->raid_disks != le32_to_cpu(sb->raid_disks)) - update_raid_disks(mddev, le32_to_cpu(sb->raid_disks)); + if (mddev->raid_disks != le32_to_cpu(sb->raid_disks)) { + ret = update_raid_disks(mddev, le32_to_cpu(sb->raid_disks)); + if (ret) + pr_warn("md: updating array disks failed. %d\n", ret); + } /* * Since mddev->delta_disks has already updated in update_raid_disks, From patchwork Thu Nov 19 11:41:34 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "heming.zhao@suse.com" X-Patchwork-Id: 329928 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-18.7 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_CR_TRAILER, INCLUDES_PATCH, MAILING_LIST_MULTI, MSGID_FROM_MTA_HEADER, SPF_HELO_NONE, SPF_PASS, URIBL_BLOCKED, USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1482EC64E7B for ; Thu, 19 Nov 2020 11:48:24 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 8ABF92224E for ; Thu, 19 Nov 2020 11:48:23 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (1024-bit key) header.d=suse.com header.i=@suse.com header.b="TynaYas7" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727367AbgKSLsG (ORCPT ); Thu, 19 Nov 2020 06:48:06 -0500 Received: from de-smtp-delivery-52.mimecast.com ([62.140.7.52]:27545 "EHLO de-smtp-delivery-52.mimecast.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727156AbgKSLmP (ORCPT ); Thu, 19 Nov 2020 06:42:15 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.com; s=mimecast20200619; t=1605786133; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=ujRyZUePYV92oS4mXIitKn+n4aNhawDG03BqV3avwrk=; b=TynaYas7rbhBTQAlknyLz4scS+eIgbu+kNNEBPC0/NbJSjkLw/wjx+SbIXvLAjLRCz4JIE bAO4WuyW2e2vOv0/se/vSfp5vxaJCygPFyu5rPuswJoT5XJtlyRlyIC2V1+v4Xfz9bp25Q sv7ix6ewXGJSvlHPWs7vpvTQM2/PZ6Y= Received: from EUR01-DB5-obe.outbound.protection.outlook.com (mail-db5eur01lp2054.outbound.protection.outlook.com [104.47.2.54]) (Using TLS) by relay.mimecast.com with ESMTP id de-mta-18-lbcUB6gZMVqm5_TquGLh9Q-1; Thu, 19 Nov 2020 12:42:11 +0100 X-MC-Unique: lbcUB6gZMVqm5_TquGLh9Q-1 ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=CMuw+dhBqN1xEVC33Eh3zHSZJvskdNVWMExckEQvITwD6fvi9OcJEaGBsNK3vPOpX3c5Bq4rinoc1zcD/BWvnN8mOmst8ZB5S6IRq9Kr2SpT8O7VVSsHcijQA/FHv6iWbmLEoDRj2jEJ5FAESH+pqS8gJWZDlqQFerdcaHgJEF9ZPOA0xx3CqcJbvDJJn4GAFmZnyLCounHq06r0uUcunmab9OJuyptiVY1XAwFkZUTgDrvCnE4sSZMU2vPPw2gF3KrZfSXuovT027KqwJAGfQ5ZXovbkS9v2M9hEd8XiXzjCvzV3n9I2aV5PR4rwA3ve34JD9GWA0U/7Ur5Gh0u6Q== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=ujRyZUePYV92oS4mXIitKn+n4aNhawDG03BqV3avwrk=; b=XTJ+m0tLSa5X78kxzDHRcagybxdh6NeMxIpJQMtx9M2pt2R/DHF/AFfurABUA1b4F25py0mSh33ltw3uclBqbvXDQnnHLSDv5G6dogEH7h7c/CV0b5JD/hBNQ/CkQbO9gh9IkSNenURjx2y38CM1HDP2WKcc/FwozNE9o13IaKmZPpvA5yJ9ULnICVQrj29d7ca7FadXtPHPWByQcpseMuNqehHmm0lah3ii5rDeae6t6wa1yHCLPVi+T8/iTWO+xYmR31MH7dt4T5CM1SMhKX+jRFIuV8FA8vxhe/5enaSvH3tr6KRkGjAjwjRvdJ+/9U0kJz2d8bd+Ii7CwJmXww== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=suse.com; dmarc=pass action=none header.from=suse.com; dkim=pass header.d=suse.com; arc=none Authentication-Results: vger.kernel.org; dkim=none (message not signed) header.d=none; vger.kernel.org; dmarc=none action=none header.from=suse.com; Received: from DB7PR04MB4666.eurprd04.prod.outlook.com (2603:10a6:5:2b::14) by DB6PR0402MB2806.eurprd04.prod.outlook.com (2603:10a6:4:97::14) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.3589.20; Thu, 19 Nov 2020 11:42:10 +0000 Received: from DB7PR04MB4666.eurprd04.prod.outlook.com ([fe80::1c22:83d3:bc1e:6383]) by DB7PR04MB4666.eurprd04.prod.outlook.com ([fe80::1c22:83d3:bc1e:6383%4]) with mapi id 15.20.3564.028; Thu, 19 Nov 2020 11:42:10 +0000 From: Zhao Heming To: linux-raid@vger.kernel.org, song@kernel.org, guoqing.jiang@cloud.ionos.com, xni@redhat.com Cc: Zhao Heming , lidong.zhong@suse.com, neilb@suse.de, colyli@suse.de, stable@vger.kernel.org Subject: [PATCH v4 2/2] md/cluster: fix deadlock when node is doing resync job Date: Thu, 19 Nov 2020 19:41:34 +0800 Message-Id: <1605786094-5582-3-git-send-email-heming.zhao@suse.com> X-Mailer: git-send-email 1.8.3.1 In-Reply-To: <1605786094-5582-1-git-send-email-heming.zhao@suse.com> References: <1605786094-5582-1-git-send-email-heming.zhao@suse.com> X-Originating-IP: [123.123.130.58] X-ClientProxiedBy: HKAPR03CA0033.apcprd03.prod.outlook.com (2603:1096:203:c9::20) To DB7PR04MB4666.eurprd04.prod.outlook.com (2603:10a6:5:2b::14) MIME-Version: 1.0 X-MS-Exchange-MessageSentRepresentingType: 1 Received: from c73.home (123.123.130.58) by HKAPR03CA0033.apcprd03.prod.outlook.com (2603:1096:203:c9::20) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.3589.15 via Frontend Transport; Thu, 19 Nov 2020 11:42:06 +0000 X-MS-PublicTrafficType: Email X-MS-Office365-Filtering-Correlation-Id: 7d2b0580-1931-424d-9a18-08d88c802601 X-MS-TrafficTypeDiagnostic: DB6PR0402MB2806: X-MS-Exchange-Transport-Forked: True X-Microsoft-Antispam-PRVS: X-MS-Oob-TLC-OOBClassifiers: OLM:238; X-MS-Exchange-SenderADCheck: 1 X-Microsoft-Antispam: BCL:0; X-Microsoft-Antispam-Message-Info: 9tfg97ear3s46CjBLkdoj/oP8VERaKjB9p9t4CHuU0jz7HXzsaJdjtsVux4zREWp+ECNV4uYD2Dp8tboZGI+4rEpTFW7mRqK76kbzNZRkc68r1tesXJ4ZiaSZQi4e/8838TH3FlKDxM3IRVzXuyhZaC5ImN7Fza5Nft17SN/XrJBYXQ7AWFQsoPnrHma3gs9ePRtABT7rw3FSxtkbtb8Uo9Y2fPBdHyeY4xYOPfZFSwhnrbEZ8CUfn453tC3bUin+e1DJlVnkIkDHSEuEJssMSQ6YDOhKQ6GjBO8OjLph9x30pgTv1PIQo3UxsXJumfgjfrIJoRAOMq7+e0rfKWLw+B7ytiG6J2IRnBtgU+Ug/6byPCOvFYhX0hS+lXLaeVY X-Forefront-Antispam-Report: CIP:255.255.255.255; CTRY:; LANG:en; SCL:1; SRV:; IPV:NLI; SFV:NSPM; H:DB7PR04MB4666.eurprd04.prod.outlook.com; PTR:; CAT:NONE; SFS:(346002)(376002)(136003)(366004)(39860400002)(396003)(86362001)(2906002)(6512007)(478600001)(5660300002)(66946007)(8936002)(66556008)(52116002)(66476007)(6666004)(8886007)(4326008)(16526019)(186003)(8676002)(6506007)(316002)(6486002)(956004)(26005)(2616005)(83380400001)(36756003)(9126006); DIR:OUT; SFP:1101; X-MS-Exchange-AntiSpam-MessageData: YHuDuBt+jzBb05azAo55PPaPvyYEEgvrm2PAY8pi2oXXlbOMxj4JX5yMxKddCdtmWqHjCMarklCyfBULUTCbvFW1xJ4qCC75fKx2Rv9BnlM3Zh0BXGfM395JReozArRoKoqKn9iRbIC4kyGE254k9b/rc0TQRftVmLgKCQTXVs+tgEHZeHD1WE+IaY96y+lfDc73h5tV/XNqF6rQEy5qW6F+eqR0CQ84ICVvY97uILA2oFE4NfL0BZhwH+0N9/EGSLV2YefKMNd9LZIHns7w07Sq5eOgjK3BMprG3IrM4wEbfSaAPz6Wmt8wITX5rqStlDAJWtuFJQlvpJOghC60Eq7O89OTEqustZqqcLmG4uTO6lNynlB/nsWfHK5EQW9ux8goN8RPrhh5VphJSQblZaSXe7ssGVqxIRy8JN8rMbNTZNwloOAiRx3UsFMa24tDh0jWcIat/JJiq2yZ9j18fSnR7gyecBhSYzu4iqVFvIh+ORQ4p2Ma+KICojYtMtXEKYv0nwUOs/i3niK+z/tnf22ef+IoCiCKtPLF3kf7k1JUn7Q33/q8FkfmW3tSUTF1JTpfelGoJZdI5G0birqHuHQN/x1WgE/TfgluRkcOnpflyWNICv6NDLEn8mgsBxh7S1X0XaDokcLlSDRx+O1GdQ== X-OriginatorOrg: suse.com X-MS-Exchange-CrossTenant-Network-Message-Id: 7d2b0580-1931-424d-9a18-08d88c802601 X-MS-Exchange-CrossTenant-AuthSource: DB7PR04MB4666.eurprd04.prod.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Internal X-MS-Exchange-CrossTenant-OriginalArrivalTime: 19 Nov 2020 11:42:09.9513 (UTC) X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted X-MS-Exchange-CrossTenant-Id: f7a17af6-1c5c-4a36-aa8b-f5be247aa4ba X-MS-Exchange-CrossTenant-MailboxType: HOSTED X-MS-Exchange-CrossTenant-UserPrincipalName: fURqzbepxtQ8v6mW8eF+DgCEmm6IhfbvmNELw6HVz4S0rgTyQ/mnhfQzOzP57f4XizMBUBkUc0H+C8ANJEm7Aw== X-MS-Exchange-Transport-CrossTenantHeadersStamped: DB6PR0402MB2806 Precedence: bulk List-ID: X-Mailing-List: stable@vger.kernel.org md-cluster uses MD_CLUSTER_SEND_LOCK to make node can exclusively send msg. During sending msg, node can concurrently receive msg from another node. When node does resync job, grab token_lockres:EX may trigger a deadlock: ``` nodeA nodeB -------------------- -------------------- a. send METADATA_UPDATED held token_lockres:EX b. md_do_sync resync_info_update send RESYNCING + set MD_CLUSTER_SEND_LOCK + wait for holding token_lockres:EX c. mdadm /dev/md0 --remove /dev/sdg + held reconfig_mutex + send REMOVE + wait_event(MD_CLUSTER_SEND_LOCK) d. recv_daemon //METADATA_UPDATED from A process_metadata_update + (mddev_trylock(mddev) || MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD) //this time, both return false forever ``` Explaination: a. A send METADATA_UPDATED This will block another node to send msg b. B does sync jobs, which will send RESYNCING at intervals. This will be block for holding token_lockres:EX lock. c. B do "mdadm --remove", which will send REMOVE. This will be blocked by step : MD_CLUSTER_SEND_LOCK is 1. d. B recv METADATA_UPDATED msg, which send from A in step . This will be blocked by step : holding mddev lock, it makes wait_event can't hold mddev lock. (btw, MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD keep ZERO in this scenario.) There is a similar deadlock in commit 0ba959774e93 ("md-cluster: use sync way to handle METADATA_UPDATED msg") In that commit, step c is "update sb". This patch step c is "mdadm --remove". For fixing this issue, we can refer the solution of function: metadata_update_start. Which does the same grab lock_token action. lock_comm can use the same steps to avoid deadlock. By moving MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD from lock_token to lock_comm. It enlarge a little bit window of MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD, but it is safe & can break deadlock. Repro steps (I only triggered 3 times with hundreds tests): two nodes share 3 iSCSI luns: sdg/sdh/sdi. Each lun size is 1GB. ``` ssh root@node2 "mdadm -S --scan" mdadm -S --scan for i in {g,h,i};do dd if=/dev/zero of=/dev/sd$i oflag=direct bs=1M \ count=20; done mdadm -C /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sdg /dev/sdh \ --bitmap-chunk=1M ssh root@node2 "mdadm -A /dev/md0 /dev/sdg /dev/sdh" sleep 5 mkfs.xfs /dev/md0 mdadm --manage --add /dev/md0 /dev/sdi mdadm --wait /dev/md0 mdadm --grow --raid-devices=3 /dev/md0 mdadm /dev/md0 --fail /dev/sdg mdadm /dev/md0 --remove /dev/sdg mdadm --grow --raid-devices=2 /dev/md0 ``` test script will hung when executing "mdadm --remove". ``` # dump stacks by "echo t > /proc/sysrq-trigger" md0_cluster_rec D 0 5329 2 0x80004000 Call Trace: __schedule+0x1f6/0x560 ? _cond_resched+0x2d/0x40 ? schedule+0x4a/0xb0 ? process_metadata_update.isra.0+0xdb/0x140 [md_cluster] ? wait_woken+0x80/0x80 ? process_recvd_msg+0x113/0x1d0 [md_cluster] ? recv_daemon+0x9e/0x120 [md_cluster] ? md_thread+0x94/0x160 [md_mod] ? wait_woken+0x80/0x80 ? md_congested+0x30/0x30 [md_mod] ? kthread+0x115/0x140 ? __kthread_bind_mask+0x60/0x60 ? ret_from_fork+0x1f/0x40 mdadm D 0 5423 1 0x00004004 Call Trace: __schedule+0x1f6/0x560 ? __schedule+0x1fe/0x560 ? schedule+0x4a/0xb0 ? lock_comm.isra.0+0x7b/0xb0 [md_cluster] ? wait_woken+0x80/0x80 ? remove_disk+0x4f/0x90 [md_cluster] ? hot_remove_disk+0xb1/0x1b0 [md_mod] ? md_ioctl+0x50c/0xba0 [md_mod] ? wait_woken+0x80/0x80 ? blkdev_ioctl+0xa2/0x2a0 ? block_ioctl+0x39/0x40 ? ksys_ioctl+0x82/0xc0 ? __x64_sys_ioctl+0x16/0x20 ? do_syscall_64+0x5f/0x150 ? entry_SYSCALL_64_after_hwframe+0x44/0xa9 md0_resync D 0 5425 2 0x80004000 Call Trace: __schedule+0x1f6/0x560 ? schedule+0x4a/0xb0 ? dlm_lock_sync+0xa1/0xd0 [md_cluster] ? wait_woken+0x80/0x80 ? lock_token+0x2d/0x90 [md_cluster] ? resync_info_update+0x95/0x100 [md_cluster] ? raid1_sync_request+0x7d3/0xa40 [raid1] ? md_do_sync.cold+0x737/0xc8f [md_mod] ? md_thread+0x94/0x160 [md_mod] ? md_congested+0x30/0x30 [md_mod] ? kthread+0x115/0x140 ? __kthread_bind_mask+0x60/0x60 ? ret_from_fork+0x1f/0x40 ``` At last, thanks for Xiao's solution. Cc: stable@vger.kernel.org Signed-off-by: Zhao Heming Suggested-by: Xiao Ni Reviewed-by: Xiao Ni --- drivers/md/md-cluster.c | 69 +++++++++++++++++++++++------------------ drivers/md/md.c | 6 ++-- 2 files changed, 43 insertions(+), 32 deletions(-) diff --git a/drivers/md/md-cluster.c b/drivers/md/md-cluster.c index 4aaf4820b6f6..405bcc5d513e 100644 --- a/drivers/md/md-cluster.c +++ b/drivers/md/md-cluster.c @@ -664,9 +664,27 @@ static void recv_daemon(struct md_thread *thread) * Takes the lock on the TOKEN lock resource so no other * node can communicate while the operation is underway. */ -static int lock_token(struct md_cluster_info *cinfo, bool mddev_locked) +static int lock_token(struct md_cluster_info *cinfo) { - int error, set_bit = 0; + int error; + + error = dlm_lock_sync(cinfo->token_lockres, DLM_LOCK_EX); + if (error) { + pr_err("md-cluster(%s:%d): failed to get EX on TOKEN (%d)\n", + __func__, __LINE__, error); + } else { + /* Lock the receive sequence */ + mutex_lock(&cinfo->recv_mutex); + } + return error; +} + +/* lock_comm() + * Sets the MD_CLUSTER_SEND_LOCK bit to lock the send channel. + */ +static int lock_comm(struct md_cluster_info *cinfo, bool mddev_locked) +{ + int rv, set_bit = 0; struct mddev *mddev = cinfo->mddev; /* @@ -677,34 +695,19 @@ static int lock_token(struct md_cluster_info *cinfo, bool mddev_locked) */ if (mddev_locked && !test_bit(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD, &cinfo->state)) { - error = test_and_set_bit_lock(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD, + rv = test_and_set_bit_lock(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD, &cinfo->state); - WARN_ON_ONCE(error); + WARN_ON_ONCE(rv); md_wakeup_thread(mddev->thread); set_bit = 1; } - error = dlm_lock_sync(cinfo->token_lockres, DLM_LOCK_EX); - if (set_bit) - clear_bit_unlock(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD, &cinfo->state); - if (error) - pr_err("md-cluster(%s:%d): failed to get EX on TOKEN (%d)\n", - __func__, __LINE__, error); - - /* Lock the receive sequence */ - mutex_lock(&cinfo->recv_mutex); - return error; -} - -/* lock_comm() - * Sets the MD_CLUSTER_SEND_LOCK bit to lock the send channel. - */ -static int lock_comm(struct md_cluster_info *cinfo, bool mddev_locked) -{ wait_event(cinfo->wait, - !test_and_set_bit(MD_CLUSTER_SEND_LOCK, &cinfo->state)); - - return lock_token(cinfo, mddev_locked); + !test_and_set_bit(MD_CLUSTER_SEND_LOCK, &cinfo->state)); + rv = lock_token(cinfo); + if (set_bit) + clear_bit_unlock(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD, &cinfo->state); + return rv; } static void unlock_comm(struct md_cluster_info *cinfo) @@ -784,9 +787,11 @@ static int sendmsg(struct md_cluster_info *cinfo, struct cluster_msg *cmsg, { int ret; - lock_comm(cinfo, mddev_locked); - ret = __sendmsg(cinfo, cmsg); - unlock_comm(cinfo); + ret = lock_comm(cinfo, mddev_locked); + if (!ret) { + ret = __sendmsg(cinfo, cmsg); + unlock_comm(cinfo); + } return ret; } @@ -1061,7 +1066,7 @@ static int metadata_update_start(struct mddev *mddev) return 0; } - ret = lock_token(cinfo, 1); + ret = lock_token(cinfo); clear_bit_unlock(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD, &cinfo->state); return ret; } @@ -1255,7 +1260,10 @@ static void update_size(struct mddev *mddev, sector_t old_dev_sectors) int raid_slot = -1; md_update_sb(mddev, 1); - lock_comm(cinfo, 1); + if (lock_comm(cinfo, 1)) { + pr_err("%s: lock_comm failed\n", __func__); + return; + } memset(&cmsg, 0, sizeof(cmsg)); cmsg.type = cpu_to_le32(METADATA_UPDATED); @@ -1407,7 +1415,8 @@ static int add_new_disk(struct mddev *mddev, struct md_rdev *rdev) cmsg.type = cpu_to_le32(NEWDISK); memcpy(cmsg.uuid, uuid, 16); cmsg.raid_slot = cpu_to_le32(rdev->desc_nr); - lock_comm(cinfo, 1); + if (lock_comm(cinfo, 1)) + return -EAGAIN; ret = __sendmsg(cinfo, &cmsg); if (ret) { unlock_comm(cinfo); diff --git a/drivers/md/md.c b/drivers/md/md.c index 74280e353b8f..46da165afde2 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -6948,8 +6948,10 @@ static int hot_remove_disk(struct mddev *mddev, dev_t dev) goto busy; kick_rdev: - if (mddev_is_clustered(mddev)) - md_cluster_ops->remove_disk(mddev, rdev); + if (mddev_is_clustered(mddev)) { + if (md_cluster_ops->remove_disk(mddev, rdev)) + goto busy; + } md_kick_rdev_from_array(rdev); set_bit(MD_SB_CHANGE_DEVS, &mddev->sb_flags);