From patchwork Tue Jun 23 19:55:41 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Greg KH X-Patchwork-Id: 223221 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-10.0 required=3.0 tests=DKIMWL_WL_HIGH, DKIM_SIGNED, DKIM_VALID, HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_PATCH, MAILING_LIST_MULTI, SIGNED_OFF_BY, SPF_HELO_NONE, SPF_PASS, USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id E33A2C433DF for ; Tue, 23 Jun 2020 21:16:57 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id BB59720738 for ; Tue, 23 Jun 2020 21:16:57 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1592947017; bh=B7+noFMSADoC1yQsvH+s6c+fEh2r7laiozc+L19iyCY=; h=From:To:Cc:Subject:Date:In-Reply-To:References:List-ID:From; b=zqIS2Dow1iqqT8FM1MaIvbVurvXu2MhNBZcToUQ1EjeDvOTgA/X2ILTfzJQAhYEg2 Qn7L/3/dRZSZyJAA2et8uKBvGEqX7HpfAqZETrc4+fdEfgmS1Bsw/yMe19R700k55b GcIZezSDrWRcWrMhfMSaUcL37dg6y2ORcefxkJ+g= Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2390734AbgFWU1M (ORCPT ); Tue, 23 Jun 2020 16:27:12 -0400 Received: from mail.kernel.org ([198.145.29.99]:46604 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2390240AbgFWU1K (ORCPT ); Tue, 23 Jun 2020 16:27:10 -0400 Received: from localhost (83-86-89-107.cable.dynamic.v4.ziggo.nl [83.86.89.107]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id 0C8FA2070E; Tue, 23 Jun 2020 20:27:09 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1592944030; bh=B7+noFMSADoC1yQsvH+s6c+fEh2r7laiozc+L19iyCY=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=teaoCRn+oBa5ruSLY9/WLNeIkmnSjUSkb+6WDYuviuZEJ9hL9RnA70Y/L/uZZ+pb+ Vg29pe9vVzmnWY8XFGR1Be3qLY4Af7GOxweVkL25NGE3z/rT4y3hfKt7nH0bp0b9mS I6OKJsgWShZ3qXFRe13/NMeDYQXsbq43GLurx9ns= From: Greg Kroah-Hartman To: linux-kernel@vger.kernel.org Cc: Greg Kroah-Hartman , stable@vger.kernel.org, Omer Shpigelman , Oded Gabbay , Sasha Levin Subject: [PATCH 5.4 146/314] habanalabs: increase timeout during reset Date: Tue, 23 Jun 2020 21:55:41 +0200 Message-Id: <20200623195345.818635556@linuxfoundation.org> X-Mailer: git-send-email 2.27.0 In-Reply-To: <20200623195338.770401005@linuxfoundation.org> References: <20200623195338.770401005@linuxfoundation.org> User-Agent: quilt/0.66 MIME-Version: 1.0 Sender: stable-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: stable@vger.kernel.org From: Oded Gabbay [ Upstream commit 7a65ee046b2238e053f6ebb610e1a082cfc49490 ] When doing training, the DL framework (e.g. tensorflow) performs hundreds of thousands of memory allocations and mappings. In case the driver needs to perform hard-reset during training, the driver kills the application and unmaps all those memory allocations. Unfortunately, because of that large amount of mappings, the driver isn't able to do that in the current timeout (5 seconds). Therefore, increase the timeout significantly to 30 seconds to avoid situation where the driver resets the device with active mappings, which sometime can cause a kernel bug. BTW, it doesn't mean we will spend all the 30 seconds because the reset thread checks every one second if the unmap operation is done. Reviewed-by: Omer Shpigelman Signed-off-by: Oded Gabbay Signed-off-by: Sasha Levin --- drivers/misc/habanalabs/habanalabs.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/misc/habanalabs/habanalabs.h b/drivers/misc/habanalabs/habanalabs.h index 75862be53c60e..30addffd76f53 100644 --- a/drivers/misc/habanalabs/habanalabs.h +++ b/drivers/misc/habanalabs/habanalabs.h @@ -23,7 +23,7 @@ #define HL_MMAP_CB_MASK (0x8000000000000000ull >> PAGE_SHIFT) -#define HL_PENDING_RESET_PER_SEC 5 +#define HL_PENDING_RESET_PER_SEC 30 #define HL_DEVICE_TIMEOUT_USEC 1000000 /* 1 s */