From patchwork Wed Jan 15 01:26:45 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Mina Almasry X-Patchwork-Id: 208962 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-17.4 required=3.0 tests=DKIMWL_WL_MED, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_PATCH, MAILING_LIST_MULTI, SIGNED_OFF_BY, SPF_HELO_NONE, SPF_PASS, USER_AGENT_GIT, USER_IN_DEF_DKIM_WL autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 005AFC33CB2 for ; Wed, 15 Jan 2020 01:27:28 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id CAAB22467A for ; Wed, 15 Jan 2020 01:27:27 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="lGs4BLNH" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728882AbgAOB07 (ORCPT ); Tue, 14 Jan 2020 20:26:59 -0500 Received: from mail-pl1-f201.google.com ([209.85.214.201]:39386 "EHLO mail-pl1-f201.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728915AbgAOB06 (ORCPT ); Tue, 14 Jan 2020 20:26:58 -0500 Received: by mail-pl1-f201.google.com with SMTP id p15so937357plq.6 for ; Tue, 14 Jan 2020 17:26:57 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=date:in-reply-to:message-id:mime-version:references:subject:from:to :cc; bh=VD9B9XQiTIZ/I3GgP2atJmksN68hbJNGgZeyi4+Kld8=; b=lGs4BLNH5UJp8PirX+0cIBqHW95lTXQIuPY018sNrBns1BSssPpXT3HuuO7QeQGuIe u9Q/LVnpf1MptbB+AzYedmLDV26mXSRP8MpWL9lTvLGWMGIbRwOGxSxnCBMk7sd+o4yo s2Zay07dkVOJQeRnbeIva9zpUHc1dkmAq8FzgR6QiQ46BAqfCR0kZEteEgb5gRb6U9oz WllB3SFiOGY/FGJCcJ5dmT553Kfl75iBs9yhf92PpZWZc/8Q5FtErBGT/5i4DdpAiMNI 3/B44ps4npXEDY0hN7Vl2HehrtsCkNx+S7DP4joBNKGGqr/75S+33NtwDRyTJNWWN4my NZtg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:in-reply-to:message-id:mime-version :references:subject:from:to:cc; bh=VD9B9XQiTIZ/I3GgP2atJmksN68hbJNGgZeyi4+Kld8=; b=ozrrorm+xZKoKjx0dxkoyIcHVXLaVKRWPhDpSdkYuH9yOKXF0ytTjMq3qoqMBLE7Oh MeHwCXLc1BktbRpvW1L15Pgn2MNFCbs2v10oQVOMVYGGqZ+0U+I0cut8v3d3XJgOyhvL pkkviawdGDqlsSyWjNUx6N8/kxwllhzhaxinXjeuQ2HK5Ppde0M5jggLjjYSKuoCvzFM jLUCc26WBsnVos8jM7LX54KXGh6JTzVB5b6zOR7MKP8ztLfppNYgnXX7SWPCYEbXTCK3 Lw3fcA8qK3LSIbyxQxdTdd5Hd49uUVUSdFmNS45s9/dfOcjnRi9p8I8WUFGCUo4tWpdT pttQ== X-Gm-Message-State: APjAAAWPUFYwhtbBrUz/eP56Y4MsZJpxyNk6nPaPk05jLvPeoAX/oBOx aiAtpvAXBqASdjJPaizzpOJ0e5Sac/PzCXWduw== X-Google-Smtp-Source: APXvYqy8wu+rN6XIvdPZvljPhhlginlJ4IfOUPOeV7wb2cEpfAUMfQvkqZxabEz78XjhuB2+EONpJS7WCBe000q0UA== X-Received: by 2002:a63:e84d:: with SMTP id a13mr31663837pgk.274.1579051617276; Tue, 14 Jan 2020 17:26:57 -0800 (PST) Date: Tue, 14 Jan 2020 17:26:45 -0800 In-Reply-To: <20200115012651.228058-1-almasrymina@google.com> Message-Id: <20200115012651.228058-2-almasrymina@google.com> Mime-Version: 1.0 References: <20200115012651.228058-1-almasrymina@google.com> X-Mailer: git-send-email 2.25.0.rc1.283.g88dfdc4193-goog Subject: [PATCH v10 2/8] hugetlb_cgroup: add interface for charge/uncharge hugetlb reservations From: Mina Almasry To: mike.kravetz@oracle.com, rientjes@google.com, shakeelb@google.com Cc: shuah@kernel.org, almasrymina@google.com, gthelen@google.com, akpm@linux-foundation.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-kselftest@vger.kernel.org, cgroups@vger.kernel.org, aneesh.kumar@linux.vnet.ibm.com Sender: linux-kselftest-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kselftest@vger.kernel.org Augments hugetlb_cgroup_charge_cgroup to be able to charge hugetlb usage or hugetlb reservation counter. Adds a new interface to uncharge a hugetlb_cgroup counter via hugetlb_cgroup_uncharge_counter. Integrates the counter with hugetlb_cgroup, via hugetlb_cgroup_init, hugetlb_cgroup_have_usage, and hugetlb_cgroup_css_offline. Signed-off-by: Mina Almasry --- Changes in v10: - Added missing VM_BUG_ON Changes in V9: - Fixed HUGETLB_CGROUP_MIN_ORDER. - Minor variable name update. - Moved some init/cleanup code from later patches in the series to this patch. - Updated reparenting of reservation accounting. --- include/linux/hugetlb_cgroup.h | 68 ++++++++++++++--------- mm/hugetlb.c | 19 ++++--- mm/hugetlb_cgroup.c | 99 +++++++++++++++++++++++++--------- 3 files changed, 128 insertions(+), 58 deletions(-) -- 2.25.0.rc1.283.g88dfdc4193-goog diff --git a/include/linux/hugetlb_cgroup.h b/include/linux/hugetlb_cgroup.h index 063962f6dfc6a..eab8a70d5bcb5 100644 --- a/include/linux/hugetlb_cgroup.h +++ b/include/linux/hugetlb_cgroup.h @@ -20,29 +20,37 @@ struct hugetlb_cgroup; /* * Minimum page order trackable by hugetlb cgroup. - * At least 3 pages are necessary for all the tracking information. + * At least 4 pages are necessary for all the tracking information. */ #define HUGETLB_CGROUP_MIN_ORDER 2 #ifdef CONFIG_CGROUP_HUGETLB -static inline struct hugetlb_cgroup *hugetlb_cgroup_from_page(struct page *page) +static inline struct hugetlb_cgroup *hugetlb_cgroup_from_page(struct page *page, + bool reserved) { VM_BUG_ON_PAGE(!PageHuge(page), page); if (compound_order(page) < HUGETLB_CGROUP_MIN_ORDER) return NULL; - return (struct hugetlb_cgroup *)page[2].private; + if (reserved) + return (struct hugetlb_cgroup *)page[3].private; + else + return (struct hugetlb_cgroup *)page[2].private; } -static inline -int set_hugetlb_cgroup(struct page *page, struct hugetlb_cgroup *h_cg) +static inline int set_hugetlb_cgroup(struct page *page, + struct hugetlb_cgroup *h_cg, + bool reservation) { VM_BUG_ON_PAGE(!PageHuge(page), page); if (compound_order(page) < HUGETLB_CGROUP_MIN_ORDER) return -1; - page[2].private = (unsigned long)h_cg; + if (reservation) + page[3].private = (unsigned long)h_cg; + else + page[2].private = (unsigned long)h_cg; return 0; } @@ -52,26 +60,34 @@ static inline bool hugetlb_cgroup_disabled(void) } extern int hugetlb_cgroup_charge_cgroup(int idx, unsigned long nr_pages, - struct hugetlb_cgroup **ptr); + struct hugetlb_cgroup **ptr, + bool reserved); extern void hugetlb_cgroup_commit_charge(int idx, unsigned long nr_pages, struct hugetlb_cgroup *h_cg, - struct page *page); + struct page *page, bool reserved); extern void hugetlb_cgroup_uncharge_page(int idx, unsigned long nr_pages, - struct page *page); + struct page *page, bool reserved); + extern void hugetlb_cgroup_uncharge_cgroup(int idx, unsigned long nr_pages, - struct hugetlb_cgroup *h_cg); + struct hugetlb_cgroup *h_cg, + bool reserved); +extern void hugetlb_cgroup_uncharge_counter(struct page_counter *p, + unsigned long nr_pages, + struct cgroup_subsys_state *css); + extern void hugetlb_cgroup_file_init(void) __init; extern void hugetlb_cgroup_migrate(struct page *oldhpage, struct page *newhpage); #else -static inline struct hugetlb_cgroup *hugetlb_cgroup_from_page(struct page *page) +static inline struct hugetlb_cgroup *hugetlb_cgroup_from_page(struct page *page, + bool reserved) { return NULL; } -static inline -int set_hugetlb_cgroup(struct page *page, struct hugetlb_cgroup *h_cg) +static inline int set_hugetlb_cgroup(struct page *page, + struct hugetlb_cgroup *h_cg, bool reserved) { return 0; } @@ -81,28 +97,30 @@ static inline bool hugetlb_cgroup_disabled(void) return true; } -static inline int -hugetlb_cgroup_charge_cgroup(int idx, unsigned long nr_pages, - struct hugetlb_cgroup **ptr) +static inline int hugetlb_cgroup_charge_cgroup(int idx, unsigned long nr_pages, + struct hugetlb_cgroup **ptr, + bool reserved) { return 0; } -static inline void -hugetlb_cgroup_commit_charge(int idx, unsigned long nr_pages, - struct hugetlb_cgroup *h_cg, - struct page *page) +static inline void hugetlb_cgroup_commit_charge(int idx, unsigned long nr_pages, + struct hugetlb_cgroup *h_cg, + struct page *page, + bool reserved) { } -static inline void -hugetlb_cgroup_uncharge_page(int idx, unsigned long nr_pages, struct page *page) +static inline void hugetlb_cgroup_uncharge_page(int idx, unsigned long nr_pages, + struct page *page, + bool reserved) { } -static inline void -hugetlb_cgroup_uncharge_cgroup(int idx, unsigned long nr_pages, - struct hugetlb_cgroup *h_cg) +static inline void hugetlb_cgroup_uncharge_cgroup(int idx, + unsigned long nr_pages, + struct hugetlb_cgroup *h_cg, + bool reserved) { } diff --git a/mm/hugetlb.c b/mm/hugetlb.c index dd8737a94bec4..62a4cf3db4090 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -1068,7 +1068,8 @@ static void update_and_free_page(struct hstate *h, struct page *page) 1 << PG_active | 1 << PG_private | 1 << PG_writeback); } - VM_BUG_ON_PAGE(hugetlb_cgroup_from_page(page), page); + VM_BUG_ON_PAGE(hugetlb_cgroup_from_page(page, false), page); + VM_BUG_ON_PAGE(hugetlb_cgroup_from_page(page, true), page); set_compound_page_dtor(page, NULL_COMPOUND_DTOR); set_page_refcounted(page); if (hstate_is_gigantic(h)) { @@ -1178,8 +1179,8 @@ static void __free_huge_page(struct page *page) spin_lock(&hugetlb_lock); clear_page_huge_active(page); - hugetlb_cgroup_uncharge_page(hstate_index(h), - pages_per_huge_page(h), page); + hugetlb_cgroup_uncharge_page(hstate_index(h), pages_per_huge_page(h), + page, false); if (restore_reserve) h->resv_huge_pages++; @@ -1253,7 +1254,8 @@ static void prep_new_huge_page(struct hstate *h, struct page *page, int nid) INIT_LIST_HEAD(&page->lru); set_compound_page_dtor(page, HUGETLB_PAGE_DTOR); spin_lock(&hugetlb_lock); - set_hugetlb_cgroup(page, NULL); + set_hugetlb_cgroup(page, NULL, false); + set_hugetlb_cgroup(page, NULL, true); h->nr_huge_pages++; h->nr_huge_pages_node[nid]++; spin_unlock(&hugetlb_lock); @@ -2039,7 +2041,8 @@ struct page *alloc_huge_page(struct vm_area_struct *vma, gbl_chg = 1; } - ret = hugetlb_cgroup_charge_cgroup(idx, pages_per_huge_page(h), &h_cg); + ret = hugetlb_cgroup_charge_cgroup(idx, pages_per_huge_page(h), &h_cg, + false); if (ret) goto out_subpool_put; @@ -2063,7 +2066,8 @@ struct page *alloc_huge_page(struct vm_area_struct *vma, list_move(&page->lru, &h->hugepage_activelist); /* Fall through */ } - hugetlb_cgroup_commit_charge(idx, pages_per_huge_page(h), h_cg, page); + hugetlb_cgroup_commit_charge(idx, pages_per_huge_page(h), h_cg, page, + false); spin_unlock(&hugetlb_lock); set_page_private(page, (unsigned long)spool); @@ -2087,7 +2091,8 @@ struct page *alloc_huge_page(struct vm_area_struct *vma, return page; out_uncharge_cgroup: - hugetlb_cgroup_uncharge_cgroup(idx, pages_per_huge_page(h), h_cg); + hugetlb_cgroup_uncharge_cgroup(idx, pages_per_huge_page(h), h_cg, + false); out_subpool_put: if (map_chg || avoid_reserve) hugepage_subpool_put_pages(spool, 1); diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c index 209f9b9604d34..c434f69f38354 100644 --- a/mm/hugetlb_cgroup.c +++ b/mm/hugetlb_cgroup.c @@ -96,8 +96,12 @@ static inline bool hugetlb_cgroup_have_usage(struct hugetlb_cgroup *h_cg) int idx; for (idx = 0; idx < hugetlb_max_hstate; idx++) { - if (page_counter_read(&h_cg->hugepage[idx])) + if (page_counter_read( + hugetlb_cgroup_get_counter(h_cg, idx, true)) || + page_counter_read( + hugetlb_cgroup_get_counter(h_cg, idx, false))) { return true; + } } return false; } @@ -108,18 +112,33 @@ static void hugetlb_cgroup_init(struct hugetlb_cgroup *h_cgroup, int idx; for (idx = 0; idx < HUGE_MAX_HSTATE; idx++) { - struct page_counter *counter = &h_cgroup->hugepage[idx]; - struct page_counter *parent = NULL; + struct page_counter *fault_parent = NULL; + struct page_counter *reserved_parent = NULL; unsigned long limit; int ret; - if (parent_h_cgroup) - parent = &parent_h_cgroup->hugepage[idx]; - page_counter_init(counter, parent); + if (parent_h_cgroup) { + fault_parent = hugetlb_cgroup_get_counter( + parent_h_cgroup, idx, false); + reserved_parent = hugetlb_cgroup_get_counter( + parent_h_cgroup, idx, true); + } + page_counter_init(hugetlb_cgroup_get_counter(h_cgroup, idx, + false), + fault_parent); + page_counter_init(hugetlb_cgroup_get_counter(h_cgroup, idx, + true), + reserved_parent); limit = round_down(PAGE_COUNTER_MAX, 1 << huge_page_order(&hstates[idx])); - ret = page_counter_set_max(counter, limit); + + ret = page_counter_set_max( + hugetlb_cgroup_get_counter(h_cgroup, idx, false), + limit); + VM_BUG_ON(ret); + ret = page_counter_set_max( + hugetlb_cgroup_get_counter(h_cgroup, idx, true), limit); VM_BUG_ON(ret); } } @@ -149,7 +168,6 @@ static void hugetlb_cgroup_css_free(struct cgroup_subsys_state *css) kfree(h_cgroup); } - /* * Should be called with hugetlb_lock held. * Since we are holding hugetlb_lock, pages cannot get moved from @@ -165,7 +183,7 @@ static void hugetlb_cgroup_move_parent(int idx, struct hugetlb_cgroup *h_cg, struct hugetlb_cgroup *page_hcg; struct hugetlb_cgroup *parent = parent_hugetlb_cgroup(h_cg); - page_hcg = hugetlb_cgroup_from_page(page); + page_hcg = hugetlb_cgroup_from_page(page, false); /* * We can have pages in active list without any cgroup * ie, hugepage with less than 3 pages. We can safely @@ -184,7 +202,7 @@ static void hugetlb_cgroup_move_parent(int idx, struct hugetlb_cgroup *h_cg, /* Take the pages off the local counter */ page_counter_cancel(counter, nr_pages); - set_hugetlb_cgroup(page, parent); + set_hugetlb_cgroup(page, parent, false); out: return; } @@ -227,7 +245,7 @@ static inline void hugetlb_event(struct hugetlb_cgroup *hugetlb, int idx, } int hugetlb_cgroup_charge_cgroup(int idx, unsigned long nr_pages, - struct hugetlb_cgroup **ptr) + struct hugetlb_cgroup **ptr, bool reserved) { int ret = 0; struct page_counter *counter; @@ -250,13 +268,20 @@ int hugetlb_cgroup_charge_cgroup(int idx, unsigned long nr_pages, } rcu_read_unlock(); - if (!page_counter_try_charge(&h_cg->hugepage[idx], nr_pages, - &counter)) { + if (!page_counter_try_charge(hugetlb_cgroup_get_counter(h_cg, idx, + reserved), + nr_pages, &counter)) { ret = -ENOMEM; hugetlb_event(hugetlb_cgroup_from_counter(counter, idx), idx, HUGETLB_MAX); + css_put(&h_cg->css); + goto done; } - css_put(&h_cg->css); + /* Reservations take a reference to the css because they do not get + * reparented. + */ + if (!reserved) + css_put(&h_cg->css); done: *ptr = h_cg; return ret; @@ -265,12 +290,12 @@ int hugetlb_cgroup_charge_cgroup(int idx, unsigned long nr_pages, /* Should be called with hugetlb_lock held */ void hugetlb_cgroup_commit_charge(int idx, unsigned long nr_pages, struct hugetlb_cgroup *h_cg, - struct page *page) + struct page *page, bool reserved) { if (hugetlb_cgroup_disabled() || !h_cg) return; - set_hugetlb_cgroup(page, h_cg); + set_hugetlb_cgroup(page, h_cg, reserved); return; } @@ -278,23 +303,29 @@ void hugetlb_cgroup_commit_charge(int idx, unsigned long nr_pages, * Should be called with hugetlb_lock held */ void hugetlb_cgroup_uncharge_page(int idx, unsigned long nr_pages, - struct page *page) + struct page *page, bool reserved) { struct hugetlb_cgroup *h_cg; if (hugetlb_cgroup_disabled()) return; lockdep_assert_held(&hugetlb_lock); - h_cg = hugetlb_cgroup_from_page(page); + h_cg = hugetlb_cgroup_from_page(page, reserved); if (unlikely(!h_cg)) return; - set_hugetlb_cgroup(page, NULL); - page_counter_uncharge(&h_cg->hugepage[idx], nr_pages); + set_hugetlb_cgroup(page, NULL, reserved); + + page_counter_uncharge(hugetlb_cgroup_get_counter(h_cg, idx, reserved), + nr_pages); + + if (reserved) + css_put(&h_cg->css); + return; } void hugetlb_cgroup_uncharge_cgroup(int idx, unsigned long nr_pages, - struct hugetlb_cgroup *h_cg) + struct hugetlb_cgroup *h_cg, bool reserved) { if (hugetlb_cgroup_disabled() || !h_cg) return; @@ -302,8 +333,22 @@ void hugetlb_cgroup_uncharge_cgroup(int idx, unsigned long nr_pages, if (huge_page_order(&hstates[idx]) < HUGETLB_CGROUP_MIN_ORDER) return; - page_counter_uncharge(&h_cg->hugepage[idx], nr_pages); - return; + page_counter_uncharge(hugetlb_cgroup_get_counter(h_cg, idx, reserved), + nr_pages); + + if (reserved) + css_put(&h_cg->css); +} + +void hugetlb_cgroup_uncharge_counter(struct page_counter *p, + unsigned long nr_pages, + struct cgroup_subsys_state *css) +{ + if (hugetlb_cgroup_disabled() || !p || !css) + return; + + page_counter_uncharge(p, nr_pages); + css_put(css); } enum { @@ -675,6 +720,7 @@ void __init hugetlb_cgroup_file_init(void) void hugetlb_cgroup_migrate(struct page *oldhpage, struct page *newhpage) { struct hugetlb_cgroup *h_cg; + struct hugetlb_cgroup *h_cg_reservation; struct hstate *h = page_hstate(oldhpage); if (hugetlb_cgroup_disabled()) @@ -682,11 +728,12 @@ void hugetlb_cgroup_migrate(struct page *oldhpage, struct page *newhpage) VM_BUG_ON_PAGE(!PageHuge(oldhpage), oldhpage); spin_lock(&hugetlb_lock); - h_cg = hugetlb_cgroup_from_page(oldhpage); - set_hugetlb_cgroup(oldhpage, NULL); + h_cg = hugetlb_cgroup_from_page(oldhpage, false); + h_cg_reservation = hugetlb_cgroup_from_page(oldhpage, true); + set_hugetlb_cgroup(oldhpage, NULL, false); /* move the h_cg details to new cgroup */ - set_hugetlb_cgroup(newhpage, h_cg); + set_hugetlb_cgroup(newhpage, h_cg_reservation, true); list_move(&newhpage->lru, &h->hugepage_activelist); spin_unlock(&hugetlb_lock); return; From patchwork Wed Jan 15 01:26:47 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Mina Almasry X-Patchwork-Id: 208963 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-17.4 required=3.0 tests=DKIMWL_WL_MED, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_PATCH, MAILING_LIST_MULTI, SIGNED_OFF_BY, SPF_HELO_NONE, SPF_PASS, USER_AGENT_GIT, USER_IN_DEF_DKIM_WL autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 577A4C33CB3 for ; Wed, 15 Jan 2020 01:27:25 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 166C024658 for ; Wed, 15 Jan 2020 01:27:25 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="DLGiyQ3W" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728894AbgAOB1Y (ORCPT ); Tue, 14 Jan 2020 20:27:24 -0500 Received: from mail-pf1-f202.google.com ([209.85.210.202]:52866 "EHLO mail-pf1-f202.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728988AbgAOB1D (ORCPT ); Tue, 14 Jan 2020 20:27:03 -0500 Received: by mail-pf1-f202.google.com with SMTP id 145so9933126pfx.19 for ; Tue, 14 Jan 2020 17:27:02 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=date:in-reply-to:message-id:mime-version:references:subject:from:to :cc; bh=/xscfjBmff5JrVk2WzWVkCtQEIejbU+Dlihd/yDhdmo=; b=DLGiyQ3W5hFqlGsXB1DbkIIohGB0aKZBwD2zT3WXCGKx28+BfhxwQgXSzUCacMnE91 R8dhu73d/s9iRkjVWKQ0mx9tdV/MiKv/teCcNM+73S572HK4VZW0BivhI0xyUktXdUKk Lk+R9PkJCPhKUU/jf5znw5WQvGZ0VkZbLHWJM5SUF0y+ColKbxTToqwh6hMADafuFUua CG58dcaFd295jgx1s7WXaaceAlL8m7ZEbT416wvsSRHc/MMmnMLkY4mjrp7SojUeZaei 20BohH5J6v2dMOpW47yreqce78/e4o4zrkN6/MvPZVAZhVYv7obETJpbMbWoyUjzvjTq Pxqg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:in-reply-to:message-id:mime-version :references:subject:from:to:cc; bh=/xscfjBmff5JrVk2WzWVkCtQEIejbU+Dlihd/yDhdmo=; b=fFjnT7PQvtYELbj1bevmbbQx37u3336sxgCv13AKP41EiCB+/L83Ns77lc85WR3gT2 6iEWqIwvZedtosaw1VBBsmwg0lG7kIvvVGE6JFo2zYn0c15agZjqFKdbUum2G/4krDCM oF6ELX+2XJuJQ6P6LvMRz8J+Tt3w1Co4Lv7LTy5nC+exzbtzClrsB4WwzIp+xL91ghpG 6B7CONnmsHHUm5Q/yUZwGs4vwjlMhQ9IovzGJSmz2xfjbiEy9IE80KD2Q9T08whLvLU2 oTOKWnWdcPP3czTTaz7jyLdy/aDgYdJsIBdJz2yqcOCiEEolSLNPZq5OVlc/7odPhs67 e8ug== X-Gm-Message-State: APjAAAUWG9IWk8GREF2E1yagaGsa3FcDkgsc+LtrMUBlnohvN/kb5Y+H j541GfVd0MXtjE0wGJsdP0OwlGPPRy7hrtRw2w== X-Google-Smtp-Source: APXvYqyEo4pbySguc1WgtPlpiTL5DPJW0pPTOESC6K3DXPCJ8u5VwXBLbXSNlucqH+lC8lOt7ZuSN01aeJqk6TQ7Og== X-Received: by 2002:a65:5786:: with SMTP id b6mr30489168pgr.316.1579051622257; Tue, 14 Jan 2020 17:27:02 -0800 (PST) Date: Tue, 14 Jan 2020 17:26:47 -0800 In-Reply-To: <20200115012651.228058-1-almasrymina@google.com> Message-Id: <20200115012651.228058-4-almasrymina@google.com> Mime-Version: 1.0 References: <20200115012651.228058-1-almasrymina@google.com> X-Mailer: git-send-email 2.25.0.rc1.283.g88dfdc4193-goog Subject: [PATCH v10 4/8] hugetlb: disable region_add file_region coalescing From: Mina Almasry To: mike.kravetz@oracle.com, rientjes@google.com, shakeelb@google.com Cc: shuah@kernel.org, almasrymina@google.com, gthelen@google.com, akpm@linux-foundation.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-kselftest@vger.kernel.org, cgroups@vger.kernel.org, aneesh.kumar@linux.vnet.ibm.com Sender: linux-kselftest-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kselftest@vger.kernel.org A follow up patch in this series adds hugetlb cgroup uncharge info the file_region entries in resv->regions. The cgroup uncharge info may differ for different regions, so they can no longer be coalesced at region_add time. So, disable region coalescing in region_add in this patch. Behavior change: Say a resv_map exists like this [0->1], [2->3], and [5->6]. Then a region_chg/add call comes in region_chg/add(f=0, t=5). Old code would generate resv->regions: [0->5], [5->6]. New code would generate resv->regions: [0->1], [1->2], [2->3], [3->5], [5->6]. Special care needs to be taken to handle the resv->adds_in_progress variable correctly. In the past, only 1 region would be added for every region_chg and region_add call. But now, each call may add multiple regions, so we can no longer increment adds_in_progress by 1 in region_chg, or decrement adds_in_progress by 1 after region_add or region_abort. Instead, region_chg calls add_reservation_in_range() to count the number of regions needed and allocates those, and that info is passed to region_add and region_abort to decrement adds_in_progress correctly. We've also modified the assumption that region_add after region_chg never fails. region_chg now pre-allocates at least 1 region for region_add. If region_add needs more regions than region_chg has allocated for it, then it may fail. Signed-off-by: Mina Almasry Reviewed-by: Mike Kravetz --- Changes in v9: - Added clarifications in the comments and addressed minor issues from code review. Changes in v7: - region_chg no longer allocates (t-f) / 2 file_region entries. Changes in v6: - Fix bug in number of region_caches allocated by region_chg --- mm/hugetlb.c | 327 +++++++++++++++++++++++++++++++++++---------------- 1 file changed, 223 insertions(+), 104 deletions(-) -- 2.25.0.rc1.283.g88dfdc4193-goog diff --git a/mm/hugetlb.c b/mm/hugetlb.c index f1b63946ee95c..de0028e9a8630 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -245,110 +245,179 @@ struct file_region { long to; }; +/* Helper that removes a struct file_region from the resv_map cache and returns + * it for use. + */ +static struct file_region * +get_file_region_entry_from_cache(struct resv_map *resv, long from, long to) +{ + struct file_region *nrg = NULL; + + VM_BUG_ON(resv->region_cache_count <= 0); + + resv->region_cache_count--; + nrg = list_first_entry(&resv->region_cache, struct file_region, link); + VM_BUG_ON(!nrg); + list_del(&nrg->link); + + nrg->from = from; + nrg->to = to; + + return nrg; +} + /* Must be called with resv->lock held. Calling this with count_only == true * will count the number of pages to be added but will not modify the linked - * list. + * list. If regions_needed != NULL and count_only == true, then regions_needed + * will indicate the number of file_regions needed in the cache to carry out to + * add the regions for this range. */ static long add_reservation_in_range(struct resv_map *resv, long f, long t, - bool count_only) + long *regions_needed, bool count_only) { - long chg = 0; + long add = 0; struct list_head *head = &resv->regions; + long last_accounted_offset = f; struct file_region *rg = NULL, *trg = NULL, *nrg = NULL; - /* Locate the region we are before or in. */ - list_for_each_entry(rg, head, link) - if (f <= rg->to) - break; - - /* Round our left edge to the current segment if it encloses us. */ - if (f > rg->from) - f = rg->from; + if (regions_needed) + *regions_needed = 0; - chg = t - f; + /* In this loop, we essentially handle an entry for the range + * [last_accounted_offset, rg->from), at every iteration, with some + * bounds checking. + */ + list_for_each_entry_safe(rg, trg, head, link) { + /* Skip irrelevant regions that start before our range. */ + if (rg->from < f) { + /* If this region ends after the last accounted offset, + * then we need to update last_accounted_offset. + */ + if (rg->to > last_accounted_offset) + last_accounted_offset = rg->to; + continue; + } - /* Check for and consume any regions we now overlap with. */ - nrg = rg; - list_for_each_entry_safe(rg, trg, rg->link.prev, link) { - if (&rg->link == head) - break; + /* When we find a region that starts beyond our range, we've + * finished. + */ if (rg->from > t) break; - /* We overlap with this area, if it extends further than - * us then we must extend ourselves. Account for its - * existing reservation. + /* Add an entry for last_accounted_offset -> rg->from, and + * update last_accounted_offset. */ - if (rg->to > t) { - chg += rg->to - t; - t = rg->to; + if (rg->from > last_accounted_offset) { + add += rg->from - last_accounted_offset; + if (!count_only) { + nrg = get_file_region_entry_from_cache( + resv, last_accounted_offset, rg->from); + list_add(&nrg->link, rg->link.prev); + } else if (regions_needed) + *regions_needed += 1; } - chg -= rg->to - rg->from; - if (!count_only && rg != nrg) { - list_del(&rg->link); - kfree(rg); - } + last_accounted_offset = rg->to; } - if (!count_only) { - nrg->from = f; - nrg->to = t; + /* Handle the case where our range extends beyond + * last_accounted_offset. + */ + if (last_accounted_offset < t) { + add += t - last_accounted_offset; + if (!count_only) { + nrg = get_file_region_entry_from_cache( + resv, last_accounted_offset, t); + list_add(&nrg->link, rg->link.prev); + } else if (regions_needed) + *regions_needed += 1; } - return chg; + return add; } /* * Add the huge page range represented by [f, t) to the reserve - * map. Existing regions will be expanded to accommodate the specified - * range, or a region will be taken from the cache. Sufficient regions - * must exist in the cache due to the previous call to region_chg with - * the same range. + * map. Regions will be taken from the cache to fill in this range. + * Sufficient regions should exist in the cache due to the previous + * call to region_chg with the same range, but in some cases the cache will not + * have sufficient entries due to races with other code doing region_add or + * region_del. The extra needed entries will be allocated. * - * Return the number of new huge pages added to the map. This - * number is greater than or equal to zero. + * regions_needed is the out value provided by a previous call to region_chg. + * + * Return the number of new huge pages added to the map. This number is greater + * than or equal to zero. If file_region entries needed to be allocated for + * this operation and we were not able to allocate, it ruturns -ENOMEM. + * region_add of regions of length 1 never allocate file_regions and cannot + * fail; region_chg will always allocate at least 1 entry and a region_add for + * 1 page will only require at most 1 entry. */ -static long region_add(struct resv_map *resv, long f, long t) +static long region_add(struct resv_map *resv, long f, long t, + long in_regions_needed) { - struct list_head *head = &resv->regions; - struct file_region *rg, *nrg; - long add = 0; + long add = 0, actual_regions_needed = 0, i = 0; + struct file_region *trg = NULL, *rg = NULL; + struct list_head allocated_regions; + + INIT_LIST_HEAD(&allocated_regions); spin_lock(&resv->lock); - /* Locate the region we are either in or before. */ - list_for_each_entry(rg, head, link) - if (f <= rg->to) - break; +retry: + + /* Count how many regions are actually needed to execute this add. */ + add_reservation_in_range(resv, f, t, &actual_regions_needed, true); /* - * If no region exists which can be expanded to include the - * specified range, pull a region descriptor from the cache - * and use it for this range. + * Check for sufficient descriptors in the cache to accommodate + * this add operation. Note that actual_regions_needed may be greater + * than in_regions_needed. In this case, we need to make sure that we + * allocate extra entries, such that we have enough for all the + * existing adds_in_progress, plus the excess needed for this + * operation. */ - if (&rg->link == head || t < rg->from) { - VM_BUG_ON(resv->region_cache_count <= 0); + if (resv->region_cache_count < + resv->adds_in_progress + + (actual_regions_needed - in_regions_needed)) { + /* region_add operation of range 1 should never need to + * allocate file_region entries. + */ + VM_BUG_ON(t - f <= 1); - resv->region_cache_count--; - nrg = list_first_entry(&resv->region_cache, struct file_region, - link); - list_del(&nrg->link); + /* Must drop lock to allocate a new descriptor. */ + spin_unlock(&resv->lock); + for (i = 0; i < (actual_regions_needed - in_regions_needed); + i++) { + trg = kmalloc(sizeof(*trg), GFP_KERNEL); + if (!trg) + goto out_of_memory; + list_add(&trg->link, &allocated_regions); + } + spin_lock(&resv->lock); - nrg->from = f; - nrg->to = t; - list_add(&nrg->link, rg->link.prev); + list_for_each_entry_safe(rg, trg, &allocated_regions, link) { + list_del(&rg->link); + list_add(&rg->link, &resv->region_cache); + resv->region_cache_count++; + } - add += t - f; - goto out_locked; + goto retry; } - add = add_reservation_in_range(resv, f, t, false); + add = add_reservation_in_range(resv, f, t, NULL, false); + + resv->adds_in_progress -= in_regions_needed; -out_locked: - resv->adds_in_progress--; spin_unlock(&resv->lock); VM_BUG_ON(add < 0); return add; + +out_of_memory: + list_for_each_entry_safe(rg, trg, &allocated_regions, link) { + list_del(&rg->link); + kfree(rg); + } + return -ENOMEM; } /* @@ -358,49 +427,79 @@ static long region_add(struct resv_map *resv, long f, long t) * call to region_add that will actually modify the reserve * map to add the specified range [f, t). region_chg does * not change the number of huge pages represented by the - * map. A new file_region structure is added to the cache - * as a placeholder, so that the subsequent region_add - * call will have all the regions it needs and will not fail. + * map. A number of new file_region structures is added to the cache as a + * placeholder, for the subsequent region_add call to use. At least 1 + * file_region structure is added. + * + * out_regions_needed is the number of regions added to the + * resv->adds_in_progress. This value needs to be provided to a follow up call + * to region_add or region_abort for proper accounting. * * Returns the number of huge pages that need to be added to the existing * reservation map for the range [f, t). This number is greater or equal to * zero. -ENOMEM is returned if a new file_region structure or cache entry * is needed and can not be allocated. */ -static long region_chg(struct resv_map *resv, long f, long t) +static long region_chg(struct resv_map *resv, long f, long t, + long *out_regions_needed) { - long chg = 0; + struct file_region *trg = NULL, *rg = NULL; + long chg = 0, i = 0, to_allocate = 0; + struct list_head allocated_regions; + + INIT_LIST_HEAD(&allocated_regions); spin_lock(&resv->lock); -retry_locked: - resv->adds_in_progress++; + + /* Count how many hugepages in this range are NOT respresented. */ + chg = add_reservation_in_range(resv, f, t, out_regions_needed, true); + + if (*out_regions_needed == 0) + *out_regions_needed = 1; + + resv->adds_in_progress += *out_regions_needed; /* * Check for sufficient descriptors in the cache to accommodate * the number of in progress add operations. */ - if (resv->adds_in_progress > resv->region_cache_count) { - struct file_region *trg; - - VM_BUG_ON(resv->adds_in_progress - resv->region_cache_count > 1); - /* Must drop lock to allocate a new descriptor. */ - resv->adds_in_progress--; + while (resv->region_cache_count < resv->adds_in_progress) { + to_allocate = resv->adds_in_progress - resv->region_cache_count; + + /* Must drop lock to allocate a new descriptor. Note that even + * though we drop the lock here, we do not make another call to + * add_reservation_in_range after re-acquiring the lock. + * Essentially this branch makes sure that we have enough + * descriptors in the cache as suggested by the first call to + * add_reservation_in_range. If more regions turn out to be + * required, region_add will deal with it. + */ spin_unlock(&resv->lock); - - trg = kmalloc(sizeof(*trg), GFP_KERNEL); - if (!trg) - return -ENOMEM; + for (i = 0; i < to_allocate; i++) { + trg = kmalloc(sizeof(*trg), GFP_KERNEL); + if (!trg) + goto out_of_memory; + list_add(&trg->link, &allocated_regions); + } spin_lock(&resv->lock); - list_add(&trg->link, &resv->region_cache); - resv->region_cache_count++; - goto retry_locked; - } - chg = add_reservation_in_range(resv, f, t, true); + list_for_each_entry_safe(rg, trg, &allocated_regions, link) { + list_del(&rg->link); + list_add(&rg->link, &resv->region_cache); + resv->region_cache_count++; + } + } spin_unlock(&resv->lock); return chg; + +out_of_memory: + list_for_each_entry_safe(rg, trg, &allocated_regions, link) { + list_del(&rg->link); + kfree(rg); + } + return -ENOMEM; } /* @@ -408,17 +507,20 @@ static long region_chg(struct resv_map *resv, long f, long t) * of the resv_map keeps track of the operations in progress between * calls to region_chg and region_add. Operations are sometimes * aborted after the call to region_chg. In such cases, region_abort - * is called to decrement the adds_in_progress counter. + * is called to decrement the adds_in_progress counter. regions_needed + * is the value returned by the region_chg call, it is used to decrement + * the adds_in_progress counter. * * NOTE: The range arguments [f, t) are not needed or used in this * routine. They are kept to make reading the calling code easier as * arguments will match the associated region_chg call. */ -static void region_abort(struct resv_map *resv, long f, long t) +static void region_abort(struct resv_map *resv, long f, long t, + long regions_needed) { spin_lock(&resv->lock); VM_BUG_ON(!resv->region_cache_count); - resv->adds_in_progress--; + resv->adds_in_progress -= regions_needed; spin_unlock(&resv->lock); } @@ -1883,6 +1985,7 @@ static long __vma_reservation_common(struct hstate *h, struct resv_map *resv; pgoff_t idx; long ret; + long dummy_out_regions_needed; resv = vma_resv_map(vma); if (!resv) @@ -1891,20 +1994,29 @@ static long __vma_reservation_common(struct hstate *h, idx = vma_hugecache_offset(h, vma, addr); switch (mode) { case VMA_NEEDS_RESV: - ret = region_chg(resv, idx, idx + 1); + ret = region_chg(resv, idx, idx + 1, &dummy_out_regions_needed); + /* We assume that vma_reservation_* routines always operate on + * 1 page, and that adding to resv map a 1 page entry can only + * ever require 1 region. + */ + VM_BUG_ON(dummy_out_regions_needed != 1); break; case VMA_COMMIT_RESV: - ret = region_add(resv, idx, idx + 1); + ret = region_add(resv, idx, idx + 1, 1); + /* region_add calls of range 1 should never fail. */ + VM_BUG_ON(ret < 0); break; case VMA_END_RESV: - region_abort(resv, idx, idx + 1); + region_abort(resv, idx, idx + 1, 1); ret = 0; break; case VMA_ADD_RESV: - if (vma->vm_flags & VM_MAYSHARE) - ret = region_add(resv, idx, idx + 1); - else { - region_abort(resv, idx, idx + 1); + if (vma->vm_flags & VM_MAYSHARE) { + ret = region_add(resv, idx, idx + 1, 1); + /* region_add calls of range 1 should never fail. */ + VM_BUG_ON(ret < 0); + } else { + region_abort(resv, idx, idx + 1, 1); ret = region_del(resv, idx, idx + 1); } break; @@ -4563,12 +4675,12 @@ int hugetlb_reserve_pages(struct inode *inode, struct vm_area_struct *vma, vm_flags_t vm_flags) { - long ret, chg; + long ret, chg, add = -1; struct hstate *h = hstate_inode(inode); struct hugepage_subpool *spool = subpool_inode(inode); struct resv_map *resv_map; struct hugetlb_cgroup *h_cg; - long gbl_reserve; + long gbl_reserve, regions_needed = 0; /* This should never happen */ if (from > to) { @@ -4598,7 +4710,7 @@ int hugetlb_reserve_pages(struct inode *inode, */ resv_map = inode_resv_map(inode); - chg = region_chg(resv_map, from, to); + chg = region_chg(resv_map, from, to, ®ions_needed); } else { /* Private mapping. */ @@ -4668,9 +4780,14 @@ int hugetlb_reserve_pages(struct inode *inode, * else has to be done for private mappings here */ if (!vma || vma->vm_flags & VM_MAYSHARE) { - long add = region_add(resv_map, from, to); - - if (unlikely(chg > add)) { + add = region_add(resv_map, from, to, regions_needed); + + if (unlikely(add < 0)) { + hugetlb_acct_memory(h, -gbl_reserve); + /* put back original number of pages, chg */ + (void)hugepage_subpool_put_pages(spool, chg); + goto out_err; + } else if (unlikely(chg > add)) { /* * pages in this range were added to the reserve * map between region_chg and region_add. This @@ -4688,9 +4805,11 @@ int hugetlb_reserve_pages(struct inode *inode, return 0; out_err: if (!vma || vma->vm_flags & VM_MAYSHARE) - /* Don't call region_abort if region_chg failed */ - if (chg >= 0) - region_abort(resv_map, from, to); + /* Only call region_abort if the region_chg succeeded but the + * region_add failed or didn't run. + */ + if (chg >= 0 && add < 0) + region_abort(resv_map, from, to, regions_needed); if (vma && is_vma_resv_set(vma, HPAGE_RESV_OWNER)) kref_put(&resv_map->refs, resv_map_release); return ret; From patchwork Wed Jan 15 01:26:48 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Mina Almasry X-Patchwork-Id: 208965 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-17.4 required=3.0 tests=DKIMWL_WL_MED, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_PATCH, MAILING_LIST_MULTI, SIGNED_OFF_BY, SPF_HELO_NONE, SPF_PASS, USER_AGENT_GIT, USER_IN_DEF_DKIM_WL autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2F3D1C3F68F for ; Wed, 15 Jan 2020 01:27:08 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id E68DA24681 for ; Wed, 15 Jan 2020 01:27:07 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="g4vrCCuF" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729017AbgAOB1H (ORCPT ); Tue, 14 Jan 2020 20:27:07 -0500 Received: from mail-pf1-f201.google.com ([209.85.210.201]:37483 "EHLO mail-pf1-f201.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1729016AbgAOB1F (ORCPT ); Tue, 14 Jan 2020 20:27:05 -0500 Received: by mail-pf1-f201.google.com with SMTP id d85so9934126pfd.4 for ; Tue, 14 Jan 2020 17:27:05 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=date:in-reply-to:message-id:mime-version:references:subject:from:to :cc; bh=4/iY+6M9fGnOfYT38lRw2eCjml2wSZh+p85JWa17Yuo=; b=g4vrCCuF7G6cMYYudAo5h/E+v30Ehekj3WxyXuXzVQwxDVcWyVoAAN3OFGHHNOsiAg oAkKe7Ko+zrccunYtqksk8HBagVGIFSKE8PCZJdKXVvwp+PHcES2ZZnq7AebZfeeckKi jgYiRZkTukb34BC9YGdLOqqdR2cP2F1lsj4ZvttUyUT5J2k23i0ut8y1fVXCw/CxoBBG lFs1rvXjdPKkzNClMppobOlE59ccjODz4up1ITe2phRIQAELd6kv9Xv6kTa4ghgsB5Yr t6KzckbvynWqurVbOA8YhGSuyV/B7OfODYL5MB0+J1WVdTUIuuy8zZpwg5He71suUDrC jgIw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:in-reply-to:message-id:mime-version :references:subject:from:to:cc; bh=4/iY+6M9fGnOfYT38lRw2eCjml2wSZh+p85JWa17Yuo=; b=gNXg2Ly9xVFTFqOmJr8JDhc45DaL8TufRylDay4Ers0gSrDmkZnQERlpqgcl02CGHs /uceOWlj729fpwthFrqRTuzOuaOgV4xfnfkg/Bf7FKFL8DovliOThhUZ8Fyh3xN4xQrQ 7449JPK9FN5ui/MU0D6h98Akf6uaiXC9ehlWmaBhT4wHWUE+U2vFu/C4pnBqSukXswDd 6rtDe4CuNmzLUxtWFu9WWzsdlwgg2CfTYsS5hA0tLl43FaxmxvX7HB5I7FlQ1Yh4/zOh 5W1yiqIifopP0V5FpsTDg7qZim557nme9PitjQU436SO7mTBak//+RxTsOkIlKglV9tU 4HuQ== X-Gm-Message-State: APjAAAUq1ld13H3OW04TFUbPWHJmyFFE+s4NYo4KkmyUHXDy7jIX4nTj BYI5zGwS5fkckQLb2XmMFxFx35cRQb1FXH7wew== X-Google-Smtp-Source: APXvYqwddvd+8Tr4pUJA1VUpFUOfuOZzunJnUAn2p1fyIp9b1Iu3ONU8F5HWpFwkrI3ylCktac5hVZz59lyTHGk5vg== X-Received: by 2002:a63:2949:: with SMTP id p70mr30626707pgp.191.1579051624531; Tue, 14 Jan 2020 17:27:04 -0800 (PST) Date: Tue, 14 Jan 2020 17:26:48 -0800 In-Reply-To: <20200115012651.228058-1-almasrymina@google.com> Message-Id: <20200115012651.228058-5-almasrymina@google.com> Mime-Version: 1.0 References: <20200115012651.228058-1-almasrymina@google.com> X-Mailer: git-send-email 2.25.0.rc1.283.g88dfdc4193-goog Subject: [PATCH v10 5/8] hugetlb_cgroup: add accounting for shared mappings From: Mina Almasry To: mike.kravetz@oracle.com, rientjes@google.com, shakeelb@google.com Cc: shuah@kernel.org, almasrymina@google.com, gthelen@google.com, akpm@linux-foundation.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-kselftest@vger.kernel.org, cgroups@vger.kernel.org, aneesh.kumar@linux.vnet.ibm.com Sender: linux-kselftest-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kselftest@vger.kernel.org For shared mappings, the pointer to the hugetlb_cgroup to uncharge lives in the resv_map entries, in file_region->reservation_counter. After a call to region_chg, we charge the approprate hugetlb_cgroup, and if successful, we pass on the hugetlb_cgroup info to a follow up region_add call. When a file_region entry is added to the resv_map via region_add, we put the pointer to that cgroup in file_region->reservation_counter. If charging doesn't succeed, we report the error to the caller, so that the kernel fails the reservation. On region_del, which is when the hugetlb memory is unreserved, we also uncharge the file_region->reservation_counter. Signed-off-by: Mina Almasry --- Changes in v10: - Deleted duplicated code snippet. Changes in V9: - Updated for hugetlb reservation repareting. --- mm/hugetlb.c | 156 ++++++++++++++++++++++++++++++++++++++++----------- 1 file changed, 124 insertions(+), 32 deletions(-) -- 2.25.0.rc1.283.g88dfdc4193-goog diff --git a/mm/hugetlb.c b/mm/hugetlb.c index de0028e9a8630..9bcfc12c5d214 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -243,6 +243,16 @@ struct file_region { struct list_head link; long from; long to; +#ifdef CONFIG_CGROUP_HUGETLB + /* + * On shared mappings, each reserved region appears as a struct + * file_region in resv_map. These fields hold the info needed to + * uncharge each reservation. + */ + struct page_counter *reservation_counter; + unsigned long pages_per_hpage; + struct cgroup_subsys_state *css; +#endif }; /* Helper that removes a struct file_region from the resv_map cache and returns @@ -266,6 +276,25 @@ get_file_region_entry_from_cache(struct resv_map *resv, long from, long to) return nrg; } +/* Helper that records hugetlb_cgroup uncharge info. */ +static void record_hugetlb_cgroup_uncharge_info(struct hugetlb_cgroup *h_cg, + struct file_region *nrg, + struct hstate *h) +{ +#ifdef CONFIG_CGROUP_HUGETLB + if (h_cg) { + nrg->reservation_counter = + &h_cg->reserved_hugepage[hstate_index(h)]; + nrg->pages_per_hpage = pages_per_huge_page(h); + nrg->css = &h_cg->css; + } else { + nrg->reservation_counter = NULL; + nrg->pages_per_hpage = 0; + nrg->css = NULL; + } +#endif +} + /* Must be called with resv->lock held. Calling this with count_only == true * will count the number of pages to be added but will not modify the linked * list. If regions_needed != NULL and count_only == true, then regions_needed @@ -273,7 +302,9 @@ get_file_region_entry_from_cache(struct resv_map *resv, long from, long to) * add the regions for this range. */ static long add_reservation_in_range(struct resv_map *resv, long f, long t, - long *regions_needed, bool count_only) + struct hugetlb_cgroup *h_cg, + struct hstate *h, long *regions_needed, + bool count_only) { long add = 0; struct list_head *head = &resv->regions; @@ -312,6 +343,8 @@ static long add_reservation_in_range(struct resv_map *resv, long f, long t, if (!count_only) { nrg = get_file_region_entry_from_cache( resv, last_accounted_offset, rg->from); + record_hugetlb_cgroup_uncharge_info(h_cg, nrg, + h); list_add(&nrg->link, rg->link.prev); } else if (regions_needed) *regions_needed += 1; @@ -328,11 +361,13 @@ static long add_reservation_in_range(struct resv_map *resv, long f, long t, if (!count_only) { nrg = get_file_region_entry_from_cache( resv, last_accounted_offset, t); + record_hugetlb_cgroup_uncharge_info(h_cg, nrg, h); list_add(&nrg->link, rg->link.prev); } else if (regions_needed) *regions_needed += 1; } + VM_BUG_ON(add < 0); return add; } @@ -353,7 +388,8 @@ static long add_reservation_in_range(struct resv_map *resv, long f, long t, * fail; region_chg will always allocate at least 1 entry and a region_add for * 1 page will only require at most 1 entry. */ -static long region_add(struct resv_map *resv, long f, long t, +static long region_add(struct hstate *h, struct hugetlb_cgroup *h_cg, + struct resv_map *resv, long f, long t, long in_regions_needed) { long add = 0, actual_regions_needed = 0, i = 0; @@ -366,7 +402,8 @@ static long region_add(struct resv_map *resv, long f, long t, retry: /* Count how many regions are actually needed to execute this add. */ - add_reservation_in_range(resv, f, t, &actual_regions_needed, true); + add_reservation_in_range(resv, f, t, NULL, NULL, &actual_regions_needed, + true); /* * Check for sufficient descriptors in the cache to accommodate @@ -404,7 +441,7 @@ static long region_add(struct resv_map *resv, long f, long t, goto retry; } - add = add_reservation_in_range(resv, f, t, NULL, false); + add = add_reservation_in_range(resv, f, t, h_cg, h, NULL, false); resv->adds_in_progress -= in_regions_needed; @@ -452,7 +489,8 @@ static long region_chg(struct resv_map *resv, long f, long t, spin_lock(&resv->lock); /* Count how many hugepages in this range are NOT respresented. */ - chg = add_reservation_in_range(resv, f, t, out_regions_needed, true); + chg = add_reservation_in_range(resv, f, t, NULL, NULL, + out_regions_needed, true); if (*out_regions_needed == 0) *out_regions_needed = 1; @@ -524,6 +562,29 @@ static void region_abort(struct resv_map *resv, long f, long t, spin_unlock(&resv->lock); } +static void uncharge_cgroup_if_shared_mapping(struct resv_map *resv, + struct file_region *rg, + unsigned long nr_pages) +{ +#ifdef CONFIG_CGROUP_HUGETLB + /* + * If resv->reservation_counter is NULL, then this is either a shared + * reservation, or cgroup charging is disabled on this resv_map. + * + * If the cgroup charging is disabled, then rg->reservation_counter is + * NULL and the uncharge counter call is a no-op. If the mapping is + * shared then the reserved memory is tracked in the file_struct + * entries inside of resv_map. So we need to uncharge the memory here. + */ + if (rg->reservation_counter && rg->pages_per_hpage && nr_pages > 0 && + !resv->reservation_counter) { + hugetlb_cgroup_uncharge_counter(rg->reservation_counter, + nr_pages * rg->pages_per_hpage, + rg->css); + } +#endif +} + /* * Delete the specified range [f, t) from the reserve map. If the * t parameter is LONG_MAX, this indicates that ALL regions after f @@ -588,11 +649,22 @@ static long region_del(struct resv_map *resv, long f, long t) /* New entry for end of split region */ nrg->from = t; nrg->to = rg->to; + +#ifdef CONFIG_CGROUP_HUGETLB + nrg->reservation_counter = rg->reservation_counter; + nrg->pages_per_hpage = rg->pages_per_hpage; + nrg->css = rg->css; + css_get(rg->css); +#endif + INIT_LIST_HEAD(&nrg->link); /* Original entry is trimmed */ rg->to = f; + uncharge_cgroup_if_shared_mapping(resv, rg, + nrg->to - nrg->from); + list_add(&nrg->link, &rg->link); nrg = NULL; break; @@ -600,6 +672,8 @@ static long region_del(struct resv_map *resv, long f, long t) if (f <= rg->from && t >= rg->to) { /* Remove entire region */ del += rg->to - rg->from; + uncharge_cgroup_if_shared_mapping(resv, rg, + rg->to - rg->from); list_del(&rg->link); kfree(rg); continue; @@ -608,14 +682,20 @@ static long region_del(struct resv_map *resv, long f, long t) if (f <= rg->from) { /* Trim beginning of region */ del += t - rg->from; rg->from = t; + + uncharge_cgroup_if_shared_mapping(resv, rg, + t - rg->from); } else { /* Trim end of region */ del += rg->to - f; rg->to = f; + + uncharge_cgroup_if_shared_mapping(resv, rg, rg->to - f); } } spin_unlock(&resv->lock); kfree(nrg); + return del; } @@ -2002,7 +2082,7 @@ static long __vma_reservation_common(struct hstate *h, VM_BUG_ON(dummy_out_regions_needed != 1); break; case VMA_COMMIT_RESV: - ret = region_add(resv, idx, idx + 1, 1); + ret = region_add(NULL, NULL, resv, idx, idx + 1, 1); /* region_add calls of range 1 should never fail. */ VM_BUG_ON(ret < 0); break; @@ -2012,7 +2092,7 @@ static long __vma_reservation_common(struct hstate *h, break; case VMA_ADD_RESV: if (vma->vm_flags & VM_MAYSHARE) { - ret = region_add(resv, idx, idx + 1, 1); + ret = region_add(NULL, NULL, resv, idx, idx + 1, 1); /* region_add calls of range 1 should never fail. */ VM_BUG_ON(ret < 0); } else { @@ -4679,7 +4759,7 @@ int hugetlb_reserve_pages(struct inode *inode, struct hstate *h = hstate_inode(inode); struct hugepage_subpool *spool = subpool_inode(inode); struct resv_map *resv_map; - struct hugetlb_cgroup *h_cg; + struct hugetlb_cgroup *h_cg = NULL; long gbl_reserve, regions_needed = 0; /* This should never happen */ @@ -4720,23 +4800,6 @@ int hugetlb_reserve_pages(struct inode *inode, chg = to - from; - if (hugetlb_cgroup_charge_cgroup(hstate_index(h), - chg * pages_per_huge_page(h), - &h_cg, true)) { - kref_put(&resv_map->refs, resv_map_release); - return -ENOMEM; - } - -#ifdef CONFIG_CGROUP_HUGETLB - /* - * Since this branch handles private mappings, we attach the - * counter to uncharge for this reservation off resv_map. - */ - resv_map->reservation_counter = - &h_cg->reserved_hugepage[hstate_index(h)]; - resv_map->pages_per_hpage = pages_per_huge_page(h); -#endif - set_vma_resv_map(vma, resv_map); set_vma_resv_flags(vma, HPAGE_RESV_OWNER); } @@ -4746,6 +4809,26 @@ int hugetlb_reserve_pages(struct inode *inode, goto out_err; } + ret = hugetlb_cgroup_charge_cgroup( + hstate_index(h), chg * pages_per_huge_page(h), &h_cg, true); + + if (ret < 0) { + ret = -ENOMEM; + goto out_err; + } + +#ifdef CONFIG_CGROUP_HUGETLB + if (vma && !(vma->vm_flags & VM_MAYSHARE) && h_cg) { + /* For private mappings, the hugetlb_cgroup uncharge info hangs + * of the resv_map. + */ + resv_map->reservation_counter = + &h_cg->reserved_hugepage[hstate_index(h)]; + resv_map->pages_per_hpage = pages_per_huge_page(h); + resv_map->css = &h_cg->css; + } +#endif + /* * There must be enough pages in the subpool for the mapping. If * the subpool has a minimum size, there may be some global @@ -4754,7 +4837,7 @@ int hugetlb_reserve_pages(struct inode *inode, gbl_reserve = hugepage_subpool_get_pages(spool, chg); if (gbl_reserve < 0) { ret = -ENOSPC; - goto out_err; + goto out_uncharge_cgroup; } /* @@ -4763,9 +4846,7 @@ int hugetlb_reserve_pages(struct inode *inode, */ ret = hugetlb_acct_memory(h, gbl_reserve); if (ret < 0) { - /* put back original number of pages, chg */ - (void)hugepage_subpool_put_pages(spool, chg); - goto out_err; + goto out_put_pages; } /* @@ -4780,7 +4861,7 @@ int hugetlb_reserve_pages(struct inode *inode, * else has to be done for private mappings here */ if (!vma || vma->vm_flags & VM_MAYSHARE) { - add = region_add(resv_map, from, to, regions_needed); + add = region_add(h, h_cg, resv_map, from, to, regions_needed); if (unlikely(add < 0)) { hugetlb_acct_memory(h, -gbl_reserve); @@ -4797,12 +4878,23 @@ int hugetlb_reserve_pages(struct inode *inode, */ long rsv_adjust; - rsv_adjust = hugepage_subpool_put_pages(spool, - chg - add); + hugetlb_cgroup_uncharge_cgroup( + hstate_index(h), + (chg - add) * pages_per_huge_page(h), h_cg, + true); + + rsv_adjust = + hugepage_subpool_put_pages(spool, chg - add); hugetlb_acct_memory(h, -rsv_adjust); } } return 0; +out_put_pages: + /* put back original number of pages, chg */ + (void)hugepage_subpool_put_pages(spool, chg); +out_uncharge_cgroup: + hugetlb_cgroup_uncharge_cgroup( + hstate_index(h), chg * pages_per_huge_page(h), h_cg, true); out_err: if (!vma || vma->vm_flags & VM_MAYSHARE) /* Only call region_abort if the region_chg succeeded but the From patchwork Wed Jan 15 01:26:51 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Mina Almasry X-Patchwork-Id: 208964 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-17.4 required=3.0 tests=DKIMWL_WL_MED, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_PATCH, MAILING_LIST_MULTI, SIGNED_OFF_BY, SPF_HELO_NONE, SPF_PASS, USER_AGENT_GIT, USER_IN_DEF_DKIM_WL autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id BD957C33CB2 for ; Wed, 15 Jan 2020 01:27:19 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 889C024658 for ; Wed, 15 Jan 2020 01:27:19 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="DdORwNw0" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729026AbgAOB1S (ORCPT ); Tue, 14 Jan 2020 20:27:18 -0500 Received: from mail-pf1-f202.google.com ([209.85.210.202]:37484 "EHLO mail-pf1-f202.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1729076AbgAOB1M (ORCPT ); Tue, 14 Jan 2020 20:27:12 -0500 Received: by mail-pf1-f202.google.com with SMTP id d85so9934289pfd.4 for ; Tue, 14 Jan 2020 17:27:12 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=date:in-reply-to:message-id:mime-version:references:subject:from:to :cc; bh=h8U2qNtEFbaq+hBB/JCIu5FRCXJjzgS5K1PNEW4ZKws=; b=DdORwNw0ndB7yPDnt8tT0WEqtdwD8V4r97JqJT/LXfvkDwe5s1Q21YiAh1Gty98/b0 UViZadY+jCLv9/GwljHX1EpDUv/iKm4sPdChU0zjoMGyXN7zg1cbfcpgPdzbX5qTrdF/ dbKOqYdZjkBCfPZQQ1YxpKX6DhhfqJq+713oEumbz1oluyU8GfaKlJqBtya86kphQBPp hQ6z3xF6QIrMNetm1VyXpVPsEnkMLh9SXbwVlRZzxr3OVTLRib9eqf1oTPYa4qshTYex xOfkILN202JFeCcVen+QZ9LO1j+FVNwWsgKIETR7mXnbAXLRN4ALA87go3rMV94BSNjZ 9NZA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:in-reply-to:message-id:mime-version :references:subject:from:to:cc; bh=h8U2qNtEFbaq+hBB/JCIu5FRCXJjzgS5K1PNEW4ZKws=; b=ZIYykBB2iAvrMCeRQzrVZP8XinlMlgQCiyIa5Jq+QrFT9sH8prncXbCeZ3TAqRwu59 3T7vJKhT4O4+1/iE/6mX79aWrWYzhOK5UxHLSAC5yMWPzao28ET7MrIjQHpggpMc+N5Z UIt6QJAQF9m/6egANVi7+96FNlDIYYCMyEKOCdhojj8NgzH/AANft/dAMHIISwTxdqvl 5tyBIz9QsOQCGYae3jpgX/TQFTJ0/gxUC0/rRhBLsjrt7sElK75WROp27xQVXs5XNifo kUB79w88h/abS9NnubcBB+2a8I2UBkvTkbTwmMHO95TbcZsQc6gQ8wGz6n7cqHU59Ck1 U1Yg== X-Gm-Message-State: APjAAAVIGK0PPVvtCKKd0J4R4kFnnmvQTwCRS/1nXVsIv4PbsG0V5G2q 5HfOl+2Ezh0aRMV1j3WMmpe37K0hRo5mn4/+Uw== X-Google-Smtp-Source: APXvYqxmMi2/ACArgSv7fIgXtolIquE4UCUC/iP8qxu7HL4rR76nCNRpBZfD1IoYHxOtfCJxpCr2SfVY1MIWGdoxtA== X-Received: by 2002:a63:4b49:: with SMTP id k9mr30074514pgl.269.1579051631919; Tue, 14 Jan 2020 17:27:11 -0800 (PST) Date: Tue, 14 Jan 2020 17:26:51 -0800 In-Reply-To: <20200115012651.228058-1-almasrymina@google.com> Message-Id: <20200115012651.228058-8-almasrymina@google.com> Mime-Version: 1.0 References: <20200115012651.228058-1-almasrymina@google.com> X-Mailer: git-send-email 2.25.0.rc1.283.g88dfdc4193-goog Subject: [PATCH v10 8/8] hugetlb_cgroup: Add hugetlb_cgroup reservation docs From: Mina Almasry To: mike.kravetz@oracle.com, rientjes@google.com, shakeelb@google.com Cc: shuah@kernel.org, almasrymina@google.com, gthelen@google.com, akpm@linux-foundation.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-kselftest@vger.kernel.org, cgroups@vger.kernel.org, aneesh.kumar@linux.vnet.ibm.com Sender: linux-kselftest-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kselftest@vger.kernel.org Add docs for how to use hugetlb_cgroup reservations, and their behavior. Signed-off-by: Mina Almasry --- Changes in v10: - Clarify reparenting behavior. - Reword benefits of reservation limits. Changes in v6: - Updated docs to reflect the new design based on a new counter that tracks both reservations and faults. --- .../admin-guide/cgroup-v1/hugetlb.rst | 103 ++++++++++++++++-- 1 file changed, 92 insertions(+), 11 deletions(-) -- 2.25.0.rc1.283.g88dfdc4193-goog diff --git a/Documentation/admin-guide/cgroup-v1/hugetlb.rst b/Documentation/admin-guide/cgroup-v1/hugetlb.rst index a3902aa253a96..00dd4daf55f19 100644 --- a/Documentation/admin-guide/cgroup-v1/hugetlb.rst +++ b/Documentation/admin-guide/cgroup-v1/hugetlb.rst @@ -2,13 +2,6 @@ HugeTLB Controller ================== -The HugeTLB controller allows to limit the HugeTLB usage per control group and -enforces the controller limit during page fault. Since HugeTLB doesn't -support page reclaim, enforcing the limit at page fault time implies that, -the application will get SIGBUS signal if it tries to access HugeTLB pages -beyond its limit. This requires the application to know beforehand how much -HugeTLB pages it would require for its use. - HugeTLB controller can be created by first mounting the cgroup filesystem. # mount -t cgroup -o hugetlb none /sys/fs/cgroup @@ -28,10 +21,14 @@ process (bash) into it. Brief summary of control files:: - hugetlb..limit_in_bytes # set/show limit of "hugepagesize" hugetlb usage - hugetlb..max_usage_in_bytes # show max "hugepagesize" hugetlb usage recorded - hugetlb..usage_in_bytes # show current usage for "hugepagesize" hugetlb - hugetlb..failcnt # show the number of allocation failure due to HugeTLB limit + hugetlb..resv.limit_in_bytes # set/show limit of "hugepagesize" hugetlb reservations + hugetlb..resv.max_usage_in_bytes # show max "hugepagesize" hugetlb reservations and no-reserve faults + hugetlb..resv.usage_in_bytes # show current reservations and no-reserve faults for "hugepagesize" hugetlb + hugetlb..resv.failcnt # show the number of allocation failure due to HugeTLB reservation limit + hugetlb..limit_in_bytes # set/show limit of "hugepagesize" hugetlb faults + hugetlb..max_usage_in_bytes # show max "hugepagesize" hugetlb usage recorded + hugetlb..usage_in_bytes # show current usage for "hugepagesize" hugetlb + hugetlb..failcnt # show the number of allocation failure due to HugeTLB usage limit For a system supporting three hugepage sizes (64k, 32M and 1G), the control files include:: @@ -40,11 +37,95 @@ files include:: hugetlb.1GB.max_usage_in_bytes hugetlb.1GB.usage_in_bytes hugetlb.1GB.failcnt + hugetlb.1GB.resv.limit_in_bytes + hugetlb.1GB.resv.max_usage_in_bytes + hugetlb.1GB.resv.usage_in_bytes + hugetlb.1GB.resv.failcnt hugetlb.64KB.limit_in_bytes hugetlb.64KB.max_usage_in_bytes hugetlb.64KB.usage_in_bytes hugetlb.64KB.failcnt + hugetlb.64KB.resv.limit_in_bytes + hugetlb.64KB.resv.max_usage_in_bytes + hugetlb.64KB.resv.usage_in_bytes + hugetlb.64KB.resv.failcnt hugetlb.32MB.limit_in_bytes hugetlb.32MB.max_usage_in_bytes hugetlb.32MB.usage_in_bytes hugetlb.32MB.failcnt + hugetlb.32MB.resv.limit_in_bytes + hugetlb.32MB.resv.max_usage_in_bytes + hugetlb.32MB.resv.usage_in_bytes + hugetlb.32MB.resv.failcnt + + +1. Page fault accounting + +hugetlb..limit_in_bytes +hugetlb..max_usage_in_bytes +hugetlb..usage_in_bytes +hugetlb..failcnt + +The HugeTLB controller allows users to limit the HugeTLB usage (page fault) per +control group and enforces the limit during page fault. Since HugeTLB +doesn't support page reclaim, enforcing the limit at page fault time implies +that, the application will get SIGBUS signal if it tries to fault in HugeTLB +pages beyond its limit. Therefore the application needs to know exactly how many +HugeTLB pages it uses before hand, and the sysadmin needs to make sure that +there are enough available on the machine for all the users to avoid processes +getting SIGBUS. + + +2. Reservation accounting + +hugetlb..resv.limit_in_bytes +hugetlb..resv.max_usage_in_bytes +hugetlb..resv.usage_in_bytes +hugetlb..resv.failcnt + +The HugeTLB controller allows to limit the HugeTLB reservations per control +group and enforces the controller limit at reservation time and at the fault of +HugeTLB memory for which no reservation exists. Since reservation limits are +enforced at reservation time (on mmap or shget), reservation limits never causes +the application to get SIGBUS signal if the memory was reserved before hand. For +MAP_NORESERVE allocations, the reservation limit behaves the same as the fault +limit, enforcing memory usage at fault time and causing the application to +receive a SIGBUS if it's crossing its limit. + +Reservation limits are superior to page fault limits described above, since +reservation limits are enforced at reservation time (on mmap or shget), and +never causes the application to get SIGBUS signal if the memory was reserved +before hand. This allows for easier fallback to alternatives such as +non-HugeTLB memory for example. In the case of page fault accounting, it's very +hard to avoid processes getting SIGBUS since the sysadmin needs precisely know +the HugeTLB usage of all the tasks in the system and make sure there is enough +pages to satisfy all requests. Avoiding tasks getting SIGBUS on overcommited +systems is practically impossible with page fault accounting. + + +3. Caveats with shared memory + +For shared HugeTLB memory, both HugeTLB reservation and page faults are charged +to the first task that causes the memory to be reserved or faulted, and all +subsequent uses of this reserved or faulted memory is done without charging. + +Shared HugeTLB memory is only uncharged when it is unreserved or deallocated. +This is usually when the HugeTLB file is deleted, and not when the task that +caused the reservation or fault has exited. + + +4. Caveats with HugeTLB cgroup offline. + +When a HugeTLB cgroup goes offline with some reservations or faults still +charged to it, the behavior is as follows: + +- The fault charges are charged to the parent HugeTLB cgroup (reparented), +- the reservation charges remain on the offline HugeTLB cgroup. + +This means that if a HugeTLB cgroup gets offlined while there is still HugeTLB +reservations charged to it, that cgroup persists as a zombie until all HugeTLB +reservations are uncharged. HugeTLB reservations behave in this manner to match +the memory controller whose cgroups also persist as zombie until all charged +memory is uncharged. Also, the tracking of HugeTLB reservations is a bit more +complex compared to the tracking of HugeTLB faults, so it is significantly +harder to reparent reservations at offline time.