[05/11] mm: memcg/slab: fix percpu slab vmstats flushing

From: Roman Gushchin <guro@fb.com>

From: Roman Gushchin <guro@fb.com>
Subject: mm: memcg/slab: fix percpu slab vmstats flushing

Currently slab percpu vmstats are flushed twice: during the memcg
offlining and just before freeing the memcg structure.  Each time percpu
counters are summed, added to the atomic counterparts and propagated up by
the cgroup tree.

The second flushing is required due to how recursive vmstats are
implemented: counters are batched in percpu variables on a local level,
and once a percpu value is crossing some predefined threshold, it spills
over to atomic values on the local and each ascendant levels.  It means
that without flushing some numbers cached in percpu variables will be
dropped on floor each time a cgroup is destroyed.  And with uptime the
error on upper levels might become noticeable.

The first flushing aims to make counters on ancestor levels more precise. 
Dying cgroups may resume in the dying state for a long time.  After
kmem_cache reparenting which is performed during the offlining slab
counters of the dying cgroup don't have any chances to be updated, because
any slab operations will be performed on the parent level.  It means that
the inaccuracy caused by percpu batching will not decrease up to the final
destruction of the cgroup.  By the original idea flushing slab counters
during the offlining should minimize the visible inaccuracy of slab
counters on the parent level.

The problem is that percpu counters are not zeroed after the first
flushing.  So every cached percpu value is summed twice.  It creates a
small error (up to 32 pages per cpu, but usually less) which accumulates
on parent cgroup level.  After creating and destroying of thousands of
child cgroups, slab counter on parent level can be way off the real value.

For now, let's just stop flushing slab counters on memcg offlining.  It
can't be done correctly without scheduling a work on each cpu: reading and
zeroing it during css offlining can race with an asynchronous update,
which doesn't expect values to be changed underneath.

With this change, slab counters on parent level will become eventually
consistent.  Once all dying children are gone, values are correct.  And if
not, the error is capped by 32 * NR_CPUS pages per dying cgroup.

It's not perfect, as slab are reparented, so any updates after the
reparenting will happen on the parent level.  It means that if a slab page
was allocated, a counter on child level was bumped, then the page was
reparented and freed, the annihilation of positive and negative counter
values will not happen until the child cgroup is released.  It makes slab
counters different from others, and it might want us to implement flushing
in a correct form again.  But it's also a question of performance:
scheduling a work on each cpu isn't free, and it's an open question if the
benefit of having more accurate counters is worth it.

We might also consider flushing all counters on offlining, not only slab
counters.

So let's fix the main problem now: make the slab counters eventually
consistent, so at least the error won't grow with uptime (or more
precisely the number of created and destroyed cgroups).  And think about
the accuracy of counters separately.

Link: http://lkml.kernel.org/r/20191220042728.1045881-1-guro@fb.com
Fixes: bee07b33db78 ("mm: memcontrol: flush percpu slab vmstats on kmem offlining")
Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/mmzone.h |    5 ++---
 mm/memcontrol.c        |   37 +++++++++----------------------------
 2 files changed, 11 insertions(+), 31 deletions(-)

Message ID	20200114002916.-zlNE%akpm@linux-foundation.org
State	New
Headers	show Return-Path: <SRS0=X/91=3D=vger.kernel.org=stable-owner@kernel.org> X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-3.8 required=3.0 tests=DKIMWL_WL_HIGH, DKIM_SIGNED, DKIM_VALID, HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI, SIGNED_OFF_BY, SPF_HELO_NONE, SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id EEEB5C33CA9 for <stable@archiver.kernel.org>; Tue, 14 Jan 2020 00:29:18 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id BA6ED2084D for <stable@archiver.kernel.org>; Tue, 14 Jan 2020 00:29:18 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1578961758; bh=IzTMuWg7WYoPMjN/8kKUDToGlR9HPKzoT0cOLeW9rq0=; h=Date:From:To:Subject:In-Reply-To:List-ID:From; b=KPQhkfc/E/MDDRZJboVkbD/YpYczBSeZra0WfLBhPbOlujZYpxEbd9OX9INdoe6b2 R+rpW5Zys1MPOGHQYroVquM6Rq6GVjb9/UZPgkIcN7jhrc9BghSmICx8ufMRGdDI4d s06AV2OoJWTr2QcjNMbgw1ZZd3pac7UExO75xsRU= Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728946AbgANA3S (ORCPT <rfc822;stable@archiver.kernel.org>); Mon, 13 Jan 2020 19:29:18 -0500 Received: from mail.kernel.org ([198.145.29.99]:45250 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728890AbgANA3S (ORCPT <rfc822;stable@vger.kernel.org>); Mon, 13 Jan 2020 19:29:18 -0500 Received: from akpm3.svl.corp.google.com (unknown [104.133.8.65]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id 411FC222C3; Tue, 14 Jan 2020 00:29:17 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1578961757; bh=IzTMuWg7WYoPMjN/8kKUDToGlR9HPKzoT0cOLeW9rq0=; h=Date:From:To:Subject:In-Reply-To:From; b=IZihq/Hvv7Ro5K8m+zwFQl5us3zpziNEDTgErPRtYFhP6sYA4lP6d77dCU9jSPR5i lD3Fclnv31JjMlAyvKSAafzaJAE1MBazqTFCbo/BE9Dj6uC66zlMFXuwTVM4qava3h ltKsB+4fWd2uiVOBqn1M67Xf6ATd6rIKZd7k09LM= Date: Mon, 13 Jan 2020 16:29:16 -0800 From: Andrew Morton <akpm@linux-foundation.org> To: stable@vger.kernel.org, mhocko@suse.com, hannes@cmpxchg.org, guro@fb.com, akpm@linux-foundation.org, linux-mm@kvack.org, mm-commits@vger.kernel.org, torvalds@linux-foundation.org Subject: [patch 05/11] mm: memcg/slab: fix percpu slab vmstats flushing Message-ID: <20200114002916.-zlNE%akpm@linux-foundation.org> In-Reply-To: <20200113162831.f7d69e11e9e673c40005c9b0@linux-foundation.org> User-Agent: s-nail v14.9.15 Sender: stable-owner@vger.kernel.org Precedence: bulk List-ID: <stable.vger.kernel.org> X-Mailing-List: stable@vger.kernel.org
Series	None \| expand [03/11] mm/huge_memory.c: thp: fix conflict of above-47bit hint address and PMD alignment [05/11] mm: memcg/slab: fix percpu slab vmstats flushing [10/11] mm: memcg/slab: call flush_memcg_workqueue() only if memcg workqueue is valid

[05/11] mm: memcg/slab: fix percpu slab vmstats flushing

Commit Message

Patch