mbox series

[00/30] The panic notifiers refactor

Message ID 20220427224924.592546-1-gpiccoli@igalia.com
Headers show
Series The panic notifiers refactor | expand

Message

Guilherme G. Piccoli April 27, 2022, 10:48 p.m. UTC
Hey folks, this is an attempt to improve/refactor the dated panic notifiers
infrastructure. This is strongly based in a suggestion made by Pter Mladek [0]
some time ago, and it's finally ready. Below I'll detail the patch ordering,
testing made, etc.
First, a bit about the reason behind this.

The panic notifiers list is an infrastructure that allows callbacks to execute
during panic time. Happens that anybody can add functions there, no ordering
is enforced (by default) and the decision to execute or not such notifiers
before kdump may lead to high risk of failure in crash scenarios - default is
not to execute any of them. There is a parameter acting as a switch for that.
But some architectures require some notifiers, so..it's messy.

The suggestion from Petr came after a patch submission to add a notifiers
filter, allowing the notifiers selection by function name, which was welcomed
by some people, but not by Petr, which claimed the code should indeed have a
refactor - and it made a lot of sense, his suggestion makes code more clear
and reliable.

So, this series might be split in 3 portions:

Part 1: the first 18 patches are mostly fixes (one or two might be considered
improvements), mostly replacing spinlocks/mutexes with safer alternatives for
atomic contexts, like spin_trylock, etc. We also focused on commenting
everything that is possible and clean-up code.

Part 2, the core: patches 19-25 are the main refactor, which splits the panic
notifiers list in three, introduce the concept of panic notifier level and
clean-up and highly comment the code, effectively leading to a more reliable
and clear, yet highly customizable panic path.

Part 3: The remaining 5 patches are fixes that _require the main refactor_
patches, they don't make sense without the core changes - but again, these are
small fixes and not part of the main goal of refactoring the panic code.

I've tried my best to make the patches the more "bisectable" as possible, so
they tend to be self-contained and easy to backport (specially patches from
part 1). Notice that the series is *based on 5.18-rc4* - usually a refactor
like this would be based on linux-next, but since we have many fixes in the
series, I kept it based on mainline tree. Of course I could change that in a
subsequent iteration, if desired.

Since this touches multiple architectures and drivers, it's very difficult to
test it really (by executing all touched code). So, my tests split in two
approaches: build tests and real tests, that involves panic triggering with
and without kdump, changing panic notifiers level, etc.

Build tests (using cross-compilers): alpha, arm, arm64, mips (sgi 22 and 32),
parisc, s390, sparc, um, x86_64 (couldn't get a functional xtensa cross
compiler).

Real/full tests: x86_64 (Hyper-V and QEMU guests) + PowerPC (pseries guest).

Here is the link with the .config files used: https://people.igalia.com/gpiccoli/panic_notifiers_configs/
(tried my best to build all the affected code).

Finally, a bit about my CCing strategy: I've included everybody present in the
original thread [0] plus some maintainers and other interested parties as CC
in the full series. But the patches have individual CC lists, for people that
are definitely related to them but might not care much for the whole series;
nevertheless, _everybody_ mentioned at least once in some patch is CCed in this
cover-letter. Hopefully I didn't forget to include anybody - all the mailing
lists were CCed in the whole series. Apologies in advance if (a) you received
emails you didn't want to or, (b) I forgot to include you but it was something
considered interesting by you.

Thanks in advance for reviews / comments / suggestions!
Cheers,

Guilherme

[0] https://lore.kernel.org/lkml/YfPxvzSzDLjO5ldp@alley/

Guilherme G. Piccoli (30):
  x86/crash,reboot: Avoid re-disabling VMX in all CPUs on crash/restart
  ARM: kexec: Disable IRQs/FIQs also on crash CPUs shutdown path
  notifier: Add panic notifiers info and purge trailing whitespaces
  firmware: google: Convert regular spinlock into trylock on panic path
  misc/pvpanic: Convert regular spinlock into trylock on panic path
  soc: bcm: brcmstb: Document panic notifier action and remove useless header
  mips: ip22: Reword PANICED to PANICKED and remove useless header
  powerpc/setup: Refactor/untangle panic notifiers
  coresight: cpu-debug: Replace mutex with mutex_trylock on panic notifier
  alpha: Clean-up the panic notifier code
  um: Improve panic notifiers consistency and ordering
  parisc: Replace regular spinlock with spin_trylock on panic path
  s390/consoles: Improve panic notifiers reliability
  panic: Properly identify the panic event to the notifiers' callbacks
  bus: brcmstb_gisb: Clean-up panic/die notifiers
  drivers/hv/vmbus, video/hyperv_fb: Untangle and refactor Hyper-V panic notifiers
  tracing: Improve panic/die notifiers
  notifier: Show function names on notifier routines if DEBUG_NOTIFIERS is set
  panic: Add the panic hypervisor notifier list
  panic: Add the panic informational notifier list
  panic: Introduce the panic pre-reboot notifier list
  panic: Introduce the panic post-reboot notifier list
  printk: kmsg_dump: Introduce helper to inform number of dumpers
  panic: Refactor the panic path
  panic, printk: Add console flush parameter and convert panic_print to a notifier
  Drivers: hv: Do not force all panic notifiers to execute before kdump
  powerpc: Do not force all panic notifiers to execute before kdump
  panic: Unexport crash_kexec_post_notifiers
  powerpc: ps3, pseries: Avoid duplicate call to kmsg_dump() on panic
  um: Avoid duplicate call to kmsg_dump()

 .../admin-guide/kernel-parameters.txt         |  54 ++-
 Documentation/admin-guide/sysctl/kernel.rst   |   5 +-
 arch/alpha/kernel/setup.c                     |  40 +--
 arch/arm/kernel/machine_kexec.c               |   3 +
 arch/arm64/kernel/setup.c                     |   2 +-
 arch/mips/kernel/relocate.c                   |   2 +-
 arch/mips/sgi-ip22/ip22-reset.c               |  13 +-
 arch/mips/sgi-ip32/ip32-reset.c               |   3 +-
 arch/parisc/include/asm/pdc.h                 |   1 +
 arch/parisc/kernel/firmware.c                 |  27 +-
 arch/parisc/kernel/pdc_chassis.c              |   3 +-
 arch/powerpc/include/asm/bug.h                |   2 +-
 arch/powerpc/kernel/fadump.c                  |   8 -
 arch/powerpc/kernel/setup-common.c            |  76 ++--
 arch/powerpc/kernel/traps.c                   |   6 +-
 arch/powerpc/platforms/powernv/opal.c         |   2 +-
 arch/powerpc/platforms/ps3/setup.c            |   2 +-
 arch/powerpc/platforms/pseries/setup.c        |   2 +-
 arch/s390/kernel/ipl.c                        |   4 +-
 arch/s390/kernel/setup.c                      |  19 +-
 arch/sparc/kernel/setup_32.c                  |  27 +-
 arch/sparc/kernel/setup_64.c                  |  29 +-
 arch/sparc/kernel/sstate.c                    |   3 +-
 arch/um/drivers/mconsole_kern.c               |  10 +-
 arch/um/kernel/um_arch.c                      |  11 +-
 arch/x86/include/asm/cpu.h                    |   1 +
 arch/x86/kernel/crash.c                       |   8 +-
 arch/x86/kernel/reboot.c                      |  14 +-
 arch/x86/kernel/setup.c                       |   2 +-
 arch/x86/xen/enlighten.c                      |   2 +-
 arch/xtensa/platforms/iss/setup.c             |   4 +-
 drivers/bus/brcmstb_gisb.c                    |  28 +-
 drivers/char/ipmi/ipmi_msghandler.c           |  12 +-
 drivers/edac/altera_edac.c                    |   3 +-
 drivers/firmware/google/gsmi.c                |  10 +-
 drivers/hv/hv_common.c                        |  12 -
 drivers/hv/vmbus_drv.c                        | 113 +++---
 .../hwtracing/coresight/coresight-cpu-debug.c |  11 +-
 drivers/leds/trigger/ledtrig-activity.c       |   4 +-
 drivers/leds/trigger/ledtrig-heartbeat.c      |   4 +-
 drivers/leds/trigger/ledtrig-panic.c          |   3 +-
 drivers/misc/bcm-vk/bcm_vk_dev.c              |   6 +-
 drivers/misc/ibmasm/heartbeat.c               |  16 +-
 drivers/misc/pvpanic/pvpanic.c                |  14 +-
 drivers/net/ipa/ipa_smp2p.c                   |   5 +-
 drivers/parisc/power.c                        |  21 +-
 drivers/power/reset/ltc2952-poweroff.c        |   4 +-
 drivers/remoteproc/remoteproc_core.c          |   6 +-
 drivers/s390/char/con3215.c                   |  38 +-
 drivers/s390/char/con3270.c                   |  36 +-
 drivers/s390/char/raw3270.c                   |  18 +
 drivers/s390/char/raw3270.h                   |   1 +
 drivers/s390/char/sclp_con.c                  |  30 +-
 drivers/s390/char/sclp_vt220.c                |  44 +--
 drivers/s390/char/zcore.c                     |   5 +-
 drivers/soc/bcm/brcmstb/pm/pm-arm.c           |  18 +-
 drivers/soc/tegra/ari-tegra186.c              |   3 +-
 drivers/staging/olpc_dcon/olpc_dcon.c         |   6 +-
 drivers/video/fbdev/hyperv_fb.c               |  12 +-
 include/linux/console.h                       |   2 +
 include/linux/kmsg_dump.h                     |   7 +
 include/linux/notifier.h                      |   8 +-
 include/linux/panic.h                         |   3 -
 include/linux/panic_notifier.h                |  12 +-
 include/linux/printk.h                        |   1 +
 kernel/hung_task.c                            |   3 +-
 kernel/kexec_core.c                           |   8 +-
 kernel/notifier.c                             |  48 ++-
 kernel/panic.c                                | 335 +++++++++++-------
 kernel/printk/printk.c                        |  76 ++++
 kernel/rcu/tree.c                             |   1 -
 kernel/rcu/tree_stall.h                       |   3 +-
 kernel/trace/trace.c                          |  59 +--
 .../selftests/pstore/pstore_crash_test        |   5 +-
 74 files changed, 953 insertions(+), 486 deletions(-)

Comments

Paul E. McKenney April 27, 2022, 11:49 p.m. UTC | #1
On Wed, Apr 27, 2022 at 07:49:14PM -0300, Guilherme G. Piccoli wrote:
> The goal of this new panic notifier is to allow its users to
> register callbacks to run earlier in the panic path than they
> currently do. This aims at informational mechanisms, like dumping
> kernel offsets and showing device error data (in case it's simple
> registers reading, for example) as well as mechanisms to disable
> log flooding (like hung_task detector / RCU warnings) and the
> tracing dump_on_oops (when enabled).
> 
> Any (non-invasive) information that should be provided before
> kmsg_dump() as well as log flooding preventing code should fit
> here, as long it offers relatively low risk for kdump.
> 
> For now, the patch is almost a no-op, although it changes a bit
> the ordering in which some panic notifiers are executed - specially
> affected by this are the notifiers responsible for disabling the
> hung_task detector / RCU warnings, which now run first. In a
> subsequent patch, the panic path will be refactored, then the
> panic informational notifiers will effectively run earlier,
> before ksmg_dump() (and usually before kdump as well).
> 
> We also defer documenting it all properly in the subsequent
> refactor patch. Finally, while at it, we removed some useless
> header inclusions too.
> 
> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Florian Fainelli <f.fainelli@gmail.com>
> Cc: Frederic Weisbecker <frederic@kernel.org>
> Cc: "H. Peter Anvin" <hpa@zytor.com>
> Cc: Hari Bathini <hbathini@linux.ibm.com>
> Cc: Joel Fernandes <joel@joelfernandes.org>
> Cc: Jonathan Hunter <jonathanh@nvidia.com>
> Cc: Josh Triplett <josh@joshtriplett.org>
> Cc: Lai Jiangshan <jiangshanlai@gmail.com>
> Cc: Leo Yan <leo.yan@linaro.org>
> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Cc: Mathieu Poirier <mathieu.poirier@linaro.org>
> Cc: Michael Ellerman <mpe@ellerman.id.au>
> Cc: Mike Leach <mike.leach@linaro.org>
> Cc: Mikko Perttunen <mperttunen@nvidia.com>
> Cc: Neeraj Upadhyay <quic_neeraju@quicinc.com>
> Cc: Nicholas Piggin <npiggin@gmail.com>
> Cc: Paul Mackerras <paulus@samba.org>
> Cc: Suzuki K Poulose <suzuki.poulose@arm.com>
> Cc: Thierry Reding <thierry.reding@gmail.com>
> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
> Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
Randy Dunlap April 28, 2022, 12:28 a.m. UTC | #2
On 4/27/22 15:49, Guilherme G. Piccoli wrote:
> +	crash_kexec_post_notifiers
> +			This was DEPRECATED - users should always prefer the

			This is DEPRECATED - users should always prefer the

> +			parameter "panic_notifiers_level" - check its entry
> +			in this documentation for details on how it works.
> +			Setting this parameter is exactly the same as setting
> +			"panic_notifiers_level=4".
Xiaoming Ni April 28, 2022, 1:01 a.m. UTC | #3
On 2022/4/28 6:49, Guilherme G. Piccoli wrote:
> Currently we have a debug infrastructure in the notifiers file, but
> it's very simple/limited. This patch extends it by:
> 
> (a) Showing all registered/unregistered notifiers' callback names;
> 
> (b) Adding a dynamic debug tuning to allow showing called notifiers'
> function names. Notice that this should be guarded as a tunable since
> it can flood the kernel log buffer.
> 
> Cc: Arjan van de Ven <arjan@linux.intel.com>
> Cc: Cong Wang <xiyou.wangcong@gmail.com>
> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> Cc: Valentin Schneider <valentin.schneider@arm.com>
> Cc: Xiaoming Ni <nixiaoming@huawei.com>
> Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
> ---
> 
> We have some design decisions that worth discussing here:
> 
> (a) First of call, using C99 helps a lot to write clear and concise code, but
> due to commit 4d94f910e79a ("Kbuild: use -Wdeclaration-after-statement") we
> have a warning if mixing variable declarations with code. For this patch though,
> doing that makes the code way clear, so decision was to add the debug code
> inside brackets whenever this warning pops up. We can change that, but that'll
> cause more ifdefs in the same function.
> 
> (b) In the symbol lookup helper function, we modify the parameter passed but
> even more, we return it as well! This is unusual and seems unnecessary, but was
> the strategy taken to allow embedding such function in the pr_debug() call.
> 
> Not doing that would likely requiring 3 symbol_name variables to avoid
> concurrency (registering notifier A while calling notifier B) - we rely in
> local variables as a serialization mechanism.
> 
> We're open for suggestions in case this design is not appropriate;
> thanks in advance!
> 
>   kernel/notifier.c | 48 +++++++++++++++++++++++++++++++++++++++++++++--
>   1 file changed, 46 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/notifier.c b/kernel/notifier.c
> index ba005ebf4730..21032ebcde57 100644
> --- a/kernel/notifier.c
> +++ b/kernel/notifier.c
> @@ -7,6 +7,22 @@
>   #include <linux/vmalloc.h>
>   #include <linux/reboot.h>
>   
> +#ifdef CONFIG_DEBUG_NOTIFIERS
> +#include <linux/kallsyms.h>
> +
> +/*
> + *	Helper to get symbol names in case DEBUG_NOTIFIERS is set.
> + *	Return the modified parameter is a strategy used to achieve
> + *	the pr_debug() functionality - with this, function is only
> + *	executed if the dynamic debug tuning is effectively set.
> + */
> +static inline char *notifier_name(struct notifier_block *nb, char *sym_name)
> +{
> +	lookup_symbol_name((unsigned long)(nb->notifier_call), sym_name);
> +	return sym_name;
> +}
> +#endif
> +
>   /*
>    *	Notifier list for kernel code which wants to be called
>    *	at shutdown. This is used to stop any idling DMA operations
> @@ -34,20 +50,41 @@ static int notifier_chain_register(struct notifier_block **nl,
>   	}
>   	n->next = *nl;
>   	rcu_assign_pointer(*nl, n);
> +
> +#ifdef CONFIG_DEBUG_NOTIFIERS
> +	{
> +		char sym_name[KSYM_NAME_LEN];
> +
> +		pr_info("notifiers: registered %s()\n",
> +			notifier_name(n, sym_name));
> +	}

Duplicate Code.

Is it better to use __func__ and %pS?

pr_info("%s: %pS\n", __func__, n->notifier_call);


> +#endif
>   	return 0;
>   }
>   
>   static int notifier_chain_unregister(struct notifier_block **nl,
>   		struct notifier_block *n)
>   {
> +	int ret = -ENOENT;
> +
>   	while ((*nl) != NULL) {
>   		if ((*nl) == n) {
>   			rcu_assign_pointer(*nl, n->next);
> -			return 0;
> +			ret = 0;
> +			break;
>   		}
>   		nl = &((*nl)->next);
>   	}
> -	return -ENOENT;
> +
> +#ifdef CONFIG_DEBUG_NOTIFIERS
> +	if (!ret) {
> +		char sym_name[KSYM_NAME_LEN];
> +
> +		pr_info("notifiers: unregistered %s()\n",
> +			notifier_name(n, sym_name));
> +	}
Duplicate Code.

Is it better to use __func__ and %pS?

pr_info("%s: %pS\n", __func__, n->notifier_call);
> +#endif
> +	return ret;
>   }
>   
>   /**
> @@ -80,6 +117,13 @@ static int notifier_call_chain(struct notifier_block **nl,
>   			nb = next_nb;
>   			continue;
>   		}
> +
Is the "#ifdef" missing here?
> +		{
> +			char sym_name[KSYM_NAME_LEN];
> +
> +			pr_debug("notifiers: calling %s()\n",
> +				 notifier_name(nb, sym_name));
Duplicate Code.

Is it better to use __func__ and %pS?

pr_info("%s: %pS\n", __func__, n->notifier_call);
> +		}
>   #endif
>   		ret = nb->notifier_call(nb, val, v);
>   
> 

Thanks
Xiaoming Ni
Suzuki K Poulose April 28, 2022, 8:11 a.m. UTC | #4
Hi Guilherme,

On 27/04/2022 23:49, Guilherme G. Piccoli wrote:
> The panic notifier infrastructure executes registered callbacks when
> a panic event happens - such callbacks are executed in atomic context,
> with interrupts and preemption disabled in the running CPU and all other
> CPUs disabled. That said, mutexes in such context are not a good idea.
> 
> This patch replaces a regular mutex with a mutex_trylock safer approach;
> given the nature of the mutex used in the driver, it should be pretty
> uncommon being unable to acquire such mutex in the panic path, hence
> no functional change should be observed (and if it is, that would be
> likely a deadlock with the regular mutex).
> 
> Fixes: 2227b7c74634 ("coresight: add support for CPU debug module")
> Cc: Leo Yan <leo.yan@linaro.org>
> Cc: Mathieu Poirier <mathieu.poirier@linaro.org>
> Cc: Mike Leach <mike.leach@linaro.org>
> Cc: Suzuki K Poulose <suzuki.poulose@arm.com>
> Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com>

How would you like to proceed with queuing this ? I am happy
either way. In case you plan to push this as part of this
series (I don't see any potential conflicts) :

Reviewed-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Suzuki K Poulose April 28, 2022, 8:14 a.m. UTC | #5
On 27/04/2022 23:49, Guilherme G. Piccoli wrote:
> The goal of this new panic notifier is to allow its users to
> register callbacks to run earlier in the panic path than they
> currently do. This aims at informational mechanisms, like dumping
> kernel offsets and showing device error data (in case it's simple
> registers reading, for example) as well as mechanisms to disable
> log flooding (like hung_task detector / RCU warnings) and the
> tracing dump_on_oops (when enabled).
> 
> Any (non-invasive) information that should be provided before
> kmsg_dump() as well as log flooding preventing code should fit
> here, as long it offers relatively low risk for kdump.
> 
> For now, the patch is almost a no-op, although it changes a bit
> the ordering in which some panic notifiers are executed - specially
> affected by this are the notifiers responsible for disabling the
> hung_task detector / RCU warnings, which now run first. In a
> subsequent patch, the panic path will be refactored, then the
> panic informational notifiers will effectively run earlier,
> before ksmg_dump() (and usually before kdump as well).
> 
> We also defer documenting it all properly in the subsequent
> refactor patch. Finally, while at it, we removed some useless
> header inclusions too.
> 
> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Florian Fainelli <f.fainelli@gmail.com>
> Cc: Frederic Weisbecker <frederic@kernel.org>
> Cc: "H. Peter Anvin" <hpa@zytor.com>
> Cc: Hari Bathini <hbathini@linux.ibm.com>
> Cc: Joel Fernandes <joel@joelfernandes.org>
> Cc: Jonathan Hunter <jonathanh@nvidia.com>
> Cc: Josh Triplett <josh@joshtriplett.org>
> Cc: Lai Jiangshan <jiangshanlai@gmail.com>
> Cc: Leo Yan <leo.yan@linaro.org>
> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Cc: Mathieu Poirier <mathieu.poirier@linaro.org>
> Cc: Michael Ellerman <mpe@ellerman.id.au>
> Cc: Mike Leach <mike.leach@linaro.org>
> Cc: Mikko Perttunen <mperttunen@nvidia.com>
> Cc: Neeraj Upadhyay <quic_neeraju@quicinc.com>
> Cc: Nicholas Piggin <npiggin@gmail.com>
> Cc: Paul Mackerras <paulus@samba.org>
> Cc: Suzuki K Poulose <suzuki.poulose@arm.com>
> Cc: Thierry Reding <thierry.reding@gmail.com>
> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
> Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
> ---
>   arch/arm64/kernel/setup.c                         | 2 +-
>   arch/mips/kernel/relocate.c                       | 2 +-
>   arch/powerpc/kernel/setup-common.c                | 2 +-
>   arch/x86/kernel/setup.c                           | 2 +-
>   drivers/bus/brcmstb_gisb.c                        | 2 +-
>   drivers/hwtracing/coresight/coresight-cpu-debug.c | 4 ++--
>   drivers/soc/tegra/ari-tegra186.c                  | 3 ++-
>   include/linux/panic_notifier.h                    | 1 +
>   kernel/hung_task.c                                | 3 ++-
>   kernel/panic.c                                    | 4 ++++
>   kernel/rcu/tree.c                                 | 1 -
>   kernel/rcu/tree_stall.h                           | 3 ++-
>   kernel/trace/trace.c                              | 2 +-
>   13 files changed, 19 insertions(+), 12 deletions(-)
> 

...

> diff --git a/drivers/hwtracing/coresight/coresight-cpu-debug.c b/drivers/hwtracing/coresight/coresight-cpu-debug.c
> index 1874df7c6a73..7b1012454525 100644
> --- a/drivers/hwtracing/coresight/coresight-cpu-debug.c
> +++ b/drivers/hwtracing/coresight/coresight-cpu-debug.c
> @@ -535,7 +535,7 @@ static int debug_func_init(void)
>   			    &debug_func_knob_fops);
>   
>   	/* Register function to be called for panic */
> -	ret = atomic_notifier_chain_register(&panic_notifier_list,
> +	ret = atomic_notifier_chain_register(&panic_info_list,
>   					     &debug_notifier);
>   	if (ret) {
>   		pr_err("%s: unable to register notifier: %d\n",
> @@ -552,7 +552,7 @@ static int debug_func_init(void)
>   
>   static void debug_func_exit(void)
>   {
> -	atomic_notifier_chain_unregister(&panic_notifier_list,
> +	atomic_notifier_chain_unregister(&panic_info_list,
>   					 &debug_notifier);
>   	debugfs_remove_recursive(debug_debugfs_dir);
>   }

Acked-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Alex Elder April 28, 2022, 2:13 p.m. UTC | #6
On 4/27/22 5:49 PM, Guilherme G. Piccoli wrote:
> This patch renames the panic_notifier_list to panic_pre_reboot_list;
> the idea is that a subsequent patch will refactor the panic path
> in order to better split the notifiers, running some of them very
> early, some of them not so early [but still before kmsg_dump()] and
> finally, the rest should execute late, after kdump. The latter ones
> are now in the panic pre-reboot list - the name comes from the idea
> that these notifiers execute before panic() attempts rebooting the
> machine (if that option is set).
> 
> We also took the opportunity to clean-up useless header inclusions,
> improve some notifier block declarations (e.g. in ibmasm/heartbeat.c)
> and more important, change some priorities - we hereby set 2 notifiers
> to run late in the list [iss_panic_event() and the IPMI panic_event()]
> due to the risks they offer (may not return, for example).
> Proper documentation is going to be provided in a subsequent patch,
> that effectively refactors the panic path.
> 
> Cc: Alex Elder <elder@kernel.org>

For "drivers/net/ipa/ipa_smp2p.c":

Acked-by: Alex Elder <elder@kernel.org>

> Cc: Alexander Gordeev <agordeev@linux.ibm.com>
> Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com>
> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> Cc: Bjorn Andersson <bjorn.andersson@linaro.org>
> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
> Cc: Chris Zankel <chris@zankel.net>
> Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
> Cc: Corey Minyard <minyard@acm.org>
> Cc: Dexuan Cui <decui@microsoft.com>
> Cc: "H. Peter Anvin" <hpa@zytor.com>
> Cc: Haiyang Zhang <haiyangz@microsoft.com>
> Cc: Heiko Carstens <hca@linux.ibm.com>
> Cc: Helge Deller <deller@gmx.de>
> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
> Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
> Cc: James Morse <james.morse@arm.com>
> Cc: Johannes Berg <johannes@sipsolutions.net>
> Cc: Juergen Gross <jgross@suse.com>
> Cc: "K. Y. Srinivasan" <kys@microsoft.com>
> Cc: Mathieu Poirier <mathieu.poirier@linaro.org>
> Cc: Matt Turner <mattst88@gmail.com>
> Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
> Cc: Max Filippov <jcmvbkbc@gmail.com>
> Cc: Michael Ellerman <mpe@ellerman.id.au>
> Cc: Paul Mackerras <paulus@samba.org>
> Cc: Pavel Machek <pavel@ucw.cz>
> Cc: Richard Henderson <rth@twiddle.net>
> Cc: Richard Weinberger <richard@nod.at>
> Cc: Robert Richter <rric@kernel.org>
> Cc: Stefano Stabellini <sstabellini@kernel.org>
> Cc: Stephen Hemminger <sthemmin@microsoft.com>
> Cc: Sven Schnelle <svens@linux.ibm.com>
> Cc: Tony Luck <tony.luck@intel.com>
> Cc: Vasily Gorbik <gor@linux.ibm.com>
> Cc: Wei Liu <wei.liu@kernel.org>
> Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
> ---
> 

. . .
Corey Minyard April 28, 2022, 4:26 p.m. UTC | #7
On Wed, Apr 27, 2022 at 07:49:15PM -0300, Guilherme G. Piccoli wrote:
> This patch renames the panic_notifier_list to panic_pre_reboot_list;
> the idea is that a subsequent patch will refactor the panic path
> in order to better split the notifiers, running some of them very
> early, some of them not so early [but still before kmsg_dump()] and
> finally, the rest should execute late, after kdump. The latter ones
> are now in the panic pre-reboot list - the name comes from the idea
> that these notifiers execute before panic() attempts rebooting the
> machine (if that option is set).
> 
> We also took the opportunity to clean-up useless header inclusions,
> improve some notifier block declarations (e.g. in ibmasm/heartbeat.c)
> and more important, change some priorities - we hereby set 2 notifiers
> to run late in the list [iss_panic_event() and the IPMI panic_event()]
> due to the risks they offer (may not return, for example).
> Proper documentation is going to be provided in a subsequent patch,
> that effectively refactors the panic path.

For the IPMI portion:

Acked-by: Corey Minyard <cminyard@mvista.com>

Note that the IPMI panic_event() should always return, but it may take
some time, especially if the IPMI controller is no longer functional.
So the risk of a long delay is there and it makes sense to move it very
late.

-corey

> 
> Cc: Alex Elder <elder@kernel.org>
> Cc: Alexander Gordeev <agordeev@linux.ibm.com>
> Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com>
> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> Cc: Bjorn Andersson <bjorn.andersson@linaro.org>
> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
> Cc: Chris Zankel <chris@zankel.net>
> Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
> Cc: Corey Minyard <minyard@acm.org>
> Cc: Dexuan Cui <decui@microsoft.com>
> Cc: "H. Peter Anvin" <hpa@zytor.com>
> Cc: Haiyang Zhang <haiyangz@microsoft.com>
> Cc: Heiko Carstens <hca@linux.ibm.com>
> Cc: Helge Deller <deller@gmx.de>
> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
> Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
> Cc: James Morse <james.morse@arm.com>
> Cc: Johannes Berg <johannes@sipsolutions.net>
> Cc: Juergen Gross <jgross@suse.com>
> Cc: "K. Y. Srinivasan" <kys@microsoft.com>
> Cc: Mathieu Poirier <mathieu.poirier@linaro.org>
> Cc: Matt Turner <mattst88@gmail.com>
> Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
> Cc: Max Filippov <jcmvbkbc@gmail.com>
> Cc: Michael Ellerman <mpe@ellerman.id.au>
> Cc: Paul Mackerras <paulus@samba.org>
> Cc: Pavel Machek <pavel@ucw.cz>
> Cc: Richard Henderson <rth@twiddle.net>
> Cc: Richard Weinberger <richard@nod.at>
> Cc: Robert Richter <rric@kernel.org>
> Cc: Stefano Stabellini <sstabellini@kernel.org>
> Cc: Stephen Hemminger <sthemmin@microsoft.com>
> Cc: Sven Schnelle <svens@linux.ibm.com>
> Cc: Tony Luck <tony.luck@intel.com>
> Cc: Vasily Gorbik <gor@linux.ibm.com>
> Cc: Wei Liu <wei.liu@kernel.org>
> Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
> ---
> 
> Notice that, with this name change, out-of-tree code that relies in the global
> exported "panic_notifier_list" will fail to build. We could easily keep the
> retro-compatibility by making the old symbol to still exist and point to the
> pre_reboot list (or even, keep the old naming).
> 
> But our design choice was to allow the breakage, making users rethink their
> notifiers, adding them in the list that fits best. If that wasn't a good
> decision, we're open to change it, of course.
> Thanks in advance for the review!
> 
>  arch/alpha/kernel/setup.c             |  4 ++--
>  arch/parisc/kernel/pdc_chassis.c      |  3 +--
>  arch/powerpc/kernel/setup-common.c    |  2 +-
>  arch/s390/kernel/ipl.c                |  4 ++--
>  arch/um/drivers/mconsole_kern.c       |  2 +-
>  arch/um/kernel/um_arch.c              |  2 +-
>  arch/x86/xen/enlighten.c              |  2 +-
>  arch/xtensa/platforms/iss/setup.c     |  4 ++--
>  drivers/char/ipmi/ipmi_msghandler.c   | 12 +++++++-----
>  drivers/edac/altera_edac.c            |  3 +--
>  drivers/hv/vmbus_drv.c                |  4 ++--
>  drivers/leds/trigger/ledtrig-panic.c  |  3 +--
>  drivers/misc/ibmasm/heartbeat.c       | 16 +++++++++-------
>  drivers/net/ipa/ipa_smp2p.c           |  5 ++---
>  drivers/parisc/power.c                |  4 ++--
>  drivers/remoteproc/remoteproc_core.c  |  6 ++++--
>  drivers/s390/char/con3215.c           |  2 +-
>  drivers/s390/char/con3270.c           |  2 +-
>  drivers/s390/char/sclp_con.c          |  2 +-
>  drivers/s390/char/sclp_vt220.c        |  2 +-
>  drivers/staging/olpc_dcon/olpc_dcon.c |  6 ++++--
>  drivers/video/fbdev/hyperv_fb.c       |  4 ++--
>  include/linux/panic_notifier.h        |  2 +-
>  kernel/panic.c                        |  9 ++++-----
>  24 files changed, 54 insertions(+), 51 deletions(-)
> 
> diff --git a/arch/alpha/kernel/setup.c b/arch/alpha/kernel/setup.c
> index d88bdf852753..8ace0d7113b6 100644
> --- a/arch/alpha/kernel/setup.c
> +++ b/arch/alpha/kernel/setup.c
> @@ -472,8 +472,8 @@ setup_arch(char **cmdline_p)
>  	}
>  
>  	/* Register a call for panic conditions. */
> -	atomic_notifier_chain_register(&panic_notifier_list,
> -			&alpha_panic_block);
> +	atomic_notifier_chain_register(&panic_pre_reboot_list,
> +					&alpha_panic_block);
>  
>  #ifndef alpha_using_srm
>  	/* Assume that we've booted from SRM if we haven't booted from MILO.
> diff --git a/arch/parisc/kernel/pdc_chassis.c b/arch/parisc/kernel/pdc_chassis.c
> index da154406d368..0fd8d87fb4f9 100644
> --- a/arch/parisc/kernel/pdc_chassis.c
> +++ b/arch/parisc/kernel/pdc_chassis.c
> @@ -22,7 +22,6 @@
>  #include <linux/kernel.h>
>  #include <linux/panic_notifier.h>
>  #include <linux/reboot.h>
> -#include <linux/notifier.h>
>  #include <linux/cache.h>
>  #include <linux/proc_fs.h>
>  #include <linux/seq_file.h>
> @@ -135,7 +134,7 @@ void __init parisc_pdc_chassis_init(void)
>  				PDC_CHASSIS_VER);
>  
>  		/* initialize panic notifier chain */
> -		atomic_notifier_chain_register(&panic_notifier_list,
> +		atomic_notifier_chain_register(&panic_pre_reboot_list,
>  				&pdc_chassis_panic_block);
>  
>  		/* initialize reboot notifier chain */
> diff --git a/arch/powerpc/kernel/setup-common.c b/arch/powerpc/kernel/setup-common.c
> index d04b8bf8dbc7..3518bebc10ad 100644
> --- a/arch/powerpc/kernel/setup-common.c
> +++ b/arch/powerpc/kernel/setup-common.c
> @@ -762,7 +762,7 @@ void __init setup_panic(void)
>  
>  	/* Low-level platform-specific routines that should run on panic */
>  	if (ppc_md.panic)
> -		atomic_notifier_chain_register(&panic_notifier_list,
> +		atomic_notifier_chain_register(&panic_pre_reboot_list,
>  					       &ppc_panic_block);
>  }
>  
> diff --git a/arch/s390/kernel/ipl.c b/arch/s390/kernel/ipl.c
> index 1cc85b8ff42e..4a88c5bb6e15 100644
> --- a/arch/s390/kernel/ipl.c
> +++ b/arch/s390/kernel/ipl.c
> @@ -2034,7 +2034,7 @@ static int on_panic_notify(struct notifier_block *self,
>  			   unsigned long event, void *data)
>  {
>  	do_panic();
> -	return NOTIFY_OK;
> +	return NOTIFY_DONE;
>  }
>  
>  static struct notifier_block on_panic_nb = {
> @@ -2069,7 +2069,7 @@ void __init setup_ipl(void)
>  		/* We have no info to copy */
>  		break;
>  	}
> -	atomic_notifier_chain_register(&panic_notifier_list, &on_panic_nb);
> +	atomic_notifier_chain_register(&panic_pre_reboot_list, &on_panic_nb);
>  }
>  
>  void s390_reset_system(void)
> diff --git a/arch/um/drivers/mconsole_kern.c b/arch/um/drivers/mconsole_kern.c
> index 2ea0421bcc3f..21c13b4e24a3 100644
> --- a/arch/um/drivers/mconsole_kern.c
> +++ b/arch/um/drivers/mconsole_kern.c
> @@ -855,7 +855,7 @@ static struct notifier_block panic_exit_notifier = {
>  
>  static int add_notifier(void)
>  {
> -	atomic_notifier_chain_register(&panic_notifier_list,
> +	atomic_notifier_chain_register(&panic_pre_reboot_list,
>  			&panic_exit_notifier);
>  	return 0;
>  }
> diff --git a/arch/um/kernel/um_arch.c b/arch/um/kernel/um_arch.c
> index 4485b1a7c8e4..fc6e443299da 100644
> --- a/arch/um/kernel/um_arch.c
> +++ b/arch/um/kernel/um_arch.c
> @@ -257,7 +257,7 @@ static struct notifier_block panic_exit_notifier = {
>  
>  void uml_finishsetup(void)
>  {
> -	atomic_notifier_chain_register(&panic_notifier_list,
> +	atomic_notifier_chain_register(&panic_pre_reboot_list,
>  				       &panic_exit_notifier);
>  
>  	uml_postsetup();
> diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
> index 30c6e986a6cd..d4f4de239a21 100644
> --- a/arch/x86/xen/enlighten.c
> +++ b/arch/x86/xen/enlighten.c
> @@ -290,7 +290,7 @@ static struct notifier_block xen_panic_block = {
>  
>  int xen_panic_handler_init(void)
>  {
> -	atomic_notifier_chain_register(&panic_notifier_list, &xen_panic_block);
> +	atomic_notifier_chain_register(&panic_pre_reboot_list, &xen_panic_block);
>  	return 0;
>  }
>  
> diff --git a/arch/xtensa/platforms/iss/setup.c b/arch/xtensa/platforms/iss/setup.c
> index d3433e1bb94e..eeeeb6cff6bd 100644
> --- a/arch/xtensa/platforms/iss/setup.c
> +++ b/arch/xtensa/platforms/iss/setup.c
> @@ -13,7 +13,6 @@
>   */
>  #include <linux/init.h>
>  #include <linux/kernel.h>
> -#include <linux/notifier.h>
>  #include <linux/panic_notifier.h>
>  #include <linux/printk.h>
>  #include <linux/string.h>
> @@ -53,6 +52,7 @@ iss_panic_event(struct notifier_block *this, unsigned long event, void *ptr)
>  
>  static struct notifier_block iss_panic_block = {
>  	.notifier_call = iss_panic_event,
> +	.priority = INT_MIN, /* run as late as possible, may not return */
>  };
>  
>  void __init platform_setup(char **p_cmdline)
> @@ -81,5 +81,5 @@ void __init platform_setup(char **p_cmdline)
>  		}
>  	}
>  
> -	atomic_notifier_chain_register(&panic_notifier_list, &iss_panic_block);
> +	atomic_notifier_chain_register(&panic_pre_reboot_list, &iss_panic_block);
>  }
> diff --git a/drivers/char/ipmi/ipmi_msghandler.c b/drivers/char/ipmi/ipmi_msghandler.c
> index c59265146e9c..6c4770949c01 100644
> --- a/drivers/char/ipmi/ipmi_msghandler.c
> +++ b/drivers/char/ipmi/ipmi_msghandler.c
> @@ -25,7 +25,6 @@
>  #include <linux/slab.h>
>  #include <linux/ipmi.h>
>  #include <linux/ipmi_smi.h>
> -#include <linux/notifier.h>
>  #include <linux/init.h>
>  #include <linux/proc_fs.h>
>  #include <linux/rcupdate.h>
> @@ -5375,10 +5374,13 @@ static int ipmi_register_driver(void)
>  	return rv;
>  }
>  
> +/*
> + * we should execute this panic callback late, since it involves
> + * a complex call-chain and panic() runs in atomic context.
> + */
>  static struct notifier_block panic_block = {
>  	.notifier_call	= panic_event,
> -	.next		= NULL,
> -	.priority	= 200	/* priority: INT_MAX >= x >= 0 */
> +	.priority	= INT_MIN + 1,
>  };
>  
>  static int ipmi_init_msghandler(void)
> @@ -5406,7 +5408,7 @@ static int ipmi_init_msghandler(void)
>  	timer_setup(&ipmi_timer, ipmi_timeout, 0);
>  	mod_timer(&ipmi_timer, jiffies + IPMI_TIMEOUT_JIFFIES);
>  
> -	atomic_notifier_chain_register(&panic_notifier_list, &panic_block);
> +	atomic_notifier_chain_register(&panic_pre_reboot_list, &panic_block);
>  
>  	initialized = true;
>  
> @@ -5438,7 +5440,7 @@ static void __exit cleanup_ipmi(void)
>  	if (initialized) {
>  		destroy_workqueue(remove_work_wq);
>  
> -		atomic_notifier_chain_unregister(&panic_notifier_list,
> +		atomic_notifier_chain_unregister(&panic_pre_reboot_list,
>  						 &panic_block);
>  
>  		/*
> diff --git a/drivers/edac/altera_edac.c b/drivers/edac/altera_edac.c
> index e7e8e624a436..4890e9cba6fb 100644
> --- a/drivers/edac/altera_edac.c
> +++ b/drivers/edac/altera_edac.c
> @@ -16,7 +16,6 @@
>  #include <linux/kernel.h>
>  #include <linux/mfd/altera-sysmgr.h>
>  #include <linux/mfd/syscon.h>
> -#include <linux/notifier.h>
>  #include <linux/of_address.h>
>  #include <linux/of_irq.h>
>  #include <linux/of_platform.h>
> @@ -2163,7 +2162,7 @@ static int altr_edac_a10_probe(struct platform_device *pdev)
>  		int dberror, err_addr;
>  
>  		edac->panic_notifier.notifier_call = s10_edac_dberr_handler;
> -		atomic_notifier_chain_register(&panic_notifier_list,
> +		atomic_notifier_chain_register(&panic_pre_reboot_list,
>  					       &edac->panic_notifier);
>  
>  		/* Printout a message if uncorrectable error previously. */
> diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
> index 901b97034308..3717c323aa36 100644
> --- a/drivers/hv/vmbus_drv.c
> +++ b/drivers/hv/vmbus_drv.c
> @@ -1622,7 +1622,7 @@ static int vmbus_bus_init(void)
>  	 * Always register the vmbus unload panic notifier because we
>  	 * need to shut the VMbus channel connection on panic.
>  	 */
> -	atomic_notifier_chain_register(&panic_notifier_list,
> +	atomic_notifier_chain_register(&panic_pre_reboot_list,
>  			       &hyperv_panic_vmbus_unload_block);
>  
>  	vmbus_request_offers();
> @@ -2851,7 +2851,7 @@ static void __exit vmbus_exit(void)
>  	 * The vmbus panic notifier is always registered, hence we should
>  	 * also unconditionally unregister it here as well.
>  	 */
> -	atomic_notifier_chain_unregister(&panic_notifier_list,
> +	atomic_notifier_chain_unregister(&panic_pre_reboot_list,
>  					&hyperv_panic_vmbus_unload_block);
>  
>  	free_page((unsigned long)hv_panic_page);
> diff --git a/drivers/leds/trigger/ledtrig-panic.c b/drivers/leds/trigger/ledtrig-panic.c
> index 64abf2e91608..34fd5170723f 100644
> --- a/drivers/leds/trigger/ledtrig-panic.c
> +++ b/drivers/leds/trigger/ledtrig-panic.c
> @@ -7,7 +7,6 @@
>  
>  #include <linux/kernel.h>
>  #include <linux/init.h>
> -#include <linux/notifier.h>
>  #include <linux/panic_notifier.h>
>  #include <linux/leds.h>
>  #include "../leds.h"
> @@ -64,7 +63,7 @@ static long led_panic_blink(int state)
>  
>  static int __init ledtrig_panic_init(void)
>  {
> -	atomic_notifier_chain_register(&panic_notifier_list,
> +	atomic_notifier_chain_register(&panic_pre_reboot_list,
>  				       &led_trigger_panic_nb);
>  
>  	led_trigger_register_simple("panic", &trigger);
> diff --git a/drivers/misc/ibmasm/heartbeat.c b/drivers/misc/ibmasm/heartbeat.c
> index 59c9a0d95659..d6acae88b722 100644
> --- a/drivers/misc/ibmasm/heartbeat.c
> +++ b/drivers/misc/ibmasm/heartbeat.c
> @@ -8,7 +8,6 @@
>   * Author: Max Asböck <amax@us.ibm.com>
>   */
>  
> -#include <linux/notifier.h>
>  #include <linux/panic_notifier.h>
>  #include "ibmasm.h"
>  #include "dot_command.h"
> @@ -24,7 +23,7 @@ static int suspend_heartbeats = 0;
>   * In the case of a panic the interrupt handler continues to work and thus
>   * continues to respond to heartbeats, making the service processor believe
>   * the OS is still running and thus preventing a reboot.
> - * To prevent this from happening a callback is added the panic_notifier_list.
> + * To prevent this from happening a callback is added in a panic notifier list.
>   * Before responding to a heartbeat the driver checks if a panic has happened,
>   * if yes it suspends heartbeat, causing the service processor to reboot as
>   * expected.
> @@ -32,20 +31,23 @@ static int suspend_heartbeats = 0;
>  static int panic_happened(struct notifier_block *n, unsigned long val, void *v)
>  {
>  	suspend_heartbeats = 1;
> -	return 0;
> +	return NOTIFY_DONE;
>  }
>  
> -static struct notifier_block panic_notifier = { panic_happened, NULL, 1 };
> +static struct notifier_block panic_notifier = {
> +	.notifier_call = panic_happened,
> +};
>  
>  void ibmasm_register_panic_notifier(void)
>  {
> -	atomic_notifier_chain_register(&panic_notifier_list, &panic_notifier);
> +	atomic_notifier_chain_register(&panic_pre_reboot_list,
> +					&panic_notifier);
>  }
>  
>  void ibmasm_unregister_panic_notifier(void)
>  {
> -	atomic_notifier_chain_unregister(&panic_notifier_list,
> -			&panic_notifier);
> +	atomic_notifier_chain_unregister(&panic_pre_reboot_list,
> +					&panic_notifier);
>  }
>  
>  
> diff --git a/drivers/net/ipa/ipa_smp2p.c b/drivers/net/ipa/ipa_smp2p.c
> index 211233612039..92cdf6e0637c 100644
> --- a/drivers/net/ipa/ipa_smp2p.c
> +++ b/drivers/net/ipa/ipa_smp2p.c
> @@ -7,7 +7,6 @@
>  #include <linux/types.h>
>  #include <linux/device.h>
>  #include <linux/interrupt.h>
> -#include <linux/notifier.h>
>  #include <linux/panic_notifier.h>
>  #include <linux/pm_runtime.h>
>  #include <linux/soc/qcom/smem.h>
> @@ -138,13 +137,13 @@ static int ipa_smp2p_panic_notifier_register(struct ipa_smp2p *smp2p)
>  	smp2p->panic_notifier.notifier_call = ipa_smp2p_panic_notifier;
>  	smp2p->panic_notifier.priority = INT_MAX;	/* Do it early */
>  
> -	return atomic_notifier_chain_register(&panic_notifier_list,
> +	return atomic_notifier_chain_register(&panic_pre_reboot_list,
>  					      &smp2p->panic_notifier);
>  }
>  
>  static void ipa_smp2p_panic_notifier_unregister(struct ipa_smp2p *smp2p)
>  {
> -	atomic_notifier_chain_unregister(&panic_notifier_list,
> +	atomic_notifier_chain_unregister(&panic_pre_reboot_list,
>  					 &smp2p->panic_notifier);
>  }
>  
> diff --git a/drivers/parisc/power.c b/drivers/parisc/power.c
> index 8512884de2cf..5bb0868f5f08 100644
> --- a/drivers/parisc/power.c
> +++ b/drivers/parisc/power.c
> @@ -233,7 +233,7 @@ static int __init power_init(void)
>  	}
>  
>  	/* Register a call for panic conditions. */
> -	atomic_notifier_chain_register(&panic_notifier_list,
> +	atomic_notifier_chain_register(&panic_pre_reboot_list,
>  			&parisc_panic_block);
>  
>  	return 0;
> @@ -243,7 +243,7 @@ static void __exit power_exit(void)
>  {
>  	kthread_stop(power_task);
>  
> -	atomic_notifier_chain_unregister(&panic_notifier_list,
> +	atomic_notifier_chain_unregister(&panic_pre_reboot_list,
>  			&parisc_panic_block);
>  
>  	pdc_soft_power_button(0);
> diff --git a/drivers/remoteproc/remoteproc_core.c b/drivers/remoteproc/remoteproc_core.c
> index c510125769b9..24799ff239e6 100644
> --- a/drivers/remoteproc/remoteproc_core.c
> +++ b/drivers/remoteproc/remoteproc_core.c
> @@ -2795,12 +2795,14 @@ static int rproc_panic_handler(struct notifier_block *nb, unsigned long event,
>  static void __init rproc_init_panic(void)
>  {
>  	rproc_panic_nb.notifier_call = rproc_panic_handler;
> -	atomic_notifier_chain_register(&panic_notifier_list, &rproc_panic_nb);
> +	atomic_notifier_chain_register(&panic_pre_reboot_list,
> +				       &rproc_panic_nb);
>  }
>  
>  static void __exit rproc_exit_panic(void)
>  {
> -	atomic_notifier_chain_unregister(&panic_notifier_list, &rproc_panic_nb);
> +	atomic_notifier_chain_unregister(&panic_pre_reboot_list,
> +					 &rproc_panic_nb);
>  }
>  
>  static int __init remoteproc_init(void)
> diff --git a/drivers/s390/char/con3215.c b/drivers/s390/char/con3215.c
> index 192198bd3dc4..07379dd3f1f3 100644
> --- a/drivers/s390/char/con3215.c
> +++ b/drivers/s390/char/con3215.c
> @@ -867,7 +867,7 @@ static int __init con3215_init(void)
>  		raw3215[0] = NULL;
>  		return -ENODEV;
>  	}
> -	atomic_notifier_chain_register(&panic_notifier_list, &on_panic_nb);
> +	atomic_notifier_chain_register(&panic_pre_reboot_list, &on_panic_nb);
>  	register_reboot_notifier(&on_reboot_nb);
>  	register_console(&con3215);
>  	return 0;
> diff --git a/drivers/s390/char/con3270.c b/drivers/s390/char/con3270.c
> index 476202f3d8a0..e79bf3e7bde3 100644
> --- a/drivers/s390/char/con3270.c
> +++ b/drivers/s390/char/con3270.c
> @@ -645,7 +645,7 @@ con3270_init(void)
>  	condev->cline->len = 0;
>  	con3270_create_status(condev);
>  	condev->input = alloc_string(&condev->freemem, 80);
> -	atomic_notifier_chain_register(&panic_notifier_list, &on_panic_nb);
> +	atomic_notifier_chain_register(&panic_pre_reboot_list, &on_panic_nb);
>  	register_reboot_notifier(&on_reboot_nb);
>  	register_console(&con3270);
>  	return 0;
> diff --git a/drivers/s390/char/sclp_con.c b/drivers/s390/char/sclp_con.c
> index e5d947c763ea..7ca9d4c45d60 100644
> --- a/drivers/s390/char/sclp_con.c
> +++ b/drivers/s390/char/sclp_con.c
> @@ -288,7 +288,7 @@ sclp_console_init(void)
>  	timer_setup(&sclp_con_timer, sclp_console_timeout, 0);
>  
>  	/* enable printk-access to this driver */
> -	atomic_notifier_chain_register(&panic_notifier_list, &on_panic_nb);
> +	atomic_notifier_chain_register(&panic_pre_reboot_list, &on_panic_nb);
>  	register_reboot_notifier(&on_reboot_nb);
>  	register_console(&sclp_console);
>  	return 0;
> diff --git a/drivers/s390/char/sclp_vt220.c b/drivers/s390/char/sclp_vt220.c
> index a32f34a1c6d2..97cf9e290c28 100644
> --- a/drivers/s390/char/sclp_vt220.c
> +++ b/drivers/s390/char/sclp_vt220.c
> @@ -838,7 +838,7 @@ sclp_vt220_con_init(void)
>  	if (rc)
>  		return rc;
>  	/* Attach linux console */
> -	atomic_notifier_chain_register(&panic_notifier_list, &on_panic_nb);
> +	atomic_notifier_chain_register(&panic_pre_reboot_list, &on_panic_nb);
>  	register_reboot_notifier(&on_reboot_nb);
>  	register_console(&sclp_vt220_console);
>  	return 0;
> diff --git a/drivers/staging/olpc_dcon/olpc_dcon.c b/drivers/staging/olpc_dcon/olpc_dcon.c
> index 7284cb4ac395..cb50471f2246 100644
> --- a/drivers/staging/olpc_dcon/olpc_dcon.c
> +++ b/drivers/staging/olpc_dcon/olpc_dcon.c
> @@ -653,7 +653,8 @@ static int dcon_probe(struct i2c_client *client, const struct i2c_device_id *id)
>  	}
>  
>  	register_reboot_notifier(&dcon->reboot_nb);
> -	atomic_notifier_chain_register(&panic_notifier_list, &dcon_panic_nb);
> +	atomic_notifier_chain_register(&panic_pre_reboot_list,
> +				       &dcon_panic_nb);
>  
>  	return 0;
>  
> @@ -676,7 +677,8 @@ static int dcon_remove(struct i2c_client *client)
>  	struct dcon_priv *dcon = i2c_get_clientdata(client);
>  
>  	unregister_reboot_notifier(&dcon->reboot_nb);
> -	atomic_notifier_chain_unregister(&panic_notifier_list, &dcon_panic_nb);
> +	atomic_notifier_chain_unregister(&panic_pre_reboot_list,
> +					 &dcon_panic_nb);
>  
>  	free_irq(DCON_IRQ, dcon);
>  
> diff --git a/drivers/video/fbdev/hyperv_fb.c b/drivers/video/fbdev/hyperv_fb.c
> index f3494b868a64..ec21e63592be 100644
> --- a/drivers/video/fbdev/hyperv_fb.c
> +++ b/drivers/video/fbdev/hyperv_fb.c
> @@ -1253,7 +1253,7 @@ static int hvfb_probe(struct hv_device *hdev,
>  	 */
>  	par->hvfb_panic_nb.notifier_call = hvfb_on_panic;
>  	par->hvfb_panic_nb.priority = INT_MIN + 10,
> -	atomic_notifier_chain_register(&panic_notifier_list,
> +	atomic_notifier_chain_register(&panic_pre_reboot_list,
>  				       &par->hvfb_panic_nb);
>  
>  	return 0;
> @@ -1276,7 +1276,7 @@ static int hvfb_remove(struct hv_device *hdev)
>  	struct fb_info *info = hv_get_drvdata(hdev);
>  	struct hvfb_par *par = info->par;
>  
> -	atomic_notifier_chain_unregister(&panic_notifier_list,
> +	atomic_notifier_chain_unregister(&panic_pre_reboot_list,
>  					 &par->hvfb_panic_nb);
>  
>  	par->update = false;
> diff --git a/include/linux/panic_notifier.h b/include/linux/panic_notifier.h
> index 7364a346bcb0..7912aacbc0e5 100644
> --- a/include/linux/panic_notifier.h
> +++ b/include/linux/panic_notifier.h
> @@ -5,9 +5,9 @@
>  #include <linux/notifier.h>
>  #include <linux/types.h>
>  
> -extern struct atomic_notifier_head panic_notifier_list;
>  extern struct atomic_notifier_head panic_hypervisor_list;
>  extern struct atomic_notifier_head panic_info_list;
> +extern struct atomic_notifier_head panic_pre_reboot_list;
>  
>  extern bool crash_kexec_post_notifiers;
>  
> diff --git a/kernel/panic.c b/kernel/panic.c
> index 73ca1bc44e30..a9d43b98b05b 100644
> --- a/kernel/panic.c
> +++ b/kernel/panic.c
> @@ -69,16 +69,15 @@ EXPORT_SYMBOL_GPL(panic_timeout);
>  #define PANIC_PRINT_ALL_CPU_BT		0x00000040
>  unsigned long panic_print;
>  
> -ATOMIC_NOTIFIER_HEAD(panic_notifier_list);
> -
> -EXPORT_SYMBOL(panic_notifier_list);
> -
>  ATOMIC_NOTIFIER_HEAD(panic_hypervisor_list);
>  EXPORT_SYMBOL(panic_hypervisor_list);
>  
>  ATOMIC_NOTIFIER_HEAD(panic_info_list);
>  EXPORT_SYMBOL(panic_info_list);
>  
> +ATOMIC_NOTIFIER_HEAD(panic_pre_reboot_list);
> +EXPORT_SYMBOL(panic_pre_reboot_list);
> +
>  static long no_blink(int state)
>  {
>  	return 0;
> @@ -295,7 +294,7 @@ void panic(const char *fmt, ...)
>  	 */
>  	atomic_notifier_call_chain(&panic_hypervisor_list, PANIC_NOTIFIER, buf);
>  	atomic_notifier_call_chain(&panic_info_list, PANIC_NOTIFIER, buf);
> -	atomic_notifier_call_chain(&panic_notifier_list, PANIC_NOTIFIER, buf);
> +	atomic_notifier_call_chain(&panic_pre_reboot_list, PANIC_NOTIFIER, buf);
>  
>  	panic_print_sys_info(false);
>  
> -- 
> 2.36.0
>
Helge Deller April 28, 2022, 4:55 p.m. UTC | #8
On 4/28/22 00:49, Guilherme G. Piccoli wrote:
> The panic notifiers' callbacks execute in an atomic context, with
> interrupts/preemption disabled, and all CPUs not running the panic
> function are off, so it's very dangerous to wait on a regular
> spinlock, there's a risk of deadlock.
>
> This patch refactors the panic notifier of parisc/power driver
> to make use of spin_trylock - for that, we've added a second
> version of the soft-power function. Also, some comments were
> reorganized and trailing white spaces, useless header inclusion
> and blank lines were removed.
>
> Cc: Helge Deller <deller@gmx.de>
> Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
> Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com>

You may add:
Acked-by: Helge Deller <deller@gmx.de> # parisc

Helge


> ---
>  arch/parisc/include/asm/pdc.h |  1 +
>  arch/parisc/kernel/firmware.c | 27 +++++++++++++++++++++++----
>  drivers/parisc/power.c        | 17 ++++++++++-------
>  3 files changed, 34 insertions(+), 11 deletions(-)
>
> diff --git a/arch/parisc/include/asm/pdc.h b/arch/parisc/include/asm/pdc.h
> index b643092d4b98..7a106008e258 100644
> --- a/arch/parisc/include/asm/pdc.h
> +++ b/arch/parisc/include/asm/pdc.h
> @@ -83,6 +83,7 @@ int pdc_do_firm_test_reset(unsigned long ftc_bitmap);
>  int pdc_do_reset(void);
>  int pdc_soft_power_info(unsigned long *power_reg);
>  int pdc_soft_power_button(int sw_control);
> +int pdc_soft_power_button_panic(int sw_control);
>  void pdc_io_reset(void);
>  void pdc_io_reset_devices(void);
>  int pdc_iodc_getc(void);
> diff --git a/arch/parisc/kernel/firmware.c b/arch/parisc/kernel/firmware.c
> index 6a7e315bcc2e..0e2f70b592f4 100644
> --- a/arch/parisc/kernel/firmware.c
> +++ b/arch/parisc/kernel/firmware.c
> @@ -1232,15 +1232,18 @@ int __init pdc_soft_power_info(unsigned long *power_reg)
>  }
>
>  /*
> - * pdc_soft_power_button - Control the soft power button behaviour
> - * @sw_control: 0 for hardware control, 1 for software control
> + * pdc_soft_power_button{_panic} - Control the soft power button behaviour
> + * @sw_control: 0 for hardware control, 1 for software control
>   *
>   *
>   * This PDC function places the soft power button under software or
>   * hardware control.
> - * Under software control the OS may control to when to allow to shut
> - * down the system. Under hardware control pressing the power button
> + * Under software control the OS may control to when to allow to shut
> + * down the system. Under hardware control pressing the power button
>   * powers off the system immediately.
> + *
> + * The _panic version relies in spin_trylock to prevent deadlock
> + * on panic path.
>   */
>  int pdc_soft_power_button(int sw_control)
>  {
> @@ -1254,6 +1257,22 @@ int pdc_soft_power_button(int sw_control)
>  	return retval;
>  }
>
> +int pdc_soft_power_button_panic(int sw_control)
> +{
> +	int retval;
> +	unsigned long flags;
> +
> +	if (!spin_trylock_irqsave(&pdc_lock, flags)) {
> +		pr_emerg("Couldn't enable soft power button\n");
> +		return -EBUSY; /* ignored by the panic notifier */
> +	}
> +
> +	retval = mem_pdc_call(PDC_SOFT_POWER, PDC_SOFT_POWER_ENABLE, __pa(pdc_result), sw_control);
> +	spin_unlock_irqrestore(&pdc_lock, flags);
> +
> +	return retval;
> +}
> +
>  /*
>   * pdc_io_reset - Hack to avoid overlapping range registers of Bridges devices.
>   * Primarily a problem on T600 (which parisc-linux doesn't support) but
> diff --git a/drivers/parisc/power.c b/drivers/parisc/power.c
> index 456776bd8ee6..8512884de2cf 100644
> --- a/drivers/parisc/power.c
> +++ b/drivers/parisc/power.c
> @@ -37,7 +37,6 @@
>  #include <linux/module.h>
>  #include <linux/init.h>
>  #include <linux/kernel.h>
> -#include <linux/notifier.h>
>  #include <linux/panic_notifier.h>
>  #include <linux/reboot.h>
>  #include <linux/sched/signal.h>
> @@ -175,16 +174,21 @@ static void powerfail_interrupt(int code, void *x)
>
>
>
> -/* parisc_panic_event() is called by the panic handler.
> - * As soon as a panic occurs, our tasklets above will not be
> - * executed any longer. This function then re-enables the
> - * soft-power switch and allows the user to switch off the system
> +/*
> + * parisc_panic_event() is called by the panic handler.
> + *
> + * As soon as a panic occurs, our tasklets above will not
> + * be executed any longer. This function then re-enables
> + * the soft-power switch and allows the user to switch off
> + * the system. We rely in pdc_soft_power_button_panic()
> + * since this version spin_trylocks (instead of regular
> + * spinlock), preventing deadlocks on panic path.
>   */
>  static int parisc_panic_event(struct notifier_block *this,
>  		unsigned long event, void *ptr)
>  {
>  	/* re-enable the soft-power switch */
> -	pdc_soft_power_button(0);
> +	pdc_soft_power_button_panic(0);
>  	return NOTIFY_DONE;
>  }
>
> @@ -193,7 +197,6 @@ static struct notifier_block parisc_panic_block = {
>  	.priority	= INT_MAX,
>  };
>
> -
>  static int __init power_init(void)
>  {
>  	unsigned long ret;
Guilherme G. Piccoli April 29, 2022, 2:01 p.m. UTC | #9
On 28/04/2022 05:11, Suzuki K Poulose wrote:
> Hi Guilherme,
> [...] 
> How would you like to proceed with queuing this ? I am happy
> either way. In case you plan to push this as part of this
> series (I don't see any potential conflicts) :
> 
> Reviewed-by: Suzuki K Poulose <suzuki.poulose@arm.com>

Thanks for your review Suzuki, much appreciated!

About your question, I'm not sure yet - in case the core changes would
take a while (like if community find them polemic, require many changes,
etc) I might split this series in 2 parts, the fixes part vs the
improvements per se. Either way, a V2 is going to happen for sure, and
in that moment, I'll let you know what I think it's best.

But either way, any choice you prefer is fine by me as well (like if you
want to merge it now or postpone to get merged in the future), this is
not an urgent fix I think =)
Cheers,


Guilherme
Guilherme G. Piccoli April 29, 2022, 2:34 p.m. UTC | #10
On 28/04/2022 13:55, Helge Deller wrote:
> [...]
> You may add:
> Acked-by: Helge Deller <deller@gmx.de> # parisc
> 
> Helge

Thanks Helge, added!
Cheers,


Guilherme

> 
> 
>> ---
>>  arch/parisc/include/asm/pdc.h |  1 +
>>  arch/parisc/kernel/firmware.c | 27 +++++++++++++++++++++++----
>>  drivers/parisc/power.c        | 17 ++++++++++-------
>>  3 files changed, 34 insertions(+), 11 deletions(-)
>>
>> diff --git a/arch/parisc/include/asm/pdc.h b/arch/parisc/include/asm/pdc.h
>> index b643092d4b98..7a106008e258 100644
>> --- a/arch/parisc/include/asm/pdc.h
>> +++ b/arch/parisc/include/asm/pdc.h
>> @@ -83,6 +83,7 @@ int pdc_do_firm_test_reset(unsigned long ftc_bitmap);
>>  int pdc_do_reset(void);
>>  int pdc_soft_power_info(unsigned long *power_reg);
>>  int pdc_soft_power_button(int sw_control);
>> +int pdc_soft_power_button_panic(int sw_control);
>>  void pdc_io_reset(void);
>>  void pdc_io_reset_devices(void);
>>  int pdc_iodc_getc(void);
>> diff --git a/arch/parisc/kernel/firmware.c b/arch/parisc/kernel/firmware.c
>> index 6a7e315bcc2e..0e2f70b592f4 100644
>> --- a/arch/parisc/kernel/firmware.c
>> +++ b/arch/parisc/kernel/firmware.c
>> @@ -1232,15 +1232,18 @@ int __init pdc_soft_power_info(unsigned long *power_reg)
>>  }
>>
>>  /*
>> - * pdc_soft_power_button - Control the soft power button behaviour
>> - * @sw_control: 0 for hardware control, 1 for software control
>> + * pdc_soft_power_button{_panic} - Control the soft power button behaviour
>> + * @sw_control: 0 for hardware control, 1 for software control
>>   *
>>   *
>>   * This PDC function places the soft power button under software or
>>   * hardware control.
>> - * Under software control the OS may control to when to allow to shut
>> - * down the system. Under hardware control pressing the power button
>> + * Under software control the OS may control to when to allow to shut
>> + * down the system. Under hardware control pressing the power button
>>   * powers off the system immediately.
>> + *
>> + * The _panic version relies in spin_trylock to prevent deadlock
>> + * on panic path.
>>   */
>>  int pdc_soft_power_button(int sw_control)
>>  {
>> @@ -1254,6 +1257,22 @@ int pdc_soft_power_button(int sw_control)
>>  	return retval;
>>  }
>>
>> +int pdc_soft_power_button_panic(int sw_control)
>> +{
>> +	int retval;
>> +	unsigned long flags;
>> +
>> +	if (!spin_trylock_irqsave(&pdc_lock, flags)) {
>> +		pr_emerg("Couldn't enable soft power button\n");
>> +		return -EBUSY; /* ignored by the panic notifier */
>> +	}
>> +
>> +	retval = mem_pdc_call(PDC_SOFT_POWER, PDC_SOFT_POWER_ENABLE, __pa(pdc_result), sw_control);
>> +	spin_unlock_irqrestore(&pdc_lock, flags);
>> +
>> +	return retval;
>> +}
>> +
>>  /*
>>   * pdc_io_reset - Hack to avoid overlapping range registers of Bridges devices.
>>   * Primarily a problem on T600 (which parisc-linux doesn't support) but
>> diff --git a/drivers/parisc/power.c b/drivers/parisc/power.c
>> index 456776bd8ee6..8512884de2cf 100644
>> --- a/drivers/parisc/power.c
>> +++ b/drivers/parisc/power.c
>> @@ -37,7 +37,6 @@
>>  #include <linux/module.h>
>>  #include <linux/init.h>
>>  #include <linux/kernel.h>
>> -#include <linux/notifier.h>
>>  #include <linux/panic_notifier.h>
>>  #include <linux/reboot.h>
>>  #include <linux/sched/signal.h>
>> @@ -175,16 +174,21 @@ static void powerfail_interrupt(int code, void *x)
>>
>>
>>
>> -/* parisc_panic_event() is called by the panic handler.
>> - * As soon as a panic occurs, our tasklets above will not be
>> - * executed any longer. This function then re-enables the
>> - * soft-power switch and allows the user to switch off the system
>> +/*
>> + * parisc_panic_event() is called by the panic handler.
>> + *
>> + * As soon as a panic occurs, our tasklets above will not
>> + * be executed any longer. This function then re-enables
>> + * the soft-power switch and allows the user to switch off
>> + * the system. We rely in pdc_soft_power_button_panic()
>> + * since this version spin_trylocks (instead of regular
>> + * spinlock), preventing deadlocks on panic path.
>>   */
>>  static int parisc_panic_event(struct notifier_block *this,
>>  		unsigned long event, void *ptr)
>>  {
>>  	/* re-enable the soft-power switch */
>> -	pdc_soft_power_button(0);
>> +	pdc_soft_power_button_panic(0);
>>  	return NOTIFY_DONE;
>>  }
>>
>> @@ -193,7 +197,6 @@ static struct notifier_block parisc_panic_block = {
>>  	.priority	= INT_MAX,
>>  };
>>
>> -
>>  static int __init power_init(void)
>>  {
>>  	unsigned long ret;
>
Guilherme G. Piccoli April 29, 2022, 2:50 p.m. UTC | #11
Thanks Paul and Suzuki for the ACKs.
Cheers,


Guilherme
Guilherme G. Piccoli April 29, 2022, 3:18 p.m. UTC | #12
On 28/04/2022 13:26, Corey Minyard wrote:
> [...]
> 
> For the IPMI portion:
> 
> Acked-by: Corey Minyard <cminyard@mvista.com>

Thanks Alex and Corey for the ACKs!

> 
> Note that the IPMI panic_event() should always return, but it may take
> some time, especially if the IPMI controller is no longer functional.
> So the risk of a long delay is there and it makes sense to move it very
> late.
> 

Thanks, I agree - the patch moves it to the (latest - 1) position, since
some arch code might run as the latest and effectively stops the machine.
Cheers,


Guilherme
Guilherme G. Piccoli April 29, 2022, 4:04 p.m. UTC | #13
On 27/04/2022 21:28, Randy Dunlap wrote:
> 
> 
> On 4/27/22 15:49, Guilherme G. Piccoli wrote:
>> +	crash_kexec_post_notifiers
>> +			This was DEPRECATED - users should always prefer the
> 
> 			This is DEPRECATED - users should always prefer the
> 
>> +			parameter "panic_notifiers_level" - check its entry
>> +			in this documentation for details on how it works.
>> +			Setting this parameter is exactly the same as setting
>> +			"panic_notifiers_level=4".
> 

Thanks Randy, for your suggestion - but I confess I couldn't understand
it properly. It's related to spaces/tabs, right? What you suggest me to
change in this formatting? Just by looking the email I can't parse.

Cheers,


Guilherme
Max Filippov April 29, 2022, 4:04 p.m. UTC | #14
On Wed, Apr 27, 2022 at 3:55 PM Guilherme G. Piccoli
<gpiccoli@igalia.com> wrote:
>
> This patch renames the panic_notifier_list to panic_pre_reboot_list;
> the idea is that a subsequent patch will refactor the panic path
> in order to better split the notifiers, running some of them very
> early, some of them not so early [but still before kmsg_dump()] and
> finally, the rest should execute late, after kdump. The latter ones
> are now in the panic pre-reboot list - the name comes from the idea
> that these notifiers execute before panic() attempts rebooting the
> machine (if that option is set).
>
> We also took the opportunity to clean-up useless header inclusions,
> improve some notifier block declarations (e.g. in ibmasm/heartbeat.c)
> and more important, change some priorities - we hereby set 2 notifiers
> to run late in the list [iss_panic_event() and the IPMI panic_event()]
> due to the risks they offer (may not return, for example).
> Proper documentation is going to be provided in a subsequent patch,
> that effectively refactors the panic path.

[...]

>  arch/xtensa/platforms/iss/setup.c     |  4 ++--For xtensa:

For xtensa:
Acked-by: Max Filippov <jcmvbkbc@gmail.com>
Michael Kelley April 29, 2022, 4:26 p.m. UTC | #15
From: Guilherme G. Piccoli <gpiccoli@igalia.com> Sent: Wednesday, April 27, 2022 3:49 PM
> 
> Currently the regular CPU shutdown path for ARM disables IRQs/FIQs
> in the secondary CPUs - smp_send_stop() calls ipi_cpu_stop(), which
> is responsible for that. This makes sense, since we're turning off
> such CPUs, putting them in an endless busy-wait loop.
> 
> Problem is that there is an alternative path for disabling CPUs,
> in the form of function crash_smp_send_stop(), used for kexec/panic
> paths. This functions relies in a SMP call that also triggers a

s/functions relies in/function relies on/

> busy-wait loop [at machine_crash_nonpanic_core()], but *without*
> disabling interrupts. This might lead to odd scenarios, like early
> interrupts in the boot of kexec'd kernel or even interrupts in
> other CPUs while the main one still works in the panic path and
> assumes all secondary CPUs are (really!) off.
> 
> This patch mimics the ipi_cpu_stop() interrupt disable mechanism
> in the crash CPU shutdown path, hence disabling IRQs/FIQs in all
> secondary CPUs in the kexec/panic path as well.
> 
> Cc: Marc Zyngier <maz@kernel.org>
> Cc: Russell King <linux@armlinux.org.uk>
> Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
> ---
>  arch/arm/kernel/machine_kexec.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/arch/arm/kernel/machine_kexec.c b/arch/arm/kernel/machine_kexec.c
> index f567032a09c0..ef788ee00519 100644
> --- a/arch/arm/kernel/machine_kexec.c
> +++ b/arch/arm/kernel/machine_kexec.c
> @@ -86,6 +86,9 @@ void machine_crash_nonpanic_core(void *unused)
>  	set_cpu_online(smp_processor_id(), false);
>  	atomic_dec(&waiting_for_crash_ipi);
> 
> +	local_fiq_disable();
> +	local_irq_disable();
> +
>  	while (1) {
>  		cpu_relax();
>  		wfe();
> --
> 2.36.0
Michael Kelley April 29, 2022, 4:27 p.m. UTC | #16
From: Guilherme G. Piccoli <gpiccoli@igalia.com> Sent: Wednesday, April 27, 2022 3:49 PM
> 
> Currently we have a debug infrastructure in the notifiers file, but
> it's very simple/limited. This patch extends it by:
> 
> (a) Showing all registered/unregistered notifiers' callback names;
> 
> (b) Adding a dynamic debug tuning to allow showing called notifiers'
> function names. Notice that this should be guarded as a tunable since
> it can flood the kernel log buffer.
> 
> Cc: Arjan van de Ven <arjan@linux.intel.com>
> Cc: Cong Wang <xiyou.wangcong@gmail.com>
> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> Cc: Valentin Schneider <valentin.schneider@arm.com>
> Cc: Xiaoming Ni <nixiaoming@huawei.com>
> Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
> ---
> 
> We have some design decisions that worth discussing here:
> 
> (a) First of call, using C99 helps a lot to write clear and concise code, but

s/call/all/

> due to commit 4d94f910e79a ("Kbuild: use -Wdeclaration-after-statement") we
> have a warning if mixing variable declarations with code. For this patch though,
> doing that makes the code way clear, so decision was to add the debug code
> inside brackets whenever this warning pops up. We can change that, but that'll
> cause more ifdefs in the same function.
> 
> (b) In the symbol lookup helper function, we modify the parameter passed but
> even more, we return it as well! This is unusual and seems unnecessary, but was
> the strategy taken to allow embedding such function in the pr_debug() call.
> 
> Not doing that would likely requiring 3 symbol_name variables to avoid
> concurrency (registering notifier A while calling notifier B) - we rely in
> local variables as a serialization mechanism.
> 
> We're open for suggestions in case this design is not appropriate;
> thanks in advance!
> 
>  kernel/notifier.c | 48 +++++++++++++++++++++++++++++++++++++++++++++--
>  1 file changed, 46 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/notifier.c b/kernel/notifier.c
> index ba005ebf4730..21032ebcde57 100644
> --- a/kernel/notifier.c
> +++ b/kernel/notifier.c
> @@ -7,6 +7,22 @@
>  #include <linux/vmalloc.h>
>  #include <linux/reboot.h>
> 
> +#ifdef CONFIG_DEBUG_NOTIFIERS
> +#include <linux/kallsyms.h>
> +
> +/*
> + *	Helper to get symbol names in case DEBUG_NOTIFIERS is set.
> + *	Return the modified parameter is a strategy used to achieve
> + *	the pr_debug() functionality - with this, function is only
> + *	executed if the dynamic debug tuning is effectively set.
> + */
> +static inline char *notifier_name(struct notifier_block *nb, char *sym_name)
> +{
> +	lookup_symbol_name((unsigned long)(nb->notifier_call), sym_name);
> +	return sym_name;
> +}
> +#endif
> +
>  /*
>   *	Notifier list for kernel code which wants to be called
>   *	at shutdown. This is used to stop any idling DMA operations
> @@ -34,20 +50,41 @@ static int notifier_chain_register(struct notifier_block **nl,
>  	}
>  	n->next = *nl;
>  	rcu_assign_pointer(*nl, n);
> +
> +#ifdef CONFIG_DEBUG_NOTIFIERS
> +	{
> +		char sym_name[KSYM_NAME_LEN];
> +
> +		pr_info("notifiers: registered %s()\n",
> +			notifier_name(n, sym_name));
> +	}
> +#endif
>  	return 0;
>  }
> 
>  static int notifier_chain_unregister(struct notifier_block **nl,
>  		struct notifier_block *n)
>  {
> +	int ret = -ENOENT;
> +
>  	while ((*nl) != NULL) {
>  		if ((*nl) == n) {
>  			rcu_assign_pointer(*nl, n->next);
> -			return 0;
> +			ret = 0;
> +			break;
>  		}
>  		nl = &((*nl)->next);
>  	}
> -	return -ENOENT;
> +
> +#ifdef CONFIG_DEBUG_NOTIFIERS
> +	if (!ret) {
> +		char sym_name[KSYM_NAME_LEN];
> +
> +		pr_info("notifiers: unregistered %s()\n",
> +			notifier_name(n, sym_name));
> +	}
> +#endif
> +	return ret;
>  }
> 
>  /**
> @@ -80,6 +117,13 @@ static int notifier_call_chain(struct notifier_block **nl,
>  			nb = next_nb;
>  			continue;
>  		}
> +
> +		{
> +			char sym_name[KSYM_NAME_LEN];
> +
> +			pr_debug("notifiers: calling %s()\n",
> +				 notifier_name(nb, sym_name));
> +		}
>  #endif
>  		ret = nb->notifier_call(nb, val, v);
> 
> --
> 2.36.0
Michael Kelley April 29, 2022, 5:16 p.m. UTC | #17
From: Guilherme G. Piccoli <gpiccoli@igalia.com> Sent: Wednesday, April 27, 2022 3:49 PM
> 
> Currently Hyper-V guests are among the most relevant users of the panic
> infrastructure, like panic notifiers, kmsg dumpers, etc. The reasons rely
> both in cleaning-up procedures (closing a hypervisor <-> guest connection,
> disabling a paravirtualized timer) as well as to data collection (sending
> panic information to the hypervisor) and framebuffer management.
> 
> The thing is: some notifiers are related to others, ordering matters, some
> functionalities are duplicated and there are lots of conditionals behind
> sending panic information to the hypervisor. This patch, as part of an
> effort to clean-up the panic notifiers mechanism and better document
> things, address some of the issues/complexities of Hyper-V panic handling
> through the following changes:
> 
> (a) We have die and panic notifiers on vmbus_drv.c and both have goals of
> sending panic information to the hypervisor, though the panic notifier is
> also responsible for a cleaning-up procedure.
> 
> This commit clears the code by splitting the panic notifier in two, one
> for closing the vmbus connection whereas the other is only for sending
> panic info to hypervisor. With that, it was possible to merge the die and
> panic notifiers in a single/well-documented function, and clear some
> conditional complexities on sending such information to the hypervisor.
> 
> (b) The new panic notifier created after (a) is only doing a single thing:
> cleaning the vmbus connection. This procedure might cause a delay (due to
> hypervisor I/O completion), so we postpone that to run late. But more
> relevant: this *same* vmbus unloading happens in the crash_shutdown()
> handler, so if kdump is set, we can safely skip this panic notifier and
> defer such clean-up to the kexec crash handler.

While the last sentence is true for Hyper-V on x86/x64, it's not true for
Hyper-V on ARM64.  x86/x64 has the 'machine_ops' data structure
with the ability to provide a custom crash_shutdown() function, which
Hyper-V does in the form of hv_machine_crash_shutdown().  But ARM64
has no mechanism to provide such a custom function that will eventually
do the needed vmbus_initiate_unload() before running kdump.

I'm not immediately sure what the best solution is for ARM64.  At this
point, I'm just pointing out the problem and will think about the tradeoffs
for various possible solutions.  Please do the same yourself. :-)

> 
> (c) There is also a Hyper-V framebuffer panic notifier, which relies in
> doing a vmbus operation that demands a valid connection. So, we must
> order this notifier with the panic notifier from vmbus_drv.c, in order to
> guarantee that the framebuffer code executes before the vmbus connection
> is unloaded.

Patch 21 of this set puts the Hyper-V FB panic notifier on the pre_reboot
notifier list, which means it won't execute before the VMbus connection
unload in the case of kdump.   This notifier is making sure that Hyper-V
is notified about the last updates made to the frame buffer before the
panic, so maybe it needs to be put on the hypervisor notifier list.  It
sends a message to Hyper-V over its existing VMbus channel, but it
does not wait for a reply.  It does, however, obtain a spin lock on the
ring buffer used to communicate with Hyper-V.   Unless someone has
a better suggestion, I'm inclined to take the risk of blocking on that
spin lock.

> 
> Also, this commit removes a useless header.
> 
> Although there is code rework and re-ordering, we expect that this change
> has no functional regressions but instead optimize the path and increase
> panic reliability on Hyper-V. This was tested on Hyper-V with success.
> 
> Fixes: 792f232d57ff ("Drivers: hv: vmbus: Fix potential crash on module unload")
> Fixes: 74347a99e73a ("x86/Hyper-V: Unload vmbus channel in hv panic callback")

The "Fixes:" tags imply that these changes should be backported to older
longterm kernel versions, which I don't think is the case.  There is a
dependency on Patch 14 of your series where PANIC_NOTIFIER is
introduced.

> Cc: Andrea Parri (Microsoft) <parri.andrea@gmail.com>
> Cc: Dexuan Cui <decui@microsoft.com>
> Cc: Haiyang Zhang <haiyangz@microsoft.com>
> Cc: "K. Y. Srinivasan" <kys@microsoft.com>
> Cc: Michael Kelley <mikelley@microsoft.com>
> Cc: Stephen Hemminger <sthemmin@microsoft.com>
> Cc: Tianyu Lan <Tianyu.Lan@microsoft.com>
> Cc: Wei Liu <wei.liu@kernel.org>
> Tested-by: Fabio A M Martins <fabiomirmar@gmail.com>
> Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
> ---
> 
> Special thanks to Michael Kelley for the good information about the Hyper-V
> panic path in email threads some months ago, and to Fabio for the testing
> performed.
> 
> Michael and all Microsoft folks: a careful analysis to double-check our changes
> and assumptions here is really appreciated, this code is complex and intricate,
> it is possible some corner case might have been overlooked.
> 
> Thanks in advance!
> 
>  drivers/hv/vmbus_drv.c          | 109 ++++++++++++++++++++------------
>  drivers/video/fbdev/hyperv_fb.c |   8 +++
>  2 files changed, 76 insertions(+), 41 deletions(-)
> 
> diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
> index 14de17087864..f37f12d48001 100644
> --- a/drivers/hv/vmbus_drv.c
> +++ b/drivers/hv/vmbus_drv.c
> @@ -24,11 +24,11 @@
>  #include <linux/sched/task_stack.h>
> 
>  #include <linux/delay.h>
> -#include <linux/notifier.h>
>  #include <linux/panic_notifier.h>
>  #include <linux/ptrace.h>
>  #include <linux/screen_info.h>
>  #include <linux/kdebug.h>
> +#include <linux/kexec.h>
>  #include <linux/efi.h>
>  #include <linux/random.h>
>  #include <linux/kernel.h>
> @@ -68,51 +68,75 @@ static int hyperv_report_reg(void)
>  	return !sysctl_record_panic_msg || !hv_panic_page;
>  }
> 
> -static int hyperv_panic_event(struct notifier_block *nb, unsigned long val,
> +/*
> + * The panic notifier below is responsible solely for unloading the
> + * vmbus connection, which is necessary in a panic event. But notice
> + * that this same unloading procedure is executed in the Hyper-V
> + * crash_shutdown() handler [see hv_crash_handler()], which basically
> + * means that we can postpone its execution if we have kdump set,
> + * since it will run the crash_shutdown() handler anyway. Even more
> + * intrincated is the relation of this notifier with Hyper-V framebuffer

s/intrincated/intricate/

> + * panic notifier - we need vmbus connection alive there in order to
> + * succeed, so we need to order both with each other [for reference see
> + * hvfb_on_panic()] - this is done using notifiers' priorities.
> + */
> +static int hv_panic_vmbus_unload(struct notifier_block *nb, unsigned long val,
>  			      void *args)
> +{
> +	if (!kexec_crash_loaded())

I'm not clear on the purpose of this condition.  I think it means
we will skip the vmbus_initiate_unload() if a panic occurs in the
kdump kernel.  Is there a reason a panic in the kdump kernel
should be treated differently?  Or am I misunderstanding?

> +		vmbus_initiate_unload(true);
> +
> +	return NOTIFY_DONE;
> +}
> +static struct notifier_block hyperv_panic_vmbus_unload_block = {
> +	.notifier_call	= hv_panic_vmbus_unload,
> +	.priority	= INT_MIN + 1, /* almost the latest one to execute */
> +};
> +
> +/*
> + * The following callback works both as die and panic notifier; its
> + * goal is to provide panic information to the hypervisor unless the
> + * kmsg dumper is gonna be used [see hv_kmsg_dump()], which provides
> + * more information but is not always available.
> + *
> + * Notice that both the panic/die report notifiers are registered only
> + * if we have the capability HV_FEATURE_GUEST_CRASH_MSR_AVAILABLE set.
> + */
> +static int hv_die_panic_notify_crash(struct notifier_block *nb,
> +				     unsigned long val, void *args)
>  {
>  	struct pt_regs *regs;
> +	bool is_die;
> 
> -	vmbus_initiate_unload(true);
> -
> -	/*
> -	 * Hyper-V should be notified only once about a panic.  If we will be
> -	 * doing hv_kmsg_dump() with kmsg data later, don't do the notification
> -	 * here.
> -	 */
> -	if (ms_hyperv.misc_features & HV_FEATURE_GUEST_CRASH_MSR_AVAILABLE
> -	    && hyperv_report_reg()) {
> +	/* Don't notify Hyper-V unless we have a die oops event or panic. */
> +	switch (val) {
> +	case DIE_OOPS:
> +		is_die = true;
> +		regs = ((struct die_args *)args)->regs;
> +		break;
> +	case PANIC_NOTIFIER:
> +		is_die = false;
>  		regs = current_pt_regs();
> -		hyperv_report_panic(regs, val, false);
> -	}
> -	return NOTIFY_DONE;
> -}
> -
> -static int hyperv_die_event(struct notifier_block *nb, unsigned long val,
> -			    void *args)
> -{
> -	struct die_args *die = args;
> -	struct pt_regs *regs = die->regs;
> -
> -	/* Don't notify Hyper-V if the die event is other than oops */
> -	if (val != DIE_OOPS)
> +		break;
> +	default:
>  		return NOTIFY_DONE;
> +	}
> 
>  	/*
> -	 * Hyper-V should be notified only once about a panic.  If we will be
> -	 * doing hv_kmsg_dump() with kmsg data later, don't do the notification
> -	 * here.
> +	 * Hyper-V should be notified only once about a panic/die. If we will
> +	 * be calling hv_kmsg_dump() later with kmsg data, don't do the
> +	 * notification here.
>  	 */
>  	if (hyperv_report_reg())
> -		hyperv_report_panic(regs, val, true);
> +		hyperv_report_panic(regs, val, is_die);
> +
>  	return NOTIFY_DONE;
>  }
> -
> -static struct notifier_block hyperv_die_block = {
> -	.notifier_call = hyperv_die_event,
> +static struct notifier_block hyperv_die_report_block = {
> +	.notifier_call = hv_die_panic_notify_crash,
>  };
> -static struct notifier_block hyperv_panic_block = {
> -	.notifier_call = hyperv_panic_event,
> +static struct notifier_block hyperv_panic_report_block = {
> +	.notifier_call = hv_die_panic_notify_crash,
>  };
> 
>  static const char *fb_mmio_name = "fb_range";
> @@ -1589,16 +1613,17 @@ static int vmbus_bus_init(void)
>  		if (hyperv_crash_ctl & HV_CRASH_CTL_CRASH_NOTIFY_MSG)
>  			hv_kmsg_dump_register();
> 
> -		register_die_notifier(&hyperv_die_block);
> +		register_die_notifier(&hyperv_die_report_block);
> +		atomic_notifier_chain_register(&panic_notifier_list,
> +						&hyperv_panic_report_block);
>  	}
> 
>  	/*
> -	 * Always register the panic notifier because we need to unload
> -	 * the VMbus channel connection to prevent any VMbus
> -	 * activity after the VM panics.
> +	 * Always register the vmbus unload panic notifier because we
> +	 * need to shut the VMbus channel connection on panic.
>  	 */
>  	atomic_notifier_chain_register(&panic_notifier_list,
> -			       &hyperv_panic_block);
> +			       &hyperv_panic_vmbus_unload_block);
> 
>  	vmbus_request_offers();
> 
> @@ -2817,15 +2842,17 @@ static void __exit vmbus_exit(void)
> 
>  	if (ms_hyperv.misc_features & HV_FEATURE_GUEST_CRASH_MSR_AVAILABLE) {
>  		kmsg_dump_unregister(&hv_kmsg_dumper);
> -		unregister_die_notifier(&hyperv_die_block);
> +		unregister_die_notifier(&hyperv_die_report_block);
> +		atomic_notifier_chain_unregister(&panic_notifier_list,
> +						&hyperv_panic_report_block);
>  	}
> 
>  	/*
> -	 * The panic notifier is always registered, hence we should
> +	 * The vmbus panic notifier is always registered, hence we should
>  	 * also unconditionally unregister it here as well.
>  	 */
>  	atomic_notifier_chain_unregister(&panic_notifier_list,
> -					 &hyperv_panic_block);
> +					&hyperv_panic_vmbus_unload_block);
> 
>  	free_page((unsigned long)hv_panic_page);
>  	unregister_sysctl_table(hv_ctl_table_hdr);
> diff --git a/drivers/video/fbdev/hyperv_fb.c b/drivers/video/fbdev/hyperv_fb.c
> index c8e0ea27caf1..f3494b868a64 100644
> --- a/drivers/video/fbdev/hyperv_fb.c
> +++ b/drivers/video/fbdev/hyperv_fb.c
> @@ -1244,7 +1244,15 @@ static int hvfb_probe(struct hv_device *hdev,
>  	par->fb_ready = true;
> 
>  	par->synchronous_fb = false;
> +
> +	/*
> +	 * We need to be sure this panic notifier runs _before_ the
> +	 * vmbus disconnect, so order it by priority. It must execute
> +	 * before the function hv_panic_vmbus_unload() [drivers/hv/vmbus_drv.c],
> +	 * which is almost at the end of list, with priority = INT_MIN + 1.
> +	 */
>  	par->hvfb_panic_nb.notifier_call = hvfb_on_panic;
> +	par->hvfb_panic_nb.priority = INT_MIN + 10,
>  	atomic_notifier_chain_register(&panic_notifier_list,
>  				       &par->hvfb_panic_nb);
> 
> --
> 2.36.0
Michael Kelley April 29, 2022, 5:53 p.m. UTC | #18
From: Guilherme G. Piccoli <gpiccoli@igalia.com> Sent: Wednesday, April 27, 2022 3:49 PM
> 
> The panic() function is somewhat convoluted - a lot of changes were
> made over the years, adding comments that might be misleading/outdated
> now, it has a code structure that is a bit complex to follow, with
> lots of conditionals, for example. The panic notifier list is something
> else - a single list, with multiple callbacks of different purposes,
> that run in a non-deterministic order and may affect hardly kdump
> reliability - see the "crash_kexec_post_notifiers" workaround-ish flag.
> 
> This patch proposes a major refactor on the panic path based on Petr's
> idea [0] - basically we split the notifiers list in three, having a set
> of different call points in the panic path. Below a list of changes
> proposed in this patch, culminating in the panic notifiers level
> concept:
> 
> (a) First of all, we improved comments all over the function
> and removed useless variables / includes. Also, as part of this
> clean-up we concentrate the console flushing functions in a helper.
> 
> (b) As mentioned before, there is a split of the panic notifier list
> in three, based on the purpose of the callback. The code contains
> good documentation in form of comments, but a summary of the three
> lists follows:
> 
> - the hypervisor list aims low-risk procedures to inform hypervisors
> or firmware about the panic event, also includes LED-related functions;
> 
> - the informational list contains callbacks that provide more details,
> like kernel offset or trace dump (if enabled) and also includes the
> callbacks aimed at reducing log pollution or warns, like the RCU and
> hung task disable callbacks;
> 
> - finally, the pre_reboot list is the old notifier list renamed,
> containing the more risky callbacks that didn't fit the previous
> lists. There is also a 4th list (the post_reboot one), but it's not
> related with the original list - it contains late time architecture
> callbacks aimed at stopping the machine, for example.
> 
> The 3 notifiers lists execute in different moments, hypervisor being
> the first, followed by informational and finally the pre_reboot list.
> 
> (c) But then, there is the ordering problem of the notifiers against
> the crash_kernel() call - kdump must be as reliable as possible.
> For that, a simple binary "switch" as "crash_kexec_post_notifiers"
> is not enough, hence we introduce here concept of panic notifier
> levels: there are 5 levels, from 0 (no notifier executes before
> kdump) until 4 (all notifiers run before kdump); the default level
> is 2, in which the hypervisor and (iff we have any kmsg dumper)
> the informational notifiers execute before kdump.
> 
> The detailed documentation of the levels is present in code comments
> and in the kernel-parameters.txt file; as an analogy with the previous
> panic() implementation, the level 0 is exactly the same as the old
> behavior of notifiers, running all after kdump, and the level 4 is
> the same as "crash_kexec_post_notifiers=Y" (we kept this parameter as
> a deprecated one).
> 
> (d) Finally, an important change made here: we now use only the
> function "crash_smp_send_stop()" to shut all the secondary CPUs
> in the panic path. Before, there was a case of using the regular
> "smp_send_stop()", but the better approach is to simplify the
> code and try to use the function which was created exclusively
> for the panic path. Experiments showed that it works fine, and
> code was very simplified with that.
> 
> Functional change is expected from this refactor, since now we
> call some notifiers by default before kdump, but the goal here
> besides code clean-up is to have a better panic path, more
> reliable and deterministic, but also very customizable.
> 
> [0] https://lore.kernel.org/lkml/YfPxvzSzDLjO5ldp@alley/
> 
> Suggested-by: Petr Mladek <pmladek@suse.com>
> Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
> ---
> 
> Special thanks to Petr and Baoquan for the suggestion and feedback in a previous
> email thread. There's some important design decisions that worth mentioning and
> discussing:
> 
> * The default panic notifiers level is 2, based on Petr Mladek's suggestion,
> which makes a lot of sense. Of course, this is customizable through the
> parameter, but would be something worthwhile to have a KConfig option to set
> the default level? It would help distros that want the old behavior
> (no notifiers before kdump) as default.
> 
> * The implementation choice was to _avoid_ intricate if conditionals in the
> panic path, which would _definitely_ be present with the panic notifiers levels
> idea; so, instead of lots of if conditionals, the set/clear bits approach with
> functions called in 2 points (but executing only in one of them) is much easier
> to follow an was used here; the ordering helper function and the comments also
> help a lot to avoid confusion (hopefully).
> 
> * Choice was to *always* use crash_smp_send_stop() instead of sometimes making
> use of the regular smp_send_stop(); for most architectures they are the same,
> including Xen (on x86). For the ones that override it, all should work fine,
> in the powerpc case it's even more correct (see the subsequent patch
> "powerpc: Do not force all panic notifiers to execute before kdump")
> 
> There seems to be 2 cases that requires some plumbing to work 100% right:
> - ARM doesn't disable local interrupts / FIQs in the crash version of
> send_stop(); we patched that early in this series;
> - x86 could face an issue if we have VMX and do use crash_smp_send_stop()
> _without_ kdump, but this is fixed in the first patch of the series (and
> it's a bug present even before this refactor).
> 
> * Notice we didn't add a sysrq for panic notifiers level - should have it?
> Alejandro proposed recently to add a sysrq for "crash_kexec_post_notifiers",
> let me know if you feel the need here Alejandro, since the core parameters are
> present in /sys, I didn't consider much gain in having a sysrq, but of course
> I'm open to suggestions!
> 
> Thanks advance for the review!
> 
>  .../admin-guide/kernel-parameters.txt         |  42 ++-
>  include/linux/panic_notifier.h                |   1 +
>  kernel/kexec_core.c                           |   8 +-
>  kernel/panic.c                                | 292 +++++++++++++-----
>  .../selftests/pstore/pstore_crash_test        |   5 +-
>  5 files changed, 252 insertions(+), 96 deletions(-)
> 
> diff --git a/Documentation/admin-guide/kernel-parameters.txt
> b/Documentation/admin-guide/kernel-parameters.txt
> index 3f1cc5e317ed..8d3524060ce3 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -829,6 +829,13 @@
>  			It will be ignored when crashkernel=X,high is not used
>  			or memory reserved is below 4G.
> 
> +	crash_kexec_post_notifiers
> +			This was DEPRECATED - users should always prefer the
> +			parameter "panic_notifiers_level" - check its entry
> +			in this documentation for details on how it works.
> +			Setting this parameter is exactly the same as setting
> +			"panic_notifiers_level=4".
> +
>  	cryptomgr.notests
>  			[KNL] Disable crypto self-tests
> 
> @@ -3784,6 +3791,33 @@
>  			timeout < 0: reboot immediately
>  			Format: <timeout>
> 
> +	panic_notifiers_level=
> +			[KNL] Set the panic notifiers execution order.
> +			Format: <unsigned int>
> +			We currently have 4 lists of panic notifiers; based
> +			on the functionality and risk (for panic success) the
> +			callbacks are added in a given list. The lists are:
> +			- hypervisor/FW notification list (low risk);
> +			- informational list (low/medium risk);
> +			- pre_reboot list (higher risk);
> +			- post_reboot list (only run late in panic and after
> +			kdump, not configurable for now).
> +			This parameter defines the ordering of the first 3
> +			lists with regards to kdump; the levels determine
> +			which set of notifiers execute before kdump. The
> +			accepted levels are:
> +			0: kdump is the first thing to run, NO list is
> +			executed before kdump.
> +			1: only the hypervisor list is executed before kdump.
> +			2 (default level): the hypervisor list and (*if*
> +			there's any kmsg_dumper defined) the informational
> +			list are executed before kdump.
> +			3: both the hypervisor and the informational lists
> +			(always) execute before kdump.

I'm not clear on why level 2 exists.  What is the scenario where
execution of the info list before kdump should be conditional on the
existence of a kmsg_dumper?   Maybe the scenario is described
somewhere in the patch set and I just missed it.

> +			4: the 3 lists (hypervisor, info and pre_reboot)
> +			execute before kdump - this behavior is analog to the
> +			deprecated parameter "crash_kexec_post_notifiers".
> +
>  	panic_print=	Bitmask for printing system info when panic happens.
>  			User can chose combination of the following bits:
>  			bit 0: print all tasks info
> @@ -3814,14 +3848,6 @@
>  	panic_on_warn	panic() instead of WARN().  Useful to cause kdump
>  			on a WARN().
> 
> -	crash_kexec_post_notifiers
> -			Run kdump after running panic-notifiers and dumping
> -			kmsg. This only for the users who doubt kdump always
> -			succeeds in any situation.
> -			Note that this also increases risks of kdump failure,
> -			because some panic notifiers can make the crashed
> -			kernel more unstable.
> -
>  	parkbd.port=	[HW] Parallel port number the keyboard adapter is
>  			connected to, default is 0.
>  			Format: <parport#>
> diff --git a/include/linux/panic_notifier.h b/include/linux/panic_notifier.h
> index bcf6a5ea9d7f..b5041132321d 100644
> --- a/include/linux/panic_notifier.h
> +++ b/include/linux/panic_notifier.h
> @@ -10,6 +10,7 @@ extern struct atomic_notifier_head panic_info_list;
>  extern struct atomic_notifier_head panic_pre_reboot_list;
>  extern struct atomic_notifier_head panic_post_reboot_list;
> 
> +bool panic_notifiers_before_kdump(void);
>  extern bool crash_kexec_post_notifiers;
> 
>  enum panic_notifier_val {
> diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
> index 68480f731192..f8906db8ca72 100644
> --- a/kernel/kexec_core.c
> +++ b/kernel/kexec_core.c
> @@ -74,11 +74,11 @@ struct resource crashk_low_res = {
>  int kexec_should_crash(struct task_struct *p)
>  {
>  	/*
> -	 * If crash_kexec_post_notifiers is enabled, don't run
> -	 * crash_kexec() here yet, which must be run after panic
> -	 * notifiers in panic().
> +	 * In case any panic notifiers are set to execute before kexec,
> +	 * don't run crash_kexec() here yet, which must run after these
> +	 * panic notifiers are executed, in the panic() path.
>  	 */
> -	if (crash_kexec_post_notifiers)
> +	if (panic_notifiers_before_kdump())
>  		return 0;
>  	/*
>  	 * There are 4 panic() calls in make_task_dead() path, each of which
> diff --git a/kernel/panic.c b/kernel/panic.c
> index bf792102b43e..b7c055d4f87f 100644
> --- a/kernel/panic.c
> +++ b/kernel/panic.c
> @@ -15,7 +15,6 @@
>  #include <linux/kgdb.h>
>  #include <linux/kmsg_dump.h>
>  #include <linux/kallsyms.h>
> -#include <linux/notifier.h>
>  #include <linux/vt_kern.h>
>  #include <linux/module.h>
>  #include <linux/random.h>
> @@ -52,14 +51,23 @@ static unsigned long tainted_mask =
>  static int pause_on_oops;
>  static int pause_on_oops_flag;
>  static DEFINE_SPINLOCK(pause_on_oops_lock);
> -bool crash_kexec_post_notifiers;
> +
>  int panic_on_warn __read_mostly;
> +bool panic_on_taint_nousertaint;
>  unsigned long panic_on_taint;
> -bool panic_on_taint_nousertaint = false;
> 
>  int panic_timeout = CONFIG_PANIC_TIMEOUT;
>  EXPORT_SYMBOL_GPL(panic_timeout);
> 
> +/* Initialized with all notifiers set to run before kdump */
> +static unsigned long panic_notifiers_bits = 15;
> +
> +/* Default level is 2, see kernel-parameters.txt */
> +unsigned int panic_notifiers_level = 2;
> +
> +/* DEPRECATED in favor of panic_notifiers_level */
> +bool crash_kexec_post_notifiers;
> +
>  #define PANIC_PRINT_TASK_INFO		0x00000001
>  #define PANIC_PRINT_MEM_INFO		0x00000002
>  #define PANIC_PRINT_TIMER_INFO		0x00000004
> @@ -109,10 +117,14 @@ void __weak nmi_panic_self_stop(struct pt_regs *regs)
>  }
> 
>  /*
> - * Stop other CPUs in panic.  Architecture dependent code may override this
> - * with more suitable version.  For example, if the architecture supports
> - * crash dump, it should save registers of each stopped CPU and disable
> - * per-CPU features such as virtualization extensions.
> + * Stop other CPUs in panic context.
> + *
> + * Architecture dependent code may override this with more suitable version.
> + * For example, if the architecture supports crash dump, it should save the
> + * registers of each stopped CPU and disable per-CPU features such as
> + * virtualization extensions. When not overridden in arch code (and for
> + * x86/xen), this is exactly the same as execute smp_send_stop(), but
> + * guarded against duplicate execution.
>   */
>  void __weak crash_smp_send_stop(void)
>  {
> @@ -183,6 +195,112 @@ static void panic_print_sys_info(bool console_flush)
>  		ftrace_dump(DUMP_ALL);
>  }
> 
> +/*
> + * Helper that accumulates all console flushing routines executed on panic.
> + */
> +static void console_flushing(void)
> +{
> +#ifdef CONFIG_VT
> +	unblank_screen();
> +#endif
> +	console_unblank();
> +
> +	/*
> +	 * In this point, we may have disabled other CPUs, hence stopping the
> +	 * CPU holding the lock while still having some valuable data in the
> +	 * console buffer.
> +	 *
> +	 * Try to acquire the lock then release it regardless of the result.
> +	 * The release will also print the buffers out. Locks debug should
> +	 * be disabled to avoid reporting bad unlock balance when panic()
> +	 * is not being called from OOPS.
> +	 */
> +	debug_locks_off();
> +	console_flush_on_panic(CONSOLE_FLUSH_PENDING);
> +
> +	panic_print_sys_info(true);
> +}
> +
> +#define PN_HYPERVISOR_BIT	0
> +#define PN_INFO_BIT		1
> +#define PN_PRE_REBOOT_BIT	2
> +#define PN_POST_REBOOT_BIT	3
> +
> +/*
> + * Determine the order of panic notifiers with regards to kdump.
> + *
> + * This function relies in the "panic_notifiers_level" kernel parameter
> + * to determine how to order the notifiers with regards to kdump. We
> + * have currently 5 levels. For details, please check the kernel docs for
> + * "panic_notifiers_level" at Documentation/admin-guide/kernel-parameters.txt.
> + *
> + * Default level is 2, which means the panic hypervisor and informational
> + * (unless we don't have any kmsg_dumper) lists will execute before kdump.
> + */
> +static void order_panic_notifiers_and_kdump(void)
> +{
> +	/*
> +	 * The parameter "crash_kexec_post_notifiers" is deprecated, but
> +	 * valid. Users that set it want really all panic notifiers to
> +	 * execute before kdump, so it's effectively the same as setting
> +	 * the panic notifiers level to 4.
> +	 */
> +	if (panic_notifiers_level >= 4 || crash_kexec_post_notifiers)
> +		return;
> +
> +	/*
> +	 * Based on the level configured (smaller than 4), we clear the
> +	 * proper bits in "panic_notifiers_bits". Notice that this bitfield
> +	 * is initialized with all notifiers set.
> +	 */
> +	switch (panic_notifiers_level) {
> +	case 3:
> +		clear_bit(PN_PRE_REBOOT_BIT, &panic_notifiers_bits);
> +		break;
> +	case 2:
> +		clear_bit(PN_PRE_REBOOT_BIT, &panic_notifiers_bits);
> +
> +		if (!kmsg_has_dumpers())
> +			clear_bit(PN_INFO_BIT, &panic_notifiers_bits);
> +		break;
> +	case 1:
> +		clear_bit(PN_PRE_REBOOT_BIT, &panic_notifiers_bits);
> +		clear_bit(PN_INFO_BIT, &panic_notifiers_bits);
> +		break;
> +	case 0:
> +		clear_bit(PN_PRE_REBOOT_BIT, &panic_notifiers_bits);
> +		clear_bit(PN_INFO_BIT, &panic_notifiers_bits);
> +		clear_bit(PN_HYPERVISOR_BIT, &panic_notifiers_bits);
> +		break;
> +	}

I think the above switch statement could be done as follows:

if (panic_notifiers_level <= 3)
	clear_bit(PN_PRE_REBOOT_BIT, &panic_notifiers_bits);
if (panic_notifiers_level <= 2)
	if (!kmsg_has_dumpers())
		clear_bit(PN_INFO_BIT, &panic_notifiers_bits);
if (panic_notifiers_level <=1)
	clear_bit(PN_INFO_BIT, &panic_notifiers_bits);
if (panic_notifiers_level == 0)
	clear_bit(PN_HYPERVISOR_BIT, &panic_notifiers_bits);

That's about half the lines of code.  It's somewhat a matter of style,
so treat this as just a suggestion to consider.  I just end up looking
for a better solution when I see the same line of code repeated
3 or 4 times!

> +}
> +
> +/*
> + * Set of helpers to execute the panic notifiers only once.
> + * Just the informational notifier cares about the return.
> + */
> +static inline bool notifier_run_once(struct atomic_notifier_head head,
> +				     char *buf, long bit)
> +{
> +	if (test_and_change_bit(bit, &panic_notifiers_bits)) {
> +		atomic_notifier_call_chain(&head, PANIC_NOTIFIER, buf);
> +		return true;
> +	}
> +	return false;
> +}
> +
> +#define panic_notifier_hypervisor_once(buf)\
> +	notifier_run_once(panic_hypervisor_list, buf, PN_HYPERVISOR_BIT)
> +
> +#define panic_notifier_info_once(buf)\
> +	notifier_run_once(panic_info_list, buf, PN_INFO_BIT)
> +
> +#define panic_notifier_pre_reboot_once(buf)\
> +	notifier_run_once(panic_pre_reboot_list, buf, PN_PRE_REBOOT_BIT)
> +
> +#define panic_notifier_post_reboot_once(buf)\
> +	notifier_run_once(panic_post_reboot_list, buf, PN_POST_REBOOT_BIT)
> +
>  /**
>   *	panic - halt the system
>   *	@fmt: The text string to print
> @@ -198,32 +316,29 @@ void panic(const char *fmt, ...)
>  	long i, i_next = 0, len;
>  	int state = 0;
>  	int old_cpu, this_cpu;
> -	bool _crash_kexec_post_notifiers = crash_kexec_post_notifiers;
> 
> -	if (panic_on_warn) {
> -		/*
> -		 * This thread may hit another WARN() in the panic path.
> -		 * Resetting this prevents additional WARN() from panicking the
> -		 * system on this thread.  Other threads are blocked by the
> -		 * panic_mutex in panic().
> -		 */
> -		panic_on_warn = 0;
> -	}
> +	/*
> +	 * This thread may hit another WARN() in the panic path, so
> +	 * resetting this option prevents additional WARN() from
> +	 * re-panicking the system here.
> +	 */
> +	panic_on_warn = 0;
> 
>  	/*
>  	 * Disable local interrupts. This will prevent panic_smp_self_stop
> -	 * from deadlocking the first cpu that invokes the panic, since
> -	 * there is nothing to prevent an interrupt handler (that runs
> -	 * after setting panic_cpu) from invoking panic() again.
> +	 * from deadlocking the first cpu that invokes the panic, since there
> +	 * is nothing to prevent an interrupt handler (that runs after setting
> +	 * panic_cpu) from invoking panic() again. Also disables preemption
> +	 * here - notice it's not safe to rely on interrupt disabling to avoid
> +	 * preemption, since any cond_resched() or cond_resched_lock() might
> +	 * trigger a reschedule if the preempt count is 0 (for reference, see
> +	 * Documentation/locking/preempt-locking.rst). Some functions called
> +	 * from here want preempt disabled, so no point enabling it later.
>  	 */
>  	local_irq_disable();
>  	preempt_disable_notrace();
> 
>  	/*
> -	 * It's possible to come here directly from a panic-assertion and
> -	 * not have preempt disabled. Some functions called from here want
> -	 * preempt to be disabled. No point enabling it later though...
> -	 *
>  	 * Only one CPU is allowed to execute the panic code from here. For
>  	 * multiple parallel invocations of panic, all other CPUs either
>  	 * stop themself or will wait until they are stopped by the 1st CPU
> @@ -266,73 +381,75 @@ void panic(const char *fmt, ...)
>  	kgdb_panic(buf);
> 
>  	/*
> -	 * If we have crashed and we have a crash kernel loaded let it handle
> -	 * everything else.
> -	 * If we want to run this after calling panic_notifiers, pass
> -	 * the "crash_kexec_post_notifiers" option to the kernel.
> +	 * Here lies one of the most subtle parts of the panic path,
> +	 * the panic notifiers and their order with regards to kdump.
> +	 * We currently have 4 sets of notifiers:
>  	 *
> -	 * Bypass the panic_cpu check and call __crash_kexec directly.
> +	 *  - the hypervisor list is composed by callbacks that are related
> +	 *  to warn the FW / hypervisor about panic, or non-invasive LED
> +	 *  controlling functions - (hopefully) low-risk for kdump, should
> +	 *  run early if possible.
> +	 *
> +	 *  - the informational list is composed by functions dumping data
> +	 *  like kernel offsets, device error registers or tracing buffer;
> +	 *  also log flooding prevention callbacks fit in this list. It is
> +	 *  relatively safe to run before kdump.
> +	 *
> +	 *  - the pre_reboot list basically is everything else, all the
> +	 *  callbacks that don't fit in the 2 previous lists. It should
> +	 *  run *after* kdump if possible, as it contains high-risk
> +	 *  functions that may break kdump.
> +	 *
> +	 *  - we also have a 4th list of notifiers, the post_reboot
> +	 *  callbacks. This is not strongly related to kdump since it's
> +	 *  always executed late in the panic path, after the restart
> +	 *  mechanism (if set); its goal is to provide a way for
> +	 *  architecture code effectively power-off/disable the system.
> +	 *
> +	 *  The kernel provides the "panic_notifiers_level" parameter
> +	 *  to adjust the ordering in which these notifiers should run
> +	 *  with regards to kdump - the default level is 2, so both the
> +	 *  hypervisor and informational notifiers should execute before
> +	 *  the __crash_kexec(); the info notifier won't run by default
> +	 *  unless there's some kmsg_dumper() registered. For details
> +	 *  about it, check Documentation/admin-guide/kernel-parameters.txt.
> +	 *
> +	 *  Notice that the code relies in bits set/clear operations to
> +	 *  determine the ordering, functions *_once() execute only one
> +	 *  time, as their name implies. The goal is to prevent too much
> +	 *  if conditionals and more confusion. Finally, regarding CPUs
> +	 *  disabling: unless NO panic notifier executes before kdump,
> +	 *  we always disable secondary CPUs before __crash_kexec() and
> +	 *  the notifiers execute.
>  	 */
> -	if (!_crash_kexec_post_notifiers) {
> +	order_panic_notifiers_and_kdump();
> +
> +	/* If no level, we should kdump ASAP. */
> +	if (!panic_notifiers_level)
>  		__crash_kexec(NULL);
> 
> -		/*
> -		 * Note smp_send_stop is the usual smp shutdown function, which
> -		 * unfortunately means it may not be hardened to work in a
> -		 * panic situation.
> -		 */
> -		smp_send_stop();
> -	} else {
> -		/*
> -		 * If we want to do crash dump after notifier calls and
> -		 * kmsg_dump, we will need architecture dependent extra
> -		 * works in addition to stopping other CPUs.
> -		 */
> -		crash_smp_send_stop();
> -	}
> +	crash_smp_send_stop();
> +	panic_notifier_hypervisor_once(buf);
> 
> -	/*
> -	 * Run any panic handlers, including those that might need to
> -	 * add information to the kmsg dump output.
> -	 */
> -	atomic_notifier_call_chain(&panic_hypervisor_list, PANIC_NOTIFIER, buf);
> -	atomic_notifier_call_chain(&panic_info_list, PANIC_NOTIFIER, buf);
> -	atomic_notifier_call_chain(&panic_pre_reboot_list, PANIC_NOTIFIER, buf);
> +	if (panic_notifier_info_once(buf)) {
> +		panic_print_sys_info(false);
> +		kmsg_dump(KMSG_DUMP_PANIC);
> +	}
> 
> -	panic_print_sys_info(false);
> +	panic_notifier_pre_reboot_once(buf);
> 
> -	kmsg_dump(KMSG_DUMP_PANIC);
> +	__crash_kexec(NULL);
> 
> -	/*
> -	 * If you doubt kdump always works fine in any situation,
> -	 * "crash_kexec_post_notifiers" offers you a chance to run
> -	 * panic_notifiers and dumping kmsg before kdump.
> -	 * Note: since some panic_notifiers can make crashed kernel
> -	 * more unstable, it can increase risks of the kdump failure too.
> -	 *
> -	 * Bypass the panic_cpu check and call __crash_kexec directly.
> -	 */
> -	if (_crash_kexec_post_notifiers)
> -		__crash_kexec(NULL);
> +	panic_notifier_hypervisor_once(buf);
> 
> -#ifdef CONFIG_VT
> -	unblank_screen();
> -#endif
> -	console_unblank();
> -
> -	/*
> -	 * We may have ended up stopping the CPU holding the lock (in
> -	 * smp_send_stop()) while still having some valuable data in the console
> -	 * buffer.  Try to acquire the lock then release it regardless of the
> -	 * result.  The release will also print the buffers out.  Locks debug
> -	 * should be disabled to avoid reporting bad unlock balance when
> -	 * panic() is not being callled from OOPS.
> -	 */
> -	debug_locks_off();
> -	console_flush_on_panic(CONSOLE_FLUSH_PENDING);
> +	if (panic_notifier_info_once(buf)) {
> +		panic_print_sys_info(false);
> +		kmsg_dump(KMSG_DUMP_PANIC);
> +	}
> 
> -	panic_print_sys_info(true);
> +	panic_notifier_pre_reboot_once(buf);
> 
> +	console_flushing();
>  	if (!panic_blink)
>  		panic_blink = no_blink;
> 
> @@ -363,8 +480,7 @@ void panic(const char *fmt, ...)
>  		emergency_restart();
>  	}
> 
> -	atomic_notifier_call_chain(&panic_post_reboot_list,
> -				   PANIC_NOTIFIER, buf);
> +	panic_notifier_post_reboot_once(buf);
> 
>  	pr_emerg("---[ end Kernel panic - not syncing: %s ]---\n", buf);
> 
> @@ -383,6 +499,15 @@ void panic(const char *fmt, ...)
> 
>  EXPORT_SYMBOL(panic);
> 
> +/*
> + * Helper used in the kexec code, to validate if any
> + * panic notifier is set to execute early, before kdump.
> + */
> +inline bool panic_notifiers_before_kdump(void)
> +{
> +	return panic_notifiers_level || crash_kexec_post_notifiers;
> +}
> +
>  /*
>   * TAINT_FORCED_RMMOD could be a per-module flag but the module
>   * is being removed anyway.
> @@ -692,6 +817,9 @@ core_param(panic, panic_timeout, int, 0644);
>  core_param(panic_print, panic_print, ulong, 0644);
>  core_param(pause_on_oops, pause_on_oops, int, 0644);
>  core_param(panic_on_warn, panic_on_warn, int, 0644);
> +core_param(panic_notifiers_level, panic_notifiers_level, uint, 0644);
> +
> +/* DEPRECATED in favor of panic_notifiers_level */
>  core_param(crash_kexec_post_notifiers, crash_kexec_post_notifiers, bool, 0644);
> 
>  static int __init oops_setup(char *s)
> diff --git a/tools/testing/selftests/pstore/pstore_crash_test
> b/tools/testing/selftests/pstore/pstore_crash_test
> index 2a329bbb4aca..1e60ce4501aa 100755
> --- a/tools/testing/selftests/pstore/pstore_crash_test
> +++ b/tools/testing/selftests/pstore/pstore_crash_test
> @@ -25,6 +25,7 @@ touch $REBOOT_FLAG
>  sync
> 
>  # cause crash
> -# Note: If you use kdump and want to see kmesg-* files after reboot, you should
> -#       specify 'crash_kexec_post_notifiers' in 1st kernel's cmdline.
> +# Note: If you use kdump and want to see kmsg-* files after reboot, you should
> +#       be sure that the parameter "panic_notifiers_level" is more than '2' (the
> +#       default value for this parameter is '2') in the first kernel's cmdline.
>  echo c > /proc/sysrq-trigger
> --
> 2.36.0
Marc Zyngier April 29, 2022, 6:20 p.m. UTC | #19
On Wed, 27 Apr 2022 23:48:56 +0100,
"Guilherme G. Piccoli" <gpiccoli@igalia.com> wrote:
> 
> Currently the regular CPU shutdown path for ARM disables IRQs/FIQs
> in the secondary CPUs - smp_send_stop() calls ipi_cpu_stop(), which
> is responsible for that. This makes sense, since we're turning off
> such CPUs, putting them in an endless busy-wait loop.
> 
> Problem is that there is an alternative path for disabling CPUs,
> in the form of function crash_smp_send_stop(), used for kexec/panic
> paths. This functions relies in a SMP call that also triggers a
> busy-wait loop [at machine_crash_nonpanic_core()], but *without*
> disabling interrupts. This might lead to odd scenarios, like early
> interrupts in the boot of kexec'd kernel or even interrupts in
> other CPUs while the main one still works in the panic path and
> assumes all secondary CPUs are (really!) off.
> 
> This patch mimics the ipi_cpu_stop() interrupt disable mechanism
> in the crash CPU shutdown path, hence disabling IRQs/FIQs in all
> secondary CPUs in the kexec/panic path as well.
> 
> Cc: Marc Zyngier <maz@kernel.org>
> Cc: Russell King <linux@armlinux.org.uk>
> Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
> ---
>  arch/arm/kernel/machine_kexec.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/arch/arm/kernel/machine_kexec.c b/arch/arm/kernel/machine_kexec.c
> index f567032a09c0..ef788ee00519 100644
> --- a/arch/arm/kernel/machine_kexec.c
> +++ b/arch/arm/kernel/machine_kexec.c
> @@ -86,6 +86,9 @@ void machine_crash_nonpanic_core(void *unused)
>  	set_cpu_online(smp_processor_id(), false);
>  	atomic_dec(&waiting_for_crash_ipi);
>  
> +	local_fiq_disable();
> +	local_irq_disable();
> +

My expectations would be that, since we're getting here using an IPI,
interrupts are already masked. So what reenabled them the first place?

Thanks,

	M.
Guilherme G. Piccoli April 29, 2022, 7:34 p.m. UTC | #20
On 29/04/2022 13:04, Max Filippov wrote:
> [...]
>>  arch/xtensa/platforms/iss/setup.c     |  4 ++--For xtensa:
> 
> For xtensa:
> Acked-by: Max Filippov <jcmvbkbc@gmail.com>
> 

Perfect, thanks Max!
Cheers,


Guilherme
Guilherme G. Piccoli April 29, 2022, 7:38 p.m. UTC | #21
On 27/04/2022 22:01, Xiaoming Ni wrote:
> [...]
> Duplicate Code.
> 
> Is it better to use __func__ and %pS?
> 
> pr_info("%s: %pS\n", __func__, n->notifier_call);
> 
> 

This is a great suggestion Xiaoming, much appreciated!
I feel like reinventing the wheel here - with your idea, code was super
clear and concise, very nice suggestion!!

The only 2 things that diverge from your idea: I'm using '%ps' (not
showing offsets) and also, kept the wording "(un)registered/calling",
not using __func__ - I feel it's a bit odd in the output.
OK for you?

I'm definitely using your idea in V2 heh
Cheers,


Guilherme
Guilherme G. Piccoli April 29, 2022, 8:38 p.m. UTC | #22
On 29/04/2022 14:53, Michael Kelley (LINUX) wrote:
> From: Guilherme G. Piccoli <gpiccoli@igalia.com> Sent: Wednesday, April 27, 2022 3:49 PM
>> [...]
>> +	panic_notifiers_level=
>> +			[KNL] Set the panic notifiers execution order.
>> +			Format: <unsigned int>
>> +			We currently have 4 lists of panic notifiers; based
>> +			on the functionality and risk (for panic success) the
>> +			callbacks are added in a given list. The lists are:
>> +			- hypervisor/FW notification list (low risk);
>> +			- informational list (low/medium risk);
>> +			- pre_reboot list (higher risk);
>> +			- post_reboot list (only run late in panic and after
>> +			kdump, not configurable for now).
>> +			This parameter defines the ordering of the first 3
>> +			lists with regards to kdump; the levels determine
>> +			which set of notifiers execute before kdump. The
>> +			accepted levels are:
>> +			0: kdump is the first thing to run, NO list is
>> +			executed before kdump.
>> +			1: only the hypervisor list is executed before kdump.
>> +			2 (default level): the hypervisor list and (*if*
>> +			there's any kmsg_dumper defined) the informational
>> +			list are executed before kdump.
>> +			3: both the hypervisor and the informational lists
>> +			(always) execute before kdump.
> 
> I'm not clear on why level 2 exists.  What is the scenario where
> execution of the info list before kdump should be conditional on the
> existence of a kmsg_dumper?   Maybe the scenario is described
> somewhere in the patch set and I just missed it.
> 

Hi Michael, thanks for your review/consideration. So, this idea started
kind of some time ago. It all started with a need of exposing more
information on kernel log *before* kdump and *before* pstore -
specifically, we're talking about panic_print. But this cause some
reactions, Baoquan was very concerned with that [0]. Soon after, I've
proposed a panic notifiers filter (orthogonal) approach, to which Petr
suggested instead doing a major refactor [1] - it finally is alive in
the form of this series.

The theory behind the level 2 is to allow a scenario of kdump with the
minimum amount of notifiers - what is the point in printing more
information if the user doesn't care, since it's going to kdump? Now, if
there is a kmsg dumper, it means that there is likely some interest in
collecting information, and that might as well be required before the
potential kdump (which is my case, hence the proposal on [0]).

Instead of forcing one of the two behaviors (level 1 or level 3), we
have a middle-term/compromise: if there's interest in collecting such
data (in the form of a kmsg dumper), we then execute the informational
notifiers before kdump. If not, why to increase (even slightly) the risk
for kdump?

I'm OK in removing the level 2 if people prefer, but I don't feel it's a
burden, quite opposite - seems a good way to accommodate the somewhat
antagonistic ideas (jump to kdump ASAP vs collecting more info in the
panicked kernel log).

[0] https://lore.kernel.org/lkml/20220126052246.GC2086@MiWiFi-R3L-srv/

[1] https://lore.kernel.org/lkml/YfPxvzSzDLjO5ldp@alley/


>[...]
>> +	 * Based on the level configured (smaller than 4), we clear the
>> +	 * proper bits in "panic_notifiers_bits". Notice that this bitfield
>> +	 * is initialized with all notifiers set.
>> +	 */
>> +	switch (panic_notifiers_level) {
>> +	case 3:
>> +		clear_bit(PN_PRE_REBOOT_BIT, &panic_notifiers_bits);
>> +		break;
>> +	case 2:
>> +		clear_bit(PN_PRE_REBOOT_BIT, &panic_notifiers_bits);
>> +
>> +		if (!kmsg_has_dumpers())
>> +			clear_bit(PN_INFO_BIT, &panic_notifiers_bits);
>> +		break;
>> +	case 1:
>> +		clear_bit(PN_PRE_REBOOT_BIT, &panic_notifiers_bits);
>> +		clear_bit(PN_INFO_BIT, &panic_notifiers_bits);
>> +		break;
>> +	case 0:
>> +		clear_bit(PN_PRE_REBOOT_BIT, &panic_notifiers_bits);
>> +		clear_bit(PN_INFO_BIT, &panic_notifiers_bits);
>> +		clear_bit(PN_HYPERVISOR_BIT, &panic_notifiers_bits);
>> +		break;
>> +	}
> 
> I think the above switch statement could be done as follows:
> 
> if (panic_notifiers_level <= 3)
> 	clear_bit(PN_PRE_REBOOT_BIT, &panic_notifiers_bits);
> if (panic_notifiers_level <= 2)
> 	if (!kmsg_has_dumpers())
> 		clear_bit(PN_INFO_BIT, &panic_notifiers_bits);
> if (panic_notifiers_level <=1)
> 	clear_bit(PN_INFO_BIT, &panic_notifiers_bits);
> if (panic_notifiers_level == 0)
> 	clear_bit(PN_HYPERVISOR_BIT, &panic_notifiers_bits);
> 
> That's about half the lines of code.  It's somewhat a matter of style,
> so treat this as just a suggestion to consider.  I just end up looking
> for a better solution when I see the same line of code repeated
> 3 or 4 times!
> 

It's a good idea - I liked your code. The switch seems more
natural/explicit for me, even duplicating some lines, but in case more
people prefer your way, I can definitely change the code - thanks for
the suggestion.
Cheers,


Guilherme
Guilherme G. Piccoli April 29, 2022, 9:38 p.m. UTC | #23
Thanks Marc and Michael for the review/discussion.

On 29/04/2022 15:20, Marc Zyngier wrote:
> [...]

> My expectations would be that, since we're getting here using an IPI,
> interrupts are already masked. So what reenabled them the first place?
> 
> Thanks,
> 
> 	M.
> 

Marc, I did some investigation in the code (and tried/failed in the ARM
documentation as well heh), but this is still not 100% clear for me.

You're saying IPI calls disable IRQs/FIQs by default in the the target
CPUs? Where does it happen? I'm a bit confused if this a processor
mechanism, or it's in code.

Looking the smp_send_stop() in arch/arm/, it does IPI the CPUs, with the
flag IPI_CPU_STOP, eventually calling ipi_cpu_stop(), and the latter
does disable IRQ/FIQ in code - that's where I stole my code from.

But crash_smp_send_stop() is different, it seems to IPI the other CPUs
with the flag IPI_CALL_FUNC, which leads to calling
generic_smp_call_function_interrupt() - does it disable interrupts/FIQs
as well? I couldn't find it.

Appreciate your clarifications about that, thanks again.
Cheers,


Guilherme
Russell King (Oracle) April 29, 2022, 9:45 p.m. UTC | #24
On Fri, Apr 29, 2022 at 06:38:19PM -0300, Guilherme G. Piccoli wrote:
> Thanks Marc and Michael for the review/discussion.
> 
> On 29/04/2022 15:20, Marc Zyngier wrote:
> > [...]
> 
> > My expectations would be that, since we're getting here using an IPI,
> > interrupts are already masked. So what reenabled them the first place?
> > 
> > Thanks,
> > 
> > 	M.
> > 
> 
> Marc, I did some investigation in the code (and tried/failed in the ARM
> documentation as well heh), but this is still not 100% clear for me.
> 
> You're saying IPI calls disable IRQs/FIQs by default in the the target
> CPUs? Where does it happen? I'm a bit confused if this a processor
> mechanism, or it's in code.

When we taken an IRQ, IRQs will be masked, FIQs will not. IPIs are
themselves interrupts, so IRQs will be masked while the IPI is being
processed. Therefore, there should be no need to re-disable the
already disabled interrupts.

> But crash_smp_send_stop() is different, it seems to IPI the other CPUs
> with the flag IPI_CALL_FUNC, which leads to calling
> generic_smp_call_function_interrupt() - does it disable interrupts/FIQs
> as well? I couldn't find it.

It's buried in the architecture behaviour. When the CPU takes an
interrupt and jumps to the interrupt vector in the vectors page, it is
architecturally defined that interrupts will be disabled. If they
weren't architecturally disabled at this point, then as soon as the
first instruction is processed (at the interrupt vector, likely a
branch) the CPU would immediately take another jump to the interrupt
vector, and this process would continue indefinitely, making interrupt
handling utterly useless.

So, you won't find an explicit instruction in the code path from the
vectors to the IPI handler that disables interrupts - because it's
written into the architecture that this is what must happen.

IRQs are a lower priority than FIQs, so FIQs remain unmasked.
Guilherme G. Piccoli April 29, 2022, 9:56 p.m. UTC | #25
On 29/04/2022 18:45, Russell King (Oracle) wrote:
> [...]
>> Marc, I did some investigation in the code (and tried/failed in the ARM
>> documentation as well heh), but this is still not 100% clear for me.
>>
>> You're saying IPI calls disable IRQs/FIQs by default in the the target
>> CPUs? Where does it happen? I'm a bit confused if this a processor
>> mechanism, or it's in code.
> 
> When we taken an IRQ, IRQs will be masked, FIQs will not. IPIs are
> themselves interrupts, so IRQs will be masked while the IPI is being
> processed. Therefore, there should be no need to re-disable the
> already disabled interrupts.
> 
>> But crash_smp_send_stop() is different, it seems to IPI the other CPUs
>> with the flag IPI_CALL_FUNC, which leads to calling
>> generic_smp_call_function_interrupt() - does it disable interrupts/FIQs
>> as well? I couldn't find it.
> 
> It's buried in the architecture behaviour. When the CPU takes an
> interrupt and jumps to the interrupt vector in the vectors page, it is
> architecturally defined that interrupts will be disabled. If they
> weren't architecturally disabled at this point, then as soon as the
> first instruction is processed (at the interrupt vector, likely a
> branch) the CPU would immediately take another jump to the interrupt
> vector, and this process would continue indefinitely, making interrupt
> handling utterly useless.
> 
> So, you won't find an explicit instruction in the code path from the
> vectors to the IPI handler that disables interrupts - because it's
> written into the architecture that this is what must happen.
> 
> IRQs are a lower priority than FIQs, so FIQs remain unmasked.
> 

Thanks a lot for the *great* explanation Russell, much appreciated.
So, this leads to the both following questions:

a) Shall we then change the patch to only disable FIQs, since it's panic
path and we don't want secondary CPUs getting interrupted, but only
spinning quietly "forever"?

b) How about cleaning ipi_cpu_stop() then, by dropping the call to
local_irq_disable() there, to avoid the double IRQ disabling?

Thanks,


Guilherme
Marc Zyngier April 29, 2022, 10 p.m. UTC | #26
On Fri, 29 Apr 2022 22:45:14 +0100,
"Russell King (Oracle)" <linux@armlinux.org.uk> wrote:
> 
> On Fri, Apr 29, 2022 at 06:38:19PM -0300, Guilherme G. Piccoli wrote:
> > Thanks Marc and Michael for the review/discussion.
> > 
> > On 29/04/2022 15:20, Marc Zyngier wrote:
> > > [...]
> > 
> > > My expectations would be that, since we're getting here using an IPI,
> > > interrupts are already masked. So what reenabled them the first place?
> > > 
> > > Thanks,
> > > 
> > > 	M.
> > > 
> > 
> > Marc, I did some investigation in the code (and tried/failed in the ARM
> > documentation as well heh), but this is still not 100% clear for me.
> > 
> > You're saying IPI calls disable IRQs/FIQs by default in the the target
> > CPUs? Where does it happen? I'm a bit confused if this a processor
> > mechanism, or it's in code.
> 
> When we taken an IRQ, IRQs will be masked, FIQs will not. IPIs are
> themselves interrupts, so IRQs will be masked while the IPI is being
> processed. Therefore, there should be no need to re-disable the
> already disabled interrupts.
> 
> > But crash_smp_send_stop() is different, it seems to IPI the other CPUs
> > with the flag IPI_CALL_FUNC, which leads to calling
> > generic_smp_call_function_interrupt() - does it disable interrupts/FIQs
> > as well? I couldn't find it.
> 
> It's buried in the architecture behaviour. When the CPU takes an
> interrupt and jumps to the interrupt vector in the vectors page, it is
> architecturally defined that interrupts will be disabled. If they
> weren't architecturally disabled at this point, then as soon as the
> first instruction is processed (at the interrupt vector, likely a
> branch) the CPU would immediately take another jump to the interrupt
> vector, and this process would continue indefinitely, making interrupt
> handling utterly useless.
> 
> So, you won't find an explicit instruction in the code path from the
> vectors to the IPI handler that disables interrupts - because it's
> written into the architecture that this is what must happen.
> 
> IRQs are a lower priority than FIQs, so FIQs remain unmasked.

Ah, you're of course right. That's one of the huge differences between
AArch32 and AArch64, where the former has per target mode masking
rules, and the later masks everything on entry...

	M.
Guilherme G. Piccoli April 29, 2022, 10:35 p.m. UTC | #27
Hi Michael, first of all thanks for the great review, much appreciated.
Some comments inline below:

On 29/04/2022 14:16, Michael Kelley (LINUX) wrote:
> [...]
>> hypervisor I/O completion), so we postpone that to run late. But more
>> relevant: this *same* vmbus unloading happens in the crash_shutdown()
>> handler, so if kdump is set, we can safely skip this panic notifier and
>> defer such clean-up to the kexec crash handler.
> 
> While the last sentence is true for Hyper-V on x86/x64, it's not true for
> Hyper-V on ARM64.  x86/x64 has the 'machine_ops' data structure
> with the ability to provide a custom crash_shutdown() function, which
> Hyper-V does in the form of hv_machine_crash_shutdown().  But ARM64
> has no mechanism to provide such a custom function that will eventually
> do the needed vmbus_initiate_unload() before running kdump.
> 
> I'm not immediately sure what the best solution is for ARM64.  At this
> point, I'm just pointing out the problem and will think about the tradeoffs
> for various possible solutions.  Please do the same yourself. :-)
> 

Oh, you're totally right! I just assumed ARM64 would the the same, my
bad. Just to propose some alternatives, so you/others can also discuss
here and we can reach a consensus about the trade-offs:

(a) We could forget about this change, and always do the clean-up here,
not relying in machine_crash_shutdown().
Pro: really simple, behaves the same as it is doing currently.
Con: less elegant/concise, doesn't allow arm64 customization.

(b) Add a way to allow ARM64 customization of shutdown crash handler.
Pro: matches x86, more customizable, improves arm64 arch code.
Con: A tad more complex.

Also, a question that came-up: if ARM64 has no way of calling special
crash shutdown handler, how can you execute hv_stimer_cleanup() and
hv_synic_disable_regs() there? Or are they not required in ARM64?


>>
>> (c) There is also a Hyper-V framebuffer panic notifier, which relies in
>> doing a vmbus operation that demands a valid connection. So, we must
>> order this notifier with the panic notifier from vmbus_drv.c, in order to
>> guarantee that the framebuffer code executes before the vmbus connection
>> is unloaded.
> 
> Patch 21 of this set puts the Hyper-V FB panic notifier on the pre_reboot
> notifier list, which means it won't execute before the VMbus connection
> unload in the case of kdump.   This notifier is making sure that Hyper-V
> is notified about the last updates made to the frame buffer before the
> panic, so maybe it needs to be put on the hypervisor notifier list.  It
> sends a message to Hyper-V over its existing VMbus channel, but it
> does not wait for a reply.  It does, however, obtain a spin lock on the
> ring buffer used to communicate with Hyper-V.   Unless someone has
> a better suggestion, I'm inclined to take the risk of blocking on that
> spin lock.

The logic behind that was: when kdump is set, we'd skip the vmbus
disconnect on notifiers, deferring that to crash_shutdown(), logic this
one refuted in the above discussion on ARM64 (one more Pro argument to
the idea of refactoring aarch64 code to allow a custom crash shutdown
handler heh). But you're right, for the default level 2, we skip the
pre_reboot notifiers on kdump, effectively skipping this notifier.

Some ideas of what we can do here:

I) we could change the framebuffer notifier to rely on trylocks, instead
of risking a lockup scenario, and with that, we can execute it before
the vmbus disconnect in the hypervisor list;

II) we ignore the hypervisor notifier in case of kdump _by default_, and
if the users don't want that, they can always set the panic notifier
level to 4 and run all notifiers prior to kdump; would that be terrible
you think? Kdump users might don't care about the framebuffer...

III) we go with approach (b) above and refactor arm64 code to allow the
custom crash handler on kdump time, then [with point (I) above] the
logic proposed in this series is still valid - seems more and more the
most correct/complete solution.

In any case, I guess we should avoid workarounds if possible and do the
things the best way we can, to encompass all (or almost all) the
possible scenarios and don't force things on users (like enforcing panic
notifier level 4 for Hyper-V or something like this...)

More feedback from you / Hyper-V folks is pretty welcome about this.


> 
>> [...]
> The "Fixes:" tags imply that these changes should be backported to older
> longterm kernel versions, which I don't think is the case.  There is a
> dependency on Patch 14 of your series where PANIC_NOTIFIER is
> introduced.
> 

Oh, this was more related with archeology of the kernel. When I'm
investigating stuff, I really want to understand why code was added and
that usually require some time git blaming stuff, so having that pronto
in the commit message is a bonus.

But of course we don't need to use the Fixes tag for that, easy to only
mention it in the text. A secondary benefit by using this tag is to
indicate this is a _real fix_ to some code, and not an improvement, but
as you say, I agree we shouldn't backport it to previous releases having
or not the Fixes tag (AFAIK it's not mandatory to backport stuff with
Fixes tag).


>> [...]
>> + * intrincated is the relation of this notifier with Hyper-V framebuffer
> 
> s/intrincated/intricate/

Thanks, fixed in V2!


>
>> [...]
>> +static int hv_panic_vmbus_unload(struct notifier_block *nb, unsigned long val,
>>  			      void *args)
>> +{
>> +	if (!kexec_crash_loaded())
> 
> I'm not clear on the purpose of this condition.  I think it means
> we will skip the vmbus_initiate_unload() if a panic occurs in the
> kdump kernel.  Is there a reason a panic in the kdump kernel
> should be treated differently?  Or am I misunderstanding?

This is really related with the point discussed in the top of this
response - I assumed both ARM64/x86_64 would behave the same and
disconnect the vmbus through the custom crash handler when kdump is set,
so worth skipping it here in the notifier. But that's not true for ARM64
as you pointed, so this guard against kexec is really part of the
decision/discussion on what to do with ARM64 heh

Cheers!
Michael Kelley May 3, 2022, 5:31 p.m. UTC | #28
From: Guilherme G. Piccoli <gpiccoli@igalia.com> Sent: Friday, April 29, 2022 1:38 PM
> 
> On 29/04/2022 14:53, Michael Kelley (LINUX) wrote:
> > From: Guilherme G. Piccoli <gpiccoli@igalia.com> Sent: Wednesday, April 27, 2022
> 3:49 PM
> >> [...]
> >> +	panic_notifiers_level=
> >> +			[KNL] Set the panic notifiers execution order.
> >> +			Format: <unsigned int>
> >> +			We currently have 4 lists of panic notifiers; based
> >> +			on the functionality and risk (for panic success) the
> >> +			callbacks are added in a given list. The lists are:
> >> +			- hypervisor/FW notification list (low risk);
> >> +			- informational list (low/medium risk);
> >> +			- pre_reboot list (higher risk);
> >> +			- post_reboot list (only run late in panic and after
> >> +			kdump, not configurable for now).
> >> +			This parameter defines the ordering of the first 3
> >> +			lists with regards to kdump; the levels determine
> >> +			which set of notifiers execute before kdump. The
> >> +			accepted levels are:
> >> +			0: kdump is the first thing to run, NO list is
> >> +			executed before kdump.
> >> +			1: only the hypervisor list is executed before kdump.
> >> +			2 (default level): the hypervisor list and (*if*
> >> +			there's any kmsg_dumper defined) the informational
> >> +			list are executed before kdump.
> >> +			3: both the hypervisor and the informational lists
> >> +			(always) execute before kdump.
> >
> > I'm not clear on why level 2 exists.  What is the scenario where
> > execution of the info list before kdump should be conditional on the
> > existence of a kmsg_dumper?   Maybe the scenario is described
> > somewhere in the patch set and I just missed it.
> >
> 
> Hi Michael, thanks for your review/consideration. So, this idea started
> kind of some time ago. It all started with a need of exposing more
> information on kernel log *before* kdump and *before* pstore -
> specifically, we're talking about panic_print. But this cause some
> reactions, Baoquan was very concerned with that [0]. Soon after, I've
> proposed a panic notifiers filter (orthogonal) approach, to which Petr
> suggested instead doing a major refactor [1] - it finally is alive in
> the form of this series.
> 
> The theory behind the level 2 is to allow a scenario of kdump with the
> minimum amount of notifiers - what is the point in printing more
> information if the user doesn't care, since it's going to kdump? Now, if
> there is a kmsg dumper, it means that there is likely some interest in
> collecting information, and that might as well be required before the
> potential kdump (which is my case, hence the proposal on [0]).
> 
> Instead of forcing one of the two behaviors (level 1 or level 3), we
> have a middle-term/compromise: if there's interest in collecting such
> data (in the form of a kmsg dumper), we then execute the informational
> notifiers before kdump. If not, why to increase (even slightly) the risk
> for kdump?
> 
> I'm OK in removing the level 2 if people prefer, but I don't feel it's a
> burden, quite opposite - seems a good way to accommodate the somewhat
> antagonistic ideas (jump to kdump ASAP vs collecting more info in the
> panicked kernel log).
> 
> [0] https://lore.kernel.org/lkml/20220126052246.GC2086@MiWiFi-R3L-srv/
> 
> [1] https://lore.kernel.org/lkml/YfPxvzSzDLjO5ldp@alley/
> 

To me, it's a weak correlation between having a kmsg dumper, and
wanting or not wanting the info level output to come before kdump.
Hyper-V is one of only a few places that register a kmsg dumper, so most
Linux instances outside of Hyper-V guest (and PowerPC systems?) will have
the info level output after kdump.  It seems like anyone who cared strongly
about the info level output would set the panic_notifier_level to 1 or to 3
so that the result is more deterministic.  But that's just my opinion, and
it's probably an opinion that is not as well informed on the topic as some
others in the discussion. So keeping things as in your patch set is not a
show-stopper for me.

However, I would request a clarification in the documentation.   The
panic_notifier_level affects not only the hypervisor, informational,
and pre_reboot lists, but it also affects panic_print_sys_info() and
kmsg_dump().  Specifically, at level 1, panic_print_sys_info() and
kmsg_dump() will not be run before kdump.  At level 3, they will
always be run before kdump.  Your documentation above mentions
"informational lists" (plural), which I take to vaguely include
kmsg_dump() and panic_print_sys_info(), but being explicit about
the effect would be better.

Michael

> 
> >[...]
> >> +	 * Based on the level configured (smaller than 4), we clear the
> >> +	 * proper bits in "panic_notifiers_bits". Notice that this bitfield
> >> +	 * is initialized with all notifiers set.
> >> +	 */
> >> +	switch (panic_notifiers_level) {
> >> +	case 3:
> >> +		clear_bit(PN_PRE_REBOOT_BIT, &panic_notifiers_bits);
> >> +		break;
> >> +	case 2:
> >> +		clear_bit(PN_PRE_REBOOT_BIT, &panic_notifiers_bits);
> >> +
> >> +		if (!kmsg_has_dumpers())
> >> +			clear_bit(PN_INFO_BIT, &panic_notifiers_bits);
> >> +		break;
> >> +	case 1:
> >> +		clear_bit(PN_PRE_REBOOT_BIT, &panic_notifiers_bits);
> >> +		clear_bit(PN_INFO_BIT, &panic_notifiers_bits);
> >> +		break;
> >> +	case 0:
> >> +		clear_bit(PN_PRE_REBOOT_BIT, &panic_notifiers_bits);
> >> +		clear_bit(PN_INFO_BIT, &panic_notifiers_bits);
> >> +		clear_bit(PN_HYPERVISOR_BIT, &panic_notifiers_bits);
> >> +		break;
> >> +	}
> >
> > I think the above switch statement could be done as follows:
> >
> > if (panic_notifiers_level <= 3)
> > 	clear_bit(PN_PRE_REBOOT_BIT, &panic_notifiers_bits);
> > if (panic_notifiers_level <= 2)
> > 	if (!kmsg_has_dumpers())
> > 		clear_bit(PN_INFO_BIT, &panic_notifiers_bits);
> > if (panic_notifiers_level <=1)
> > 	clear_bit(PN_INFO_BIT, &panic_notifiers_bits);
> > if (panic_notifiers_level == 0)
> > 	clear_bit(PN_HYPERVISOR_BIT, &panic_notifiers_bits);
> >
> > That's about half the lines of code.  It's somewhat a matter of style,
> > so treat this as just a suggestion to consider.  I just end up looking
> > for a better solution when I see the same line of code repeated
> > 3 or 4 times!
> >
> 
> It's a good idea - I liked your code. The switch seems more
> natural/explicit for me, even duplicating some lines, but in case more
> people prefer your way, I can definitely change the code - thanks for
> the suggestion.
> Cheers,
> 
> 
> Guilherme
Guilherme G. Piccoli May 3, 2022, 6:06 p.m. UTC | #29
On 03/05/2022 14:31, Michael Kelley (LINUX) wrote:
> [...]
> 
> To me, it's a weak correlation between having a kmsg dumper, and
> wanting or not wanting the info level output to come before kdump.
> Hyper-V is one of only a few places that register a kmsg dumper, so most
> Linux instances outside of Hyper-V guest (and PowerPC systems?) will have
> the info level output after kdump.  It seems like anyone who cared strongly
> about the info level output would set the panic_notifier_level to 1 or to 3
> so that the result is more deterministic.  But that's just my opinion, and
> it's probably an opinion that is not as well informed on the topic as some
> others in the discussion. So keeping things as in your patch set is not a
> show-stopper for me.
> 
> However, I would request a clarification in the documentation.   The
> panic_notifier_level affects not only the hypervisor, informational,
> and pre_reboot lists, but it also affects panic_print_sys_info() and
> kmsg_dump().  Specifically, at level 1, panic_print_sys_info() and
> kmsg_dump() will not be run before kdump.  At level 3, they will
> always be run before kdump.  Your documentation above mentions
> "informational lists" (plural), which I take to vaguely include
> kmsg_dump() and panic_print_sys_info(), but being explicit about
> the effect would be better.
> 
> Michael

Thanks again Michael, to express your points and concerns - great idea
of documentation improvement here, I'll do that for V2, for sure.

The idea of "defaulting" to skip the info list on kdump (if no
kmsg_dump() is set) is again a mechanism that aims at accommodating all
users and concerns of antagonistic goals, kdump vs notifier lists.

Before this patch set, by default no notifier executed before kdump. So,
the "pendulum"  was strongly on kdump side, and clearly this was a
sub-optimal decision - proof of that is that both Hyper-V / PowerPC code
forcibly set the "crash_kexec_post_notifiers". The goal here is to have
a more lightweight list that by default runs before kdump, a secondary
list that only runs before kdump if there's usage for that (either user
sets that or kmsg_dumper set is considered a valid user), and the
remaining notifiers run by default only after kdump, all of that very
customizable through the levels idea.

Now, one thing we could do to improve consistency for the hyper-v case:
having a kmsg_dump_once() helper, and *for Hyper-V only*, call it on the
hypervisor list, within the info notifier (that would be moved to
hypervisor list, ofc).
Let's wait for more feedback on that, just throwing some ideas in order
we can have everyone happy with the end-result!

Cheers,


Guilherme
Michael Kelley May 3, 2022, 6:13 p.m. UTC | #30
From: Guilherme G. Piccoli <gpiccoli@igalia.com> Sent: Friday, April 29, 2022 3:35 PM
> 
> Hi Michael, first of all thanks for the great review, much appreciated.
> Some comments inline below:
> 
> On 29/04/2022 14:16, Michael Kelley (LINUX) wrote:
> > [...]
> >> hypervisor I/O completion), so we postpone that to run late. But more
> >> relevant: this *same* vmbus unloading happens in the crash_shutdown()
> >> handler, so if kdump is set, we can safely skip this panic notifier and
> >> defer such clean-up to the kexec crash handler.
> >
> > While the last sentence is true for Hyper-V on x86/x64, it's not true for
> > Hyper-V on ARM64.  x86/x64 has the 'machine_ops' data structure
> > with the ability to provide a custom crash_shutdown() function, which
> > Hyper-V does in the form of hv_machine_crash_shutdown().  But ARM64
> > has no mechanism to provide such a custom function that will eventually
> > do the needed vmbus_initiate_unload() before running kdump.
> >
> > I'm not immediately sure what the best solution is for ARM64.  At this
> > point, I'm just pointing out the problem and will think about the tradeoffs
> > for various possible solutions.  Please do the same yourself. :-)
> >
> 
> Oh, you're totally right! I just assumed ARM64 would the the same, my
> bad. Just to propose some alternatives, so you/others can also discuss
> here and we can reach a consensus about the trade-offs:
> 
> (a) We could forget about this change, and always do the clean-up here,
> not relying in machine_crash_shutdown().
> Pro: really simple, behaves the same as it is doing currently.
> Con: less elegant/concise, doesn't allow arm64 customization.
> 
> (b) Add a way to allow ARM64 customization of shutdown crash handler.
> Pro: matches x86, more customizable, improves arm64 arch code.
> Con: A tad more complex.
> 
> Also, a question that came-up: if ARM64 has no way of calling special
> crash shutdown handler, how can you execute hv_stimer_cleanup() and
> hv_synic_disable_regs() there? Or are they not required in ARM64?
> 

My suggestion is to do (a) for now.  I suspect (b) could be a more
extended discussion and I wouldn't want your patch set to get held
up on that discussion.  I don't know what the sense of the ARM64
maintainers would be toward (b).  They have tried to avoid picking
up code warts like have accumulated on the x86/x64 side over the
years, and I agree with that effort.  But as more and varied
hypervisors become available for ARM64, it seems like a framework
for supporting a custom shutdown handler may become necessary.
But that could take a little time.

You are right about hv_stimer_cleanup() and hv_synic_disable_regs().
We are not running these when a panic occurs on ARM64, and we
should be, though the risk is small.   We will pursue (b) and add
these additional cleanups as part of that.  But again, I would suggest
doing (a) for now, and we will switch back to your solution once
(b) is in place.

> 
> >>
> >> (c) There is also a Hyper-V framebuffer panic notifier, which relies in
> >> doing a vmbus operation that demands a valid connection. So, we must
> >> order this notifier with the panic notifier from vmbus_drv.c, in order to
> >> guarantee that the framebuffer code executes before the vmbus connection
> >> is unloaded.
> >
> > Patch 21 of this set puts the Hyper-V FB panic notifier on the pre_reboot
> > notifier list, which means it won't execute before the VMbus connection
> > unload in the case of kdump.   This notifier is making sure that Hyper-V
> > is notified about the last updates made to the frame buffer before the
> > panic, so maybe it needs to be put on the hypervisor notifier list.  It
> > sends a message to Hyper-V over its existing VMbus channel, but it
> > does not wait for a reply.  It does, however, obtain a spin lock on the
> > ring buffer used to communicate with Hyper-V.   Unless someone has
> > a better suggestion, I'm inclined to take the risk of blocking on that
> > spin lock.
> 
> The logic behind that was: when kdump is set, we'd skip the vmbus
> disconnect on notifiers, deferring that to crash_shutdown(), logic this
> one refuted in the above discussion on ARM64 (one more Pro argument to
> the idea of refactoring aarch64 code to allow a custom crash shutdown
> handler heh). But you're right, for the default level 2, we skip the
> pre_reboot notifiers on kdump, effectively skipping this notifier.
> 
> Some ideas of what we can do here:
> 
> I) we could change the framebuffer notifier to rely on trylocks, instead
> of risking a lockup scenario, and with that, we can execute it before
> the vmbus disconnect in the hypervisor list;

I think we have to do this approach for now.

> 
> II) we ignore the hypervisor notifier in case of kdump _by default_, and
> if the users don't want that, they can always set the panic notifier
> level to 4 and run all notifiers prior to kdump; would that be terrible
> you think? Kdump users might don't care about the framebuffer...
> 
> III) we go with approach (b) above and refactor arm64 code to allow the
> custom crash handler on kdump time, then [with point (I) above] the
> logic proposed in this series is still valid - seems more and more the
> most correct/complete solution.

But even when/if we get approach (b) implemented, having the
framebuffer notifier on the pre_reboot list is still too late with the
default of panic_notifier_level = 2.  The kdump path will reset the
VMbus connection and then the framebuffer notifier won't work.

> 
> In any case, I guess we should avoid workarounds if possible and do the
> things the best way we can, to encompass all (or almost all) the
> possible scenarios and don't force things on users (like enforcing panic
> notifier level 4 for Hyper-V or something like this...)
> 
> More feedback from you / Hyper-V folks is pretty welcome about this.
> 
> 
> >
> >> [...]
> > The "Fixes:" tags imply that these changes should be backported to older
> > longterm kernel versions, which I don't think is the case.  There is a
> > dependency on Patch 14 of your series where PANIC_NOTIFIER is
> > introduced.
> >
> 
> Oh, this was more related with archeology of the kernel. When I'm
> investigating stuff, I really want to understand why code was added and
> that usually require some time git blaming stuff, so having that pronto
> in the commit message is a bonus.
> 
> But of course we don't need to use the Fixes tag for that, easy to only
> mention it in the text. A secondary benefit by using this tag is to
> indicate this is a _real fix_ to some code, and not an improvement, but
> as you say, I agree we shouldn't backport it to previous releases having
> or not the Fixes tag (AFAIK it's not mandatory to backport stuff with
> Fixes tag).
> 
> 
> >> [...]
> >> + * intrincated is the relation of this notifier with Hyper-V framebuffer
> >
> > s/intrincated/intricate/
> 
> Thanks, fixed in V2!
> 
> 
> >
> >> [...]
> >> +static int hv_panic_vmbus_unload(struct notifier_block *nb, unsigned long val,
> >>  			      void *args)
> >> +{
> >> +	if (!kexec_crash_loaded())
> >
> > I'm not clear on the purpose of this condition.  I think it means
> > we will skip the vmbus_initiate_unload() if a panic occurs in the
> > kdump kernel.  Is there a reason a panic in the kdump kernel
> > should be treated differently?  Or am I misunderstanding?
> 
> This is really related with the point discussed in the top of this
> response - I assumed both ARM64/x86_64 would behave the same and
> disconnect the vmbus through the custom crash handler when kdump is set,
> so worth skipping it here in the notifier. But that's not true for ARM64
> as you pointed, so this guard against kexec is really part of the
> decision/discussion on what to do with ARM64 heh

But note that vmbus_initiate_unload() already has a guard built-in.
If the intent of this test is just as a guard against running twice,
then it isn't needed.

> 
> Cheers!
Guilherme G. Piccoli May 3, 2022, 6:57 p.m. UTC | #31
On 03/05/2022 15:13, Michael Kelley (LINUX) wrote:
> [...]
>> (a) We could forget about this change, and always do the clean-up here,
>> not relying in machine_crash_shutdown().
>> Pro: really simple, behaves the same as it is doing currently.
>> Con: less elegant/concise, doesn't allow arm64 customization.
>>
>> (b) Add a way to allow ARM64 customization of shutdown crash handler.
>> Pro: matches x86, more customizable, improves arm64 arch code.
>> Con: A tad more complex.
>>
>> Also, a question that came-up: if ARM64 has no way of calling special
>> crash shutdown handler, how can you execute hv_stimer_cleanup() and
>> hv_synic_disable_regs() there? Or are they not required in ARM64?
>>
> 
> My suggestion is to do (a) for now.  I suspect (b) could be a more
> extended discussion and I wouldn't want your patch set to get held
> up on that discussion.  I don't know what the sense of the ARM64
> maintainers would be toward (b).  They have tried to avoid picking
> up code warts like have accumulated on the x86/x64 side over the
> years, and I agree with that effort.  But as more and varied
> hypervisors become available for ARM64, it seems like a framework
> for supporting a custom shutdown handler may become necessary.
> But that could take a little time.
> 
> You are right about hv_stimer_cleanup() and hv_synic_disable_regs().
> We are not running these when a panic occurs on ARM64, and we
> should be, though the risk is small.   We will pursue (b) and add
> these additional cleanups as part of that.  But again, I would suggest
> doing (a) for now, and we will switch back to your solution once
> (b) is in place.
> 

Thanks again Michael, I'll stick with (a) for now. I'll check with ARM64
community about that, and I might even try to implement something in
parallel (if you are not already working on that - lemme know please),
so we don't get stuck here. As you said, I feel that this is more and
more relevant as the number of panic/crash/kexec scenarios tend to
increase in ARM64.


>> [...]
>> Some ideas of what we can do here:
>>
>> I) we could change the framebuffer notifier to rely on trylocks, instead
>> of risking a lockup scenario, and with that, we can execute it before
>> the vmbus disconnect in the hypervisor list;
> 
> I think we have to do this approach for now.
> 
>>
>> II) we ignore the hypervisor notifier in case of kdump _by default_, and
>> if the users don't want that, they can always set the panic notifier
>> level to 4 and run all notifiers prior to kdump; would that be terrible
>> you think? Kdump users might don't care about the framebuffer...
>>
>> III) we go with approach (b) above and refactor arm64 code to allow the
>> custom crash handler on kdump time, then [with point (I) above] the
>> logic proposed in this series is still valid - seems more and more the
>> most correct/complete solution.
> 
> But even when/if we get approach (b) implemented, having the
> framebuffer notifier on the pre_reboot list is still too late with the
> default of panic_notifier_level = 2.  The kdump path will reset the
> VMbus connection and then the framebuffer notifier won't work.
> 

OK, perfect! I'll work something along these lines in V2, allowing the
FB notifier to always run in the hypervisor list before the vmbus unload
mechanism.


>> [...]
>>>> +static int hv_panic_vmbus_unload(struct notifier_block *nb, unsigned long val,
>>>>  			      void *args)
>>>> +{
>>>> +	if (!kexec_crash_loaded())
>>>
>>> I'm not clear on the purpose of this condition.  I think it means
>>> we will skip the vmbus_initiate_unload() if a panic occurs in the
>>> kdump kernel.  Is there a reason a panic in the kdump kernel
>>> should be treated differently?  Or am I misunderstanding?
>>
>> This is really related with the point discussed in the top of this
>> response - I assumed both ARM64/x86_64 would behave the same and
>> disconnect the vmbus through the custom crash handler when kdump is set,
>> so worth skipping it here in the notifier. But that's not true for ARM64
>> as you pointed, so this guard against kexec is really part of the
>> decision/discussion on what to do with ARM64 heh
> 
> But note that vmbus_initiate_unload() already has a guard built-in.
> If the intent of this test is just as a guard against running twice,
> then it isn't needed.

Since we're going to avoid relying in the custom crash_shutdown(), due
to the lack of ARM64 support for now, this check will be removed in V2.

Its purpose was to skip the notifier *proactively* in case kexec is set,
given that...once kexec happens, the custom crash_shutdown() would run
the same function (wrong assumption for ARM64, my bad).

Postponing that slightly would maybe gain us some time while the
hypervisor finish its work, so we'd delay less in the vmbus unload path
- that was the rationale behind this check.


Cheers!
Guilherme G. Piccoli May 9, 2022, 1:09 p.m. UTC | #32
On 28/04/2022 05:11, Suzuki K Poulose wrote:
> Hi Guilherme,
> 
> On 27/04/2022 23:49, Guilherme G. Piccoli wrote:
>> The panic notifier infrastructure executes registered callbacks when
>> a panic event happens - such callbacks are executed in atomic context,
>> with interrupts and preemption disabled in the running CPU and all other
>> CPUs disabled. That said, mutexes in such context are not a good idea.
>>
>> This patch replaces a regular mutex with a mutex_trylock safer approach;
>> given the nature of the mutex used in the driver, it should be pretty
>> uncommon being unable to acquire such mutex in the panic path, hence
>> no functional change should be observed (and if it is, that would be
>> likely a deadlock with the regular mutex).
>>
>> Fixes: 2227b7c74634 ("coresight: add support for CPU debug module")
>> Cc: Leo Yan <leo.yan@linaro.org>
>> Cc: Mathieu Poirier <mathieu.poirier@linaro.org>
>> Cc: Mike Leach <mike.leach@linaro.org>
>> Cc: Suzuki K Poulose <suzuki.poulose@arm.com>
>> Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
> 
> How would you like to proceed with queuing this ? I am happy
> either way. In case you plan to push this as part of this
> series (I don't see any potential conflicts) :
> 
> Reviewed-by: Suzuki K Poulose <suzuki.poulose@arm.com>

Hi Suzuki, some other maintainers are taking the patches to their next
branches for example. I'm working on V2, and I guess in the end would be
nice to reduce the size of the series a bit.

So, do you think you could pick this one for your coresight/next branch
(or even for rc cycle, your call - this is really a fix)?
This way, I won't re-submit this one in V2, since it's gonna be merged
already in your branch.

Thanks in advance,


Guilherme
Guilherme G. Piccoli May 9, 2022, 2:25 p.m. UTC | #33
On 29/04/2022 13:04, Guilherme G. Piccoli wrote:
> On 27/04/2022 21:28, Randy Dunlap wrote:
>>
>>
>> On 4/27/22 15:49, Guilherme G. Piccoli wrote:
>>> +	crash_kexec_post_notifiers
>>> +			This was DEPRECATED - users should always prefer the
>>
>> 			This is DEPRECATED - users should always prefer the
>>
>>> +			parameter "panic_notifiers_level" - check its entry
>>> +			in this documentation for details on how it works.
>>> +			Setting this parameter is exactly the same as setting
>>> +			"panic_notifiers_level=4".
>>
> 
> Thanks Randy, for your suggestion - but I confess I couldn't understand
> it properly. It's related to spaces/tabs, right? What you suggest me to
> change in this formatting? Just by looking the email I can't parse.
> 
> Cheers,
> 
> 
> Guilherme

Complete lack of attention from me, apologies!
The suggestions was s/was/is - already fixed for V2, thanks Randy.
Suzuki K Poulose May 9, 2022, 4:14 p.m. UTC | #34
Hi

On 09/05/2022 14:09, Guilherme G. Piccoli wrote:
> On 28/04/2022 05:11, Suzuki K Poulose wrote:
>> Hi Guilherme,
>>
>> On 27/04/2022 23:49, Guilherme G. Piccoli wrote:
>>> The panic notifier infrastructure executes registered callbacks when
>>> a panic event happens - such callbacks are executed in atomic context,
>>> with interrupts and preemption disabled in the running CPU and all other
>>> CPUs disabled. That said, mutexes in such context are not a good idea.
>>>
>>> This patch replaces a regular mutex with a mutex_trylock safer approach;
>>> given the nature of the mutex used in the driver, it should be pretty
>>> uncommon being unable to acquire such mutex in the panic path, hence
>>> no functional change should be observed (and if it is, that would be
>>> likely a deadlock with the regular mutex).
>>>
>>> Fixes: 2227b7c74634 ("coresight: add support for CPU debug module")
>>> Cc: Leo Yan <leo.yan@linaro.org>
>>> Cc: Mathieu Poirier <mathieu.poirier@linaro.org>
>>> Cc: Mike Leach <mike.leach@linaro.org>
>>> Cc: Suzuki K Poulose <suzuki.poulose@arm.com>
>>> Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
>>
>> How would you like to proceed with queuing this ? I am happy
>> either way. In case you plan to push this as part of this
>> series (I don't see any potential conflicts) :
>>
>> Reviewed-by: Suzuki K Poulose <suzuki.poulose@arm.com>
> 
> Hi Suzuki, some other maintainers are taking the patches to their next
> branches for example. I'm working on V2, and I guess in the end would be
> nice to reduce the size of the series a bit.
> 
> So, do you think you could pick this one for your coresight/next branch
> (or even for rc cycle, your call - this is really a fix)?
> This way, I won't re-submit this one in V2, since it's gonna be merged
> already in your branch.

I have queued this to coresight/next.

Thanks
Suzuki
Guilherme G. Piccoli May 9, 2022, 4:26 p.m. UTC | #35
On 09/05/2022 13:14, Suzuki K Poulose wrote:
> [...]> 
> I have queued this to coresight/next.
> 
> Thanks
> Suzuki


Thanks a lot Suzuki!
Steven Rostedt May 10, 2022, 5:29 p.m. UTC | #36
On Thu, 28 Apr 2022 09:01:13 +0800
Xiaoming Ni <nixiaoming@huawei.com> wrote:

> > +#ifdef CONFIG_DEBUG_NOTIFIERS
> > +	{
> > +		char sym_name[KSYM_NAME_LEN];
> > +
> > +		pr_info("notifiers: registered %s()\n",
> > +			notifier_name(n, sym_name));
> > +	}  
> 
> Duplicate Code.
> 
> Is it better to use __func__ and %pS?
> 
> pr_info("%s: %pS\n", __func__, n->notifier_call);
> 
> 
> > +#endif

Also, don't sprinkle #ifdef in C code. Instead:

	if (IS_ENABLED(CONFIG_DEBUG_NOTIFIERS))
		pr_info("notifers: regsiter %ps()\n",
			n->notifer_call);


Or define a print macro at the start of the C file that is a nop if it is
not defined, and use the macro.

-- Steve
Guilherme G. Piccoli May 16, 2022, 4:14 p.m. UTC | #37
On 10/05/2022 14:29, Steven Rostedt wrote:
> [...]
> Also, don't sprinkle #ifdef in C code. Instead:
> 
> 	if (IS_ENABLED(CONFIG_DEBUG_NOTIFIERS))
> 		pr_info("notifers: regsiter %ps()\n",
> 			n->notifer_call);
> 
> 
> Or define a print macro at the start of the C file that is a nop if it is
> not defined, and use the macro.

Thanks, I'll go with the IS_ENABLED() idea in V2 - appreciate the hint.
Cheers,


Guilherme
Guilherme G. Piccoli May 23, 2022, 8:40 p.m. UTC | #38
On 28/04/2022 13:55, Helge Deller wrote:
> [...]
> You may add:
> Acked-by: Helge Deller <deller@gmx.de> # parisc
> 
> Helge

Hi Helge, do you think would be possible to still pick this one for
v5.19 or do you prefer to hold for the next release?

I'm working on V2, so if it's merged for 5.19 I won't send it again.
Thanks,


Guilherme