Message ID | 20220427224924.592546-25-gpiccoli@igalia.com |
---|---|
State | New |
Headers | show |
Series | The panic notifiers refactor | expand |
From: Guilherme G. Piccoli <gpiccoli@igalia.com> Sent: Friday, April 29, 2022 1:38 PM > > On 29/04/2022 14:53, Michael Kelley (LINUX) wrote: > > From: Guilherme G. Piccoli <gpiccoli@igalia.com> Sent: Wednesday, April 27, 2022 > 3:49 PM > >> [...] > >> + panic_notifiers_level= > >> + [KNL] Set the panic notifiers execution order. > >> + Format: <unsigned int> > >> + We currently have 4 lists of panic notifiers; based > >> + on the functionality and risk (for panic success) the > >> + callbacks are added in a given list. The lists are: > >> + - hypervisor/FW notification list (low risk); > >> + - informational list (low/medium risk); > >> + - pre_reboot list (higher risk); > >> + - post_reboot list (only run late in panic and after > >> + kdump, not configurable for now). > >> + This parameter defines the ordering of the first 3 > >> + lists with regards to kdump; the levels determine > >> + which set of notifiers execute before kdump. The > >> + accepted levels are: > >> + 0: kdump is the first thing to run, NO list is > >> + executed before kdump. > >> + 1: only the hypervisor list is executed before kdump. > >> + 2 (default level): the hypervisor list and (*if* > >> + there's any kmsg_dumper defined) the informational > >> + list are executed before kdump. > >> + 3: both the hypervisor and the informational lists > >> + (always) execute before kdump. > > > > I'm not clear on why level 2 exists. What is the scenario where > > execution of the info list before kdump should be conditional on the > > existence of a kmsg_dumper? Maybe the scenario is described > > somewhere in the patch set and I just missed it. > > > > Hi Michael, thanks for your review/consideration. So, this idea started > kind of some time ago. It all started with a need of exposing more > information on kernel log *before* kdump and *before* pstore - > specifically, we're talking about panic_print. But this cause some > reactions, Baoquan was very concerned with that [0]. Soon after, I've > proposed a panic notifiers filter (orthogonal) approach, to which Petr > suggested instead doing a major refactor [1] - it finally is alive in > the form of this series. > > The theory behind the level 2 is to allow a scenario of kdump with the > minimum amount of notifiers - what is the point in printing more > information if the user doesn't care, since it's going to kdump? Now, if > there is a kmsg dumper, it means that there is likely some interest in > collecting information, and that might as well be required before the > potential kdump (which is my case, hence the proposal on [0]). > > Instead of forcing one of the two behaviors (level 1 or level 3), we > have a middle-term/compromise: if there's interest in collecting such > data (in the form of a kmsg dumper), we then execute the informational > notifiers before kdump. If not, why to increase (even slightly) the risk > for kdump? > > I'm OK in removing the level 2 if people prefer, but I don't feel it's a > burden, quite opposite - seems a good way to accommodate the somewhat > antagonistic ideas (jump to kdump ASAP vs collecting more info in the > panicked kernel log). > > [0] https://lore.kernel.org/lkml/20220126052246.GC2086@MiWiFi-R3L-srv/ > > [1] https://lore.kernel.org/lkml/YfPxvzSzDLjO5ldp@alley/ > To me, it's a weak correlation between having a kmsg dumper, and wanting or not wanting the info level output to come before kdump. Hyper-V is one of only a few places that register a kmsg dumper, so most Linux instances outside of Hyper-V guest (and PowerPC systems?) will have the info level output after kdump. It seems like anyone who cared strongly about the info level output would set the panic_notifier_level to 1 or to 3 so that the result is more deterministic. But that's just my opinion, and it's probably an opinion that is not as well informed on the topic as some others in the discussion. So keeping things as in your patch set is not a show-stopper for me. However, I would request a clarification in the documentation. The panic_notifier_level affects not only the hypervisor, informational, and pre_reboot lists, but it also affects panic_print_sys_info() and kmsg_dump(). Specifically, at level 1, panic_print_sys_info() and kmsg_dump() will not be run before kdump. At level 3, they will always be run before kdump. Your documentation above mentions "informational lists" (plural), which I take to vaguely include kmsg_dump() and panic_print_sys_info(), but being explicit about the effect would be better. Michael > > >[...] > >> + * Based on the level configured (smaller than 4), we clear the > >> + * proper bits in "panic_notifiers_bits". Notice that this bitfield > >> + * is initialized with all notifiers set. > >> + */ > >> + switch (panic_notifiers_level) { > >> + case 3: > >> + clear_bit(PN_PRE_REBOOT_BIT, &panic_notifiers_bits); > >> + break; > >> + case 2: > >> + clear_bit(PN_PRE_REBOOT_BIT, &panic_notifiers_bits); > >> + > >> + if (!kmsg_has_dumpers()) > >> + clear_bit(PN_INFO_BIT, &panic_notifiers_bits); > >> + break; > >> + case 1: > >> + clear_bit(PN_PRE_REBOOT_BIT, &panic_notifiers_bits); > >> + clear_bit(PN_INFO_BIT, &panic_notifiers_bits); > >> + break; > >> + case 0: > >> + clear_bit(PN_PRE_REBOOT_BIT, &panic_notifiers_bits); > >> + clear_bit(PN_INFO_BIT, &panic_notifiers_bits); > >> + clear_bit(PN_HYPERVISOR_BIT, &panic_notifiers_bits); > >> + break; > >> + } > > > > I think the above switch statement could be done as follows: > > > > if (panic_notifiers_level <= 3) > > clear_bit(PN_PRE_REBOOT_BIT, &panic_notifiers_bits); > > if (panic_notifiers_level <= 2) > > if (!kmsg_has_dumpers()) > > clear_bit(PN_INFO_BIT, &panic_notifiers_bits); > > if (panic_notifiers_level <=1) > > clear_bit(PN_INFO_BIT, &panic_notifiers_bits); > > if (panic_notifiers_level == 0) > > clear_bit(PN_HYPERVISOR_BIT, &panic_notifiers_bits); > > > > That's about half the lines of code. It's somewhat a matter of style, > > so treat this as just a suggestion to consider. I just end up looking > > for a better solution when I see the same line of code repeated > > 3 or 4 times! > > > > It's a good idea - I liked your code. The switch seems more > natural/explicit for me, even duplicating some lines, but in case more > people prefer your way, I can definitely change the code - thanks for > the suggestion. > Cheers, > > > Guilherme
On 03/05/2022 14:31, Michael Kelley (LINUX) wrote: > [...] > > To me, it's a weak correlation between having a kmsg dumper, and > wanting or not wanting the info level output to come before kdump. > Hyper-V is one of only a few places that register a kmsg dumper, so most > Linux instances outside of Hyper-V guest (and PowerPC systems?) will have > the info level output after kdump. It seems like anyone who cared strongly > about the info level output would set the panic_notifier_level to 1 or to 3 > so that the result is more deterministic. But that's just my opinion, and > it's probably an opinion that is not as well informed on the topic as some > others in the discussion. So keeping things as in your patch set is not a > show-stopper for me. > > However, I would request a clarification in the documentation. The > panic_notifier_level affects not only the hypervisor, informational, > and pre_reboot lists, but it also affects panic_print_sys_info() and > kmsg_dump(). Specifically, at level 1, panic_print_sys_info() and > kmsg_dump() will not be run before kdump. At level 3, they will > always be run before kdump. Your documentation above mentions > "informational lists" (plural), which I take to vaguely include > kmsg_dump() and panic_print_sys_info(), but being explicit about > the effect would be better. > > Michael Thanks again Michael, to express your points and concerns - great idea of documentation improvement here, I'll do that for V2, for sure. The idea of "defaulting" to skip the info list on kdump (if no kmsg_dump() is set) is again a mechanism that aims at accommodating all users and concerns of antagonistic goals, kdump vs notifier lists. Before this patch set, by default no notifier executed before kdump. So, the "pendulum" was strongly on kdump side, and clearly this was a sub-optimal decision - proof of that is that both Hyper-V / PowerPC code forcibly set the "crash_kexec_post_notifiers". The goal here is to have a more lightweight list that by default runs before kdump, a secondary list that only runs before kdump if there's usage for that (either user sets that or kmsg_dumper set is considered a valid user), and the remaining notifiers run by default only after kdump, all of that very customizable through the levels idea. Now, one thing we could do to improve consistency for the hyper-v case: having a kmsg_dump_once() helper, and *for Hyper-V only*, call it on the hypervisor list, within the info notifier (that would be moved to hypervisor list, ofc). Let's wait for more feedback on that, just throwing some ideas in order we can have everyone happy with the end-result! Cheers, Guilherme
On 29/04/2022 13:04, Guilherme G. Piccoli wrote: > On 27/04/2022 21:28, Randy Dunlap wrote: >> >> >> On 4/27/22 15:49, Guilherme G. Piccoli wrote: >>> + crash_kexec_post_notifiers >>> + This was DEPRECATED - users should always prefer the >> >> This is DEPRECATED - users should always prefer the >> >>> + parameter "panic_notifiers_level" - check its entry >>> + in this documentation for details on how it works. >>> + Setting this parameter is exactly the same as setting >>> + "panic_notifiers_level=4". >> > > Thanks Randy, for your suggestion - but I confess I couldn't understand > it properly. It's related to spaces/tabs, right? What you suggest me to > change in this formatting? Just by looking the email I can't parse. > > Cheers, > > > Guilherme Complete lack of attention from me, apologies! The suggestions was s/was/is - already fixed for V2, thanks Randy.
Sorry for the delayed response. Unfortunately, I had 10 days holidays until yesterday... > .../admin-guide/kernel-parameters.txt | 42 ++- > include/linux/panic_notifier.h | 1 + > kernel/kexec_core.c | 8 +- > kernel/panic.c | 292 +++++++++++++----- > .../selftests/pstore/pstore_crash_test | 5 +- > 5 files changed, 252 insertions(+), 96 deletions(-) > > diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt > index 3f1cc5e317ed..8d3524060ce3 100644 > --- a/Documentation/admin-guide/kernel-parameters.txt > +++ b/Documentation/admin-guide/kernel-parameters.txt ...snip... > @@ -3784,6 +3791,33 @@ > timeout < 0: reboot immediately > Format: <timeout> > > + panic_notifiers_level= > + [KNL] Set the panic notifiers execution order. > + Format: <unsigned int> > + We currently have 4 lists of panic notifiers; based > + on the functionality and risk (for panic success) the > + callbacks are added in a given list. The lists are: > + - hypervisor/FW notification list (low risk); > + - informational list (low/medium risk); > + - pre_reboot list (higher risk); > + - post_reboot list (only run late in panic and after > + kdump, not configurable for now). > + This parameter defines the ordering of the first 3 > + lists with regards to kdump; the levels determine > + which set of notifiers execute before kdump. The > + accepted levels are: > + 0: kdump is the first thing to run, NO list is > + executed before kdump. > + 1: only the hypervisor list is executed before kdump. > + 2 (default level): the hypervisor list and (*if* Hmmm, why are you trying to change default setting? Based on the current design of kdump, it's natural to put what the handlers for these level 1 and level 2 handlers do in machine_crash_shutdown(), as these are necessary by default, right? Or have you already tried that and figured out it's difficult in some reason and reached the current design? If so, why is that difficult? Could you point to if there is already such discussion online? kdump is designed to perform as little things as possible before transferring the execution to the 2nd kernel in order to increase reliability. Just detour to panic() increases risks of kdump failure in the sense of increasing the executed codes in the abnormal situation, which is very the note in the explanation of crash_kexec_post_notifiers. Also, the current implementation of crash_kexec_post_notifiers uses the panic notifier, but this is not from the technical reason. Ideally, it should have been implemented in the context of crash_kexec() independently of panic(). That is, it looks to me that, in addition to changing design of panic notifier, you are trying to integrate shutdown code of the crash kexec and the panic paths. If so, this is a big design change for kdump. I'm concerned about increase of reliability. I'd like you to discuss them carefully. Thanks. HATAYAMA, Daisuke
Hey Hatayma, thanks for your great analysis and no need for apologies! I'll comment/respond properly inline below, just noticing here that I've CCed Mark and Marc (from the ARM64 perspective), Michael (Hyper-V perspective) and Hari (PowerPC perspective), besides the usual suspects as Petr, Baoquan, etc. On 09/05/2022 12:16, d.hatayama@fujitsu.com wrote: > Sorry for the delayed response. Unfortunately, I had 10 days holidays > until yesterday... > [...] >> + We currently have 4 lists of panic notifiers; based >> + on the functionality and risk (for panic success) the >> + callbacks are added in a given list. The lists are: >> + - hypervisor/FW notification list (low risk); >> + - informational list (low/medium risk); >> + - pre_reboot list (higher risk); >> + - post_reboot list (only run late in panic and after >> + kdump, not configurable for now). >> + This parameter defines the ordering of the first 3 >> + lists with regards to kdump; the levels determine >> + which set of notifiers execute before kdump. The >> + accepted levels are: >> + 0: kdump is the first thing to run, NO list is >> + executed before kdump. >> + 1: only the hypervisor list is executed before kdump. >> + 2 (default level): the hypervisor list and (*if* > > Hmmm, why are you trying to change default setting? > > Based on the current design of kdump, it's natural to put what the > handlers for these level 1 and level 2 handlers do in > machine_crash_shutdown(), as these are necessary by default, right? > > Or have you already tried that and figured out it's difficult in some > reason and reached the current design? If so, why is that difficult? > Could you point to if there is already such discussion online? > > kdump is designed to perform as little things as possible before > transferring the execution to the 2nd kernel in order to increase > reliability. Just detour to panic() increases risks of kdump failure > in the sense of increasing the executed codes in the abnormal > situation, which is very the note in the explanation of > crash_kexec_post_notifiers. > > Also, the current implementation of crash_kexec_post_notifiers uses > the panic notifier, but this is not from the technical > reason. Ideally, it should have been implemented in the context of > crash_kexec() independently of panic(). > > That is, it looks to me that, in addition to changing design of panic > notifier, you are trying to integrate shutdown code of the crash kexec > and the panic paths. If so, this is a big design change for kdump. > I'm concerned about increase of reliability. I'd like you to discuss > them carefully.
Hello, first, I am sorry for stepping into the discussion so late. I was busy with some other stuff and this patchset is far from trivial. Second, thanks a lot for putting so much effort into it. Most of the changes look pretty good, especially all the fixes of particular notifiers and split into four lists. Though this patch will need some more love. See below for more details. On Wed 2022-04-27 19:49:18, Guilherme G. Piccoli wrote: > The panic() function is somewhat convoluted - a lot of changes were > made over the years, adding comments that might be misleading/outdated > now, it has a code structure that is a bit complex to follow, with > lots of conditionals, for example. The panic notifier list is something > else - a single list, with multiple callbacks of different purposes, > that run in a non-deterministic order and may affect hardly kdump > reliability - see the "crash_kexec_post_notifiers" workaround-ish flag. > > --- a/Documentation/admin-guide/kernel-parameters.txt > +++ b/Documentation/admin-guide/kernel-parameters.txt > @@ -3784,6 +3791,33 @@ > timeout < 0: reboot immediately > Format: <timeout> > > + panic_notifiers_level= > + [KNL] Set the panic notifiers execution order. > + Format: <unsigned int> > + We currently have 4 lists of panic notifiers; based > + on the functionality and risk (for panic success) the > + callbacks are added in a given list. The lists are: > + - hypervisor/FW notification list (low risk); > + - informational list (low/medium risk); > + - pre_reboot list (higher risk); > + - post_reboot list (only run late in panic and after > + kdump, not configurable for now). > + This parameter defines the ordering of the first 3 > + lists with regards to kdump; the levels determine > + which set of notifiers execute before kdump. The > + accepted levels are: This talks only about kdump. The reality is much more complicated. The level affect the order of: + notifiers vs. kdump + notifiers vs. crash_dump + crash_dump vs. kdump There might theoretically many variants of the ordering of kdump, crash_dump, and the 4 notifier list. Some variants do not make much sense. You choose 5 variants and tried to select them by a level number. The question is if we really could easily describe the meaning this way. It is not only about a "level" of notifiers before kdump. It is also about the ordering of crash_dump vs. kdump. IMHO, "level" semantic does not fit there. Maybe more parameters might be easier to understand the effect. Anyway, we first need to agree on the chosen variants. I am going to discuss it more in the code, see below. > + 0: kdump is the first thing to run, NO list is > + executed before kdump. > + 1: only the hypervisor list is executed before kdump. > + 2 (default level): the hypervisor list and (*if* > + there's any kmsg_dumper defined) the informational > + list are executed before kdump. > + 3: both the hypervisor and the informational lists > + (always) execute before kdump. > + 4: the 3 lists (hypervisor, info and pre_reboot) > + execute before kdump - this behavior is analog to the > + deprecated parameter "crash_kexec_post_notifiers". > + > panic_print= Bitmask for printing system info when panic happens. > User can chose combination of the following bits: > bit 0: print all tasks info > --- a/kernel/panic.c > +++ b/kernel/panic.c > @@ -183,6 +195,112 @@ static void panic_print_sys_info(bool console_flush) > ftrace_dump(DUMP_ALL); > } > > +/* > + * Helper that accumulates all console flushing routines executed on panic. > + */ > +static void console_flushing(void) > +{ > +#ifdef CONFIG_VT > + unblank_screen(); > +#endif > + console_unblank(); > + > + /* > + * In this point, we may have disabled other CPUs, hence stopping the > + * CPU holding the lock while still having some valuable data in the > + * console buffer. > + * > + * Try to acquire the lock then release it regardless of the result. > + * The release will also print the buffers out. Locks debug should > + * be disabled to avoid reporting bad unlock balance when panic() > + * is not being called from OOPS. > + */ > + debug_locks_off(); > + console_flush_on_panic(CONSOLE_FLUSH_PENDING); > + > + panic_print_sys_info(true); > +} > + > +#define PN_HYPERVISOR_BIT 0 > +#define PN_INFO_BIT 1 > +#define PN_PRE_REBOOT_BIT 2 > +#define PN_POST_REBOOT_BIT 3 > + > +/* > + * Determine the order of panic notifiers with regards to kdump. > + * > + * This function relies in the "panic_notifiers_level" kernel parameter > + * to determine how to order the notifiers with regards to kdump. We > + * have currently 5 levels. For details, please check the kernel docs for > + * "panic_notifiers_level" at Documentation/admin-guide/kernel-parameters.txt. > + * > + * Default level is 2, which means the panic hypervisor and informational > + * (unless we don't have any kmsg_dumper) lists will execute before kdump. > + */ > +static void order_panic_notifiers_and_kdump(void) > +{ > + /* > + * The parameter "crash_kexec_post_notifiers" is deprecated, but > + * valid. Users that set it want really all panic notifiers to > + * execute before kdump, so it's effectively the same as setting > + * the panic notifiers level to 4. > + */ > + if (panic_notifiers_level >= 4 || crash_kexec_post_notifiers) > + return; > + > + /* > + * Based on the level configured (smaller than 4), we clear the > + * proper bits in "panic_notifiers_bits". Notice that this bitfield > + * is initialized with all notifiers set. > + */ > + switch (panic_notifiers_level) { > + case 3: > + clear_bit(PN_PRE_REBOOT_BIT, &panic_notifiers_bits); > + break; > + case 2: > + clear_bit(PN_PRE_REBOOT_BIT, &panic_notifiers_bits); > + > + if (!kmsg_has_dumpers()) > + clear_bit(PN_INFO_BIT, &panic_notifiers_bits); > + break; > + case 1: > + clear_bit(PN_PRE_REBOOT_BIT, &panic_notifiers_bits); > + clear_bit(PN_INFO_BIT, &panic_notifiers_bits); > + break; > + case 0: > + clear_bit(PN_PRE_REBOOT_BIT, &panic_notifiers_bits); > + clear_bit(PN_INFO_BIT, &panic_notifiers_bits); > + clear_bit(PN_HYPERVISOR_BIT, &panic_notifiers_bits); > + break; > + } > +} > > +/* > + * Set of helpers to execute the panic notifiers only once. > + * Just the informational notifier cares about the return. > + */ > +static inline bool notifier_run_once(struct atomic_notifier_head head, > + char *buf, long bit) > +{ > + if (test_and_change_bit(bit, &panic_notifiers_bits)) { > + atomic_notifier_call_chain(&head, PANIC_NOTIFIER, buf); > + return true; > + } > + return false; > +} Here is the code using the above functions. It helps to discuss the design and logic. <kernel/panic.c> order_panic_notifiers_and_kdump(); /* If no level, we should kdump ASAP. */ if (!panic_notifiers_level) __crash_kexec(NULL); crash_smp_send_stop(); panic_notifier_hypervisor_once(buf); if (panic_notifier_info_once(buf)) kmsg_dump(KMSG_DUMP_PANIC); panic_notifier_pre_reboot_once(buf); __crash_kexec(NULL); panic_notifier_hypervisor_once(buf); if (panic_notifier_info_once(buf)) kmsg_dump(KMSG_DUMP_PANIC); panic_notifier_pre_reboot_once(buf); </kernel/panic.c> I have to say that the logic is very unclear. Almost all functions are called twice: + __crash_kexec() + kmsg_dump() + panic_notifier_hypervisor_once() + panic_notifier_pre_reboot_once() + panic_notifier_info_once() It is pretty hard to find what functions are always called in the same order and where the order can be inverted. The really used code path is defined by order_panic_notifiers_and_kdump() that encodes "level" into "bits". The bits are then flipped in panic_notifier_*_once() calls that either do something or not. kmsg_dump() is called according to the bit flip. It is an interesting approach. I guess that you wanted to avoid too many if/then/else levels in panic(). But honestly, it looks like a black magic to me. IMHO, it is always easier to follow if/then/else logic than using a translation table that requires additional bit flips when a value is used more times. Also I guess that it is good proof that "level" abstraction does not fit here. Normal levels would not need this kind of magic. OK, the question is how to make it better. Let's start with a clear picture of the problem: 1. panic() has basically two funtions: + show/store debug information (optional ways and amount) + do something with the system (reboot, stay hanged) 2. There are 4 ways how to show/store the information: + tell hypervisor to store what it is interested about + crash_dump + kmsg_dump() + consoles , where crash_dump and consoles are special: + crash_dump does not return. Instead it ends up with reboot. + Consoles work transparently. They just need an extra flush before reboot or staying hanged. 3. The various notifiers do things like: + tell hypervisor about the crash + print more information (also stop watchdogs) + prepare system for reboot (touch some interfaces) + prepare system for staying hanged (blinking) Note that it pretty nicely matches the 4 notifier lists. Now, we need to decide about the ordering. The main area is how to store the debug information. Consoles are transparent so the quesition is about: + hypervisor + crash_dump + kmsg_dump Some people need none and some people want all. There is a risk that system might hung at any stage. This why people want to make the order configurable. But crash_dump() does not return when it succeeds. And kmsg_dump() users havn't complained about hypervisor problems yet. So, that two variants might be enough: + crash_dump (hypervisor, kmsg_dump as fallback) + hypervisor, kmsg_dump, crash_dump One option "panic_prefer_crash_dump" should be enough. And the code might look like: void panic() { [...] dump_stack(); kgdb_panic(buf); < --- here starts the reworked code --- > /* crash dump is enough when enabled and preferred. */ if (panic_prefer_crash_dump) __crash_kexec(NULL); /* Stop other CPUs and focus on handling the panic state. */ if (has_kexec_crash_image) crash_smp_send_stop(); else smp_send_stop() /* Notify hypervisor about the system panic. */ atomic_notifier_call_chain(&panic_hypervisor_list, 0, NULL); /* * No need to risk extra info when there is no kmsg dumper * registered. */ if (!has_kmsg_dumper()) __crash_kexec(NULL); /* Add extra info from different subsystems. */ atomic_notifier_call_chain(&panic_info_list, 0, NULL); kmsg_dump(KMSG_DUMP_PANIC); __crash_kexec(NULL); /* Flush console */ unblank_screen(); console_unblank(); debug_locks_off(); console_flush_on_panic(CONSOLE_FLUSH_PENDING); if (panic_timeout > 0) { delay() } /* * Prepare system for eventual reboot and allow custom * reboot handling. */ atomic_notifier_call_chain(&panic_reboot_list, 0, NULL); if (panic_timeout != 0) { reboot(); } /* * Prepare system for the infinite waiting, for example, * setup blinking. */ atomic_notifier_call_chain(&panic_loop_list, 0, NULL); infinite_loop(); } __crash_kexec() is there 3 times but otherwise the code looks quite straight forward. Note 1: I renamed the two last notifier list. The name 'post-reboot' did sound strange from the logical POV ;-) Note 2: We have to avoid the possibility to call "reboot" list before kmsg_dump(). All callbacks providing info have to be in the info list. It a callback combines info and reboot functionality then it should be split. There must be another way to calm down problematic info callbacks. And it has to be solved when such a problem is reported. Is there any known issue, please? It is possible that I have missed something important. But I would really like to make the logic as simple as possible. Best Regards, Petr
On 12/05/2022 11:03, Petr Mladek wrote: > Hello, > > first, I am sorry for stepping into the discussion so late. > I was busy with some other stuff and this patchset is far > from trivial. > > Second, thanks a lot for putting so much effort into it. > Most of the changes look pretty good, especially all > the fixes of particular notifiers and split into > four lists. > > Though this patch will need some more love. See below > for more details. Thanks a lot for your review Petr, it is much appreciated! No need for apologies, there is no urgency here =) > [...] > This talks only about kdump. The reality is much more complicated. > The level affect the order of: > > + notifiers vs. kdump > + notifiers vs. crash_dump > + crash_dump vs. kdump First of all, I'd like to ask you please to clarify to me *exactly* what are the differences between "crash_dump" and "kdump". I'm sorry if that's a silly question, I need to be 100% sure I understand the concepts the same way you do. > There might theoretically many variants of the ordering of kdump, > crash_dump, and the 4 notifier list. Some variants do not make > much sense. You choose 5 variants and tried to select them by > a level number. > > The question is if we really could easily describe the meaning this > way. It is not only about a "level" of notifiers before kdump. It is > also about the ordering of crash_dump vs. kdump. IMHO, "level" > semantic does not fit there. > > Maybe more parameters might be easier to understand the effect. > Anyway, we first need to agree on the chosen variants. > I am going to discuss it more in the code, see below. > > > [...] > Here is the code using the above functions. It helps to discuss > the design and logic. > > <kernel/panic.c> > order_panic_notifiers_and_kdump(); > > /* If no level, we should kdump ASAP. */ > if (!panic_notifiers_level) > __crash_kexec(NULL); > > crash_smp_send_stop(); > panic_notifier_hypervisor_once(buf); > > if (panic_notifier_info_once(buf)) > kmsg_dump(KMSG_DUMP_PANIC); > > panic_notifier_pre_reboot_once(buf); > > __crash_kexec(NULL); > > panic_notifier_hypervisor_once(buf); > > if (panic_notifier_info_once(buf)) > kmsg_dump(KMSG_DUMP_PANIC); > > panic_notifier_pre_reboot_once(buf); > </kernel/panic.c> > > I have to say that the logic is very unclear. Almost all > functions are called twice: > > + __crash_kexec() > + kmsg_dump() > + panic_notifier_hypervisor_once() > + panic_notifier_pre_reboot_once() > + panic_notifier_info_once() > > It is pretty hard to find what functions are always called in the same > order and where the order can be inverted. > > The really used code path is defined by order_panic_notifiers_and_kdump() > that encodes "level" into "bits". The bits are then flipped in > panic_notifier_*_once() calls that either do something or not. > kmsg_dump() is called according to the bit flip. > > It is an interesting approach. I guess that you wanted to avoid too > many if/then/else levels in panic(). But honestly, it looks like > a black magic to me. > > IMHO, it is always easier to follow if/then/else logic than using > a translation table that requires additional bit flips when > a value is used more times. > > Also I guess that it is good proof that "level" abstraction does > not fit here. Normal levels would not need this kind of magic. Heheh OK, I appreciate your opinion, but I guess we'll need to agree in disagree here - I'm much more fond to this kind of code than a bunch of if/else blocks that almost give headaches. Encoding such "level" logic in the if/else scheme is very convoluted, generates a very big code. And the functions aren't so black magic - they map a level in bits, and the functions _once() are called...once! Although we switch the position in the code, so there are 2 calls, one of them is called and the other not. But that's totally fine to change - especially if we're moving away from the "level" logic. I see below you propose a much simpler approach - if we follow that, definitely we won't need the "black magic" approach heheh > > OK, the question is how to make it better. Let's start with > a clear picture of the problem: > > 1. panic() has basically two funtions: > > + show/store debug information (optional ways and amount) > + do something with the system (reboot, stay hanged) > > > 2. There are 4 ways how to show/store the information: > > + tell hypervisor to store what it is interested about > + crash_dump > + kmsg_dump() > + consoles > > , where crash_dump and consoles are special: > > + crash_dump does not return. Instead it ends up with reboot. > > + Consoles work transparently. They just need an extra flush > before reboot or staying hanged. > > > 3. The various notifiers do things like: > > + tell hypervisor about the crash > + print more information (also stop watchdogs) > + prepare system for reboot (touch some interfaces) > + prepare system for staying hanged (blinking) > > Note that it pretty nicely matches the 4 notifier lists. > I really appreciate the summary skill you have, to convert complex problems in very clear and concise ideas. Thanks for that, very useful! I agree with what was summarized above. > Now, we need to decide about the ordering. The main area is how > to store the debug information. Consoles are transparent so > the quesition is about: > > + hypervisor > + crash_dump > + kmsg_dump > > Some people need none and some people want all. There is a > risk that system might hung at any stage. This why people want to > make the order configurable. > > But crash_dump() does not return when it succeeds. And kmsg_dump() > users havn't complained about hypervisor problems yet. So, that > two variants might be enough: > > + crash_dump (hypervisor, kmsg_dump as fallback) > + hypervisor, kmsg_dump, crash_dump > > One option "panic_prefer_crash_dump" should be enough. > And the code might look like: > > void panic() > { > [...] > dump_stack(); > kgdb_panic(buf); > > < --- here starts the reworked code --- > > > /* crash dump is enough when enabled and preferred. */ > if (panic_prefer_crash_dump) > __crash_kexec(NULL); > > /* Stop other CPUs and focus on handling the panic state. */ > if (has_kexec_crash_image) > crash_smp_send_stop(); > else > smp_send_stop() > Here we have a very important point. Why do we need 2 variants of SMP CPU stopping functions? I disagree with that - my understanding of this after some study in architectures is that the crash_() variant is "stronger", should work in all cases and if not, we should fix that - that'd be a bug. Such variant either maps to smp_send_stop() (in various architectures, including XEN/x86) or overrides the basic function with more proper handling for panic() case...I don't see why we still need such distinction, if you / others have some insight about that, I'd like to hear =) > /* Notify hypervisor about the system panic. */ > atomic_notifier_call_chain(&panic_hypervisor_list, 0, NULL); > > /* > * No need to risk extra info when there is no kmsg dumper > * registered. > */ > if (!has_kmsg_dumper()) > __crash_kexec(NULL); > > /* Add extra info from different subsystems. */ > atomic_notifier_call_chain(&panic_info_list, 0, NULL); > > kmsg_dump(KMSG_DUMP_PANIC); > __crash_kexec(NULL); > > /* Flush console */ > unblank_screen(); > console_unblank(); > debug_locks_off(); > console_flush_on_panic(CONSOLE_FLUSH_PENDING); > > if (panic_timeout > 0) { > delay() > } > > /* > * Prepare system for eventual reboot and allow custom > * reboot handling. > */ > atomic_notifier_call_chain(&panic_reboot_list, 0, NULL); You had the order of panic_reboot_list VS. consoles flushing inverted. It might make sense, although I didn't do that in V1... Are you OK in having a helper for console flushing, as I did in V1? It makes code of panic() a bit less polluted / more focused I feel. > > if (panic_timeout != 0) { > reboot(); > } > > /* > * Prepare system for the infinite waiting, for example, > * setup blinking. > */ > atomic_notifier_call_chain(&panic_loop_list, 0, NULL); > > infinite_loop(); > } > > > __crash_kexec() is there 3 times but otherwise the code looks > quite straight forward. > > Note 1: I renamed the two last notifier list. The name 'post-reboot' > did sound strange from the logical POV ;-) > > Note 2: We have to avoid the possibility to call "reboot" list > before kmsg_dump(). All callbacks providing info > have to be in the info list. It a callback combines > info and reboot functionality then it should be split. > > There must be another way to calm down problematic > info callbacks. And it has to be solved when such > a problem is reported. Is there any known issue, please? > > It is possible that I have missed something important. > But I would really like to make the logic as simple as possible. OK, I agree with you! It's indeed simpler and if others agree, I can happily change the logic to what you proposed. Although...currently the "crash_kexec_post_notifiers" allows to call _all_ panic_reboot_list callbacks _before kdump_. We need to mention this change in the commit messages, but I really would like to hear the opinions of heavy users of notifiers (as Michael/Hyper-V) and the kdump interested parties (like Baoquan / Dave Young / Hayatama). If we all agree on such approach, will change that for V2 =) Thanks again Petr, for the time spent in such detailed review! Cheers, Guilherme
On Sun 2022-05-15 19:47:39, Guilherme G. Piccoli wrote: > On 12/05/2022 11:03, Petr Mladek wrote: > > This talks only about kdump. The reality is much more complicated. > > The level affect the order of: > > > > + notifiers vs. kdump > > + notifiers vs. crash_dump > > + crash_dump vs. kdump > > First of all, I'd like to ask you please to clarify to me *exactly* what > are the differences between "crash_dump" and "kdump". I'm sorry if > that's a silly question, I need to be 100% sure I understand the > concepts the same way you do. Ah, it should have been: + notifiers vs. kmsg_dump + notifiers vs. crash_dump + crash_dump vs. kmsg_dump I am sorry for the confusion. Even "crash_dump" is slightly misleading because there is no function with this name. But it seems to be easier to understand than __crash_kexec(). > > There might theoretically many variants of the ordering of kdump, > > crash_dump, and the 4 notifier list. Some variants do not make > > much sense. You choose 5 variants and tried to select them by > > a level number. > > > > The question is if we really could easily describe the meaning this > > way. It is not only about a "level" of notifiers before kdump. It is > > also about the ordering of crash_dump vs. kdump. IMHO, "level" > > semantic does not fit there. > > > > Maybe more parameters might be easier to understand the effect. > > Anyway, we first need to agree on the chosen variants. > > I am going to discuss it more in the code, see below. > > > > > > [...] > > Here is the code using the above functions. It helps to discuss > > the design and logic. > > > > I have to say that the logic is very unclear. Almost all > > functions are called twice: > > > > The really used code path is defined by order_panic_notifiers_and_kdump() > > that encodes "level" into "bits". The bits are then flipped in > > panic_notifier_*_once() calls that either do something or not. > > kmsg_dump() is called according to the bit flip. > > > > Also I guess that it is good proof that "level" abstraction does > > not fit here. Normal levels would not need this kind of magic. > > Heheh OK, I appreciate your opinion, but I guess we'll need to agree in > disagree here - I'm much more fond to this kind of code than a bunch of > if/else blocks that almost give headaches. Encoding such "level" logic > in the if/else scheme is very convoluted, generates a very big code. And > the functions aren't so black magic - they map a level in bits, and the > functions _once() are called...once! Although we switch the position in > the code, so there are 2 calls, one of them is called and the other not. I see. Well, I would consider this as a warning that the approach is too complex. If the code, using if/then/else, would cause headaches then also understanding of the behavior would cause headaches for both users and programmers. > But that's totally fine to change - especially if we're moving away from > the "level" logic. I see below you propose a much simpler approach - if > we follow that, definitely we won't need the "black magic" approach heheh I do not say that my proposal is fully correct. But we really need this kind of simpler approach. > > OK, the question is how to make it better. > > One option "panic_prefer_crash_dump" should be enough. > > And the code might look like: > > > > void panic() > > { > > [...] > > dump_stack(); > > kgdb_panic(buf); > > > > < --- here starts the reworked code --- > > > > > /* crash dump is enough when enabled and preferred. */ > > if (panic_prefer_crash_dump) > > __crash_kexec(NULL); > > > > /* Stop other CPUs and focus on handling the panic state. */ > > if (has_kexec_crash_image) > > crash_smp_send_stop(); > > else > > smp_send_stop() > > > > Here we have a very important point. Why do we need 2 variants of SMP > CPU stopping functions? I disagree with that - my understanding of this > after some study in architectures is that the crash_() variant is > "stronger", should work in all cases and if not, we should fix that - > that'd be a bug. > > Such variant either maps to smp_send_stop() (in various architectures, > including XEN/x86) or overrides the basic function with more proper > handling for panic() case...I don't see why we still need such > distinction, if you / others have some insight about that, I'd like to > hear =) The two variants were introduced by the commit 0ee59413c967c35a6dd ("x86/panic: replace smp_send_stop() with kdump friendly version in panic path") It points to https://lkml.org/lkml/2015/6/24/44 that talks about still running watchdogs. It is possible that the problem could be fixed another way. It is even possible that it has already been fixed by the notifiers that disable the watchdogs. Anyway, any change of the smp_send_stop() behavior should be done in a separate patch. It will help with bisection of possible regression. Also it would require a good explanation in the commit message. I would personally do it in a separate patch(set). > > /* Notify hypervisor about the system panic. */ > > atomic_notifier_call_chain(&panic_hypervisor_list, 0, NULL); > > > > /* > > * No need to risk extra info when there is no kmsg dumper > > * registered. > > */ > > if (!has_kmsg_dumper()) > > __crash_kexec(NULL); > > > > /* Add extra info from different subsystems. */ > > atomic_notifier_call_chain(&panic_info_list, 0, NULL); > > > > kmsg_dump(KMSG_DUMP_PANIC); > > __crash_kexec(NULL); > > > > /* Flush console */ > > unblank_screen(); > > console_unblank(); > > debug_locks_off(); > > console_flush_on_panic(CONSOLE_FLUSH_PENDING); > > > > if (panic_timeout > 0) { > > delay() > > } > > > > /* > > * Prepare system for eventual reboot and allow custom > > * reboot handling. > > */ > > atomic_notifier_call_chain(&panic_reboot_list, 0, NULL); > > You had the order of panic_reboot_list VS. consoles flushing inverted. > It might make sense, although I didn't do that in V1... IMHO, it makes sense: 1. panic_reboot_list contains notifiers that do the reboot immediately, for example, xen_panic_event, alpha_panic_event. The consoles have to be flushed earlier. 2. console_flush_on_panic() ignores the result of console_trylock() and always calls console_unlock(). As a result the lock should be unlocked at the end. And any further printk() should be able to printk the messages to the console immediately. It means that any messages printed by the reboot notifiers should appear on the console as well. > Are you OK in having a helper for console flushing, as I did in V1? It > makes code of panic() a bit less polluted / more focused I feel. Yes, it makes sense. Well, it would better to do it in a separate patch. The patch patch reworking the logic should be as small as possible. It will simplify the review. > > if (panic_timeout != 0) { > > reboot(); > > } > > > > /* > > * Prepare system for the infinite waiting, for example, > > * setup blinking. > > */ > > atomic_notifier_call_chain(&panic_loop_list, 0, NULL); > > > > infinite_loop(); > > } > > > > > > __crash_kexec() is there 3 times but otherwise the code looks > > quite straight forward. > > > > Note 1: I renamed the two last notifier list. The name 'post-reboot' > > did sound strange from the logical POV ;-) > > > > Note 2: We have to avoid the possibility to call "reboot" list > > before kmsg_dump(). All callbacks providing info > > have to be in the info list. It a callback combines > > info and reboot functionality then it should be split. > > > > There must be another way to calm down problematic > > info callbacks. And it has to be solved when such > > a problem is reported. Is there any known issue, please? > > > > It is possible that I have missed something important. > > But I would really like to make the logic as simple as possible. > > OK, I agree with you! It's indeed simpler and if others agree, I can > happily change the logic to what you proposed. Although...currently the > "crash_kexec_post_notifiers" allows to call _all_ panic_reboot_list > callbacks _before kdump_. > > We need to mention this change in the commit messages, but I really > would like to hear the opinions of heavy users of notifiers (as > Michael/Hyper-V) and the kdump interested parties (like Baoquan / Dave > Young / Hayatama). If we all agree on such approach, will change that > for V2 =) Sure, we need to make sure that we call everything that is needed. And it should be documented. I believe that this is the right way because: + It was actually the motivation for this patchset. We split the notifiers into separate lists because we want to call only the really needed ones before kmsg_dump and crash_dump. + If anything is needed for crash_dump that it should be called even when crash_dump is called first. It should be either hardcoded into crash_dump() or we would need another notifier list that will be always called before crash_dump. Thanks a lot for working on this. Best Regards, Petr
On 16/05/2022 07:21, Petr Mladek wrote: > [...] > Ah, it should have been: > > + notifiers vs. kmsg_dump > + notifiers vs. crash_dump > + crash_dump vs. kmsg_dump > > I am sorry for the confusion. Even "crash_dump" is slightly > misleading because there is no function with this name. > But it seems to be easier to understand than __crash_kexec(). Cool, thanks! Now it's totally clear for me =) I feel crash dump is the proper term, but I personally prefer kdump to avoid mess-up with user space "core dump" concept heheh Also, KDUMP is an entry on MAINTAINERS file. > [...] >> Heheh OK, I appreciate your opinion, but I guess we'll need to agree in >> disagree here - I'm much more fond to this kind of code than a bunch of >> if/else blocks that almost give headaches. Encoding such "level" logic >> in the if/else scheme is very convoluted, generates a very big code. And >> the functions aren't so black magic - they map a level in bits, and the >> functions _once() are called...once! Although we switch the position in >> the code, so there are 2 calls, one of them is called and the other not. > > I see. Well, I would consider this as a warning that the approach is > too complex. If the code, using if/then/else, would cause headaches > then also understanding of the behavior would cause headaches for > both users and programmers. > > >> But that's totally fine to change - especially if we're moving away from >> the "level" logic. I see below you propose a much simpler approach - if >> we follow that, definitely we won't need the "black magic" approach heheh > > I do not say that my proposal is fully correct. But we really need > this kind of simpler approach. It's cool, I agree that your idea is much simpler and makes sense - mine seems to be an over-engineering effort. Let's see the opinions of the interested parties, I'm curious to see if everybody agrees here, that'd would be ideal (and kind of "wishful thinking" I guess heheh - panic path is polemic). > [...] >> Here we have a very important point. Why do we need 2 variants of SMP >> CPU stopping functions? I disagree with that - my understanding of this >> after some study in architectures is that the crash_() variant is >> "stronger", should work in all cases and if not, we should fix that - >> that'd be a bug. >> >> Such variant either maps to smp_send_stop() (in various architectures, >> including XEN/x86) or overrides the basic function with more proper >> handling for panic() case...I don't see why we still need such >> distinction, if you / others have some insight about that, I'd like to >> hear =) > > The two variants were introduced by the commit 0ee59413c967c35a6dd > ("x86/panic: replace smp_send_stop() with kdump friendly version in > panic path") > > It points to https://lkml.org/lkml/2015/6/24/44 that talks about > still running watchdogs. > > It is possible that the problem could be fixed another way. It is > even possible that it has already been fixed by the notifiers > that disable the watchdogs. > > Anyway, any change of the smp_send_stop() behavior should be done > in a separate patch. It will help with bisection of possible > regression. Also it would require a good explanation in > the commit message. I would personally do it in a separate > patch(set). Thanks for the archeology and interesting findings. I agree that is better to split in smaller patches. I'm planning a split in 3 patches for V2: clean-up (comment, console flushing idea, useless header), the refactor itself and finally, this SMP change. > [...] >> You had the order of panic_reboot_list VS. consoles flushing inverted. >> It might make sense, although I didn't do that in V1... > > IMHO, it makes sense: > > 1. panic_reboot_list contains notifiers that do the reboot > immediately, for example, xen_panic_event, alpha_panic_event. > The consoles have to be flushed earlier. > > 2. console_flush_on_panic() ignores the result of console_trylock() > and always calls console_unlock(). As a result the lock should > be unlocked at the end. And any further printk() should be able > to printk the messages to the console immediately. It means > that any messages printed by the reboot notifiers should appear > on the console as well. > [...] >> OK, I agree with you! It's indeed simpler and if others agree, I can >> happily change the logic to what you proposed. Although...currently the >> "crash_kexec_post_notifiers" allows to call _all_ panic_reboot_list >> callbacks _before kdump_. >> >> We need to mention this change in the commit messages, but I really >> would like to hear the opinions of heavy users of notifiers (as >> Michael/Hyper-V) and the kdump interested parties (like Baoquan / Dave >> Young / Hayatama). If we all agree on such approach, will change that >> for V2 =) > > Sure, we need to make sure that we call everything that is needed. > And it should be documented. > > I believe that this is the right way because: > > + It was actually the motivation for this patchset. We split > the notifiers into separate lists because we want to call > only the really needed ones before kmsg_dump and crash_dump. > > + If anything is needed for crash_dump that it should be called > even when crash_dump is called first. It should be either > hardcoded into crash_dump() or we would need another notifier > list that will be always called before crash_dump. Ack, makes sense! Will do that in V2 =) For the "hardcoded" part, we have the custom machine_crash_kexec() in some archs (like x86), unfortunately not in all of them - this path is ideally for mandatory code that is required for a successful crash_kexec(). The problem is the "same old, same old" - architecture folks push that to panic notifiers; notifiers folks push it to the arch custom shutdown handler (see [0] heheh). (CCed Marc / Mark in case they want to chime-in here...) >[...] > Thanks a lot for working on this. > > Best Regards, > Petr You're welcome, _thank you_ for the great and detailed reviews! Cheers, Guilherme [0] https://lore.kernel.org/lkml/427a8277-49f0-4317-d6c3-4a15d7070e55@igalia.com/
On 05/15/22 at 07:47pm, Guilherme G. Piccoli wrote: > On 12/05/2022 11:03, Petr Mladek wrote: ...... > > OK, the question is how to make it better. Let's start with > > a clear picture of the problem: > > > > 1. panic() has basically two funtions: > > > > + show/store debug information (optional ways and amount) > > + do something with the system (reboot, stay hanged) > > > > > > 2. There are 4 ways how to show/store the information: > > > > + tell hypervisor to store what it is interested about > > + crash_dump > > + kmsg_dump() > > + consoles > > > > , where crash_dump and consoles are special: > > > > + crash_dump does not return. Instead it ends up with reboot. > > > > + Consoles work transparently. They just need an extra flush > > before reboot or staying hanged. > > > > > > 3. The various notifiers do things like: > > > > + tell hypervisor about the crash > > + print more information (also stop watchdogs) > > + prepare system for reboot (touch some interfaces) > > + prepare system for staying hanged (blinking) > > > > Note that it pretty nicely matches the 4 notifier lists. > > > > I really appreciate the summary skill you have, to convert complex > problems in very clear and concise ideas. Thanks for that, very useful! > I agree with what was summarized above. I want to say the similar words to Petr's reviewing comment when I went through the patches and traced each reviewing sub-thread to try to catch up. Petr has reivewed this series so carefully and given many comments I want to ack immediately. I agree with most of the suggestions from Petr to this patch, except of one tiny concern, please see below inline comment. > > > > Now, we need to decide about the ordering. The main area is how > > to store the debug information. Consoles are transparent so > > the quesition is about: > > > > + hypervisor > > + crash_dump > > + kmsg_dump > > > > Some people need none and some people want all. There is a > > risk that system might hung at any stage. This why people want to > > make the order configurable. > > > > But crash_dump() does not return when it succeeds. And kmsg_dump() > > users havn't complained about hypervisor problems yet. So, that > > two variants might be enough: > > > > + crash_dump (hypervisor, kmsg_dump as fallback) > > + hypervisor, kmsg_dump, crash_dump > > > > One option "panic_prefer_crash_dump" should be enough. > > And the code might look like: > > > > void panic() > > { > > [...] > > dump_stack(); > > kgdb_panic(buf); > > > > < --- here starts the reworked code --- > > > > > /* crash dump is enough when enabled and preferred. */ > > if (panic_prefer_crash_dump) > > __crash_kexec(NULL); I like the proposed skeleton of panic() and code style suggested by Petr very much. About panic_prefer_crash_dump which might need be added, I hope it has a default value true. This makes crash_dump execute at first by default just as before, unless people specify panic_prefer_crash_dump=0|n|off to disable it. Otherwise we need add panic_prefer_crash_dump=1 in kernel and in our distros to enable kdump, this is inconsistent with the old behaviour. > > > > /* Stop other CPUs and focus on handling the panic state. */ > > if (has_kexec_crash_image) > > crash_smp_send_stop(); > > else > > smp_send_stop() > > > > Here we have a very important point. Why do we need 2 variants of SMP > CPU stopping functions? I disagree with that - my understanding of this > after some study in architectures is that the crash_() variant is > "stronger", should work in all cases and if not, we should fix that - > that'd be a bug. > > Such variant either maps to smp_send_stop() (in various architectures, > including XEN/x86) or overrides the basic function with more proper > handling for panic() case...I don't see why we still need such > distinction, if you / others have some insight about that, I'd like to > hear =) > > > > /* Notify hypervisor about the system panic. */ > > atomic_notifier_call_chain(&panic_hypervisor_list, 0, NULL); > > > > /* > > * No need to risk extra info when there is no kmsg dumper > > * registered. > > */ > > if (!has_kmsg_dumper()) > > __crash_kexec(NULL); > > > > /* Add extra info from different subsystems. */ > > atomic_notifier_call_chain(&panic_info_list, 0, NULL); > > > > kmsg_dump(KMSG_DUMP_PANIC); > > __crash_kexec(NULL); > > > > /* Flush console */ > > unblank_screen(); > > console_unblank(); > > debug_locks_off(); > > console_flush_on_panic(CONSOLE_FLUSH_PENDING); > > > > if (panic_timeout > 0) { > > delay() > > } > > > > /* > > * Prepare system for eventual reboot and allow custom > > * reboot handling. > > */ > > atomic_notifier_call_chain(&panic_reboot_list, 0, NULL); > > You had the order of panic_reboot_list VS. consoles flushing inverted. > It might make sense, although I didn't do that in V1... > Are you OK in having a helper for console flushing, as I did in V1? It > makes code of panic() a bit less polluted / more focused I feel. > > > > > > if (panic_timeout != 0) { > > reboot(); > > } > > > > /* > > * Prepare system for the infinite waiting, for example, > > * setup blinking. > > */ > > atomic_notifier_call_chain(&panic_loop_list, 0, NULL); > > > > infinite_loop(); > > } > > > > > > __crash_kexec() is there 3 times but otherwise the code looks > > quite straight forward. > > > > Note 1: I renamed the two last notifier list. The name 'post-reboot' > > did sound strange from the logical POV ;-) > > > > Note 2: We have to avoid the possibility to call "reboot" list > > before kmsg_dump(). All callbacks providing info > > have to be in the info list. It a callback combines > > info and reboot functionality then it should be split. > > > > There must be another way to calm down problematic > > info callbacks. And it has to be solved when such > > a problem is reported. Is there any known issue, please? > > > > It is possible that I have missed something important. > > But I would really like to make the logic as simple as possible. > > OK, I agree with you! It's indeed simpler and if others agree, I can > happily change the logic to what you proposed. Although...currently the > "crash_kexec_post_notifiers" allows to call _all_ panic_reboot_list > callbacks _before kdump_. > > We need to mention this change in the commit messages, but I really > would like to hear the opinions of heavy users of notifiers (as > Michael/Hyper-V) and the kdump interested parties (like Baoquan / Dave > Young / Hayatama). If we all agree on such approach, will change that > for V2 =) > > Thanks again Petr, for the time spent in such detailed review! > Cheers, > > > Guilherme >
On 19/05/2022 20:45, Baoquan He wrote: > [...] >> I really appreciate the summary skill you have, to convert complex >> problems in very clear and concise ideas. Thanks for that, very useful! >> I agree with what was summarized above. > > I want to say the similar words to Petr's reviewing comment when I went > through the patches and traced each reviewing sub-thread to try to > catch up. Petr has reivewed this series so carefully and given many > comments I want to ack immediately. > > I agree with most of the suggestions from Petr to this patch, except of > one tiny concern, please see below inline comment. Hi Baoquan, thanks! I'm glad you're also reviewing that =) > [...] > > I like the proposed skeleton of panic() and code style suggested by > Petr very much. About panic_prefer_crash_dump which might need be added, > I hope it has a default value true. This makes crash_dump execute at > first by default just as before, unless people specify > panic_prefer_crash_dump=0|n|off to disable it. Otherwise we need add > panic_prefer_crash_dump=1 in kernel and in our distros to enable kdump, > this is inconsistent with the old behaviour. I'd like to understand better why the crash_kexec() must always be the first thing in your use case. If we keep that behavior, we'll see all sorts of workarounds - see the last patches of this series, Hyper-V and PowerPC folks hardcoded "crash_kexec_post_notifiers" in order to force execution of their relevant notifiers (like the vmbus disconnect, specially in arm64 that has no custom machine_crash_shutdown, or the fadump case in ppc). This led to more risk in kdump. The thing is: with the notifiers' split, we tried to keep only the most relevant/necessary stuff in this first list, things that ultimately should improve kdump reliability or if not, at least not break it. My feeling is that, with this series, we should change the idea/concept that kdump must run first nevertheless, not matter what. We're here trying to accommodate the antagonistic goals of hypervisors that need some clean-up (even for kdump to work) VS. kdump users, that wish a "pristine" system reboot ASAP after the crash. Cheers, Guilherme
On Fri 2022-05-20 08:23:33, Guilherme G. Piccoli wrote: > On 19/05/2022 20:45, Baoquan He wrote: > > [...] > >> I really appreciate the summary skill you have, to convert complex > >> problems in very clear and concise ideas. Thanks for that, very useful! > >> I agree with what was summarized above. > > > > I want to say the similar words to Petr's reviewing comment when I went > > through the patches and traced each reviewing sub-thread to try to > > catch up. Petr has reivewed this series so carefully and given many > > comments I want to ack immediately. > > > > I agree with most of the suggestions from Petr to this patch, except of > > one tiny concern, please see below inline comment. > > Hi Baoquan, thanks! I'm glad you're also reviewing that =) > > > > [...] > > > > I like the proposed skeleton of panic() and code style suggested by > > Petr very much. About panic_prefer_crash_dump which might need be added, > > I hope it has a default value true. This makes crash_dump execute at > > first by default just as before, unless people specify > > panic_prefer_crash_dump=0|n|off to disable it. Otherwise we need add > > panic_prefer_crash_dump=1 in kernel and in our distros to enable kdump, > > this is inconsistent with the old behaviour. > > I'd like to understand better why the crash_kexec() must always be the > first thing in your use case. If we keep that behavior, we'll see all > sorts of workarounds - see the last patches of this series, Hyper-V and > PowerPC folks hardcoded "crash_kexec_post_notifiers" in order to force > execution of their relevant notifiers (like the vmbus disconnect, > specially in arm64 that has no custom machine_crash_shutdown, or the > fadump case in ppc). This led to more risk in kdump. > > The thing is: with the notifiers' split, we tried to keep only the most > relevant/necessary stuff in this first list, things that ultimately > should improve kdump reliability or if not, at least not break it. My > feeling is that, with this series, we should change the idea/concept > that kdump must run first nevertheless, not matter what. We're here > trying to accommodate the antagonistic goals of hypervisors that need > some clean-up (even for kdump to work) VS. kdump users, that wish a > "pristine" system reboot ASAP after the crash. Good question. I wonder if Baoquan knows about problems caused by the particular notifiers that will end up in the hypervisor list. Note that there will be some shuffles and the list will be slightly different in V2. Anyway, I see four possible solutions: 1. The most conservative approach is to keep the current behavior and call kdump first by default. 2. A medium conservative approach to change the default default behavior and call hypervisor and eventually the info notifiers before kdump. There still would be the possibility to call kdump first by the command line parameter. 3. Remove the possibility to call kdump first completely. It would assume that all the notifiers in the info list are super safe or that they make kdump actually more safe. 4. Create one more notifier list for operations that always should be called before crash_dump. Regarding the extra notifier list (4th solution). It is not clear to me whether it would be always called even before hypervisor list or when kdump is not enabled. We must not over-engineer it. 2nd proposal looks like a good compromise. But maybe we could do this change few releases later. The notifiers split is a big change on its own. Best Regards, Petr
On 05/20/22 at 08:23am, Guilherme G. Piccoli wrote: > On 19/05/2022 20:45, Baoquan He wrote: > > [...] > >> I really appreciate the summary skill you have, to convert complex > >> problems in very clear and concise ideas. Thanks for that, very useful! > >> I agree with what was summarized above. > > > > I want to say the similar words to Petr's reviewing comment when I went > > through the patches and traced each reviewing sub-thread to try to > > catch up. Petr has reivewed this series so carefully and given many > > comments I want to ack immediately. > > > > I agree with most of the suggestions from Petr to this patch, except of > > one tiny concern, please see below inline comment. > > Hi Baoquan, thanks! I'm glad you're also reviewing that =) > > > > [...] > > > > I like the proposed skeleton of panic() and code style suggested by > > Petr very much. About panic_prefer_crash_dump which might need be added, > > I hope it has a default value true. This makes crash_dump execute at > > first by default just as before, unless people specify > > panic_prefer_crash_dump=0|n|off to disable it. Otherwise we need add > > panic_prefer_crash_dump=1 in kernel and in our distros to enable kdump, > > this is inconsistent with the old behaviour. > > I'd like to understand better why the crash_kexec() must always be the > first thing in your use case. If we keep that behavior, we'll see all > sorts of workarounds - see the last patches of this series, Hyper-V and > PowerPC folks hardcoded "crash_kexec_post_notifiers" in order to force > execution of their relevant notifiers (like the vmbus disconnect, > specially in arm64 that has no custom machine_crash_shutdown, or the > fadump case in ppc). This led to more risk in kdump. Firstly, kdump is not always the first thing. In any use case, if kdump kernel is not loaded, it's not the first thing at all. Not to mention if crash_kexec_post_notifiers is specified. if kdump kernel is loaded, kdump has been executing firslty, since it was added into kenrel/panic(); Until 2014, Masa added crash_kexec_post_notifiers kernel parameter to make panic notifiers be able to execute before kdump if specified. commit dc009d92435f99498cbc579ce76bf28e837e2c14 Author: Eric W. Biederman <ebiederm@xmission.com> Date: Sat Jun 25 14:57:52 2005 -0700 [PATCH] kexec: add kexec syscalls commit f06e5153f4ae2e2f3b0300f0e260e40cb7fefd45 Author: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Date: Fri Jun 6 14:37:07 2014 -0700 kernel/panic.c: add "crash_kexec_post_notifiers" option for kdump after panic_notifers Changing this will cause regression. During these years, nobody ever doubt kdump should execute firstly if crashkernel is reserved and kdump kernel is loaded. That's not saying we can't change this, but need a convincing justification. Secondly, even with the notifiers' split, we can't guarantee people will absolutely add notifiers into right list in the future. Letting kdump execute behind lists by default will put kdump into risk. For example, you replied to Hatamata saying you have been working with kdump in the last 3, 4 years, and you have have been working on these panic notifiers refactoring issue in the recent months. However, in your refactoring patches of introducing hypervisor/info/pre-reboot, I noticed you acked the suggestion from Petr that several notifiers need be moved to correct position. So even you can't make sure these, how can other people be able to recognize which list should be 100% appropriate when they try to register one notifier for their sub-component? At last, I am wondering why fadump matters. I don't know in which case people wants to load kdump kernel, but expect to trigger crash fadump. Power people need consider this carefully and makes some change. Fadump just borrows the crashkernel reservation mechanism. If fadump would rather take risk to run all panic notifiers, whether fadump really needs them or not, then execute crash_fadump(), that's powerpc's business. As for Hyper-V, if it enforces to terminate VMbus connection, no matter it's kdump or not, why not taking it out of panic notifiers list and execute it before kdump unconditionally. Below is abstracted from Michael's words. https://lore.kernel.org/all/MWHPR21MB15933573F5C81C5250BF6A1CD75E9@MWHPR21MB1593.namprd21.prod.outlook.com/T/#u ======= I looked at the code again, and should revise my previous comments somewhat. The Hyper-V resets that I described indeed must be done prior to kexec'ing the kdump kernel. Most such resets are actually done via __crash_kexec() -> machine_crash_shutdown(), not via the panic notifier. However, the Hyper-V panic notifier must terminate the VMbus connection, because that must be done even if kdump is not being invoked. See commit 74347a99e73. ======= > > The thing is: with the notifiers' split, we tried to keep only the most > relevant/necessary stuff in this first list, things that ultimately > should improve kdump reliability or if not, at least not break it. My > feeling is that, with this series, we should change the idea/concept > that kdump must run first nevertheless, not matter what. We're here > trying to accommodate the antagonistic goals of hypervisors that need > some clean-up (even for kdump to work) VS. kdump users, that wish a > "pristine" system reboot ASAP after the crash. > > Cheers, > > > Guilherme >
On 05/24/22 at 10:01am, Petr Mladek wrote: > On Fri 2022-05-20 08:23:33, Guilherme G. Piccoli wrote: > > On 19/05/2022 20:45, Baoquan He wrote: > > > [...] > > >> I really appreciate the summary skill you have, to convert complex > > >> problems in very clear and concise ideas. Thanks for that, very useful! > > >> I agree with what was summarized above. > > > > > > I want to say the similar words to Petr's reviewing comment when I went > > > through the patches and traced each reviewing sub-thread to try to > > > catch up. Petr has reivewed this series so carefully and given many > > > comments I want to ack immediately. > > > > > > I agree with most of the suggestions from Petr to this patch, except of > > > one tiny concern, please see below inline comment. > > > > Hi Baoquan, thanks! I'm glad you're also reviewing that =) > > > > > > > [...] > > > > > > I like the proposed skeleton of panic() and code style suggested by > > > Petr very much. About panic_prefer_crash_dump which might need be added, > > > I hope it has a default value true. This makes crash_dump execute at > > > first by default just as before, unless people specify > > > panic_prefer_crash_dump=0|n|off to disable it. Otherwise we need add > > > panic_prefer_crash_dump=1 in kernel and in our distros to enable kdump, > > > this is inconsistent with the old behaviour. > > > > I'd like to understand better why the crash_kexec() must always be the > > first thing in your use case. If we keep that behavior, we'll see all > > sorts of workarounds - see the last patches of this series, Hyper-V and > > PowerPC folks hardcoded "crash_kexec_post_notifiers" in order to force > > execution of their relevant notifiers (like the vmbus disconnect, > > specially in arm64 that has no custom machine_crash_shutdown, or the > > fadump case in ppc). This led to more risk in kdump. > > > > The thing is: with the notifiers' split, we tried to keep only the most > > relevant/necessary stuff in this first list, things that ultimately > > should improve kdump reliability or if not, at least not break it. My > > feeling is that, with this series, we should change the idea/concept > > that kdump must run first nevertheless, not matter what. We're here > > trying to accommodate the antagonistic goals of hypervisors that need > > some clean-up (even for kdump to work) VS. kdump users, that wish a > > "pristine" system reboot ASAP after the crash. > > Good question. I wonder if Baoquan knows about problems caused by the > particular notifiers that will end up in the hypervisor list. Note > that there will be some shuffles and the list will be slightly > different in V2. Yes, I knew some of them. Please check my response to Guilherme. We have bug to track the issue on Hyper-V in which failure happened during panic notifiers running, haven't come to kdump. Seems both of us sent mail replying to Guilherme at the same time. > > Anyway, I see four possible solutions: > > 1. The most conservative approach is to keep the current behavior > and call kdump first by default. > > 2. A medium conservative approach to change the default default > behavior and call hypervisor and eventually the info notifiers > before kdump. There still would be the possibility to call kdump > first by the command line parameter. > > 3. Remove the possibility to call kdump first completely. It would > assume that all the notifiers in the info list are super safe > or that they make kdump actually more safe. > > 4. Create one more notifier list for operations that always should > be called before crash_dump. I would vote for 1 or 4 without any hesitation, and prefer 4. I ever suggest the variant of solution 4 in v1 reviewing. That's taking those notifiers out of list and enforcing to execute them before kdump. E.g the one on HyperV to terminate VMbus connection. Maybe solution 4 is better to provide a determinate way for people to add necessary code at the earliest part. > > Regarding the extra notifier list (4th solution). It is not clear to > me whether it would be always called even before hypervisor list or > when kdump is not enabled. We must not over-engineer it. One thing I would like to notice is, no matter how perfect we split the lists this time, we can't gurantee people will add notifiers reasonablly in the future. And people from different sub-component may not do sufficient investigation and add them to fulfil their local purpose. The current panic notifers list is the best example. Hyper-V actually wants to run some necessary code before kdump, but not all of them, they just add it, ignoring the original purpose of crash_kexec_post_notifiers. I guess they do like this just because it's easy to do, no need to bother changing code in generic place. Solution 4 can make this no doubt, that's why I like it better. > > 2nd proposal looks like a good compromise. But maybe we could do > this change few releases later. The notifiers split is a big > change on its own. As I replied to Guilherme, solution 2 will cause regression if not calling kdump firstly. Solution 3 leaves people space to make mistake, they could add nontifier into wrong list. I would like to note again that the panic notifiers are optional to run, while kdump is expectd once loaded, from the original purpose. I guess people I know will still have this thought, e.g Hatayama, Masa, they are truly often use panic notifiers like this on their company's system.
"Guilherme G. Piccoli" <gpiccoli@igalia.com> writes: > The panic() function is somewhat convoluted - a lot of changes were > made over the years, adding comments that might be misleading/outdated > now, it has a code structure that is a bit complex to follow, with > lots of conditionals, for example. The panic notifier list is something > else - a single list, with multiple callbacks of different purposes, > that run in a non-deterministic order and may affect hardly kdump > reliability - see the "crash_kexec_post_notifiers" workaround-ish flag. > > This patch proposes a major refactor on the panic path based on Petr's > idea [0] - basically we split the notifiers list in three, having a set > of different call points in the panic path. Below a list of changes > proposed in this patch, culminating in the panic notifiers level > concept: > > (a) First of all, we improved comments all over the function > and removed useless variables / includes. Also, as part of this > clean-up we concentrate the console flushing functions in a helper. > > (b) As mentioned before, there is a split of the panic notifier list > in three, based on the purpose of the callback. The code contains > good documentation in form of comments, but a summary of the three > lists follows: > > - the hypervisor list aims low-risk procedures to inform hypervisors > or firmware about the panic event, also includes LED-related functions; > > - the informational list contains callbacks that provide more details, > like kernel offset or trace dump (if enabled) and also includes the > callbacks aimed at reducing log pollution or warns, like the RCU and > hung task disable callbacks; > > - finally, the pre_reboot list is the old notifier list renamed, > containing the more risky callbacks that didn't fit the previous > lists. There is also a 4th list (the post_reboot one), but it's not > related with the original list - it contains late time architecture > callbacks aimed at stopping the machine, for example. > > The 3 notifiers lists execute in different moments, hypervisor being > the first, followed by informational and finally the pre_reboot list. > > (c) But then, there is the ordering problem of the notifiers against > the crash_kernel() call - kdump must be as reliable as possible. > For that, a simple binary "switch" as "crash_kexec_post_notifiers" > is not enough, hence we introduce here concept of panic notifier > levels: there are 5 levels, from 0 (no notifier executes before > kdump) until 4 (all notifiers run before kdump); the default level > is 2, in which the hypervisor and (iff we have any kmsg dumper) > the informational notifiers execute before kdump. > > The detailed documentation of the levels is present in code comments > and in the kernel-parameters.txt file; as an analogy with the previous > panic() implementation, the level 0 is exactly the same as the old > behavior of notifiers, running all after kdump, and the level 4 is > the same as "crash_kexec_post_notifiers=Y" (we kept this parameter as > a deprecated one). > > (d) Finally, an important change made here: we now use only the > function "crash_smp_send_stop()" to shut all the secondary CPUs > in the panic path. Before, there was a case of using the regular > "smp_send_stop()", but the better approach is to simplify the > code and try to use the function which was created exclusively > for the panic path. Experiments showed that it works fine, and > code was very simplified with that. > > Functional change is expected from this refactor, since now we > call some notifiers by default before kdump, but the goal here > besides code clean-up is to have a better panic path, more > reliable and deterministic, but also very customizable. > > [0] https://lore.kernel.org/lkml/YfPxvzSzDLjO5ldp@alley/ I am late to this discussion. My apologies. Unfortunately I am also very grouchy. Notifiers before kexec on panic are fundamentally broken. So please just remove crash_kexec_post notifiers and be done with it. Part of the deep issue is that firmware always has a common and broken implementation for anything that is not mission critical to motherboards. Notifiers in any sense on these paths are just bollocks. Any kind of notifier list is fundamentally fragile in the face of memory corruption and very very difficult to review. So I am going to refresh my ancient NACK on this. I can certainly appreciate that there are pieces of the reboot paths that can be improved. I don't think making anything more feature full or flexible is any kind of real improvement. Eric > > Suggested-by: Petr Mladek <pmladek@suse.com> > Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com> > --- > > Special thanks to Petr and Baoquan for the suggestion and feedback in a previous > email thread. There's some important design decisions that worth mentioning and > discussing: > > * The default panic notifiers level is 2, based on Petr Mladek's suggestion, > which makes a lot of sense. Of course, this is customizable through the > parameter, but would be something worthwhile to have a KConfig option to set > the default level? It would help distros that want the old behavior > (no notifiers before kdump) as default. > > * The implementation choice was to _avoid_ intricate if conditionals in the > panic path, which would _definitely_ be present with the panic notifiers levels > idea; so, instead of lots of if conditionals, the set/clear bits approach with > functions called in 2 points (but executing only in one of them) is much easier > to follow an was used here; the ordering helper function and the comments also > help a lot to avoid confusion (hopefully). > > * Choice was to *always* use crash_smp_send_stop() instead of sometimes making > use of the regular smp_send_stop(); for most architectures they are the same, > including Xen (on x86). For the ones that override it, all should work fine, > in the powerpc case it's even more correct (see the subsequent patch > "powerpc: Do not force all panic notifiers to execute before kdump") > > There seems to be 2 cases that requires some plumbing to work 100% right: > - ARM doesn't disable local interrupts / FIQs in the crash version of > send_stop(); we patched that early in this series; > - x86 could face an issue if we have VMX and do use crash_smp_send_stop() > _without_ kdump, but this is fixed in the first patch of the series (and > it's a bug present even before this refactor). > > * Notice we didn't add a sysrq for panic notifiers level - should have it? > Alejandro proposed recently to add a sysrq for "crash_kexec_post_notifiers", > let me know if you feel the need here Alejandro, since the core parameters are > present in /sys, I didn't consider much gain in having a sysrq, but of course > I'm open to suggestions! > > Thanks advance for the review! > > .../admin-guide/kernel-parameters.txt | 42 ++- > include/linux/panic_notifier.h | 1 + > kernel/kexec_core.c | 8 +- > kernel/panic.c | 292 +++++++++++++----- > .../selftests/pstore/pstore_crash_test | 5 +- > 5 files changed, 252 insertions(+), 96 deletions(-) > > diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt > index 3f1cc5e317ed..8d3524060ce3 100644 > --- a/Documentation/admin-guide/kernel-parameters.txt > +++ b/Documentation/admin-guide/kernel-parameters.txt > @@ -829,6 +829,13 @@ > It will be ignored when crashkernel=X,high is not used > or memory reserved is below 4G. > > + crash_kexec_post_notifiers > + This was DEPRECATED - users should always prefer the > + parameter "panic_notifiers_level" - check its entry > + in this documentation for details on how it works. > + Setting this parameter is exactly the same as setting > + "panic_notifiers_level=4". > + > cryptomgr.notests > [KNL] Disable crypto self-tests > > @@ -3784,6 +3791,33 @@ > timeout < 0: reboot immediately > Format: <timeout> > > + panic_notifiers_level= > + [KNL] Set the panic notifiers execution order. > + Format: <unsigned int> > + We currently have 4 lists of panic notifiers; based > + on the functionality and risk (for panic success) the > + callbacks are added in a given list. The lists are: > + - hypervisor/FW notification list (low risk); > + - informational list (low/medium risk); > + - pre_reboot list (higher risk); > + - post_reboot list (only run late in panic and after > + kdump, not configurable for now). > + This parameter defines the ordering of the first 3 > + lists with regards to kdump; the levels determine > + which set of notifiers execute before kdump. The > + accepted levels are: > + 0: kdump is the first thing to run, NO list is > + executed before kdump. > + 1: only the hypervisor list is executed before kdump. > + 2 (default level): the hypervisor list and (*if* > + there's any kmsg_dumper defined) the informational > + list are executed before kdump. > + 3: both the hypervisor and the informational lists > + (always) execute before kdump. > + 4: the 3 lists (hypervisor, info and pre_reboot) > + execute before kdump - this behavior is analog to the > + deprecated parameter "crash_kexec_post_notifiers". > + > panic_print= Bitmask for printing system info when panic happens. > User can chose combination of the following bits: > bit 0: print all tasks info > @@ -3814,14 +3848,6 @@ > panic_on_warn panic() instead of WARN(). Useful to cause kdump > on a WARN(). > > - crash_kexec_post_notifiers > - Run kdump after running panic-notifiers and dumping > - kmsg. This only for the users who doubt kdump always > - succeeds in any situation. > - Note that this also increases risks of kdump failure, > - because some panic notifiers can make the crashed > - kernel more unstable. > - > parkbd.port= [HW] Parallel port number the keyboard adapter is > connected to, default is 0. > Format: <parport#> > diff --git a/include/linux/panic_notifier.h b/include/linux/panic_notifier.h > index bcf6a5ea9d7f..b5041132321d 100644 > --- a/include/linux/panic_notifier.h > +++ b/include/linux/panic_notifier.h > @@ -10,6 +10,7 @@ extern struct atomic_notifier_head panic_info_list; > extern struct atomic_notifier_head panic_pre_reboot_list; > extern struct atomic_notifier_head panic_post_reboot_list; > > +bool panic_notifiers_before_kdump(void); > extern bool crash_kexec_post_notifiers; > > enum panic_notifier_val { > diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c > index 68480f731192..f8906db8ca72 100644 > --- a/kernel/kexec_core.c > +++ b/kernel/kexec_core.c > @@ -74,11 +74,11 @@ struct resource crashk_low_res = { > int kexec_should_crash(struct task_struct *p) > { > /* > - * If crash_kexec_post_notifiers is enabled, don't run > - * crash_kexec() here yet, which must be run after panic > - * notifiers in panic(). > + * In case any panic notifiers are set to execute before kexec, > + * don't run crash_kexec() here yet, which must run after these > + * panic notifiers are executed, in the panic() path. > */ > - if (crash_kexec_post_notifiers) > + if (panic_notifiers_before_kdump()) > return 0; > /* > * There are 4 panic() calls in make_task_dead() path, each of which > diff --git a/kernel/panic.c b/kernel/panic.c > index bf792102b43e..b7c055d4f87f 100644 > --- a/kernel/panic.c > +++ b/kernel/panic.c > @@ -15,7 +15,6 @@ > #include <linux/kgdb.h> > #include <linux/kmsg_dump.h> > #include <linux/kallsyms.h> > -#include <linux/notifier.h> > #include <linux/vt_kern.h> > #include <linux/module.h> > #include <linux/random.h> > @@ -52,14 +51,23 @@ static unsigned long tainted_mask = > static int pause_on_oops; > static int pause_on_oops_flag; > static DEFINE_SPINLOCK(pause_on_oops_lock); > -bool crash_kexec_post_notifiers; > + > int panic_on_warn __read_mostly; > +bool panic_on_taint_nousertaint; > unsigned long panic_on_taint; > -bool panic_on_taint_nousertaint = false; > > int panic_timeout = CONFIG_PANIC_TIMEOUT; > EXPORT_SYMBOL_GPL(panic_timeout); > > +/* Initialized with all notifiers set to run before kdump */ > +static unsigned long panic_notifiers_bits = 15; > + > +/* Default level is 2, see kernel-parameters.txt */ > +unsigned int panic_notifiers_level = 2; > + > +/* DEPRECATED in favor of panic_notifiers_level */ > +bool crash_kexec_post_notifiers; > + > #define PANIC_PRINT_TASK_INFO 0x00000001 > #define PANIC_PRINT_MEM_INFO 0x00000002 > #define PANIC_PRINT_TIMER_INFO 0x00000004 > @@ -109,10 +117,14 @@ void __weak nmi_panic_self_stop(struct pt_regs *regs) > } > > /* > - * Stop other CPUs in panic. Architecture dependent code may override this > - * with more suitable version. For example, if the architecture supports > - * crash dump, it should save registers of each stopped CPU and disable > - * per-CPU features such as virtualization extensions. > + * Stop other CPUs in panic context. > + * > + * Architecture dependent code may override this with more suitable version. > + * For example, if the architecture supports crash dump, it should save the > + * registers of each stopped CPU and disable per-CPU features such as > + * virtualization extensions. When not overridden in arch code (and for > + * x86/xen), this is exactly the same as execute smp_send_stop(), but > + * guarded against duplicate execution. > */ > void __weak crash_smp_send_stop(void) > { > @@ -183,6 +195,112 @@ static void panic_print_sys_info(bool console_flush) > ftrace_dump(DUMP_ALL); > } > > +/* > + * Helper that accumulates all console flushing routines executed on panic. > + */ > +static void console_flushing(void) > +{ > +#ifdef CONFIG_VT > + unblank_screen(); > +#endif > + console_unblank(); > + > + /* > + * In this point, we may have disabled other CPUs, hence stopping the > + * CPU holding the lock while still having some valuable data in the > + * console buffer. > + * > + * Try to acquire the lock then release it regardless of the result. > + * The release will also print the buffers out. Locks debug should > + * be disabled to avoid reporting bad unlock balance when panic() > + * is not being called from OOPS. > + */ > + debug_locks_off(); > + console_flush_on_panic(CONSOLE_FLUSH_PENDING); > + > + panic_print_sys_info(true); > +} > + > +#define PN_HYPERVISOR_BIT 0 > +#define PN_INFO_BIT 1 > +#define PN_PRE_REBOOT_BIT 2 > +#define PN_POST_REBOOT_BIT 3 > + > +/* > + * Determine the order of panic notifiers with regards to kdump. > + * > + * This function relies in the "panic_notifiers_level" kernel parameter > + * to determine how to order the notifiers with regards to kdump. We > + * have currently 5 levels. For details, please check the kernel docs for > + * "panic_notifiers_level" at Documentation/admin-guide/kernel-parameters.txt. > + * > + * Default level is 2, which means the panic hypervisor and informational > + * (unless we don't have any kmsg_dumper) lists will execute before kdump. > + */ > +static void order_panic_notifiers_and_kdump(void) > +{ > + /* > + * The parameter "crash_kexec_post_notifiers" is deprecated, but > + * valid. Users that set it want really all panic notifiers to > + * execute before kdump, so it's effectively the same as setting > + * the panic notifiers level to 4. > + */ > + if (panic_notifiers_level >= 4 || crash_kexec_post_notifiers) > + return; > + > + /* > + * Based on the level configured (smaller than 4), we clear the > + * proper bits in "panic_notifiers_bits". Notice that this bitfield > + * is initialized with all notifiers set. > + */ > + switch (panic_notifiers_level) { > + case 3: > + clear_bit(PN_PRE_REBOOT_BIT, &panic_notifiers_bits); > + break; > + case 2: > + clear_bit(PN_PRE_REBOOT_BIT, &panic_notifiers_bits); > + > + if (!kmsg_has_dumpers()) > + clear_bit(PN_INFO_BIT, &panic_notifiers_bits); > + break; > + case 1: > + clear_bit(PN_PRE_REBOOT_BIT, &panic_notifiers_bits); > + clear_bit(PN_INFO_BIT, &panic_notifiers_bits); > + break; > + case 0: > + clear_bit(PN_PRE_REBOOT_BIT, &panic_notifiers_bits); > + clear_bit(PN_INFO_BIT, &panic_notifiers_bits); > + clear_bit(PN_HYPERVISOR_BIT, &panic_notifiers_bits); > + break; > + } > +} > + > +/* > + * Set of helpers to execute the panic notifiers only once. > + * Just the informational notifier cares about the return. > + */ > +static inline bool notifier_run_once(struct atomic_notifier_head head, > + char *buf, long bit) > +{ > + if (test_and_change_bit(bit, &panic_notifiers_bits)) { > + atomic_notifier_call_chain(&head, PANIC_NOTIFIER, buf); > + return true; > + } > + return false; > +} > + > +#define panic_notifier_hypervisor_once(buf)\ > + notifier_run_once(panic_hypervisor_list, buf, PN_HYPERVISOR_BIT) > + > +#define panic_notifier_info_once(buf)\ > + notifier_run_once(panic_info_list, buf, PN_INFO_BIT) > + > +#define panic_notifier_pre_reboot_once(buf)\ > + notifier_run_once(panic_pre_reboot_list, buf, PN_PRE_REBOOT_BIT) > + > +#define panic_notifier_post_reboot_once(buf)\ > + notifier_run_once(panic_post_reboot_list, buf, PN_POST_REBOOT_BIT) > + > /** > * panic - halt the system > * @fmt: The text string to print > @@ -198,32 +316,29 @@ void panic(const char *fmt, ...) > long i, i_next = 0, len; > int state = 0; > int old_cpu, this_cpu; > - bool _crash_kexec_post_notifiers = crash_kexec_post_notifiers; > > - if (panic_on_warn) { > - /* > - * This thread may hit another WARN() in the panic path. > - * Resetting this prevents additional WARN() from panicking the > - * system on this thread. Other threads are blocked by the > - * panic_mutex in panic(). > - */ > - panic_on_warn = 0; > - } > + /* > + * This thread may hit another WARN() in the panic path, so > + * resetting this option prevents additional WARN() from > + * re-panicking the system here. > + */ > + panic_on_warn = 0; > > /* > * Disable local interrupts. This will prevent panic_smp_self_stop > - * from deadlocking the first cpu that invokes the panic, since > - * there is nothing to prevent an interrupt handler (that runs > - * after setting panic_cpu) from invoking panic() again. > + * from deadlocking the first cpu that invokes the panic, since there > + * is nothing to prevent an interrupt handler (that runs after setting > + * panic_cpu) from invoking panic() again. Also disables preemption > + * here - notice it's not safe to rely on interrupt disabling to avoid > + * preemption, since any cond_resched() or cond_resched_lock() might > + * trigger a reschedule if the preempt count is 0 (for reference, see > + * Documentation/locking/preempt-locking.rst). Some functions called > + * from here want preempt disabled, so no point enabling it later. > */ > local_irq_disable(); > preempt_disable_notrace(); > > /* > - * It's possible to come here directly from a panic-assertion and > - * not have preempt disabled. Some functions called from here want > - * preempt to be disabled. No point enabling it later though... > - * > * Only one CPU is allowed to execute the panic code from here. For > * multiple parallel invocations of panic, all other CPUs either > * stop themself or will wait until they are stopped by the 1st CPU > @@ -266,73 +381,75 @@ void panic(const char *fmt, ...) > kgdb_panic(buf); > > /* > - * If we have crashed and we have a crash kernel loaded let it handle > - * everything else. > - * If we want to run this after calling panic_notifiers, pass > - * the "crash_kexec_post_notifiers" option to the kernel. > + * Here lies one of the most subtle parts of the panic path, > + * the panic notifiers and their order with regards to kdump. > + * We currently have 4 sets of notifiers: > * > - * Bypass the panic_cpu check and call __crash_kexec directly. > + * - the hypervisor list is composed by callbacks that are related > + * to warn the FW / hypervisor about panic, or non-invasive LED > + * controlling functions - (hopefully) low-risk for kdump, should > + * run early if possible. > + * > + * - the informational list is composed by functions dumping data > + * like kernel offsets, device error registers or tracing buffer; > + * also log flooding prevention callbacks fit in this list. It is > + * relatively safe to run before kdump. > + * > + * - the pre_reboot list basically is everything else, all the > + * callbacks that don't fit in the 2 previous lists. It should > + * run *after* kdump if possible, as it contains high-risk > + * functions that may break kdump. > + * > + * - we also have a 4th list of notifiers, the post_reboot > + * callbacks. This is not strongly related to kdump since it's > + * always executed late in the panic path, after the restart > + * mechanism (if set); its goal is to provide a way for > + * architecture code effectively power-off/disable the system. > + * > + * The kernel provides the "panic_notifiers_level" parameter > + * to adjust the ordering in which these notifiers should run > + * with regards to kdump - the default level is 2, so both the > + * hypervisor and informational notifiers should execute before > + * the __crash_kexec(); the info notifier won't run by default > + * unless there's some kmsg_dumper() registered. For details > + * about it, check Documentation/admin-guide/kernel-parameters.txt. > + * > + * Notice that the code relies in bits set/clear operations to > + * determine the ordering, functions *_once() execute only one > + * time, as their name implies. The goal is to prevent too much > + * if conditionals and more confusion. Finally, regarding CPUs > + * disabling: unless NO panic notifier executes before kdump, > + * we always disable secondary CPUs before __crash_kexec() and > + * the notifiers execute. > */ > - if (!_crash_kexec_post_notifiers) { > + order_panic_notifiers_and_kdump(); > + > + /* If no level, we should kdump ASAP. */ > + if (!panic_notifiers_level) > __crash_kexec(NULL); > > - /* > - * Note smp_send_stop is the usual smp shutdown function, which > - * unfortunately means it may not be hardened to work in a > - * panic situation. > - */ > - smp_send_stop(); > - } else { > - /* > - * If we want to do crash dump after notifier calls and > - * kmsg_dump, we will need architecture dependent extra > - * works in addition to stopping other CPUs. > - */ > - crash_smp_send_stop(); > - } > + crash_smp_send_stop(); > + panic_notifier_hypervisor_once(buf); > > - /* > - * Run any panic handlers, including those that might need to > - * add information to the kmsg dump output. > - */ > - atomic_notifier_call_chain(&panic_hypervisor_list, PANIC_NOTIFIER, buf); > - atomic_notifier_call_chain(&panic_info_list, PANIC_NOTIFIER, buf); > - atomic_notifier_call_chain(&panic_pre_reboot_list, PANIC_NOTIFIER, buf); > + if (panic_notifier_info_once(buf)) { > + panic_print_sys_info(false); > + kmsg_dump(KMSG_DUMP_PANIC); > + } > > - panic_print_sys_info(false); > + panic_notifier_pre_reboot_once(buf); > > - kmsg_dump(KMSG_DUMP_PANIC); > + __crash_kexec(NULL); > > - /* > - * If you doubt kdump always works fine in any situation, > - * "crash_kexec_post_notifiers" offers you a chance to run > - * panic_notifiers and dumping kmsg before kdump. > - * Note: since some panic_notifiers can make crashed kernel > - * more unstable, it can increase risks of the kdump failure too. > - * > - * Bypass the panic_cpu check and call __crash_kexec directly. > - */ > - if (_crash_kexec_post_notifiers) > - __crash_kexec(NULL); > + panic_notifier_hypervisor_once(buf); > > -#ifdef CONFIG_VT > - unblank_screen(); > -#endif > - console_unblank(); > - > - /* > - * We may have ended up stopping the CPU holding the lock (in > - * smp_send_stop()) while still having some valuable data in the console > - * buffer. Try to acquire the lock then release it regardless of the > - * result. The release will also print the buffers out. Locks debug > - * should be disabled to avoid reporting bad unlock balance when > - * panic() is not being callled from OOPS. > - */ > - debug_locks_off(); > - console_flush_on_panic(CONSOLE_FLUSH_PENDING); > + if (panic_notifier_info_once(buf)) { > + panic_print_sys_info(false); > + kmsg_dump(KMSG_DUMP_PANIC); > + } > > - panic_print_sys_info(true); > + panic_notifier_pre_reboot_once(buf); > > + console_flushing(); > if (!panic_blink) > panic_blink = no_blink; > > @@ -363,8 +480,7 @@ void panic(const char *fmt, ...) > emergency_restart(); > } > > - atomic_notifier_call_chain(&panic_post_reboot_list, > - PANIC_NOTIFIER, buf); > + panic_notifier_post_reboot_once(buf); > > pr_emerg("---[ end Kernel panic - not syncing: %s ]---\n", buf); > > @@ -383,6 +499,15 @@ void panic(const char *fmt, ...) > > EXPORT_SYMBOL(panic); > > +/* > + * Helper used in the kexec code, to validate if any > + * panic notifier is set to execute early, before kdump. > + */ > +inline bool panic_notifiers_before_kdump(void) > +{ > + return panic_notifiers_level || crash_kexec_post_notifiers; > +} > + > /* > * TAINT_FORCED_RMMOD could be a per-module flag but the module > * is being removed anyway. > @@ -692,6 +817,9 @@ core_param(panic, panic_timeout, int, 0644); > core_param(panic_print, panic_print, ulong, 0644); > core_param(pause_on_oops, pause_on_oops, int, 0644); > core_param(panic_on_warn, panic_on_warn, int, 0644); > +core_param(panic_notifiers_level, panic_notifiers_level, uint, 0644); > + > +/* DEPRECATED in favor of panic_notifiers_level */ > core_param(crash_kexec_post_notifiers, crash_kexec_post_notifiers, bool, 0644); > > static int __init oops_setup(char *s) > diff --git a/tools/testing/selftests/pstore/pstore_crash_test b/tools/testing/selftests/pstore/pstore_crash_test > index 2a329bbb4aca..1e60ce4501aa 100755 > --- a/tools/testing/selftests/pstore/pstore_crash_test > +++ b/tools/testing/selftests/pstore/pstore_crash_test > @@ -25,6 +25,7 @@ touch $REBOOT_FLAG > sync > > # cause crash > -# Note: If you use kdump and want to see kmesg-* files after reboot, you should > -# specify 'crash_kexec_post_notifiers' in 1st kernel's cmdline. > +# Note: If you use kdump and want to see kmsg-* files after reboot, you should > +# be sure that the parameter "panic_notifiers_level" is more than '2' (the > +# default value for this parameter is '2') in the first kernel's cmdline. > echo c > /proc/sysrq-trigger
Hey folks, first of all thanks a lot for the reviews / opinions about this. I imagined that such change would be polemic, and I see I was right heh I'll try to "mix" all the relevant opinions in a single email, since they happened in different responses and even different mail threads. I've looped here the most interested parties based on the feedback received, such as Baoquan (kdump), Hatayama (kdump), Eric (kexec), Mark (arm64), Michael (Hyper-V), Petr (console/printk) and Vitaly (hyper-v / kvm). I hope we can discuss and try to reach some consensus - my apologies in advance for this long message! So, here goes some feedback we received about this change and correlated feedback from arm64 community - my apologies if I missed something important, I've tried to collect the most relevant portions, while keeping the summary "as short" as possible. I'll respond to such feedback below, after the quotes. On 24/05/2022 05:32, Baoquan He wrote: >> [...] >> Firstly, kdump is not always the first thing. In any use case, if kdump >> kernel is not loaded, it's not the first thing at all. Not to mention >> if crash_kexec_post_notifiers is specified. >> [...] >> Changing this will cause regression. During these years, nobody ever doubt >> kdump should execute firstly if crashkernel is reserved and kdump kernel is >> loaded. That's not saying we can't change >> this, but need a convincing justification. >> [...] >> Secondly, even with the notifiers' split, we can't guarantee people will >> absolutely add notifiers into right list in the future. Letting kdump >> execute behind lists by default will put kdump into risk. >> [...] >> As for Hyper-V, if it enforces to terminate VMbus connection, no matter >> it's kdump or not, why not taking it out of panic notifiers list and >> execute it before kdump unconditionally. On 24/05/2022 05:01, Petr Mladek wrote: >> [...] >> Anyway, I see four possible solutions: >> >> 1. The most conservative approach is to keep the current behavior >> and call kdump first by default. >> >> 2. A medium conservative approach to change the default default >> behavior and call hypervisor and eventually the info notifiers >> before kdump. There still would be the possibility to call kdump >> first by the command line parameter. >> >> 3. Remove the possibility to call kdump first completely. It would >> assume that all the notifiers in the info list are super safe >> or that they make kdump actually more safe. >> >> 4. Create one more notifier list for operations that always should >> be called before crash_dump. >> >> Regarding the extra notifier list (4th solution). It is not clear to >> me whether it would be always called even before hypervisor list or >> when kdump is not enabled. We must not over-engineer it. >> >> 2nd proposal looks like a good compromise. But maybe we could do >> this change few releases later. The notifiers split is a big >> change on its own. On 24/05/2022 07:18, Baoquan He wrote: >>[...] >> I would vote for 1 or 4 without any hesitation, and prefer 4. I ever >> suggest the variant of solution 4 in v1 reviewing. That's taking those >> notifiers out of list and enforcing to execute them before kdump. E.g >> the one on HyperV to terminate VMbus connection. Maybe solution 4 is >> better to provide a determinate way for people to add necessary code >> at the earliest part. >> [...] >>> >>> Regarding the extra notifier list (4th solution). It is not clear to >>> me whether it would be always called even before hypervisor list or >>> when kdump is not enabled. We must not over-engineer it. >> >> One thing I would like to notice is, no matter how perfect we split the >> lists this time, we can't gurantee people will add notifiers reasonablly >> in the future. And people from different sub-component may not do >> sufficient investigation and add them to fulfil their local purpose. >> >> The current panic notifers list is the best example. Hyper-V actually >> wants to run some necessary code before kdump, but not all of them, they >> just add it, ignoring the original purpose of >> crash_kexec_post_notifiers. I guess they do like this just because it's >> easy to do, no need to bother changing code in generic place. >> >> Solution 4 can make this no doubt, that's why I like it better. >> [...] >> As I replied to Guilherme, solution 2 will cause regression if not >> calling kdump firstly. Solution 3 leaves people space to make mistake, >> they could add nontifier into wrong list. >> >> I would like to note again that the panic notifiers are optional to run, >> while kdump is expectd once loaded, from the original purpose. I guess >> people I know will still have this thought, e.g Hatayama, Masa, they are >> truly often use panic notifiers like this on their company's system. On 24/05/2022 11:44, Eric W. Biederman wrote: > [...] > Unfortunately I am also very grouchy. > > Notifiers before kexec on panic are fundamentally broken. So please > just remove crash_kexec_post notifiers and be done with it. Part of the > deep issue is that firmware always has a common and broken > implementation for anything that is not mission critical to > motherboards. > > Notifiers in any sense on these paths are just bollocks. Any kind of > notifier list is fundamentally fragile in the face of memory corruption > and very very difficult to review. > > So I am going to refresh my ancient NACK on this. > > I can certainly appreciate that there are pieces of the reboot paths > that can be improved. I don't think making anything more feature full > or flexible is any kind of real improvement. Now, from the thread "Should arm64 have a custom crash shutdown handler?" (link: https://lore.kernel.org/lkml/427a8277-49f0-4317-d6c3-4a15d7070e55@igalia.com/), we have: On 05/05/2022 08:10, Mark Rutland wrote: >> On Wed, May 04, 2022 at 05:00:42PM -0300, Guilherme G. Piccoli wrote: >>> [...] >>> Currently, when we kexec in arm64, the function machine_crash_shutdown() >>> is called as a handler to disable CPUs and (potentially) do extra >>> quiesce work. In the aforementioned architectures, there's a way to >>> override this function, if for example an hypervisor wish to have its >>> guests running their own custom shutdown machinery. >> >> What exactly do you need to do in this custom shutdown machinery? >> >> The general expectation for arm64 is that any hypervisor can implement PSCI, >> and as long as you have that, CPUs (and the VM as a whole) can be shutdown in a >> standard way. >> >> I suspect what you're actually after is a mechanism to notify the hypervisor >> when the guest crashes, rather than changing the way the shutdown itself >> occurs? If so, we already have panic notifiers, and QEMU has a "pvpanic" >> device using that. See drivers/misc/pvpanic/. OK, so it seems we have some points in which agreement exists, and some points that there is no agreement and instead, we have antagonistic / opposite views and needs. Let's start with the easier part heh It seems everybody agrees that *we shouldn't over-engineer things*, and as per Eric good words: making the panic path more feature-full or increasing flexibility isn't a good idea. So, as a "corollary": the panic level approach I'm proposing is not a good fit, I'll drop it and let's go with something simpler. Another point of agreement seems to be that _notifier lists in the panic path are dangerous_, for *2 different reasons*: (a) We cannot guarantee that people won't add crazy callbacks there, we can plan and document things the best as possible - it'll never be enough, somebody eventually would slip a nonsense callback that would break things and defeat the planned purpose of such a list; (b) As per Eric point, in a panic/crash situation we might have memory corruption exactly in the list code / pointers, etc, so the notifier lists are, by nature, a bit fragile. But I think we shouldn't consider it completely "bollocks", since this approach has been used for a while with a good success rate. So, lists aren't perfect at all, but at the same time, they aren't completely useless. Now, to the points in which there are conflicting / antagonistic needs/views: (I) Kdump should be the first thing to run, as it's been like that since forever. But...notice that "crash_kexec_post_notifiers" was created exactly as a way to circumvent that, so we can see this is not an absolute truth. Some users really *require to execute* some special code *before kdump*. Worth noticing here that regular kexec invokes the drivers .shutdown() handlers, while kdump [aka crash_kexec()] does not, so we must have a way to run code before kdump in a crash situation. (II) If *we need* to have some code always running before kdump/reboot on panic path (like the Hyper-V vmbus connection unload), *where to add such code*? Again, conflicting views. Some would say we should hardcode this in the panic() function. Others, that we should use the custom machine_crash_shutdown() infrastructure - but notice that this isn't available in all architectures, like arm64. Finally, others suggest to...use notifier lists! Which was more or less the approach we took in this patch. How can we reach consensus on this? Not everybody will be 100% happy, that's for sure. Also, I'd risk to say keep things as-is now or even getting rid of "crash_kexec_post_notifiers" won't work at all, we have users with legitimate needs of running code before a kdump/reboot when crash happens. The *main goal* should be to have a *simple solution* that doesn't require users to abuse parameters, like it's been done with "crash_kexec_post_notifiers" (Hyper-V and PowerPC currently force this parameter to be enabled, for example). To avoid using a 4th list, especially given the list nature is a bit fragile, I'd suggest one of the 3 following approaches - I *really appreciate feedbacks* on that so I can implement the best solution and avoid wasting time in some poor/disliked solution: (1) We could have a helper function in the "beginning" of panic path, i.e., after the logic to disable preemption/IRQs/secondary CPUs, but *before* kdump. Then, users like Hyper-V that require to execute stuff regardless of kdump or not, would run their callbacks from there, directly, no lists involved. - pros: simple, doesn't mess with arch code or involve lists. - cons: header issues - will need to "export" such function from driver code, for example, to run in core code. Also, some code might only be required to run in some architectures, or only if kdump is set, etc., making the callbacks more complex / full of if conditionals. (2) Similarly to previous solution, we could have a helper in the kexec core code, not in the panic path. This way, users that require running stuff *before a kdump* would add direct calls there; if kdump isn't configured, and if such users also require that this code execute in panic nevertheless, they'd need to also add a callback to some notifier list. - pros: also simple / doesn't mess with arch code or involve lists; restricts the callbacks to kdump case. - cons: also header issues, but might cause code duplicity too, since some users would require both to run their code before a kdump and in some panic notifier list. (3) Have a way in arm64 (and all archs maybe) to run code before a kdump - this is analog to the custom machine_crash_shutdown() we have nowadays in some archs. - pros: decouple kdump-required callbacks from panic notifiers, doesn't involve lists, friendly to arch-dependent callbacks. - cons: also header issues, might cause code duplicity (since some users would also require to run their code in panic path regardless of kdump) and involve changing arch code (some maintainers like Mark aren't fond about that, with good reasons!). So, hopefully we can converge to some direction even if not 100% of users are happy - this problem is naturally full of trade-offs. Thanks again for the reviews and the time you're spending reading these long threads. Cheers, Guilherme
On Thu 2022-05-26 13:25:57, Guilherme G. Piccoli wrote: > OK, so it seems we have some points in which agreement exists, and some > points that there is no agreement and instead, we have antagonistic / > opposite views and needs. Let's start with the easier part heh > > It seems everybody agrees that *we shouldn't over-engineer things*, and > as per Eric good words: making the panic path more feature-full or > increasing flexibility isn't a good idea. So, as a "corollary": the > panic level approach I'm proposing is not a good fit, I'll drop it and > let's go with something simpler. Makes sense. > Another point of agreement seems to be that _notifier lists in the panic > path are dangerous_, for *2 different reasons*: > > (a) We cannot guarantee that people won't add crazy callbacks there, we > can plan and document things the best as possible - it'll never be > enough, somebody eventually would slip a nonsense callback that would > break things and defeat the planned purpose of such a list; It is true that notifier lists might allow to add crazy stuff without proper review more easily. Things added into the core code would most likely get better review. But nothing is error-proof. And bugs will happen with any approach. > (b) As per Eric point, in a panic/crash situation we might have memory > corruption exactly in the list code / pointers, etc, so the notifier > lists are, by nature, a bit fragile. But I think we shouldn't consider > it completely "bollocks", since this approach has been used for a while > with a good success rate. So, lists aren't perfect at all, but at the > same time, they aren't completely useless. I am not able to judge this. Of course, any extra step increases the risk. I am just not sure how much more complicated it would be to hardcode the calls. Most of them are architecture and/or feature specific. And such code is often hard to review and maintain. > To avoid using a 4th list, 4th or 5th? We already have "hypervisor", "info", "pre-reboot", and "pre-loop". The 5th might be pre-crash-exec. > especially given the list nature is a bit > fragile, I'd suggest one of the 3 following approaches - I *really > appreciate feedbacks* on that so I can implement the best solution and > avoid wasting time in some poor/disliked solution: Honestly, I am not able to decide what might be better without seeing the code. Most things fits pretty well into the 4 proposed lists: "hypervisor", "info", "pre-reboot", and "pre-loop". IMHO, the only question is the code that needs to be always called even before crash_dump. I suggest that you solve the crash_dump callbacks the way that looks best to you. Ideally do it in a separate patch so it can be reviewed and reworked more easily. I believe that a fresh code with an updated split and simplified logic would help us to move forward. Best Regards, Petr
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 3f1cc5e317ed..8d3524060ce3 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -829,6 +829,13 @@ It will be ignored when crashkernel=X,high is not used or memory reserved is below 4G. + crash_kexec_post_notifiers + This was DEPRECATED - users should always prefer the + parameter "panic_notifiers_level" - check its entry + in this documentation for details on how it works. + Setting this parameter is exactly the same as setting + "panic_notifiers_level=4". + cryptomgr.notests [KNL] Disable crypto self-tests @@ -3784,6 +3791,33 @@ timeout < 0: reboot immediately Format: <timeout> + panic_notifiers_level= + [KNL] Set the panic notifiers execution order. + Format: <unsigned int> + We currently have 4 lists of panic notifiers; based + on the functionality and risk (for panic success) the + callbacks are added in a given list. The lists are: + - hypervisor/FW notification list (low risk); + - informational list (low/medium risk); + - pre_reboot list (higher risk); + - post_reboot list (only run late in panic and after + kdump, not configurable for now). + This parameter defines the ordering of the first 3 + lists with regards to kdump; the levels determine + which set of notifiers execute before kdump. The + accepted levels are: + 0: kdump is the first thing to run, NO list is + executed before kdump. + 1: only the hypervisor list is executed before kdump. + 2 (default level): the hypervisor list and (*if* + there's any kmsg_dumper defined) the informational + list are executed before kdump. + 3: both the hypervisor and the informational lists + (always) execute before kdump. + 4: the 3 lists (hypervisor, info and pre_reboot) + execute before kdump - this behavior is analog to the + deprecated parameter "crash_kexec_post_notifiers". + panic_print= Bitmask for printing system info when panic happens. User can chose combination of the following bits: bit 0: print all tasks info @@ -3814,14 +3848,6 @@ panic_on_warn panic() instead of WARN(). Useful to cause kdump on a WARN(). - crash_kexec_post_notifiers - Run kdump after running panic-notifiers and dumping - kmsg. This only for the users who doubt kdump always - succeeds in any situation. - Note that this also increases risks of kdump failure, - because some panic notifiers can make the crashed - kernel more unstable. - parkbd.port= [HW] Parallel port number the keyboard adapter is connected to, default is 0. Format: <parport#> diff --git a/include/linux/panic_notifier.h b/include/linux/panic_notifier.h index bcf6a5ea9d7f..b5041132321d 100644 --- a/include/linux/panic_notifier.h +++ b/include/linux/panic_notifier.h @@ -10,6 +10,7 @@ extern struct atomic_notifier_head panic_info_list; extern struct atomic_notifier_head panic_pre_reboot_list; extern struct atomic_notifier_head panic_post_reboot_list; +bool panic_notifiers_before_kdump(void); extern bool crash_kexec_post_notifiers; enum panic_notifier_val { diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c index 68480f731192..f8906db8ca72 100644 --- a/kernel/kexec_core.c +++ b/kernel/kexec_core.c @@ -74,11 +74,11 @@ struct resource crashk_low_res = { int kexec_should_crash(struct task_struct *p) { /* - * If crash_kexec_post_notifiers is enabled, don't run - * crash_kexec() here yet, which must be run after panic - * notifiers in panic(). + * In case any panic notifiers are set to execute before kexec, + * don't run crash_kexec() here yet, which must run after these + * panic notifiers are executed, in the panic() path. */ - if (crash_kexec_post_notifiers) + if (panic_notifiers_before_kdump()) return 0; /* * There are 4 panic() calls in make_task_dead() path, each of which diff --git a/kernel/panic.c b/kernel/panic.c index bf792102b43e..b7c055d4f87f 100644 --- a/kernel/panic.c +++ b/kernel/panic.c @@ -15,7 +15,6 @@ #include <linux/kgdb.h> #include <linux/kmsg_dump.h> #include <linux/kallsyms.h> -#include <linux/notifier.h> #include <linux/vt_kern.h> #include <linux/module.h> #include <linux/random.h> @@ -52,14 +51,23 @@ static unsigned long tainted_mask = static int pause_on_oops; static int pause_on_oops_flag; static DEFINE_SPINLOCK(pause_on_oops_lock); -bool crash_kexec_post_notifiers; + int panic_on_warn __read_mostly; +bool panic_on_taint_nousertaint; unsigned long panic_on_taint; -bool panic_on_taint_nousertaint = false; int panic_timeout = CONFIG_PANIC_TIMEOUT; EXPORT_SYMBOL_GPL(panic_timeout); +/* Initialized with all notifiers set to run before kdump */ +static unsigned long panic_notifiers_bits = 15; + +/* Default level is 2, see kernel-parameters.txt */ +unsigned int panic_notifiers_level = 2; + +/* DEPRECATED in favor of panic_notifiers_level */ +bool crash_kexec_post_notifiers; + #define PANIC_PRINT_TASK_INFO 0x00000001 #define PANIC_PRINT_MEM_INFO 0x00000002 #define PANIC_PRINT_TIMER_INFO 0x00000004 @@ -109,10 +117,14 @@ void __weak nmi_panic_self_stop(struct pt_regs *regs) } /* - * Stop other CPUs in panic. Architecture dependent code may override this - * with more suitable version. For example, if the architecture supports - * crash dump, it should save registers of each stopped CPU and disable - * per-CPU features such as virtualization extensions. + * Stop other CPUs in panic context. + * + * Architecture dependent code may override this with more suitable version. + * For example, if the architecture supports crash dump, it should save the + * registers of each stopped CPU and disable per-CPU features such as + * virtualization extensions. When not overridden in arch code (and for + * x86/xen), this is exactly the same as execute smp_send_stop(), but + * guarded against duplicate execution. */ void __weak crash_smp_send_stop(void) { @@ -183,6 +195,112 @@ static void panic_print_sys_info(bool console_flush) ftrace_dump(DUMP_ALL); } +/* + * Helper that accumulates all console flushing routines executed on panic. + */ +static void console_flushing(void) +{ +#ifdef CONFIG_VT + unblank_screen(); +#endif + console_unblank(); + + /* + * In this point, we may have disabled other CPUs, hence stopping the + * CPU holding the lock while still having some valuable data in the + * console buffer. + * + * Try to acquire the lock then release it regardless of the result. + * The release will also print the buffers out. Locks debug should + * be disabled to avoid reporting bad unlock balance when panic() + * is not being called from OOPS. + */ + debug_locks_off(); + console_flush_on_panic(CONSOLE_FLUSH_PENDING); + + panic_print_sys_info(true); +} + +#define PN_HYPERVISOR_BIT 0 +#define PN_INFO_BIT 1 +#define PN_PRE_REBOOT_BIT 2 +#define PN_POST_REBOOT_BIT 3 + +/* + * Determine the order of panic notifiers with regards to kdump. + * + * This function relies in the "panic_notifiers_level" kernel parameter + * to determine how to order the notifiers with regards to kdump. We + * have currently 5 levels. For details, please check the kernel docs for + * "panic_notifiers_level" at Documentation/admin-guide/kernel-parameters.txt. + * + * Default level is 2, which means the panic hypervisor and informational + * (unless we don't have any kmsg_dumper) lists will execute before kdump. + */ +static void order_panic_notifiers_and_kdump(void) +{ + /* + * The parameter "crash_kexec_post_notifiers" is deprecated, but + * valid. Users that set it want really all panic notifiers to + * execute before kdump, so it's effectively the same as setting + * the panic notifiers level to 4. + */ + if (panic_notifiers_level >= 4 || crash_kexec_post_notifiers) + return; + + /* + * Based on the level configured (smaller than 4), we clear the + * proper bits in "panic_notifiers_bits". Notice that this bitfield + * is initialized with all notifiers set. + */ + switch (panic_notifiers_level) { + case 3: + clear_bit(PN_PRE_REBOOT_BIT, &panic_notifiers_bits); + break; + case 2: + clear_bit(PN_PRE_REBOOT_BIT, &panic_notifiers_bits); + + if (!kmsg_has_dumpers()) + clear_bit(PN_INFO_BIT, &panic_notifiers_bits); + break; + case 1: + clear_bit(PN_PRE_REBOOT_BIT, &panic_notifiers_bits); + clear_bit(PN_INFO_BIT, &panic_notifiers_bits); + break; + case 0: + clear_bit(PN_PRE_REBOOT_BIT, &panic_notifiers_bits); + clear_bit(PN_INFO_BIT, &panic_notifiers_bits); + clear_bit(PN_HYPERVISOR_BIT, &panic_notifiers_bits); + break; + } +} + +/* + * Set of helpers to execute the panic notifiers only once. + * Just the informational notifier cares about the return. + */ +static inline bool notifier_run_once(struct atomic_notifier_head head, + char *buf, long bit) +{ + if (test_and_change_bit(bit, &panic_notifiers_bits)) { + atomic_notifier_call_chain(&head, PANIC_NOTIFIER, buf); + return true; + } + return false; +} + +#define panic_notifier_hypervisor_once(buf)\ + notifier_run_once(panic_hypervisor_list, buf, PN_HYPERVISOR_BIT) + +#define panic_notifier_info_once(buf)\ + notifier_run_once(panic_info_list, buf, PN_INFO_BIT) + +#define panic_notifier_pre_reboot_once(buf)\ + notifier_run_once(panic_pre_reboot_list, buf, PN_PRE_REBOOT_BIT) + +#define panic_notifier_post_reboot_once(buf)\ + notifier_run_once(panic_post_reboot_list, buf, PN_POST_REBOOT_BIT) + /** * panic - halt the system * @fmt: The text string to print @@ -198,32 +316,29 @@ void panic(const char *fmt, ...) long i, i_next = 0, len; int state = 0; int old_cpu, this_cpu; - bool _crash_kexec_post_notifiers = crash_kexec_post_notifiers; - if (panic_on_warn) { - /* - * This thread may hit another WARN() in the panic path. - * Resetting this prevents additional WARN() from panicking the - * system on this thread. Other threads are blocked by the - * panic_mutex in panic(). - */ - panic_on_warn = 0; - } + /* + * This thread may hit another WARN() in the panic path, so + * resetting this option prevents additional WARN() from + * re-panicking the system here. + */ + panic_on_warn = 0; /* * Disable local interrupts. This will prevent panic_smp_self_stop - * from deadlocking the first cpu that invokes the panic, since - * there is nothing to prevent an interrupt handler (that runs - * after setting panic_cpu) from invoking panic() again. + * from deadlocking the first cpu that invokes the panic, since there + * is nothing to prevent an interrupt handler (that runs after setting + * panic_cpu) from invoking panic() again. Also disables preemption + * here - notice it's not safe to rely on interrupt disabling to avoid + * preemption, since any cond_resched() or cond_resched_lock() might + * trigger a reschedule if the preempt count is 0 (for reference, see + * Documentation/locking/preempt-locking.rst). Some functions called + * from here want preempt disabled, so no point enabling it later. */ local_irq_disable(); preempt_disable_notrace(); /* - * It's possible to come here directly from a panic-assertion and - * not have preempt disabled. Some functions called from here want - * preempt to be disabled. No point enabling it later though... - * * Only one CPU is allowed to execute the panic code from here. For * multiple parallel invocations of panic, all other CPUs either * stop themself or will wait until they are stopped by the 1st CPU @@ -266,73 +381,75 @@ void panic(const char *fmt, ...) kgdb_panic(buf); /* - * If we have crashed and we have a crash kernel loaded let it handle - * everything else. - * If we want to run this after calling panic_notifiers, pass - * the "crash_kexec_post_notifiers" option to the kernel. + * Here lies one of the most subtle parts of the panic path, + * the panic notifiers and their order with regards to kdump. + * We currently have 4 sets of notifiers: * - * Bypass the panic_cpu check and call __crash_kexec directly. + * - the hypervisor list is composed by callbacks that are related + * to warn the FW / hypervisor about panic, or non-invasive LED + * controlling functions - (hopefully) low-risk for kdump, should + * run early if possible. + * + * - the informational list is composed by functions dumping data + * like kernel offsets, device error registers or tracing buffer; + * also log flooding prevention callbacks fit in this list. It is + * relatively safe to run before kdump. + * + * - the pre_reboot list basically is everything else, all the + * callbacks that don't fit in the 2 previous lists. It should + * run *after* kdump if possible, as it contains high-risk + * functions that may break kdump. + * + * - we also have a 4th list of notifiers, the post_reboot + * callbacks. This is not strongly related to kdump since it's + * always executed late in the panic path, after the restart + * mechanism (if set); its goal is to provide a way for + * architecture code effectively power-off/disable the system. + * + * The kernel provides the "panic_notifiers_level" parameter + * to adjust the ordering in which these notifiers should run + * with regards to kdump - the default level is 2, so both the + * hypervisor and informational notifiers should execute before + * the __crash_kexec(); the info notifier won't run by default + * unless there's some kmsg_dumper() registered. For details + * about it, check Documentation/admin-guide/kernel-parameters.txt. + * + * Notice that the code relies in bits set/clear operations to + * determine the ordering, functions *_once() execute only one + * time, as their name implies. The goal is to prevent too much + * if conditionals and more confusion. Finally, regarding CPUs + * disabling: unless NO panic notifier executes before kdump, + * we always disable secondary CPUs before __crash_kexec() and + * the notifiers execute. */ - if (!_crash_kexec_post_notifiers) { + order_panic_notifiers_and_kdump(); + + /* If no level, we should kdump ASAP. */ + if (!panic_notifiers_level) __crash_kexec(NULL); - /* - * Note smp_send_stop is the usual smp shutdown function, which - * unfortunately means it may not be hardened to work in a - * panic situation. - */ - smp_send_stop(); - } else { - /* - * If we want to do crash dump after notifier calls and - * kmsg_dump, we will need architecture dependent extra - * works in addition to stopping other CPUs. - */ - crash_smp_send_stop(); - } + crash_smp_send_stop(); + panic_notifier_hypervisor_once(buf); - /* - * Run any panic handlers, including those that might need to - * add information to the kmsg dump output. - */ - atomic_notifier_call_chain(&panic_hypervisor_list, PANIC_NOTIFIER, buf); - atomic_notifier_call_chain(&panic_info_list, PANIC_NOTIFIER, buf); - atomic_notifier_call_chain(&panic_pre_reboot_list, PANIC_NOTIFIER, buf); + if (panic_notifier_info_once(buf)) { + panic_print_sys_info(false); + kmsg_dump(KMSG_DUMP_PANIC); + } - panic_print_sys_info(false); + panic_notifier_pre_reboot_once(buf); - kmsg_dump(KMSG_DUMP_PANIC); + __crash_kexec(NULL); - /* - * If you doubt kdump always works fine in any situation, - * "crash_kexec_post_notifiers" offers you a chance to run - * panic_notifiers and dumping kmsg before kdump. - * Note: since some panic_notifiers can make crashed kernel - * more unstable, it can increase risks of the kdump failure too. - * - * Bypass the panic_cpu check and call __crash_kexec directly. - */ - if (_crash_kexec_post_notifiers) - __crash_kexec(NULL); + panic_notifier_hypervisor_once(buf); -#ifdef CONFIG_VT - unblank_screen(); -#endif - console_unblank(); - - /* - * We may have ended up stopping the CPU holding the lock (in - * smp_send_stop()) while still having some valuable data in the console - * buffer. Try to acquire the lock then release it regardless of the - * result. The release will also print the buffers out. Locks debug - * should be disabled to avoid reporting bad unlock balance when - * panic() is not being callled from OOPS. - */ - debug_locks_off(); - console_flush_on_panic(CONSOLE_FLUSH_PENDING); + if (panic_notifier_info_once(buf)) { + panic_print_sys_info(false); + kmsg_dump(KMSG_DUMP_PANIC); + } - panic_print_sys_info(true); + panic_notifier_pre_reboot_once(buf); + console_flushing(); if (!panic_blink) panic_blink = no_blink; @@ -363,8 +480,7 @@ void panic(const char *fmt, ...) emergency_restart(); } - atomic_notifier_call_chain(&panic_post_reboot_list, - PANIC_NOTIFIER, buf); + panic_notifier_post_reboot_once(buf); pr_emerg("---[ end Kernel panic - not syncing: %s ]---\n", buf); @@ -383,6 +499,15 @@ void panic(const char *fmt, ...) EXPORT_SYMBOL(panic); +/* + * Helper used in the kexec code, to validate if any + * panic notifier is set to execute early, before kdump. + */ +inline bool panic_notifiers_before_kdump(void) +{ + return panic_notifiers_level || crash_kexec_post_notifiers; +} + /* * TAINT_FORCED_RMMOD could be a per-module flag but the module * is being removed anyway. @@ -692,6 +817,9 @@ core_param(panic, panic_timeout, int, 0644); core_param(panic_print, panic_print, ulong, 0644); core_param(pause_on_oops, pause_on_oops, int, 0644); core_param(panic_on_warn, panic_on_warn, int, 0644); +core_param(panic_notifiers_level, panic_notifiers_level, uint, 0644); + +/* DEPRECATED in favor of panic_notifiers_level */ core_param(crash_kexec_post_notifiers, crash_kexec_post_notifiers, bool, 0644); static int __init oops_setup(char *s) diff --git a/tools/testing/selftests/pstore/pstore_crash_test b/tools/testing/selftests/pstore/pstore_crash_test index 2a329bbb4aca..1e60ce4501aa 100755 --- a/tools/testing/selftests/pstore/pstore_crash_test +++ b/tools/testing/selftests/pstore/pstore_crash_test @@ -25,6 +25,7 @@ touch $REBOOT_FLAG sync # cause crash -# Note: If you use kdump and want to see kmesg-* files after reboot, you should -# specify 'crash_kexec_post_notifiers' in 1st kernel's cmdline. +# Note: If you use kdump and want to see kmsg-* files after reboot, you should +# be sure that the parameter "panic_notifiers_level" is more than '2' (the +# default value for this parameter is '2') in the first kernel's cmdline. echo c > /proc/sysrq-trigger
The panic() function is somewhat convoluted - a lot of changes were made over the years, adding comments that might be misleading/outdated now, it has a code structure that is a bit complex to follow, with lots of conditionals, for example. The panic notifier list is something else - a single list, with multiple callbacks of different purposes, that run in a non-deterministic order and may affect hardly kdump reliability - see the "crash_kexec_post_notifiers" workaround-ish flag. This patch proposes a major refactor on the panic path based on Petr's idea [0] - basically we split the notifiers list in three, having a set of different call points in the panic path. Below a list of changes proposed in this patch, culminating in the panic notifiers level concept: (a) First of all, we improved comments all over the function and removed useless variables / includes. Also, as part of this clean-up we concentrate the console flushing functions in a helper. (b) As mentioned before, there is a split of the panic notifier list in three, based on the purpose of the callback. The code contains good documentation in form of comments, but a summary of the three lists follows: - the hypervisor list aims low-risk procedures to inform hypervisors or firmware about the panic event, also includes LED-related functions; - the informational list contains callbacks that provide more details, like kernel offset or trace dump (if enabled) and also includes the callbacks aimed at reducing log pollution or warns, like the RCU and hung task disable callbacks; - finally, the pre_reboot list is the old notifier list renamed, containing the more risky callbacks that didn't fit the previous lists. There is also a 4th list (the post_reboot one), but it's not related with the original list - it contains late time architecture callbacks aimed at stopping the machine, for example. The 3 notifiers lists execute in different moments, hypervisor being the first, followed by informational and finally the pre_reboot list. (c) But then, there is the ordering problem of the notifiers against the crash_kernel() call - kdump must be as reliable as possible. For that, a simple binary "switch" as "crash_kexec_post_notifiers" is not enough, hence we introduce here concept of panic notifier levels: there are 5 levels, from 0 (no notifier executes before kdump) until 4 (all notifiers run before kdump); the default level is 2, in which the hypervisor and (iff we have any kmsg dumper) the informational notifiers execute before kdump. The detailed documentation of the levels is present in code comments and in the kernel-parameters.txt file; as an analogy with the previous panic() implementation, the level 0 is exactly the same as the old behavior of notifiers, running all after kdump, and the level 4 is the same as "crash_kexec_post_notifiers=Y" (we kept this parameter as a deprecated one). (d) Finally, an important change made here: we now use only the function "crash_smp_send_stop()" to shut all the secondary CPUs in the panic path. Before, there was a case of using the regular "smp_send_stop()", but the better approach is to simplify the code and try to use the function which was created exclusively for the panic path. Experiments showed that it works fine, and code was very simplified with that. Functional change is expected from this refactor, since now we call some notifiers by default before kdump, but the goal here besides code clean-up is to have a better panic path, more reliable and deterministic, but also very customizable. [0] https://lore.kernel.org/lkml/YfPxvzSzDLjO5ldp@alley/ Suggested-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com> --- Special thanks to Petr and Baoquan for the suggestion and feedback in a previous email thread. There's some important design decisions that worth mentioning and discussing: * The default panic notifiers level is 2, based on Petr Mladek's suggestion, which makes a lot of sense. Of course, this is customizable through the parameter, but would be something worthwhile to have a KConfig option to set the default level? It would help distros that want the old behavior (no notifiers before kdump) as default. * The implementation choice was to _avoid_ intricate if conditionals in the panic path, which would _definitely_ be present with the panic notifiers levels idea; so, instead of lots of if conditionals, the set/clear bits approach with functions called in 2 points (but executing only in one of them) is much easier to follow an was used here; the ordering helper function and the comments also help a lot to avoid confusion (hopefully). * Choice was to *always* use crash_smp_send_stop() instead of sometimes making use of the regular smp_send_stop(); for most architectures they are the same, including Xen (on x86). For the ones that override it, all should work fine, in the powerpc case it's even more correct (see the subsequent patch "powerpc: Do not force all panic notifiers to execute before kdump") There seems to be 2 cases that requires some plumbing to work 100% right: - ARM doesn't disable local interrupts / FIQs in the crash version of send_stop(); we patched that early in this series; - x86 could face an issue if we have VMX and do use crash_smp_send_stop() _without_ kdump, but this is fixed in the first patch of the series (and it's a bug present even before this refactor). * Notice we didn't add a sysrq for panic notifiers level - should have it? Alejandro proposed recently to add a sysrq for "crash_kexec_post_notifiers", let me know if you feel the need here Alejandro, since the core parameters are present in /sys, I didn't consider much gain in having a sysrq, but of course I'm open to suggestions! Thanks advance for the review! .../admin-guide/kernel-parameters.txt | 42 ++- include/linux/panic_notifier.h | 1 + kernel/kexec_core.c | 8 +- kernel/panic.c | 292 +++++++++++++----- .../selftests/pstore/pstore_crash_test | 5 +- 5 files changed, 252 insertions(+), 96 deletions(-)