Message ID | 20210512210459.1983026-1-luzmaximilian@gmail.com |
---|---|
State | New |
Headers | show |
Series | x86/i8259: Work around buggy legacy PIC | expand |
From: Maximilian Luz > Sent: 12 May 2021 22:05 > > The legacy PIC on the AMD variant of the Microsoft Surface Laptop 4 has > some problems on boot. For some reason it consistently does not respond > on the first try, requiring a couple more tries before it finally > responds. That seems very strange, something else must be going on that causes the grief. The 8259 will be built into to the one of the cpu support chips. I can't imagine that requires anything special. It's not as though you have a real 8259 - which even a 286 can break the inter-cycle recovery on (with the circuit from the application note). David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
On 5/13/21 10:10 AM, David Laight wrote: > From: Maximilian Luz >> Sent: 12 May 2021 22:05 >> >> The legacy PIC on the AMD variant of the Microsoft Surface Laptop 4 has >> some problems on boot. For some reason it consistently does not respond >> on the first try, requiring a couple more tries before it finally >> responds. > > That seems very strange, something else must be going on that causes the grief. > The 8259 will be built into to the one of the cpu support chips. > I can't imagine that requires anything special. Right, it's definitely strange. Both Sachi (I imagine) and I don't know much about these devices, so we're open for suggestions. For reference: [1] and following comments are the discussion where we (mostly Sachi) discovered the issue, tracking this from platform_get_irq() returning -EINVAL to the PIC not probing. It also has a dmesg log [2] attached, but as far as we can tell [ 0.105820] Using NULL legacy PIC is the only relevant line to that in there. And lastly, if that's any help at all: The PIC device is described in ACPI in [3]. The Surface Laptop 3 also has an AMD CPU (although a prior generation) and has the PIC described in the exact same way (can also be found in that repository), but doesn't exhibit that behavior (and doesn't show the "Using NULL legacy PIC" line). I expect there's not much you can change to that definition so that's probably irrelevant here. Again, I don't really know anything about these devices, so my guess would be bad hardware revision or bad firmware revision. All I know is that retrying seems to "fix" it. > It's not as though you have a real 8259 - which even a 286 can > break the inter-cycle recovery on (with the circuit from the > application note). Right, I imagine that's some emulation for legacy reasons? Regards, Max [1]: https://github.com/linux-surface/linux-surface/issues/425#issuecomment-831921098 [2]: https://github.com/linux-surface/linux-surface/files/6421166/dmesg-2021-05-04_22-11.txt [3]: https://github.com/linux-surface/acpidumps/blob/69d5ecc1954ea5e90829b8e33541308e7451e951/surface_laptop_4_amd/dsdt.dsl#L1201-L1221
> -----Original Message----- > From: Maximilian Luz <luzmaximilian@gmail.com> > Sent: 13 May 2021 11:12 > To: David Laight <David.Laight@ACULAB.COM>; Thomas Gleixner <tglx@linutronix.de>; Ingo Molnar > <mingo@redhat.com>; Borislav Petkov <bp@alien8.de> > Cc: H. Peter Anvin <hpa@zytor.com>; Sachi King <nakato@nakato.io>; x86@kernel.org; linux- > kernel@vger.kernel.org; stable@vger.kernel.org > Subject: Re: [PATCH] x86/i8259: Work around buggy legacy PIC > > On 5/13/21 10:10 AM, David Laight wrote: > > From: Maximilian Luz > >> Sent: 12 May 2021 22:05 > >> > >> The legacy PIC on the AMD variant of the Microsoft Surface Laptop 4 has > >> some problems on boot. For some reason it consistently does not respond > >> on the first try, requiring a couple more tries before it finally > >> responds. > > > > That seems very strange, something else must be going on that causes the grief. > > The 8259 will be built into to the one of the cpu support chips. > > I can't imagine that requires anything special. > > Right, it's definitely strange. Both Sachi (I imagine) and I don't know > much about these devices, so we're open for suggestions. I found a copy of the datasheet (I don't seem to have the black book): https://pdos.csail.mit.edu/6.828/2010/readings/hardware/8259A.pdf The PC hardware has two 8259 in cascade mode. (Cascaded using an interrupt that wasn't really using in the original 8088 PC which only had one 8259.) I wonder if the bios has actually initialised is properly. Some initialisation writes have to be done to set everything up. It is also worth noting that the probe code is spectacularly crap. It writes 0xff and then checks that 0xff is read back. Almost anything (including a failed PCIe read to the ISA bridge) will return 0xff and make the test pass. It's about 35 years since I last wrote the code to initialise an 8259. The memory cells are foggy. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
From: Sachi King > Sent: 14 May 2021 20:41 > > On Thursday, May 13, 2021 8:36:27 PM AEST David Laight wrote: > > > -----Original Message----- > > > From: Maximilian Luz <luzmaximilian@gmail.com> > > > Sent: 13 May 2021 11:12 > > > To: David Laight <David.Laight@ACULAB.COM>; Thomas Gleixner > > > <tglx@linutronix.de>; Ingo Molnar <mingo@redhat.com>; Borislav Petkov > > > <bp@alien8.de> > > > Cc: H. Peter Anvin <hpa@zytor.com>; Sachi King <nakato@nakato.io>; > > > x86@kernel.org; linux-kernel@vger.kernel.org; stable@vger.kernel.org > > > Subject: Re: [PATCH] x86/i8259: Work around buggy legacy PIC > > > > > > On 5/13/21 10:10 AM, David Laight wrote: > > > > > > > From: Maximilian Luz > > > > > > > >> Sent: 12 May 2021 22:05 > > > >> > > > >> > > > >> > > > >> The legacy PIC on the AMD variant of the Microsoft Surface Laptop 4 > > > >> has > > > >> some problems on boot. For some reason it consistently does not > > > >> respond > > > >> on the first try, requiring a couple more tries before it finally > > > >> responds. > > > > > > > > > > > > > > > > That seems very strange, something else must be going on that causes the > > > > grief. > > > > The 8259 will be built into to the one of the cpu support > > > > chips. > > > > I can't imagine that requires anything special. > > > > > > > > > Right, it's definitely strange. Both Sachi (I imagine) and I don't know > > > much about these devices, so we're open for suggestions. > > > > > > I found a copy of the datasheet (I don't seem to have the black book): > > > > https://pdos.csail.mit.edu/6.828/2010/readings/hardware/8259A.pdf > > > > The PC hardware has two 8259 in cascade mode. > > (Cascaded using an interrupt that wasn't really using in the original > > 8088 PC which only had one 8259.) > > > > I wonder if the bios has actually initialised is properly. > > Some initialisation writes have to be done to set everything up. > > I suspect by the displayed behaviour you are correct and that it has > not. I'm struggling to figure out who to talk to to see that is > something that can be fixed in the firmware. > > > It is also worth noting that the probe code is spectacularly crap. > > It writes 0xff and then checks that 0xff is read back. > > Almost anything (including a failed PCIe read to the ISA bridge) > > will return 0xff and make the test pass. > > I was under the impression that it wrote 0xfb, and 0xff would be > considered a failure. I probably misread it. For anything like a probe you really need to write one value and read back something different - that is hopefully reasonably unique. OTOH just the write can trash the wrong hardware - many old PC hardware probes were very dubious. Although unlikely here (since the logic is all inside chip) on a real IO bus with tristate drivers a write followed by a read to an unknown address could easily return the written value due to capacitance of the data bus. > > It's about 35 years since I last wrote the code to initialise an 8259. > > The memory cells are foggy. > > I'm not sure the i8259 is needed on the device, as the interrupts > appear to function on the device if I bypass the nr_legacy_irqs() check > while the legacy_pic is set to the null_legacy_pic. > > The null_legacy_pic however specifies having 0 irqs, and the io_apic > does not allow us to set the pin attributes unless the pin we are > attempting to set is less than nr_legacy_irqs. > > The IOAPIC seems to take responsibility for the 0-15 interrupts on this > specific hardware, should we maybe be ignoring the i8259 and looking > into allowing interrupts 0-15 to be setup even when the legacy_pic is > not available? That woke up some memory cells. IIRC on an SMP kernel the IOAPIC is always used in preference to the 8259. So maybe the initialisation code is just plain wrong! David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
On 5/14/21 9:41 PM, Sachi King wrote: > On Thursday, May 13, 2021 8:36:27 PM AEST David Laight wrote: >>> -----Original Message----- >>> From: Maximilian Luz <luzmaximilian@gmail.com> >>> Sent: 13 May 2021 11:12 >>> To: David Laight <David.Laight@ACULAB.COM>; Thomas Gleixner >>> <tglx@linutronix.de>; Ingo Molnar <mingo@redhat.com>; Borislav Petkov >>> <bp@alien8.de> >>> Cc: H. Peter Anvin <hpa@zytor.com>; Sachi King <nakato@nakato.io>; >>> x86@kernel.org; linux-kernel@vger.kernel.org; stable@vger.kernel.org >>> Subject: Re: [PATCH] x86/i8259: Work around buggy legacy PIC >>> >>> On 5/13/21 10:10 AM, David Laight wrote: >>> >>>> From: Maximilian Luz >>>> >>>>> Sent: 12 May 2021 22:05 >>>>> >>>>> >>>>> >>>>> The legacy PIC on the AMD variant of the Microsoft Surface Laptop 4 >>>>> has >>>>> some problems on boot. For some reason it consistently does not >>>>> respond >>>>> on the first try, requiring a couple more tries before it finally >>>>> responds. >>>> >>>> >>>> >>>> That seems very strange, something else must be going on that causes the >>>> grief. >>>> The 8259 will be built into to the one of the cpu support >>>> chips. >>>> I can't imagine that requires anything special. >>> >>> >>> Right, it's definitely strange. Both Sachi (I imagine) and I don't know >>> much about these devices, so we're open for suggestions. >> >> >> I found a copy of the datasheet (I don't seem to have the black book): >> >> https://pdos.csail.mit.edu/6.828/2010/readings/hardware/8259A.pdf >> >> The PC hardware has two 8259 in cascade mode. >> (Cascaded using an interrupt that wasn't really using in the original >> 8088 PC which only had one 8259.) >> >> I wonder if the bios has actually initialised is properly. >> Some initialisation writes have to be done to set everything up. > > I suspect by the displayed behaviour you are correct and that it has > not. I'm struggling to figure out who to talk to to see that is > something that can be fixed in the firmware. I'd assume that _some_ sort of interrupt setup is done by the BIOS/UEFI. The UEFI on those devices is fairly well-featured, with touch support via SPI and all. Furthermore, keyboard (also supported in the device's UEFI) is handled via a custom UART protocol. Unless they rely on polling for all of that, I believe they'd have to set up some interrupts. Although, as you mention later on, that could also be handled via the IOAPIC and the PIC is actually not supposed to be used. Maybe some legacy component that never got tested and just broke with some new hardware/firmware revision without anyone noticing? And since Linux still seems to rely on that, we might be the first to notice. Regards, Max
David, On Thu, May 13 2021 at 10:36, David Laight wrote: >> -----Original Message----- >> From: Maximilian Luz <luzmaximilian@gmail.com> >> Sent: 13 May 2021 11:12 >> To: David Laight <David.Laight@ACULAB.COM>; Thomas Gleixner <tglx@linutronix.de>; Ingo Molnar >> <mingo@redhat.com>; Borislav Petkov <bp@alien8.de> >> Cc: H. Peter Anvin <hpa@zytor.com>; Sachi King <nakato@nakato.io>; x86@kernel.org; linux- >> kernel@vger.kernel.org; stable@vger.kernel.org >> Subject: Re: [PATCH] x86/i8259: Work around buggy legacy PIC can you please fix your mail client and spare us the useless header duplication in the reply? > It is also worth noting that the probe code is spectacularly crap. > It writes 0xff and then checks that 0xff is read back. > Almost anything (including a failed PCIe read to the ISA bridge) > will return 0xff and make the test pass. unsigned char probe_val = ~(1 << PIC_CASCADE_IR); outb(probe_val, PIC_MASTER_IMR); new_val = inb(PIC_MASTER_IMR); How is that writing 0xFF? Thanks, tglx
From: Thomas Gleixner > Sent: 14 May 2021 14:02 > > David, > > On Thu, May 13 2021 at 10:36, David Laight wrote: > > >> -----Original Message----- > >> From: Maximilian Luz <luzmaximilian@gmail.com> > > can you please fix your mail client and spare us the useless header > duplication in the reply? I have to delete them by hand - must have forgotten, I can't fix outlook :-) > > It is also worth noting that the probe code is spectacularly crap. > > It writes 0xff and then checks that 0xff is read back. > > Almost anything (including a failed PCIe read to the ISA bridge) > > will return 0xff and make the test pass. > > unsigned char probe_val = ~(1 << PIC_CASCADE_IR); > > outb(probe_val, PIC_MASTER_IMR); > new_val = inb(PIC_MASTER_IMR); > > How is that writing 0xFF? Sorry I misread the code and diagnostic output. In any case writing a value and expecting the same value back isn't exactly a high-quality probe. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
Max, On Thu, May 13 2021 at 12:11, Maximilian Luz wrote: > And lastly, if that's any help at all: The PIC device is described in > ACPI in [3]. The Surface Laptop 3 also has an AMD CPU (although a prior > generation) and has the PIC described in the exact same way (can also be > found in that repository), but doesn't exhibit that behavior (and > doesn't show the "Using NULL legacy PIC" line). I expect there's not > much you can change to that definition so that's probably irrelevant > here. > > Again, I don't really know anything about these devices, so my guess > would be bad hardware revision or bad firmware revision. All I know is > that retrying seems to "fix" it. That might be a a power optimization thing. Except for older systems the PIC is not really required for IOAPiC to work. But there is some historical code which makes assumptions. We can change that, but that needs some careful thoughts. Thanks, tglx
From: Thomas Gleixner > Sent: 14 May 2021 14:45 > > Max, > > On Thu, May 13 2021 at 12:11, Maximilian Luz wrote: > > And lastly, if that's any help at all: The PIC device is described in > > ACPI in [3]. The Surface Laptop 3 also has an AMD CPU (although a prior > > generation) and has the PIC described in the exact same way (can also be > > found in that repository), but doesn't exhibit that behavior (and > > doesn't show the "Using NULL legacy PIC" line). I expect there's not > > much you can change to that definition so that's probably irrelevant > > here. > > > > Again, I don't really know anything about these devices, so my guess > > would be bad hardware revision or bad firmware revision. All I know is > > that retrying seems to "fix" it. > > That might be a a power optimization thing. > > Except for older systems the PIC is not really required for IOAPiC to > work. But there is some historical code which makes assumptions. We can > change that, but that needs some careful thoughts. A more interesting probe would be: - Write some value to register 1 - the mask. - Write 9 to register zero (selects interrupt in service register). - Read register 0 - should be zero since we aren't in as ISR. - Read register 1 - should get the mask back. You can also write 8 to register 0, reads then return the pending interrupts. Their might be pending interrupts - so that value can't be checked. But if reads start returning the last written value you might only have capacitors on the data bus. The required initialisation registers are pretty fixed for the PC hardware. But finding the values requires a bit of work. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
* David Laight <David.Laight@ACULAB.COM> wrote: > > > It is also worth noting that the probe code is spectacularly crap. > > > It writes 0xff and then checks that 0xff is read back. > > > Almost anything (including a failed PCIe read to the ISA bridge) > > > will return 0xff and make the test pass. > > > > unsigned char probe_val = ~(1 << PIC_CASCADE_IR); > > > > outb(probe_val, PIC_MASTER_IMR); > > new_val = inb(PIC_MASTER_IMR); > > > > How is that writing 0xFF? > > Sorry I misread the code and diagnostic output. > > In any case writing a value and expecting the same value back > isn't exactly a high-quality probe. It's not, and it's not intended to be: 0x21 is a well-known port nobody was crazy enough to override yet, so that probe basically filters out the "there is nothing at that port, at all" case, which would normally return 0xff, or in a few weird cases 0x00 perhaps. Writing something inbetween those values and getting the same value back tells us that something functional occupies that well-known IO-port, pretending to be a i8259 PIC. Which is what we wanted to know, given the context. Thanks, Ingo
On 5/14/21 9:12 AM, David Laight wrote: > > A more interesting probe would be: > - Write some value to register 1 - the mask. > - Write 9 to register zero (selects interrupt in service register). > - Read register 0 - should be zero since we aren't in as ISR. > - Read register 1 - should get the mask back. > You can also write 8 to register 0, reads then return the pending interrupts. > Their might be pending interrupts - so that value can't be checked. > > But if reads start returning the last written value you might only > have capacitors on the data bus. What data bus? These things haven't been on a physical parallel bus for ages. > The required initialisation registers are pretty fixed for the PC hardware. > But finding the values requires a bit of work. > > David And you always risk activating new bugs. Since this appears to be a specific platform advertising the wrong answer in firmware, this is better handled as a quirk. -hpa
On Fri, May 14 2021 at 13:58, Maximilian Luz wrote: > On 5/14/21 9:41 PM, Sachi King wrote: > I'd assume that _some_ sort of interrupt setup is done by the BIOS/UEFI. > The UEFI on those devices is fairly well-featured, with touch support > via SPI and all. Furthermore, keyboard (also supported in the device's > UEFI) is handled via a custom UART protocol. Unless they rely on polling > for all of that, I believe they'd have to set up some interrupts. Polling would be truly surprising. > Although, as you mention later on, that could also be handled via the > IOAPIC and the PIC is actually not supposed to be used. Maybe some > legacy component that never got tested and just broke with some new > hardware/firmware revision without anyone noticing? And since Linux > still seems to rely on that, we might be the first to notice. That's a valid assumption. As I said, we can make IOAPIC work even w/o PIC. I'll have a look how much PIC assumptions are still around. Thanks, tglx
On 5/14/21 10:32 AM, Thomas Gleixner wrote: > > That's a valid assumption. As I said, we can make IOAPIC work even w/o > PIC. I'll have a look how much PIC assumptions are still around. > As far as I read, the problem isn't actually the absence of a PIC (we definitely boot systems without PICs all the time now), but rather that the PIC is advertised in ACPI but is buggy or absent; a similar platform with different firmware doesn't have problem. If my understanding of the thread is correct, it's quirk fodder. -hpa
On Thursday, May 13, 2021 8:36:27 PM AEST David Laight wrote: > > -----Original Message----- > > From: Maximilian Luz <luzmaximilian@gmail.com> > > Sent: 13 May 2021 11:12 > > To: David Laight <David.Laight@ACULAB.COM>; Thomas Gleixner > > <tglx@linutronix.de>; Ingo Molnar <mingo@redhat.com>; Borislav Petkov > > <bp@alien8.de> > > Cc: H. Peter Anvin <hpa@zytor.com>; Sachi King <nakato@nakato.io>; > > x86@kernel.org; linux-kernel@vger.kernel.org; stable@vger.kernel.org > > Subject: Re: [PATCH] x86/i8259: Work around buggy legacy PIC > > > > On 5/13/21 10:10 AM, David Laight wrote: > > > > > From: Maximilian Luz > > > > > >> Sent: 12 May 2021 22:05 > > >> > > >> > > >> > > >> The legacy PIC on the AMD variant of the Microsoft Surface Laptop 4 > > >> has > > >> some problems on boot. For some reason it consistently does not > > >> respond > > >> on the first try, requiring a couple more tries before it finally > > >> responds. > > > > > > > > > > > > That seems very strange, something else must be going on that causes the > > > grief. > > > The 8259 will be built into to the one of the cpu support > > > chips. > > > I can't imagine that requires anything special. > > > > > > Right, it's definitely strange. Both Sachi (I imagine) and I don't know > > much about these devices, so we're open for suggestions. > > > I found a copy of the datasheet (I don't seem to have the black book): > > https://pdos.csail.mit.edu/6.828/2010/readings/hardware/8259A.pdf > > The PC hardware has two 8259 in cascade mode. > (Cascaded using an interrupt that wasn't really using in the original > 8088 PC which only had one 8259.) > > I wonder if the bios has actually initialised is properly. > Some initialisation writes have to be done to set everything up. I suspect by the displayed behaviour you are correct and that it has not. I'm struggling to figure out who to talk to to see that is something that can be fixed in the firmware. > It is also worth noting that the probe code is spectacularly crap. > It writes 0xff and then checks that 0xff is read back. > Almost anything (including a failed PCIe read to the ISA bridge) > will return 0xff and make the test pass. I was under the impression that it wrote 0xfb, and 0xff would be considered a failure. > It's about 35 years since I last wrote the code to initialise an 8259. > The memory cells are foggy. I'm not sure the i8259 is needed on the device, as the interrupts appear to function on the device if I bypass the nr_legacy_irqs() check while the legacy_pic is set to the null_legacy_pic. The null_legacy_pic however specifies having 0 irqs, and the io_apic does not allow us to set the pin attributes unless the pin we are attempting to set is less than nr_legacy_irqs. The IOAPIC seems to take responsibility for the 0-15 interrupts on this specific hardware, should we maybe be ignoring the i8259 and looking into allowing interrupts 0-15 to be setup even when the legacy_pic is not available? Cheers, Sachi
On 5/14/21 7:35 PM, H. Peter Anvin wrote: > On 5/14/21 10:32 AM, Thomas Gleixner wrote: >> >> That's a valid assumption. As I said, we can make IOAPIC work even w/o >> PIC. I'll have a look how much PIC assumptions are still around. >> > > As far as I read, the problem isn't actually the absence of a PIC (we definitely boot systems without PICs all the time now), but rather that the PIC is advertised in ACPI but is buggy or absent; a similar platform with different firmware doesn't have problem. > > If my understanding of the thread is correct, it's quirk fodder. I believe the theory was that, while the PIC is advertised in ACPI, it might be expected to not be used and only present for some legacy reason (therefore untested and buggy). Which I believe led to the question whether we shouldn't prefer IOAPIC on systems like that in general. So I guess it comes down to how you define "systems like that". By Tomas' comment, I guess it should be possible to implement this as "systems that should prefer IOAPIC over legacy PIC" quirk. I guess all modern machines should have an IOAPIC, so it might also be preferable to expand that definition, maybe over time and with enough testing. If possible, I think that would be preferable to the "just retry until it works" kind of workaround. As Sachi mentioned in her reply: > I'm not sure the i8259 is needed on the device, as the interrupts > appear to function on the device if I bypass the nr_legacy_irqs() check > while the legacy_pic is set to the null_legacy_pic. > > The null_legacy_pic however specifies having 0 irqs, and the io_apic > does not allow us to set the pin attributes unless the pin we are > attempting to set is less than nr_legacy_irqs. > > The IOAPIC seems to take responsibility for the 0-15 interrupts on this > specific hardware, should we maybe be ignoring the i8259 and looking > into allowing interrupts 0-15 to be setup even when the legacy_pic is > not available? Just my two cents, please keep in mind that I'm out of my depth here. Regards, Max
Max, On Sat, May 15 2021 at 00:47, Maximilian Luz wrote: > I believe the theory was that, while the PIC is advertised in ACPI, it > might be expected to not be used and only present for some legacy reason > (therefore untested and buggy). Which I believe led to the question > whether we shouldn't prefer IOAPIC on systems like that in general. So I > guess it comes down to how you define "systems like that". By Tomas' > comment, I guess it should be possible to implement this as "systems > that should prefer IOAPIC over legacy PIC" quirk. > > I guess all modern machines should have an IOAPIC, so it might also be > preferable to expand that definition, maybe over time and with enough > testing. I just double checked and we actually can boot just fine without the PIC even when it is advertised, but disfunctional. Can you please add "apic=verbose" to the kernel command line and provide full dmesg output for a kernel w/o your patch and one with your patch applied? Thanks, tglx
On 5/17/21 8:40 PM, Thomas Gleixner wrote: > Max, > > On Sat, May 15 2021 at 00:47, Maximilian Luz wrote: >> I believe the theory was that, while the PIC is advertised in ACPI, it >> might be expected to not be used and only present for some legacy reason >> (therefore untested and buggy). Which I believe led to the question >> whether we shouldn't prefer IOAPIC on systems like that in general. So I >> guess it comes down to how you define "systems like that". By Tomas' >> comment, I guess it should be possible to implement this as "systems >> that should prefer IOAPIC over legacy PIC" quirk. >> >> I guess all modern machines should have an IOAPIC, so it might also be >> preferable to expand that definition, maybe over time and with enough >> testing. > > I just double checked and we actually can boot just fine without the > PIC even when it is advertised, but disfunctional. > > Can you please add "apic=verbose" to the kernel command line and provide > full dmesg output for a kernel w/o your patch and one with your patch > applied? I don't actually own an affected device, but I'm sure Sachi can provide you with that. As far as we can tell, due to the NULL PIC being chosen nr_legacy_irqs() returns 0. That in turn causes mp_check_pin_attr() to return false because is_level and active_low don't seem to match the expected values. That check is essentially ignored if nr_legacy_irqs() returns a high enough value. I guess that might also be a firmware bug here? Not sure where the expected values come from. Due to this, mp_map_pin_to_irq() fails with -EBUSY which causes acpi_register_gsi() to fail. That fails in acpi_dev_get_irqresource(), which causes the IRQ resource to be marked as disabled. Down the line, this then causes platform_get_irq() to return -EINVAL, because the IRQ we're trying to get has the IORESOURCE_DISABLED bit set. Sachi can probably walk you through this a bit better as she's the one who tracked this down. See also [1, 2] and following comments. Regards, Max [1]: https://github.com/linux-surface/linux-surface/issues/425#issuecomment-835309201 [2]: https://github.com/linux-surface/linux-surface/issues/425#issuecomment-835261784
On Mon, May 17 2021 at 21:25, Maximilian Luz wrote: > On 5/17/21 8:40 PM, Thomas Gleixner wrote: >> Can you please add "apic=verbose" to the kernel command line and provide >> full dmesg output for a kernel w/o your patch and one with your patch >> applied? > > I don't actually own an affected device, but I'm sure Sachi can provide > you with that. Ok. > As far as we can tell, due to the NULL PIC being chosen nr_legacy_irqs() > returns 0. That in turn causes mp_check_pin_attr() to return false > because is_level and active_low don't seem to match the expected > values. Ok. > That check is essentially ignored if nr_legacy_irqs() returns a high > enough value. Close enough. > I guess that might also be a firmware bug here? Not sure where the > expected values come from. They come from the interrupt override ACPI table and if not supplied then irq 0-15 is preset with default values, which are type=edge and polarity=high, i.e. the opposite of what the failing driver wants. The ACPI table lacks an override entry for IRQ7. I looked at one of the dmesg files in that github thread and that has overrides: [ 0.111674] ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl) [ 0.111681] ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 low level) [ 0.111688] ACPI: IRQ0 used by override. [ 0.111692] ACPI: IRQ9 used by override. IRQ7 should have a corresponding entry as IRQ9 has: https://github.com/linux-surface/acpidumps/blob/4da0148744164cea0c924dab92f45842fde03177/surface_laptop_4_amd/apic.dsl#L178 Subtable Type : 02 [Interrupt Source Override] Length : 0A Bus : 00 Source : 07 Interrupt : 00000007 Flags (decoded below) : 000F Polarity : 3 Trigger Mode : 3 > Sachi can probably walk you through this a bit better as she's the one > who tracked this down. See also [1, 2] and following comments. Impressive detective work! Sachi, can you please try the hack below to confirm the above? It's not meant to be a solution, but it's the most trivial way to validate this. I'm pretty sure that Windows on Surface does not care about the PIC at all. Whether that's on purpose to safe power or just because Windows ignores the PIC completely by now does not matter at all. No idea how that repeated poking on the PIC makes it come alive either and TBH, I don't care too much about it simply because Linux is able to cope with a missing PIC as long as the ACPI tables are correct. I'm way too tired to think about a proper solution for that problem and I noticed another related issue in that dmesg output: [ 0.272448] Failed to register legacy timer interrupt It's not a problem which causes failures, but it's related to the missing PIC. Needs some more thoughts with brain awake... Thanks, tglx --- --- a/arch/x86/kernel/acpi/boot.c +++ b/arch/x86/kernel/acpi/boot.c @@ -21,6 +21,7 @@ #include <linux/efi-bgrt.h> #include <linux/serial_core.h> #include <linux/pgtable.h> +#include <linux/dmi.h> #include <asm/e820/api.h> #include <asm/irqdomain.h> @@ -1155,6 +1156,17 @@ static void __init mp_config_acpi_legacy } } +static const struct dmi_system_id surface_quirk[] __initconst = { + { + .ident = "Microsoft Surface Laptop 4 (AMD)", + .matches = { + DMI_MATCH(DMI_SYS_VENDOR, "Microsoft Corporation"), + DMI_MATCH(DMI_PRODUCT_SKU, "Surface_Laptop_4_1952:1953") + }, + }, + {} +}; + /* * Parse IOAPIC related entries in MADT * returns 0 on success, < 0 on error @@ -1212,6 +1224,11 @@ static int __init acpi_parse_madt_ioapic acpi_sci_ioapic_setup(acpi_gbl_FADT.sci_interrupt, 0, 0, acpi_gbl_FADT.sci_interrupt); + if (dmi_check_system(surface_quirk)) { + pr_warn("Surface hack: Override irq 7\n"); + mp_override_legacy_irq(7, 3, 3, 7); + } + /* Fill in identity legacy mappings where no override */ mp_config_acpi_legacy_irqs();
On Tue, May 18, 2021 at 01:27:02AM +0200, Thomas Gleixner wrote: > On Mon, May 17 2021 at 21:25, Maximilian Luz wrote: > > On 5/17/21 8:40 PM, Thomas Gleixner wrote: > >> Can you please add "apic=verbose" to the kernel command line and provide > >> full dmesg output for a kernel w/o your patch and one with your patch > >> applied? > > > > I don't actually own an affected device, but I'm sure Sachi can provide > > you with that. > > Ok. > > > As far as we can tell, due to the NULL PIC being chosen nr_legacy_irqs() > > returns 0. That in turn causes mp_check_pin_attr() to return false > > because is_level and active_low don't seem to match the expected > > values. > > Ok. > > > That check is essentially ignored if nr_legacy_irqs() returns a high > > enough value. > > Close enough. > > > I guess that might also be a firmware bug here? Not sure where the > > expected values come from. > > They come from the interrupt override ACPI table and if not supplied > then irq 0-15 is preset with default values, which are type=edge and > polarity=high, i.e. the opposite of what the failing driver wants. > > The ACPI table lacks an override entry for IRQ7. I looked at one of the > dmesg files in that github thread and that has overrides: > > [ 0.111674] ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl) > [ 0.111681] ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 low level) > [ 0.111688] ACPI: IRQ0 used by override. > [ 0.111692] ACPI: IRQ9 used by override. > > IRQ7 should have a corresponding entry as IRQ9 has: > > https://github.com/linux-surface/acpidumps/blob/4da0148744164cea0c924dab92f45842fde03177/surface_laptop_4_amd/apic.dsl#L178 > > Subtable Type : 02 [Interrupt Source Override] > Length : 0A > Bus : 00 > Source : 07 > Interrupt : 00000007 > Flags (decoded below) : 000F > Polarity : 3 > Trigger Mode : 3 > > > Sachi can probably walk you through this a bit better as she's the one > > who tracked this down. See also [1, 2] and following comments. > > Impressive detective work! > > Sachi, can you please try the hack below to confirm the above? > > It's not meant to be a solution, but it's the most trivial way to > validate this. > > I'm pretty sure that Windows on Surface does not care about the PIC at > all. Whether that's on purpose to safe power or just because Windows > ignores the PIC completely by now does not matter at all. No idea how > that repeated poking on the PIC makes it come alive either and TBH, I > don't care too much about it simply because Linux is able to cope with a > missing PIC as long as the ACPI tables are correct. > > I'm way too tired to think about a proper solution for that problem and > I noticed another related issue in that dmesg output: > > [ 0.272448] Failed to register legacy timer interrupt > > It's not a problem which causes failures, but it's related to the > missing PIC. But ACPI has a pretty nice means about missing legacy hardware, it's called Hardware Reduced mode. It excludes automatically the (legacy) PIC, PIT, etc. OTOH it excludes ACPI power chip as well. I haven't looked into this, just share my thoughts what else can be checked. (On Intel the MID devices use that approach) > Needs some more thoughts with brain awake... -- With Best Regards, Andy Shevchenko
Sachi, On Wed, May 19 2021 at 05:58, Sachi King wrote: > On Tuesday, May 18, 2021 9:27:02 AM AEST Thomas Gleixner wrote: >> On Mon, May 17 2021 at 21:25, Maximilian Luz wrote: >> > On 5/17/21 8:40 PM, Thomas Gleixner wrote: >> >> Can you please add "apic=verbose" to the kernel command line and provide >> >> full dmesg output for a kernel w/o your patch and one with your patch >> >> applied? > > I've linked to a dmesg with "apic=verbose" below with the irq 7 override hack > applied. Would you still like a copy without either of these patches > applied? No, that's pretty conclusive. > What's the convention for including a dmesg on the mailing list? I've > included the copy via gist as pasting it into email seems incorrect, but > that's also probably not the correct convention. That's fine. >> Sachi, can you please try the hack below to confirm the above? >> >> It's not meant to be a solution, but it's the most trivial way to >> validate this. > > I've applied that patch and the GPIO driver loads and functions. It's exactly what I expected. So now for a proper solution. After talking to Raphael it seems something like that quirk is the best option we have. I'll try to come up with something less horrible. There is some other minor issue I noticed, which I need to look at as well. Thanks for digging into this and for testing! tglx
On Tuesday, May 18, 2021 9:27:02 AM AEST Thomas Gleixner wrote: > On Mon, May 17 2021 at 21:25, Maximilian Luz wrote: > > On 5/17/21 8:40 PM, Thomas Gleixner wrote: > >> Can you please add "apic=verbose" to the kernel command line and provide > >> full dmesg output for a kernel w/o your patch and one with your patch > >> applied? I've linked to a dmesg with "apic=verbose" below with the irq 7 override hack applied. Would you still like a copy without either of these patches applied? What's the convention for including a dmesg on the mailing list? I've included the copy via gist as pasting it into email seems incorrect, but that's also probably not the correct convention. > Sachi, can you please try the hack below to confirm the above? > > It's not meant to be a solution, but it's the most trivial way to > validate this. I've applied that patch and the GPIO driver loads and functions. The dmesg with the patch and "apic=verbose" I've placed at: https://gist.github.com/nakato/08c6962a540d91b6f9597303d54bffe5 Thanks, Sachi
diff --git a/arch/x86/kernel/i8259.c b/arch/x86/kernel/i8259.c index 282b4ee1339f..0da757c6b292 100644 --- a/arch/x86/kernel/i8259.c +++ b/arch/x86/kernel/i8259.c @@ -1,4 +1,7 @@ // SPDX-License-Identifier: GPL-2.0 + +#define pr_fmt(fmt) "i8259: " fmt + #include <linux/linkage.h> #include <linux/errno.h> #include <linux/signal.h> @@ -16,6 +19,7 @@ #include <linux/io.h> #include <linux/delay.h> #include <linux/pgtable.h> +#include <linux/dmi.h> #include <linux/atomic.h> #include <asm/timer.h> @@ -298,11 +302,39 @@ static void unmask_8259A(void) raw_spin_unlock_irqrestore(&i8259A_lock, flags); } +/* + * DMI table to identify devices with quirky probe behavior. See comment in + * probe_8259A() for more details. + */ +static const struct dmi_system_id retry_probe_quirk_table[] = { + { + .ident = "Microsoft Surface Laptop 4 (AMD)", + .matches = { + DMI_MATCH(DMI_SYS_VENDOR, "Microsoft Corporation"), + DMI_MATCH(DMI_PRODUCT_SKU, "Surface_Laptop_4_1952:1953") + }, + }, + {} +}; + static int probe_8259A(void) { unsigned long flags; unsigned char probe_val = ~(1 << PIC_CASCADE_IR); unsigned char new_val; + unsigned int i, imax = 1; + + /* + * Some systems have a legacy PIC that doesn't immediately respond + * after boot. We know it's there, we know it should respond and is + * required for proper interrupt handling later on, so let's try a + * couple of times. + */ + if (dmi_check_system(retry_probe_quirk_table)) { + pr_warn("system with broken legacy PIC detected, re-trying multiple times if necessary\n"); + imax = 10; + } + /* * Check to see if we have a PIC. * Mask all except the cascade and read @@ -312,15 +344,24 @@ static int probe_8259A(void) */ raw_spin_lock_irqsave(&i8259A_lock, flags); - outb(0xff, PIC_SLAVE_IMR); /* mask all of 8259A-2 */ - outb(probe_val, PIC_MASTER_IMR); - new_val = inb(PIC_MASTER_IMR); - if (new_val != probe_val) { - printk(KERN_INFO "Using NULL legacy PIC\n"); + for (i = 0; i < imax; i++) { + outb(0xff, PIC_SLAVE_IMR); /* mask all of 8259A-2 */ + outb(probe_val, PIC_MASTER_IMR); + new_val = inb(PIC_MASTER_IMR); + if (new_val == probe_val) + break; + } + + if (i == imax) { + pr_info("using NULL legacy PIC\n"); legacy_pic = &null_legacy_pic; } raw_spin_unlock_irqrestore(&i8259A_lock, flags); + + if (imax > 1 && i < imax) + pr_info("got legacy PIC after %d tries\n", i + 1); + return nr_legacy_irqs(); }