Message ID | 20230513220418.19357-7-kirill.shutemov@linux.intel.com |
---|---|
State | Superseded |
Headers | show |
Series | mm, x86/cc, efi: Implement support for unaccepted memory | expand |
On Sun, 14 May 2023 at 00:04, Kirill A. Shutemov <kirill.shutemov@linux.intel.com> wrote: > > load_unaligned_zeropad() can lead to unwanted loads across page boundaries. > The unwanted loads are typically harmless. But, they might be made to > totally unrelated or even unmapped memory. load_unaligned_zeropad() > relies on exception fixup (#PF, #GP and now #VE) to recover from these > unwanted loads. > > But, this approach does not work for unaccepted memory. For TDX, a load > from unaccepted memory will not lead to a recoverable exception within > the guest. The guest will exit to the VMM where the only recourse is to > terminate the guest. > Does this mean that the kernel maps memory before accepting it? As otherwise, I would assume that such an access would page fault inside the guest before triggering an exception related to the unaccepted state. > There are two parts to fix this issue and comprehensively avoid access > to unaccepted memory. Together these ensure that an extra "guard" page > is accepted in addition to the memory that needs to be used. > > 1. Implicitly extend the range_contains_unaccepted_memory(start, end) > checks up to end+unit_size if 'end' is aligned on a unit_size > boundary. > 2. Implicitly extend accept_memory(start, end) to end+unit_size if 'end' > is aligned on a unit_size boundary. > > Side note: This leads to something strange. Pages which were accepted > at boot, marked by the firmware as accepted and will never > _need_ to be accepted might be on unaccepted_pages list > This is a cue to ensure that the next page is accepted > before 'page' can be used. > > This is an actual, real-world problem which was discovered during TDX > testing. > > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> > Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> > --- > drivers/firmware/efi/unaccepted_memory.c | 35 ++++++++++++++++++++++++ > 1 file changed, 35 insertions(+) > > diff --git a/drivers/firmware/efi/unaccepted_memory.c b/drivers/firmware/efi/unaccepted_memory.c > index bb91c41f76fb..3d1ca60916dd 100644 > --- a/drivers/firmware/efi/unaccepted_memory.c > +++ b/drivers/firmware/efi/unaccepted_memory.c > @@ -37,6 +37,34 @@ void accept_memory(phys_addr_t start, phys_addr_t end) > start -= unaccepted->phys_base; > end -= unaccepted->phys_base; > > + /* > + * load_unaligned_zeropad() can lead to unwanted loads across page > + * boundaries. The unwanted loads are typically harmless. But, they > + * might be made to totally unrelated or even unmapped memory. > + * load_unaligned_zeropad() relies on exception fixup (#PF, #GP and now > + * #VE) to recover from these unwanted loads. > + * > + * But, this approach does not work for unaccepted memory. For TDX, a > + * load from unaccepted memory will not lead to a recoverable exception > + * within the guest. The guest will exit to the VMM where the only > + * recourse is to terminate the guest. > + * > + * There are two parts to fix this issue and comprehensively avoid > + * access to unaccepted memory. Together these ensure that an extra > + * "guard" page is accepted in addition to the memory that needs to be > + * used: > + * > + * 1. Implicitly extend the range_contains_unaccepted_memory(start, end) > + * checks up to end+unit_size if 'end' is aligned on a unit_size > + * boundary. > + * > + * 2. Implicitly extend accept_memory(start, end) to end+unit_size if > + * 'end' is aligned on a unit_size boundary. (immediately following > + * this comment) > + */ > + if (!(end % unit_size)) > + end += unit_size; > + > /* Make sure not to overrun the bitmap */ > if (end > unaccepted->size * unit_size * BITS_PER_BYTE) > end = unaccepted->size * unit_size * BITS_PER_BYTE; > @@ -84,6 +112,13 @@ bool range_contains_unaccepted_memory(phys_addr_t start, phys_addr_t end) > start -= unaccepted->phys_base; > end -= unaccepted->phys_base; > > + /* > + * Also consider the unaccepted state of the *next* page. See fix #1 in > + * the comment on load_unaligned_zeropad() in accept_memory(). > + */ > + if (!(end % unit_size)) > + end += unit_size; > + > /* Make sure not to overrun the bitmap */ > if (end > unaccepted->size * unit_size * BITS_PER_BYTE) > end = unaccepted->size * unit_size * BITS_PER_BYTE; > -- > 2.39.3 >
On 5/16/23 11:08, Ard Biesheuvel wrote: >> But, this approach does not work for unaccepted memory. For TDX, a load >> from unaccepted memory will not lead to a recoverable exception within >> the guest. The guest will exit to the VMM where the only recourse is to >> terminate the guest. >> > Does this mean that the kernel maps memory before accepting it? As > otherwise, I would assume that such an access would page fault inside > the guest before triggering an exception related to the unaccepted > state. Yes, the kernel maps memory before accepting it (modulo things like DEBUG_PAGEALLOC).
On Tue, May 16, 2023 at 08:08:37PM +0200, Ard Biesheuvel wrote: > On Sun, 14 May 2023 at 00:04, Kirill A. Shutemov > <kirill.shutemov@linux.intel.com> wrote: > > > > load_unaligned_zeropad() can lead to unwanted loads across page boundaries. > > The unwanted loads are typically harmless. But, they might be made to > > totally unrelated or even unmapped memory. load_unaligned_zeropad() > > relies on exception fixup (#PF, #GP and now #VE) to recover from these > > unwanted loads. > > > > But, this approach does not work for unaccepted memory. For TDX, a load > > from unaccepted memory will not lead to a recoverable exception within > > the guest. The guest will exit to the VMM where the only recourse is to > > terminate the guest. > > > > Does this mean that the kernel maps memory before accepting it? As > otherwise, I would assume that such an access would page fault inside > the guest before triggering an exception related to the unaccepted > state. Yes, kernel maps all memory into direct mapping whether it is accepted or not [yet]. The problem is that access of unaccepted memory is not page fault on TDX. It causes unrecoverable exit to the host so it must not happen to legitimate accesses, including load_unaligned_zeropad() overshoot. For context: there's a way configure TDX environment to trigger #VE on such accesses and it is default. But Linux requires such #VEs to be disabled as it opens attack vector from the host to the guest: host can pull any private page from under kernel at any point and trigger such #VE. If it happens in just a right time in syscall gap or NMI entry code it can be exploitable. See also commits 9a22bf6debbf and 373e715e31bf.
On Tue, 16 May 2023 at 20:27, Dave Hansen <dave.hansen@intel.com> wrote: > > On 5/16/23 11:08, Ard Biesheuvel wrote: > >> But, this approach does not work for unaccepted memory. For TDX, a load > >> from unaccepted memory will not lead to a recoverable exception within > >> the guest. The guest will exit to the VMM where the only recourse is to > >> terminate the guest. > >> > > Does this mean that the kernel maps memory before accepting it? As > > otherwise, I would assume that such an access would page fault inside > > the guest before triggering an exception related to the unaccepted > > state. > > Yes, the kernel maps memory before accepting it (modulo things like > DEBUG_PAGEALLOC). > OK, and so the architecture stipulates that prefetching or other speculative accesses must never deliver exceptions to the host regarding such ranges? If this all works as it should, then I'm ok with leaving this here, but I imagine we may want to factor out some arch specific policy here in the future, as I don't think this would work the same on ARM.
On Tue, May 16, 2023 at 08:35:27PM +0200, Ard Biesheuvel wrote: > On Tue, 16 May 2023 at 20:27, Dave Hansen <dave.hansen@intel.com> wrote: > > > > On 5/16/23 11:08, Ard Biesheuvel wrote: > > >> But, this approach does not work for unaccepted memory. For TDX, a load > > >> from unaccepted memory will not lead to a recoverable exception within > > >> the guest. The guest will exit to the VMM where the only recourse is to > > >> terminate the guest. > > >> > > > Does this mean that the kernel maps memory before accepting it? As > > > otherwise, I would assume that such an access would page fault inside > > > the guest before triggering an exception related to the unaccepted > > > state. > > > > Yes, the kernel maps memory before accepting it (modulo things like > > DEBUG_PAGEALLOC). > > > > OK, and so the architecture stipulates that prefetching or other > speculative accesses must never deliver exceptions to the host > regarding such ranges? > > If this all works as it should, then I'm ok with leaving this here, > but I imagine we may want to factor out some arch specific policy here > in the future, as I don't think this would work the same on ARM. Even if other architectures don't need this, it is harmless: we just accept one unit ahead of time.
On 5/16/23 11:35, Ard Biesheuvel wrote: >>> Does this mean that the kernel maps memory before accepting it? As >>> otherwise, I would assume that such an access would page fault inside >>> the guest before triggering an exception related to the unaccepted >>> state. >> Yes, the kernel maps memory before accepting it (modulo things like >> DEBUG_PAGEALLOC). >> > OK, and so the architecture stipulates that prefetching or other > speculative accesses must never deliver exceptions to the host > regarding such ranges? I don't know of anywhere that this is explicitly written. It's probably implicit _somewhere_ in the reams of VMX/TDX and base SDM docs, but heck if I know where it is. :) If this is something anyone wants to see added to the SEPT_VE_DISABLE documentation, please speak up. I don't think it would be hard to get it added and provide an explicit guarantee.
On Tue, May 16, 2023 at 01:03:32PM -0700, Dave Hansen wrote: > On 5/16/23 11:35, Ard Biesheuvel wrote: > >>> Does this mean that the kernel maps memory before accepting it? As > >>> otherwise, I would assume that such an access would page fault inside > >>> the guest before triggering an exception related to the unaccepted > >>> state. > >> Yes, the kernel maps memory before accepting it (modulo things like > >> DEBUG_PAGEALLOC). > >> > > OK, and so the architecture stipulates that prefetching or other > > speculative accesses must never deliver exceptions to the host > > regarding such ranges? > > I don't know of anywhere that this is explicitly written. It's probably > implicit _somewhere_ in the reams of VMX/TDX and base SDM docs, but heck > if I know where it is. :) It is not specific to TDX: on x86 (and all architectures with precise exceptions) exception handling is delayed until instruction retirement and will not happen if speculation turned out to be wrong. And prefetching never generates exceptions. But I failed to find right away in 5000+ pages of Intel Software Developer’s Manual. :/
On 5/16/23 14:52, Kirill A. Shutemov wrote: > On Tue, May 16, 2023 at 01:03:32PM -0700, Dave Hansen wrote: >> On 5/16/23 11:35, Ard Biesheuvel wrote: >>>>> Does this mean that the kernel maps memory before accepting it? As >>>>> otherwise, I would assume that such an access would page fault inside >>>>> the guest before triggering an exception related to the unaccepted >>>>> state. >>>> Yes, the kernel maps memory before accepting it (modulo things like >>>> DEBUG_PAGEALLOC). >>>> >>> OK, and so the architecture stipulates that prefetching or other >>> speculative accesses must never deliver exceptions to the host >>> regarding such ranges? >> I don't know of anywhere that this is explicitly written. It's probably >> implicit _somewhere_ in the reams of VMX/TDX and base SDM docs, but heck >> if I know where it is. 😄 > It is not specific to TDX: on x86 (and all architectures with precise > exceptions) exception handling is delayed until instruction retirement and > will not happen if speculation turned out to be wrong. And prefetching > never generates exceptions. Not to be Debbie Downer too much here, but it's *totally* possible for speculative execution to go read memory that causes you to machine check. We've had such bugs in Linux. We just happen to be lucky in this case that the unaccepted memory exceptions don't generate machine checks *AND* TDX hardware does not machine check on speculative accesses that would _just_ violate TDX security properties. You're right for normal, sane exceptions, though.
On Wed, 17 May 2023 at 00:00, Dave Hansen <dave.hansen@intel.com> wrote: > > On 5/16/23 14:52, Kirill A. Shutemov wrote: > > On Tue, May 16, 2023 at 01:03:32PM -0700, Dave Hansen wrote: > >> On 5/16/23 11:35, Ard Biesheuvel wrote: > >>>>> Does this mean that the kernel maps memory before accepting it? As > >>>>> otherwise, I would assume that such an access would page fault inside > >>>>> the guest before triggering an exception related to the unaccepted > >>>>> state. > >>>> Yes, the kernel maps memory before accepting it (modulo things like > >>>> DEBUG_PAGEALLOC). > >>>> > >>> OK, and so the architecture stipulates that prefetching or other > >>> speculative accesses must never deliver exceptions to the host > >>> regarding such ranges? > >> I don't know of anywhere that this is explicitly written. It's probably > >> implicit _somewhere_ in the reams of VMX/TDX and base SDM docs, but heck > >> if I know where it is. 😄 > > It is not specific to TDX: on x86 (and all architectures with precise > > exceptions) exception handling is delayed until instruction retirement and > > will not happen if speculation turned out to be wrong. And prefetching > > never generates exceptions. > > Not to be Debbie Downer too much here, but it's *totally* possible for > speculative execution to go read memory that causes you to machine > check. We've had such bugs in Linux. > > We just happen to be lucky in this case that the unaccepted memory > exceptions don't generate machine checks *AND* TDX hardware does not > machine check on speculative accesses that would _just_ violate TDX > security properties. > > You're right for normal, sane exceptions, though. Same thing on ARM, although I'd have to check their RME stuff in more detail to see how it behaves in this particular case. But Kyrill is right that it doesn't really matter for the logic in this patch - it just accepts some additional pages. The relevant difference between implementations will likely be whether unaccepted memory gets mapped beforehand in the first place, but we'll deal with that once we have to. As long as we only accept memory that appears in the bitmap as 'unaccepted', this kind of rounding seems safe and reasonable to me. Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
On 5/16/23 11:33, Kirill A. Shutemov wrote: > For context: there's a way configure TDX environment to trigger #VE on > such accesses and it is default. But Linux requires such #VEs to be > disabled as it opens attack vector from the host to the guest: host can > pull any private page from under kernel at any point and trigger such #VE. > If it happens in just a right time in syscall gap or NMI entry code it can > be exploitable. I'm kinda uncomfortable with saying it's exploitable. It really boils down to not wanting to deal with managing a new IST exception. While the NMI IST implementation is about as good as we can get it, I believe there are still holes in it (even if we consider only how it interacts with #MC). The more IST users we add, the more holes there are. You add the fact that an actual adversary can induce the exceptions instead of (rare and mostly random) radiation that causes #MC, and it makes me want to either curl up in a little ball or pursue a new career. So, exploitable? Dunno. Do I want to touch an #VE/IST implementation? No way, not with a 10 foot pole.
On 5/13/23 17:04, Kirill A. Shutemov wrote: > load_unaligned_zeropad() can lead to unwanted loads across page boundaries. > The unwanted loads are typically harmless. But, they might be made to > totally unrelated or even unmapped memory. load_unaligned_zeropad() > relies on exception fixup (#PF, #GP and now #VE) to recover from these > unwanted loads. > > But, this approach does not work for unaccepted memory. For TDX, a load > from unaccepted memory will not lead to a recoverable exception within > the guest. The guest will exit to the VMM where the only recourse is to > terminate the guest. > > There are two parts to fix this issue and comprehensively avoid access > to unaccepted memory. Together these ensure that an extra "guard" page > is accepted in addition to the memory that needs to be used. > > 1. Implicitly extend the range_contains_unaccepted_memory(start, end) > checks up to end+unit_size if 'end' is aligned on a unit_size > boundary. > 2. Implicitly extend accept_memory(start, end) to end+unit_size if 'end' > is aligned on a unit_size boundary. > > Side note: This leads to something strange. Pages which were accepted > at boot, marked by the firmware as accepted and will never > _need_ to be accepted might be on unaccepted_pages list > This is a cue to ensure that the next page is accepted > before 'page' can be used. > > This is an actual, real-world problem which was discovered during TDX > testing. > > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> > Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> > --- > drivers/firmware/efi/unaccepted_memory.c | 35 ++++++++++++++++++++++++ > 1 file changed, 35 insertions(+) > > diff --git a/drivers/firmware/efi/unaccepted_memory.c b/drivers/firmware/efi/unaccepted_memory.c > index bb91c41f76fb..3d1ca60916dd 100644 > --- a/drivers/firmware/efi/unaccepted_memory.c > +++ b/drivers/firmware/efi/unaccepted_memory.c > @@ -37,6 +37,34 @@ void accept_memory(phys_addr_t start, phys_addr_t end) > start -= unaccepted->phys_base; > end -= unaccepted->phys_base; > > + /* > + * load_unaligned_zeropad() can lead to unwanted loads across page > + * boundaries. The unwanted loads are typically harmless. But, they > + * might be made to totally unrelated or even unmapped memory. > + * load_unaligned_zeropad() relies on exception fixup (#PF, #GP and now > + * #VE) to recover from these unwanted loads. > + * > + * But, this approach does not work for unaccepted memory. For TDX, a > + * load from unaccepted memory will not lead to a recoverable exception > + * within the guest. The guest will exit to the VMM where the only > + * recourse is to terminate the guest. > + * > + * There are two parts to fix this issue and comprehensively avoid > + * access to unaccepted memory. Together these ensure that an extra > + * "guard" page is accepted in addition to the memory that needs to be > + * used: > + * > + * 1. Implicitly extend the range_contains_unaccepted_memory(start, end) > + * checks up to end+unit_size if 'end' is aligned on a unit_size > + * boundary. > + * > + * 2. Implicitly extend accept_memory(start, end) to end+unit_size if > + * 'end' is aligned on a unit_size boundary. (immediately following > + * this comment) > + */ > + if (!(end % unit_size)) > + end += unit_size; > + > /* Make sure not to overrun the bitmap */ > if (end > unaccepted->size * unit_size * BITS_PER_BYTE) > end = unaccepted->size * unit_size * BITS_PER_BYTE; > @@ -84,6 +112,13 @@ bool range_contains_unaccepted_memory(phys_addr_t start, phys_addr_t end) > start -= unaccepted->phys_base; > end -= unaccepted->phys_base; > > + /* > + * Also consider the unaccepted state of the *next* page. See fix #1 in > + * the comment on load_unaligned_zeropad() in accept_memory(). > + */ > + if (!(end % unit_size)) > + end += unit_size; > + > /* Make sure not to overrun the bitmap */ > if (end > unaccepted->size * unit_size * BITS_PER_BYTE) > end = unaccepted->size * unit_size * BITS_PER_BYTE;
diff --git a/drivers/firmware/efi/unaccepted_memory.c b/drivers/firmware/efi/unaccepted_memory.c index bb91c41f76fb..3d1ca60916dd 100644 --- a/drivers/firmware/efi/unaccepted_memory.c +++ b/drivers/firmware/efi/unaccepted_memory.c @@ -37,6 +37,34 @@ void accept_memory(phys_addr_t start, phys_addr_t end) start -= unaccepted->phys_base; end -= unaccepted->phys_base; + /* + * load_unaligned_zeropad() can lead to unwanted loads across page + * boundaries. The unwanted loads are typically harmless. But, they + * might be made to totally unrelated or even unmapped memory. + * load_unaligned_zeropad() relies on exception fixup (#PF, #GP and now + * #VE) to recover from these unwanted loads. + * + * But, this approach does not work for unaccepted memory. For TDX, a + * load from unaccepted memory will not lead to a recoverable exception + * within the guest. The guest will exit to the VMM where the only + * recourse is to terminate the guest. + * + * There are two parts to fix this issue and comprehensively avoid + * access to unaccepted memory. Together these ensure that an extra + * "guard" page is accepted in addition to the memory that needs to be + * used: + * + * 1. Implicitly extend the range_contains_unaccepted_memory(start, end) + * checks up to end+unit_size if 'end' is aligned on a unit_size + * boundary. + * + * 2. Implicitly extend accept_memory(start, end) to end+unit_size if + * 'end' is aligned on a unit_size boundary. (immediately following + * this comment) + */ + if (!(end % unit_size)) + end += unit_size; + /* Make sure not to overrun the bitmap */ if (end > unaccepted->size * unit_size * BITS_PER_BYTE) end = unaccepted->size * unit_size * BITS_PER_BYTE; @@ -84,6 +112,13 @@ bool range_contains_unaccepted_memory(phys_addr_t start, phys_addr_t end) start -= unaccepted->phys_base; end -= unaccepted->phys_base; + /* + * Also consider the unaccepted state of the *next* page. See fix #1 in + * the comment on load_unaligned_zeropad() in accept_memory(). + */ + if (!(end % unit_size)) + end += unit_size; + /* Make sure not to overrun the bitmap */ if (end > unaccepted->size * unit_size * BITS_PER_BYTE) end = unaccepted->size * unit_size * BITS_PER_BYTE;