diff mbox series

[PATCHv14,5/9] efi: Add unaccepted memory support

Message ID 20230606142637.5171-6-kirill.shutemov@linux.intel.com
State Accepted
Commit 2053bc57f36763febced0b5cd91821698bcf6b3d
Headers show
Series mm, x86/cc, efi: Implement support for unaccepted memory | expand

Commit Message

Kirill A. Shutemov June 6, 2023, 2:26 p.m. UTC
efi_config_parse_tables() reserves memory that holds unaccepted memory
configuration table so it won't be reused by page allocator.

Core-mm requires few helpers to support unaccepted memory:

 - accept_memory() checks the range of addresses against the bitmap and
   accept memory if needed.

 - range_contains_unaccepted_memory() checks if anything within the
   range requires acceptance.

Architectural code has to provide efi_get_unaccepted_table() that
returns pointer to the unaccepted memory configuration table.

arch_accept_memory() handles arch-specific part of memory acceptance.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
---
 arch/x86/platform/efi/efi.c              |   3 +
 drivers/firmware/efi/Makefile            |   1 +
 drivers/firmware/efi/efi.c               |  25 +++++
 drivers/firmware/efi/unaccepted_memory.c | 112 +++++++++++++++++++++++
 include/linux/efi.h                      |   1 +
 5 files changed, 142 insertions(+)
 create mode 100644 drivers/firmware/efi/unaccepted_memory.c

Comments

Mel Gorman July 3, 2023, 1:25 p.m. UTC | #1
On Tue, Jun 06, 2023 at 05:26:33PM +0300, Kirill A. Shutemov wrote:
> efi_config_parse_tables() reserves memory that holds unaccepted memory
> configuration table so it won't be reused by page allocator.
> 
> Core-mm requires few helpers to support unaccepted memory:
> 
>  - accept_memory() checks the range of addresses against the bitmap and
>    accept memory if needed.
> 
>  - range_contains_unaccepted_memory() checks if anything within the
>    range requires acceptance.
> 
> Architectural code has to provide efi_get_unaccepted_table() that
> returns pointer to the unaccepted memory configuration table.
> 
> arch_accept_memory() handles arch-specific part of memory acceptance.
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>

By and large, this looks ok from the page allocator perspective as the
checks for unaccepted are mostly after watermark checks. However, if you
look in the initial fast path, you'll see this

        /* 
         * Forbid the first pass from falling back to types that fragment
         * memory until all local zones are considered.
         */     
        alloc_flags |= alloc_flags_nofragment(ac.preferred_zoneref->zone, gfp);

While checking watermarks should be fine from a functional perspective and
the fast paths are unaffected, there is a risk of premature fragmentation
until all memory has been accepted. Meeting watermarks does not necessarily
mean that fragmentation is avoided as pageblocks can get mixed while still
meeting watermarks. This will be tricky to detect in most cases so it may
be worth considering augmenting the watermark checks with a check in this
block for unaccepted memory

        /*
         * It's possible on a UMA machine to get through all zones that are
         * fragmented. If avoiding fragmentation, reset and try again.  */
        if (no_fallback) {
                alloc_flags &= ~ALLOC_NOFRAGMENT;
		goto retry;
	}

I think the watermark checks will still be needed because hypothetically
if 100% of allocations were MIGRATE_UNMOVABLE then there never would be
a request that fragments memory and memory would not be accepted.

It's not a blocker to the series, simply a suggestion for follow-on
work.
Kirill A. Shutemov July 4, 2023, 2:37 p.m. UTC | #2
On Mon, Jul 03, 2023 at 02:25:18PM +0100, Mel Gorman wrote:
> On Tue, Jun 06, 2023 at 05:26:33PM +0300, Kirill A. Shutemov wrote:
> > efi_config_parse_tables() reserves memory that holds unaccepted memory
> > configuration table so it won't be reused by page allocator.
> > 
> > Core-mm requires few helpers to support unaccepted memory:
> > 
> >  - accept_memory() checks the range of addresses against the bitmap and
> >    accept memory if needed.
> > 
> >  - range_contains_unaccepted_memory() checks if anything within the
> >    range requires acceptance.
> > 
> > Architectural code has to provide efi_get_unaccepted_table() that
> > returns pointer to the unaccepted memory configuration table.
> > 
> > arch_accept_memory() handles arch-specific part of memory acceptance.
> > 
> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
> > Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
> 
> By and large, this looks ok from the page allocator perspective as the
> checks for unaccepted are mostly after watermark checks. However, if you
> look in the initial fast path, you'll see this
> 
>         /* 
>          * Forbid the first pass from falling back to types that fragment
>          * memory until all local zones are considered.
>          */     
>         alloc_flags |= alloc_flags_nofragment(ac.preferred_zoneref->zone, gfp);
> 
> While checking watermarks should be fine from a functional perspective and
> the fast paths are unaffected, there is a risk of premature fragmentation
> until all memory has been accepted. Meeting watermarks does not necessarily
> mean that fragmentation is avoided as pageblocks can get mixed while still
> meeting watermarks.

Could you elaborate on this scenario?

Current code checks the watermark, if it is met, try rmqueue().

If rmqueue() fails anyway, try to accept more pages and retry the zone if
it is successful.

I'm not sure how we can get to the 'if (no_fallback) {' case with any
unaccepted memory in the allowed zones.

I see that there's preferred_zoneref and spread_dirty_pages cases, but
unaccepted memory seems change nothing for them.

Hm?
Mel Gorman July 12, 2023, 9:18 a.m. UTC | #3
On Tue, Jul 04, 2023 at 05:37:40PM +0300, Kirill A. Shutemov wrote:
> On Mon, Jul 03, 2023 at 02:25:18PM +0100, Mel Gorman wrote:
> > On Tue, Jun 06, 2023 at 05:26:33PM +0300, Kirill A. Shutemov wrote:
> > > efi_config_parse_tables() reserves memory that holds unaccepted memory
> > > configuration table so it won't be reused by page allocator.
> > > 
> > > Core-mm requires few helpers to support unaccepted memory:
> > > 
> > >  - accept_memory() checks the range of addresses against the bitmap and
> > >    accept memory if needed.
> > > 
> > >  - range_contains_unaccepted_memory() checks if anything within the
> > >    range requires acceptance.
> > > 
> > > Architectural code has to provide efi_get_unaccepted_table() that
> > > returns pointer to the unaccepted memory configuration table.
> > > 
> > > arch_accept_memory() handles arch-specific part of memory acceptance.
> > > 
> > > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > > Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
> > > Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
> > 
> > By and large, this looks ok from the page allocator perspective as the
> > checks for unaccepted are mostly after watermark checks. However, if you
> > look in the initial fast path, you'll see this
> > 
> >         /* 
> >          * Forbid the first pass from falling back to types that fragment
> >          * memory until all local zones are considered.
> >          */     
> >         alloc_flags |= alloc_flags_nofragment(ac.preferred_zoneref->zone, gfp);
> > 
> > While checking watermarks should be fine from a functional perspective and
> > the fast paths are unaffected, there is a risk of premature fragmentation
> > until all memory has been accepted. Meeting watermarks does not necessarily
> > mean that fragmentation is avoided as pageblocks can get mixed while still
> > meeting watermarks.
> 
> Could you elaborate on this scenario?
> 
> Current code checks the watermark, if it is met, try rmqueue().
> 
> If rmqueue() fails anyway, try to accept more pages and retry the zone if
> it is successful.
> 
> I'm not sure how we can get to the 'if (no_fallback) {' case with any
> unaccepted memory in the allowed zones.
> 

Lets take an extreme example and assume that the low watermark is lower
than 2MB (one pageblock). Just before the watermark is reached (free
count between 1MB and 2MB), it is unlikely that all free pages are within
pageblocks of the same migratetype (e.g. MIGRATE_MOVABLE). If there is an
allocation near the watermark of a different type (e.g. MIGRATE_UNMOVABLE)
then the page allocation could fallback to a different pageblock and now
it is mixed.  It's a condition that is only obvious if you are explicitly
checking for it via tracepoints.  This can happen in the normal case, but
unaccepted memory makes it worse because the "pageblock mixing" could have
been avoided if the "no_fallback" case accepted at least one new pageblock
instead of mixing pageblocks.

That is an extreme example but the same logic applies when the free
count is at or near MIGRATE_TYPES*pageblock_nr_pages as it is not
guaranteed that the pageblocks with free pages are a migratetype that
matches the allocation request.

Hence, it may be more robust from a fragmentation perspective if
ALLOC_NOFRAGMENT requests accept memory if it is available and retries
before clearing ALLOC_NOFRAGMENT and mixing pageblocks before the watermarks
are reached.

> I see that there's preferred_zoneref and spread_dirty_pages cases, but
> unaccepted memory seems change nothing for them.
> 

preferred_zoneref is about premature zone exhaustion and
spread_dirty_pages is about avoiding premature stalls on a node/zone due
to an imbalance in the number of pages waiting for writeback to
complete. There is an arguement to be made that they also should accept
memory but it's less clear how much of a problem this is. Both are very
obvious when they "fail" and likely are covered by the existing
watermark checks. Premature pageblock mixing is more subtle as the final
impact (root cause of a premature THP allocation failure) is harder to
detect.
Michael Roth Oct. 10, 2023, 9:05 p.m. UTC | #4
On Tue, Jun 06, 2023 at 05:26:33PM +0300, Kirill A. Shutemov wrote:
> efi_config_parse_tables() reserves memory that holds unaccepted memory
> configuration table so it won't be reused by page allocator.
> 
> Core-mm requires few helpers to support unaccepted memory:
> 
>  - accept_memory() checks the range of addresses against the bitmap and
>    accept memory if needed.
> 
>  - range_contains_unaccepted_memory() checks if anything within the
>    range requires acceptance.
> 
> Architectural code has to provide efi_get_unaccepted_table() that
> returns pointer to the unaccepted memory configuration table.
> 
> arch_accept_memory() handles arch-specific part of memory acceptance.
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
> ---
>  arch/x86/platform/efi/efi.c              |   3 +
>  drivers/firmware/efi/Makefile            |   1 +
>  drivers/firmware/efi/efi.c               |  25 +++++
>  drivers/firmware/efi/unaccepted_memory.c | 112 +++++++++++++++++++++++
>  include/linux/efi.h                      |   1 +
>  5 files changed, 142 insertions(+)
>  create mode 100644 drivers/firmware/efi/unaccepted_memory.c
> 
> diff --git a/drivers/firmware/efi/unaccepted_memory.c b/drivers/firmware/efi/unaccepted_memory.c
> new file mode 100644
> index 000000000000..08a9a843550a
> --- /dev/null
> +++ b/drivers/firmware/efi/unaccepted_memory.c
> @@ -0,0 +1,112 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +
> +#include <linux/efi.h>
> +#include <linux/memblock.h>
> +#include <linux/spinlock.h>
> +#include <asm/unaccepted_memory.h>
> +
> +/* Protects unaccepted memory bitmap */
> +static DEFINE_SPINLOCK(unaccepted_memory_lock);
> +
> +/*
> + * accept_memory() -- Consult bitmap and accept the memory if needed.
> + *
> + * Only memory that is explicitly marked as unaccepted in the bitmap requires
> + * an action. All the remaining memory is implicitly accepted and doesn't need
> + * acceptance.
> + *
> + * No need to accept:
> + *  - anything if the system has no unaccepted table;
> + *  - memory that is below phys_base;
> + *  - memory that is above the memory that addressable by the bitmap;
> + */
> +void accept_memory(phys_addr_t start, phys_addr_t end)
> +{
> +	struct efi_unaccepted_memory *unaccepted;
> +	unsigned long range_start, range_end;
> +	unsigned long flags;
> +	u64 unit_size;
> +
> +	unaccepted = efi_get_unaccepted_table();
> +	if (!unaccepted)
> +		return;
> +
> +	unit_size = unaccepted->unit_size;
> +
> +	/*
> +	 * Only care for the part of the range that is represented
> +	 * in the bitmap.
> +	 */
> +	if (start < unaccepted->phys_base)
> +		start = unaccepted->phys_base;
> +	if (end < unaccepted->phys_base)
> +		return;
> +
> +	/* Translate to offsets from the beginning of the bitmap */
> +	start -= unaccepted->phys_base;
> +	end -= unaccepted->phys_base;
> +
> +	/* Make sure not to overrun the bitmap */
> +	if (end > unaccepted->size * unit_size * BITS_PER_BYTE)
> +		end = unaccepted->size * unit_size * BITS_PER_BYTE;
> +
> +	range_start = start / unit_size;
> +
> +	spin_lock_irqsave(&unaccepted_memory_lock, flags);
> +	for_each_set_bitrange_from(range_start, range_end, unaccepted->bitmap,
> +				   DIV_ROUND_UP(end, unit_size)) {
> +		unsigned long phys_start, phys_end;
> +		unsigned long len = range_end - range_start;
> +
> +		phys_start = range_start * unit_size + unaccepted->phys_base;
> +		phys_end = range_end * unit_size + unaccepted->phys_base;
> +
> +		arch_accept_memory(phys_start, phys_end);
> +		bitmap_clear(unaccepted->bitmap, range_start, len);
> +	}
> +	spin_unlock_irqrestore(&unaccepted_memory_lock, flags);
> +}

While testing SNP guests running today's tip/master (ef19bc9dddc3) I ran
into what seems to be fairly significant lock contention due to the
unaccepted_memory_lock spinlock above, which results in a constant stream
of soft-lockups until the workload gets all its memory accepted/faulted
in if the guest has around 16+ vCPUs.

I've included the guest dmesg traces I was seeing below.

In this case I was running a 32 vCPU guest with 200GB of memory running on
a 256 thread EPYC (Milan) system, and can trigger the above situation fairly
reliably by running the following workload in a freshly-booted guests:

  stress --vm 32 --vm-bytes 5G --vm-keep

Scaling up the number of stress threads and vCPUs should make it easier
to reproduce.

Other than unresponsiveness/lockup messages until the memory is accepted,
the guest seems to continue running fine, but for large guests where
unaccepted memory is more likely to be useful, it seems like it could be
an issue, especially when consider 100+ vCPU guests.

-Mike

[  105.753073] watchdog: BUG: soft lockup - CPU#3 stuck for 27s! [stress:1590]
[  105.756109] Modules linked in: btrfs(E) blake2b_generic(E) raid10(E) raid456(E) async_raid6_recov(E) async_memcpy(E) async_pq(E) async_xor(E) async_tx(E) xor(E) raid6_pq(E) libcrc32c(E) raid1(E) raid0(E) multipath(E) linear(E) crc32_pclmul(E) virtio_net(E) i2c_i801(E) net_failover(E) i2c_smbus(E) psmouse(E) failover(E) virtio_scsi(E) lpc_ich(E)
[  105.771644] irq event stamp: 392274
[  105.773852] hardirqs last  enabled at (392273): [<ffffffff9fbddf9a>] _raw_spin_unlock_irqrestore+0x5a/0x70
[  105.780058] hardirqs last disabled at (392274): [<ffffffff9fbddca9>] _raw_spin_lock_irqsave+0x69/0x70
[  105.785486] softirqs last  enabled at (392158): [<ffffffff9fbdf663>] __do_softirq+0x2a3/0x357
[  105.790627] softirqs last disabled at (392149): [<ffffffff9ecd0f5c>] irq_exit_rcu+0x8c/0xb0
[  105.795972] CPU: 3 PID: 1590 Comm: stress Tainted: G            EL     6.6.0-rc5-snp-guest-v10n+ #1
[  105.802416] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 2/2/2022
[  105.807040] RIP: 0010:_raw_spin_unlock_irqrestore+0x45/0x70
[  105.810327] Code: b1 b1 18 ff 4c 89 e7 e8 79 e7 18 ff 81 e3 00 02 00 00 75 26 9c 58 0f 1f 40 00 f6 c4 02 75 22 48 85 db 74 06 fb 0f 1f 44 00 00 <65> ff 0d fc 5d 46 60 5b 41 5c 5d c3 cc cc cc cc e8 c6 9f 28 ff eb
[  105.818749] RSP: 0000:ffffc9000585faa8 EFLAGS: 00000206
[  105.821155] RAX: 0000000000000046 RBX: 0000000000000200 RCX: 00000000000028ce
[  105.825258] RDX: ffffffff9f8c7830 RSI: ffffffff9fbddf9a RDI: ffffffff9fbddf9a
[  105.828287] RBP: ffffc9000585fab8 R08: 0000000000000000 R09: 0000000000000000
[  105.831322] R10: 000000ffffffffff R11: 0000000000000000 R12: ffffffffa091a160
[  105.834713] R13: 00000000000028cd R14: ffff88807c459030 R15: ffff88807c459018
[  105.838230] FS:  00007fa4a8ee8740(0000) GS:ffff88b1aff80000(0000) knlGS:0000000000000000
[  105.841677] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  105.844115] CR2: 00007fa3719ec010 CR3: 000800010f7c2003 CR4: 0000000000770ef0
[  105.847160] PKRU: 55555554
[  105.848490] Call Trace:
[  105.849634]  <IRQ>
[  105.850517]  ? show_regs+0x68/0x70
[  105.851986]  ? watchdog_timer_fn+0x1dc/0x240
[  105.853868]  ? __pfx_watchdog_timer_fn+0x10/0x10
[  105.855827]  ? __hrtimer_run_queues+0x1c3/0x340
[  105.857821]  ? hrtimer_interrupt+0x109/0x240
[  105.859673]  ? __sysvec_apic_timer_interrupt+0x67/0x170
[  105.861941]  ? sysvec_apic_timer_interrupt+0x7b/0x90
[  105.864048]  </IRQ>
[  105.864981]  <TASK>
[  105.865901]  ? asm_sysvec_apic_timer_interrupt+0x1b/0x20
[  105.868110]  ? accept_memory+0x150/0x170
[  105.869795]  ? _raw_spin_unlock_irqrestore+0x5a/0x70
[  105.871885]  ? _raw_spin_unlock_irqrestore+0x5a/0x70
[  105.873966]  ? _raw_spin_unlock_irqrestore+0x45/0x70
[  105.876047]  accept_memory+0x150/0x170
[  105.877665]  try_to_accept_memory+0x134/0x1b0
[  105.879532]  get_page_from_freelist+0xa3e/0x1370
[  105.881479]  ? lock_acquire+0xd8/0x2b0
[  105.883045]  __alloc_pages+0x1b7/0x390
[  105.884619]  __folio_alloc+0x1b/0x50
[  105.886151]  ? policy_node+0x57/0x70
[  105.887673]  vma_alloc_folio+0xa6/0x360
[  105.889325]  do_pte_missing+0x1a5/0x8b0
[  105.890948]  __handle_mm_fault+0x75e/0xda0
[  105.892693]  handle_mm_fault+0xe1/0x2d0
[  105.894304]  do_user_addr_fault+0x1ce/0xa40
[  105.896050]  ? exit_to_user_mode_prepare+0xa4/0x230
[  105.898113]  exc_page_fault+0x84/0x200
[  105.899693]  asm_exc_page_fault+0x27/0x30
[  105.901405] RIP: 0033:0x55ebd4602cf0
[  105.902926] Code: 8b 54 24 0c 31 c0 85 d2 0f 94 c0 89 04 24 41 83 fd 02 0f 8f 3e 02 00 00 48 85 ed 48 89 d8 7e 1b 66 2e 0f 1f 84 00 00 00 00 00 <c6> 00 5a 4c 01 f8 48 89 c2 48 29 da 48 39 d5 7f ef 49 83 fc 00 0f
[  105.910671] RSP: 002b:00007ffebfede470 EFLAGS: 00010206
[  105.912897] RAX: 00007fa3719ec010 RBX: 00007fa3683ff010 RCX: 00007fa3683ff010
[  105.915859] RDX: 00000000095ed000 RSI: 0000000140001000 RDI: 0000000000000000
[  105.918853] RBP: 0000000140000000 R08: 00000000ffffffff R09: 0000000000000000
[  105.921836] R10: 0000000000000022 R11: 0000000000000246 R12: ffffffffffffffff
[  105.924810] R13: 0000000000000002 R14: fffffffffffff000 R15: 0000000000001000
[  105.927771]  </TASK>
[  110.367774] watchdog: BUG: soft lockup - CPU#15 stuck for 32s! [stress:1594]
[  110.369415] watchdog: BUG: soft lockup - CPU#13 stuck for 36s! [stress:1602]
Kirill A. Shutemov Oct. 13, 2023, 12:33 p.m. UTC | #5
On Tue, Oct 10, 2023 at 04:05:18PM -0500, Michael Roth wrote:
> On Tue, Jun 06, 2023 at 05:26:33PM +0300, Kirill A. Shutemov wrote:
> > efi_config_parse_tables() reserves memory that holds unaccepted memory
> > configuration table so it won't be reused by page allocator.
> > 
> > Core-mm requires few helpers to support unaccepted memory:
> > 
> >  - accept_memory() checks the range of addresses against the bitmap and
> >    accept memory if needed.
> > 
> >  - range_contains_unaccepted_memory() checks if anything within the
> >    range requires acceptance.
> > 
> > Architectural code has to provide efi_get_unaccepted_table() that
> > returns pointer to the unaccepted memory configuration table.
> > 
> > arch_accept_memory() handles arch-specific part of memory acceptance.
> > 
> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
> > Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
> > ---
> >  arch/x86/platform/efi/efi.c              |   3 +
> >  drivers/firmware/efi/Makefile            |   1 +
> >  drivers/firmware/efi/efi.c               |  25 +++++
> >  drivers/firmware/efi/unaccepted_memory.c | 112 +++++++++++++++++++++++
> >  include/linux/efi.h                      |   1 +
> >  5 files changed, 142 insertions(+)
> >  create mode 100644 drivers/firmware/efi/unaccepted_memory.c
> > 
> > diff --git a/drivers/firmware/efi/unaccepted_memory.c b/drivers/firmware/efi/unaccepted_memory.c
> > new file mode 100644
> > index 000000000000..08a9a843550a
> > --- /dev/null
> > +++ b/drivers/firmware/efi/unaccepted_memory.c
> > @@ -0,0 +1,112 @@
> > +// SPDX-License-Identifier: GPL-2.0-only
> > +
> > +#include <linux/efi.h>
> > +#include <linux/memblock.h>
> > +#include <linux/spinlock.h>
> > +#include <asm/unaccepted_memory.h>
> > +
> > +/* Protects unaccepted memory bitmap */
> > +static DEFINE_SPINLOCK(unaccepted_memory_lock);
> > +
> > +/*
> > + * accept_memory() -- Consult bitmap and accept the memory if needed.
> > + *
> > + * Only memory that is explicitly marked as unaccepted in the bitmap requires
> > + * an action. All the remaining memory is implicitly accepted and doesn't need
> > + * acceptance.
> > + *
> > + * No need to accept:
> > + *  - anything if the system has no unaccepted table;
> > + *  - memory that is below phys_base;
> > + *  - memory that is above the memory that addressable by the bitmap;
> > + */
> > +void accept_memory(phys_addr_t start, phys_addr_t end)
> > +{
> > +	struct efi_unaccepted_memory *unaccepted;
> > +	unsigned long range_start, range_end;
> > +	unsigned long flags;
> > +	u64 unit_size;
> > +
> > +	unaccepted = efi_get_unaccepted_table();
> > +	if (!unaccepted)
> > +		return;
> > +
> > +	unit_size = unaccepted->unit_size;
> > +
> > +	/*
> > +	 * Only care for the part of the range that is represented
> > +	 * in the bitmap.
> > +	 */
> > +	if (start < unaccepted->phys_base)
> > +		start = unaccepted->phys_base;
> > +	if (end < unaccepted->phys_base)
> > +		return;
> > +
> > +	/* Translate to offsets from the beginning of the bitmap */
> > +	start -= unaccepted->phys_base;
> > +	end -= unaccepted->phys_base;
> > +
> > +	/* Make sure not to overrun the bitmap */
> > +	if (end > unaccepted->size * unit_size * BITS_PER_BYTE)
> > +		end = unaccepted->size * unit_size * BITS_PER_BYTE;
> > +
> > +	range_start = start / unit_size;
> > +
> > +	spin_lock_irqsave(&unaccepted_memory_lock, flags);
> > +	for_each_set_bitrange_from(range_start, range_end, unaccepted->bitmap,
> > +				   DIV_ROUND_UP(end, unit_size)) {
> > +		unsigned long phys_start, phys_end;
> > +		unsigned long len = range_end - range_start;
> > +
> > +		phys_start = range_start * unit_size + unaccepted->phys_base;
> > +		phys_end = range_end * unit_size + unaccepted->phys_base;
> > +
> > +		arch_accept_memory(phys_start, phys_end);
> > +		bitmap_clear(unaccepted->bitmap, range_start, len);
> > +	}
> > +	spin_unlock_irqrestore(&unaccepted_memory_lock, flags);
> > +}
> 
> While testing SNP guests running today's tip/master (ef19bc9dddc3) I ran
> into what seems to be fairly significant lock contention due to the
> unaccepted_memory_lock spinlock above, which results in a constant stream
> of soft-lockups until the workload gets all its memory accepted/faulted
> in if the guest has around 16+ vCPUs.
> 
> I've included the guest dmesg traces I was seeing below.
> 
> In this case I was running a 32 vCPU guest with 200GB of memory running on
> a 256 thread EPYC (Milan) system, and can trigger the above situation fairly
> reliably by running the following workload in a freshly-booted guests:
> 
>   stress --vm 32 --vm-bytes 5G --vm-keep
> 
> Scaling up the number of stress threads and vCPUs should make it easier
> to reproduce.
> 
> Other than unresponsiveness/lockup messages until the memory is accepted,
> the guest seems to continue running fine, but for large guests where
> unaccepted memory is more likely to be useful, it seems like it could be
> an issue, especially when consider 100+ vCPU guests.

Okay, sorry for delay. It took time to reproduce it with TDX.

I will look what can be done.
Kirill A. Shutemov Oct. 13, 2023, 4:22 p.m. UTC | #6
On Fri, Oct 13, 2023 at 03:33:58PM +0300, Kirill A. Shutemov wrote:
> > While testing SNP guests running today's tip/master (ef19bc9dddc3) I ran
> > into what seems to be fairly significant lock contention due to the
> > unaccepted_memory_lock spinlock above, which results in a constant stream
> > of soft-lockups until the workload gets all its memory accepted/faulted
> > in if the guest has around 16+ vCPUs.
> > 
> > I've included the guest dmesg traces I was seeing below.
> > 
> > In this case I was running a 32 vCPU guest with 200GB of memory running on
> > a 256 thread EPYC (Milan) system, and can trigger the above situation fairly
> > reliably by running the following workload in a freshly-booted guests:
> > 
> >   stress --vm 32 --vm-bytes 5G --vm-keep
> > 
> > Scaling up the number of stress threads and vCPUs should make it easier
> > to reproduce.
> > 
> > Other than unresponsiveness/lockup messages until the memory is accepted,
> > the guest seems to continue running fine, but for large guests where
> > unaccepted memory is more likely to be useful, it seems like it could be
> > an issue, especially when consider 100+ vCPU guests.
> 
> Okay, sorry for delay. It took time to reproduce it with TDX.
> 
> I will look what can be done.

Could you check if the patch below helps?

diff --git a/drivers/firmware/efi/unaccepted_memory.c b/drivers/firmware/efi/unaccepted_memory.c
index 853f7dc3c21d..591da3f368fa 100644
--- a/drivers/firmware/efi/unaccepted_memory.c
+++ b/drivers/firmware/efi/unaccepted_memory.c
@@ -8,6 +8,14 @@
 /* Protects unaccepted memory bitmap */
 static DEFINE_SPINLOCK(unaccepted_memory_lock);
 
+struct accept_range {
+	struct list_head list;
+	unsigned long start;
+	unsigned long end;
+};
+
+static LIST_HEAD(accepting_list);
+
 /*
  * accept_memory() -- Consult bitmap and accept the memory if needed.
  *
@@ -24,6 +32,7 @@ void accept_memory(phys_addr_t start, phys_addr_t end)
 {
 	struct efi_unaccepted_memory *unaccepted;
 	unsigned long range_start, range_end;
+	struct accept_range range, *entry;
 	unsigned long flags;
 	u64 unit_size;
 
@@ -80,7 +89,25 @@ void accept_memory(phys_addr_t start, phys_addr_t end)
 
 	range_start = start / unit_size;
 
+	range.start = start;
+	range.end = end;
+retry:
 	spin_lock_irqsave(&unaccepted_memory_lock, flags);
+
+	list_for_each_entry(entry, &accepting_list, list) {
+		if (entry->end < start)
+			continue;
+		if (entry->start > end)
+			continue;
+		spin_unlock_irqrestore(&unaccepted_memory_lock, flags);
+
+		/* Somebody else accepting the range */
+		cpu_relax();
+		goto retry;
+	}
+
+	list_add(&range.list, &accepting_list);
+
 	for_each_set_bitrange_from(range_start, range_end, unaccepted->bitmap,
 				   DIV_ROUND_UP(end, unit_size)) {
 		unsigned long phys_start, phys_end;
@@ -89,9 +116,15 @@ void accept_memory(phys_addr_t start, phys_addr_t end)
 		phys_start = range_start * unit_size + unaccepted->phys_base;
 		phys_end = range_end * unit_size + unaccepted->phys_base;
 
+		spin_unlock_irqrestore(&unaccepted_memory_lock, flags);
+
 		arch_accept_memory(phys_start, phys_end);
+
+		spin_lock_irqsave(&unaccepted_memory_lock, flags);
 		bitmap_clear(unaccepted->bitmap, range_start, len);
 	}
+
+	list_del(&range.list);
 	spin_unlock_irqrestore(&unaccepted_memory_lock, flags);
 }
Vlastimil Babka Oct. 13, 2023, 4:44 p.m. UTC | #7
On 10/13/23 18:22, Kirill A. Shutemov wrote:
> On Fri, Oct 13, 2023 at 03:33:58PM +0300, Kirill A. Shutemov wrote:
>> > While testing SNP guests running today's tip/master (ef19bc9dddc3) I ran
>> > into what seems to be fairly significant lock contention due to the
>> > unaccepted_memory_lock spinlock above, which results in a constant stream
>> > of soft-lockups until the workload gets all its memory accepted/faulted
>> > in if the guest has around 16+ vCPUs.
>> > 
>> > I've included the guest dmesg traces I was seeing below.
>> > 
>> > In this case I was running a 32 vCPU guest with 200GB of memory running on
>> > a 256 thread EPYC (Milan) system, and can trigger the above situation fairly
>> > reliably by running the following workload in a freshly-booted guests:
>> > 
>> >   stress --vm 32 --vm-bytes 5G --vm-keep
>> > 
>> > Scaling up the number of stress threads and vCPUs should make it easier
>> > to reproduce.
>> > 
>> > Other than unresponsiveness/lockup messages until the memory is accepted,
>> > the guest seems to continue running fine, but for large guests where
>> > unaccepted memory is more likely to be useful, it seems like it could be
>> > an issue, especially when consider 100+ vCPU guests.
>> 
>> Okay, sorry for delay. It took time to reproduce it with TDX.
>> 
>> I will look what can be done.
> 
> Could you check if the patch below helps?
> 
> diff --git a/drivers/firmware/efi/unaccepted_memory.c b/drivers/firmware/efi/unaccepted_memory.c
> index 853f7dc3c21d..591da3f368fa 100644
> --- a/drivers/firmware/efi/unaccepted_memory.c
> +++ b/drivers/firmware/efi/unaccepted_memory.c
> @@ -8,6 +8,14 @@
>  /* Protects unaccepted memory bitmap */
>  static DEFINE_SPINLOCK(unaccepted_memory_lock);
>  
> +struct accept_range {
> +	struct list_head list;
> +	unsigned long start;
> +	unsigned long end;
> +};
> +
> +static LIST_HEAD(accepting_list);
> +
>  /*
>   * accept_memory() -- Consult bitmap and accept the memory if needed.
>   *
> @@ -24,6 +32,7 @@ void accept_memory(phys_addr_t start, phys_addr_t end)
>  {
>  	struct efi_unaccepted_memory *unaccepted;
>  	unsigned long range_start, range_end;
> +	struct accept_range range, *entry;
>  	unsigned long flags;
>  	u64 unit_size;
>  
> @@ -80,7 +89,25 @@ void accept_memory(phys_addr_t start, phys_addr_t end)
>  
>  	range_start = start / unit_size;
>  
> +	range.start = start;
> +	range.end = end;
> +retry:
>  	spin_lock_irqsave(&unaccepted_memory_lock, flags);
> +
> +	list_for_each_entry(entry, &accepting_list, list) {
> +		if (entry->end < start)
> +			continue;
> +		if (entry->start > end)
> +			continue;
> +		spin_unlock_irqrestore(&unaccepted_memory_lock, flags);
> +
> +		/* Somebody else accepting the range */
> +		cpu_relax();

Should this be rather cond_resched()? I think cpu_relax() isn't enough to
prevent soft lockups.
Although IIUC hitting this should be rare, as the contending tasks will pick
different ranges via try_to_accept_memory_one(), right?

> +		goto retry;
> +	}
> +
> +	list_add(&range.list, &accepting_list);
> +
>  	for_each_set_bitrange_from(range_start, range_end, unaccepted->bitmap,
>  				   DIV_ROUND_UP(end, unit_size)) {
>  		unsigned long phys_start, phys_end;
> @@ -89,9 +116,15 @@ void accept_memory(phys_addr_t start, phys_addr_t end)
>  		phys_start = range_start * unit_size + unaccepted->phys_base;
>  		phys_end = range_end * unit_size + unaccepted->phys_base;
>  
> +		spin_unlock_irqrestore(&unaccepted_memory_lock, flags);
> +
>  		arch_accept_memory(phys_start, phys_end);
> +
> +		spin_lock_irqsave(&unaccepted_memory_lock, flags);
>  		bitmap_clear(unaccepted->bitmap, range_start, len);
>  	}
> +
> +	list_del(&range.list);
>  	spin_unlock_irqrestore(&unaccepted_memory_lock, flags);
>  }
>
Kirill A. Shutemov Oct. 13, 2023, 5:27 p.m. UTC | #8
On Fri, Oct 13, 2023 at 06:44:45PM +0200, Vlastimil Babka wrote:
> On 10/13/23 18:22, Kirill A. Shutemov wrote:
> > On Fri, Oct 13, 2023 at 03:33:58PM +0300, Kirill A. Shutemov wrote:
> >> > While testing SNP guests running today's tip/master (ef19bc9dddc3) I ran
> >> > into what seems to be fairly significant lock contention due to the
> >> > unaccepted_memory_lock spinlock above, which results in a constant stream
> >> > of soft-lockups until the workload gets all its memory accepted/faulted
> >> > in if the guest has around 16+ vCPUs.
> >> > 
> >> > I've included the guest dmesg traces I was seeing below.
> >> > 
> >> > In this case I was running a 32 vCPU guest with 200GB of memory running on
> >> > a 256 thread EPYC (Milan) system, and can trigger the above situation fairly
> >> > reliably by running the following workload in a freshly-booted guests:
> >> > 
> >> >   stress --vm 32 --vm-bytes 5G --vm-keep
> >> > 
> >> > Scaling up the number of stress threads and vCPUs should make it easier
> >> > to reproduce.
> >> > 
> >> > Other than unresponsiveness/lockup messages until the memory is accepted,
> >> > the guest seems to continue running fine, but for large guests where
> >> > unaccepted memory is more likely to be useful, it seems like it could be
> >> > an issue, especially when consider 100+ vCPU guests.
> >> 
> >> Okay, sorry for delay. It took time to reproduce it with TDX.
> >> 
> >> I will look what can be done.
> > 
> > Could you check if the patch below helps?
> > 
> > diff --git a/drivers/firmware/efi/unaccepted_memory.c b/drivers/firmware/efi/unaccepted_memory.c
> > index 853f7dc3c21d..591da3f368fa 100644
> > --- a/drivers/firmware/efi/unaccepted_memory.c
> > +++ b/drivers/firmware/efi/unaccepted_memory.c
> > @@ -8,6 +8,14 @@
> >  /* Protects unaccepted memory bitmap */
> >  static DEFINE_SPINLOCK(unaccepted_memory_lock);
> >  
> > +struct accept_range {
> > +	struct list_head list;
> > +	unsigned long start;
> > +	unsigned long end;
> > +};
> > +
> > +static LIST_HEAD(accepting_list);
> > +
> >  /*
> >   * accept_memory() -- Consult bitmap and accept the memory if needed.
> >   *
> > @@ -24,6 +32,7 @@ void accept_memory(phys_addr_t start, phys_addr_t end)
> >  {
> >  	struct efi_unaccepted_memory *unaccepted;
> >  	unsigned long range_start, range_end;
> > +	struct accept_range range, *entry;
> >  	unsigned long flags;
> >  	u64 unit_size;
> >  
> > @@ -80,7 +89,25 @@ void accept_memory(phys_addr_t start, phys_addr_t end)
> >  
> >  	range_start = start / unit_size;
> >  
> > +	range.start = start;
> > +	range.end = end;
> > +retry:
> >  	spin_lock_irqsave(&unaccepted_memory_lock, flags);
> > +
> > +	list_for_each_entry(entry, &accepting_list, list) {
> > +		if (entry->end < start)
> > +			continue;
> > +		if (entry->start > end)
> > +			continue;
> > +		spin_unlock_irqrestore(&unaccepted_memory_lock, flags);
> > +
> > +		/* Somebody else accepting the range */
> > +		cpu_relax();
> 
> Should this be rather cond_resched()? I think cpu_relax() isn't enough to
> prevent soft lockups.

Right. For some reason, I thought we cannot call cond_resched() from
atomic context (we sometimes get there from atomic context), but we can.

> Although IIUC hitting this should be rare, as the contending tasks will pick
> different ranges via try_to_accept_memory_one(), right?

Yes, it should be rare.

Generally, with exception of memblock, we accept all memory with MAX_ORDER
chunks. As long as unit_size <= MAX_ORDER page allocator should never
trigger the conflict as the caller owns full range to accept.

I will test the idea with larger unit_size to see how it behaves.
Tom Lendacky Oct. 13, 2023, 5:45 p.m. UTC | #9
On 10/13/23 11:22, Kirill A. Shutemov wrote:
> On Fri, Oct 13, 2023 at 03:33:58PM +0300, Kirill A. Shutemov wrote:
>>> While testing SNP guests running today's tip/master (ef19bc9dddc3) I ran
>>> into what seems to be fairly significant lock contention due to the
>>> unaccepted_memory_lock spinlock above, which results in a constant stream
>>> of soft-lockups until the workload gets all its memory accepted/faulted
>>> in if the guest has around 16+ vCPUs.
>>>
>>> I've included the guest dmesg traces I was seeing below.
>>>
>>> In this case I was running a 32 vCPU guest with 200GB of memory running on
>>> a 256 thread EPYC (Milan) system, and can trigger the above situation fairly
>>> reliably by running the following workload in a freshly-booted guests:
>>>
>>>    stress --vm 32 --vm-bytes 5G --vm-keep
>>>
>>> Scaling up the number of stress threads and vCPUs should make it easier
>>> to reproduce.
>>>
>>> Other than unresponsiveness/lockup messages until the memory is accepted,
>>> the guest seems to continue running fine, but for large guests where
>>> unaccepted memory is more likely to be useful, it seems like it could be
>>> an issue, especially when consider 100+ vCPU guests.
>>
>> Okay, sorry for delay. It took time to reproduce it with TDX.
>>
>> I will look what can be done.
> 
> Could you check if the patch below helps?
> 
> diff --git a/drivers/firmware/efi/unaccepted_memory.c b/drivers/firmware/efi/unaccepted_memory.c
> index 853f7dc3c21d..591da3f368fa 100644
> --- a/drivers/firmware/efi/unaccepted_memory.c
> +++ b/drivers/firmware/efi/unaccepted_memory.c
> @@ -8,6 +8,14 @@
>   /* Protects unaccepted memory bitmap */
>   static DEFINE_SPINLOCK(unaccepted_memory_lock);
>   
> +struct accept_range {
> +	struct list_head list;
> +	unsigned long start;
> +	unsigned long end;
> +};
> +
> +static LIST_HEAD(accepting_list);
> +
>   /*
>    * accept_memory() -- Consult bitmap and accept the memory if needed.
>    *
> @@ -24,6 +32,7 @@ void accept_memory(phys_addr_t start, phys_addr_t end)
>   {
>   	struct efi_unaccepted_memory *unaccepted;
>   	unsigned long range_start, range_end;
> +	struct accept_range range, *entry;
>   	unsigned long flags;
>   	u64 unit_size;
>   
> @@ -80,7 +89,25 @@ void accept_memory(phys_addr_t start, phys_addr_t end)
>   
>   	range_start = start / unit_size;
>   
> +	range.start = start;
> +	range.end = end;
> +retry:
>   	spin_lock_irqsave(&unaccepted_memory_lock, flags);
> +
> +	list_for_each_entry(entry, &accepting_list, list) {
> +		if (entry->end < start)
> +			continue;
> +		if (entry->start > end)

Should this be a >= check since start and end are page aligned values?

> +			continue;
> +		spin_unlock_irqrestore(&unaccepted_memory_lock, flags);
> +
> +		/* Somebody else accepting the range */
> +		cpu_relax();
> +		goto retry;

Could you set some kind of flag here so that ...

> +	}
> +

... once you get here, that means that area was accepted and removed from 
the list, so I think you could just drop the lock and exit now, right? 
Because at that point the bitmap will have been updated and you wouldn't 
be accepting any memory anyway?

Thanks,
Tom

> +	list_add(&range.list, &accepting_list);
> +
>   	for_each_set_bitrange_from(range_start, range_end, unaccepted->bitmap,
>   				   DIV_ROUND_UP(end, unit_size)) {
>   		unsigned long phys_start, phys_end;
> @@ -89,9 +116,15 @@ void accept_memory(phys_addr_t start, phys_addr_t end)
>   		phys_start = range_start * unit_size + unaccepted->phys_base;
>   		phys_end = range_end * unit_size + unaccepted->phys_base;
>   
> +		spin_unlock_irqrestore(&unaccepted_memory_lock, flags);
> +
>   		arch_accept_memory(phys_start, phys_end);
> +
> +		spin_lock_irqsave(&unaccepted_memory_lock, flags);
>   		bitmap_clear(unaccepted->bitmap, range_start, len);
>   	}
> +
> +	list_del(&range.list);
>   	spin_unlock_irqrestore(&unaccepted_memory_lock, flags);
>   }
>
Kirill A. Shutemov Oct. 13, 2023, 7:53 p.m. UTC | #10
On Fri, Oct 13, 2023 at 12:45:20PM -0500, Tom Lendacky wrote:
> On 10/13/23 11:22, Kirill A. Shutemov wrote:
> > On Fri, Oct 13, 2023 at 03:33:58PM +0300, Kirill A. Shutemov wrote:
> > > > While testing SNP guests running today's tip/master (ef19bc9dddc3) I ran
> > > > into what seems to be fairly significant lock contention due to the
> > > > unaccepted_memory_lock spinlock above, which results in a constant stream
> > > > of soft-lockups until the workload gets all its memory accepted/faulted
> > > > in if the guest has around 16+ vCPUs.
> > > > 
> > > > I've included the guest dmesg traces I was seeing below.
> > > > 
> > > > In this case I was running a 32 vCPU guest with 200GB of memory running on
> > > > a 256 thread EPYC (Milan) system, and can trigger the above situation fairly
> > > > reliably by running the following workload in a freshly-booted guests:
> > > > 
> > > >    stress --vm 32 --vm-bytes 5G --vm-keep
> > > > 
> > > > Scaling up the number of stress threads and vCPUs should make it easier
> > > > to reproduce.
> > > > 
> > > > Other than unresponsiveness/lockup messages until the memory is accepted,
> > > > the guest seems to continue running fine, but for large guests where
> > > > unaccepted memory is more likely to be useful, it seems like it could be
> > > > an issue, especially when consider 100+ vCPU guests.
> > > 
> > > Okay, sorry for delay. It took time to reproduce it with TDX.
> > > 
> > > I will look what can be done.
> > 
> > Could you check if the patch below helps?
> > 
> > diff --git a/drivers/firmware/efi/unaccepted_memory.c b/drivers/firmware/efi/unaccepted_memory.c
> > index 853f7dc3c21d..591da3f368fa 100644
> > --- a/drivers/firmware/efi/unaccepted_memory.c
> > +++ b/drivers/firmware/efi/unaccepted_memory.c
> > @@ -8,6 +8,14 @@
> >   /* Protects unaccepted memory bitmap */
> >   static DEFINE_SPINLOCK(unaccepted_memory_lock);
> > +struct accept_range {
> > +	struct list_head list;
> > +	unsigned long start;
> > +	unsigned long end;
> > +};
> > +
> > +static LIST_HEAD(accepting_list);
> > +
> >   /*
> >    * accept_memory() -- Consult bitmap and accept the memory if needed.
> >    *
> > @@ -24,6 +32,7 @@ void accept_memory(phys_addr_t start, phys_addr_t end)
> >   {
> >   	struct efi_unaccepted_memory *unaccepted;
> >   	unsigned long range_start, range_end;
> > +	struct accept_range range, *entry;
> >   	unsigned long flags;
> >   	u64 unit_size;
> > @@ -80,7 +89,25 @@ void accept_memory(phys_addr_t start, phys_addr_t end)
> >   	range_start = start / unit_size;
> > +	range.start = start;
> > +	range.end = end;
> > +retry:
> >   	spin_lock_irqsave(&unaccepted_memory_lock, flags);
> > +
> > +	list_for_each_entry(entry, &accepting_list, list) {
> > +		if (entry->end < start)
> > +			continue;
> > +		if (entry->start > end)
> 
> Should this be a >= check since start and end are page aligned values?

Right. Good catch.

> > +			continue;
> > +		spin_unlock_irqrestore(&unaccepted_memory_lock, flags);
> > +
> > +		/* Somebody else accepting the range */
> > +		cpu_relax();
> > +		goto retry;
> 
> Could you set some kind of flag here so that ...
> 
> > +	}
> > +
> 
> ... once you get here, that means that area was accepted and removed from
> the list, so I think you could just drop the lock and exit now, right?
> Because at that point the bitmap will have been updated and you wouldn't be
> accepting any memory anyway?

No. Consider the case if someone else accept part of the range you are
accepting.

I guess we can check if the range on the list covers what we are accepting
fully, but it complication. Checking bitmap at this point is cheap enough:
we already hold the lock.
Kirill A. Shutemov Oct. 13, 2023, 9:54 p.m. UTC | #11
On Fri, Oct 13, 2023 at 08:27:28PM +0300, Kirill A. Shutemov wrote:
> I will test the idea with larger unit_size to see how it behaves.

It indeed uncovered an issue. We need to record ranges on accepting_list
in unit_size granularity. Otherwise, we fail to stop parallel accept
requests to the same unit_size block if they don't overlap on physical
addresses.

Updated patch:

diff --git a/drivers/firmware/efi/unaccepted_memory.c b/drivers/firmware/efi/unaccepted_memory.c
index 853f7dc3c21d..8af0306c8e5c 100644
--- a/drivers/firmware/efi/unaccepted_memory.c
+++ b/drivers/firmware/efi/unaccepted_memory.c
@@ -5,9 +5,17 @@
 #include <linux/spinlock.h>
 #include <asm/unaccepted_memory.h>
 
-/* Protects unaccepted memory bitmap */
+/* Protects unaccepted memory bitmap and accepting_list */
 static DEFINE_SPINLOCK(unaccepted_memory_lock);
 
+struct accept_range {
+	struct list_head list;
+	unsigned long start;
+	unsigned long end;
+};
+
+static LIST_HEAD(accepting_list);
+
 /*
  * accept_memory() -- Consult bitmap and accept the memory if needed.
  *
@@ -24,6 +32,7 @@ void accept_memory(phys_addr_t start, phys_addr_t end)
 {
 	struct efi_unaccepted_memory *unaccepted;
 	unsigned long range_start, range_end;
+	struct accept_range range, *entry;
 	unsigned long flags;
 	u64 unit_size;
 
@@ -78,20 +87,58 @@ void accept_memory(phys_addr_t start, phys_addr_t end)
 	if (end > unaccepted->size * unit_size * BITS_PER_BYTE)
 		end = unaccepted->size * unit_size * BITS_PER_BYTE;
 
-	range_start = start / unit_size;
-
+	range.start = start / unit_size;
+	range.end = DIV_ROUND_UP(end, unit_size);
+retry:
 	spin_lock_irqsave(&unaccepted_memory_lock, flags);
+
+	/*
+	 * Check if anybody works on accepting the same range of the memory.
+	 *
+	 * The check with unit_size granularity. It is crucial to catch all
+	 * accept requests to the same unit_size block, even if they don't
+	 * overlap on physical address level.
+	 */
+	list_for_each_entry(entry, &accepting_list, list) {
+		if (entry->end < range.start)
+			continue;
+		if (entry->start >= range.end)
+			continue;
+
+		/*
+		 * Somebody else accepting the range. Or at least part of it.
+		 *
+		 * Drop the lock and retry until it is complete.
+		 */
+		spin_unlock_irqrestore(&unaccepted_memory_lock, flags);
+		cond_resched();
+		goto retry;
+	}
+
+	/*
+	 * Register that the range is about to be accepted.
+	 * Make sure nobody else will accept it.
+	 */
+	list_add(&range.list, &accepting_list);
+
+	range_start = range.start;
 	for_each_set_bitrange_from(range_start, range_end, unaccepted->bitmap,
-				   DIV_ROUND_UP(end, unit_size)) {
+				   range.end) {
 		unsigned long phys_start, phys_end;
 		unsigned long len = range_end - range_start;
 
 		phys_start = range_start * unit_size + unaccepted->phys_base;
 		phys_end = range_end * unit_size + unaccepted->phys_base;
 
+		spin_unlock_irqrestore(&unaccepted_memory_lock, flags);
+
 		arch_accept_memory(phys_start, phys_end);
+
+		spin_lock_irqsave(&unaccepted_memory_lock, flags);
 		bitmap_clear(unaccepted->bitmap, range_start, len);
 	}
+
+	list_del(&range.list);
 	spin_unlock_irqrestore(&unaccepted_memory_lock, flags);
 }
diff mbox series

Patch

diff --git a/arch/x86/platform/efi/efi.c b/arch/x86/platform/efi/efi.c
index f3f2d87cce1b..e9f99c56f3ce 100644
--- a/arch/x86/platform/efi/efi.c
+++ b/arch/x86/platform/efi/efi.c
@@ -96,6 +96,9 @@  static const unsigned long * const efi_tables[] = {
 #ifdef CONFIG_EFI_COCO_SECRET
 	&efi.coco_secret,
 #endif
+#ifdef CONFIG_UNACCEPTED_MEMORY
+	&efi.unaccepted,
+#endif
 };
 
 u64 efi_setup;		/* efi setup_data physical address */
diff --git a/drivers/firmware/efi/Makefile b/drivers/firmware/efi/Makefile
index b51f2a4c821e..e489fefd23da 100644
--- a/drivers/firmware/efi/Makefile
+++ b/drivers/firmware/efi/Makefile
@@ -41,3 +41,4 @@  obj-$(CONFIG_EFI_CAPSULE_LOADER)	+= capsule-loader.o
 obj-$(CONFIG_EFI_EARLYCON)		+= earlycon.o
 obj-$(CONFIG_UEFI_CPER_ARM)		+= cper-arm.o
 obj-$(CONFIG_UEFI_CPER_X86)		+= cper-x86.o
+obj-$(CONFIG_UNACCEPTED_MEMORY)		+= unaccepted_memory.o
diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c
index 7dce06e419c5..d817e7afd266 100644
--- a/drivers/firmware/efi/efi.c
+++ b/drivers/firmware/efi/efi.c
@@ -50,6 +50,9 @@  struct efi __read_mostly efi = {
 #ifdef CONFIG_EFI_COCO_SECRET
 	.coco_secret		= EFI_INVALID_TABLE_ADDR,
 #endif
+#ifdef CONFIG_UNACCEPTED_MEMORY
+	.unaccepted		= EFI_INVALID_TABLE_ADDR,
+#endif
 };
 EXPORT_SYMBOL(efi);
 
@@ -605,6 +608,9 @@  static const efi_config_table_type_t common_tables[] __initconst = {
 #ifdef CONFIG_EFI_COCO_SECRET
 	{LINUX_EFI_COCO_SECRET_AREA_GUID,	&efi.coco_secret,	"CocoSecret"	},
 #endif
+#ifdef CONFIG_UNACCEPTED_MEMORY
+	{LINUX_EFI_UNACCEPTED_MEM_TABLE_GUID,	&efi.unaccepted,	"Unaccepted"	},
+#endif
 #ifdef CONFIG_EFI_GENERIC_STUB
 	{LINUX_EFI_SCREEN_INFO_TABLE_GUID,	&screen_info_table			},
 #endif
@@ -759,6 +765,25 @@  int __init efi_config_parse_tables(const efi_config_table_t *config_tables,
 		}
 	}
 
+	if (IS_ENABLED(CONFIG_UNACCEPTED_MEMORY) &&
+	    efi.unaccepted != EFI_INVALID_TABLE_ADDR) {
+		struct efi_unaccepted_memory *unaccepted;
+
+		unaccepted = early_memremap(efi.unaccepted, sizeof(*unaccepted));
+		if (unaccepted) {
+			unsigned long size;
+
+			if (unaccepted->version == 1) {
+				size = sizeof(*unaccepted) + unaccepted->size;
+				memblock_reserve(efi.unaccepted, size);
+			} else {
+				efi.unaccepted = EFI_INVALID_TABLE_ADDR;
+			}
+
+			early_memunmap(unaccepted, sizeof(*unaccepted));
+		}
+	}
+
 	return 0;
 }
 
diff --git a/drivers/firmware/efi/unaccepted_memory.c b/drivers/firmware/efi/unaccepted_memory.c
new file mode 100644
index 000000000000..08a9a843550a
--- /dev/null
+++ b/drivers/firmware/efi/unaccepted_memory.c
@@ -0,0 +1,112 @@ 
+// SPDX-License-Identifier: GPL-2.0-only
+
+#include <linux/efi.h>
+#include <linux/memblock.h>
+#include <linux/spinlock.h>
+#include <asm/unaccepted_memory.h>
+
+/* Protects unaccepted memory bitmap */
+static DEFINE_SPINLOCK(unaccepted_memory_lock);
+
+/*
+ * accept_memory() -- Consult bitmap and accept the memory if needed.
+ *
+ * Only memory that is explicitly marked as unaccepted in the bitmap requires
+ * an action. All the remaining memory is implicitly accepted and doesn't need
+ * acceptance.
+ *
+ * No need to accept:
+ *  - anything if the system has no unaccepted table;
+ *  - memory that is below phys_base;
+ *  - memory that is above the memory that addressable by the bitmap;
+ */
+void accept_memory(phys_addr_t start, phys_addr_t end)
+{
+	struct efi_unaccepted_memory *unaccepted;
+	unsigned long range_start, range_end;
+	unsigned long flags;
+	u64 unit_size;
+
+	unaccepted = efi_get_unaccepted_table();
+	if (!unaccepted)
+		return;
+
+	unit_size = unaccepted->unit_size;
+
+	/*
+	 * Only care for the part of the range that is represented
+	 * in the bitmap.
+	 */
+	if (start < unaccepted->phys_base)
+		start = unaccepted->phys_base;
+	if (end < unaccepted->phys_base)
+		return;
+
+	/* Translate to offsets from the beginning of the bitmap */
+	start -= unaccepted->phys_base;
+	end -= unaccepted->phys_base;
+
+	/* Make sure not to overrun the bitmap */
+	if (end > unaccepted->size * unit_size * BITS_PER_BYTE)
+		end = unaccepted->size * unit_size * BITS_PER_BYTE;
+
+	range_start = start / unit_size;
+
+	spin_lock_irqsave(&unaccepted_memory_lock, flags);
+	for_each_set_bitrange_from(range_start, range_end, unaccepted->bitmap,
+				   DIV_ROUND_UP(end, unit_size)) {
+		unsigned long phys_start, phys_end;
+		unsigned long len = range_end - range_start;
+
+		phys_start = range_start * unit_size + unaccepted->phys_base;
+		phys_end = range_end * unit_size + unaccepted->phys_base;
+
+		arch_accept_memory(phys_start, phys_end);
+		bitmap_clear(unaccepted->bitmap, range_start, len);
+	}
+	spin_unlock_irqrestore(&unaccepted_memory_lock, flags);
+}
+
+bool range_contains_unaccepted_memory(phys_addr_t start, phys_addr_t end)
+{
+	struct efi_unaccepted_memory *unaccepted;
+	unsigned long flags;
+	bool ret = false;
+	u64 unit_size;
+
+	unaccepted = efi_get_unaccepted_table();
+	if (!unaccepted)
+		return false;
+
+	unit_size = unaccepted->unit_size;
+
+	/*
+	 * Only care for the part of the range that is represented
+	 * in the bitmap.
+	 */
+	if (start < unaccepted->phys_base)
+		start = unaccepted->phys_base;
+	if (end < unaccepted->phys_base)
+		return false;
+
+	/* Translate to offsets from the beginning of the bitmap */
+	start -= unaccepted->phys_base;
+	end -= unaccepted->phys_base;
+
+	/* Make sure not to overrun the bitmap */
+	if (end > unaccepted->size * unit_size * BITS_PER_BYTE)
+		end = unaccepted->size * unit_size * BITS_PER_BYTE;
+
+	spin_lock_irqsave(&unaccepted_memory_lock, flags);
+	while (start < end) {
+		if (test_bit(start / unit_size, unaccepted->bitmap)) {
+			ret = true;
+			break;
+		}
+
+		start += unit_size;
+	}
+	spin_unlock_irqrestore(&unaccepted_memory_lock, flags);
+
+	return ret;
+}
diff --git a/include/linux/efi.h b/include/linux/efi.h
index 29cc622910da..9864f9c00da2 100644
--- a/include/linux/efi.h
+++ b/include/linux/efi.h
@@ -646,6 +646,7 @@  extern struct efi {
 	unsigned long			tpm_final_log;		/* TPM2 Final Events Log table */
 	unsigned long			mokvar_table;		/* MOK variable config table */
 	unsigned long			coco_secret;		/* Confidential computing secret table */
+	unsigned long			unaccepted;		/* Unaccepted memory table */
 
 	efi_get_time_t			*get_time;
 	efi_set_time_t			*set_time;