diff mbox series

[1/9] cpuidle: rename ARCH_HAS_CPU_RELAX to ARCH_HAS_OPTIMIZED_POLL

Message ID 20240430183730.561960-2-ankur.a.arora@oracle.com
State Superseded
Headers show
Series Enable haltpoll for arm64 | expand

Commit Message

Ankur Arora April 30, 2024, 6:37 p.m. UTC
ARCH_HAS_CPU_RELAX is a bit of a misnomer since all architectures
define cpu_relax(). Not all, however, have a performant version, with
some only implementing it as a compiler barrier.

In contexts that this config option is used, it is expected to provide
an architectural primitive that can be used as part of a polling
mechanism -- one that would be cheaper than spinning in a tight loop.

Advertise the availability of such a primitive by renaming to
ARCH_HAS_OPTIMIZED_POLL. And, while at it, explicitly condition
cpuidle-haltpoll and intel-idle, both of which depend on a polling
state, on it.

Suggested-by: Will Deacon <will@kernel.org>
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 arch/x86/Kconfig              | 2 +-
 drivers/acpi/processor_idle.c | 4 ++--
 drivers/cpuidle/Kconfig       | 2 +-
 drivers/cpuidle/Makefile      | 2 +-
 drivers/idle/Kconfig          | 1 +
 include/linux/cpuidle.h       | 2 +-
 6 files changed, 7 insertions(+), 6 deletions(-)

Comments

Christoph Lameter (Ampere) May 2, 2024, 1:33 a.m. UTC | #1
On Tue, 30 Apr 2024, Ankur Arora wrote:

> ARCH_HAS_CPU_RELAX is a bit of a misnomer since all architectures
> define cpu_relax(). Not all, however, have a performant version, with
> some only implementing it as a compiler barrier.
>
> In contexts that this config option is used, it is expected to provide
> an architectural primitive that can be used as part of a polling
> mechanism -- one that would be cheaper than spinning in a tight loop.

The intend of cpu_relax() is not a polling mechanism. Initial AFAICT it 
was introduced on x86 as the REP NOP instruction. Aka as PAUSE. And it was 
part of a spin loop. So there was no connection to polling anything.

The intend was to make the processor aware that we are in a spin loop. 
Various processors have different actions that they take upon encountering 
such a cpu relax operation.

The polling (WFE/WFI) available on ARM (and potentially other platforms) 
is a different mechanism that is actually intended to reduce the power 
requirement of the processor until a certain condition is met and that 
check is done in hardware.

These are not the same and I think we need both config options.

The issues that you have with WFET later in the patchset arise from not 
making this distinction.

The polling (waiting for an event) could be implemented for a 
processor not supporting that in hardware by using a loop that 
checks for the condition and then does a cpu_relax().

With that you could f.e. support the existing cpu_relax() and also have 
some form of cpu_poll() interface.
Ankur Arora May 3, 2024, 4:13 a.m. UTC | #2
Christoph Lameter (Ampere) <cl@gentwo.org> writes:

> On Tue, 30 Apr 2024, Ankur Arora wrote:
>
>> ARCH_HAS_CPU_RELAX is a bit of a misnomer since all architectures
>> define cpu_relax(). Not all, however, have a performant version, with
>> some only implementing it as a compiler barrier.
>>
>> In contexts that this config option is used, it is expected to provide
>> an architectural primitive that can be used as part of a polling
>> mechanism -- one that would be cheaper than spinning in a tight loop.
>
> The intend of cpu_relax() is not a polling mechanism. Initial AFAICT it was
> introduced on x86 as the REP NOP instruction. Aka as PAUSE. And it was part of a
> spin loop. So there was no connection to polling anything.

Agreed, cpu_relax() is just a mechanism to tell the pipeline that
we are in a spin-loop.

> The intend was to make the processor aware that we are in a spin loop. Various
> processors have different actions that they take upon encountering such a cpu
> relax operation.

Sure, though most processors don't have a nice mechanism to do that.
x86 clearly has the REP; NOP thing. arm64 only has a YIELD which from my
measurements is basically a NOP when executed on a system without
hardware threads.

And that's why only x86 defines ARCH_HAS_CPU_RELAX.

> The polling (WFE/WFI) available on ARM (and potentially other platforms) is a
> different mechanism that is actually intended to reduce the power requirement of
> the processor until a certain condition is met and that check is done in
> hardware.

Sure. Which almost exactly fits the bill for the poll-idle loop -- except for the
timeout part.

> These are not the same and I think we need both config options.

My main concern is that poll_idle() conflates polling in idle with
ARCH_HAS_CPU_RELAX, when they aren't really related.

So, poll_idle(), and its users should depend on ARCH_HAS_OPTIMIZED_POLL
which, if defined by some architecture, means that poll_idle() would
be better than a spin-wait loop.

Beyond that I'm okay to keep ARCH_HAS_CPU_RELAX around.

That said, do you see a use for ARCH_HAS_CPU_RELAX? The only current
user is the poll-idle path.

> The issues that you have with WFET later in the patchset arise from not making
> this distinction.

Did you mean the issue with WFE? I'm not using WFET in this patchset at all.

With WFE, sure there's a problem in that you depend on an interrupt or
the event-stream to get out of the wait. And, so sometimes you would
overshoot the target poll timeout.

> The polling (waiting for an event) could be implemented for a processor not
> supporting that in hardware by using a loop that checks for the condition and
> then does a cpu_relax().

Yeah. That's exactly what patch-6 does. smp_cond_load_relaxed() uses
cpu_relax() internally in its spin-loop variant (non arm64).

On arm64, this would use LDXR; WFE. Or are you suggesting implementing
the arm64 loop via cpu_relax() (and thus YIELD?)

Ankur

> With that you could f.e. support the existing cpu_relax() and also have some
> form of cpu_poll() interface.
Christoph Lameter (Ampere) May 3, 2024, 5:07 p.m. UTC | #3
On Thu, 2 May 2024, Ankur Arora wrote:

>> The intend was to make the processor aware that we are in a spin loop. Various
>> processors have different actions that they take upon encountering such a cpu
>> relax operation.
>
> Sure, though most processors don't have a nice mechanism to do that.
> x86 clearly has the REP; NOP thing. arm64 only has a YIELD which from my
> measurements is basically a NOP when executed on a system without
> hardware threads.
>
> And that's why only x86 defines ARCH_HAS_CPU_RELAX.

My impression is that the use of arm YIELD has led cpu architects to 
implement similar mechanisms to x86s PAUSE, This is not part of the spec 
but it has been there for a long time. So I would rather leave it as is.


>> These are not the same and I think we need both config options.
>
> My main concern is that poll_idle() conflates polling in idle with
> ARCH_HAS_CPU_RELAX, when they aren't really related.
>
> So, poll_idle(), and its users should depend on ARCH_HAS_OPTIMIZED_POLL
> which, if defined by some architecture, means that poll_idle() would
> be better than a spin-wait loop.
>
> Beyond that I'm okay to keep ARCH_HAS_CPU_RELAX around.
>
> That said, do you see a use for ARCH_HAS_CPU_RELAX? The only current
> user is the poll-idle path.

I would think that we need a generic cpu_poll() mechanism that can fall 
back to cpu_relax() on processors that do not offer such thing (x86?) and 
if not even that is there fall back.

We already have something like that in the smp_cond_acquire mechanism (a 
bit weird to put that in the barrier.h>).

So what if we had

void cpu_wait(unsigned flags, unsigned long timeout, void *cacheline);

With

#define CPU_POLL_INTERRUPT (1 << 0)
#define CPU_POLL_EVENT (1 <<  1)
#define CPU_POLL_CACHELINE (1 << 2)
#define CPU_POLL_TIMEOUT (1 << 3)
#define CPU_POLL_BROADCAST_EVENT (1 << 4)
#define CPU_POLL_LOCAL_EVENT (1 << 5)


The cpu_poll() function coud be generically defined in asm-generic and 
then arches could provide their own implementation optimizing the hardware 
polling mechanisms.

Any number of flags could be specified simultaneously. On ARM this would 
map then to SEVL SEV and WFI/WFE WFIT/WFET

So f.e.

cpu_wait(CPU_POLL_INTERUPT|CPU_POLL_EVENT|CPU_POLL_TIMEOUT|CPU_POLL_CACHELINE, 
timeout, &mylock);

to wait on a change in a cacheline with a timeout.

In additional we could then think about making effective use of the 
signaling mechanism provided by SEV in core logic of the kernel. Maybe 
that is more effective then waiting for a cacheline in some situations.


> With WFE, sure there's a problem in that you depend on an interrupt or
> the event-stream to get out of the wait. And, so sometimes you would
> overshoot the target poll timeout.

Right. The dependence on the event stream makes this approach a bit 
strange. Having some sort of generic cpu_wait() feature with timeout spec 
could avoid that.
Ankur Arora May 6, 2024, 9:27 p.m. UTC | #4
Christoph Lameter (Ampere) <cl@gentwo.org> writes:

> On Thu, 2 May 2024, Ankur Arora wrote:
>
>>> The intend was to make the processor aware that we are in a spin loop. Various
>>> processors have different actions that they take upon encountering such a cpu
>>> relax operation.
>>
>> Sure, though most processors don't have a nice mechanism to do that.
>> x86 clearly has the REP; NOP thing. arm64 only has a YIELD which from my
>> measurements is basically a NOP when executed on a system without
>> hardware threads.
>>
>> And that's why only x86 defines ARCH_HAS_CPU_RELAX.
>
> My impression is that the use of arm YIELD has led cpu architects to implement
> similar mechanisms to x86s PAUSE, This is not part of the spec but it has been
> there for a long time. So I would rather leave it as is.
>
>
>>> These are not the same and I think we need both config options.
>>
>> My main concern is that poll_idle() conflates polling in idle with
>> ARCH_HAS_CPU_RELAX, when they aren't really related.
>>
>> So, poll_idle(), and its users should depend on ARCH_HAS_OPTIMIZED_POLL
>> which, if defined by some architecture, means that poll_idle() would
>> be better than a spin-wait loop.
>>
>> Beyond that I'm okay to keep ARCH_HAS_CPU_RELAX around.
>>
>> That said, do you see a use for ARCH_HAS_CPU_RELAX? The only current
>> user is the poll-idle path.
>
> I would think that we need a generic cpu_poll() mechanism that can fall back to
> cpu_relax() on processors that do not offer such thing (x86?) and if not even
> that is there fall back.
>
> We already have something like that in the smp_cond_acquire mechanism (a bit
> weird to put that in the barrier.h>).
>
> So what if we had
>
> void cpu_wait(unsigned flags, unsigned long timeout, void *cacheline);
>
> With
>
> #define CPU_POLL_INTERRUPT (1 << 0)
> #define CPU_POLL_EVENT (1 <<  1)
> #define CPU_POLL_CACHELINE (1 << 2)
> #define CPU_POLL_TIMEOUT (1 << 3)
> #define CPU_POLL_BROADCAST_EVENT (1 << 4)
> #define CPU_POLL_LOCAL_EVENT (1 << 5)
>
>
> The cpu_poll() function coud be generically defined in asm-generic and then
> arches could provide their own implementation optimizing the hardware polling
> mechanisms.
>
> Any number of flags could be specified simultaneously. On ARM this would map
> then to SEVL SEV and WFI/WFE WFIT/WFET
>
> So f.e.
>
> cpu_wait(CPU_POLL_INTERUPT|CPU_POLL_EVENT|CPU_POLL_TIMEOUT|CPU_POLL_CACHELINE,
> timeout, &mylock);
>
> to wait on a change in a cacheline with a timeout.
>
> In additional we could then think about making effective use of the signaling
> mechanism provided by SEV in core logic of the kernel. Maybe that is more
> effective then waiting for a cacheline in some situations.
>
>
>> With WFE, sure there's a problem in that you depend on an interrupt or
>> the event-stream to get out of the wait. And, so sometimes you would
>> overshoot the target poll timeout.
>
> Right. The dependence on the event stream makes this approach a bit strange.
> Having some sort of generic cpu_wait() feature with timeout spec could avoid
> that.

Thanks for the detailed comments. Helped me think through some of the details.

So, there are three issues that you bring up. Let me address each in turn.

1) A generic cpu_poll() mechanism that can fall back to cpu_relax().

> I would think that we need a generic cpu_poll() mechanism that can fall back to
> cpu_relax() on processors that do not offer such thing (x86?) and if not even
> that is there fall back.
>
> We already have something like that in the smp_cond_acquire mechanism (a bit
> weird to put that in the barrier.h>).

Isn't that exactly what this series does?

If you see patch-6, that gets rid of direct use of cpu_relax(), instead
using smp_cond_load_relaxed().

And smp_cond_load_relaxed(), in its generic variant (used everywhere but
arm64) uses cpu_relax() implicitly. Any architecture that override
this -- as arm64 does -- get their own optimizations.

(Maybe this patch would be clearer if it was sequenced after patch-6?)

2) That brings me back to your second point, about having a different
interface which allows for different optimizations.

> void cpu_wait(unsigned flags, unsigned long timeout, void *cacheline);
>
> With
>
> #define CPU_POLL_INTERRUPT (1 << 0)
> #define CPU_POLL_EVENT (1 <<  1)
> #define CPU_POLL_CACHELINE (1 << 2)
> #define CPU_POLL_TIMEOUT (1 << 3)
> #define CPU_POLL_BROADCAST_EVENT (1 << 4)
> #define CPU_POLL_LOCAL_EVENT (1 << 5)

I agree with you that the polling logic does need to handle timeouts
but I don't think that we need a special interface.

Given that we are only concerned about poll_idle() here and that needs
to work with the scheduler's set_nr_if_polling() machinery to elide IPIs,
the polling on a cacheline is needed anyway. That also means we poll_idle()
doesn't need to handle interrupts.

For the rest, the architecture could internally choose whichever
variation they perform best at -- so long as they can either spin-wait
or have some kind of event driven mechanism (WFE/WFET, MONITOR/MWAIT[X])


The timeout is something that I plan to address separately. I think we can
straight-forwardly extend to smp_cond_load_timeout() to do that at least
for WFET supporting platforms where others depend on the event-stream.

  #define smp_cond_load_relaxed_timeout(ptr, cond_expr, time_check_expr, \
                                  time_limit, timeout) ({   \
          typeof(ptr) __PTR = (ptr);                              \
          __unqual_scalar_typeof(*ptr) VAL;                       \
          unsigned int __count = 0;                               \
          for (;;) {                                              \
                  VAL = READ_ONCE(*__PTR);                        \
                  if (cond_expr)                                  \
                          break;                                  \
                  cpu_relax();                                    \
                  if (__count++ < smp_cond_time_check_count)      \
                          continue;                               \
                                                                  \
                  if ((time_check_expr) > time_limit)             \
                          goto timeout;                           \
                                                                  \
                  __count = 0;                                    \
          }                                                       \
          (typeof(*ptr))VAL;                                      \
  })

arm64, for instance, can use alternatives to implement whichever variant
the processor supports (untested, also needs massaging).

  static inline void __cmpwait_case_##sz(volatile void *ptr,              \
                                  unsigned long val,                      \
                                  unsigned long etime)                    \
                                                                          \


          unsigned long tmp;                                              \
                                                                          \
          const unsigned long ecycles = xloops_to_cycles(nsecs_to_xloops(etime)); \
          asm volatile(                                                   \
          "       sevl\ n"                                                \
          "       wfe\ n"                                                 \
          "       ldxr" #sfx "\ t%" #w "[tmp], %[v]\n"                    \
          "       eor     %" #w "[tmp], %" #w "[tmp], %" #w "[val]\ n"    \
          "       cbnz    %" #w "[tmp], 1f\ n"                            \
          ALTERNATIVE("wfe\ n",                                           \
                  "msr s0_3_c1_c0_0, %[ecycles]\ n",                      \
                  ARM64_HAS_WFXT)                                         \
          "1:"                                                            \
          : [tmp] "=&r" (tmp), [v] "+Q" (*(u##sz *)ptr)                   \
          : [val] "r" (val), [ecycles] "r" (ecycles));                    \
  }

And, then poll_idle() only need be:

  static int __cpuidle poll_idle(struct cpuidle_device *dev,
                               struct cpuidle_driver *drv, int index)
  {
        u64 time_start;

        time_start = local_clock_noinstr();

        dev->poll_time_limit = false;

        raw_local_irq_enable();
        if (!current_set_polling_and_test()) {
                u64 time_limit;

                time_limit = cpuidle_poll_time(drv, dev);

                smp_cond_load_relaxed_timeout(&current_thread_info()->flags,
                                              VAL & _TIF_NEED_RESCHED,
                                              local_clock_noinstr(),
                                              time_start + time_limit,
                                              timed_out);
        }
        ...

On x86, this generates code pretty similar to the current version and
on arm64 similar to the WFE version.

3) Dependence on the event-stream for the WFE variants

> Right. The dependence on the event stream makes this approach a bit strange.
> Having some sort of generic cpu_wait() feature with timeout spec could avoid
> that.

Not sure I agree with that. Seems to me, the event-stream is present for
exactly that -- so we don't wait in a WFE or WFI forever. The spec
say (section D12.2.3)

  "An implementation that includes the Generic Timer can use the system
  counter to generate one or more event streams, to generate periodic
  wakeup events as part of the mechanism described in Wait for Event.

  An event stream might be used:
   - To impose a time-out on a Wait For Event polling loop."

The overshoot is a problem, but I don't think it is a huge one. Most
of the time that haltpoll is in effect, it should wake up with work
to do in the guest_halt_poll_ns duration. The times, it doesn't it
would exit poll_idle() and exit to the hypervisor.

So, this is only an issue in the last iteration.

And, I'm not sure how a generic cpu_wait() would work? Either the
generic cpu_wait() spins in cpu_relax()/YIELD which is suboptimal
on arm64 or it uses WFE but sets a timer for 50us or whatever. Seems
like unnecessary overhead when the overshoot is relatively uncommon.

Or is there another mechanism you are thinking of for enforcing a
timeout?

Thanks

--
ankur
diff mbox series

Patch

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 4474bf32d0a4..b238c874875a 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -368,7 +368,7 @@  config ARCH_MAY_HAVE_PC_FDC
 config GENERIC_CALIBRATE_DELAY
 	def_bool y
 
-config ARCH_HAS_CPU_RELAX
+config ARCH_HAS_OPTIMIZED_POLL
 	def_bool y
 
 config ARCH_HIBERNATION_POSSIBLE
diff --git a/drivers/acpi/processor_idle.c b/drivers/acpi/processor_idle.c
index bd6a7857ce05..ccef38410950 100644
--- a/drivers/acpi/processor_idle.c
+++ b/drivers/acpi/processor_idle.c
@@ -36,7 +36,7 @@ 
 #include <asm/cpu.h>
 #endif
 
-#define ACPI_IDLE_STATE_START	(IS_ENABLED(CONFIG_ARCH_HAS_CPU_RELAX) ? 1 : 0)
+#define ACPI_IDLE_STATE_START	(IS_ENABLED(CONFIG_ARCH_HAS_OPTIMIZED_POLL) ? 1 : 0)
 
 static unsigned int max_cstate __read_mostly = ACPI_PROCESSOR_MAX_POWER;
 module_param(max_cstate, uint, 0400);
@@ -787,7 +787,7 @@  static int acpi_processor_setup_cstates(struct acpi_processor *pr)
 	if (max_cstate == 0)
 		max_cstate = 1;
 
-	if (IS_ENABLED(CONFIG_ARCH_HAS_CPU_RELAX)) {
+	if (IS_ENABLED(CONFIG_ARCH_HAS_OPTIMIZED_POLL)) {
 		cpuidle_poll_state_init(drv);
 		count = 1;
 	} else {
diff --git a/drivers/cpuidle/Kconfig b/drivers/cpuidle/Kconfig
index cac5997dca50..75f6e176bbc8 100644
--- a/drivers/cpuidle/Kconfig
+++ b/drivers/cpuidle/Kconfig
@@ -73,7 +73,7 @@  endmenu
 
 config HALTPOLL_CPUIDLE
 	tristate "Halt poll cpuidle driver"
-	depends on X86 && KVM_GUEST
+	depends on X86 && KVM_GUEST && ARCH_HAS_OPTIMIZED_POLL
 	select CPU_IDLE_GOV_HALTPOLL
 	default y
 	help
diff --git a/drivers/cpuidle/Makefile b/drivers/cpuidle/Makefile
index d103342b7cfc..f29dfd1525b0 100644
--- a/drivers/cpuidle/Makefile
+++ b/drivers/cpuidle/Makefile
@@ -7,7 +7,7 @@  obj-y += cpuidle.o driver.o governor.o sysfs.o governors/
 obj-$(CONFIG_ARCH_NEEDS_CPU_IDLE_COUPLED) += coupled.o
 obj-$(CONFIG_DT_IDLE_STATES)		  += dt_idle_states.o
 obj-$(CONFIG_DT_IDLE_GENPD)		  += dt_idle_genpd.o
-obj-$(CONFIG_ARCH_HAS_CPU_RELAX)	  += poll_state.o
+obj-$(CONFIG_ARCH_HAS_OPTIMIZED_POLL)	  += poll_state.o
 obj-$(CONFIG_HALTPOLL_CPUIDLE)		  += cpuidle-haltpoll.o
 
 ##################################################################################
diff --git a/drivers/idle/Kconfig b/drivers/idle/Kconfig
index 6707d2539fc4..6f9b1d48fede 100644
--- a/drivers/idle/Kconfig
+++ b/drivers/idle/Kconfig
@@ -4,6 +4,7 @@  config INTEL_IDLE
 	depends on CPU_IDLE
 	depends on X86
 	depends on CPU_SUP_INTEL
+	depends on ARCH_HAS_OPTIMIZED_POLL
 	help
 	  Enable intel_idle, a cpuidle driver that includes knowledge of
 	  native Intel hardware idle features.  The acpi_idle driver
diff --git a/include/linux/cpuidle.h b/include/linux/cpuidle.h
index 3183aeb7f5b4..7e7e58a17b07 100644
--- a/include/linux/cpuidle.h
+++ b/include/linux/cpuidle.h
@@ -275,7 +275,7 @@  static inline void cpuidle_coupled_parallel_barrier(struct cpuidle_device *dev,
 }
 #endif
 
-#if defined(CONFIG_CPU_IDLE) && defined(CONFIG_ARCH_HAS_CPU_RELAX)
+#if defined(CONFIG_CPU_IDLE) && defined(CONFIG_ARCH_HAS_OPTIMIZED_POLL)
 void cpuidle_poll_state_init(struct cpuidle_driver *drv);
 #else
 static inline void cpuidle_poll_state_init(struct cpuidle_driver *drv) {}