diff mbox

[2/4] sched/fair: Drop out incomplete current period when sched averages accrue

Message ID 20160412115645.GA3730@vingu-laptop
State New
Headers show

Commit Message

Vincent Guittot April 12, 2016, 11:56 a.m. UTC
Le Tuesday 12 Apr 2016 à 03:41:41 (+0800), Yuyang Du a écrit :
> Hi Vincent,

> 

> On Mon, Apr 11, 2016 at 11:08:04AM +0200, Vincent Guittot wrote:

> > > @@ -2704,11 +2694,14 @@ static __always_inline int

> > >  __update_load_avg(u64 now, int cpu, struct sched_avg *sa,

> > >                   unsigned long weight, int running, struct cfs_rq *cfs_rq)

> > >  {

> > > -       u64 delta, scaled_delta, periods;

> > > -       u32 contrib;

> > > -       unsigned int delta_w, scaled_delta_w, decayed = 0;

> > > +       u64 delta;

> > > +       u32 contrib, periods;

> > >         unsigned long scale_freq, scale_cpu;

> > >

> > > +       /*

> > > +        * now rolls down to a period boundary

> > > +        */

> > > +       now = now && (u64)(~0xFFFFF);

> > >         delta = now - sa->last_update_time;

> > >         /*

> > >          * This should only happen when time goes backwards, which it

> > > @@ -2720,89 +2713,56 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,

> > >         }

> > >

> > >         /*

> > > -        * Use 1024ns as the unit of measurement since it's a reasonable

> > > -        * approximation of 1us and fast to compute.

> > > +        * Use 1024*1024ns as an approximation of 1ms period, pretty close.

> > >          */

> > > -       delta >>= 10;

> > > -       if (!delta)

> > > +       periods = delta >> 20;

> > > +       if (!periods)

> > >                 return 0;

> > >         sa->last_update_time = now;

> > 

> > The optimization looks quite interesting but I see one potential issue

> > with migration as we will lose the part of the ongoing period that is

> > now not saved anymore. This lost part can be quite significant for a

> > short task that ping pongs between CPUs.

> 

> Yes, basically, it is we lose precision (~1ms scale in contrast with ~1us scale).


But with a HZ set to 1000 and a sched-slice in the same range, having a precision
of 1ms instead of 1us makes the precision of load tracking of short task quite
random IMHO.

you can keep recording this partial period without using it in the load tracking.
Something like below keep precision without sacrifying the optimization.

---
 kernel/sched/fair.c | 16 ++++++++++++----
 1 file changed, 12 insertions(+), 4 deletions(-)

> But as I wrote, we may either lose a sub-1ms, or gain a sub-1ms, statistically,

> they should even out, given the load/util updates are quite a large number of

> samples, and we do want a lot of samples for the metrics, this is the point of

> the entire average thing. Plus, as you also said, the incomplete current period

> also plays a (somewhat) negative role here.

> 

> In addition, excluding the flat hierarchical util patch, we gain quite some

> efficiency.

Comments

Vincent Guittot April 13, 2016, 11:11 a.m. UTC | #1
On 12 April 2016 at 23:09, Yuyang Du <yuyang.du@intel.com> wrote:
>

> Hi Vincent,

>

> On Tue, Apr 12, 2016 at 01:56:45PM +0200, Vincent Guittot wrote:

> > Le Tuesday 12 Apr 2016 à 03:41:41 (+0800), Yuyang Du a écrit :

> > > Hi Vincent,

> > >

> > > On Mon, Apr 11, 2016 at 11:08:04AM +0200, Vincent Guittot wrote:

> > > > > @@ -2704,11 +2694,14 @@ static __always_inline int

> > > > >  __update_load_avg(u64 now, int cpu, struct sched_avg *sa,

> > > > >                   unsigned long weight, int running, struct cfs_rq *cfs_rq)

> > > > >  {

> > > > > -       u64 delta, scaled_delta, periods;

> > > > > -       u32 contrib;

> > > > > -       unsigned int delta_w, scaled_delta_w, decayed = 0;

> > > > > +       u64 delta;

> > > > > +       u32 contrib, periods;

> > > > >         unsigned long scale_freq, scale_cpu;

> > > > >

> > > > > +       /*

> > > > > +        * now rolls down to a period boundary

> > > > > +        */

> > > > > +       now = now && (u64)(~0xFFFFF);

> > > > >         delta = now - sa->last_update_time;

> > > > >         /*

> > > > >          * This should only happen when time goes backwards, which it

> > > > > @@ -2720,89 +2713,56 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,

> > > > >         }

> > > > >

> > > > >         /*

> > > > > -        * Use 1024ns as the unit of measurement since it's a reasonable

> > > > > -        * approximation of 1us and fast to compute.

> > > > > +        * Use 1024*1024ns as an approximation of 1ms period, pretty close.

> > > > >          */

> > > > > -       delta >>= 10;

> > > > > -       if (!delta)

> > > > > +       periods = delta >> 20;

> > > > > +       if (!periods)

> > > > >                 return 0;

> > > > >         sa->last_update_time = now;

> > > >

> > > > The optimization looks quite interesting but I see one potential issue

> > > > with migration as we will lose the part of the ongoing period that is

> > > > now not saved anymore. This lost part can be quite significant for a

> > > > short task that ping pongs between CPUs.

> > >

> > > Yes, basically, it is we lose precision (~1ms scale in contrast with ~1us scale).

> >

> > But with a HZ set to 1000 and a sched-slice in the same range, having a precision

> > of 1ms instead of 1us makes the precision of load tracking of short task quite

> > random IMHO.

> >

> > you can keep recording this partial period without using it in the load tracking.

> > Something like below keep precision without sacrifying the optimization.

>

> The residue is accumulated and rolled over to next update every time. But its

> state is runnable/not-runnable, or running/not-running?


yes, this need to be sorted

>

>

> > ---

> >  kernel/sched/fair.c | 16 ++++++++++++----

> >  1 file changed, 12 insertions(+), 4 deletions(-)

> >

> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c

> > index 68273e8..b234169 100644

> > --- a/kernel/sched/fair.c

> > +++ b/kernel/sched/fair.c

> > @@ -674,6 +674,12 @@ void init_entity_runnable_average(struct sched_entity *se)

> >       struct sched_avg *sa = &se->avg;

> >

> >       sa->last_update_time = 0;

> > +     /*

> > +      * sched_avg's period_contrib should be strictly less then 1024 * 1024, so

> > +      * we give it 1023 * 1024 to make sure it is almost a period (1024us), and

> > +      * will definitely be updated (after enqueue).

> > +      */

> > +     sa->period_contrib = 1023*1024;

> >       sa->load_avg = scale_load_down(se->load.weight);

> >       sa->load_sum = sa->load_avg * LOAD_AVG_MAX;

> >       /*

> > @@ -2698,10 +2704,6 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,

> >       u32 contrib, periods;

> >       unsigned long scale_freq, scale_cpu;

> >

> > -     /*

> > -      * now rolls down to a period boundary

> > -      */

> > -     now = now && (u64)(~0xFFFFF);

> >       delta = now - sa->last_update_time;

> >       /*

> >        * This should only happen when time goes backwards, which it

> > @@ -2712,6 +2714,9 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,

> >               return 0;

> >       }

> >

> > +     /* Add how much left for the current period */

> > +     delta += sa->period_contrib;

> > +

> >       /*

> >        * Use 1024*1024ns as an approximation of 1ms period, pretty close.

> >        */

> > @@ -2720,6 +2725,9 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,

> >               return 0;

> >       sa->last_update_time = now;

> >

> > +     /* Get how much left for the next period */

> > +     sa->period_contrib = delta & (u64)(0xFFFFF);

> > +

> >       scale_freq = arch_scale_freq_capacity(NULL, cpu);

> >       scale_cpu = arch_scale_cpu_capacity(NULL, cpu);

> >

> > > But as I wrote, we may either lose a sub-1ms, or gain a sub-1ms, statistically,

> > > they should even out, given the load/util updates are quite a large number of

> > > samples, and we do want a lot of samples for the metrics, this is the point of

> > > the entire average thing. Plus, as you also said, the incomplete current period

> > > also plays a (somewhat) negative role here.

> > >

> > > In addition, excluding the flat hierarchical util patch, we gain quite some

> > > efficiency.
diff mbox

Patch

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 68273e8..b234169 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -674,6 +674,12 @@  void init_entity_runnable_average(struct sched_entity *se)
 	struct sched_avg *sa = &se->avg;
 
 	sa->last_update_time = 0;
+	/*
+	 * sched_avg's period_contrib should be strictly less then 1024 * 1024, so
+	 * we give it 1023 * 1024 to make sure it is almost a period (1024us), and
+	 * will definitely be updated (after enqueue).
+	 */
+	sa->period_contrib = 1023*1024;
 	sa->load_avg = scale_load_down(se->load.weight);
 	sa->load_sum = sa->load_avg * LOAD_AVG_MAX;
 	/*
@@ -2698,10 +2704,6 @@  __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
 	u32 contrib, periods;
 	unsigned long scale_freq, scale_cpu;
 
-	/*
-	 * now rolls down to a period boundary
-	 */
-	now = now && (u64)(~0xFFFFF);
 	delta = now - sa->last_update_time;
 	/*
 	 * This should only happen when time goes backwards, which it
@@ -2712,6 +2714,9 @@  __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
 		return 0;
 	}
 
+	/* Add how much left for the current period */
+	delta += sa->period_contrib;
+
 	/*
 	 * Use 1024*1024ns as an approximation of 1ms period, pretty close.
 	 */
@@ -2720,6 +2725,9 @@  __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
 		return 0;
 	sa->last_update_time = now;
 
+	/* Get how much left for the next period */
+	sa->period_contrib = delta & (u64)(0xFFFFF);
+
 	scale_freq = arch_scale_freq_capacity(NULL, cpu);
 	scale_cpu = arch_scale_cpu_capacity(NULL, cpu);