总体来说,调度主要解决两类问题:

  • 什么时候调度
  • 调度的目标是谁 而一个优秀的调度系统,可以从如下几个指标判断:
  • fast process response time
  • good throughput for background jobs
  • avoidance of process starvation
  • reconciliation of the needs of low- and high- priority processes

Linux调度基于分时(Time-Sharing):多个进程共享同一个CPU,CPU运行时切分成多个时间片。分时技术依赖于timer interrupt,对进程来说是透明的。

Linux进程是可抢占的。当进程切换到TASK_RUNNING状态,内核检查其动态优先级是否比当前运行的进程的优先级高。如果更高,当前执行的进程中断掉,调度器切入,选择另外一个进程运行(通常是刚刚切换成runnable的进程)。当进程的时间片消耗完毕时,进程也是可被抢占的。当前进程struct thread_info中的TIF_NEED_RESCHED被设置,timer interrupt handler终止后,scheduler切入。

一个被抢占的进程并不是被暂停,它还维持在TASK_RUNNING状态,只不过不再使用CPU。

调度算法#

Linux 2.6版本之前的调度算法很简单,进程切换时,内核扫描runnable进程列表,计算它们的优先级,选择“最优”的进程运行。这种算法的问题是切换时的时间开销会随着进程数目的增多呈线性增长。

调度策略:

  • SCHED_FIFO
  • SCHED_RR
  • SCHED_NORMAL

O(1) Scheduler#

基于Linux-2.6.12版本。

优先级#

Linux 2.6.12 内核里,优先级表示分为用户态和内核态。用户态优先级就是我们常用的nice值。取值范围[-20, 19]。内核态优先级被称作静态优先级(static priority),取值范围是[0, 139]。其中,[0, 99]是Real-Time Process,[100, 139]对应用户态nice值。

在静态优先级外,真正发挥作用的是动态优先级(dynamic priority)。动态优先级实际上是静态优先级加上了一个运行时补偿(bonus): $$ dynamic_priority = max(100, min((static_priority - 5 + bonus), 139)), bonus \in [0, 10] $$

// kernel/sched.c

#define CURRENT_BONUS(p) \
    (NS_TO_JIFFIES((p)->sleep_avg) * MAX_BONUS / \
        MAX_SLEEP_AVG)

/*
 * effective_prio - return the priority that is based on the static
 * priority but is modified by bonuses/penalties.
 *
 * We scale the actual sleep average [0 .... MAX_SLEEP_AVG]
 * into the -5 ... 0 ... +5 bonus/penalty range.
 *
 * We use 25% of the full 0...39 priority range so that:
 *
 * 1) nice +19 interactive tasks do not preempt nice 0 CPU hogs.
 * 2) nice -20 CPU hogs do not get preempted by nice 0 tasks.
 *
 * Both properties are important to certain workloads.
 */
static int effective_prio(task_t *p)
{
    int bonus, prio;

    if (rt_task(p))
        return p->prio;

    bonus = CURRENT_BONUS(p) - MAX_BONUS / 2;

    prio = p->static_prio - bonus;
    if (prio < MAX_RT_PRIO)
        prio = MAX_RT_PRIO;
    if (prio > MAX_PRIO-1)
        prio = MAX_PRIO-1;
    return prio;
}

bonus的取值跟进程结构体中sleep_avg的值相关。sleep_avg的初始化在函数recalc_task_prio(task_t *p unsigned long long now)中,此处不再粘贴代码。总体上说,average sleep time越长,会增加bonus的值,导致优先级降低。==因为休眠时间越长,用户任务会被归类为idle,降低其优先级防止饿死其他进程。==

/*
 * These are the 'tuning knobs' of the scheduler:
 *
 * Minimum timeslice is 5 msecs (or 1 jiffy, whichever is larger),
 * default timeslice is 100 msecs, maximum timeslice is 800 msecs.
 * Timeslices get refilled after they expire.
 */
#define MIN_TIMESLICE       max(5 * HZ / 1000, 1)
#define DEF_TIMESLICE       (100 * HZ / 1000)

#define MAX_SLEEP_AVG       (DEF_TIMESLICE * MAX_BONUS)

time_slice计算#

进程的时间片也会受到进程优先级的影响。总的规则是,==优先级越高,分配的时间片越长==。 $$ base_time_quantum = \begin{cases} (140 - static_priority) * 20 &\text{if static priority < 120}\ (140 - static_priority) * 5 &\text{if static priority >= 120} \end{cases} $$

/*
 * task_timeslice() scales user-nice values [ -20 ... 0 ... 19 ]
 * to time slice values: [800ms ... 100ms ... 5ms]
 *
 * The higher a thread's priority, the bigger timeslices
 * it gets during one round of execution. But even the lowest
 * priority thread gets MIN_TIMESLICE worth of execution time.
 */

#define SCALE_PRIO(x, prio) \
    max(x * (MAX_PRIO - prio) / (MAX_USER_PRIO/2), MIN_TIMESLICE)

static inline unsigned int task_timeslice(task_t *p)
{
    if (p->static_prio < NICE_TO_PRIO(0))
        return SCALE_PRIO(DEF_TIMESLICE*4, p->static_prio);
    else
        return SCALE_PRIO(DEF_TIMESLICE, p->static_prio);
}

基本思想#

调度器的主函数是schedule(),定义在kernel/sched.c内。每个CPU核都有自己的per-cpu的runqueue结构体。runqueue中记录了当前CPU中等待运行的进程列表。

每个runqueue有两个==优先级队列==:activeexpiredactive用来放置time_slice没有跑完的进程。进程的time_slice消耗完时,进程会转移到expired队列中。

// kernel/sched.c: scheduler_tick(void)
	if (!--p->time_slice) {
        dequeue_task(p, rq->active);
        set_tsk_need_resched(p);
        p->prio = effective_prio(p);
        p->time_slice = task_timeslice(p);
        p->first_time_slice = 0;

        if (!rq->expired_timestamp)
            rq->expired_timestamp = jiffies;
        if (!TASK_INTERACTIVE(p) || EXPIRED_STARVING(rq)) {
            enqueue_task(p, rq->expired);
            if (p->static_prio < rq->best_expired_prio)
                rq->best_expired_prio = p->static_prio;
        } else
            enqueue_task(p, rq->active);
    }

==scheduler_tick()会在kernel/timer.c中调用。==

active队列所有进程都被转移到expired中后,runqueue会交换两个队列的指针,即expired队列成为新的active队列。

// kernel/sched.c: schedule(void)
	array = rq->active;
    if (unlikely(!array->nr_active)) {
        /*
         * Switch the active and expired arrays.
         */
        schedstat_inc(rq, sched_switch);
        rq->active = rq->expired;
        rq->expired = array;
        array = rq->active;
        rq->expired_timestamp = 0;
        rq->best_expired_prio = MAX_PRIO;
    }

activeexpired队列始终是有序的,并能通过一个bitmap以O(1)找到最优先的任务:

struct prio_array {
    unsigned int nr_active;
    unsigned long bitmap[BITMAP_SIZE];
    struct list_head queue[MAX_PRIO];
};

bitmap每一个bit代表一个优先级,映射的是所有同一优先级的任务队列。某一优先级的任务加入时,将该优先级bit置位。任务移除时,如果该优先级的队列里没有其它任务,将该优先级bit复位。在查找时,找到第一个置位的队列,从里面取任务。

CFS Scheduler#

基于Linux-4.18版本。

The main issue with this algorithm is the complex heuristics used to mark a task as interactive or non-interactive. The algorithm tries to identify interactive processes by analyzing average sleep time (the amount of time the process spends waiting for input). Processes that sleep for long periods of time probably are waiting for user input, so the scheduler assumes they’re interactive. The scheduler gives a priority bonus to interactive tasks (for better throughput) while penalizing non-interactive tasks by lowering their priorities. All the calculations to determine the interactivity of tasks are complex and subject to potential miscalculations,[citation needed] causing non-interactive behavior from an interactive process.

vruntime#

vruntime是CFS构造出来的一个概念,用来衡量tasks的运行时间。因为在实际运行时间上按照优先级进行了加权,被称作虚拟运行时间。CFS的runqueue完全是按照虚拟运行时间进行排序的。

任务的运行时长怎么确定?

基本思想#

CFS basically models an ‘ideal, precise multitasking CPU’ on real hardware. - Ingo Molnar, the author of the CFS

对于一个理想的多任务CPU,所有的任务的运行时间在任意时间段内应该是相同的。但是现实情况是,一个CPU同一时间只能执行一个任务的指令,所以理想情况是不可实现的。现代CPU采用分时调度来使多任务在同一CPU上看起来是并行执行的。

img

左图为理想状态,右图为实际运行情况。

每个任务在CPU上都有自己的运行时间,==CFS的基本思想就是让所有任务的运行时间尽量收敛到理想状态。因此,对于当前运行时间少的任务,调度器给予更高的调度优先级和调度运行时间补偿;对于当前运行时间多的任务,调度器降低其调度优先级,即使调度上,也会调低调度运行时间。==

相比O(1) Scheduler,CFS改用rbtree实现runqueue task管理。红黑树按照vruntime的大小进行排序,换句话说,vruntime越小,任务被调度的顺位越靠前。

核心流程#

CFS所有的操作函数都放在fair_sched_class结构体内,定义在kernel/sched/fair.c内。

最底层有几个比较重要的操作函数,所有的流程都可以根据这几个函数倒溯出来:

  • place_entity - 设置任务的vruntime
  • enqueue_entity - 将任务加入per-cpu的rq
  • dequeue_entity - 将任务移除rq
新建任务流程#
_do_fork()
│
├─copy_process()
│ │
│ └─sched_fork()
│   │
│   └─p->sched_class->task_fork(p)
│                     │
│                     └─task_fork_fair()
│                       │
│                       └─place_entity()
│
└─wake_up_new_task()
  │
  └─activate_task()
    │
    └─enqueue_task()
      │
      └─p->sched_class->enqueue_task()
                        │
                        └─enqueue_task_fair()
                          │
                          └─enqueue_entity()
  • cgroup的参数是如何影响CFS工作的?

  • 为什么用红黑树,不用最小堆?

The reason is: Heaps are array based and hence require contiguous memory in kernel space. This is because the way heaps are implemented in Linux. See the files lib/prio_heap.c and include/linux/prio_heap.h and you’ll note that heap is kmalloc’d using heap_init. Once the multi-programming space becomes huge, maintaining thousands of struct sched_entity requires lot of contiguous space (it runs in several pages). From time and performance point of view, one would prefer heap as hepify operation can run in background once min vruntime is picked but it’s space requirement which makes bottleneck.

As rbtree is readily available, kernel developers didn’t think of implementing pointer based heap, in fact one doesn’t need.

Another interesting point is that, considering you have a task (process or thread) changing state from runnable to blocked (waiting for io or network resource), then you need to remove that task from the runqueue and the complexities are:

  • O(log(n)) for red black tree
  • O(n) for heap

The remove operation of heap is slow and that’s why red black tree is better.

And when we get the min vruntime, the heap operation is not actually O(1), O(1) only happen if you refer the root node without removing it. But in CFS, we need to

  • Remove it (which requires heapifying of O(log(n)))
  • Update vruntime, and insert it back to runqueue which needs O(log(n)), too

img

References#