【Linux内核设计思想】五、进程调度

进程调度程序

进程调度程序是内核组成的一部分，主要负责选择下一个要运行的程序。换言之，进程调度程序是在可运行态进程之间分配有限的处理器时间资源的内核子系统，调度程序是多任务操作系统的基础，合理的调度才能最大程度的利用系统资源，才能达到多进程并发执行的效果。（多任务操作系统是指可以同时并发交互执行多个进程的操作系统，在单处理器时多进程交互执行并非真正并行，在多处理器时可以真正并行）。

进程调度程序的原则是最大程度的利用处理器时间，也就是说，只要有可执行的进程，那么就一定会有进程在执行，如果系统中的进程数量比处理器的数量多，那么总会有一些进程处于等待执行的状态，这时候就需要调度程序从一组可运行状态的进程中选择一个进程来执行，这就是调度程序的基本功能。

多任务操作系统分为非抢占式多任务（cooperative multitasking）和抢占式多任务（preemptive multitasking）。Linux操作系统就是典型的多任务操作系统，并提供了抢占式多任务模式，在此模式下，由调度程序来决定什么时候停止一个进程并执行其它进程，而强制挂起一个进程的动作就叫抢占（preemption），进程在被强占之前所运行的时间长度是预先设置好的，这个时间段就叫做时间片（timeslice），时间片是预先分配给每个可运行进程的处理器使用时间。合理的管理时间片能有效地帮助调度程序从全局角度做出调度决定，合理利用系统资源。对于非抢占多任务模式，进程会一直运行，除非它主动停止，进程主动挂起的动作称为让步（yielding），该模式下，调度程序无法对每个进程的运行时间做统一规定，进程占用处理器的时间也不可控。

策略

策略决定了调度程序在什么时候选择哪个进程执行，并负责优化处理器使用时间。

I/O消耗型和处理器消耗型进程

I/O消耗型进程：进程的大部分时间用来提交I/O请求或者等待I/O请求，这种进程经常处于可运行状态，但运行时间都很短，当等待更多的I/O请求时阻塞。

处理器消耗型进程：大多数时间用于执行代码，由于没有太多的I/O需求，这种进程一般会一直不停的运行，除非被抢占。由于这种进程不属于I/O驱动类型，从系统响应速度考虑，调度器不应该经常调度这种进程运行，调度的策略应该是降低处理器消耗型进程的运行频率，延长其运行时间。典型例子就是无限循环执行。

当然，有的服务器即是I/O消耗型，又是处理器消耗型。调度策略必须要在进程响应速度（响应时间短）和最大系统利用率（高吞吐量）之间寻找平衡。一般来说，交互式程序都是I/O消耗型的，调度程序通常会向这种类型的进程倾斜来缩短系统响应时间，Linux为了保证交互式应用，对进程的响应做了优化，缩短响应时间，更倾向于优先调度I/O消耗型进程。

进程优先级

调度算法中最基本的一种就是基于优先级的调度，即根据进程的价值和对处理器的时间需求进行分级，高优先级进程先运行，低优先级进程后运行，相同优先级进程按轮转方式调度，一般来说，优先级高的进程所分配的时间片也更长。调度程序总是选择时间片未用尽且优先级最高的进程来运行，而用户和系统都可以通过设置优先级来影响系统的调度。

在Linux中，实现了一种基于动态优先级的调度方法，即每个进程在最开始会设置一个最基本的优先级，而调度程序可以根据需要来加减优先级。比如说，一个进程在I/O等待上耗费的时间高于其运行时间（即I/O消耗型进程），那么它的优先级将会动态的被提高，来优先被调度；相反，如果一个进程的时间片非常短（处理器消耗型进程），那么它的优先级将会被动态降低。

Linux内核中提供两种独立的优先级范围：

nice值，范围为-20~+19，默认值为0。nice值越大优先级越低，nice值越小优先级越高。nice值为-20的进程可能获得的时间片最长，nice值为19的进程获得的时间片更短。

实时优先级，它的值是可配置的，默认情况下范围是0~99，任何实时进程的优先级都高于普通进程的优先级。

时间片

时间片是用于表示进程在被强占前所能持续运行时间的一个数值，默认时间片由调度器规定。时间片的设定非常关键，如果时间片太长会导致系统对交互的响应较差，且并行效果较差；时间片太短的话，会导致频繁进程切换，浪费系统资源和时间；并且I/O消耗型进程（需要较短时间片）和处理器消耗型进程（需要较长时间片）对时间片的要求也需要平衡。

Linux调度程序给交互式程序提供了更高的优先级，来保证它们更频繁地运行，并分配给交互式进程更长的时间片，另外，Linux调度程序还能根据进程优先级动态调整分配给进程的时间片。这样就保证了高优先级进程（更重要的进程），执行频率更高，执行时间更长。

值得注意的是，进程并不一定要一次性用完它的所有时间片，进程可以通过多次调度，分多次使用完自己的时间片。比如，一个进程时间片为100ms，它可以通过5次调度，每次20ms来用完自己的时间片。

当一个进程的时间片耗尽之后，这个进程就结束了，并且它不会再被投入运行。直到所有进程都耗尽了自己的时间片，所有进程的时间片将会被重新计算。

进程管理结构 task_struct 中有个保存着时间片的字段 counter, 如下:

structtask_struct {
    ...
    volatilelongneed_resched;
    ...
    longcounter;
    ...
};

进程抢占

Linux系统是抢占式的，当一个进程进入TASK_RUNNING状态的时候，内核会检查它的优先级是否高于正在运行的进程的优先级，如果是的话，将会唤醒调度程序，并抢占正在运行的进程。另外，如果一个进程的时间片变为0，也会唤醒调度程序，抢占当前进程并运行一个新的进程。

调度算法

Linux的调度程序定义在kernel/sched.c中。并且调度程序充分地实现了O(1)调度，全面的实现了SMP的可扩展性，每个处理器有自己的锁和可执行队列，并且尽量将一组相关的任务交给一个CPU连续执行，尽可能的保证公平。

/*
*  'schedule()' is the scheduler function. It's a very simple and nice
* scheduler: it's not perfect, but certainly works for most things.
*
* The goto is "interesting".
*
*   NOTE!!  Task 0 is the 'idle' task, which gets called when no other
* tasks can run. It can not be killed, and it cannot sleep. The 'state'
* information in task[0] is never used.
*/
asmlinkagevoidschedule(void)
{
structschedule_data*sched_data;
structtask_struct*prev, *next, *p;
structlist_head*tmp;
intthis_cpu, c;

if (!current->active_mm) BUG();
need_resched_back:
prev=current;
this_cpu=prev->processor;

if (in_interrupt())
gotoscheduling_in_interrupt;

release_kernel_lock(prev, this_cpu);

/* Do "administrative" work here while we don't hold any locks */
if (softirq_active(this_cpu) &softirq_mask(this_cpu))
gotohandle_softirq;
handle_softirq_back:

/*
* 'sched_data' is protected by the fact that we can run
* only one process per CPU.
*/
sched_data=&aligned_data[this_cpu].schedule_data;

spin_lock_irq(&runqueue_lock);

/* move an exhausted RR process to be last.. */
if (prev->policy==SCHED_RR)
gotomove_rr_last;
move_rr_back:

switch (prev->state) {
caseTASK_INTERRUPTIBLE:
if (signal_pending(prev)) {
prev->state=TASK_RUNNING;
break;
}
default:
del_from_runqueue(prev);
caseTASK_RUNNING:
}
prev->need_resched=0;

/*
* this is the scheduler proper:
*/

repeat_schedule:
/*
* Default process to select..
*/
next=idle_task(this_cpu);
c=-1000;
if (prev->state==TASK_RUNNING)
gotostill_running;

still_running_back:
list_for_each(tmp, &runqueue_head) {
p=list_entry(tmp, structtask_struct, run_list);
if (can_schedule(p, this_cpu)) {
intweight=goodness(p, this_cpu, prev->active_mm);
if (weight>c)
c=weight, next=p;
}
}

/* Do we need to re-calculate counters? */
if (!c)
gotorecalculate;
/*
* from this point on nothing can prevent us from
* switching to the next task, save this fact in
* sched_data.
*/
sched_data->curr=next;
#ifdef CONFIG_SMP
next->has_cpu=1;
next->processor=this_cpu;
#endif
spin_unlock_irq(&runqueue_lock);

if (prev==next)
gotosame_process;

#ifdef CONFIG_SMP
/*
* maintain the per-process 'last schedule' value.
* (this has to be recalculated even if we reschedule to
* the same process) Currently this is only used on SMP,
* and it's approximate, so we do not have to maintain
* it while holding the runqueue spinlock.
*/
sched_data->last_schedule=get_cycles();

/*
* We drop the scheduler lock early (it's a global spinlock),
* thus we have to lock the previous process from getting
* rescheduled during switch_to().
*/

#endif /* CONFIG_SMP */

kstat.context_swtch++;
/*
* there are 3 processes which are affected by a context switch:
*
* prev == .... ==> (last => next)
*
* It's the 'much more previous' 'prev' that is on next's stack,
* but prev is set to (the just run) 'last' process by switch_to().
* This might sound slightly confusing but makes tons of sense.
*/
prepare_to_switch();
{
structmm_struct*mm=next->mm;
structmm_struct*oldmm=prev->active_mm;
if (!mm) { // 如果是内核线程
if (next->active_mm) BUG();
next->active_mm=oldmm;
atomic_inc(&oldmm->mm_count);
enter_lazy_tlb(oldmm, next, this_cpu);
} else {
if (next->active_mm!=mm) BUG();
switch_mm(oldmm, mm, next, this_cpu);
}

if (!prev->mm) {
prev->active_mm=NULL;
mmdrop(oldmm);
}
}

/*
* This just switches the register state and the
* stack.
*/
switch_to(prev, next, prev);
__schedule_tail(prev);

same_process:
reacquire_kernel_lock(current);
if (current->need_resched)
gotoneed_resched_back;

return;

recalculate:
{
structtask_struct*p;
spin_unlock_irq(&runqueue_lock);
read_lock(&tasklist_lock);
for_each_task(p)
p->counter= (p->counter>>1) +NICE_TO_TICKS(p->nice);
read_unlock(&tasklist_lock);
spin_lock_irq(&runqueue_lock);
}
gotorepeat_schedule;

still_running:
c=goodness(prev, this_cpu, prev->active_mm);
next=prev;
gotostill_running_back;

handle_softirq:
do_softirq();
gotohandle_softirq_back;

move_rr_last:
if (!prev->counter) {
prev->counter=NICE_TO_TICKS(prev->nice);
move_last_runqueue(prev);
}
gotomove_rr_back;

scheduling_in_interrupt:
printk("Scheduling in interruptn");
BUG();
return;
}

可执行队列

可执行队列runqueue是调度程序中最基本的数据结构，它同样定义在kernel/sched.c中。可执行队列是给定处理的上的可执行进程的链表，每个处理器都有自己的可执行队列，并且每个可投入运行的进程都唯一归属于一个可执行队列。在可执行队列中还包含了每个处理器的调度信息。

structrunqueue{
spinlock_tlock;/* 保护运行队列的自旋锁 */
unsignedlongnr_running;/* 可运行任务数目 */
unsignedlongnr_seitches;/* 上下文切换数目 */
    unsignedlongexpired_timestamp;/* 队列最后被换出时间 */
    unsignedlongnr_uninterruptible;/* 处于不可中断睡眠状态的任务数目 */
    unsignedlonglongtimestamp_last_tick;/* 最后一个调度程序的节拍 */
    structtask_struct*curr;/* 当前运行任务 */
    structtask_struct*idle;/* 该处理器的空任务 */
    structmm_struct*prev_mm;/* 最后运行任务的mm_struct结构体 */
    structprio_array*active;/* 活动优先级队列 */
    structprio_array*expired;/* 超时优先级队列 */
    structprio_array*arrays[2];/* 实际优先级数组 */
    structtask_struct*migration_thread;/* 移出线程 */
    structlist_head*migration_queue;/* 移出队列 */
    atomic_tnr_iowait;/* 等待IO操作的任务数目 */
};

在内核中，还定义了一组宏来获取进程相关的可执行队列。cpu_rq(processor)宏用于返回给定处理器可执行队列的指针。this_rq()宏用于返回当前处理器的可执行队列。task_rq(task)宏用于返回给定任务所在的可执行队列的指针。

优先级数组

在每个运行队列中有两个优先级数组，一个活动优先级队列，一个超时优先级队列，在上面的struct runqueue中可以看到。优先级数组是一种能够提供O(1)级算法复杂度的数据结构。优先级数组使可运行处理器的每一种优先级都包含一个相应的队列，而这些队列包含对应优先级上的可执行进程链表。优先级数组还拥有一个优先级位图。

structprio_array{
intnr_active; /* 任务数目 */
    unsignedlongbitmap[BITMAP_SIZE]; /* 优先级位图 */
    structlist_headqueue[MAX_PRIO]; /* 优先级队列 */
};

MAX_PRIO定义了系统拥有的优先级数量（默认140），每个优先级都有一个struct list_head结构体。BITMAP_SIZE是优先级位图的大小，（至少）每一个位代表一个优先级。初始时，所有位置为0，当某个有优先级的进程开始准备执行（状态变为TASK_RUNNING），位图中相应的位就会被置为1。比如优先级为n的进程变为可执行状态，那么位图的第n位就被置1。Linux提供了快速查找位图的算法sched_find_first_bit()。

queue队列中的每一个成员都是struct list_head类型的，每个链表于一个给定的优先级相对应，该处理器上相应优先级的所有可运行进程都在这个链表上。要寻找下一个要运行的进程就可以直接在这个链表上找。

计数器nr_active保存了该优先级数组内可执行进程的数量。

重新计算时间片

在以前的操作系统中，当所有进程的时间片都用完时，通过遍历所有进程来重新计算每个进程的时间片，这种依赖循环的方式大大耗费了时间等资源。Linux调度程序通过为每个处理器维护两个优先级数组（活动数组和过期数组）来实现时间片的重新计算。活动数组内的可执行队列上的进程都还有剩余时间片，而过期数组内的可执行队列上的进程都耗尽了时间片。当进程的时间片用完时，它将被移动到过期数组，并且它的时间片已经重新计算好了。这样只要在两个数组之间切换就可以实现进程时间片的重新计算。而数组是通过指针访问的，所以交换进程所用的时间就是交换指针所用的时间。这种交换是O(1)级调度程序的核心，这将不在依赖循环从头到尾的计算时间片，只要在两个数组间切换就足够了。

schedule()

schedule()函数的功能是选定一个进程并切换到它去执行。当内核代码需要休眠时会直接调用schedule()函数，如果有进程被抢占也会调用该函数。schedule()函数独立于每个处理器运行，因此每个CPU自己要判断下一个要运行的进程是哪个。

schedule()函数在工作时，首先要在活动优先级数组中找到第一个被设置的位，该位表示优先级最高的可执行进程。然后，调度程序选择这个级别链表里的第一个进程，它就是系统中优先级最高的可执行程序，也就是即将被调度执行的程序。

计算优先级与时间片

进程都拥有一个初始优先级，即nice值，nice值的范围为-20~19（-20优先级最高，19优先级最低），默认值为0。nice的值存放在task_struct的static_prio域内，也叫做静态优先级，它从一开始被用户指定后就不能再改变。而调度程序要用到的动态优先级存放在prio域内，动态优先级通过一个关于静态优先级和进程交互的函数关系计算得到。

effective_prio()函数可以返回一个进程的动态优先级。该函数的原理为以nice值为基数，再加上-5到5之间的进程交互性的奖励或惩罚。

进程的交互性是如何判断的呢？调度程序通过进程休眠的时间来判断进程属于IO消耗型还是处理器消耗型，如果一个进程大部分时间处于休眠那么它就是IO消耗型，反之如果执行时间长于休眠时间就是处理器消耗型。

为了支持这种机制，Linux记录了一个进程休眠和执行的时间，这个时间存放在task_struct的sleep_avg域中。该值的范围为0~MAX_SLEEP_AVG，默认为10毫秒。进程每运行一个时钟节拍，sleep_avg就会做相应的递减，直到减为0，而进程休眠时，根据它休眠时间的长短，sleep_avg会做相应增加，直到加为MAX_SLEEP_AVG。

在一个进程创建的时候，新建的子进程和父进程均分父进程剩余的时间片。当一个任务的时间片用完之后，就要根据任务的静态优先级重新计算时间片。task_timeslice()函数为给定的进程返回一个新的时间片。

另外，Linux为了支持交互式进程，如果一个进程交互性非常强，那么当它的时间片用完后，它会再次被放置到活动数组而不是过期数组。

睡眠与唤醒

休眠的进程（被阻塞）处于一种特殊的不可执行状态。进程休眠一般是为了等待一些事件，进程也有可能是在尝试获取一个已被占用的内核信号量时被迫进入休眠。对与内核来说，当进程休眠时，进程会把自己标记为休眠状态，并把自己从可执行队列移出，加入到等待队列，然后调用schedule()函数来选择和执行一个其它进程。唤醒时反之，进程被设置为可执行状态，然后再从等待队列中移到可执行队列。

进程休眠的在状态有两种：TASK_INTERRUPTIBLE状态（收到信号会提前唤醒并响应信号）和TASK_UNINTERRUPTIBLE状态（会忽略信号）。

休眠通过等待队列来处理，等待队列是由等待某些事件发生的进程组成的简单链表。进程加入等待队列的过程如下：

调用DECLARE_WAITQUEUE()创建一个等待队列的项；

调用add_wait_queue()把自己加入到队列中，该队列会在等待的条件满足时唤醒它，事件发生时，对等待队列执行wake_up()操作；

进程状态变为TASK_INTERRUPTIBLE或TASK_UNINTERRUPTIBLE状态；

如果进程状态为TASK_INTERRUPTIBLE，当信号唤醒进程时，我们称为伪唤醒（唤醒不是因为事件的发生），此时检查并处理信号；

检查事件，事件发生则退出休眠，否则调用schedule()；

当进程被唤醒时，会再次检查条件是否为真，为真则退出循环，不为真则再次调用schedule()函数，并且一直重复这个操作；

当条件满足时，进程状态设置为TASK_RUNNING并调用remove_wait_queue()把自己移除等待队列。

进程的唤醒通过函数wake_up()实现，该函数会唤醒指定等待队列上的所有进程。它会调用函数try_to_wake_up()，该函数负责将进程设置为TASK_RUNNING状态，然后调用activate_task()将此进程放入可执行队列，如果被唤醒的进程优先级比当前正在执行的进程优先级高，还要设置need_resched标志，通常来说，哪段代码促使等待条件达成，哪段代码就要负责调用wake_up()函数。如果进程被唤醒并不是因为它所等待的条件达成，也就是虚假唤醒，那么需要一个循环处理来保证真正的唤醒。

负载平衡程序

Linux调度程序为多处理器系统的每个处理器准备了单独的可执行队列和锁，每个处理器拥有自己的进程链表，并且处理器只对自己的进程进行调度操作，整个调度系统从每个处理器的角度来看都是独立的。针对多处理器系统，就有可能会出现负载不均衡的情况，比如一个处理器队列上有很多进程，而另一个处理器的队列只有极少的进程。这个问题就需要负载平衡程序来解决，它负责保证可执行队列之间的负载均衡控制。负载平衡程序会对比当前处理器的可执行队列和系统中其它可执行队列，如果出现不均衡现象，就会把相对繁忙的队列中的进程抽到当前可执行队列中，在理想情况下，每个队列上的进程数目应该相等。

负载平衡程序由load_balance()函数实现，当schedule()函数执行的时候，只要当前可执行队列为空，load_balance()就会被调用；在系统空闲时，每隔1毫秒调用一次；在单处理器系统中，load_balance()不会被调用，甚至不会被编译进内核。

负载平衡程序在调用时需要锁住当前处理器的可执行队列并且要屏蔽中断，以避免可执行队列被并发访问。在定时器中调用load_balance()的时候，它需要解决所有运行队列间所有的负载失衡，如下图所示

load_balance()函数操作步骤主要包括：

首先，load_balance()函数调用find_busiest_queue()函数找到最繁忙的可执行队列。如果没有其它可执行队列中的进程数比当前队列中的进程数目多25%及以上，那么find_busiest_queue()函数返回NULL，load_balance()也返回。否则的话，返回最繁忙的那个可执行队列。

其次，load_balance()从最繁忙的运行队列选择一个优先级数组（最好是过期数组，非高速缓存命中cache hot）以便抽取进程，如果过期数组为空，选择活动数组。

然后，load_balance()找到含有进程并且优先级最高的链表，把优先级最高的进程分散开。

分析找到的所有优先级相同的进程，选择一个不是正在执行的、不会因为处理器相关性而不可移动的、不在高速缓存中的进程，调用pull_task()函数将该进程从最繁忙的队列中抽取到当前队列。

如果可执行队列之间仍然不均衡，则重复上面步骤。直到负载均衡，解除对当前运行队列的锁定，从load_balance()返回。

load_balance()函数的源码如下

staticintload_balance(intthis_cpu, structrq*this_rq,
structsched_domain*sd, enumcpu_idle_typeidle,
int*continue_balancing)
{
intld_moved, cur_ld_moved, active_balance=0;
structsched_domain*sd_parent=sd->parent;
structsched_group*group;
structrq*busiest;
structrq_flagsrf;
structcpumask*cpus=this_cpu_cpumask_var_ptr(load_balance_mask);
structlb_envenv= {
.sd=sd,
.dst_cpu=this_cpu,
.dst_rq=this_rq,
.dst_grpmask    =sched_group_span(sd->groups),
.idle=idle,
.loop_break=SCHED_NR_MIGRATE_BREAK,
.cpus=cpus,
.fbq_type=all,
.tasks=LIST_HEAD_INIT(env.tasks),
};

cpumask_and(cpus, sched_domain_span(sd), cpu_active_mask);

schedstat_inc(sd->lb_count[idle]);

redo:
if (!should_we_balance(&env)) {
*continue_balancing=0;
gotoout_balanced;
}

group=find_busiest_group(&env);
if (!group) {
schedstat_inc(sd->lb_nobusyg[idle]);
gotoout_balanced;
}

busiest=find_busiest_queue(&env, group);
if (!busiest) {
schedstat_inc(sd->lb_nobusyq[idle]);
gotoout_balanced;
}

WARN_ON_ONCE(busiest==env.dst_rq);

schedstat_add(sd->lb_imbalance[idle], env.imbalance);

env.src_cpu=busiest->cpu;
env.src_rq=busiest;

ld_moved=0;
/* Clear this flag as soon as we find a pullable task */
env.flags|=LBF_ALL_PINNED;
if (busiest->nr_running>1) {
/*
* Attempt to move tasks. If find_busiest_group has found
* an imbalance but busiest->nr_running <= 1, the group is
* still unbalanced. ld_moved simply stays zero, so it is
* correctly treated as an imbalance.
*/
env.loop_max  =min(sysctl_sched_nr_migrate, busiest->nr_running);

more_balance:
rq_lock_irqsave(busiest, &rf);
update_rq_clock(busiest);

/*
* cur_ld_moved - load moved in current iteration
* ld_moved     - cumulative load moved across iterations
*/
cur_ld_moved=detach_tasks(&env);

/*
* We've detached some tasks from busiest_rq. Every
* task is masked "TASK_ON_RQ_MIGRATING", so we can safely
* unlock busiest->lock, and we are able to be sure
* that nobody can manipulate the tasks in parallel.
* See task_rq_lock() family for the details.
*/

rq_unlock(busiest, &rf);

if (cur_ld_moved) {
attach_tasks(&env);
ld_moved+=cur_ld_moved;
}

local_irq_restore(rf.flags);

if (env.flags&LBF_NEED_BREAK) {
env.flags&=~LBF_NEED_BREAK;
/* Stop if we tried all running tasks */
if (env.loop<busiest->nr_running)
gotomore_balance;
}

/*
* Revisit (affine) tasks on src_cpu that couldn't be moved to
* us and move them to an alternate dst_cpu in our sched_group
* where they can run. The upper limit on how many times we
* iterate on same src_cpu is dependent on number of CPUs in our
* sched_group.
*
* This changes load balance semantics a bit on who can move
* load to a given_cpu. In addition to the given_cpu itself
* (or a ilb_cpu acting on its behalf where given_cpu is
* nohz-idle), we now have balance_cpu in a position to move
* load to given_cpu. In rare situations, this may cause
* conflicts (balance_cpu and given_cpu/ilb_cpu deciding
* _independently_ and at _same_ time to move some load to
* given_cpu) causing excess load to be moved to given_cpu.
* This however should not happen so much in practice and
* moreover subsequent load balance cycles should correct the
* excess load moved.
*/
if ((env.flags&LBF_DST_PINNED) &&env.imbalance>0) {

/* Prevent to re-select dst_cpu via env's CPUs */
__cpumask_clear_cpu(env.dst_cpu, env.cpus);

env.dst_rq=cpu_rq(env.new_dst_cpu);
env.dst_cpu=env.new_dst_cpu;
env.flags&=~LBF_DST_PINNED;
env.loop=0;
env.loop_break=SCHED_NR_MIGRATE_BREAK;

/*
* Go back to "more_balance" rather than "redo" since we
* need to continue with same src_cpu.
*/
gotomore_balance;
}

/*
* We failed to reach balance because of affinity.
*/
if (sd_parent) {
int*group_imbalance=&sd_parent->groups->sgc->imbalance;

if ((env.flags&LBF_SOME_PINNED) &&env.imbalance>0)
*group_imbalance=1;
}

/* All tasks on this runqueue were pinned by CPU affinity */
if (unlikely(env.flags&LBF_ALL_PINNED)) {
__cpumask_clear_cpu(cpu_of(busiest), cpus);
/*
* Attempting to continue load balancing at the current
* sched_domain level only makes sense if there are
* active CPUs remaining as possible busiest CPUs to
* pull load from which are not contained within the
* destination group that is receiving any migrated
* load.
*/
if (!cpumask_subset(cpus, env.dst_grpmask)) {
env.loop=0;
env.loop_break=SCHED_NR_MIGRATE_BREAK;
gotoredo;
}
gotoout_all_pinned;
}
}

if (!ld_moved) {
schedstat_inc(sd->lb_failed[idle]);
/*
* Increment the failure counter only on periodic balance.
* We do not want newidle balance, which can be very
* frequent, pollute the failure counter causing
* excessive cache_hot migrations and active balances.
*/
if (idle!=CPU_NEWLY_IDLE)
sd->nr_balance_failed++;

if (need_active_balance(&env)) {
unsignedlongflags;

raw_spin_rq_lock_irqsave(busiest, flags);

/*
* Don't kick the active_load_balance_cpu_stop,
* if the curr task on busiest CPU can't be
* moved to this_cpu:
*/
if (!cpumask_test_cpu(this_cpu, busiest->curr->cpus_ptr)) {
raw_spin_rq_unlock_irqrestore(busiest, flags);
gotoout_one_pinned;
}

/* Record that we found at least one task that could run on this_cpu */
env.flags&=~LBF_ALL_PINNED;

/*
* ->active_balance synchronizes accesses to
* ->active_balance_work.  Once set, it's cleared
* only after active load balance is finished.
*/
if (!busiest->active_balance) {
busiest->active_balance=1;
busiest->push_cpu=this_cpu;
active_balance=1;
}
raw_spin_rq_unlock_irqrestore(busiest, flags);

if (active_balance) {
stop_one_cpu_nowait(cpu_of(busiest),
active_load_balance_cpu_stop, busiest,
&busiest->active_balance_work);
}
}
} else {
sd->nr_balance_failed=0;
}

if (likely(!active_balance) ||need_active_balance(&env)) {
/* We were unbalanced, so reset the balancing interval */
sd->balance_interval=sd->min_interval;
}

gotoout;

out_balanced:
/*
* We reach balance although we may have faced some affinity
* constraints. Clear the imbalance flag only if other tasks got
* a chance to move and fix the imbalance.
*/
if (sd_parent&&!(env.flags&LBF_ALL_PINNED)) {
int*group_imbalance=&sd_parent->groups->sgc->imbalance;

if (*group_imbalance)
*group_imbalance=0;
}

out_all_pinned:
/*
* We reach balance because all tasks are pinned at this level so
* we can't migrate them. Let the imbalance flag set so parent level
* can try to migrate them.
*/
schedstat_inc(sd->lb_balanced[idle]);

sd->nr_balance_failed=0;

out_one_pinned:
ld_moved=0;

/*
* newidle_balance() disregards balance intervals, so we could
* repeatedly reach this code, which would lead to balance_interval
* skyrocketing in a short amount of time. Skip the balance_interval
* increase logic to avoid that.
*/
if (env.idle==CPU_NEWLY_IDLE)
gotoout;

/* tune up the balancing interval */
if ((env.flags&LBF_ALL_PINNED&&
     sd->balance_interval<MAX_PINNED_INTERVAL) ||
    sd->balance_interval<sd->max_interval)
sd->balance_interval*=2;
out:
returnld_moved;
}

上下文切换和抢占

上下文切换

上下文切换是指一个可执行进程切换到另一个可执行进程，由context_switch()函数负责处理。当一个新进程被调度并准备投入运行的时候，schedule()就会调用该函数，并完成以下两个工作：

调用<asm/mmu_context.h>中的switch_mm()函数，把虚拟内存从上一个进程映射到新进程中；

调用<asm/system.h>中的switch_to()函数，从上一个进程的处理器状态切换到新进程的处理器状态，包括保存、恢复栈信息和寄存器信息。

内核提供了一个need_resched标志来表明是否需要重新执行一次调度，当某个进程耗尽时间片的时候，scheduler_tick()就会设置该标志，当一个优先级高的进程进入可执行状态的时候，try_to_wake_up()也会设置这个标志。

当再返回用户空间以及从中断返回的时候，内核也会检查need_resched标志，若该标志已被设置，内核会再继续执行之前调用调度程序。每个进程都包含一个need_resched标志，它存放在thread_info结构体中，用一个特别的标志变量中的一位来表示。（进程描述符通常存放在高速缓存中，访问它的速度远高于访问全局变量）

用户抢占

内核在返回用户空间的时候，如果need_resched标志被设置，那么将会导致schedule()函数被调用，这时候就会发生用户抢占。内核无论是从中断处理程序返回还是从系统调用返回，都会检查need_resched标志，若该标志被设置，将会调用schedule()函数运行一个更合适程序。

内核抢占

Linux内核支持内核抢占，只要重新调度是安全的，那么内核就可以在任何时间抢占正在执行的任务。一般来说，只要没有持有锁，内核就可以抢占（锁是非抢占区域的标志），并且内核是支持SMP的，所以，只要没有持有锁，那么正在执行的代码就是可重新导入的，也就是可以抢占的。

为了支持内核的抢占机制，在每个内核的thread_info中引入了preempt_count计数器，该计数器初始值为0，每当使用锁的时候数值加1，释放锁的时候数值减1，当数值为0的时候，内核可以抢占。当从中断返回内核空间的时候，内核会检查need_resched和preempt_count的值，如果need_resched已经被设置，并且preempt_count的值为0，说明有更重要的任务需要执行，并且此时可以安全抢占，此时，调度程序schedule()就会被调用。如果preempt_count不为0，说明当前任务持有锁，不能安全抢占，那么将会直接从中断返回当前执行的进程。当进程所持有的锁被释放了，即preempt_count变为0了，释放锁的代码会检查need_resched是否被设置，若已设置，就会调用调度程序schedule()函数。

如果内核中的进程被阻塞了，或者它显式的调用了schedule()函数，那么内核抢占就会显式发生。而显式调用schedule()函数说明进程知道此时是可以安全被抢占的。

内核抢占的时机一般在：

从中断处理程序正在执行，并且返回内核空间之前。

内核代码再一次具有可抢占性的时候。

内核中的进程显式调用schedule()函数。

内核中的任务阻塞（而导致调用schedule()函数）。

实时

Linux提供了两种实时调度策略SCHED_FIFO和SCHED_RR，以及一种非实时调度策略SCHED_NORMAL。SCHED_FIFO实现了一种简单的、先入先出的调度算法，它不使用时间片。SCHED_FIFO的进程会优先于所有的SCHED_NORMAL进程被调度，并且它不基于时间片，可以一直运行，直到它自己受阻塞或显式自动释放处理器。只有更高优先级的SCHED_FIFO进程或者SCHED_RR进程才能抢占SCHED_FIFO进程。SCHED_RR是一种带有时间片的SCHED_FIFO，即一种实时轮流调度算法。当SCHED_RR任务耗尽了它的时间片，其它同一优先级的实时任务将被轮流调度。

Linux提供的两种实时调度策略SCHED_FIFO和SCHED_RR都是静态优先级，Linux内核不会为实时进程计算动态优先级，这保证了给定优先级的实时进程总能抢占优先级比它低的进程。另外，Linux中的实时调度算法是一种软实时工作方式，也就是说，内核调度进程尽力使进程在它的限定时间到来前运行，单内核不保证总能满足这些进程的要求。

实时优先级范围从0到(MAX_RT_PRIO-1)，默认0~99。SCHED_NORMAL进程的nice值共享了这个取值空间，其范围为MAX_RT_PRIO到(MAX_RT_PRIO+40)，即nice的值-20~+19直接对应100~139的实时优先级范围。

与调度相关的系统调用

Linux系统调用提供了一系列的系统调用来操作处理进程的优先级、调度策略及处理器，同时还提供了显式将处理器交给其它进程的机制。

sched_setscheduler()和sched_getscheduler()分别用于设置和获取进程的调度策略和实时优先级，其主要工作是读写进程task_struct的policy和rt_priority的值。

#include <sched.h>

       intsched_setscheduler(pid_tpid, intpolicy,
                              conststructsched_param*param);

       intsched_getscheduler(pid_tpid);

       structsched_param {
           ...
           intsched_priority;
           ...
       };

sched_setparam()和sched_getparam()分别用于设置和获取进程的实时优先级，其主要工作是获取封装在sched_param特殊结构体中的rt_priority。

#include <sched.h>

       intsched_setparam(pid_tpid, conststructsched_param*param);

       intsched_getparam(pid_tpid, structsched_param*param);

       structsched_param {
           ...
           intsched_priority;
           ...
       };

sched_get_priority_max()和sched_get_priority_min()分别用于获取给定调度策略的最大和最小优先级，实时调度策略的最大最小优先级分别是(MAX_RT_PRIO-1)和1。

#include <sched.h>

       intsched_get_priority_max(intpolicy);

       intsched_get_priority_min(intpolicy);

对于普通进程，nice()函数可以将给定进程的静态优先级增加一个给定量，但是只有超级用户才能在调用时使用负值来提高优先级。nice函数会调用内核的set_user_nice()函数来设置进程的task_struct的static_prio和prio值。

#include <unistd.h>

       intnice(intinc);

Linux调度程序提供强制的处理器绑定机制（processor affinity），虽然说内核通过软机制使进程尽量运行在同一个处理器上，但是内核也运行用户强制指定某个进程必须运行在某个处理器上。该强制处理标志保存在task_struct的cpus_allowed位掩码标志中，该掩码标志的每一位对应一个系统可用的处理器。默认所有的位都被设置，进程可以在系统的所有可用处理器上执行。用户可以通过sched_setaffinity()设置一个位或者多个位组合的位掩码，调用sched_getaffinity()则返回当前的cpus_allowed位掩码。

#define _GNU_SOURCE             /* See feature_test_macros(7) */
       #include <sched.h>

       intsched_setaffinity(pid_tpid, size_tcpusetsize,
                             cpu_set_t*mask);

       intsched_getaffinity(pid_tpid, size_tcpusetsize,
                             cpu_set_t*mask);

sched_yield()系统调用可以让进程显式的将处理器时间让给其它等待执行的进程，其实现原理是将进程从活动队列转移到过期队列，这样，不仅抢占了当前进程，并将该进程放到了优先级队列的最后面，且将其投入到过期队列以保证短期内不会执行它。由于实时进程不会过期，所以实时进程只会被移动到优先级队列的末尾，而不会放到过期队列。内核代码一般可以直接调用yield()函数，由yield()函数确定给定进程确实处于可执行状态，并调用sched_yield()函数。用户空间程序直接调用sched_yield()系统调用即可。

#include <sched.h>

       intsched_yield(void);