Sleep and Wakeup

Erik Quanstrom
quanstro@labs.coraid.com

ABSTRACT

There are many ways to control concurrency in a shared-memory multiprogramming system. The kernel (and successors) built on a sleep and wakeup system. Conceptually this is very simple: a process waits for a resource to become available with sleep, and announces that resource's availability with wakeup. However, both the implementation and its use have suffered unexpected bugs in multiprocessor systems that are worth discussing.

Introduction



You may delay, but time will not.
— Benjamin Franklin

From the earliest days of it was realized that I/O takes a long time. So long, in fact, that it's worth doing other things while waiting. The idea that came out was to suspend the process that was waiting for something to happen, and schedule another process in the mean time. Since early machines needed to process characters from a number of dumb terminals, this worked out well. A process waiting on the disk wouldn't bother another process reading tty input. A process waiting one one tty needn't bother a process waiting on another.

This basic model has worked well throughout the history of operating systems, and has even extended with some modification to large symmetric multiprocessing machines. But this is one case where progress has not been kind to the programmer.

Version 1



Well-timed silence is the most commanding expression.
— Mark Helprin

The V1 version of sleep and wakeup uses the concept of a wait channel. The wait channel or wchan is the nexus between a process to be awoken and the waker. When the waking event happens, wakeup simply chains the sleeping process to the (run) queue of the given priority. To handle multiple processes sleeping on the same wchan, sleep simply stores the previous value of wchan on the stack. Sleep itself puts the originally waiting process back on to the queue. There is no additional manipulation of process state.

There are several interesting observations about this code.
* Calls wakeup without a prior sleep on the same wchan are handled.
* Wakeup handles priority inversion.
* Multiple waiters will be woken in FILO order.
* An interrupt pending when sleep is called aborts sleep.
wakeup:	/ wakeup processes waiting for an event by linking them to the
	/ queue
	mov	r1,-(sp)	/ put char on stack
	mov	(r0)+,r2	/ r2 points to a queue
	mov	(r0)+,r3	/ r3 = wait channel number
	movb	wlist(r3),r1	/ r1 contains process number in that wait
				/ channel that was sleeping
	beq	2f		/ if 0 return, nothing to wakeup
	cmp	r2,u.pri	/ is runq >= users priority
	bhis	1f		/ yes, don't set time quantum to zero
	clrb	uquant		/ time quantum = 0
1:
	clrb	wlist(r3)	/ zero wait channel entry
	jsr	r0,putlu	/ create a link from the last user on the Q
			   	/ to this process number that got woken
2:
	mov	(sp)+,r1	/ restore r1
	rts	r0
sleep: / wait for event
	jsr	r0,isintr	/ check to see if interrupt or quit from user
		br 2f		/ user interrupt?  return
	mov	(r0)+,r1	/ put number of wait channel in r1
	movb	wlist(r1),-(sp)	/ put old pid on stack
	movb	u.uno,wlist(r1)	/ put process number of process to put
				/ to sleep in there
	mov	cdev,-(sp)	/ nothing happened in isintr so
	jsr	r0,swap		/ swap out process that needs to sleep
	mov	(sp)+,cdev	/ restore device
	jsr	r0,isintr	/ check for interrupt of new process
		br 2f		/ yes, return to new user
	movb	(sp)+,r1	/ no, r1 = original pid in wchan
	beq	1f		/ if 0 branch
	mov	runq+4,r2	/ r2 points to lowest priority queue
	mov	300,*ps	/ processor priority = 6
	jsr	r0,putlu	/ create link to old process number
	clr	*ps		/ clear the status; process priority = 0
1:
	rts	r0		/ return
2:
	jmp     sysret		/ return to user

Version 6



Have you not done tormenting me with your accursed time! It's abominable! When! When! One day, is that not enough for you...?
Samuel Beckett, Waiting for Godot

The version 6 implementation looks quite different at first glance. However, in practice it works in a similar manner. Sleep still puts the process to sleep. This time more state is saved with the process. In particular a sleeping process' state will be either SWAIT or SSLEEP depending on if it honors or ignores signals.

Wakeup now loops through the process table and wakes (makes runnable) every process with a matching wchan. This does not present a scalability problem since NPROC is just 50 in V6, and it is unlikely that many installations had enough memory to fill the process table.
/*
 * Give up the processor till a wakeup occurs
 * on chan, at which time the process
 * enters the scheduling queue at priority pri.
 * The most important effect of pri is that when
 * pri<0 a signal cannot disturb the sleep;
 * if pri>=0 signals will be processed.
 * Callers of this routine must be prepared for
 * premature return, and check that the reason for
 * sleeping has gone away.
 */
sleep(chan, pri)
{
	register *rp, s;

	s = PS->integ;
	rp = u.u_procp;
	if(pri >= 0) {
		if(issig())
			goto psig;
		spl6();
		rp->p_wchan = chan;
		rp->p_stat = SWAIT;
		rp->p_pri = pri;
		spl0();
		if(runin != 0) {
			runin = 0;
			wakeup(&runin);
		}
		swtch();
		if(issig())
			goto psig;
	} else {
		spl6();
		rp->p_wchan = chan;
		rp->p_stat = SSLEEP;
		rp->p_pri = pri;
		spl0();
		swtch();
	}
	PS->integ = s;
	return;

	/*
	 * If priority was low (>=0) and
	 * there has been a signal,
	 * execute non-local goto to
	 * the qsav location.
	 * (see trap1/trap.c)
	 */
psig:
	aretu(u.u_qsav);
}
/*
 * Wake up all processes sleeping on chan.
 */
wakeup(chan)
{
	register struct proc *p;
	register c, i;

	c = chan;
	p = &proc[0];
	i = NPROC;
	do {
		if(p->p_wchan == c) {
			setrun(p);
		}
		p++;
	} while(--i);
}


The comment preceding sleep is particularly interesting. Even in a uniprocessor environment, we are warned that sleep may return even if we weren't woken up. We need a condition code that is seperate from the wakeup event.

Plan 9—wake one



The Plan 9 implementation of sleep/wakeup is a bit of a break from versions. The reasons for the change are for efficiency and to handle symmetric mutiprocessing[1]. Plan 9 uses a structure called Rendez (for rendezvous) to coordinate the wakeup event, rather than a wchan. This structure consists of a pointer to a process structure and a lock. Unlike a wchan, at most one process may sleep on a given Rendez at a time. Thus it may be necessary to protect Rendez structures with a lock. As one can imagine, since the assumption is now that there are multiple processors, the implementation is much tricker. Indeed [1] details a number of failed versions and the use of the automatic checker spin[2].

This version of sleep operates by checking a supplied condition function with interrupts turned off, and the process and Rendez locked. If the condition has happened or there are notes pending, sleep returns immediately. Otherwise, the sleeping process is recorded in the Rendez structure, and the process state is changed to Wakeme, and the scheduler is called. The scheduler turns interrupts back on.

Wakeup only has access to the Rendez, but with interrupts turned off it checks for a waiting process. If there is one it is readied (made runable) and then the lock is dropped and the spl is restored. Note in particular that the sleeping process is readied before the Rendez lock is dropped. Therefore, the woken process may start running before the lock is dropped.

It's worth noting that there are several obvious corner cases in this code that have been resolved in favor of never missing a wakeup[1]. An interrupt may have happened, obviating the check for the condition, but returning anyway. Or, he condition may have become true before sleep was called. Alternatively, the condition may become true just after it is checked.
/*
 *  sleep if a condition is not true.  Another process will
 *  awaken us after it sets the condition.  When we awaken
 *  the condition may no longer be true.
 *
 *  we lock both the process and the rendezvous to keep r->p
 *  and p->r synchronized.
 */
void
sleep(Rendez *r, int (*f)(void*), void *arg)
{
	int s;
	void (*pt)(Proc*, int, vlong);

	s = splhi();
	if(up->nlocks.ref)
		panic("process %lud sleeps locks held0, up->pid);
	lock(r);
	lock(&up->rlock);
	if(r->p)
		panic("double sleep");
	/*
	 *  Wakeup only knows there may be something to do by testing
	 *  r->p in order to get something to lock on.
	 */
	r->p = up;

	if((*f)(arg) || up->notepending){
		r->p = nil;
		unlock(&up->rlock);
		unlock(r);
	} else {
		/* change state and call scheduler */
		up->state = Wakeme;
		up->r = r;
		procsave(up);
		if(setlabel(&up->sched)) {
			/* here when the process is awakened */
			procrestore(up);
			spllo();
		} else {
			/* here to go to sleep (i.e. stop Running) */
			unlock(&up->rlock);
			unlock(r);
			gotolabel(&m->sched);
		}
	}
	if(up->notepending) {
		up->notepending = 0;
		splx(s);
		if(up->procctl == Proc_exitme && up->closingfgrp)
			forceclosefgrp();
		error(Eintr);
	}
	splx(s);
}
Proc*
wakeup(Rendez *r)
{
	Proc *p;
	int s;

	s = splhi();
	lock(r);
	p = r->p;

	if(p != nil){
		lock(&p->rlock);
		if(p->state != Wakeme || p->r != r)
			panic("wakeup: state");
		r->p = nil;
		p->r = nil;
		ready(p);
		unlock(&p->rlock);
	}
	unlock(r);
	splx(s);
	return p;
}

Careful with that Condition, Eugene



Now we're ready to tackle some use cases. The most basic error in sleep usage is to not have a condition variable. As we've seen, otherwise it's not possible to tell a spurious wakeup from a real wakeup. In some cases this does not matter: the object of the sleep is to reduce spinning. However if the event being waited for is I/O, it's important not to falsely conclude the I/O has finished.

So our initial cut at a sleep and wakeup might be
typedef struct Io {
	Rendez;
	int	cond;
};
void
dowake(Io *i)
{
	i->cond = 1;			/* unique assignment */
	wakeup(i);
}
int
iodone(void *v)
{
	return ((Io*)v)->cond;
}
void
doio(void)
{
	Io *i;

	dispatch(i = malloc(sizeof *i));
	sleep(i, iodone, i);
	free(i);
}

Unexpected wakeups?



Well-timed silence is the most commanding expression.
— Mark Helprin

The code in the previous section does not handle unexpected wakeups for any reason. Due to the fact that we're not reusing the Rendez, it's not possible to get a wakeup for a different event, but it is still possible to get notes. If the process doing I/O is a kernel process, this is not as much of a worry. However, if it is not (that is, in the syscall path), it is possible to be interrupted by alarm(2) or by the user hitting

The simplist correction to this defect is perhaps the following
void
doio(void)
{
	Io *i;

	dispatch(i = malloc(sizeof *i));
	while(waserror())
		/* ignore notes */;
	sleep(i, iodone, i);
	poperror();
	free(i);
}

When does the Wakeup happen?



We are time's subjects, and time bids be gone.
— William Shakespeare

The careful reader will notice that our code is still incorrect. As we noted in the discussion of how sleep and wakeup are constructed, it is possible for the awoken process to run before wakeup completes. To be more exact, it could be that the doio function has executed free before wakeup has unlocked the rendezvous. It might even be that the memory has been reused as a different structure. So this code has the potential to corrupt memory without some interlock. We could reuse the lock inside the Rendez structure and insure we can lock it before we continue, but this assumes the Rendez implementation. Alternatively, we could set cond to 2 after wakeup returns and only continue after sleep when this is true. So,
void
dowake(Io *i)
{
	i->cond = 1;
	wakeup(i);
	i->cond = 2;		/* atomic */
	/* hands off i now */
}
void
doio(void)
{
	Io *i;

	dispatch(i = malloc(sizeof *i));
	while(waserror())
		/* eat note */;
	sleep(i, iodone, i);
	poperror();
	while(i->cond != 2)
		/* rendez not done */;
	free(i);
}


Often a lock is used in place of a condition variable.

Double Wake



It's odd how people waiting for you stand out far less clearly than people you are waiting for.
— Jean Giraudoux

Often it makes sense to allocate a Rendez to a static hardware structure. In this case many wakeup events will be sent through the same Rendez, and it is possible for sleep and wakeup to get out-of-sync. A detailed description of this case is found in [1], and it's worth quoting here
A process may return from sleep with the condition function false if there is a delay between the condition coming true and wakeup being called, with the delay occurring just as the receiving process calls sleep. The condition is now true, so that process returns immediately, does whatever is appropriate, and then (say) decides to call sleep again. This time the condition is false, so it goes to sleep. The wakeup process then finds a sleeping process, and wakes it up, but the condition is now false.
There is an easy (and verified) solution: at the end of sleep or after sleep returns, if the condition is false, execute sleep again. This re-execution cannot repeat; the second synchronisation is guaranteed to function under the external conditions we are supposing.

A few similar schemes



The world is full of magical things patiently waiting for our wits to grow sharper.
— Bertrand Russell

Time is an illusion. Lunchtime doubly so.
— Douglas Adams, The Hitchhiker's Guide to the Galaxy

It is interesting to note that QLock, and which is often used to protect Rendez structures and RWLock use the same process scheduling primitives as sleep itself. (With the exception of using different process states.) In fact, they function quite a bit like the original Version 1 sleep. Inside QLock, for instance, is a spinlock-protected list of processes waiting for admittance. As each process exits the lock's critical section, it wakes the next process inside the critical section. If there is no next process, the lock is released.

Abbreviated Notes

[1]
Process Sleep and Wakeup on a Shared Memory Multiprocessor, Rob Pike, Dave Pressoto, Ken Thompson,
/sys/doc/sleep.ps
[2]
Design and Validation of Computer Protocols, Prentice-Hall, Englewood Cliffs, 1991.