A Short Introduction To Operating Systems
A Short Introduction To Operating Systems
A Short Introduction To Operating Systems
Mark Burgess
October 3, 2001
Contents
Contents
1. What is an operating system?
o 1.1 Key concepts
1.1.1 Hierarchies and black boxes
1.1.2 Resources and sharing
1.1.3 Communication, protocols, data types
1.1.4 System overhead
1.1.5 Caching
o 1.2 Hardware
1.2.1 The CPU
1.2.2 Memory
1.2.3 Devices
1.2.4 Interrupts, traps, exceptions
o 1.3 Software
1.3.1 Resource management
1.3.2 Spooling
1.3.3 System calls
1.3.4 Basic command language
1.3.5 Filesystem
1.3.6 Multiple windows and screens
o Note
o Exercises
2. Single-task OS
o 2.1 Memory map and registers
o 2.2 Stack
o 2.3 Input/Output
2.3.1 Interrupts
2.3.2 Buffers
2.3.3 Synchronous and asynchronous I/O
2.3.4 DMA - Direct Memory Access
o Exercises
3. Multi-tasking and multi-user OS
o 3.1 Competition for resources
3.1.1 Users - authentication
4.5.4 Recovery
o 4.6 Summary
o Exercises
o Project
5. Memory and storage
o 5.1 Logical and Physical Memory
5.1.1 Physical Address space
5.1.2 Word size
5.1.3 Paged RAM/ROM
5.1.4 Address binding - coexistence in memory
5.1.5 Shared libraries
5.1.6 Runtime binding
5.1.7 Segmentation - sharing
5.1.8 The malloc() function
5.1.9 Page size, fragmentation and alignment
5.1.10 Reclaiming fragmented memory (Tetris!)
o 5.2 Virtual Memory
5.2.1 Paging and Swapping
5.2.2 Demand Paging - Lazy evaluation
5.2.3 Swapping and paging algorithms
5.2.4 Thrashing
o 5.3 Disks: secondary storage
5.3.1 Physical structure
5.3.2 Device drivers and IDs
5.3.3 Checking data consistency and formatting
5.3.4 Scheduling
5.3.5 Partitions
5.3.6 Stripes
o 5.4 Disk Filesystems
5.4.1 Hierachical filesystems and links
5.4.2 File types and device nodes
5.4.3 Permissions and access
5.4.4 File system protocols
5.4.5 Filesystem implementation and storage
5.4.6 The UNIX ufs filesystem
o Exercises
o Project
6. Networks: Services and protocols
o 6.1 Services: the client-server model
o 6.2 Communication and protocol
o 6.3 Services and Ports
o 6.4 UNIX client-server implementation
6.4.1 Socket based communication
6.4.2 RPC services
o 6.5 The telnet command
6.6 X11
6.7 html: hypertext markup language
Exercises
Project
7. TCP/IP Networks
o 7.1 The protocol hierarchy
7.1.1 The OSI model
7.1.2 Data encapsulation
o 7.2 The internet protocol family
7.2.1 udp
7.2.2 tcp
o 7.3 The physical layer
7.3.1 Network connectivity
7.3.2 Ethernet addresses
o 7.4 Internet Addresses and Routing
7.4.1 IP addresses, networks and domain names
7.4.2 Netmask and broadcast address
7.4.3 Routers and gateways
o 7.5 Network Naming services
7.5.1 The Domain Name Service
7.5.2 Network Information Service
o 7.6 Distributed Filesystems
7.6.1 NFS - the network filesystem
7.6.2 AFS - the andrew filesystem
7.6.3 DCE - the distributed computing environment
8. Security: design considerations
o 8.1 Who is responsible?
o 8.2 Passwords and encryption
8.2.1 UNIX passwords
8.2.2 Bad passwords
o 8.3 Super-user, or system administrator
8.3.1 Network administration
8.3.2 Setuid programs in unix
o 8.4 Backups
o 8.5 Intruders: Worms and Viruses
8.5.1 Back doors
o 8.6 Firewall
o 8.7 Public and Private Keys
Where next?
o Glossary
Index
About this document ...
o
o
o
o
MS/PC DOS
Windows 3x
QM
QM
Windows 9x
M*
AmigaDOS
hline MTS
UNIX
VMS
S/M
Windows 2000
BeOS (Hamlet?)
NT
The first of these (MS/PC DOS/Windows 3x) are single user, single-task systems which
build on a ROM based library of basic functions called the BIOS. These are system calls
which write to the screen or to disk etc. Although all the operating systems can service
interrupts, and therefore simulate the appearance of multitasking in some situations, the
older PC environments cannot be thought of as a multi-tasking systems in any sense.
Only a single user application could be open at any time. Windows 95 replaced the old
coroutine approach of quasi-multitasking with a true context switching approach, but
only a single user system, without proper memory protection. Windows NT added a
proper kernel with memory protection, based on the VMS system, originally written for
the DEC/Vax. Later versions of Windows NT and Windows 2000 (a security and kernel
enhanced version of NT) allow multiple logins also through a terminal server. Windows
2000 thus has comparable functionality to Unix in this respect.
The Macintosh system 7 can be classified as single-user quasi-multitasking1.1. That means
that it is possible to use several user applications simultaneously. A window manager can
simulate the appearance of several programs running simultaneously, but this relies on
each program obeying specific rules in order to achieve the illusion. The MacIntosh not a
true multitasking system in the sense that, if one program crashes, the whole system
crashes. Windows
is purported to be preemptive multitasking but most program
crashes also crash the entire system. This might be due to the lack of proper memory
protection. The claim is somewhat confusing.
AmigaDOS is an operating system for the Commodore Amiga computer. It is based on
the UNIX model and is a fully multi-tasking, single-user system. Several programs may
be actively running at any time. The operating system includes a window environment
which means that each independent program has a `screen' of its own and does not
therefore have to compete for the screen with other programs. This has been a major
limitation on multi-tasking operating systems in the past.
MTS (Michigan timesharing system) was the first time-sharing multi-user system1.2. It
supports only simple single-screen terminal based input/output and has no hierarchical
file system.
Unix is arguably the most important operating system today, and one which we shall
frequently refer to below. It comes in many forms, developed by different manufacturers.
Originally designed at AT&T, UNIX split into two camps early on: BSD (Berkeley
software distribution) and system 5 (AT&T license). The BSD version was developed as
a research project at the university of Berkeley, California. Many of the networking and
user-friendly features originate from these modifications. With time these two versions
have been merged back together and most systems are now a mixture of both worlds.
Historically BSD Unix has been most prevalent in universities, while system 5 has been
dominant in business environments. The trend during the last three years by Sun
Microsystems and Hewlett-Packard amongst others has been to move towards system 5,
keeping only the most important features of the BSD system. A standardization
committee for Unix called POSIX, formed by the major vendors, attempts to bring
compatibility to the Unix world. Here are some common versions of UNIX.
Unix
Manufacturer
BSD
Berkeley
BSD
SunOS (solaris 1)
Sun Microsystems
BSD/sys 5
Solaris 2
Sun Microsystems
Sys 5
Ultrix
DEC/Compaq
BSD
DEC/Compaq
BSD/sys 5
Hewlett-Packard
Sys 5
AIX
IBM
Sys 5 / BSD
IRIX
Silicon Graphics
Sys 5
Public Domain
Novell
Sys 5
HPUX
GNU/Linux
SCO unix
Note that the original BSD source code is now in the public domain. Unix is generally
regarded as the most portable and powerful operating system available today by impartial
judges, but NT is improving quickly. Unix runs on everything from laptop computers to
CRAY mainframes. It is particularly good at managing large database applications and
can run on systems with hundreds of processors. Most Unix types support symmetric
multithreaded processing and all support simultaneous logins by multiple users.
NT is a `new' operating system from Microsoft based on the old VAX/VMS kernel from
the Digital Equipment Corporation (VMS's inventor moved to Microsoft) and the
Windows32 API. Initially it reinvented many existing systems, but it is gradually being
forced to adopt many open standards from the Unix world. It is fully multitasking, and
can support multiple users (but only one at a time-- multiple logins by different users is
not possible). It has virtual memory and multithreaded support for several processors. NT
has a built in object model and security framework which is amongst the most modern in
use.
The Be operating system, originally developed for a new multimedia computer called the
BeBox, is also new and is a fully multitasking OS. It is optimized for multimedia and is
now saleable software developed by Be.Com after the new computer concept failed due
to lack of financial backing. BeOS has proper memory protection but allows direct access
to video memory (required for fast video games). It also has virtual memory, is preemptive multitasking and is based on a microkernel design. Is shares little with Unix
except for a Bash shell, a POSIX programming interface and about 150 Unix commands
(including Perl).
mind later. Simple ideas often get lost amongst distracting details, but it is important to
remember that the ideas are simple.
Most multi-tasking systems have only a single central processor unit and yet this is the
most precious resource a computer has. An multi-tasking operating system must therefore
share cpu-time between programs. That is, it must work for a time on one program, then
work a while on the next program, and so on. If the first program was left unfinished, it
must then return to work more on that, in a systematic way. The way an OS decides to
share its time between different tasks is called scheduling.
where the name is a string of bytes, terminated by a zero, and the time is a four byte digit
containing the time in hours. Computer B now knows enough to be able to extract the
information from the stream of bits.
It is important to understand that all computers have to agree on the way in which the
data are sent in advance. If the wrong protocol is diagnosed, then a string of characters
could easily be converted into a floating point number - but the result would have been
nonsense. Similarly, if computer A had sent the information incorrectly, computer B
might not be able to read the data and a protocol error would arise.
More generally, a protocol is an agreed sequence
of behaviour which must be followed.
For example, when passing parameters to functions in a computer program, there are
rules about how the parameter should be declared and in which order they are sent. This
is a simple example of a protocol. Protocols are an important part of communication and
data typing and they will appear in many forms during our discussion of operating
systems.
1.1.5 Caching
Caching is a technique used to speed up communication with slow devices. Usually the
CPU can read data much faster from memory than it can from a disk or network
connection, so it would like to keep an up-to-date copy of frequently used information in
memory. The memory area used to do this is called a cache. You can think of the whole
of the primary memory as being a cache for the secondary memory (disk).
Sometimes caching is used more generally to mean `keeping a local copy of data for
convenience'.
1.2 Hardware
Here we list the main hardware concepts.
which aim to be more efficient for a subset of instructions by using redundancy. These
have simpler instructions but can execute much more quickly, sometimes with several
instructions per clock cycle.
1.2.2 Memory
The primary memory is the most important resource a computer has. Since CPUs are only
made with instructions for reading and writing to memory, no programs would be able to
run without it. There are two types of memory: RAM - random access memory, or
read/write memory, which loses its contents when the machine is switched off, and ROM
- read only memory, which never loses its contents unless destroyed. ROM is normally
used for storing those most fundamental parts of the operating system which are required
the instant a computer is switched on, before it knows about disks etc.
1.2.3 Devices
The concepts of a device really has two parts. There is the hardware unit which is
connected to the machine, and there is the logical device which is a name given by the
OS to a legal entry point for talking to a hardware-device. When a user writes to a logical
device, the OS invokes a device driver which performs the physical operations of
controlling the hardware. For example, when writing to a disk, the OS must control the
movement of the read-write heads. When writing to a printer, the OS places the
information in a queue and services the request when the printer becomes free.
Some common logical devices are: the system disks, the keyboard, the screen, the printer
and the audio device.
Disks and tapes are often called secondary memory or secondary storage.
Interrupts are graded in levels. Low level interrupts have a low priority, whereas high
level interrupts have a high priority. A high level interrupt can interrupt a low level
interrupt, so that the CPU must be able to recover from several `layers' of interruption
and end up doing what it was originally doing. This is accomplished by means of a stack
or heap1.4. Moreover, programs can often choose whether or not they wish to be
interrupted by setting an interrupt mask which masks out the interrupts it does not want
to hear about. Masking interrupts can be dangerous, since data can be lost. All systems
therefore have non-maskable interrupts for the most crucial operations.
1.3 Software
1.3.1 Resource management
In order to keep track of how the system resources are being used, an OS must keep
tables or lists telling it what is free an what is not. For example, data cannot be stored
neatly on a disk. As files become deleted, holes appear and the data become scattered
randomly over the disk surface.
1.3.2 Spooling
Spooling is a way of processing data serially. Print jobs are spooled to the printer,
because they must be printed in the right order (it would not help the user if the lines of
his/her file were liberally mixed together with parts of someone elses file). During a
spooling operation, only one job is performed at a time and other jobs wait in a queue to
be processed. Spooling is a form of batch processing.
Spooling comes from the need to copy data onto a spool of tape for storage. It has since
been dubbed Simultaneous Peripheral Operation On-Line, which is a pretty lousy
attempt to make something more meaningful out of the word `spool'!
System calls can be thought of as a very simple protocol - an agreed way of asking the
OS to perform a service. Some typical OS calls are: read, write (to screen, disk, printer
etc), stat (get the status of a file: its size and type) and malloc (request for memory
allocation).
On older microcomputers, where high level languages are uncommon, system calls are
often available only through assembler or machine code. On modern systems and
integrated systems like UNIX, they are available as functions in a high level language
like C.
;
;
;
;
;
constitute a basic command language. Every computer must have such a language
(except perhaps the Macintosh - yawn!). In microcomputer operating systems the
command language is often built into the system code, whereas on larger systems (UNIX)
the commands are just executable programs like the last example above.
The command language deals typically with: file management, process management and
text editing.
1.3.5 Filesystem
In creating a system to store files we must answer some basic questions.
Should the filesystem distinguish between types of files e.g. executable files, text
files, scripts. If so how? One way is to use file extensions, or a naming convention
to identify files, like myprog.exe, SCRIPT.BAT, file.txt. The problem with this is
that the names can be abused by users. If one tries to execute a file which is not
meant to be executed, the result would be nonsense and might even be dangerous
to the point of crashing the system. One way around this problem is to introduce a
protocol or standard format for executable files, so that when the OS opens a file
for execution it first checks to see whether the file obeys the protocol. This
method is used for binary files in UNIX, for instance.
Protection. If several users will be storing files together on the same disk, should
each user's files be exclusive to him or her?
Is a mechanism required for sharing files between several users?
A hierarchical filesystem is a good starting point for organizing files, but it can be
too restrictive. Sometimes it is useful to have a file appear in several places at one
time. This can be accomplished with links. A link is not a copy of a file, but a
pointer to where a file really is. By making links to other places in a hierarchical
filesystem, its flexibility is increased considerably.
Note
Before proceeding, you should note that the design of operating systems is an active area
of research. There are no universal solutions to the issues that we shall discuss, rather OS
design must be thought of as a study of compromises. Hopefully you will get a feel for
this during the course of the tutorial.
Exercises
1. What are the key ingredients of an operating system?
2. What is the usefulness of system calls?
3. What is the difference between primary and secondary storage.
2. Single-task OS
Before tackling the complexities of multi-tasking, it is useful to think about the operation
of a single-task OS without all the clutter that multi-tasking entails. In a multi-task OS the
features we shall discuss below have to be reproduced
-times and then augmented by
extra control structures.
Register
Purpose
Accumulator
Program counter
Index (addressing) registers Used to specify the address of data to be loaded into or
saved from the accumulator, or operated on in some way.
Stack pointer
Status register
The memory, as seen by the CPU, is a large string of bytes starting with address and
increasing up to the maximum address. Physically it is made up, like a jigsaw puzzle, of
many memory chips and control chips. mapped into the diagram shown. Normally,
because of the hardware design of the CPU, not all of the memory is available to the user
of the machine. Some of it is required for the operation of the CPU.
The roughly distinguished areas in figure 2.1 are
Zero page: The first t `page' of the memory is often reserved for a special
purpose. It is often faster to write to the zero page because you don't have to code
the leading zero for the address - special instructions for the zero page can leave
the `zero' implicit.
Stack: Every CPU needs a stack for executing subroutines. The stack is explained
in more detail below.
User programs: Space the user programs can `grow into'.
Screen memory: What you see on the screen of a computer is the image of an area
of memory, converted into colours and positions by a hardware video-controller.
The screen memory is the area of memory needed to define the colour of every
`point' or `unit' on the screen. Depending on what kind of visual system a
computer uses, this might be one byte per character and it might be four bytes per
pixel!
Memory mapped I/O: Hardware devices like disks and video controllers contain
smaller microprocessors of their own. The CPU gives them instructions by
placing numbers into their registers. To make this process simpler, these device
registers (only a few bytes per device, perhaps) are `wired' into the main memory
map, so that writing to the device is the same as writing to the rest of the memory.
Operating system: The operating system itself is a large program which often
takes up a large part of the available memory.
Note that this figure is very simplified. It does not show, for instance, special memory
which might be located inside the devices or CPU. Such memory is often used for
caching. Also it does not show how the various components are connected together by
means of a high speed data bus.
Figure 2.1: A simple schematic memory map of a microcomputer. The order of the
different segments of memory can vary depending on the system.
2.2 Stack
A stack is a so-called last-in first-out (LIFO) data structure. That is to say - the last thing
to be placed on top of a stack, when making it, is the first item which gets removed when
un-making it. Stacks are used by the CPU to store the current position within a program
before jumping to subroutines, so that they remember where to return to after the
subroutine is finished. Because of the nature of the stack, the CPU can simply deposit the
address of the next instruction to be executed (after the subroutine is finished) on top of
the stack. When the subroutine is finished, the CPU pulls the first address it finds off the
top of the stack and jumps to that location.
Notice that the stack mechanism will continue to work even if the subroutine itself calls
another subroutine, since the second subroutine causes another stack frame to be saved on
the top of the stack. When that is finished, it returns to the first subroutine and then to the
original program in the correct order.
On many older microcomputers and in many operating systems the stack is allocated with
a fixed size in advance. If too many levels of nested subroutines are called, the stack can
overflow. Consider the following example code for a stack.
//
// A simple stack handler.
//
// Use the commands "push" and "pop" to push onto the stack and to pop
// "out" of the stack. The allocated stacksize is very small so that
// an overflow can occur if you push too far!! e.g. input
//
// push 23
// push 4
// pop
// push 678
// quit
//
// In a real stack handler the numbers would be the address of the next
// instruction to return to after completing a subroutine.
//
// The program is compiled with
//
//
g++ stack.C
//
// MB 1994
//
//*********************************************************************
#include <iostream.h>
#include <strstream.h>
#include <string.h>
//**********************************************************************
// Include file
//**********************************************************************
const int forever = 1;
const int stacksize = 10;
const int bufsize = 20;
//**********************************************************************
class Stack
{
public:
int stack[stacksize];
Stack();
void ShowStack();
void Push(int);
int Pop();
private:
int stackpointer;
};
//**********************************************************************
// Level 0
//**********************************************************************
main ()
{ char input[bufsize];
char command[5];
int number, newnumber;
Stack s;
cout << "Stack demo\n\n";
s.ShowStack();
while (forever)
{
cout << "Enter command: ";
// Extract command
cin.getline(input,bufsize);
istrstream(input,sizeof(input)) >> command >> number;
// Interpret command
if (strcmp(command,"push") == 0)
{
s.Push(number);
}
else if (strcmp(command,"pop")==0)
{
newnumber = s.Pop();
}
else if (strcmp(command,"quit")==0)
{
break;
}
else
{
number = 0;
cout << "Bad command\n\n";
}
s.ShowStack();
}
s.ShowStack();
}
//**********************************************************************
// Class Stack
//**********************************************************************
Stack::Stack()
{ int i;
stackpointer = 0;
for (i = 0; i < stacksize; i++)
{
stack[i] = 0;
}
}
//**********************************************************************
void Stack::Push (int n)
{
cout << "Pushing " << n << " on the stack\n";
if (stackpointer >= stacksize)
{
cerr << "Stack overflow!\n";
return;
}
stack[stackpointer] = n;
stackpointer++;
}
//**********************************************************************
int Stack::Pop ()
{
if (stackpointer == 0)
{
cerr << "Stack underflow!\n";
return 0;
}
stackpointer--;
cout << "Popped " << stack[stackpointer] << " from stack\n";
return (stack[stackpointer]);
}
//**********************************************************************
void Stack::ShowStack ()
{ int i;
for (i = stacksize-1; i >= 0; i--)
{
cout << "stack[" << i << "] = " << stack[i];
if (i == stackpointer)
{
cout << " <<-- Pointer\n";
}
else
{
cout << endl;
}
}
In this example, only numbers are stored. At the hardware level, this kind of stack is used by the CPU to store addresses and registers during
machine-code subroutine jumps. Operating systems also use software controlled stacks during the execution of users' programs. High level
languages subroutines can have local variables which are also copied to the stack as one large stack frame during the execution of subroutines.
2.3 Input/Output
Input arrives at the computer at unpredictable intervals. The system must be able to detect its arrival and respond to it.
2.3.1 Interrupts
Interrupts are hardware triggered signals which cause the CPU to stop what it is doing and jump to a special subroutine. Interrupts normally
arrive from hardware devices, such as when the user presses a key on the keyboard, or the disk device has fetched some data from the disk.
They can also be generated in software by errors like division by zero or illegal memory address.
When the CPU receives an interrupt, it saves the contents of its registers on the hardware stack and jumps to a special routine which will
determine the cause of the interrupt and respond to it appropriately. Interrupts occur at different levels. Low level interrupts can be interrupted
by high level interrupts. Interrupt handling routines have to work quickly, or the computer will be drowned in the business of servicing
interrupts. For certain critical operations, low level interrupts can be ignored by setting a mask (See also the generalization of this for multiuser
systems in chapter 4).
There is no logical difference between what happens during the execution of an interrupt routine and a subroutine. The difference is that
interrupt routines are triggered by events, whereas software subroutines follow a prearranged plan.
An important area is the interrupt vector. This is a region of memory reserved by the hardware for servicing of interrupts. Each interrupt has a
number from zero to the maximum number of interrupts supported on the CPU; for each interrupt, the interrupt vector must be programmed
with the address of a routine which is to be executed when the interrupt occurs. i.e. when an interrupt occurs, the system examines the address
in the interrupt vector for that interrupt and jumps to that location. The routine exits when it meets an RTI (return from interrupt) instruction.
2.3.2 Buffers
The CPU and the devices attached to it do not work at the same speed. Buffers are therefore needed to store incoming or outgoing information
temporarily, while it is waiting to be picked up by the other party. A buffer is simply an area of memory which works as a waiting area. It is a
first-in first-out (FIFO) data structure or queue.
Exercises
1. What is the program counter?
2. Explain why a stack is used to store local variables.
3. Some microprocessors (68000/Intel 386 upward) support multitasking internally.
A separate stack is then needed for each process. How can this be achieved?
4. Write a program to create a stack (LIFO) which can store any number of local
variables for each subroutine. Hint: use a linked list for the stack and for the
variables.
5. Write a program to implement a buffer (FIFO).
6. When a computer is first switched on, it executes a program called a bootstrap
program. This comes from the expression `to lift oneself by one's own bootstraps'.
The computer must begin to execute instructions and `get going'. Find out for
yourself, or speculate on how this takes place.
7. What is a stack-frame?
8. What is memory mapped I/O?
//******************************************************************
//
// Example of a segmentation fault in user mode
//
//******************************************************************
main()
{ int *ptr;
ptr = 0;
The operating system fetches each instruction from the user program and executes
it personally, never giving it directly to the CPU. The OS software switches
between different processes by fetching the instructions it decides to execute. This
is a kind of software emulation. This method works, but it is extremely inefficient
because the OS and the user program are always running together. The full speed
of the CPU is not realized. This method is often used to make simulators and
debuggers.
A more common method is to switch off the OS while the user program is
executing and switch off the user process while the OS is executing. The
switching is achieved by hardware rather than software, as follows. When handing
control to a user program, the OS uses a hardware timer to ensure that control will
return after a certain time. The OS loads a fixed time interval into the timer's
control registers and gives control to the user process. The timer then counts down
to zero and when it reaches zero it generates a non-maskable interrupt,
whereupon control returns to the OS.
mountd: Deals with requests for `mounting' this machine's disks on other
machines - i.e. requests to access the disk on this machine from another machine
on the network.
rlogind: Handles requests to login from remote terminals.
keyserv: A server which stores public and private keys. Part of a network security
system.
syslogd: Records information about important events in a log file.
named: Converts machine names into their network addresses and vice versa.
Exercises
1. Write a program to manage an array of many stacks.
2. Describe the difference between the kernel and daemons in UNIX. What is the
point of making this distinction?
3. What is two-mode operation?
4. What is the difference between an emulator or simulator and true multi-tasking?
5. To prepare to for the project suggestion in the next chapter, write a program which
reads fictitious commands in from a file. The commands should be of the form:
6. operator operand
7.
8. load
12
9. add
23
10. store
1334
11. jsr
5678
12. wait
1
13. fork
0
etc. Read in the commands and print out a log of what the commands are, in the
form "Executing (operator) on (operand)". You should be able to recognize the
commands `wait' and `fork' specially, but the other commands may be anything
you like. The aim is to simulate the type of commands a real program has to
execute.
Process: This is a general term for a program which is being executed. All work
done by the CPU contributes to the execution of processes. Each process has a
descriptive information structure associated with it (normally held by the kernel)
called a process control block which keeps track of how far the execution has
progressed and what resources the process holds.
Task: On some systems processes are called tasks.
Job: Some systems distinguish between batch execution and interactive
execution. Batch (or queued) processes are often called jobs. They are like
production line processes which start, do something and quit, without stopping to
ask for input from a user. They are non-interactive processes.
Thread: (sometimes called a lightweight process) is different from process or task
in that a thread is not enough to get a whole program executed. A thread is a kind
of stripped down process - it is just one `active hand' in a program - something
which the CPU is doing on behalf of a program, but not enough to be called a
complete process. Threads remember what they have done separately, but they
share the information about what resources a program is using, and what state the
program is in. A thread is only a CPU assignment. Several threads can contribute
to a single task. When this happens, the information about one process or task is
used by many threads. Each task must have at least one thread in order to do any
work.
CPU burst: A period of uninterrupted CPU activity.
I/O burst: A period of uninterrupted input/output activity.
4.1.2 Scheduling
On most multitasking systems, only one process can truly be active at a time - the system must therefore share its time between the execution of
many processes. This sharing is called scheduling. (Scheduling
time management.)
Different methods of scheduling are appropriate for different kinds of execution. A queue is one form of scheduling in which each program
waits its turn and is executed serially. This is not very useful for handling multitasking, but it is necessary for scheduling devices which cannot
be shared by nature. An example of the latter is the printer. Each print job has to be completed before the next one can begin, otherwise all the
print jobs would be mixed up and interleaved resulting in nonsense.
We shall make a broad distinction between two types of scheduling:
Queueing. This is appropriate for serial or batch jobs like print spooling and
requests from a server. There are two main ways of giving priority to the jobs in a
queue. One is a first-come first-served (FCFS) basis, also referred to as first-in
first-out (FIFO); the other is to process the shortest job first (SJF).
Round-robin. This is the time-sharing approach in which several tasks can
coexist. The scheduler gives a short time-slice to each job, before moving on to
the next job, polling each task round and round. This way, all the tasks advance,
little by little, on a controlled basis.
These two categories are also referred to as non-preemptive and preemptive respectively, but there is a grey area.
We want to maximize the efficiency of the machine. i.e. we would like all the
resources of the machine to be doing useful work all of the time - i.e. not be idling
during one process, when another process could be using them. The key to
organizing the resources is to get the CPU time-sharing right, since this is the
central `organ' in any computer, through which almost everything must happen.
But this cannot be achieved without also thinking about how the I/O devices must
be shared, since the I/O devices communicate by interrupting the CPU from what
it is doing. (Most workstations spend most of their time idling. There are
enormous amounts of untapped CPU power going to waste all over the world
each day.)
We would like as many jobs to get finished as quickly as possible.
Interactive users get irritated if the performance of the machine seems slow. We
would like the machine to appear fast for interactive users - or have a fast
response time.
Some of these criterea cannot be met simultaneously and we must make compromises. In particular, what is good for batch jobs is often not
good for interactive processes and vice-versa, as we remark under Run levels - priority below.
In addition, processes may be reduced in priority if their total accumulated CPU usage becomes very large. (This occurs, for example in
UNIX). The wisdom of this approach is arguable, since programs which take a long time to complete tend to be penalized. Indeed, they take
must longer to complete because their priority is reduced. If the priority continued to be lowered, long jobs would never get finished. This is
called process starvation and must be avoided.
Scheduling algorithms have to work without knowing how long processes will take. Often the best judge of how demanding a program will be
is the user who started the program. UNIX allows users to reduce the priority of a program themselves using the nice command. `Nice'
users are supposed to sacrifice their own self-interest for the good of others. Only the system manager can increase the priority of a process.
Another possibility which is often not considered, is that of increasing the priority of resource-gobbling programs in order to get them out of
the way as fast as possible. This is very difficult for an algorithm to judge, so it must be done manually by the system administrator.
/*
/*
/*
/*
/*
/*
/*
/*
/*
/*
fpu state */
fpu exception queue */
various state flags */
window overflow count */
window underflow count */
associated thread */
struct proc
{
struct proc *p_link;
struct proc *p_rlink;
struct proc *p_nxt;
*/
struct proc **p_prev;
struct as *p_as;
struct seguser *p_segu;
caddr_t p_stack;
struct user *p_uarea;
char
p_usrpri;
p_nice */
char
p_pri;
char
p_cpu;
scheduling */
*/
*/
*/
*/
*/
char
char
char
char
char
int
int
int
int
int
uid_t
p_stat;
p_time;
p_nice;
p_slptime;
p_cursig;
p_sig;
p_sigmask;
p_sigignore;
p_sigcatch;
p_flag;
p_uid;
uid_t
p_suid;
gid_t
p_sgid;
short
short
short
u_short
short
p_pgrp;
p_pid;
p_ppid;
p_xstat;
p_cpticks;
/*
/*
/*
/*
/*
struct
struct
int
int
int
int
ucred *p_cred;
rusage *p_ru;
p_tsize;
p_dsize;
p_ssize;
p_rssize;
/*
/*
/*
/*
/*
/*
Process credentials */
mbuf holding exit information */
size of text (clicks) */
size of data space (clicks) */
copy of stack size (clicks) */
current resident set size in clicks
signals
current
signals
signals
int
p_maxrss;
/*
int
p_swrss;
/*
caddr_t p_wchan;
/*
long
p_pctcpu;
/*
struct proc *p_pptr;
/*
parent */
struct proc *p_cptr;
/*
struct proc *p_osptr;
/*
struct proc *p_ysptr;
/*
struct proc *p_tptr;
/*
tracer */
struct itimerval p_realtimer;
struct sess *p_sessp;
/*
struct proc *p_pglnk;
/*
short
p_idhash;
/*
kill+exit+... */
short
p_swlocks;
/*
struct aiodone *p_aio_forw; /*
*/
struct aiodone *p_aio_back; /*
*/
int
p_aio_count;
/*
int
p_threadcnt;
/*
proc */
int
p_cpuid;
/*
*/
int
p_pam;
/*
};
copy of u.u_limit[MAXRSS] */
resident set size before last swap */
event process is awaiting */
(decayed) %cpu for this process */
pointer to process structure of
pointer
pointer
pointer
pointer
to
to
to
to
UNIX also uses a `user' structure to keep auxiliary information which is only needed when jobs are not `swapped out' (see next chapter).
1. Name. The name of the program which is to run as the new process must be
known.
2. Process ID and Process Control Block. The system creates a new process
control block, or locates an unused block in an array. This block is used to follow
the execution of the program through its course, keeping track of its resources and
priority. Each process control block is labelled by its PID or process identifier.
3. Locate the program to be executed on disk and allocate memory for the code
segment in RAM.
4. Load the program into the code segment and initialize the registers of the PCB
with the start address of the program and appropriate starting values for resources.
5. Priority. A priority must be computed for the process, using a default for the type
of process and any value which the user specified as a `nice' value (see Run levels
- priorities above).
6. Schedule the process for execution.
returncode = fork();
When this instruction is executed, the process concerned splits into two and both continue to execute independently from after the
intruction. If fork is successful, it returns
fork
to the child process and the process identifier or pid of the child process to the parent. It, for some
to the parent.
//**************************************************************
//*
//* A brief demo of the UNIX process duplicator fork().
//*
//* g++ unix.C to compile this.
//*
//**************************************************************
#include <iostream.h>
extern
extern
extern
extern
extern
"C"
"C"
"C"
"C"
"C"
void sleep();
int fork();
int getpid();
void wait();
void exit();
void ChildProcess();
//***************************************************************
main ()
{ int pid, cid;
pid = getpid();
cout << "Fork demo! I am the parent (pid = " << pid << ")\n";
if (! fork())
{
cid = getpid();
cout << "I am the child (cid=" << cid << ") of (pid = " << pid <<
")\n";
ChildProcess();
exit(0);
}
cout << "Parent waiting here for the child...\n";
wait(NULL);
cout << "Child finished, parent quitting too!\n";
}
//**************************************************************
void ChildProcess()
{ int i;
for (i = 0; i < 10; i++)
{
cout << i << "..\n";
sleep(1);
}
}
Here is the output from the program in a test run. Note that the parent and child processes share the same output stream, so we see how they are
synchronised from the order in which the output is mixed.
1.
2.
3.
4.
5.
New.
Ready (in line to be executed).
Running (active).
Waiting (sleeping, suspended)
Terminated (defunct)
When time-sharing, the scheduler only needs to consider the processes which are in the `ready' state. Changes of state are made by the system
and follow the pattern in the diagram below.
From state
New
Ready
Event
To state
Accepted
Ready
Scheduled / Dispatch
Running
Running
Need I/O
Waiting
Running
Scheduler timeout
Ready
Running
Waiting
Terminated
Sorted queue, in which the elements are regularly ordered according to some rule.
The most prevalent example of this is the shortest job first (SJF) rule.
The FCFS queue is the simplest and incurs almost no system overhead. The SJF scheme can cost quite a lot in system overhead, since each task
in the queue must be evaluated to determine which is shortest. The SJF strategy is often used for print schedulers since it is quite inexpensive to
determine the size of a file to be printed (the file size is usually stored in the file itself).
The efficiency of the two schemes is subjective: long jobs have to wait longer if short jobs are moved in front of them, but if the distribution of
jobs is random then we can show that the average waiting time of any one job is shorter in the SJF scheme, because the greatest number of jobs
will always be executed in the shortest possible time.
Of course this argument is rather stupid, since it is only the system which cares about the average waiting time per job, for its own prestige.
Users who print only long jobs do not share the same clinical viewpoint. Moreover, if only short jobs arrive after one long job, it is possible that
the long job will never get printed. This is an example of starvation. A fairer solution is required (see exercises below).
Queue scheduling can be used for CPU scheduling, but it is quite inefficient. To understand why simple queue scheduling is not desirable we
can begin by looking at a diagram which shows how the CPU and the devices are being used when a FCFS queue is used. We label each
process by
... etc. A blank space indicates that the CPU or I/O devices are in an idle state (waiting for a customer).
Time
CPU
devices
starts out with a CPU burst. At some point it needs input (say from a disk) and sends a request to the device.
, the CPU is idle, waiting for the result. Similarly, when the result returns, the device
is finished,
There are many blank spaces in the diagram, where the devices and the CPU are idle. Why, for example, couldn't the device be searching for
the I/O for
We can improve the picture by introducing a new rule: every time one process needs to wait for a device, it gets put to the back of the queue.
Now consider the following diagram, in which we have three processes. They will always be scheduled in order
one or all of them is finished.
until
Time
CPU
devices
-finishes
-
starts out as before with a CPU burst. But now when it occupies the device,
for the device to complete its I/O,
has to wait,
takes over, since it is next in the queue, but now the device is idle, because
finishes, only
-finishes
-
has to wait
finishes:
In the beginning, this second scheme looked pretty good - both the CPU and the devices were busy most of the time (few gaps in the diagram).
As processes finished, the efficiency got worse, but on a real system, someone will always be starting new processes so this might not be a
problem.
Let us ask - how can we improve this scheme? The resource utilization is not too bad, but the problem is that it assumes that every program
goes in a kind of cycle
If one program spoils this cycle by performing a lot of CPU intensive work, or by waiting for dozens of I/O requests, then the whole scheme
goes to pieces.
4.3 Threads
Figure 4.4: System and user level threads: suppose we think of a household kitchen as
being a process, then each electrical appliance which contributes to the work of the
kitchen is like a thread. In order to work, a thread needs power. The power sockets are like
kernel threads or CPUs. A job like making a cake or tidying up might involve several
threads (powered machinery), which might run in parallel or one after the other. Since
there are more appliances than power points, we have to schedule the time each appliance
gets power so as to share between all of them.
On a truly parallel computer (several CPUs) we might imagine parts of a program (different subroutines) running on quite different processors,
until they need to communicate. When one part of the program needs to send data to the other part, the two independent pieces must be
synchronized, or be made to wait for one another. But what is the point of this? We can always run independent procedures in a program as
separate programs, using the process mechanisms we have already introduced. They could communicate using normal interprocesses
communication. Why introduce another new concept? Why do we need threads?
The point is that threads are cheaper than normal processes, and that they can be scheduled for execution in a user-dependent way, with less
overhead. Threads are cheaper than a whole process because they do not have a full set of resources each. Whereas the process control block
for a heavyweight process is large and costly to context switch, the PCBs for threads are much smaller, since each thread has only a stack and
some registers to manage. It has no open file lists or resource lists, no accounting structures to update. All of these resources are shared by all
threads within the process. Threads can be assigned priorities - a higher priority thread will get put to the front of the queue.
Object
Thread (LWP)
Task (HWP)
Resources
Stack
1 thread
CPU time.
n-threads
threads which can be active at any one time is equal to the number of system level threads, which in turn is equal to the number of CPUs on the
system.
Since threads work ``inside'' a single task, the normal process scheduler cannot normally tell which thread to run and which not to run - that is
up to the program. When the kernel schedules a process for execution, it must then find out from that process which is the next thread it must
execute. If the program is lucky enough to have more than one processor available, then several threads can be scheduled at the same time.
Some important implementations of threads are
Asymmetric: one CPU does the work of the system, the other CPUs service user
requests.
Symmetric: All processors can be used by the system and users alike. No CPU is
special.
The asymmetric variant is potentially more wasteful, since it is rare that the system requires a whole CPU just to itself. This approach is more
common on very large machines with many processors, where the jobs the system has to do is quite difficult and warrants a CPU to itself.
//
//
//
//
////////////////////////////////////////////////////////////////////////
#include <iostream.h>
#include <fstream.h>
const int bufsize = 100;
void ParseFile(char *);
int LINECOUNT = 0;
/**********************************************************************/
main ()
{
cout << "Single threaded parent...\n";
ParseFile("proc1");
ParseFile("proc2");
ParseFile("proc3");
ParseFile("proc4");
cout << "Number of lines = %d\n",LINECOUNT;
}
/**********************************************************************/
void ParseFile(char *filename)
{ fstream file;
char buffer[bufsize];
cout << "Trying to open " << filename << endl;
file.open(filename, ios::in);
if (! file)
{
cerr << "Couldn't open file\n";
return;
}
while (!file.eof())
{
file.getline(buffer,bufsize);
cout << filename << ":" <<buffer << endl;
LINECOUNT++;
}
file.close();
}
This program calls the function ParseFile() several times to open and count the number of lines in a series of files. The number of
lines is held in a global variable called LINECOUNT. A global variable is, by definition, shared data. This will cause a problem when we
try to parallelize the program using threads. Here is the threaded version:
//
<iostream.h>
<fstream.h>
<pthread.h>
<sched.h>
int LINECOUNT = 0;
pthread_mutex_t MUTEX = PTHREAD_MUTEX_INITIALIZER;
/**********************************************************************/
main ()
{ pthread_t tid[maxfiles];;
int i,ret;
// Create a thread for each file
ret
ret
ret
ret
=
=
=
=
pthread_create(&(tid[0]),
pthread_create(&(tid[1]),
pthread_create(&(tid[2]),
pthread_create(&(tid[3]),
NULL,
NULL,
NULL,
NULL,
ParseFile,"proc1");
ParseFile,"proc2");
ParseFile,"proc3");
ParseFile,"proc4");
/**********************************************************************/
void *ParseFile(char *filename)
{ fstream file;
char buffer[bufsize];
int ret;
cout << "Trying to open " << filename << endl;
file.open(filename, ios::in);
if (! file)
{
cerr << "Couldn't open file\n";
return NULL;
}
while (!file.eof())
{
file.getline(buffer,bufsize);
cout << filename << ":" <<buffer << endl;
// Critical section
ret = pthread_mutex_lock(&MUTEX);
LINECOUNT++;
ret = pthread_mutex_unlock(&MUTEX);
// Try uncommenting this ....
// Yield the process, to allow next thread to be run
// sched_yield();
}
file.close();
}
In this version of the program, a separate thread is spawned for each file. First we call the function pthread_create() for each file
we encounter. A new thread is spawned with a pointer to the function the thread should execute (in this case the same function for all threads),
called ParseFile(), which reads lines from the respective files and increments the global variable LINECOUNT. Several things
are important here.
The main program is itself a thread. It is essential that we tell the main program to wait for the additional threads to join the main program
before exiting, otherwise the main program will exit and kill all of the child threads immediately. Thread join-semantics are like wait-semantics
for normal processes.
Each of the threads updates the same global variable. Suppose now that two threads are running on different CPUs. It is possible that both
threads would try to alter the value of the variable LINECOUNT simultaneously. This is called a race condition and can lead to
unpredictable results. For this reason we use a mutex to lock the variable while it is being updated. We shall discuss this more in the next
section.
A final point to note is the commented out lines in the ParseFile() function. The call sched_yield() tells a running thread
to give itself up to the scheduler, so that the next thread to be scheduled can run instead. This function can be used to switch between several
threads. By calling this function after each line is read from the files, we can spread the the CPU time evenly between each thread. Actually, it
is difficult to predict precisely which threads will be scheduled and when, because the threads in our program here are only a small number,
compared to the total number of threads waiting to be scheduled by the system. The interaction with disk I/O can also have a complicated effect
on the scheduling. On a single CPU system, threads are usually scheduled FCFS in a queue. If we yield after every instruction, it has the effect
of simulating round-robin scheduling.
/********************************************************************/
/*
*/
/* Creating a light weight process in SunOS 4.1.3
*/
/*
*/
/********************************************************************/
#include <lwp/lwp.h>
#include <lwp/stackdep.h>
#define MINSTACKSZ
#define STACKSIZE
#define MAXPRIORITY
1024
1000 + MINSTACKSZ
10
/*********************************************************************/
stkalign_t stack[STACKSIZE];
/*********************************************************************/
/* Zone 0
*/
/*********************************************************************/
main ()
{ thread_t tid;
int task();
pod_setmaxpri(MAXPRIORITY);
lwp_create(&tid,task,MAXPRIORITY,0,STKTOP(stack),0);
printf("Done! - Now other threads can run...\n");
}
/*********************************************************************/
/* Zone 1
*/
/*********************************************************************/
task ()
{
printf("Task: next thread after main()!\n");
}
Here is an example program containing several threads which wait for each other.
/********************************************************************/
/*
*/
/* Creating a light weight process in sunos 4.1.3 (Solaris 1)
*/
/*
*/
/* Yielding to other processes
*/
/*
*/
/********************************************************************/
#include <lwp/lwp.h>
#include <lwp/stackdep.h>
#define
#define
#define
#define
#define
MINSTACKSZ
STACKCACHE
STACKSIZE
MAXPRIORITY
MINPRIORITY
1024
1000
STACKCACHE + MINSTACKSZ
10
1
/*********************************************************************/
stkalign_t stack[STACKSIZE];
/*********************************************************************/
/* Zone 0
*/
/*********************************************************************/
main ()
{ thread_t tid_main;
thread_t tid_prog1;
thread_t tid_prog2;
int prog1(), prog2();
lwp_self(&tid_main);
lwp_setstkcache(STACKCACHE,3);
lwp_create(&tid_prog1,prog1,MINPRIORITY,0,lwp_newstk(),0);
lwp_create(&tid_prog2,prog2,MINPRIORITY,0,lwp_newstk(),0);
printf("One ");
lwp_yield(THREADNULL);
printf("Four ");
lwp_yield(tid_prog2);
printf("Six ");
exit(0);
}
/*********************************************************************/
/* Zone 1,2..
*/
/*********************************************************************/
prog1 ()
{
printf("Two ");
if (lwp_yield(THREADNULL) < 0)
{
lwp_perror("Bad yield");
return;
}
printf("Seven \n");
}
/*********************************************************************/
prog2 ()
{
printf("Three ");
lwp_yield(THREADNULL);
printf("Five ");
}
file, it is probably not sensible: a better solution would be to use two separate
files.
3. How should we handle a collision between processes? Should we signal an error,
or try to make the processes wait in turn? There is no universal answer to this
question - in some cases it might be logically incorrect for two processes to
change data at the same time: if two processes try to change one numerical value
then one of them has to win - which one? On the other hand, if two processes try
to add something to a list, that makes sense, but we have to be sure that they do
not write their data on top of each other. The writing must happen serially, not in
parallel.
4.4.2 Serialization
The key idea in process synchronization is serialization. This means that we have to go to some pains to undo the work we have put into
making an operating system perform several tasks in parallel. As we mentioned, in the case of print queues, parallelism is not always
appropriate.
Synchronization is a large and difficult topic, so we shall only undertake to describe the problem and some of the principles involved here.
There are essentially two strategies to serializing processes in a multitasking environment.
The scheduler can be disabled for a short period of time, to prevent control being
given to another process during a critical action like modifying shared data. This
method is very inefficient on multiprocessor machines, since all other processors
have to be halted every time one wishes to execute a critical section.
A protocol can be introduced which all programs sharing data must obey. The
protocol ensures that processes have to queue up to gain access to shared data.
Processes which ignore the protocol ignore it at their own peril (and the peril of
the remainder of the system!). This method works on multiprocessor machines
also, though it is more difficult to visualize.
The responsibility of serializing important operations falls on programmers. The OS cannot impose any restrictions on silly behaviour - it can
only provide tools and mechanisms to assist the solution of the problem.
Get_Mutex(m);
// Update shared data
Release_Mutex(m);
The mutex variable is shared by all parties (e.g. a global variable). This protocol is meant to ensure that only one process at a time can get past
the function Get_Mutex. All other processes or threads are made to wait at the function Get_Mutex until that one process calls
Release_Mutex to release the lock. A method for implementing this is discussed below. Mutexes are a central part of multithreaded
programming.
//*********************************************************************
//
// Example of a program which uses a file lock to ensure
// that no one starts more than one copy of it.
//
//*********************************************************************
#include <iostream.h>
#include <fstream.h>
//**********************************************************************
// Include file
//**********************************************************************
extern "C" int getpid();
extern "C" void unlink(char *);
int Locked();
void RemoveLock();
const int true = 1;
const int false = 0;
const int exitstatus=1;
//**********************************************************************
// Main program
//**********************************************************************
main ()
{
if (Locked())
{
cout << "This program is already running!\n";
return exitstatus;
}
// Program here
RemoveLock();
}
//**********************************************************************
// Toolkit: locks
//**********************************************************************
Locked ()
{ ifstream lockfile;
int pid;
lockfile.open("/tmp/lockfile",ios::in);
if (lockfile)
{
return true;
}
lockfile.open("/tmp/lockfile",ios::out);
if (! lockfile)
{
cerr << "Cannot secure a lock!\n";
return true;
}
pid = getpid();
lockfile.out << pid;
lockfile.close();
return false;
}
//**********************************************************************
**
void RemoveLock()
{
unlink("/tmp/lockfile");
}
If a user wishes to read a file, a non-exclusive lock is used. Other users can also get non-exclusive locks to read the file simultaneously, but
when a non-exclusive lock is placed on a file, no user may write to it.
To write to a file, we must get an exclusive lock. When an exclusive lock is obtained, no other users can read or write to the file.
int turn;
int interested[2];
void Get_Mutex (int pid)
{ int other;
other = 1 - pid;
interested[pid] = true;
turn = pid;
while (turn == pid && interested[other])
{
}
}
// Process 1
procedure1();
// Process 2
wait(mysignal);
signal(mysignal);
...
procedure2();
...
These operations are a special case of interprocess communication. A semaphore is a flag which can have a more general value than just true or
false. A semaphore is an integer counting variable and is used to solve problems where there is competition between processes. The idea is that
one part of a program tends to increment the semaphore while another part tends to decrement the semaphore. The value of the flag variable
dictates whether a program will wait or continue, or whether something special will occur. There are many uses for semaphores and we shall
not go into them here. A simple example is reading and writing via buffers, where we count how many items are in the buffer. When the buffer
becomes full, the process which is filling it must be made to wait until space in the buffer is made available.
4.4.8 Monitors
Some languages (like Modula) have special language class-environments for dealing with mutual exclusion. Such an environment is called a
monitor.
4.5 Deadlock
Waiting and synchronization is not all sweetness and roses. Consider the European road rule which says: on minor roads one should always
wait for traffic coming from the right. If four cars arrive simultaneously at a crossroads (see figure) then, according to the rule all of them must
wait for each other and none of them can ever move. This situation is called deadlock. It is the stale-mate of the operating system world.
4.5.1 Cause
Deadlock occurs when a number of processes are waiting for an event which can only be caused by another of the waiting processes.
These are the essential requirements for a deadlock:
is waiting for
where
... and
is waiting
is waiting for
.
2. Non-sharable resources. It is not possible to share the resources or signals which
are being waited for. If the resource can be shared, there is no reason to wait.
3. No preemption. The processes can not be forced to give up the resources they are
holding.
There are likewise three methods for handling deadlock situations:
1. Prevention. We can try to design a protocol which ensures that deadlock never
occurs.
2. Recovery. We can allow the system to enter a deadlock state and then recover.
3. Ostrich method. We can pretend that deadlocks will never occur and live happily
in our ignorance. This is the method used by most operating systems. User
programs are expected to behave properly. The system does not interfere. This is
understandable: it is very hard to make general rules for every situation which
might arise.
4.5.2 Prevention
Deadlock prevention requires a system overhead.
The simplest possibility for avoidance of deadlock is to introduce an extra layer of software for requesting resources in addition to a certain
amount of accounting. Each time a new request is made, the system analyses the allocation of resources before granting or refusing the
resource. The same applies for wait conditions.
The problem with this approach is that, if a process is not permitted to wait for another process - what should it do instead? At best the system
would have to reject or terminate programs which could enter deadlock, returning an error condition.
Another method is the following. One might demand that all programs declare what resources they will need in advance. Similarly all wait
conditions should be declared. The system could then analyse (and re-analyse each time a new process arrives) the resource allocation and pinpoint possible problems.
4.5.3 Detection
The detection of deadlock conditions is also a system overhead. At regular intervals the system is required to examine the state of all processes
and determine the interrelations between them. Since this is quite a performance burden, it is not surprising that most systems ignore deadlocks
and expect users to write careful programs.
4.5.4 Recovery
To recover from a deadlock, the system must either terminate one of the participants, and go on terminating them until the deadlock is cured, or
repossess the resources which are causing the deadlock from some processes until the deadlock is cured. The latter method is somewhat
dangerous since it can lead to incorrect program execution. Processes usually wait for a good reason, and any interruption of that reasoning
could lead to incorrect execution. Termination is a safer alternative.
4.6 Summary
In this chapter we have considered the creation and scheduling of processes. Each process may be described by
A process identifier.
A process control block which contains status information about the scheduled
processes.
A private stack for that process.
The scheduling of processes takes place by a variety of methods. The aim is to maximize the use of CPU time and spread the load for the
devices.
Processes can be synchronized using semaphores or flags. Protocol constructions such as critical sections and monitors guarantee that shared
data are not modified by more than one process at a time.
If a process has to wait for a condition which can never arise until it has finished waiting, then a deadlock is said to arise. The cause of
deadlock waiting is often a resource which cannot be shared. Most operating systems do not try to prevent deadlocks, but leave the problem to
user programs.
Exercises
1.
2.
3.
4.
Explain the difference between a light weight process and a normal process.
What is meant by the critical section of a program?
What is meant by deadlock?
Explain why round-robin scheduling would not be appropriate for managing a
print-queue.
5. Devise a combination of first-come-first-serve (FCFS) and shortest-job-first (SJF)
scheduling which would be the `fairest' solution to scheduling a print queue.
Project
You can learn a lot by solving the following problem. The idea is to make a time-sharing system of your own.
Together with the CPU, the physical memory (RAM) is the most important resource a computer has. The CPU chip has instructions to
manipulate data only directly in memory, so all arithemtic and logic operations must take place in RAM.
Memory mapped I/O - individual registers belonging to other chips and hardware
devices.
The interrupt vector - the CPU itself requires some workspace. Usually the
interrupt vector and sometimes the processor stack occupy fixed locations.
The operating system itself. This takes up a fair chunk of memory. On most
microcomputers this is located in ROM. On multiuser systems upgrades are much
more frequent and it is always loaded from disk into RAM.
The physical address space consists of every possible address to which memory chips are connected.
banks of memory.
Paging has obvious disadvantages - not all memory can be used at once and the method is seldom used nowadays since modern CPUs can
address much larger memory spaces. As we shall see later, multi-user systems use paging to disk. Instead of switching between hardware banks
of memeory, they copy the old contents to disk and reuse the memory which is already there for something else.
1. When the program is loaded into memory, once and for all?
2. While the program is being executed?
Initially it would seem that 1. is the better alternative, since 2 incurs a runtime overhead. In fact 2. is the more flexible option for reasons which
will become more apparent when we consider paging to disk. By performing the distinction at runtime, we have the freedom to completely
reorganize the use of physical memory dynamically at any time. This freedom is very important in a multitasking operating system where
memory has to be shared continually.
Figure 5.1: If a program hard codes addresses, there will be collisions when we try to
load a second program into memory. It is therefore imporant to have a way of allocating
addresses dynamically.
1. Considerable savings in disk space are made, because the standard library code is
never joined to the executable file which is stored on disk, thus there is only one
copy of the shared library on the system.
2. A saving of RAM can also be made since the library, once loaded into RAM can
often be shared by several programs. See under segmentation below.
3. A performance penalty is transferred from load-time to run-time, the first time a
function is accessed: the library must be loaded from disk during the execution of
the program. In the long run, this might be outweighed by the time it would
otherwise have taken to load the library for programs, which now can share it.
Also, the amount of RAM needed to support programs is now considerably
less.
Figure 5.2: Statically linked files append the entire library to each compiled program.
With shared libraries we can save disk and memory by linking a program dynamically
with a single copy of the library.
Does the logical address belong to the process? If not, generate an ownership
error (often called a segmentation fault, as we shall see below).
Translate the logical address into a physical address.
The ownership checking is performed at the logical level rather than the physical level because we want to be able to use the physical memory
in the most general possible way. If we bind physical addresses to a special user it means that we cannot later reorganize the physical memory
and part of the point of the exercise is lost. On the other hand, if users are only bound to logical addresses, we can fiddle as much as we like
with the physical memory and the user will never know.
One more question must be added to the above.
Are the data we want to access actually in the physical memory? As we shall see
later in this chapter, many systems (the most immediate example of which is
UNIX) allow paging to disk.
We shall return to this in the next section.
The conversion of logical addresses into physical
addresses is familiar in many programming languages
and is achieved by the use of
pointers
.
Instead of referring to data directly, one uses a
pointer variable which holds the true address at which
the data are kept. In machine language, the same scheme
is called ``indirect addressing''.
The difference between logical addresses and pointers
is that all pointers are user objects, and thus pointers
only point from one place in logical memory to another place
in logical memory. The mapping from logical to physical is
only visible to the designer of the system.
How is the translation performed in practice? To make the translation of logical to phyical addresses practical, it is necessary to coarse
grain the memory. If every single byte-address were independently converted, then two
bit addresses would be required for each byteaddress in the table and the storage space for the conversion table would be seven times bigger than the memory of the system!
To get around this problem, we have to break up the memory into chunks of a certain size. Then we only need to map the start address of each
block, which is much cheaper if the blocks are big enough. There are two schemes for coarse graining the memory in this way:
physical memory is called a frame. Apart from the difference in names, they must
of course have the same size.
The second of these possibilities is an attractive propostion for a number of reasons. By breaking up the memory into smaller pieces, we have
the possibility of reorganizing (reusing) each piece separately. Large programs need not be entirely in memory if they are not needed. Also, if
two programs use the same code, they can share pages, so two logical pages map into the same physical frame. This is advantageous for
shared-libraries.
Page numbers and addresses
Page addressing is a simple matter if the size of one page is a
power
. Since addresses are stored in bits, page numbers can be
assigned by simply throwing away the lower bits from every address.
It is analogous to counting in blocks of a thousand, in regular base
addresses.
with paging
we need only
and
addresses, since
, for instance.
An important consequence of the mapping of pages, is that what appears to the user as
of sequential memory may in reality be
spread in some random order just about anywhere in physical memory. The tables which map logical to physical memory are called the page
table and the frame table, and are stored per process and loaded as a part of context switching.
Figure 5.3: The UNIX process model, showing the various segments used by each
process. The stack contains all local (automatic) variables and the heap is allocated by
malloc().
The segment idea can all be built on top of the page/frame concept above by demanding that segments be a whole number of pages. That way,
we retain the advantages of the page system. Segmentation is an additional overhead which relates to the sharing of logical memory between
processes. The page overhead relates to the mapping of logical to physical addresses.
Memory addressing with segments is like plotting points in a plane with coordinates
(segment,offset).
operator new which dynamically allocates memory is a wrapper function for the C library function
malloc(). When
we use new, the compiler translates this into a call to malloc(). As an example, let's ask what happens when we call the function
malloc(). malloc is part of the standard C library on any system, but we shall only be concerned with how it is implemented in
BSD UNIX. The function is used to obtain a pointer to (the address of) a block of memory
pointer= malloc(n);
Since malloc is a user-level command, it obtains logical memory for the caller. The acquisition of physical memory is taken care or by the
system on behalf of malloc, by deeper level kernel commands.
In order to obtain
bytes of memory, malloc must normally acquire too much memory. This is because the smallest unit of memory is a page.
This when malloc is called, it checks to see if the data segment of the current process has
free bytes. If the space already exists within the
pages already allocated to the process, malloc uses this space and updates the free-memory list. If there is not sufficient space, malloc
makes a call to the brk() function, which tries to extend the size of the data segment. In most cases, not all the memory obtained is
required. The most extreme example would be the allocation of one char variable (one single byte). Then the remainder of the page is free,
and is added to the free memory list.
The next time malloc is called, it tries to use the remainder of the last allocated page, or any memory in the same segment which it
allocated earlier, but has since been freed.
The fact that malloc divides up pages of logical memory is of no consequence to the memory management system, since each process
maintains its own free memory list for the data segment. Since the segment always consists of a whole number of pages there is no conflict
with the page mapping algorithm.
Internal fragmentation. This is space wasted by malloc in trying to fit data into a
segment (logical memory).
External fragmentation. This is space lying between segments in the physical
memory. (There are never holes between segments in logical memory since we
can always just renumber the logical addresses to remove them - they are not real
anyway.)
It is now easy to see how the paging concept goes hand in hand with the logical memory concept: each time the system pages out a frame of
physical memory, it sets a flag in the page table next to the logical page that was removed. If a process attempts to read from that page of
logical memory the system first examines the flag to see if the page is available and, if it is not, a page fault occurs.
A page fault is a hardware or software interrupt (depending on implementation) which passes control to the operating system. The OS proceeds
to locate the missing page in the swap area and move it back into a free frame of physical memory. It then binds the addresses by updating the
paging table and, when control returns to the waiting process, the missing page is automagically restored, as if it had never been gone.
Notice, that the location of the physical frame is completely irrelevant to the user process. A frame does not have to be moved back into the
same place that it was removed from, because the runtime binding of addresses takes care of its relocation.
1. Try to look for free memory at the end of the current segment and add it to the
current segment.
2. Try to allocate a new, larger segment, copy the data to the new segment and
deallocate the old one.
3. Swap out the process, reallocate and swap in again.
In this use of swap space, it is clear that a process is swapped out while it is waiting for a suitable hole in to appear in the memory. This might
take a long time and it might be immediate. Another case for swapping out a job is if it has been idle (sleeping) for a long time.
On a BSD-like UNIX system, the first three processes to be started are ) the swapper,
) init, the and ) the pagedaemon. The pagedaemon
is responsible for examining the state of the page-table and deciding which pages are to be moved to disk. Normally the swapper will not swap
out processes unless they have been sleeping for a long time, because the pager will first strip them of their inactive pages. It will begin to swap
out processes however, if the average load on the system is very high. (The load average is a number based on the kernel's own internal
accounting and is supposed to reflect the state of system activity.) This gives `cheap' processes a chance to establish themselves. It is the
pagedameon which makes the paging decisions. By copying read-only segments to the swap area at load time, the running overhead of paging
out read-only data is removed, since the data are always where we need them in swap space and never change. In modernized versions of
UNIX, such as the Solaris systems by Sun Microsystems, read only pages from the code segment are thrown away when they are selected for
swap out and then read in from the filesystem if needed again. Moreover, data pages are only allocated swap space when they are forced out of
physical memory. These optimizations reflect the fact that modern systems have more physical memory than previously; also disks are getting
faster.
Let us now look more generally at how paging decisions are made. The most important aspect of paging is that pages can still be accessed even
though they are physically in secondary storage (the disk). Suppose a page fault occurs and there are no free frames into which the relevant data
can be loaded. Then the OS must select a victim: it must choose a frame and free it so that the new faulted page can be read in. This is called
(obviously) page replacement. The success or failure of virtual memory rest on its abililty to make page replacement decisions. Certain facts
might influence these algorithms. For instance, if a process is receiving I/O from a device, it would be foolish to page it out - so it would
probably I/O locked into RAM. Here are some viable alternatives for page replacement.
This algorithm has the advantage of being very straightforward, but its performance can suffer if a page is in heavy use for a long period of
time. Such a page would be selected even though it was still in heavy use.
The idea is that pages which are frequently use will have their bits set often and will therefore not get paged out. Of course, this testing incurs
an overhead. In the extreme case that all pages are in heavy use the page algorithm must cycle through all the pages setting their bits to zero
before finding the original page again. Even then, it might not find a page to replace, if the bit was set again while it was looking through the
others. In such a case, the paging system simply fails.
Figure 5.7: LRU page replacement algorithm. When there is a tie, the algorithm uses
FIFO.
Two possibilities for such an implementation are the following.
We record the time at which each page was last referenced. Unlike the FIFO
scheme above, this means that we have to update the time-stamp every single time
memory is referenced, instead of only each time a page is replaced. If the copying
operation takes, say, five CPU instructions (jump to update routine, locate page
table entry, load system clock time, store system clock time, return), this means roughly speaking - that the system is slowed down by a factor of around five. This
is an unacceptable loss, so unless the memory management unit can do something
fancy in hardware, this scheme is not worth the system's time.
We keep a stack of page addresses, so that the page number of the most recently
accessed page is always on the top of the stack. Although this sounds cheaper in
principle, since the page replacement algorithm never has to search for a
replacement - it just looks on top of the stack - it still results in a large system
overhead to maintain the stack. We must update a data stucture which requires
process synchronization and therefore waiting. Again, without special hardware,
this is not economical.
In practice, many systems use something like the second-chance algorithm above. The UNIX pagedaemon uses this approach.
5.2.4 Thrashing
Swapping and paging can lead to quite a large system overhead. Compared to memory speeds, disk access is quite slow - and, in spite of
optimized disk access for the swap area, these operations delay the system markedly. Consider the sequence of events which takes place when a
page fault occurs:
1.
2.
3.
4.
5.
6.
Because the heads of the disk move together on all surfaces, we can increase read-write efficiency by allocating blocks in parallel across all
surfaces. Thus, if a file is stored in consecutive blocks, on a disk with
without any head movement.
surfaces and
sectors sectors-per-track
When a disk is supplied by a manufacturer, the physical properties of the disk (number of tracks, number of heads, sectors per track, speed of
revolution) are provided with the disk. An operating system must be able to adjust to different types of disk. Clearly sectors per track is not a
constant, nor is necessarily the number of tracks. The numbers given are just a convention used to work out a consistent set of addresses on a
disk and may not have anything to do with the hard and fast physical limits of the disk.
To address any portion of a disk, we need a three component address consisting of (surface, track, sector).
5.3.4 Scheduling
The disk is a resource which has to be shared. It therefore has to be scheduled for use, according to some kind of queue system. If a disk only
had one customer at a time, a first-come first-served FCFS policy would be adequate. However - requests both to read and to write may come
randomly from any user process or from the system on a multitasking system and so we must think carefully about how to service them.
Since a disk is hardware, and involves mechanical movement, it can literally be destroyed by asking it to do too much. One of the aims of
scheduling a disk device is to minimize wear on the disk surface. Another aim is to maximize the speed of access. If the disk heads are being
asked to go backwards and forwards randomly many times a second, much time can be lost. Floppy disks are particularly susceptible to errors
caused by misalignment between disk and disk head. The more a hed moves rapidly backwards and forwards, the more likely it is to miss its
intended location and misread data. When this happens the data have to be read again and the whole process takes much longer.
Hard disks are more robust than floppies, but the algorithms for scheduling the disk nevertheless take into account the physical issue of
movement.
5.3.4.1 FCFS
As always, the simplest option for scheduling is the first-come first-serve method. This can be thought of in two ways: i) that the first user to
obtain the disk gets to use it uninterrupted until his or her file access is finished, or ii) every individual disk access can be sheduled on a FCFS
basis. On a busy system, ii) can lead to wild thrashing of the disk heads as different processes first try to move them one way and then another.
The AmigaDOS system (at least up to 1.3) suffered from this problem even if there were only two processes. The system tried to time-share the
disk which resulted in a more than fifty percent loss in performance. The user could wait for minutes while the system tried to thrash out a job
which could have taken seconds if one job had been allowed to complete first without interruption.
5.3.5 Partitions
For the purposes of isolating special areas of the disk, most operating systems allow the disk surface to be divided into partitions. A partition
(also called a cylinder group) is just that: a group a cylinders which lie next to each other. By defining partitions we divide up the storage of
data to special areas, for convenience.
For instance, it is quite normal to keep the system software in one partition and user data in another partition. That way, when one makes a
back-up of the disk, user data can easily be kept separate from system data. The separation becomes a hardware matter.
Partitions are supported on MS-DOS, Macintosh, BSD UNIX, AmigaDOS etc. Remarkably there are versions of system 5 UNIX which do not
support partitions. BSD UNIX partitions are a good example, and since we are focussing on UNIX we shall discuss its partitions in more detail.
BSD UNIX uses a special convention for the partitions on each disk. Each disk may have up to eight logical partitions which are labelled from
a to h.
Partition
Usage
swap partition
anything
anything
anything
anything
anything
Each partition is assigned a separate logical device and each device can only write to the cylinders which are defined as being its own.
Partitions can overlap, because they are just limits. Thus, if we read from logical device c, which is defined as the whole disk, we could, in
principle read from the whole disk, whereas if we use logical device b we may only read from the swap partition.
To use a partition we have to create a filesystem on it. This involves reserving space workspace for the operating system and suitable markers
for navigating over the surface of the disk.. Since partitions are defined for convenience, it does not matter that they overlap. What is important
is that the filesystems on two partitions do not overlap! This is extremely important. If two filesystems overlap, they will destroy each other!
In BSD UNIX, partitions are created by editing a table which is downloaded into the device driver. In Sun's SunOS and Solaris operating
systems, a special command format is used to make partitions. The newfs command is used to create a filesystem.
Once a partition has been created, it has to be mounted in order to be reachable from the directory structure of the filesystem. The mount action
is analagous to the opening of a file. On the Macintosh and Amiga operating systems, new disks are immediately sensed by the system and are
mounted. In the Macintosh case (which has only a pictoral graphic user interface) new partitions or disks are mounted on the desktop at the root
level. Under AmigaDOS, each new partition becomes a logical device and is given a logical device name which identifies the disk. If the
Workbench (graphical user interface) is running, the disks appear together with their device names on the workbench in the same way as the
Macintosh. Otherwise they appear in the mountlist.
In UNIX a partition is mounted using the command mount. For example a command like
mount /dev/sd0g
/user-data
would mount partition g on disk number zero onto the directory /user-data. The result would be that all files on that partition would
appear under the directory /user-data. A prerequisite for mounting a UNIX partition is that the partition must contain a filesystem.
5.3.6 Stripes
In recent years some UNIX systems (particularly Hewlett Packard) have experimented with disk striping. Disk striping is a way of increasing
the disk transfer rate up to a factor of
heads can now search independently, the speed of transfer is, in principle, increased manifold. The
different disks. Instead of saving all the data from a given file on one
Figure 5.10: Disk striping: files are spread in parallel over several disks.
The UNIX system does not make any particular distinction on the basis of filenames. Instead it keeps a flag to say whether a file is executable
or not. If a file is not marked as executable, UNIX will not try to run it. If it is marked executable, it will try to execute the file. If the file is not
a valid binary program, it will fail. Executable binary files must conform to a certain protocol structure which identifies them to the operating
system as being fit for execution. If a text file is marked executable, UNIX will try to pass the lines of the file to the command line interpreter
or shell.
Certain files in the UNIX operating system are not really files at all but `handles' to devices. They are called device nodes. A device node is a
way `into' a device through a filesystem interface. It is convenient to be able to use normal filing commands to access devices. Not all devices
can be accessed in this way, but the interface is useful for those that can. In the Solaris 2 operating system, the kernel process list is represented
as such a directory of pseudo-files.
For user convenience, the file command attempts to guess the contents of UNIX files, but this is not used by the system.
1. see whether the file is inaccessible, because we do not have permission to open it.
2. see whether the file is inaccessible because it is being used by another user. When
we open a file for writing, a lock is placed on the file to prevent others from
writing to it simultaneously. This lock is removed by the close operation.
3. obtain pointers to where the file exists physically within the secondary storage
and set up a data structure called a filehandle which the system will use to
describe the state of the file as we use it.
4. set up any cached data which might be used by the OS.
Once a file is open, the system must present the user with a consistent picture of the filesystem. When a user program reads lines from a file, a
pointer should be advanced so that every line is read exactly once. An end of file condition should be signalled when the file is read (this is
usually achieved by storing an EOF character at the end of the file) etc. These are all aspects of an agreed protocol defined by the filesystem.
A more complex situation is the following. Suppose one user is reading a file and another user wants to write to it.
1. Should the user be allowed to write to the file while someone is reading it?
2. If so, should the user be able to see the changes made to the file until after they
have closed it?
There are two possibilities - either all users see changes immediately, or only users opening files after the changes were made see the changes.
Both versions of this are in use by different filesystem implementations. In the latter case, the OS has to keep several copies of the file until all
file handles are released and everyone agrees about the contents of the file.
It is difficult to say that one or the other type of behaviour is more correct. This is largely a subjective issue. What is important is that the
filesystem defines its behaviour and sticks to it consistently. The behaviour of the filesystem is often called filesystem semantics.
1. Linked lists. Each block of data includes a pointer to the next block of data in a
linked list. The difficulty with this method is that each block must be read in the
correct order and the blocks might be spread randomly over the disk. Thus the
retrieval of a file could require a lot of disk head movement which is slow.
2. Linked table. A linked list of blocks for each file is stored in a file allocation
table. All of the pointers for every file are collected together in one place. This
table could also be cached in RAM for faster access. This method is used by MSDOS and a number of other microcomputer operating systems.
3. Indexing. Each file has an index containing a list of blocks which contain the file
itself. This index might be stored anywhere in principle. Space for it is normally
allocated on the disk itself, inside reserved disk blocks, and partly inside an index
table which is built when the filesystem is created. The index blocks are grouped
in one place for convenient access. This system is used in UNIX. Since the index
table must contain pointers to disk blocks, a way of storing the pointers must be
found. If the list is small and is held in a filesystem block, then most of the block
will be wasted. This is a drawback of the index method, but the main advantage of
this method is that it has few limitations.
Although the inodes can span an address space which is larger than bytes, internal pointers in the file structures are still
OSF/1) and so a limitation to bytes is imposed by the word size of the system hardware.
bit (except in
Exercises
1. Go back and think about shared libraries, in view of what you have learned about
logical, physical and virtual memory. What are the advantages and disadvanteges
of shared libraries?
2. Write programs to code the page replacement algorithms discussed in above.
Project
Write a program to model the behaviour of a hard disk. A disk drive contains a stepper motor which pushes the head one track at a time. You
can model the tracks and segments of the disk as an array.
const
const
const
const
int
int
int
int
tracks = 20;
sectors_per_track = 20;
heads = 2;
bytes_per_sector = 64;
char harddisk[tracks][sectors_per_track][heads][bytes_per_sector];
Write a device-driver program which moves the head of the disk according to the LOOK scheme. You can choose yourself whether you base it
upon SCAN or CSCAN.
Why is this array not exactly like a disk? (Hint: think geometry.)
Suppose you have four short files of data, two short and one long. Design a simple filesystem so that you can do the following:
1. Save the two short files onto your `virtual disk' from the real disk.
2. Retrieve the files again. Make sure that you can retrieve the files by name, as
many times as you like.
3. Delete the first file.
4. Save the longer file now, using the space that was freed when you deleted the
shorter file in .
5. Plot the head movement of your disk on the screen using track number for the
horizontal axis against time vertically, so that the output looks something like the
following.
6.
Track ->
7. 1 2 3 4 5 6 7 8 9 ...
8. *
9.
*
10.
*
11.
*
12.
*
13.
*
14.
*
15.
*
16.
*
17.
*
18.
*
19.
*
Time is measured in units of one head movement - one click of the stepper motor.
Show how the head moves when you save and retrieve your files. Hint: use
separate output files to print the result of the head movement and the result of
retrieving a file.
Note that you will have to design a `protocol' for saving the data into the array. The disk array is just an array of characters, you if you want to
save a file, you need to know what is data corresponding to which file.
Hint: you might want to limit the filename size to, say, eight characters to make the problem easier, like in DOS.
Explain carefully how you locate files on your disk, and what scheme your filesystem uses to recover files in the correct order.
The printer,
User authentification data (password database),
Disks holding user data,
A reference clock which can be used to set the local clocks on all systems,
Addresses and telephone numbers.
To some extent, this idea of sharing was the idea behind multi-user systems. Not everyone can afford their own - so we share. What big
multiuser mainframe machines have tought us, however, is that a single monolithic computer with
Figure 6.1: The client server model. The client and the server need not be on the same
machine when there is a network present.
We have already encountered this kind of behaviour before in connection with system calls. The system kernel is a kind of server, which
provides I/O services for all the processes on a system. Also daemons, in the UNIX terminology, are servers which answer requests or perform
some house-keeping on behalf of other processes. The key idea is that there are always two elements: clients and servers.
On a network, we would like to arrange things so that the server and the client might be anywhere - on the same machine or on different
machines. We would like a flexible system for sending out requests for services into a network and getting an answer without having to know
too much about where the services are coming from.
To achieve these aims, we need:
Well-known ports. Every computer, all over the world has to agree on the port
numbers for different services. A well-known port is a port number () which is
reserved for a well-known service like ftp or telnet. It has been registered in a
world-wide register.
RPC program numbers. Historically, we distinguish between services and RPC,
although the effect of the two is the same. The system of calling RPC services is
different to normal services - it uses program numbers first, and works out port
numbers for itself.
#
# Network services, Internet style
# This file is never consulted when the NIS are running
#
tcpmux
1/tcp
# rfc-1078
echo
7/tcp
echo
7/udp
...
ftp
21/tcp
telnet
23/tcp
smtp
25/tcp
mail
time
37/tcp
timserver
time
name
whois
domain
domain
hostnames
sunrpc
sunrpc
...
login
shell
printer
courier
uucp
biff
who
syslog
talk
route
ingreslock
bootpc
bootp
37/udp
42/udp
43/tcp
53/udp
53/tcp
101/tcp
111/udp
111/tcp
513/tcp
514/tcp
515/tcp
530/tcp
540/tcp
512/udp
513/udp
514/udp
517/udp
520/udp
1524/tcp
68/udp
67/udp
timserver
nameserver
nicname
# usually to sri-nic
hostname
# usually to sri-nic
cmd
spooler
rpc
uucpd
comsat
whod
#
#
#
#
no passwords used
line printer spooler
experimental
uucp daemon
router routed
bootps
The file maps named services into port numbers and protocol type. The protocol type is also an agreed standard which is defined in the file
/etc/protocols, which looks like this:
#
# Internet (IP) protocols
# This file is never consulted when the NIS are running
#
ip
0
IP
# internet protocol, pseudo protocol number
icmp
1
ICMP
# internet control message protocol
igmp
2
IGMP
# internet group multicast protocol
ggp
3
GGP
# gateway-gateway protocol
tcp
6
TCP
# transmission control protocol
pup
12
PUP
# PARC universal packet protocol
udp
17
UDP
# user datagram protocol
hmp
20
HMP
# host monitoring protocol
xns-idp 22
XNS-IDP # Xerox NS IDP
rdp
27
RDP
# "reliable datagram" protocol
In order to open a socket, we must know the name of the host on which the server lives. If we don't know this information in advance, we can
send a broadcast request to all hosts, hoping that one of them will reply with their correct address (see next chapter).
Also, when the message arrives at a host which runs the server process, there are two possibilities.
#
# Configuration file for inetd(8).
See inetd.conf(5).
#
# Internet services syntax:
# <service_name> <socket_type> <proto> <flags> <user> <server_pathname>
<args>
#
# Ftp and telnet are standard Internet services.
#
ftp
stream tcp
nowait root
/usr/etc/in.ftpd
in.ftpd
telnet stream tcp
nowait root
/usr/etc/in.telnetd
in.telnetd
#
# Shell, login, exec, comsat and talk are BSD protocols.
#
shell
stream tcp
nowait root
/usr/etc/in.rshd
in.rshd
login
stream tcp
nowait root
/usr/etc/in.rlogind
in.rlogind
exec
stream tcp
nowait root
/usr/etc/in.rexecd
in.rexecd
comsat dgram
udp
wait
root
/usr/etc/in.comsat
in.comsat
talk
dgram
udp
wait
root
/usr/etc/in.talkd
in.talkd
inetd listens on the network for service requests for all of the daemons which are in its configuration file, and - if such a request arrives - it
starts the server for the duration of the request.
Notice the field `wait' and `nowait'. This tells inetd what to do if another request arrives while one request is being processed - should it
wait for the first request to finish (single threaded) or should it start several processes (multi-threaded) to handle the requests 6.2.
finger mark@mymachine
to get information on user mark from the finger database, we could contact the well-known port on host `mymachine' as follows:
Directory: /mn/anyon/u2/mark
Shell: /local/bin/tcsh
On since Aug 14 11:59:39 on ttyp1 from :0.0
17 minutes Idle Time
Mail last read Sun Aug 14 14:27:02 1994
No Plan.
Or had finger not been in /etc/services, we could have written
telnet hostname 79
Not all services accept textual input in this way, but telnet will try to contact their ports nevertheless.
6.6 X11
The X11 window system, used by Unix, is a client-server based application. A user's workstation runs a server process called X, whose job it is
to display windows on the user's display. Each application the user wishes to run is a client which must contact the server in order to have its
output displayed on the X-display.
Exercises
1. Explain what `protocol' means.
2. Describe briefly the client-server model.
3. What role do dmons play in with respect to the unix kernel? Why are servers
daemons?
Project
Make a simple client-server model which commuicates via unix files. The server should be sent an arithmetic problem to solve, for example: .
The client should send this request to the server, and the server should send back the answer. The client must be able to exit gracefully if the
server does not answer for any reason. (Hint: you could use the `sleep' command to wait for the server to reply.)
You will need to think of the following:
1. What filenames should you use to send messages from the client to the server and
from the server to the client?
2. Since the client and the server are independent processes, you need to find a way
of discovering when the client and the server have finished writing their replies,
so that you don't read only half of the answer by mistake.
3. The server should loop around and around, waiting for maultiple requests, while
the client sends only one request and exits when it gets a reply.
7. TCP/IP Networks
In the last chapter we looked at some of the high level considerations for enabling transparent communication over a network. The next thing to
look at is how such a scheme of protocols is achieved in practice.
7 Application layer
6 Presentation layer
5 Session layer
RPC / sockets
4 Transport layer
tcp or udp
3 Network layer
IP internet protocol
ethernet (protocols)
1 Physical layer
ethernet (electronics)
At the lowest level, the sending of data between two machine takes place by manipulating voltages along wires. This means we need a device
driver for the signaller, and something to receive the data at the other end - a way of converting the signals into bytes; then we need a way of
structuring the data so that they make sense. Each of these elements is achieved by a different level of abstraction.
1. Physical layer. This is the problem of sending a signal along a wire, amplifying it
if it gets weak, removing noise etc. If the type of cable changes (we might want to
reflect signals off a satellite or use fibre optics) we need to convert one kind of
signal into another. Each type of transmission might have its own accepted ways
of sending data (i.e. protocols).
2. Data link layer. This is a layer of checking which makes sure that what as sent
from one end of a cable to the other actually arrived. This is sometimes called
handshaking.
3. Network layer. This is the layer of software which remembers which machines are
talking to other machines. It establishes connections and handles the delivery of
data by manipulating the physical layer. The network layer needs to know
something about addresses - i.e. where the data are going, since data might flow
along many cables and connections to arrive where they are going.
4. Transport layer. We shall concentrate on this layer for much of what follows. The
transport layer builds `packets' or `datagrams' so that the network layer knows
what is data and how to get the data to their destination. Because many machines
could be talking on the same network all at the same time, data are broken up into
short `bursts'. Only one machine can talk over a cable at a time so we must have
sharing. It is easy to share if the signals are sent in short bursts. This is analogous
to the sharing of CPU time by use of time-slices.
5. Session layer This is the part of a host's operating system which helps a user
program to set up a connection. This is typically done with sockets or the RPC.
6. Presentation layer. How are the data to be sent by the sender and interpreted by
the receiver, so that there is no doubt about their contents? This is the role played
by the external data representation (XDR) in the RPC system.
7. Application layer. The program which wants to send data.
As always, the advantage of using a layered structure is that we can change the details of the lower layers without having to change the higher
layers. Layers to are those which involve the transport of data across a network. We could change all of these without doing serious damage to
the upper layers - thus as new technology arrives, we can improve network communication without having to rewrite software.
Most of these layers are quite static - only the physical layer is changing appreciably.
4 Application layer
user program
6,7
3,4,5
2 Internet layer
1 Physical layer
Network
At the internet layer, we have the IP or internet protocol which includes a specification of addresses and basic units of data transmission. The
official name for the lowest level data `packages' in the internet protocol is datagrams. Each datagram consists of a number of 32 bit words.
The first six of these words consists of the IP header.
7.2.1 udp
The user datagram protocol is called unreliable because when an application chooses this protocol, the best it can do is to `throw its data out to
the wind' and hope that it will arrive. When we use udp transport, there is no guarantee that data will arrive at the destination and no
confirmation of receipt is sent by the receiver.
It is called connectionless because the messages are sent one by one, without any concept of there being an on-going connection between the
sender and receiver. This is like sending a letter in the post.
Udp is the simpler of the two transport layer protocols, since it requires no handshaking by the system. It is useful for applications which either
need or want to provide their own form of handshaking. For example, it would be natural to use the udp protocol for a `question-answer' type of
client-server system. The client knows that its question arrived if it gets an answer from the server, so asking the network protocols to guarantee
it would be a waste of time.
A single `message' of udp encapsulated datagrams is officially called a packet and is given a small header as shown in the figure below.
7.2.2 tcp
A single message of the transmission control protocol is called a segment. The tcp protocol is called reliable or connection-oriented because
sufficient handshaking is provided to guarantee the arrival and the ordering of the segments at their destination. The ordering of each message
implies a concept of two machines being continual contact with one another. This is like a telephone conversation: both parties are in contact
all the time.
TCP connections are useful for sending data to servers, where no particular reply is required. For example, it would be used to send print jobs
to a printer queue across a network. The sender receives no reply from the print spooler, and wants every single line of data to arrive in the
correct order without having to worry.
branches.
The ethernet address is the only piece of information a machine has before it is configured. It is only used by diskless machines and some xterminals as part of an ARP/RARP ((Reverse) Address resolution protocol) request to get an IP address.
xxx.yyy.zzz.mmm
where xxx etc can be numbers from
to (Certain addresses are reserved). In addition to a numerical address, each host has a name. For
example, the following are valid internet addresses and the names they correspond to.
129.240.22.14
128.39.89.10
192.48.96.9
anyon.uio.no
nexus.iu.hioslo.no
ftp.uu.net
The addressing scheme is based on a hierarchical splitting of networks, subnets and hosts. To arrive correctly at its destination an IP packet has
to know exactly which network a host is connected to. This information is correctly coded into the numerical address, but is not contained
directly in the textual name form.
In the textual form, each machine belongs to a logical domain which has a name. The address takes the form
machine.domain-name
Thus in the above examples, `anyon', `nexus' and `ftp' are host names and `uio.no', `iu.hioslo.no' and `uu.net' are domain names.
The numerical form is strictly speaking a combination of a network address and a host address. The textual form is a combination of a
hostname and a domain name.
There is a subtle difference between these. Given a numerical IP address, datagrams can find their way precisely to the correct network and
machine. The textual information is not sufficient however because, while the hostname is unique, the remainder of the address (the domain
name) is usually a generic name for a group of networks - and we don't know how to choose the right one.
A logical domain, like the above examples, can encompass any number of different networks. For example, the domain name `uio.no'
encompasses all of the subnets under the address 129.240.*.*. The IP packets need to know which subnet the machine is on in order
to get to their destination, because the text name only says that they should to to 120.240.*.host. The * is unknown.
To complete this information, we need a database which maps internet domain names to internet addresses. This mapping is performed by the
Domain Name Service (DNS) or Berkeley Internet Name Domain (BIND) which we shall discuss below.
Figure 7.7: The netmask sets the division between network address and host address in
different machines in the domain with address xxx.yyy.zzz with this netmask. If we wanted more, we would have to introduce a
different domain name for the extra hosts!
If we wanted more machines on each subnet, we would have to change the netmask and the definitions of the domain address. By making the
netmask 255.255.248.0, as in the figure above, we add an extra bit to the host part. Thus a total of
host.domain
and returns the numerical form of the IP address for that host. This is called resolving the name. Notice that no two machines in the same
domain may have the same name, otherwise the DNS would not be able to resolve the IP address from the textual form.
The DNS also performs the reverse service, converting numbers into names and stores extra information about which hosts are mail-exchangers
etc. The UNIX program nslookup can be used to browse in the Domain Name Service.
The domain name service is a daemon, called a nameserver, which runs on some chosen host (usually a UNIX machine, since the software was
written for UNIX) and looks up names in a database. The host on which the nameserver runs is often called a nameserver too.
Each server covers only the list of hosts in its local domain, not those of other domains - but it has a knowledge of other nameservers which can
answer queries in other domains. If a nameserver receives a request to resolve a name which is not in its own domain, it forwards the request to
the official nameserver for that domain.
Nameservers update each other's information constantly about what official nameservers addresses are so that the data are always up to date.
Each new network which is set up on the internet has to register its nameserver centrally so that this information is complete.
Every host on a network must know the name of its local nameserver in order to send requests for name resolution.
The DNS software which is in most widespread use is the Berkeley BIND software (Berkeley Internet Name Domains). Microsoft have their
own implementation called WINS (Windows internet nameservice) as their own commercial solution but this will soon be abandoned in favour
of DNS, since it lacks adequate functionality and security.
In the UNIX filesystem, a user must obtain a lock on a file in order to read or write to it. In NFS, the same system applies. A lock is obtained
from a lock server on the host where the real disk filesystem lies and the state of the filesystem is communicated by a state server. NFS is
sometimes called a stateless protocol, but this is a misleading title. The state of the filesystem on the server is maintained on the server which
owns the filesystem. If there is a crash, the server tries to reestablish the locks it held before the crash. If this is not possible because the
filesystem has changed in the meantime or because of unfortunate timing, the result is a `stale NFS filehandle' - an unrecoverable error. The
state information has to be cleared and restarted.
NFS is called stateless because the server does not record the requests which the client makes (except for locks). The server processes requests
without caring about which client it is serving, or about what it has done in previous requests. It doesn't know how much of a file the client has
read. In other words, it is the client's responsibility to keep track of what requests it sends to the server and whether or not it gets a reply.
NFS version 3 is now in use by some vendors and includes a number of improvements (and a few new bugs) over NFS. These include better
caching, access control lists (ACLs) etc.
Malicious users may try to hack into the system to destroy it.
Power failure might bring the system down.
A badly designed system may allow a user to accidentally destroy important data.
A system may not be able to function any longer because one user fills up the
entire disk with garbage.
Although discussions of security usually concentrate on the first of these possibilities, the latter two can be equally damaging to the system in
practice. One can protect against power failure by using un-interruptable power supplies (UPS). These are units which detect quickly when the
power falls below a certain threshhold and switch to a battery. Although the battery does not last forever - the UPS gives a system administrator
a chance to halt the system by the proper route.
The problem of malicious users has been hightened in recent years by the growth of international networks. Anyone connected to a network can
try to log on to almost any machine. If a machine is very insecure, they may succeed. In other words - we are not only looking at out local
environment anymore, we must consider potential threats to system security to come from any source.
The final point can be controlled by enforcing quotas on how much disk each user is allowed to use.
The user.
The system administrator.
The system designer.
Many would prefer to write this list upside down - but we must be practical. Usually we are not in a position to ring to the system designer and
say `Hey, that system modeule you wrote is not secure, fix it!'. The response would at any rate take some time. Rather, we have to learn to take
the system as it comes (pending improvements in later releases) and make the best of it. All users of the system should be aware of security
issues.
Ideally, if all users were friendly and thoughtful, everyone would think about the welfare of the system and try to behave in a system-friendly
way. Unfortunately some users are not friendly, and accidents can happen even to friendly users.
root:99UaPHtxon3uk:0:1:Operator:/:/bin/csh
sundiag:*:0:1:System
Diagnostic:/usr/diag/sundiag:/usr/diag/sundiag/sundiag
sysdiag:*:0:1:Old System
Diagnostic:/usr/diag/sysdiag:/usr/diag/sysdiag/sysdiag
daemon:*:1:1::/:
sys:*:2:2::/:/bin/csh
bin:*:3:3::/bin:
uucp:*:4:8::/var/spool/uucppublic:
news:*:6:6::/var/spool/news:/bin/csh
audit:*:9:9::/etc/security/audit:/bin/csh
nobody:*:65534:65534:::
+@NISgroup::0:0:::
The fields of the file are:
When a user types in his or her password, the system does not try to decrypt the password, but rather encrypts the password and compares the
coded forms. The reason for this is that there is no (publicly) known algorithm for decoding passwords encrypted using crypt(). Just to
reverse the process would take hundreds of thousands of years of CPU time.
To encrypt a password, crypt takes the password string and a random number, known as a salt.
characters of the encrypted password as the salt, and compare the result of the crypt function with the encrypted form from the password
file.
Elaborate programs have been written to try to crack passwords in this way. Such programs are useful tools for the system administrator who
should keep an eye on which users have poor passwords. It is better that the system administrator finds a bad password before a malicious user
does.
On newer UNIX systems, passwords are stored in a shadow password file which is not /etc/passwd but a different non-readable file.
Since normal users cannot read this file, they can only try to log onto other users' accounts by trial and error. They cannot compare an
encrypted list of their own to the password file.
To maintain the system and deal with special circumstances which arise.
and no rights) across a network. This means that the superuser has rights only on the local machine. To get rights on another machine,
across a network, either special permission must be given by the remote machine - or the user must be able to log onto the machine by knowing
the root password.
As another example of network security - or lack of it - let us consider also the X-windows system. X is a windowing system which is designed
to work transparently over a network. X works by connecting to a server, anywhere on the network. Normally the X-server only allows the
machine on which it is running to access the display, but in a network situation it is not unusual to find users logged in on several different
machines. Such a user wants all the windows to appear on his or her workstation, so the X server allows certain other named hosts to open
windows on its display.
Before the introduction of the xauthority mechanism, all security was based on the xhost program. This was host based meaning that anyone
using a named host could open windows on the server. Many users do not understand the X system (which is quite complex) and simply disable
access control by calling xhost +. This allows any host in the world to connect to the user's server. In practice, this means that anyone in
the world can view the picture on such a user's screen.
Many programs have not adopted the xauthority system which is user based, and so the xhost problem is still widespread,
An example of a setuid program is the ps program. ps lists all of the processes running in the kernel. In order to do this it needs permission
to access the private data structures in the kernel. By making ps setgid root, it allows ordinary users to be able to read as much as the writers
of ps thought fit, but no more.
Naturally, only the superuser can make a file setuid or setgid root.
Next, we have the problem of what to do with setuid programs which are read across the network. If we mount a filesystem across a network,
we have no control over what goes into the file. Suppose then a stupid system administrator, angry at the world and dying for revenge, made a
setuid root program which executed every command every user gave to it - then suddenly everybody who accessed this file over the network
would have root access on their local machine!
Clearly careless setuid programs can be a security risk, so network-based filesystems give the option of disallowing setuid programs.
8.4 Backups
Accidents happen even to the most careful users. Users delete files without meaning to, power failure leads to disk corruption, software bugs
can delete files, system administrators can make mistakes - and of course someone might actually steal your computer!
User data are the most important part of a computer system - anything else can be replaced. New disks can be bought, software can be loaded in
afresh - but once user data are gone, they are gone. It is therefore important to backup user data regularly. From a network vantage point, it is
useful to be able to take backups centrally. In BSD UNIX, this can be done using the rdump command.
Backing up data is expensive - both in terms of man-hours and in the cost of storage media. Some systems use secondary disks to keep backups
of important data. The cheaper alternative is to use tape. Tape comes in many forms. the most common in use today are
Every night. The most important data should be backed up at least as often as
significant changes are made.
Every week. Less important data might only be worth backing up once a week.
Every month. For convenience you might want to record the setup of your system
software once in a while - even though this can be loaded in again from source.
Backup software is usually intelligent enough to be able to extract only files which have been modified since the last backup, so daily backups
need not copy every file every night.
How long should you keep a backup? It might take some time to discover that a file is gone. How long you keep backup tapes depends on how
long you value your data. A year is not an unreasoable length of time.
Foreign Address
xantos-7.6000
xantos-7.6000
xantos-7.6000
xantos-7.6000
njal.6000
njal.6000
njal.6000
xantos-2.6000
xantos-2.6000
128.39.89.24.6000
128.39.89.24.6000
anyon.uio.no.login
(state)
tcp
0
ESTABLISHED
tcp
0
ESTABLISHED
tcp
0
ESTABLISHED
tcp
0
ESTABLISHED
saga.1094
xantos-7.6000
saga.1086
xantos-4.6000
saga.1023
anyon.uio.no.login
saga.1080
xantos-4.6000
This gives an indication of who is currently connected. Of course, intruders could connect when you are not watching, so another thing to do is
to monitor all the connections made to your machine continuously and dump the result to a file. This requires a considerable amount of storage
and some skill in interpreting the data. The program tcpdump will do this. Sun have their own version called etherfind.
On the other hand, we cannot live in a perpetual state of paranoia, thinking that everyone is out to get us. A balance must be struck by taking all
reasonable precautions and being aware of the problem. Finally, the super-user should never install software which is of suspicious or unknown
origin.
8.6 Firewall
One way of designing a network to protect it from attack is to use a machine as a ``firewall''. That is - a barrier to stop the spread of network
threats. The idea is to isolate important machines by placing another highly secure machine between the outside world and the local network.
The firewall is the only machine which is connected to a wide area network. It is also connected to the local area network, but it does not
forward packets to the local network and vice versa. Thus sensitive data can be hidden behind the firewall, where they can be shared on the
local network but not by the external network.
Where next?
There are many topics which have only been covered superficially in this introduction. A deeper understanding of networks and system
administration can be found in
http://www.iu.hioslo.no/~mark/lectures
Glossary
Index
crypt()
fork()
malloc()
wait()
Firewall
First come first served
First in first out
Floppy disks
Formatting
Fragmentation
Fragments (network)
Frame table
Ftp
Gateway
Handshaking
Hewlett Packard
Hierarchical filesystems
High level
I/O burst
I/O sharing
Index register
Internet protocol
Interprocess communication
Interrupt vector
Interrupts
IP address
IPC
IPV4
ISDN
ISO
Job
Kernel
Last in first out
Lazy evaluation
Least recently used algorithm
LIFO
Lightweight process
Lightweight processes
Linux
Loader
Locks
Logical device
Logical memory
LOOK algorithm
Low level
LRU
Mach
MacIntosh
Memory management unit
Memory map
UFS filesystem
UID
Unreliable protocol
User ID
User mode
Users
Virtual memory
Virus
VLSI
Waiting
Well known ports
Window system
Windows
Word size
Worm
X windows
X11
XDR