Computer Communications 23 (2000) 267–289
www.elsevier.com/locate/comcom
Operating system support for multimedia systems
T. Plagemann a,*, V. Goebel a, P. Halvorsen a,1, O. Anshus b,2
a
b
UniK-Center for Technology at Kjeller, University of Oslo, Norway
Department of Computer Science, Princeton University, Princeton NJ, USA
Abstract
Distributed multimedia applications will be an important part of tomorrow’s application mix and require appropriate operating system
(OS) support. Neither hard real-time solutions nor best-effort solutions are directly well suited for this support. One reason is the co-existence
of real-time and best effort requirements in future systems. Another reason is that the requirements of multimedia applications are not easily
predictable, like variable bit rate coded video data and user interactivity. In this article, we present a survey of new developments in OS
support for (distributed) multimedia systems, which include: (1) development of new CPU and disk scheduling mechanisms that combine
real-time and best effort in integrated solutions; (2) provision of mechanisms to dynamically adapt resource reservations to current needs; (3)
establishment of new system abstractions for resource ownership to account more accurate resource consumption; (4) development of new
file system structures; (5) introduction of memory management mechanisms that utilize knowledge about application behavior; (6) reduction
of major performance bottlenecks, like copy operations in I/O subsystems; and (7) user-level control of resources including communication.
q 2000 Elsevier Science B.V. All rights reserved.
Keywords: Operating systems; Multimedia; Quality of service; Real-time
1. Introduction
Distributed multimedia systems and applications play
already today an important role and will be one of the
cornerstones of the future information society. More specifically, we believe that time-dependent data types will be a
natural part of most future applications, like time-independent data types today. Thus, we will not differentiate in the
future between multimedia and non-multimedia applications, but rather between hard real-time, soft real-time,
and best effort requirements for performance aspects like
response time, delay jitter, synchronization skew, etc.
Obviously, all system elements that are used by applications, like networks, end-to-end protocols, database
systems, and operating systems (OSs), have to provide
appropriate support for these requirements. In this article,
we focus on the OS issues on which applications, end-to-end
protocols, and database systems directly rely. For simplicity, we use in this article the notion multimedia systems and
* Corresponding author. Tel.: 147-64844733; fax: 147-63818146.
E-mail addresses:
[email protected] (T. Plagemann),
[email protected]
(V. Goebel),
[email protected] (P. Halvorsen),
[email protected] (O. Anshus).
1
This research is sponsored by the Norwegian Research Council, DITS
program under contract number 119403/431 (INSTANCE project).
2
On leave from Computer Science Department, University of Tromsø,
Norway.
applications which comprises also distributed multimedia
systems and applications.
The task of traditional OSs can be seen from two perspectives. In the top–down view, an OS provides an abstraction
over the pure hardware, making programming simpler and
programs more portable. In the bottom–up view, an OS is
responsible for an orderly and controlled allocation of
resources among the various executing programs competing
for them. Main emphasis of resource management in
commodity OSs, like UNIX or Windows systems, is to
distribute resources to applications to reach fairness and
efficiency. These time-sharing approaches work in a besteffort manner, i.e. no guarantees are given for the execution
of a program other than to execute it as fast as possible while
preserving overall throughput, response time, and fairness.
Specialized real-time OSs in turn emphasize on managing
resources in such a way that tasks can be finished within
guaranteed deadlines. Multimedia applications are often
characterized as soft real-time applications, because they
require support for timely correct behavior. However, deadline misses do not naturally lead to catastrophic consequences even though the Quality of Service (QoS)
degrades, perhaps making the user annoyed.
Early work in the area of OS support for multimedia
systems focussed on the real-time aspect to support the
QoS requirements of multimedia applications. Traditional
real-time scheduling algorithms, like Earliest Deadline
0140-3664/00/$ - see front matter q 2000 Elsevier Science B.V. All rights reserved.
PII: S0140-366 4(99)00180-2
268
T. Plagemann et al. / Computer Communications 23 (2000) 267–289
Fig. 1. Periodic task model.
First (EDF) and Rate Monotonic (RM), have been adopted
for CPU (and disk) scheduling. These scheduling mechanisms are based on a periodic task model [1]. In this model,
tasks are characterized by a start time s, when the task
requires the first execution, and a fixed period p in which
the task requires execution time e with deadline d (see
Fig. 1). Often, the deadline is equal to the end of the period.
An 8 bit, 8 kHz PCM encoded audio stream is a good
example for such a periodic task: the constant sampling
rate and the fixed sample size generate a constant bit stream.
In order to handle the stream more efficiently, samples of
typically 20 ms are gathered in a packet. Thus, the system
has to handle in each period, i.e., p 20 ms, a packet before
the next period. The fixed packet size requires a constant
execution time e per period. This periodic task model is
attractive from an engineering point of view, because it
makes it possible to predict the future: in period k, which
starts at s 1 k 2 1p; the task with execution time e has to
be finished before s 1 k 2 1p 1 d:
However, experiences with multimedia applications
including efficient variable bit rate (VBR) coding schemes
for video, like H.261, H.263, MJPEG, and MPEG, lead to
the conclusion that it is not that easy to foresee future
requirements. Video frames are generated in a fixed
frequency (or period), but the size of the frames and the
execution times to handle each of these frames is not
constant [2,3]. It varies on a short time scale, between the
different frames, e.g. I, B, or P frames in MPEG, and on a
larger time scale, e.g. due to scene shifts such as from a
constant view on a landscape to an action scene. Furthermore, the degree of user interactivity is much higher in
recent multimedia applications, e.g. interactive distance
learning, than in earlier applications, like video-on-demand
(VoD). It is very likely that this trend will continue in the
future and make resource requirements even harder to
predict.
Latest developments in the area of multimedia OSs still
emphasize on QoS support, but integrate often adaptability
and support both real-time and best effort requirements.
Furthermore, new abstractions are introduced, like new
types of file systems and resource principals that decouple
processes from resource ownership. Finally, main
bottlenecks, like paging, copy operations, and disk I/O,
have been tackled to fulfill the stringent performance
requirements.
It is the goal of this article to give an overview over recent
developments in OS support for multimedia systems. OS
support for multimedia is an active research area, and therefore, it is not possible to discuss all particular solutions in
depth. Instead, we introduce for each OS aspect the basic
issues and give an overview and a classification of new
approaches. Furthermore, we describe a few examples in
more detail to enable the reader to grasp the idea of some
new solutions. However, for an in-depth understanding, the
reader has to refer to the original literature, because
this article is intended as a survey and to serve as an
“entry-point” for further studies.
The rest of this article is structured as follows: Section 2
discusses general OS developments, and Section 3
summarizes the requirements of multimedia applications.
The basic dependency between resource management and
QoS is discussed in Section 4. Management of the system
resources, like CPU, disk, main memory, and other system
resources, are discussed in Sections 5–8. New approaches to
overcome the I/O bottleneck are presented in Section 9.
Section 10 gives some conclusions.
2. Operating system architectures
Traditionally, an OS can be viewed as a resource allocator
or as a virtual machine. The abstractions developed to
support the virtual machine view include a virtual processor
and virtual memory. These abstractions give each process
the illusion that it is running alone on the computer. Each
virtual machine consumes physical resources like physical
processor and memory, data-, instruction-, and I/O buses,
data and instruction caches, and I/O ports. However, instead
of allowing a process to access resources directly, it must do
so through the OS. This is typically implemented as system
calls. When a process makes a system call, the call is given
to a library routine which executes an instruction sending a
software interrupt to the OS kernel. In this way, the OS gets
control in a secure way and can execute the requested
service. This is a too costly path for some services, because
trapping to the OS involves the cost of crossing the user–
supervisor level boundary at least twice, and possibly crossing address spaces also at least twice if a context switch to
another process takes place. In addition, there are costs
associated with the housekeeping activities of the OS.
When several processes are executing, each on its own
virtual processor, they will implicitly interfere with each
other through their use of physical resources. Primarily,
they will affect each other’s performance, because applications are not aware of physical resources and of each other.
A multimedia application can face a situation where it does
not get enough resources, because the OS is not aware of
each applications’ short and longer term needs. This will
typically happen when the workload increases.
The need to go through the OS to access resources and the
way the OS is designed and implemented results in a system
where low latency communication is difficult to achieve. It
is also difficult to either have resources available when they
are needed by a process; or have a process ready to execute
T. Plagemann et al. / Computer Communications 23 (2000) 267–289
when the resources, including cooperating processes on
other processors or computers, are available or ready.
A traditional general-purpose OS, like UNIX or Windows
NT, is not a good platform for the common case needs of
multimedia applications. In these OSs, resource allocation
for each process is based on general purpose scheduling
algorithms, which are developed to provide a balance
between throughput and response time, and to provide fairness. These algorithms get some limited feedback from the
application processes on what they are doing, but basically,
the OS has little or no understanding of what the applications are doing and what their requirements are. Also, the
degree to which an application can directly control
resources in a secure way, or provide the OS with hints
for its resource requirements, has traditionally been very
limited.
There are several OS architectures in use today of which
the monolithic kernel and the m-kernel architectures, or their
hybrids, are the most common. In a monolithic OS kernel,
all components are part of a large kernel, execute in a
hardware-protected supervisory mode, and can use the
entire instruction set. Consequently, the kernel has total
control over all resources. User processes execute in user
mode and can therefore use only a limited instruction set
and have only a limited control over resources. A user
process cannot execute an instruction to switch to supervisory mode, enable or disable interrupts, or directly access
I/O ports. When a user process needs OS services, it
requests the service from the OS, and the kernel performs
the requested service. Two crossings between user- and
kernel-level take place, from the user process to the kernel
and then back again when the service has finished.
In the m-kernel architecture, the OS is divided into a
smaller kernel with many OS services running as processes
in user mode. This architecture is flexible, but has traditionally resulted in an increased overhead. The kernel sends a
request for service to the correct user-level OS process. This
creates extra overhead, because typically four crossings
between user- and kernel-level take place, i.e. from the
user process to the kernel, from the kernel to the userlevel service, from the user-level service to the kernel, and
finally, from the kernel to the user process. This can also
result in memory degradation because of reduced instruction locality giving an increased number of cache misses. In
Ref. [4], a comparative study of three OSs, including
NetBSD (a monolithic UNIX) and Windows NT (a m-kernel
like architecture), is presented. The monolithic NetBSD has
the lowest overhead for accessing services. However, the
overall system performance can significantly be determined
by specific subsystems, e.g. graphics, file system, and disk
buffer cache; and for some cases Windows NT does as well
as NetBSD in spite of the higher overhead associated with
its m-kernel architecture.
Library OSs (also referred to as vertically structured
systems) like the Exokernel architecture [5,6] and Nemesis
[7] have been proposed as an alternative to monolithic and
269
m-kernel OSs. The basic idea is that those parts of an OS that
can run at user-level are executed as part of the application
processes. The OS is implemented as a set of libraries
shared by the applications. The OS kernel can be kept
very small, and it basically protects and exports the
resources of the computer to the applications, i.e. leaving
it to the applications to use the resources wisely. This allows
for a high flexibility with respect to the needs of individual
applications. At the same time it gives processes more direct
control over the resources with better performance as a
potential result.
However, the results presented in Refs. [8–10] identify
several areas of significance for the performance of an OS
including the switching overhead between user and kernel
mode, switching between address spaces, the cost of interprocess communication (IPC), and the impact of the OS
architecture upon memory behavior including cache misses.
These papers show how a m-kernel OS can be designed and
implemented at least as efficient as systems using other
architectures. The SUMO OS [11] describes how to improve
m-kernels by reducing the number of protection crossings
and context switchings even though it is built on top of the
Chorus OS.
Threads can be used as a way of reducing OS-induced
overhead. Basically, threads can be user-level or kernellevel supported. User-level threads have very low overhead
[12]. However, they are not always scheduled preemptively,
and then the programmer must make sure to resume threads
correctly. User-level threads can also result in blocking the
whole process if one thread blocks on a system service.
Even if user-level threads will reduce the internal overhead
for a process, we still have no resource control between
virtual processors. Kernel-supported threads have much
higher overhead, but are typically scheduled preemptively
and will not block the whole process when individual
threads block. This makes kernel-level supported threads
simpler to use in order to achieve concurrency and overlap
processing and communication. Kernel-level supported
threads can potentially be scheduled according to the
process requirements, but this is typically not done. Splitlevel scheduling [13] combines the advantages of user and
kernel-level thread scheduling by avoiding kernel-level
traps when possible. Scheduler activations [12] is a kernel
interface and a user-level thread package that together
combine the functionality of kernel threads with the performance and flexibility of user-level threads.
Low-context switching overhead can also be achieved
when using processes. For example, Nemesis uses a single
address space for all processes. This allows for a low
context-switching overhead, because only protection rights
must be changed.
The rapid development of commodity multiprocessors
and clusters of commodity computers provide a scalable
approach to separating virtual machines onto physical
processors and memories and thereby reduces the interference between them. However, there are still shared
270
T. Plagemann et al. / Computer Communications 23 (2000) 267–289
Fig. 2. Whiteboard stream.
resources that must be scheduled, including networks,
gateways, routers, servers, file systems, and disks.
3. Multimedia application requirements
In this section, we briefly discuss the requirements
that multimedia applications impose onto OSs. First, we
exemplify typical requirements by introducing a multimedia
application that is in productive use since 1993 for teaching
regular courses at the University of Oslo. Afterwards, we
give a more general characterization of multimedia
application requirements.
3.1. Example: interactive distance learning
The main goal of the electronic classrooms is to create a
distributed teaching environment which comes as close as
possible to a traditional (non-distributed) classroom [14].
The classroom system is composed out of three main
parts: electronic whiteboards, audio system, and video
system. At each site there is at least one large electronic
whiteboard to display transparencies. The lecturer can
write, draw, and erase comments on displayed transparencies by using a light-pen.
The main element of the audio system is a set of microphones that are mounted evenly distributed on the ceiling in
order to capture the voice of all the participants. The video
system comprises in each classroom three cameras and two
sets of monitors. One camera focuses on the lecturer, and
two cameras focus on the students. A video-switch selects
the camera corresponding to the microphone with the loudest input signal. Two monitors are placed in the front and
two monitors are placed in the back of each classroom
displaying the incoming and outgoing video information.
All video data is compressed according to the compression
standard H.261.
During a lecture, at least two electronic classrooms are
connected. Teacher and students can freely interact with
each other regardless of whether they are in the same or
in different classrooms. Audio, video, and whiteboard
events are distributed in real-time to all sites, allowing all
participants to see each other, to talk to each other, and to
use the shared whiteboard to write, draw, and present
prepared material from each site.
Detailed measurements are reported in [15] and show that
the audio system with 8 bit, 16 kHz PCM encoding generates a constant bitstream of 128 Kbit/s. The video stream,
however, varies between 100 Kbit/s and 1.4 Mbit/s, because
it depends on the activity in the classroom. The traffic
pattern of the whiteboard fluctuates even more, because it
solely depends on the interactivities of the users, i.e. teacher
and students (see Fig. 2). The large peaks are generated by
downloading transparencies and range between 30 and
125 Kbit/s. The small peaks of approximately 10 Kbit/s
are generated by light-pen activities, like editing and
marking text on transparencies.
These measurement results show that the size of video
frames and corresponding execution times are not constant
and that the whiteboard stream cannot be characterized as
periodic task. Treating both as periodic tasks would require
to perform pessimistic resource allocation for video and to
install a periodic process that polls for aperiodic user interactions. However, both solutions result in poor resource
utilization.
3.2. Requirements in general
Generally, we can identify the following three orthogonal
requirements of multimedia applications [16,17]:
• High data throughput: audio streams in telephony quality
require 16 Kbit/s and in CD-quality 1.4 Mbit/s. Typical
video data rates range from approximately 1.2 Mbit/s for
MPEG, 64 Kbit/s to 2 Mbit/s for H.261, 20 Mbit/s for
compressed HDTV, and over 1 Gbit/s for uncompressed
HDTV.
• Low latency and high responsiveness: end-to-end delay
for audio streams (which is a sum of network delay and
two times end-system delay) should be below 150 ms to
be acceptable for most applications. However, without
special hardware echo cancellation, the end-to-end
delay should be below 40 ms. Lip synchronization
requires to playout corresponding video and audio data
with a maximum skew of 80 ms. The maximum synchronization skew for music and pointing at the corresponding notes is ^5 ms. Audio samples are typically gathered
in 20 ms packets, i.e., 50 packets per second have to be
handled per audio stream.
• QoS guarantees: to achieve a quality level that satisfies
user requirements, the system has to handle and deliver
multimedia data according to negotiated QoS parameters,
e.g., bounded delay and delay jitter.
Interrupt latency, context switching overhead, and data
movements are the major bottlenecks in OSs that determine
throughput, latency, and responsiveness. In Ref. [18], it is
T. Plagemann et al. / Computer Communications 23 (2000) 267–289
271
In order to meet QoS requirements from applications and
users, it is necessary to manage the system resources in such
a manner that sufficient resources are available at the right
time to perform the task with the requested QoS. In particular, resource management in OS for QoS comprises the
following basic tasks:
Fig. 3. Operating system resources.
documented that especially the interrupt handling is a major
overhead. To implement QoS guarantees for these performance aspects, advanced management of all system
resources is required. The need for advanced resource
management has lead to the development of new OS
abstractions and structures. In the following sections, we
discuss basic resource management tasks and explain how
the new OS abstractions and structures can reduce context
switching overhead and data movement costs.
4. Resource management and QoS
A computer system has many resources, which may be
required to solve a problem: CPU, memory at different
levels, bandwidth of I/O devices, e.g. disk and host–
network interface, and bandwidth of the system bus. In
Fig. 3, the basic resource types CPU, memory, and bandwidth are partitioned into concrete system resources.
One of the primary functions of an OS is to multiplex
these resources among the users of the system. In the advent
of conflicting resource requests, the traditional OS must
decide which requests are allocated resources to operate
the computer system fairly and efficiently [19]. Fairness
and efficiency are still the most important goals for resource
management in today’s commodity OSs. However, with
respect to multimedia applications, other goals that are
related to timeliness become of central importance. For
example, user interactions and synchronization require
short response times with upper bounds, and multimedia
streams require a minimum throughput for a certain period
of time. These application requirements are specified as
QoS requirements. Typical application-level QoS specifications include parameter types like frame rate, resolution,
jitter, end-to-end delay, and synchronization skew [20].
These high-level parameters have to be broken down (or
mapped) into low-level parameters and resources that are
necessary to support the requested QoS, like CPU time per
period, amount of memory, and average and peak network
bandwidth. A discussion of this mapping process is beyond
the scope of this paper, but we want to emphasize at this
point that such a specification of resource requirements is
difficult to achieve. A constant frame rate does not necessarily require constant throughput and constant execution time
per period. Furthermore, user interactions can generate
unpredictable resource requirements.
• Specification and allocation request for resources that are
required to perform the task with the requested QoS.
• Admission control includes a test whether enough
resources are available to satisfy the request without
interfering with previously granted requests. The way
this test is performed depends on requirement specification and allocation mechanism used for this resource.
• Allocation and scheduling mechanisms assure that a
sufficient share of the resource is available at the right
time. The type of mechanism depends on the resource
type. Resources that can only exclusively be used by a
single process at a time have to be multiplexed in the
temporal domain. In other words, exclusive resources,
like CPU or disk I/O, have to be scheduled. Basically,
we can differentiate between fair scheduling, real-time
scheduling, and work and non-work conserving scheduling mechanisms. So-called shared resources, like
memory, basically require multiplexing in the spatial
domain, which can be achieved, e.g. with the help of a
table.
• Accounting tracks down the actual amount of resources
that is consumed in order to perform the task. Accounting
information is often used in scheduling mechanisms to
determine the order of waiting requests. Accounting
information is also necessary to make sure that no task
consumes more resources than negotiated and steals them
(in overload situations) from other tasks. Furthermore,
accounting information might trigger system-initiated
adaptation.
• Adaptation might be initiated by the user/application or
the system and can mean to downgrade QoS and corresponding resource requirements, or to upgrade them.
Adaptation leads in any case to new allocation parameters. Accounting information about the actual resource
consumption might be used to optimize resource
utilization.
• Deallocation frees the resources.
Specification, admission control, and allocation and scheduling strongly depend on the particular resource type,
while adaptation and resource accounting represent more
resource type independent principles. Thus, the following
two subsections introduce adaptation and resource accounting, before we discuss the different system resources in
more detail.
4.1. Adaptation
There are two motivations for adaptation in multimedia
systems: (1) resource requirements are hard to predict, e.g.
272
T. Plagemann et al. / Computer Communications 23 (2000) 267–289
still open. However, it is obvious that it requires frequent
adjustment of allocation parameters, which must not impose
much overhead.
4.2. New abstractions for resource principals
Fig. 4. Feedback control for adaptation.
VBR video and interactivity; and (2) resource availability
cannot be guaranteed if the system includes best-effort subsystems, e.g. today’s Internet or mobile systems. In besteffort systems, it is only possible to adapt the application
with respect to the amount of application data the system
has to handle. In OSs, both situations might occur, and it is
possible to adapt both application and resource allocations.
In Ref. [21], adaptation with feedback and adaptation without feedback are distinguished. Adaptation without feedback means that applications change only the functionality
of the user interface and do not change any resource parameters. Therefore, we consider in this article only adaptation with feedback, i.e. feedback control systems. Fig. 4
shows a simplified view of the collaboration between
resource consumer, e.g. application, and resource provider,
e.g. CPU scheduler, in adaptation with feedback [22]:
(A) The resource consumer, or a management entity,
estimates its resource requirements and requests the provider to allocate the resource according to its specification.
(B) After admission control is successfully passed, the
resource utilization is monitored. The monitoring results
can reflect the general resource utilization, e.g. the resource
is under- or over-utilized, and the accuracy of the consumers’ resource requirements estimations, e.g. it uses less or
more resources than allocated. The monitoring results
might trigger step (C) and/or (D).
(C) The provider requests the consumer to adjust its
resource requirements, e.g. by reducing the frame rate of
a video.
(D) The consumer requests the provider to adjust the allocation parameters.
Most of the recent results in the area of adaptive resource
management discuss CPU management (see Section 5).
Monitoring of actual execution times is supported in most
of these systems. More general approaches for adaptive
resource management include AQUA [22], SWiFT [23],
Nemesis [7], Odyssey [24], and QualMan [20]. The crucial
aspects of adaptation are the frequency in which feedback
control (and adaptation) is performed and the related overhead. For example, SCOUT uses a coarse-grained feedback
mechanism that operates in the order of several seconds
[25]. On the other hand, the work presented in Ref. [26]
aims to predict the execution times of single MPEG frames.
Whether fine-grained adaptation of allocation parameters
results in better QoS and/or better resource utilization is
Resource accounting represents a fundamental problem,
because most OSs treat a process as the “chargeable” entity
for allocation of resources, such as CPU time and memory.
In Ref. [27], it is pointed out that “a basic design premise of
such process-centric systems is that a process is the unit that
constitutes an independent activity. This gives the process
abstraction a dual function: it serves both as a protection
domain and as a resource principal.” This situation is insufficient, because there is no inherent one-to-one mapping
between an application task and a process. A single process
might serve multiple applications, and multiple applications
might serve together a single application. For example,
protocol processing is in most monolithic kernels performed
in the context of software interrupts. The corresponding
resource consumption is charged to the unlucky process
running at that time, or not accounted at all [27]. m-kernels
implement traditional OS services as user-level servers.
Applications might invoke multiple user-level servers to
perform on its behalf, but the resource consumption is
charged to the application and the user-level servers instead
of charging it to the application only [28]. It is important to
note that ownership of resources is not only important for
accounting reasons, but also for scheduling. The resources a
process “owns”, e.g. CPU time, define also its scheduling
parameter. For example, in commodity OSs with prioritybased CPU scheduling, each process is associated with a
priority, which in turn determines when it is scheduled,
i.e. receives its CPU time. Thus, a server or kernel thread
that is performing a task on behalf of an application with
QoS requirements should inherit the corresponding scheduling parameters for needed resources, e.g. the priority.
In Ref. [29], the problem of resource ownership and the
corresponding scheduling decisions is partially solved for a
monolithic kernel by deriving weights of kernel activities
from user weights. By this, the proportional share scheduling, in the extended FreeBSD kernel, is able to provide an
appropriate share of CPU bandwidth to the kernel activity
such that QoS requirements of the user process can be met.
We can identify two basic, orthogonal approaches to
handle this problem: (1) introduction of a new abstraction
for resource ownership; and (2) to provide applications as
much control as possible over devices, as it is done in
so-called library OSs, like Exokernel [5,6] and Nemesis
[7]. In these systems, applications can directly control
resource consumption for network I/O and file I/O, because
network device drivers and disk device drivers are
accessible from the application in user-space without kernel
interference. Obviously, resource consumption can be easily
charged to the application.
In Ref. [27], a quite extensive discussion of new
T. Plagemann et al. / Computer Communications 23 (2000) 267–289
abstractions for resource ownership can be found. These
abstractions differ in terms of: (1) which resources are
considered; (2) which relationships between threads and
resource owners are supported; and (3) which resource
consumptions are actually charged to the owner. In Stride
and Lottery scheduling [30], resource rights are encapsulated by tickets. Tickets are owned by clients, and ticket
transfers allow to transfer resource rights between clients.
In Mach [31], AlphaOS [32], and Spring [33], migrating
threads and shuttles, respectively, correspond to threads
and own resources. Migration of threads between protection
domains enables these systems to account resource
consumption of independent activities to the correct
owner, i.e. thread. In Ref. [34], threads can act as scheduler
for other threads and can donate CPU time to selected
threads. The reservation domains [35] of Eclipse, the Software Performance Units [36], and the scheduling domains
in Nemesis [7] enable the scheduler to consider the resource
consumption of a group of processes.
The reserve abstraction is introduced in Real-Time Mach
[28]. The main purpose of reserves is to accurately account
CPU time for activities that invoke user-level services, e.g.
client threads that invoke various servers (which are running
in user-space, because Real-Time Mach is a m-kernel). Each
thread is bound to one reserve, and multiple threads, potentially running in different protection domains, can be bound
to a single reserve. By this, computations of server threads
that are performed on behalf of a client can be charged to the
clients reserve and scheduling of the server computations is
performed according to the reservations of the reserve. In
Ref. [37], the concept of reserves is extended to manage
additional resource types such as CPU, disk bandwidth,
network bandwidth, and virtual memory.
Resource containers [27] allow explicit and fine-grained
control over resource consumption at all levels in the
system, because it allows dynamic relationships between
resource principals and processes. The system provides
explicit operations to create new resource containers, to
bind processes to containers and to release these bindings,
and to share containers between resources. Resource
containers are associated with a set of attributes, like
scheduling parameters, memory limitations, and network
QoS values, that can be set by applications and support
the appropriate scheduling (decoupled from particular
process information). Furthermore, containers can be
bound to files and sockets, such that the kernel resource
consumption on behalf of these descriptors is charged to
the container.
In the Rialto OS, an activity object is the abstraction to
which resources are allocated and against which resource
usage is charged [38,39]. Applications run by default in
their own activity and typically in their own process.
Activities may span over address spaces and machines.
Multiple threads of control may be associated with an
activity. The threads execute with rights granted by secured
user credentials associated with this activity. The CPU
273
scheduler treats all threads of an activity equal, because
the assumption is that they cooperate toward a common
goal.
The path abstraction in the SCOUT OS [40] combines
low-level de-multiplexing of network packets via packet
filter with migrating threads. A path represents an I/O path
and is executed by a thread. One thread can sequentially
execute multiple paths. A newly awakened thread inherits
the scheduling requirements of the path and can adjust them
afterwards. The path object is extended in the Escort OS
with a mechanism to account all resource consumptions of a
path to defend against denial of service attacks [41].
Compared to the previously discussed approaches for
resource accounting, a basically different approach is introduced in Ref. [42]. Multiple threads, which may reside in
different protection domains, are gathered in a job. Instead
of measuring the resource consumption of these threads,
Steere et al. [42] monitor the progress of jobs and adapt,
i.e. increase or decrease, the allocation of CPU to those jobs.
So-called symbiotic interfaces link the notion of progress,
which is depending on the application to system metrics,
like a buffer’s fill-level.
5. CPU scheduling
Most commodity OSs, like Windows NT and Solaris,
perform priority scheduling and provide time-sharing and
real-time priorities. Priorities of threads in the real-time
range are never adjusted from the system. A straightforward
approach to assure that a time critical multimedia task
receives sufficient CPU time would be to assign a realtime priority to this task. However, in Ref. [43], it is
shown for the SVR4 UNIX scheduler that this approach
results in unacceptable system performance. A high
priority multimedia task, like video, has precedence over
all time-sharing tasks and is nearly always active. Starvation
of timesharing tasks, e.g. window system and basic system
services, leads to poor QoS for the multimedia application
and unacceptable performance for the entire system. Thus,
the usage of real-time priorities does not automatically lead
to the desired system behavior. However, other implementations, like those described in Refs. [3,44], show that fixed
real-time priorities can be utilized to successfully schedule
real-time and best effort tasks in a general purpose UNIX
system.
Several new solutions have been recently developed. The
following section gives a general overview and classification of these solutions, and subsequent sections present two
solutions in more detail.
5.1. Classification of scheduling mechanisms
Two of the most popular paradigms for resource allocation and scheduling to satisfy the contradicting goals of
flexibility, fairness, and QoS guarantees are proportional
share resource allocation and reservation-based algorithms
274
T. Plagemann et al. / Computer Communications 23 (2000) 267–289
Table 1
CPU scheduling approaches
System/project
Paradigm
Admission
control
Adaptation
support
Resource
principal
Context
AQUA
ARC
Atropos
BERT
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
No
No
Yes
Yes
Solaris
Solaris
Nemesis
SCOUT
FLUX
Ref. [2]
HeiTS
Ref. [29]
Reservation-based
Proportional share
Hierarchical
Merges proportional share
and reservation-based
Hierarchical
Proportional share
Reservation-based
Hierarchical
No
No
Yes
No
Proportional share
Proportional share
Hierarchical
Reservation-based
Reservation-based
Proportional share
Hierarchical
Reservation-based
Hierarchical
No
Yes
Yes
Yes
No
No
Yes
Yes
Yes
No
No
No
No (system processes inherit
weight from user process)
Yes
Yes
Yes
Yes
No
No
No
No
No
Prototype in user-space
Framework
AIX
FreeBSD
Lottery, Stride
MTR-LS
Rialto
RT Mach
RT Upcalls
SMART
SRT
Ref. [42]
SUMO
No
Yes
No
Yes (only monitoring
of execution times)
Yes
No
Yes
Yes
No
Yes
Yes
Yes
No
[45]. In proportional share allocation, resource requirements
are specified by the relative share (or weight) of the
resource. In a dynamic system, where tasks dynamically
enter and leave the system, a share depends on both the
current system state and the current time. The resource is
always allocated in proportion to the shares of competing
requests. Thus, in pure proportional share resource allocation, no guarantees can be given, because the share of CPU
might be arbitrary low. The basis for most proportional
share resource allocation mechanisms are algorithms that
have been developed for packet scheduling in packetswitched networks, like weighted fair queuing [46], virtual
clock [47], and packet-by-packet generalized processor
sharing [48]. Examples for systems using proportional
share CPU allocation include Adaptive Rate Control
(ARC) [49], SMART [50], Earliest Eligible Virtual Deadline First [45], Lottery and Stride scheduling [30], and
Move-to-Rear List (MTR-LS) scheduling in Eclipse [35].
In contrast to proportional share resource allocation,
reservation-based algorithms, like RM and EDF scheduling,
can be used to implement guaranteed QoS. The minimum
share for each thread is both state and time independent.
However, resource reservation sacrifices flexibility and fairness [45]. EDF scheduling is used in Nemesis [7], DASH
[51], and SUMO [52]. The principles of RM scheduling are
applied, for example, in Real-Time Mach [28], HeiTS [44],
AQUA [22] and for the Real-Time Upcall in the MARS
system [163,164]. To implement a feedback-driven proportional allocator for real-rate scheduling, the work presented
in Ref. [42] uses both EDF and RM.
Most modern solutions of CPU scheduling for multimedia
systems are based on either proportional share allocation, or
are a combination of different allocation paradigms in hierarchies to support both real-time and best-effort requirements.
Mach
Eclipse
Rialto
RT-Mach
NetBSD
Solaris
User-level scheduler
Prototype in user-space
Chorus
For example, the scheduler in the Rialto system [39] and the
soft real-time (SRT) user-level scheduler from [3] combine
EDF and round-robin scheduling. The Atropos scheduler in
Nemesis [7] also applies EDF to sort waiting scheduling
domains in different queues. The proportional share
resource allocation Start-time Fair Queuing in Ref. [53] is
used in Ref. [2] to achieve a hierarchical partition of CPU
bandwidth in a general framework. For each of these partitions, arbitrary scheduling mechanisms can be used. Proportional share scheduling is also the primary policy in Ref.
[29]. When multiple processes are eligible, they are scheduled according to EDF. In the context of the Flux project, a
CPU inheritance scheduling framework has been developed in which arbitrary threads can act as scheduler for
other threads and widely different scheduling policies
can co-exist [34]. The scheduler in SCOUT, called
BERT [25], merges reservation-based and proportional
share resource allocation in a single policy, instead of
combining two or more policies in a hierarchical
approach. Basically, BERT extends the virtual clock
algorithm by considering deadlines for the scheduling
decision and by allowing high-priority real-time tasks
to steal CPU cycles from low priority and best effort
tasks.
In addition to the resource allocation paradigm,
i.e. proportional share (P), reservation-based (R), and
hierarchical (H), Table 1 uses the following criteria to classify CPU scheduling approaches: (1) whether admission
control is performed; (2) whether adaptation is supported;
(3) whether a new abstraction for resource principal is introduced; and (4) what is the context of the scheduler, i.e. is it
part of a new kernel, integrated in an existing kernel, or
implemented on top of an existing kernel in user-space.
In the following subsections, we describe in more detail two
T. Plagemann et al. / Computer Communications 23 (2000) 267–289
distinct approaches, i.e. Rialto scheduler and SMART, that
differ in all classification criteria except adaptation support.
5.2. Rialto scheduler
The scheduler of the Rialto OS [38,39,54] is based on
three fundamental abstractions:
• Activities are typically an executing program or application that comprises multiple threads of control.
Resources are allocated to activities and their usage is
charged to activities.
• CPU reservations are made by activities and are
requested in the form: “reserve x units of time out of
every Y units for activity A”. Basically, period length
and reservations for each period can be of arbitrary
length.
• Time constraints are dynamic requests from threads to
the scheduler to run a certain code segment within a
specified start time and deadline to completion.
The scheduling decision, i.e. which threads to activate
next, is based on a pre-computed scheduling graph. Each
time a request for CPU reservation is issued, this scheduling
graph is recomputed. In this scheduling graph, each node
represents an activity with a CPU reservation, specified as
time interval and period, or represents free computation
time. For each base period, i.e. the lowest common denominator of periods from all CPU reservations, the scheduler
traverses the tree in a depth-first manner, but backtracks
always to the root after visiting a leaf in the tree. Each
node, i.e. activity that is crossed during the traversal, is
scheduled for the specified amount of time. The execution
time associated with the schedule graph is fixed. Free
execution times are available for non-time-critical tasks.
This fixed schedule graph keeps the number of context
switches low and keeps the scheduling algorithm
scalable.
If threads request time constraints, the scheduler analyzes
their feasibility with the so-called time interval assignment
data structure. This data structure is based on the knowledge
represented in the schedule graph and checks whether
enough free computation time is available between start
time and deadline (including the already reserved time in
the CPU reserve). Threads are not allowed to define time
constraints when they might block—except for short
blocking intervals for synchronization or I/O. When during
the course of a scheduling graph traversal an interval
assignment record for the current time is encountered, a
thread with an active time constraint is selected according
to EDF. Otherwise, threads of an activity are scheduled
according to round-robin. Free time for non-time-critical
tasks is also distributed according to round-robin.
If threads with time constraints block on a synchronization event, the thread priority (and its reservations) is passed
to the holding thread.
275
5.3. SMART
The SMART scheduler [50] is designed for multimedia
and real-time applications and is implemented in Solaris
2.5.1. The main idea of SMART is to differentiate between
importance to determine the overall resource allocation for
each task and urgency to determine when each task is given
its allocation. Importance is valid for real-time and conventional tasks and is specified in the system by a tuple of
priority and biased virtual finishing time. Here, the virtual
finishing time, as known from fair-queuing schemes, is
extended with a bias, which is a bounded offset measuring
the ability of conventional tasks to tolerate longer and more
varied service delays. Application developers can specify
time constraints, i.e. deadlines and execution times, for a
particular block of code, and they can use the system notification. The system notification is an upcall that informs the
application that a deadline cannot be met and allows the
application to adapt to the situation. Applications can
query the scheduler for availability, which is an estimate
of processor time consumption of an application relative
to its processor allocation. Users of applications can specify
priority and share to bias the allocation of resources for the
different applications.
The SMART scheduler separates importance and urgency
considerations. First, it identifies all tasks that are important
enough to execute and collects them in a candidate set.
Afterwards, it orders the candidate set according to urgency
consideration. In more detail, the scheduler works as
follows: if the tasks with the highest value-tuple is a conventional task, schedule it. The highest value-tuple is either
determined by the highest priority or for equal priorities
by the earliest biased virtual finishing time. If the task
with the highest value-tuple is a real-time task, it creates a
candidate set of all real-time tasks that have a higher valuetuple than the highest conventional task. The candidate set is
scheduled according to the so-called best-effort real-time
scheduling algorithm. Basically, this algorithm finds the
task with the earliest deadline that can be executed without
violating deadlines of tasks with higher value-tuples.
SMART notifies applications if their computation cannot
be completed before its deadline. This enables applications
to implement downscaling.
There is no admission control implemented in SMART.
Thus, SMART can only enforce real-time behavior in
underload situations.
6. Disk management
Magnetic and optical disks enable permanent storage of
data. 3 The file system is the central OS abstraction to handle
data on disk. However, in most commodity OSs, it is possible to by-pass the file system and to use raw disks, e.g. for
3
We focus in this article only on issues related to magnetic disks.
276
T. Plagemann et al. / Computer Communications 23 (2000) 267–289
database systems. The two main resources that are of importance for disk management, no matter whether the file
system of an OS or raw disk is used, are:
• Memory space on disk: allocation of memory space at the
right place on disks, i.e. appropriate data placement, can
strongly influence the performance.
• Disk I/O bandwidth: access to disk has to be multiplexed
in the temporal domain. CPU scheduling algorithms
cannot directly be applied for disks, because of the
following reasons [55]: (1) a disk access cannot be
preempted, i.e. it is always necessary to read a whole
block; (2) access times (which correspond to execution
time for CPU scheduling) are not deterministic, because
they depend on the position of the disk head and the
position of data on disk; and (3) disk I/O represents the
main performance bottleneck in today’s systems.
Multimedia data can be managed on disk in two different
ways [57]: (1) the file organization on disk is not changed
and the required real-time support is provided by special
disk scheduling algorithms and large buffers to avoid jitter;
or (2) the data placement is optimized for continuous multimedia data in distributed storage hierarchies like disk arrays
[56].
In the following subsections, we discuss file management,
data placement, and disk scheduling issues of multimedia
file systems separately.
6.1. File management
The traditional access and control tasks of file systems,
like storing file information in sources, objects, program
libraries and executables, text, and accounting records,
have to be extended for multimedia file systems with realtime characteristics, coping with larger file sizes (high disk
bandwidth), and with multiple continuous and discrete data
streams in parallel (real-time delivery) [57]. The fact that in
the last years storage devices have become only marginally
faster compared to the exponentially increased performance
of processors and networks, makes the effect of this discrepancy in speed for handling multimedia data by file
systems even more important. This is documented by the
large research activity to find new storage structures and
retrieval techniques. The existing approaches can be categorized along multiple criteria, and we present a brief classification along architectural issues and data characteristics.
From the architectural perspective, multimedia file
systems can be classified as [166]:
• Partitioned file systems consist of multiple subfile
systems, each tailored to handle data of a specific data
type. An integration layer may provide transparent access
to the data handled by the different subfile systems. There
are multiple examples of systems using this approach,
e.g. FFS [58], Randomized I/O (RIO) [59], Shark [60],
Tiger Shark [61], and the combination of UFS and CMFS
in [62].
• Integrated file systems multiplex all available resources
(storage space, disk bandwidth, and buffer cache) among
all multimedia data. Examples of integrated multimedia
file systems are the file system of Nemesis [63], Fellini
[64], and Symphony [65].
Another way to classify multimedia file systems is to
group the systems according to the supported multimedia
data characteristics:
• General file systems capable of handling multimedia data
to a certain extend, e.g. FFS [58], and log-structured file
systems [66–68].
• Multimedia file systems optimized for continuous multimedia data (video and audio data), e.g. SBVS [69], Mitra
[70], CMFS [71], PFS [72], Tiger [73,74], Shark [60],
Tiger Shark [61], and CMSS [75].
• Multimedia file systems handling mixed-media workloads (continuous and discrete multimedia data), e.g.
Fellini [64], Symphony [65], MMFS [76], the file system
of Nemesis [63], and RIO [59].
The file system of Nemesis [63] supports QoS guarantees
using a device driver model. This model realizes a low-level
abstraction providing separation of control and data path
operations. To enable the file system layers to be executed
as unprivileged code within shared libraries, data path
modules supply translation and protection of I/O requests.
QoS guarantees and isolation between clients are provided
by scheduling low-level operations within the device
drivers.
Fellini [64] supports storage and retrieval of continuous
and discrete multimedia data. The system provides rate
guarantees for active clients by using admission control to
limit the number of concurrent active clients.
Symphony [65] can manage heterogeneous multimedia
data supporting the coexistence of multiple data type
specific techniques. Symphony comprises a QoS-aware
disk scheduling algorithm for real-time and non-real-time
requests, and a storage manager supporting multiple block
sizes and data type specific placement, failure recovery, and
caching policies.
MMFS [76] handles interactive multimedia applications by extending the UNIX file system. MMFS has a
two-dimensional file structure for single-medium editing
and multimedia playback: (1) a single-medium strand
abstraction [165]; and (2) a multimedia file construct,
which ties together multiple strands that belong logically
together. MMFS uses application-specific information for
performance optimization of interactive playback. This
includes intelligent prefetching, state-based caching,
prioritized real-time disk scheduling, and synchronized
multi-stream retrieval.
RIO [59] provides real-time data retrieval with statistical
delay guarantees for continuous and discrete multimedia data.
The system applies random data allocation on heterogeneous
T. Plagemann et al. / Computer Communications 23 (2000) 267–289
disks and partial data replication to achieve load balancing
and high performance.
In addition to the above-mentioned aspects, new mechanisms for multimedia file organization and metadata handling
are needed. For instance, in MMFS [76], each multimedia
file has a unique mnode, and for every strand in a multimedia file exists a unique inode. Mnodes include metadata
of multimedia files and multimedia-specific metadata of
each strand, e.g. recording rate, logical block size, and
size of the application data unit.
Metadata management in Symphony [65] uses a two-level
metadata structure (similar to inodes) allowing both data
type specific structure for files and supporting the traditional
byte stream interface. Like in UNIX, fixed size metadata
structures are stored on a reserved area of the disk. The
file metadata comprises, in addition to the traditional file
metadata, information about block size used to store the
file, type of data stored in the file, and a two-level index.
The first index level maps logical units, e.g., frames, to byte
offsets, and the second index level maps byte offsets to disk
block locations.
Fellini [64] uses the raw disk interface of UNIX to store
data. It maintains the following information: raw disk partition headers containing free space administration information on the disks, file control blocks similar to UNIX inodes
describing data layout on disk, and file data.
Minorca [77] divides the file system partition into multiple
sections: super block section, cylinder group section, and
extent section. Metadata such as inodes and directory blocks
are allocated in the cylinder group section in order to maintain
the contiguity of block allocation in the extent section.
6.2. Data placement
Data placement (also often referred to as disk layout and
data allocation) and disk scheduling are responsible for the
actual values of seek time, rotation time, and transfer time,
which are the three major components determining disk
efficiency [78]. There are a few general data placement
strategies for multimedia applications in which read
operations dominate and only few non-concurrent write
operations occur:
• Scattered placement: blocks are allocated at arbitrary
places on disk. Thus, sequential access to data will
usually cause a large number of intra-file seeks and
rotations resulting in high disk read times. However, in
RIO [59], random data placement is used to support
mixed-media workloads stored on heterogeneous disk
configurations. In their special scenario, the reported
performance measurements show similar results as
those for conventional striping schemes [79].
• Contiguous placement: all data blocks of a file are
successively stored on disk. Contiguous allocation will
mostly result in better performance compared to scattered
allocation. The problem of contiguous allocation is that it
causes external fragmentation.
277
• Locally contiguous placement (also called extent-based
allocation): the file is divided into multiple fragments
(extents). All blocks of a fragment are stored contiguously,
but fragments can be scattered over one or multiple disks.
The fragment size is usually determined by the amount of
data required for one service round. Locally contiguous
placement causes less external fragmentation than
contiguous placement.
• Constrained placement: this strategy restricts the average
distance measured in tracks, between a finite sequence of
blocks [71,80]. Constrained placement represents a
compromise of performance and fragmentation between
scattered and contiguous placement. However, complex
algorithms are needed to obey the defined constraints
[81]. This strategy takes into account only seek times
and not rotation times.
• VBR compressed data placement: conventional fixedsized clusters correspond to varying amounts of time,
depending on the achieved compression [82]. Alternatively, the system can store data in clusters that correspond to a fixed amount of time, with a variable cluster
size. Additionally, compressed data might not correspond
to an even number of disk sectors, which introduces the
problem of packing data [83].
To optimize write operations, log-structured placement
has been developed to reduce disk seeks for write intensive
applications [66–68]. When modifying blocks of data, logstructured systems do not store modified blocks in their
original positions. Instead, all writes for all streams are
performed sequentially in a large contiguous free space.
Therefore, instead of requiring a seek (and possibly intrafile seeks) for each stream writing, only one seek is required
prior to a number of write operations. However, this does
not guarantee any improvement for read operations, and the
mechanism is more complex to implement.
For systems managing multiple storage devices, there
exist two possibilities of distributing data among disks
[78,84]:
• Data striping: to realize a larger logical sector, many
physical sectors from multiple disks are accessed in
parallel.
• Data interleaving: requests of a disk are handled independent of requests from other disks. All fragments of a
request can be stored on 1 2 n disks [85].
Some multimedia file systems, e.g. Symphony [65] and
Tiger Shark [61], use striping techniques to interleave both
continuous and non-continuous multimedia data across
multiple disks. There are two factors crucially determining
the performance of multi-disk systems [78]:
• Efficiency in using each disk. The amount of seek and
rotation times should be reduced as much as possible in
order to have more time available for data transfer.
• Fairness in distributing the load over all disks.
278
T. Plagemann et al. / Computer Communications 23 (2000) 267–289
These two factors largely depend on the data distribution
strategy and the application characteristics. They are very
important to achieve synchronization, which means the
temporal relationship between different multimedia data
streams [57]. Synchronization is often achieved by storing
and transmitting streams interleaved, e.g. by using a MPEG
compression mechanism. Another solution is time-stamping
of the multimedia data elements and appropriate buffering at
the presentation system to enable the OS to synchronize
related data elements of continuous and discrete multimedia
data.
6.3. Disk scheduling
Traditional disk scheduling algorithms focused mainly on
reducing seek times [86], e.g. Shortest Seek Time First
(SSTF) or SCAN. SSTF has high response time variations
and may result in starvation of certain requests. SCAN
reduces the response time variations and optimizes seek
times by serving the requests in an elevator-like way.
There exist many variations and hybrid solutions of the
SSTF and SCAN algorithms that are widely used today
[87–89].
Modern disk scheduling algorithms [90,91] try to minimize the sum of seek and rotational delays by prioritizing,
e.g. the request with the Smallest Positioning Time First.
However, disk scheduling algorithms for multimedia data
requests need to optimize, beside the traditional criteria,
also other criteria special for multimedia data including
QoS guarantees [57,84]. The following list represents an
overview of recent multimedia data disk scheduling algorithms, which are primarily optimized for continuous data
streams [78,92]:
• EDF strategy [1] serves the block request with the
nearest deadline first. Strict EDF may cause low throughput and very high seek times. Thus, EDF is often adapted
or combined with other disk scheduling strategies.
• SCAN-EDF strategy [93] combines the seek optimization
of SCAN and the real-time guarantees of EDF. Requests
are served according to their deadline. The request with
the earliest deadline is served first like in EDF. If multiple requests have the same (or similar) deadline, SCAN is
used to define the order to handle the requests. The
efficiency of SCAN-EDF depends on how often the
algorithm can be applied, i.e. how many requests have
the same (or similar) deadline, because the SCAN
optimization is only achieved for requests in the same
deadline class [94].
• Group Sweeping Strategy (GSS) [89,95] optimizes disk
arm movement by using a variation of SCAN handling
the requests in a round-robin fashion. GSS splits the
requests of continuous media streams into multiple
groups. The groups are handled in a fixed order. Within
a group, SCAN is used to determine time and order of
request serving. Thus, in one service round, a request
may be handled first. In another service round, it may
be the last request served in this group. To guarantee
continuity of playout, a smoothing buffer is needed.
The buffer size is depending on the service round time
and the required data rate. Thus, the playout can first start
at the end of the group containing the first retrieval
requests when enough data is buffered. GSS represents
a trade-off between optimizations of buffer space and
disk arm movement. GSS is an improvement compared
to SCAN, which requires a buffer for every continuous
media request. However, GSS may reduce to SCAN
when only one group is built, or in the other extreme
case, GSS can behave like round-robin when every
group contains only one request.
• Scheduling in rounds, e.g. Refs. [79,84,96], splits every
continuous media request into multiple blocks (so-called
fragments) in a way that the playout duration of each
fragment is of a certain constant time (normally a few
seconds). The length of the round represents an upper
time limit for the system to retrieve the next fragment
from disk for all active requests. For each round, the
amount of buffered data must not be less than the amount
of consumed data avoiding that the amount of buffered
data effectively decreases over the time. Disk scheduling
algorithms with this property are called work-aheadaugmenting [71] or buffer-conserving [97]. Within a
round, it is possible to use round-robin or SCAN
scheduling.
However, there has only been done little work on disk
scheduling algorithms for mixed multimedia data workloads, serving discrete and continuous multimedia data
requests at the same time. Some examples are described
in [62,92,93,98–100]. These algorithms have to satisfy
three performance goals: (1) display continuous media
streams with minimal delay jitter; (2) serve discrete requests
with small average response times; and (3) avoid starvation
of discrete requests and keep variation of response times
low. In Ref. [92], disk scheduling algorithms for mixedmedia workloads are classified by:
• Number of separate scheduling phases per round: onephase algorithms produce mixed schedules, containing
both discrete and continuous data requests. Two-phase
algorithms have two, not timely overlapping, scheduling
phases serving discrete and continuous data requests
isolated in the corresponding phase.
• Number of scheduling levels: hierarchical scheduling
algorithms for discrete data requests are based on defining clusters. The higher levels of the algorithms are
concerned with the efficient scheduling of clusters of
discrete requests. The lower levels are efficiently
scheduling the requests within a cluster. The most important task to solve in this context is how to schedule
discrete data requests within the rounds of continuous
data requests, which are mostly served by SCAN variations.
For instance, Cello [101] uses such a two-level disk
T. Plagemann et al. / Computer Communications 23 (2000) 267–289
scheduling architecture. It combines a class-independent
scheduler with a set of class-specific schedulers. Two time
scales are considered in the two levels of the framework to
allocate disk bandwidth: (1) coarse-grain allocation of bandwidth to application classes is performed by the classindependent scheduler; and (2) the fine-grain interleaving
of requests is managed by the class-specific schedulers. This
separation enables the co-existence of multiple disk
scheduling mechanisms at a time depending on the application requirements.
7. Memory management
Memory is an important resource, which has to be carefully managed. The virtual memory subsystem of commodity OSs allows processes to run in their own virtual
address spaces and to use more memory than physically
available. Thus, the memory manager has several complex
tasks such as bookkeeping available resources and assigning
physical memory to a single process [57,102]. Further key
operations of the virtual memory system include [103,104]:
• Allocation of each process’ virtual address space and
mapping physical pages into a virtual address space
with appropriate protection.
• The page fault handler manages unmapped and invalid
memory references. Page faults happen when unmapped
memory is accessed, and memory references that are
inconsistent with the current protection are invalid.
• Loading data into memory and storing them back to disk.
• Duplicating an address space in case of a fork call.
Since virtual memory is mapped onto actual available
memory, the memory manager has to do paging or swapping, but due to the real-time performance sensitiveness of
multimedia applications, swapping should not be used in a
multimedia OS [57]. Thus, we focus on paging-based
memory systems. Techniques such as demand-paging and
memory-mapped files have been successfully used in
commodity OSs [17,105]. However, these techniques fail
to support multimedia applications, because they introduce
unpredictable memory access times, cause poor resource
utilization, and reduce performance. In the following
subsections, we present new approaches for memory allocation and utilization, data replacement, and prefetching using
application-specific knowledge to solve these problems.
Furthermore, we give a brief description of the UVM
Virtual Memory System that replaces the traditional virtual
memory system in NetBSD 1.4.
7.1. Memory allocation
Usually, upon process creation, a virtual address space is
allocated which contains the data of the process. Physical
memory is then allocated and assigned to a process and then
mapped into the virtual address space of the process according to available resources and a global or local allocation
279
scheme. This approach is also called user-centered allocation. Each process has its own share of the resources.
However, traditional memory allocation on a per client
(process) basis suffers from a linear increase of required
memory with the number of processes.
In order to better utilize the available memory, several
systems use so-called data-centered allocation where
memory is allocated to data objects rather than to a single
process. Thus, the data is seen as a resource principal. This
enables more cost-effective data-sharing techniques
[106,107]: (1) batching starts the video transmission when
several clients request the same movie and allows several
clients to share the same data stream; (2) buffering (or
bridging) caches data between consecutive clients omitting
new disk requests for the same data; (3) stream merging (or
adaptive piggy-backing) displays the same video clip at
different speeds to allow clients to catch up with each
other and then share the same stream; (4) content insertion
is a variation of stream merging, but rather than adjusting
the display rate, new content, e.g. commercials, is inserted
to align the consecutive playouts temporally; and (5) perodic services (or enhanced pay-per-view) assigns each clip a
retrieval period where several clients can start at the beginning of each period to view the same movie and to share
resources. These data-sharing techniques are used in several
systems. For example, a per movie memory allocation
scheme, i.e. a variant of the buffering scheme, for VoD
applications is described in Ref. [108]. All buffers are
shared among the clients watching the same movie and
work like a sliding window on the continuous data. When
the first client has consumed nearly all the data in the buffer,
it starts to refresh the oldest buffers with new data. Periodic
services are used in pyramid broadcasting [109]. The data is
split in partitions of growing size, because the consumption
rate of one partition is assumed to be lower than the downloading rate of the subsequent partition. Each partition is
then broadcast in short intervals on separate channels. A
client does not send a request to the server, but instead it
tunes into the channel transmitting the required data. The
data is cached on the receiver side, and during the playout of
a partition, the next partition is downloaded. In Refs.
[110,111], the same broadcasting idea is used. However,
to avoid very large partitions at the end of a movie and
thus to reduce the client buffer requirement, the partitioning
is changed such that not every partition increases in size, but
only each nth partition. Performance evaluations show that
the data-centered allocation schemes scale much better with
the numbers of users compared to user-centered allocation.
The total buffer space required is reduced, and the average
response time is minimized by using a small partition size at
the beginning of a movie.
The memory reservation per storage device mechanism
[78] allocates a fixed, small number of memory buffers per
storage device in a server-push VoD server using a cyclebased scheduler. In the simplest case, only two buffers of
identical size are allocated per storage device. These buffers
280
T. Plagemann et al. / Computer Communications 23 (2000) 267–289
work co-operatively, and during each cycle, the buffers
change task as data is received from disk. That is, data
from one process is read into the first buffer, and when all
the data is loaded into the buffer, the system starts to
transmit the information to the client. At the same time,
the disk starts to load data from the next client into the
other buffer. In this way, the buffers change task from
receiving disk data to transmitting data to the network until
all clients are served. The admission control adjusts the
number of concurrent users to prevent data loss when the
buffers switch and ensures the maintenance of all client
services.
In Ref. [112], the traditional allocation and page-wiring
mechanism in Real-Time Mach is changed. To avoid that
privileged users monopolize memory usage by wiring
unlimited amount of pages, only real-time threads are
allowed to wire pages, though, only within their limited
amount of allocated memory, i.e. if more pages are needed,
a request has to be sent to the reservation system. Thus,
pages may be wired in a secure way, and the reservation
system controls the amount of memory allocated to each
process.
7.2. Data replacement
When there is need for more buffer space, and there are
no available buffers, a buffer has to be replaced. How to best
choose which buffer to replace depends on the application.
However, due to the high data consumption rate in multimedia applications, data is often replaced before it might be
reused. The gain of using a complex page replacement
algorithm might be wasted and a traditional algorithm as
described in Refs. [113] or [102] might be used. Nevertheless, in some multimedia applications where data often
might be reused, proper replacement algorithms may
increase performance. The distance [114], the generalized
interval caching [115], and the SHR [116] schemes, all
replace buffers after the distance between consecutive
clients playing back the same data and the amount of
available buffers.
Usually, data replacement is handled by the OS kernel
where most applications use the same mechanism. Thus, the
OS has full control, but the used mechanism is often tuned to
best overall performance and does not support applicationspecific requirements. In Nemesis [105], self-paging has
been introduced as a technique to provide QoS to multimedia applications. The basic idea of self-paging is to
“require every application to deal with all its own memory
faults using its own concrete resources”. All paging operations are removed from the kernel where the kernel is only
responsible for dispatching fault notifications. This gives the
application flexibility and control, which might be needed in
multimedia systems, at the cost of maintaining its own
virtual memory operations. However, a major problem of
self-paging is to optimize the global system performance.
Allocating resources directly to applications gives them
more control, but that means optimizations for global
performance improvement are not directly achieved.
7.3. Prefetching
The poor performance of demand-paging is due to the
low disk access speeds. Therefore, prefetching data from
disk to memory is better suited to support continuous playback of time-dependent data types. Prefetching is a mechanism to preload data from slow, high-latency storage devices
such as disks to fast, low-latency storage like main memory.
This reduces the response time of a data read request dramatically and increases the disk I/O bandwidth. Prefetching
mechanisms in multimedia systems can take advantage of
the sequential characteristics of multimedia presentations.
For example, in Ref. [117], a read-ahead mechanism
retrieves data before it is requested if the system determines
that the accesses are sequential. In Ref. [118], the utilization
of buffers and disk is optimized by prefetching all the shortest database queries maximizing the number of processes
that can be activated once the running process is finished. In
Ref. [119], assuming a linear playout of the continuous data
stream, the data needed in the next period (determined by a
tradeoff between the maximum concurrent streams and the
initial delay) is prefetched into a shared buffer. Preloading
data according to the loading and consuming rate and the
available amount of buffers is described in Ref. [120].
In addition to the above-mentioned prefetching mechanisms designed for multimedia applications, more general
purpose facilities for retrieving data in advance are designed
which also could be used for certain multimedia applications. The informed prefetching and caching strategy [121]
preloads a certain amount of data where the buffers are
allocated/deallocated according to a global max–min valuation. This mechanism is further developed in Ref. [122]
where the automatic hint generation, based on speculative
pre-executions using mid-execution process states, is used
to prefetch data for possible future read requests. Moreover,
the dependent-based prefetching, described in Ref. [123],
captures the access patterns of linked data structures. A
prefetch engine runs in parallel with the original
program using these patterns to predict future data
references. Finally, in Ref. [124], an analytic approach
to file prefetching is described. During the execution of
a process a semantic data structure is built showing the file
accesses. When a program is re-executed, the saved access
trees are compared against the current access tree of the
activity, and if a similarity is found, the stored tree is used
to preload files.
Obviously, knowledge (or estimations) about application
behavior might be used for both replacement and
prefetching. In Ref. [125], the buffer replacement and
preloading strategy least/most relevant for presentation
designed for interactive continuous data streams is
presented. A multimedia object is replaced and prefetched
according to its relevance value computed according to the
T. Plagemann et al. / Computer Communications 23 (2000) 267–289
281
presentation point/modus of the data playout. In Ref. [126],
this algorithm is extended for multiple users and QoS
support.
because the bus is no longer regarded as one of the most
limiting performance bottlenecks, except in massive parallel
I/O systems.
7.4. UVM virtual memory system
8.2. Cache management
The UVM Virtual Memory System [103,104] replaces
the virtual memory object, fault handling, and pager of the
BSD virtual memory system; and retains only the machine
dependent/independent layering and mapping structures.
For example, the memory mapping is redesigned to increase
efficiency and security; and the map entry fragmentation is
reduced by memory wiring. In BSD, the memory object
structure is a stand-alone abstraction and under control of
the virtual memory system. In UVM, the memory object
structure is considered as a secondary structure designed
to be embedded with a handle for memory mapping resulting in better efficiency, more flexibility, and less conflicts
with external kernel subsystems. The new copy-on-write
mechanism avoids unnecessary page allocations and data
copying, and grouping or clustering the allocation and use
of resources improves performance. Finally, a virtual
memory-based data movement mechanism is introduced
which allows data sharing with other subsystems, i.e.
when combined with the I/O or IPC systems, it can reduce
the data copying overhead in the kernel.
All real-time applications rely on predictable scheduling,
but the memory cache design makes it hard to forecast and
schedule the processor time [10]. Furthermore, memory
bandwidth and the general OS performance have not
increased at the same rate as CPU performance. Benchmarked performance can be improved by enlarging and
speeding up static RAM-based cache memory, but the
large amount of multimedia data that has to be handled by
CPU and memory system will likely decrease cache hit
ratios. If two processes use the same cache lines and are
executed concurrently, there will not only be an increase in
context switch overheads, but also a cache-interference cost
that is more difficult to predict. Thus, the system performance may be dominated by slower main memory and
I/O accesses. Furthermore, the busier a system is, the more
likely it is that involuntary context switches occur, longer
run queues must be searched by the scheduler, etc. flushing
the caches even more frequently [17].
One approach to improve performance is to partition the
second-level cache as described in Refs. [10,128]. Working
sets of real-time and time-sharing applications are allowed
to be separated into different partitions of the second-level
cache. The time-share applications then cannot disrupt the
cached working sets of real-time applications, which leads
to better worst case predictability.
Another approach is discussed in Ref. [130]. A very low
overhead thread package is used letting the application
specify each thread’s use of data. The thread scheduler
then execute in turn all threads using the same data. In
this way, the data that is already in the cache is used by
all threads needing it before it is flushed.
Bershad et al. [131] describe an approach using conflict
detection and resolution to implement a cheap, large, and
fast direct-mapped cache. The conflicts are detected by
recording and summarizing a history of cache misses, and
a software policy within the OS’s virtual memory system
removes conflicts by dynamically remapping pages. This
approach nearly matches the performance of a two-way
set associative cache, but with lower hardware cost and
lower complexity.
8. Management of other resources
This section takes a brief look at management aspects of
OS resources that have not yet been discussed, like scheduling of system bus and cache management. Furthermore, we
describe some mechanisms for speed improvements in
memory access. Packet scheduling mechanisms to share
network bandwidth between multiple streams at the host–
network interface are not discussed here due to space
considerations. All solutions for packet scheduling in OSs
are adopted from packet scheduling in packet networks.
8.1. Bus scheduling
The SCSI bus is a priority-arbitrated bus. If multiple
devices, e.g. disks, want to transfer data, the device with
the highest priority will always get the bus. In systems with
multiple disks, it is possible that real-time streams being
supported from a low-priority disk get starved from highpriority disks that serve best effort requirements [127].
DROPS [128] schedules requests to the SCSI subsystem
such that the SCSI bandwidth can be fully exploited. It
divides SCSI time into slots where the size of slots is
determined by the worst case seek times of disk drives.
SCSI is a relatively old technology, and PCI has become
the main bus technology for multimedia PCs and workstations [129]. However, to the best of our knowledge, no
work has been reported on scheduling of PCI bus or other
advanced bus technologies to support QoS. Probably,
8.3. Speed improvements in memory accesses
The term dynamic RAM (DRAM), coined to indicate that
any random access in memory takes the same amount of
time, is slightly misleading. Most modern DRAMs provide
special capabilities that make it possible to perform some
accesses faster than others [132]. For example, consecutive
accesses to the same row in a pagemode memory are faster
than random accesses, and consecutive accesses that hit
different memory banks in a multi-bank system allow
282
T. Plagemann et al. / Computer Communications 23 (2000) 267–289
Fig. 5. Data transfers and copy operations.
concurrency and are thus faster than accesses that hit the
same bank. The key point is that the order of the requests
strongly affects the performance of the memory devices. For
certain classes of computations, like those which involve
streams of data where a high degree of spatial locality is
present and where we, at least in theory, have a perfect
knowledge of the future references, a reordering of the
memory accesses might give an improvement in memory
bandwidth.
The most common method to reduce latency is to increase
the cache line size, i.e. using the memory bandwidth to fill
several cache locations at the same time for each access.
However, if the stream has a non-unit-stride (stride is the
distance between successive stream elements in memory),
i.e. the presentation of successive data elements does not
follow each other in memory, the cache will load data which
will not be used. Thus, lengthening the cache line size
increases the effective bandwidth of unit-stride streams,
but decreases the cache hit rate for non-streamed accesses.
Another way of improving memory bandwidth in
memory-cache data transfers for streamed access patterns
is described in Ref. [132]. First, since streams often have no
temporal locality, they provide a separate buffer storage for
streamed data. This means that streamed data elements,
which often are replaced before they might be reused, do
not affect the replacement of data elements that might benefit from caching. Second, to take advantage of the order
sensitivity of the memory system, a memory-scheduling
unit is added to reorder the accesses. During compiletime, information about addresses, strides of a stream, and
number of data elements are collected enabling the
memory-scheduling unit to reorder the requests during
run-time.
9. I/O tuning
Traditionally, there are several different possible data
transfers and copy operations within an end-system as
shown in Fig. 5. These often involve several different
components. Using the disk-to-network data path as an
example, a data object is first transferred from disk to
main memory (A). The data object is then managed by the
many subsystems within the OS designed with different
objectives, running in their own domain (either in user or
kernel space), and therefore, managing their buffers differently. Due to different buffer representations and protection
mechanisms, data is usually copied, at a high cost, from
domain to domain ((B), (C), or (D)) to allow the different
subsystems to manipulate the data. Finally, the data object is
transferred to the network interface (E). In addition to all
these data transfers, the data object is loaded into the
cache (F) and CPU registers (G) when the data object is
manipulated.
Fig. 5 clearly identifies the reason for the poor performance of the traditional I/O system. Data is copied several
times between different memory address spaces which also
causes several context switches. Copy operations and
context switches represent the main performance bottlenecks.
Furthermore, different subsystems, e.g. file system and
communication subsystem, are not integrated. Thus, they
include redundant functionality like buffer management,
and several identical copies of a data object might be stored
in main memory, which in turn reduces the effective size of
the physical memory. Finally, when concurrent users
request the same data, the different subsystems might have
to perform the same operations on the same data several
times.
We distinguish three types of copy operations: memory–
CPU, direct I/O (i.e. memory–I/O device), and memory–
memory. Solutions for these types of copy operations have
been developed for general purpose and application specific
systems. The two last subsections describe a sample
approach for each.
9.1. Memory–CPU copy operations
Data manipulations are time consuming and are often part
of different, distinct program modules or communication
protocol layers, which typically access data independently
of each other. Consequently, each data manipulation may
require access to uncached data resulting in loading the data
from memory to a CPU register, manipulating it, and possibly storing it back to memory. Thus, these repeated
memory–CPU data transfers, denoted (F) and (G) in Fig.
5, can have large impacts on the achieved I/O bandwidth. To
decrease the number of memory–CPU data transfers, integrated layer processing [133,134] performs all data manipulation steps, e.g. calculating error detection checksums,
executing encryption schemes, transforming data for
presentation, and moving data between address spaces, in
one or two integrated processing loops instead of
performing them stepwise as in most systems.
T. Plagemann et al. / Computer Communications 23 (2000) 267–289
283
9.2. Memory–I/O device copy operations
9.3. Memory–memory copy operations
Data is transferred between hardware devices, such as
disks and network adapters, and applications’ physical
memory. This is often done via an intermediate subsystem,
like the file system or the communication system, adding an
extra memory copy. A mechanism to transfer data without
multiple copying is direct I/O, which in some form is available in several commodity OSs today, e.g. Solaris and
Windows NT. Direct memory access (DMA) or
programmed I/O (PIO) is used to transfer data directly
into a frame buffer where, e.g. the file system’s buffer
cache is omitted in a data transfer from disk to application.
Without involving the CPU in the data transfers, DMA
can achieve transfer rates close to the limits of main
memory and the I/O bus, but DMA increases complexity in the device adapters, and caches are often not
coherent with respect to DMA [135]. Using PIO, on
the other hand, the CPU is required to transfer every
word of data between memory and the I/O adapter.
Thus, only a fraction of the peak I/O bandwidth is achieved.
Due to high transfer rates, DMA is often used for direct I/O
data transfers. However, despite the reduced bandwidth,
PIO can sometimes be preferable over DMA. If data
manipulations, e.g. checksum calculations, can be integrated with the PIO data transfer, it is possible to save one
memory access, and after a programmed data movement,
the data may still reside in the cache, reducing further
memory traffic.
In case of application–disk transfers, direct I/O can
often be applied since the file system usually does not
touch the data itself. However, in case of application–
network adapter transfers, the communication system
must generate packets, calculate checksums, etc. making
it harder to avoid the data transfer through the communication system. Nevertheless, there are several attempts
to avoid data touching and copy operation transfers, i.e.
reducing the traditional (B)(E) data path in Fig. 5 to only
(E). After burner [136] and Medusa [137] copy data
directly onto the on-board memory using PIO, with integrated checksum and data length calculation, leaving just
enough space in front of the cluster to add a packed header.
Using DMA and a user-level implementation of the
communication software, the application device channel
[138,139] gives restricted but direct access to an ATM
network adaptor removing the OS kernel from the critical
network send/receive path. In Ref. [49], no memory-tomemory copying is needed using shared buffers or direct
media streaming by linking the device and network
connection together. Finally, in Refs. [140,141], zerocopy communication system architectures are reported for
TCP and ATM, respectively. Virtual memory page remapping (see next subsection) is used to eliminate copying
between applications running in user space and the OS
kernel, and DMA is used to transfer data between memory
and the network buffer.
Direct I/O is typically used when transferring data
between main memory and a hardware device as described
above. However, data transfers between different process
address spaces is done through well-defined channels, like
pipes, sockets, files, and special devices, giving each
process full control of its own data [142]. Nevertheless,
such physical copying is slow and requires at least two
system calls per transaction, i.e. one on sender and one on
receiver side. One way of reducing the IPC costs is to use
virtual page (re)mapping. That is, the data element is not
physically copied byte by byte, but only the address in
virtual memory to the data element in physical memory is
copied into the receiver’s address space. Access rights to the
data object after the data transfer are determined by the used
semantic:
• The copy model copies all data from domain to domain
giving each process full control of its own data at the cost
of cross domain data copying and maintaining several
identical copies in memory.
• The move model removes the data from the source
domain by virtually remapping the data into the destination domain avoiding the multiple-copies problem.
However, if the source later needs to re-access the
moved data, e.g. when handling a retransmission request,
the data must be fetched back.
• The share model makes the transferred data visible and
accessible to both the source and the target domain by
keeping pointers in virtual memory to the same physical
pages, i.e. by using shared memory where several
processes map the same data into their address space.
Thus, all the sharing processes may access the same
piece of memory without any system call overhead
other than the initial cost of mapping the memory.
Several general cross-domain data copy avoidance architectures are suggested trying to minimize respectively to
eliminate all (C), (B), and (D) copy operations depicted in
Fig. 5. Tenex [143] was one of the first systems to use virtual
copying, i.e. several pointers in virtual memory refer to one
physical page. Accent [144,145] generalized the concepts of
Tenex by integrating virtual memory management and IPC
in such a way that large data transfers could use memory
mapping techniques rather than physical data copying. The
V distributed system [146] and the DASH IPC mechanism
[147] use page remapping, and the container shipping
facility [148,149] uses virtual inter-domain data transfers
based on the move model where all in-memory copying is
removed. Furthermore, fast buffers (fbufs) [138,150] is a
facility for I/O buffer management and data transfers across
protection domain boundaries primarily designed for handling network streams using shared virtual memory is
combined with virtual page remapping. In Ref. [151],
fbufs is extended to a zero-copy I/O framework. Fast
in-kernel data paths between I/O objects, increasing
284
T. Plagemann et al. / Computer Communications 23 (2000) 267–289
throughput and reducing context switch operations, are
described in Refs. [152,153]. A new system call, splice(),
moves data asynchronously and without user–process intervention to and from I/O objects specified by file descriptors.
These descriptors specify the source and sink of I/O data,
respectively. This system call is extended in the stream()
system call of the Roadrunner I/O system [154,155] to
support kernel data streaming between any pair of I/O
elements without crossing any virtual memory boundaries
using techniques derived from stackable file systems. The
Genie I/O system [156–158] inputs or outputs data to or
from shared buffers in-place (i.e. directly to or from application buffers) without touching distinct intermediate
system buffers. Data is shared by managing reference counters, and a page is only deallocated if there are no processes
referencing this page. The universal continuous media I/O
system [159,160] combines all types of I/O into a single
abstraction. The buffer management system is allowed to
align data buffers on page boundaries so that data can be
moved without copying which means that the kernel and the
application are sharing a data buffer rather than maintaining
their own separate copy. The UVM Virtual Memory System
[103,104] data movement mechanism provides new techniques that allow processes to exchange and share data in
memory without copying. The page layout and page transfer
facilities give support for page loan out and reception of
pages of memory, and the map entry passing enables
exchange chunks of the processes’ virtual address space.
9.4. IO-Lite
IO-Lite [161,162] is an I/O buffering and caching system
for a general purpose OS inspired by the fbuf mechanism.
IO-Lite unifies all buffering in a system. In particular,
buffering in all subsystems are integrated, and a single
physical copy of the data is shared safely and concurrently.
This is achieved by storing buffered I/O data in immutable
buffers whose location in memory never change. Access
control and protection is ensured at the granularity of
processes by maintaining access control lists to cached
pools of buffers. For cross-domain data transfers, IO-Lite
combines page remapping and shared memory.
All data is encapsulated in mutable buffer aggregates,
which are then passed among the different subsystems
and applications by reference. The sharing of read-only
immutable buffers enables efficient transfers of I/O data
across protection domain boundaries, i.e. all subsystems
may safely refer to the same physical copy of the data without problems of synchronization, protection, consistency,
etc. However, the price to pay is that data cannot be modified in-place. This is solved by the buffer aggregate
abstraction. The aggregate is mutable, and a modified
value is stored in a new buffer, and the modified sections
are logically joined with the unchanged data through pointer
manipulation.
9.5. Multimedia mbuf
The multimedia mbuf (mmbuf) [163,164] is specially
designed for disk-to-network data transfers. It provides a
zero-copy data path for networked multimedia applications
by unifying the buffering structure in file I/O and network I/O.
This buffer system looks like a collection of clustered mbufs
that can be dynamically allocated and chained. The mmbuf
header includes references to mbuf header and buffer cache
header. By manipulating the mmbuf header, the mmbuf can
be transformed either into a traditional buffer, that a file
system and a disk driver can handle, or an mbuf, which
the network protocols and network drivers can understand.
A new interface is provided to retrieve and send data,
which coexist with the old file system interface. The old
buffer cache is bypassed by reading data from a file into
an mmbuf chain. Both synchronous (blocking) and asynchronous (non-blocking) operations are supported, and
read and send requests for multiple streams can be bunched
together in a single call minimizing system call overhead.
At setup time, each stream allocates a ring of buffers, each
of which is an mmbuf chain. The size of each buffer
element, i.e. the mmbuf chain, depends on the size of the
multimedia frame it stores, and each buffer element can be
in one of four states: empty, reading, full, or sending.
Furthermore, to coordinate the data read and send activities,
two pointers (read and send) to the ring buffer are maintained. Then, for each periodic invocation of the stream
process, these pointers are used to handle data transfers. If
the read pointer is pointing to a buffer element in the empty
state, data is read into this chain of mmbufs, and the pointer
is advanced to the next succeeding chain on which the next
read is performed. If the send pointer is holding a full buffer
element, the data stored in this buffer element is transmitted.
10. Conclusions
The aim of this article is to give an overview of recent
developments in the area of OS support for multimedia
applications. This is an active area, and a lot of valuable
research results have been published. Thus, we have not
discussed or cited all recent results, but tried to identify
the major approaches and to present at least one representative for each.
Time-dependent multimedia data types, like audio and
video, will be a natural part of future applications and integrated together with time-independent data types, like text,
graphics, and images. Commodity OSs do not presently
support all the requirements of multimedia systems. New
OS abstractions need to be developed to support a mix of
applications with real-time and best-effort requirements and
to provide the necessary performance. Thus, management of
all system resources, including processors, main memory,
network, disk space, and disk I/O, is an important issue. The
management needed encompasses admission control,
T. Plagemann et al. / Computer Communications 23 (2000) 267–289
allocation and scheduling, accounting, and adaptation.
Proposed approaches for better multimedia support include:
• New OS structures and architectures, like the library OSs
Exokernel and Nemesis.
• New mechanisms that are especially tailored for QoS
support, like specialized CPU and disk scheduling.
• New system abstractions, like resource principals for
resource ownership, inheritance of the associated
priorities, and accounting of resource utilization.
• Extended system abstractions to additionally support new
requirements, like synchronization support and metadata
in file systems.
• Avoiding the major system bottlenecks, like copy
operations avoidance through page remapping.
• Support for user-level control over resources including
user-level communication.
It is not clear how new OS architectures should look like
or even if they are really needed at all. Monolithic and
m-kernel architectures can be developed further, and a careful design and implementation of such systems can provide
both good performance and build on time proven
approaches. When proposing new architectures, it becomes
very important to demonstrate both comparable or better
performance and better functionality than in existing solutions. Furthermore, it is important to implement and evaluate integrated systems and not only to study one isolated
aspect. In this respect, Nemesis is probably the most
advanced system.
To evaluate and compare performance and functionality
of new approaches, more detailed performance measurements and analysis are necessary. This implies designing
and implementing systems, and developing and using a
common set of micro- and application benchmarks for
evaluation of multimedia systems. The field is still very
active, and much work remains to be done before it becomes
known how to design and implement multimedia platforms.
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
Acknowledgements
[16]
We would like to thank Frank Eliassen, Liviu Iftode,
Martin Karsten, Ketil Lund, Chuanbao Wang, and Lars
Wolf for reviewing earlier versions of this paper and their
valuable comments and suggestions.
References
[17]
[18]
[19]
[20]
[1] C.L. Liu, J.W. Layland, Scheduling algorithms for mutliprogramming in a hard real-time environment, Journal of the ACM 20 (1)
(1973) 46–61.
[2] P. Goyal, X. Guo, H.M. Vin, A hierarchical CPU scheduler for
multimedia operating systems, Proc. of 2nd USENIX Symp. on
Operating Systems Design and Implementation (OSDI’96), Seattle,
WA, USA, October 1996, pp. 107–121.
[3] H.-H. Chu, K. Nahrstedt, CPU service classes for multimedia
[21]
[22]
285
applications, Proc. of IEEE Int. Conf. on Multimedia Computing
and Systems (ICMCS’99), Florence, Italy, June 1999.
J.B. Chen, Y. Endo, K. Chan, D. Mazières, D. Dias, M.I. Seltzer,
M.D. Smith, The measured performance of personal computer
operating systems, ACM Transactions on Computer Systems 14
(1) (1996) 3–40.
D. Engler, S.K. Gupta, F. Kaashoek, AVM: application-level virtual
memory, Proc. of 5th Workshop on Hot Topics in Operating Systems
(HotOS-V), Orcas Island, WA, USA, May 1995.
M.F. Kaaskoek, D.R. Engler, G.R. Ganger, H.M. Briceno, R. Hunt,
D. Mazieres, T. Pinckney, R. Grimm, J. Jannotti, K. Mackenzie,
Application performance and flexibility on exokernel systems,
Proc. of 16th Symp. on Operating Systems Principles (SOSP’97),
St. Malo, France, October 1997, pp. 52–65.
I. Leslie, D. McAuley, R. Black, T. Roscoe, P. Barham, D. Evers, R.
Fairbairns, E. Hyden, The design and implementation of an operating system to support distributed multimedia applications, IEEE
Journal on Selected Areas in Communications 14 (7) (1996)
1280–1297.
J. Liedtke, On micro kernel construction, Proc. of 15th ACM Symp.
on Operating Systems Principles (SOSP’95), Cooper Mountain,
Colorado, USA, December 1995, pp. 237–250.
J. Liedtke, Toward real microkernels, Communication of the ACM
39 (9) (1996) 70–77.
H. Härtig, M. Hohmuth, J. Liedtke, S. Schönberg, J. Wolter, The
performance of m-kernel-based systems, Proc. of 16th ACM Symp.
on Operating System Principles (SOSP’97), October 1997, St. Malo,
France, pp. 66–77.
G. Coulson, G. Blair, P. Robin, D. Shepherd, Supporting continuous
media applications in a micro-kernel environment, in: O. Spaniol
(Ed.), Architecture and Protocols for High-Speed Networks, Kluwer
Academic, Dordrecht, 1994.
T.E. Anderson, B.N. Bershad, E.D. Lazowska, H.M. Levy, Scheduler activations: effective kernel support for the user-level management of parallelism, ACM Transactions on Computer Systems 10 (1)
(1992) 53–79.
R. Govindan, D.P. Anderson, Scheduling and IPC mechanisms for
continuous media, Proc. of 13th ACM Symp. on Operating Systems
Principles (SOSP’91), Pacific Grove, CA, USA, October 1991,
pp. 68–80.
K.A. Bringsrud, G. Pedersen, Distributed electronic classrooms with
large electronic whiteboards, Proc. of 4th Joint European Networking Conf. (JENC4), Trondheim, Norway, May 1993, pp. 132–144.
T. Plagemann, V. Goebel, Analysis of quality-of-service in a widearea interactive distance learning system, L. Wolf (Ed.), European
Activities in Interactive Distributed Multimedia Systems and Telecommunication Services, Telecommunication Systems 11 (1/2)
(1999) 139–160.
K. Nahrstedt, R. Steinmetz, Resource management in networked
multimedia systems, IEEE Computer 28 (5) (1995) 52–63.
H. Schulzrinne, Operating system issues for continuous media,
ACM/Springer Multimedia Systems 4 (5) (1996) 269–280.
S. Araki, A. Bilas, C. Dubnicki, J. Edler, K. Konishi, J. Philbin,
User–space communication: a quantitative study, Proc. of 10th
Int. Conf. of High Performance Computing and Communications
(SuperComputing’98), Orlando, FL, USA, November 1998.
J.L. Peterson, A. Silberschatz, Operating System Concepts,
Addison-Wesley, New York, 1985.
K. Nahrstedt, H. Chu, S. Narayan, QoS-aware resource management
for distributed multimedia applications, Journal on High-Speed
Networking, Special Issue on Multimedia Networking 7 (3/4)
(1999) 229–258.
J. Gecsei, Adaptation in distributed multimedia systems, IEEE
Multimedia 4 (2) (1997) 58–66.
K. Lakshman, AQUA: an adaptive quality of service architecture for
distributed multimedia applications, PhD thesis, Computer Science
Department, University of Kentucky, Lexington, KY, USA, 1997.
286
T. Plagemann et al. / Computer Communications 23 (2000) 267–289
[23] A. Goel, D. Steere, C. Pu, J. Walpole, SWiFT: A feedback control
and dynamic reconfiguration toolkit, Technical Report CSE-98-009,
Oregon Graduate Institute, Portland, OR, USA, June 1998.
[24] B. Noble, M. Satyanarayanan, D. Narayanan, J.E. Tilton, J. Flinn, K.
Walker, Agile application-aware adaptation for mobility, Proc. of
the 16th ACM Symp. on Operating System Principles (SOSP’97),
St. Malo, France, October 1997, pp. 276–287.
[25] A. Bavier, L. Peterson, D. Mosberger, BERT: a scheduler for besteffort and real-time paths, Technical Report TR 587-98, Princeton
University, Princeton, NJ, USA, August 1998.
[26] A. Bavier, B. Montz, L. Peterson, Predicting MPEG execution times,
Proc. of 1998 ACM Int. Conf. on Measurement and Modeling of
Computer Systems (SIGMETRICS’98), Madison, WI, USA, June
1998, pp. 131–140.
[27] G. Banga, P. Drutchel, J.C. Mogul, Resource containers: a new
facility for resource management in server systems, Proc. of 3rd
USENIX Symp. on Operating Systems Design and Implementation
(OSDI’99) New Orleans, LA, USA, February 1999.
[28] W. Clifford, J.Z. Mercer, R. Ragunathan, On predictable operating
system protocol processing, Technical Report CMU-CS-94-165,
School of Computer Science, Carnegie Mellon University,
Pittsburgh, PA, May 1994.
[29] K. Jeffay, F.D. Smith, A. Moorthy, A. Anderson, Proportional share
scheduling of operating system services for real-time applications,
Proc. of 19th IEEE Real-Time System Symp. (RTSS’98), Madrid,
Spain, December 1998, pp. 480–491.
[30] C.A. Waldspurger, Lottery and Stride scheduling: flexible proportional-share resource management, PhD thesis, Department of
Electrical Engineering and Computer Science, Massachusetts
Institute of Technology, Cambridge, MA, USA, September 1995.
[31] B. Ford, J. Lepreau, Evolving Mach 3.0 to the migrating thread
model, Proc. of 1994 USENIX Winter Conf., San Francisco, CA,
USA, January 1994.
[32] R.K. Clark, E.D. Jensen, F.D. Reynolds, An architectural overview
of the alpha real-time distributed kernel, Workshop on MicroKernels and other Kernel Architectures, April 1992.
[33] G. Hamilton, P. Kougiouris, The spring nucleus: a microkernel for
objects, Proc. USENIX Summer Conf., Cincinnati, OH, USA, June
1993.
[34] B. Ford, S. Susarla, CPU inheritance scheduling, Proc. of 2nd
USENIX Symp. on Operating Systems Design and Implementation
(OSDI’96), Seattle, WA, USA, October 1996, pp. 91–105.
[35] J. Bruno, E. Gabber, B. Özden, A. Silberschatz, The Eclipse operating system: providing quality of service via reservation domains,
Proc. of USENIX Annual Technical Conf., New Orleans, LA,
June 1998.
[36] B. Vergehese, A. Gupta, M. Rosenblum, Performance isolation:
sharing and isolation in shared memory multiprocessors, Proceeding
of 8th Int. Conf. on Architectural Support for Programming
Languages and Operating Systems (ASPLOS-VIII), San Jose, CA,
USA, October 1998.
[37] R. Rajkumar, K. Juvva, A. Molano, S. Oikawa, Resource kernels: a
resource-centric approach to real-time systems, Proc. of SPIE/ACM
Conf. on Multimedia Computing and Networking (MMCN’98), San
Jose, CA, USA, January 1998.
[38] M.B. Jones, P.J. Leach, R.P. Draves, J.S. Barrera, Modular real-time
resource management in the Rialto operating system, Proc. of 5th
Workshop on Hot Topics in Operating Systems (HotOS-V), Orcas
Island, WA, USA, May 1995, pp. 12–17.
[39] M.B. Jones, D. Rosu, M.-C. Rosu, CPU reservations and time
constraints: efficient, predictable scheduling of independent activities, Proc. of 16th ACM Symp. on Operating Systems Principles
(SOSP’97), St. Malo, France, October 1997, pp. 198–211.
[40] D. Mosberger, L.L. Peterson, Making paths explicit in the Scout
operating system, Proc. of 2nd USENIX Symp. on Operating
Systems Design and Implementation (OSDI’96), Seattle, WA,
USA, October 1996.
[41] O. Spatscheck, L.L. Peterson, Defending against denial of service
attacks in Scout, Proc. of 3rd USENIX Symp. on Operating Systems
Design and Implementation (OSDI’99), New Orleans, LA, USA,
February 1999, pp. 59–72.
[42] D.C. Steere, A. Goel, J. Gruenenberg, D. McNamee, C. Pu,
J. Walpole, A feedback-driven proportion allocator for real-rate
scheduling, Proc. of 3rd USENIX Symp. on Operating Systems
Design and Implementation (OSDI’99), New Orleans, LA, USA,
February 1999, pp. 145–158.
[43] J. Nieh, J.G. Hanko, J.D. Northcutt, G.A. Wall, SVR4 UNIX
scheduler unacceptable for multimedia applications, Proc. of 4th
Int. Workshop on Network and Operating System Support for Digital Audio and Video (NOSSDAV’93), Lancaster, UK, November
1993.
[44] L.C. Wolf, W. Burke, C. Vogt, Evaluation of a CPU scheduling
mechanism for multimedia systems, Software—Practice and
Experience 26 (4) (1996) 375–398.
[45] I. Stoica, W. Abdel-Wahab, K. Jeffay, On the duality between
resource reservation and proportional share resource allocation,
Multimedia Computing and Networking 1997, SPIE Proc. Series,
Volume 3020, San Jose, CA, USA, February 1997, pp. 207–214.
[46] A. Demers, S. Keshav, S. Shenker, Analysis and simulation of a fair
queueing algorithm, Internet-Working: Research and Experience
1 (1) (1990) 3–26.
[47] L. Zhang, Virtual clock: a new traffic control algorithm for packet
switching networks, ACM Transactions on Computer Systems 9 (3)
(1991) 101–124.
[48] A.K. Parek, R.G. Gallager, A generalized processor sharing
approach to flow control in integrated services networks: the
single-node case, IEEE/ACM Transactions on Networking 1 (3)
(1993) 344–357.
[49] D.K.Y. Yau, S.S. Lam, Operating system techniques for distributed
multimedia, Technical Report TR-95-36 (revised), Department of
Computer Sciences, University of Texas at Austin, Austin, TX,
USA, January 1996.
[50] J. Nieh, M.S. Lam, The design, implementation and evaluation of
Smart: a scheduler for multimedia applications, Proc. of 16th ACM
Symp. on Operating System Principles (SOSP’97), St. Malo, France,
October 1997, pp. 184–197.
[51] D.P. Anderson, S.Y. Tzou, R. Wahbe, R. Govindan, M. Andrews,
Support for Continuous Media in the DASH system, Proc. of 10th
Int. Conf. on Distributed Computing Systems (ICDCS’90), Paris,
France, May 1990, pp. 54–61.
[52] G. Coulson, A. Campbell, P. Robin, G. Blair, M. Papathomas,
D. Hutchinson, The design of a QoS controlled ATM based communication system in Chorus, IEEE Journal on Selected Areas of
Communications 13 (4) (1995) 686–699.
[53] P. Goyal, H.M. Vin, H. Cheng, Start-time fair queuing: a scheduling
algorithm for integrated services packet switching networks, Proc. of
ACM SIGCOMM’96, San Francisco, CA, USA, August 1996,
pp. 157–168.
[54] M.B. Jones, J.S. Barrera, A. Forin, P.J. Leach, D. Rosu, M.-C. Rosu,
An overview of the Rialto real-time architecture, Proc. of 7th ACM
SIGOPS European Workshop, Connemara, Ireland, September
1996, pp. 249–256.
[55] A. Molano, K. Juvva, R. Rajkumar, Real-time filesystems guaranteeing timing constraints for disk accesses in RT-Mach, Proc. of 18th
IEEE Real-Time Systems Symp. (RTSS’97), San Francisco, CA,
USA, December 1997.
[56] P.M. Chen, E.K. Lee, G.A. Gibson, R.H. Katz, D.A. Patterson,
RAID: high-performance, reliable, secondary storage, ACM
Computing Surveys 26 (2) (1994) 145–185.
[57] R. Steinmetz, Analyzing the multimedia operating system, IEEE
Multimedia 2 (1) (1995) 68–84.
[58] S.J. Leffler, M.K. McKusick, M.J. Karels, J.S. Quarterman, The
Design and Implementation of the 4.3BSD UNIX Operating System,
Addison-Wesley, New York, 1989.
T. Plagemann et al. / Computer Communications 23 (2000) 267–289
[59] J.R. Santos, R. Muntz, Performance analysis of the RIO multimedia
storage system with heterogeneous disk configurations, Proc. of 6th
ACM Multimedia Conf. (ACM MM’98), Bristol, UK, September
1998, pp. 303–308.
[60] R.L. Haskin, The Shark continuous-media file server, Proc. of 38th
IEEE Int. Conf.: Technologies for the Information Superhighway
(COMPCON’93), San Francisco, CA, USA, February 1993,
pp. 12–15.
[61] R.L. Haskin, F.B. Schmuck, The Tiger Shark file system, Proc. of
41st IEEE Int. Conf.: Technologies for the Information Superhighway (COMPCON’96), Santa Clara, CA, USA, February 1996,
pp. 226–231.
[62] K.K. Ramakrishnan, L. Vaitzblit, C. Gray, U. Vahalia, D. Ting,
P. Tzelnic, S. Glaser, W. Duso, Operating system support for a
video-on-demand file service, Proc. of 4th Int. Workshop on
Network and Operating System Support for Digital Audio and
Video (NOSSDAV’93), Lancaster, U.K., 1993, pp. 216–227.
[63] P.R. Barham, A fresh approach to file system quality of service,
Proc. of 7th Int. Workshop on Network and Operating System
Support for Digital Audio and Video (NOSSDAV’97), St. Louis,
MO, USA, May 1997, pp. 119–128.
[64] C. Martin, P.S. Narayanan, B. Özden, R. Rastogi, A. Silberschatz,
The Fellini multimedia storage server, in: S.M. Chung (Ed.), Multimedia Information and Storage Management, Kluwer Academic,
Dordrecht, 1996, pp. 117–146.
[65] P.J. Shenoy, P. Goyal, S.S. Rao, H.M. Vin, Symphony: an integrated
multimedia file system, Proc. of ACM/SPIE Multimedia Computing
and Networking 1998 (MMCN’98), San Jose, CA, USA, January
1998, pp. 124–138.
[66] M. Rosenblum, The Design and Implementation of a Log-Structured
File System, Kluwer Academic, Dordrecht, 1995.
[67] M. Seltzer, K. Bostic, M.K. McKusick, C. Staelin, An implementation of a log-structured file system for UNIX, Proc. of USENIX
Winter Conf., San Diego, CA, USA, January 1993.
[68] R.Y. Wang, T.E. Anderson, D.A. Patterson, Virtual log based file
systems for a programmable disk, Proc. of 3rd USENIX Symp. on
Operating Systems Design and Implementation (OSDI’99), New
Orleans, LA, USA, February 1999, pp. 29–43.
[69] M. Vernick, C. Venkatramani, T. Chiueh, Adventures in building the
Stony Brook video server, Proc. of 4th ACM Multimedia Conf.
(ACM MM’96), Boston, MA, USA, November 1996, pp. 287–295.
[70] S. Ghandeharizadeh, R. Zimmermann, W. Shi, R. Rejaie, D. Ierardi,
T.-W. Li, Mitra: a scalable continuous media server, Multimedia
Tools and Applications 5 (1) (1997) 79–108.
[71] D. Anderson, Y. Osawa, R. Govindan, A file system for continuous
media, ACM Transactions on Computer Systems 10 (4) (1992)
311–337.
[72] W. Lee, D. Su, D. Wijesekera, J. Srivastava, D.R. KenchammanaHosekote, M. Foresti, Experimental evaluation of PFS continuous
media file system, Proc. of 6th ACM Int. Conf. on Information and
Knowledge Management (CIKM’97), Las Vegas, NV, USA,
November 1997, pp. 246–253.
[73] W. Bolosky, J. Barrera, R. Draves, R. Fitzgerald, G. Gibson,
M. Jones, S. Levi, N. Myhrvold, R. Rashid, The Tiger video file
server, Proc. of 6th Int. Workshop on Network and Operating
System Support for Digital Audio and Video (NOSSDAV’96),
Zushi, Japan, April 1996, pp. 212–223.
[74] W.J. Bolosky, R.P. Fitzgerald, J.R. Douceur, Distributed schedule
management in the Tiger video file server, Proc. of 16th ACM Symp.
on Operating System Principles (SOSP’97). St. Malo, France,
October 1997, pp. 212–223.
[75] P. Lougher, D. Shepherd, The design of a storage server for
continuous media, The Computer Journal 36 (1) (1993) 32–42.
[76] T.N. Niranjan, T. Chiueh, G.A. Schloss, Implementation and evaluation of a multimedia file system, Proc. of IEEE Int. Conf. on Multimedia Computing and Systems (ICMCS’97), Ottawa, Canada, June
1997.
287
[77] C. Wang, V. Goebel, T. Plagemann, Techniques to increase disk
access locality in the Minorca multimedia file system (short
paper), to be published in Proc. of 7th ACM Multimedia Conf.
(ACM MM’99), Orlando, FL, USA, October 1999.
[78] A. Garcia-Martinez, J. Fernadez-Conde, A. Vina, Efficient memory
management in VoD servers, Computer Communications, in press.
[79] S. Berson, S. Ghandeharizadeh, R.R. Muntz, X. Ju, Staggered striping in multimedia information systems, Proc. of 1994 ACM Int.
Conf. on Management of Data (SIGMOD’94), Minneapolis, MN,
USA, May 1994, pp. 70–90.
[80] H.M. Vin, V. Rangan, Admission control algorithm for multimedia
on-demand servers, Proc. of 4th Int. Workshop on Network and
Operating System Support of Digital Audio and Video (NOSSDAV’93), La Jolla, CA, USA, 1993, pp. 56–68.
[81] E. Chang, H. Garcia-Molilna, Reducing initial latency in media
servers, IEEE Multimedia 4 (3) (1997) 50–61.
[82] T.C. Bell, A. Moffat, I.H. Witten, J. Zobel, The MG retrieval system:
compressing for space and speed, Communications of the ACM 38
(4) (1995) 41–42.
[83] D.J. Gemmell, S. Christodoulakis, Principles of delay sensitive
multimedia data storage and retrieval, ACM Transactions on
Information Systems 10 (1) (1992) 51–90.
[84] D.J. Gemmell, H.M. Vin, D.D. Kandlur, P.V. Rangan, L.A. Rowe,
Multimedia storage servers: a tutorial, IEEE Computer 28 (5) (1995)
40–49.
[85] C. Abbott, Efficient editing of digital sound on disk, Journal of
Audio Engineering 32 (6) (1984) 394–402.
[86] P.J. Denning, Effects of scheduling on file memory operations, Proc.
AFIPS Conf., April 1967, pp. 9–21.
[87] R. Geist, S. Daniel, A continuum of disk scheduling algorithms,
ACM Transactions on Computer Systems 5 (1) (1987) 77–92.
[88] J. Coffman, M. Hofri, Queuing models of secondary storage devices,
in: H. Takagi (Ed.), Stochastic Analysis of Computer and Communication Systems, North-Holland, Amsterdam, 1990.
[89] P.S. Yu, M.S. Chen, D.D. Kandlur, Grouped sweeping scheduling
for DASD-based multimedia storage management, ACM Multimedia Systems 1 (3) (1993) 99–109.
[90] D.M. Jacobson, J. Wilkes, Disk scheduling algorithms based on
rotational position, HP Laboratories Technical Report HPL-CSP91-7, Palo Alto, CA, USA, February 1991.
[91] M. Seltzer, P. Chen, J. Ousterhout, Disk scheduling revisited, Proc.
of 1990 USENIX Technical Conf., Washington, D.C., USA, January
1990, pp. 313–323.
[92] Y. Rompogiannakis, G. Nerjes, P. Muth, M. Paterakis, P. Triantafillou, G. Weikum, Disk scheduling for mixed-media workloads in a
multimedia server, Proc. of 6th ACM Multimedia Conf. (ACM
MM’98), Bristol, UK, September 1998, pp. 297–302.
[93] A.L.N. Reddy, J.C. Wyllie, I/O issues in a multimedia system, IEEE
Computer 27 (3) (1994) 69–74.
[94] A.L.N. Reddy, J. Wyllie, Disk scheduling in a multimedia I/O
system, Proc. of 1st ACM Multimedia Conf. (ACM MM’93),
Anaheim, CA, USA, August 1993, pp. 225–233.
[95] M.-S. Chen, D.D. Kandlur, P.S. Yu, Optimization of the group
sweep scheduling (GSS) with heterogeneous multimedia streams,
Proc. of 1st ACM Multimedia Conf. (ACM MM’93), Anaheim,
CA, USA, August 1993, pp. 235–241.
[96] B. Özden, R. Rastogi, A. Silberschatz, Disk striping in video server
environments, Proc. of IEEE Int. Conf. on Multimedia Computing
and Systems (ICMCS’96), Hiroshima, Japan, June 1996.
[97] D.J. Gemmell, J. Han, Multimedia network file servers: multichannel delay sensitive data retrieval, Multimedia Systems 1 (6)
(1994) 240–252.
[98] T.H. Lin, W. Tarng, Scheduling periodic and aperiodic tasks in hard
real time computing systems, Proc. of 1991 ACM Int. Conf. on
Measurement
and
Modeling
of
Computer
Systems
(SIGMETRICS’91), San Diego, CA, USA, May 1991, pp. 31–38.
[99] G. Nerjes, Y. Rompogiannakis, P. Muth, M. Paterakis,
288
[100]
[101]
[102]
[103]
[104]
[105]
[106]
[107]
[108]
[109]
[110]
[111]
[112]
[113]
[114]
[115]
[116]
[117]
[118]
T. Plagemann et al. / Computer Communications 23 (2000) 267–289
P. Triantafillou, G. Weikum, Scheduling strategies for mixed workloads in multimedia information servers, Proc. of IEEE International
Workshop on Research Issues in Data Engineering (RIDE’98),
Orlando, FL, USA, February 1998, pp. 121–128.
R. Wijayaratne, A.L.N. Reddy, Integrated QoS management for disk
I/O, Proc. of IEEE Int. Conf. on Multimedia Computing and Systems
(ICMCS’99), Florence, Italy, June 1999.
P.J. Shenoy, H.M. Vin, Cello: a disk scheduling framework for next
generation operating systems, Proc. of ACM Int. Conf. on Measurement and Modeling of Computer Systems (SIGMETRICS’98),
Madison, WI, USA, June 1998.
A.S. Tanenbaum, Modern Operating Systems, Prentice Hall, New
York, 1992.
C.D. Cranor, The design and implementation of the UVM virtual
memory system, PhD thesis, Sever Institute of Technology, Department of Computer Science, Washington University, St. Louis, MO,
USA, August 1998.
C.D. Cranor, G.M. Parulkar, The UVM virtual memory system,
Proc. of USENIX Annual Technical Conf., Monterey, CA, USA,
June 1999.
S.M. Hand, Self-paging in the Nemesis operating system, Proc. of 3rd
USENIX Symp. on Operating Systems Design and Implementation
(OSDI’99), New Orleans, LA, USA, February 1999, pp. 73–86.
M.N. Garofalakis, B. Özden, A. Silberschatz, On periodic resource
scheduling for continuous-media databases, The VLDB Journal 7 (4)
(1998) 206–225.
R. Krishnan, D. Venkatesh, T.D.C. Little, A failure and overload
tolerance mechanism for continuous media servers, Proc. of 5th
ACM Int. Multimedia Conf. (ACM MM’97), Seattle, WA, USA,
November 1997, pp. 131–142.
D. Rotem, J.L. Zhao, Buffer management for video database
systems, Proc. of 11th Int. Conf. on Data Engineering (ICDE’95),
Tapei, Taiwan, March 1995, pp. 439–448.
S. Viswanathan, T. Imielinski, Metropolitan area video-on-demand
service using pyramid broadcasting, Multimedia Systems 4 (4)
(1996) 197–208.
K.A. Hua, S. Sheu, Skyscraper broadcasting: a new broadcasting
scheme for meteropolitan video-on-demand system, Proc. of ACM
SIGCOMM’97, Cannes, France, September 1997, pp. 89–100.
L. Gao, J. Kurose, D. Towsley, Efficient schemes for broadcasting
popular videos, Proc. of 8th Int. Workshop on Network and Operating Systems Support for Digital Audio and Video (NOSSDAV’98),
Cambridge, UK.
T. Nakajima, H. Tezuka, Virtual memory management for interactive continuous media applications, Proc. of IEEE Int. Conf. on
Multimedia Computing and Systems (ICMCS’97), Ottawa, Canada,
June 1997.
W. Effelsberg, T. Härder, Principles of database buffer management,
ACM Transactions on Database Systems 9 (4) (1984) 560–595.
B. Özden, R. Rastogi, A. Silberschatz, Buffer replacement algorithms for multimedia storage systems, Proc. of IEEE Int. Conf.
on Multimedia Computing and Systems (ICMCS’96), Hiroshima,
Japan, June 1996.
A. Dan, D. Sitaram, Multimedia caching strategies for heterogeneous application and server environments, Multimedia Tools
and Applications 4 (3) (1997) 279–312.
M. Kamath, K. Ramamritham, D. Towsley, Continuous media sharing in multimedia database systems, Proc. of 4th Int. Conf. on Database Systems for Advanced Applications (DASFAA’95), Singapore,
April 1995, pp. 79–86.
D.C. Anderson, J.S. Chase, S. Gadde, A.J. Gallatin, K.G. Yocum,
M.J. Feeley, Cheating the I/O bottleneck: network storage with
Trapeze/Myrinet, Proc. of USENIX Annual Technical Conf., New
Orleans, LA, USA, June 1998.
R.T. Ng, J. Yang, Maximizing buffer and disk utilization for newson-demand, Proc. of 20th IEEE Int. Conf. on Very Large Databases
(VLDB’94), Santiago, Chile, 1994, pp. 451–462.
[119] H. Tezuka, T. Nakajima, Simple continuous media storage server on
real-time mach, Proc. of USENIX Annual Technical Conf., San
Diego, CA, USA, January 1996.
[120] A. Zhang, S. Gollapudi, QoS management in educational digital
library environments, Technical Report CS-TR-95-53, State
University of New York at Buffalo, New York, NY, USA, 1995.
[121] R.H. Patterson, G.A. Gibson, E. Ginting, D. Stodolsky, J. Zelenka,
Informed prefetching and caching, Proc. of 15th ACM Symp. on
Operating System Principles (SOSP’95), Cooper Mountain, CO,
USA, December 1995, pp. 79–95.
[122] F. Chang, G.A. Gibson, Automatic I/O hint generation through speculative execution, Proc. of 3rd USENIX Symp. on Operating
Systems Design and Implementation (OSDI’99), New Orleans,
LA, USA, February 1999, pp. 1–14.
[123] A. Roth, A. Moshovos, G.S. Sohi, Dependence based prefetching for
linked data structures, Proc. of 8th Int. Conf. on Architectural
Support for Programming Languages and Operating Systems
(ASPLOS-VIII), San Jose, CA, USA, October 1998, pp. 115–126.
[124] H. Lei, D. Duchamp, An analytical approach to file prefetching,
Proc. of USENIX Annual Technical Conf., Anaheim, CA, USA,
January 1997.
[125] F. Moser, A. Kraiss, W. Klas, L/MRP: A buffer management strategy for interactive continuous data flows in a multimedia DBMS,
Proc. of 21st IEEE Int. Conf. on Very Large Databases (VLDB’95),
Zurich, Switzerland, 1995, pp. 275–286.
[126] P. Halvorsen, V. Goebel, T. Plagemann, Q-L/MRP: a buffer management mechanism for QoS support in a multimedia DBMS, Proc. of
1998 IEEE Int. Workshop on Multimedia Database Management
Systems (IWMMDBMS’98), Dayton, OH, USA, August 1998,
pp. 162–171.
[127] A.L.N. Reddy, Scheduling in multimedia systems, Design and
Applications of Multimedia Systems, Kluwer Academic, Dordrecht,
1995.
[128] H. Härtig, R. Baumgartl, M. Borriss, C.-J. Hamann, M.
Hohmuth, F. Mehnert, L. Reuther, S. Schönberg, J. Wolter,
DROPS—OS support for distributed multimedia applications,
Proc. of 8th ACM SIGOPS European Workshop, Sintra, Portugal,
September 1998.
[129] J. Nishikawa, I. Okabayashi, Y. Mori, S. Sasaki, M. Migita,
Y. Obayashi, S. Furuya, K. Kaneko, Design and implementation
of video server for mixed-rate streams, Proc. of 7th Int. Workshop on Network and Operating System Support for Digital
Audio and Video (NOSSDAV’97), St. Louis, MO, USA, May
1997, pp. 3–11.
[130] J. Philbin, J. Edler, O.J. Anshus, C.C. Douglas, K. Li, Thread scheduling for cache locality, Proc. of 7th Int. Conf. on Architectural
Support for Programming Languages and Operating Systems
(ASPLOS-VII), Cambridge, MA, USA, October 1996, pp. 60–71.
[131] B.N. Bershad, D. Lee, T.H. Romer, J.B. Chen, Avoiding conflict
misses dynamically in large direct-mapped caches, Proc. of 6th
Int. Conf. on Architectural Support for Programming Languages
and Operating Systems (ASPLOS-VI), San Jose, CA, USA, October
1994, pp. 158–170.
[132] S.A. McKee, R.H. Klenke, K.L. Wright, W.A. Wulf, M.H. Salinas,
J.H. Aylor, A.P. Barson, Smarter memory: improving bandwidth for
streamed references, IEEE Computer 31 (7) (1998) 54–63.
[133] M.B. Abbott, L.L. Peterson, Increasing network throughput by integrating protocol layers, IEEE/ACM Transactions on Networking
1 (5) (1993) 600–610.
[134] D.D. Clark, D.L. Tennenhouse, Architectural considerations for a
new generation of protocols, Proc. of ACM SIGCOMM’90,
Philadelphia, PA, USA, September 1990, pp. 200–208.
[135] P. Druschel, M.B. Abbot, M.A. Pagels, L.L. Peterson, Network
subsystem design, IEEE Network 7 (4) (1993) 8–17.
[136] C. Dalton, G. Watson, D. Banks, C. Calamvokis, A. Edwards,
J. Lumley, Afterburner, IEEE Network 7 (4) (1993) 36–43.
[137] D. Banks, M. Prudence, A high-performance network architecture
T. Plagemann et al. / Computer Communications 23 (2000) 267–289
[138]
[139]
[140]
[141]
[142]
[143]
[144]
[145]
[146]
[147]
[148]
[149]
[150]
[151]
[152]
[153]
for a PA-RISC workstation, IEEE Journal on Selected Areas in
Communications 11 (2) (1993) 191–202.
P. Druschel, Operating system support for high-speed communication, Communication of the ACM 39 (9) (1996) 41–51.
P. Druschel, L.L. Peterson, B.S. Davie, Experiences with a highspeed network adaptor: a software perspective, Proc. of ACM
SIGCOMM’94, London, UK, September 1994, pp. 2–13.
H.-K.J. Chu, Zero-copy TCP in Solaris, Proc. of 1996 USENIX
Annual Technical Conf., San Diego, CA, USA, January 1996,
pp. 253–264.
H. Kitamura, K. Taniguchi, H. Sakamoto, T. Nishida, A new OS
architecture for high performance communication over ATM
networks: zero-copy architecture, Proc. of 5th Int. Workshop on
Network and Operating Systems Support for Digital Audio and
Video (NOSSDAV’95), Durham, NH, USA, April 1995, pp. 87–90.
M.K. McKusick, K. Bostic, M.J. Karels, J.S. Quarterman, The
Design and Implementation of the 4.4 BSD Operating System,
Addison-Wesley, New York, 1996.
D.G. Bobrow, J.D. Burchfiel, D.L. Murphy, R.S. Tomlinson,
B. Beranek, Tenex, a paged time sharing system for the PDP-10,
Communications of the ACM 15 (3) (1972) 135–143.
R. Fitzgerald, R.F. Rashid, The integration of virtual memory
management and interprocess communication in Accent, ACM
Transactions on Computer Systems 4 (2) (1986) 147–177.
R. Rashid, G. Robertson, Accent: a communication-oriented
network operating system kernel, Proc. of 8th ACM Symp. on
Operating System Principles (SOSP’81), New York, NY, USA,
1981, pp. 64–75.
D.R. Cheriton, The V distributed system, Communications of the
ACM 31 (3) (1988) 314–333.
S.-Y. Tzou, D.P. Anderson, The performance of message-passing
using restricted virtual memory remapping, Software—Practice and
Experience 21 (3) (1991) 251–267.
E.W. Anderson, Container shipping: a uniform interface for fast,
efficient, high-bandwidth I/O, PhD thesis, Computer Science and
Engineering Department, University of California, San Diego, CA,
USA, 1995.
J. Pasquale, E. Anderson, P.K. Muller, Container shipping—operating system support for I/O-intensive applications, IEEE Computer
27 (3) (1994) 84–93.
P. Druschel, L.L. Peterson, Fbufs: A high-bandwidth cross-domain
transfer facility, Proc. of 14th ACM Symp. on Operating Systems
Principles (SOSP’93), Asheville, NC, USA, December 1993,
pp. 189–202.
M.N. Thadani, Y.A. Khalidi, An efficient zero-copy I/O framework
for UNIX, Technical Report SMLI TR-95-39, Sun Microsystems
Laboratories Inc., May 1995.
K.R. Fall, A peer-to-peer I/O system in support of I/O intensive
workloads, PhD thesis, Computer Science and Engineering Department, University of California, San Diego, CA, USA, 1994.
K. Fall, J. Pasquale, Improving continuous-media playback perfor-
[154]
[155]
[156]
[157]
[158]
[159]
[160]
[161]
[162]
[163]
[164]
[165]
[166]
289
mance with in-kernel data paths, Proc. of IEEE Int. Conf. on Multimedia Computing and Systems (ICMCS’94), Boston, MA, USA,
May 1994, pp. 100–109.
F.W. Miller, P. Keleher, S.K. Tripathi, General data streaming, Proc.
of 19th IEEE Real-Time System Symp. (RTSS’98), Madrid, Spain,
December 1998.
F.W. Miller, S.K. Tripathi, An integrated input/output system for
kernel data streaming, Proc. of SPIE/ACM Multimedia Computing
and Networking (MMCN’98), San Jose, CA, USA, January 1998,
pp. 57–68.
J.C. Brustoloni, Interoperation of copy avoidance in network and file
I/O, Proc. of 18th IEEE Conf. on Computer Communications
(INFOCOM’99), New York, NY, USA, March 1999.
J.C. Brustoloni, P. Steenkiste, Effects of buffering semantics on I/O
performance, Proc. of 2nd USENIX Symp. on Operating Systems
Design and Implementation (OSDI’96), Seattle, WA, USA, October
1996, pp. 227–291.
J.C. Brustoloni, P. Steenkiste, Evaluation of data passing and
scheduling avoidance. Proc. of 7th Int. Workshop on Network and
Operating System Support for Digital Audio and Video
(NOSSDAV’97), St. Louis, MO, USA, May 1997, pp. 101–111.
C.D. Cranor, G.M. Parulkar, Universal continuous media I/O: design
and implementation, Technical Report WUCS-94-34, Department of
Computer Science, Washington University, St. Louis, MO, USA,
1994.
C.D. Cranor, G.M. Parulkar, Design of universal continuous media
I/O, Proc. of 5th Int. Workshop on Network and Operating Systems
Support for Digital Audio and Video (NOSSDAV’95), Durham,
NH, USA, April 1995, pp. 83–86.
V.S. Pai, IO-Lite: a copy-free UNIX I/O System, Master of Science
thesis, Rice University, Houston, TX, USA, January 1997.
V.S. Pai, P. Druschel, W. Zwaenepoel, IO-Lite: a unified I/O buffering and caching system, Proc. of 3rd USENIX Symp. on Operating
Systems Design and Implementation (OSDI’99), New Orleans, LA,
USA, February 1999, pp. 15–28.
M.M. Buddhikot, X.J. Chen, D. Wu, G.M. Parulkar, Enhancements
to 4.4BSD UNIX for efficient networked multimedia in project
MARS, Proceeding of IEEE Int. Conf. on Multimedia Computing
and Systems (ICMCS’98), Austin, TX, USA, June/July 1998.
M.M. Buddhikot, Project MARS: scalable, high performance, webbased multimedia-on-demand (MOD) services and servers, PhD
thesis, Sever Institute of Technology, Department of Computer
Science, Washington University, St. Louis, MO, USA, August 1998.
P.V. Rangan, H. Vin, Designing file systems for digital video and
audio, Proc. of the 13th Symp. on Operating Systems Principles
(SOSP’91), Pacific Grove, CA, USA, October 1991, pp. 81–94.
P.J. Shenoy, P. Goyal, H.M. Vin, Architectural considerations for
next generation file systems, to be published in Proc. of 7th ACM
Multimedia Conf. (ACM MM’99), Orlando, FL, USA, October
1999.